index prefetching

Started by Tomas Vondraover 2 years ago380 messages

tomas.vondra@enterprisedb.com

over 2 years ago

2 attachment(s)

Hi,

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.

Motivation
----------

Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.

The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.

That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.

3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.

Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.

It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.

This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.

Problems / Open questions
-------------------------

There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.

2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...

3) prefetch distance

I think we can do various smart things about the prefetch distance.

The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.

I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).

But the patch doesn't do any of that now.

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.

Benchmarks
----------

1) OLTP

For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).

The data sets have ~34GB, which is much more than available RAM (8GB).

For example for BTREE, we have a query like this:

SELECT * FROM btree_test WHERE a = $v

with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:

rows bitmapscan master patched seqscan
1 19.8 20.4 18.8 31875.5
10 24.4 23.8 23.2 30642.4
100 27.7 40.0 26.3 31871.3
1000 45.8 178.0 45.4 30754.1
10000 171.8 1514.9 174.5 30743.3
100000 1799.0 15993.3 1777.4 30937.3

This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.

I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.

For a query with ordering:

SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n

the results look a bit different:

rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1]https://github.com/tvondra/index-prefetch-tests for
complete results.

Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.

regards

[1]: https://github.com/tvondra/index-prefetch-tests
[2]: https://github.com/tvondra/postgres/tree/dev/index-prefetch

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

index-prefetch-poc.patchtext/x-patch; charset=UTF-8; name=index-prefetch-poc.patchDownload

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index efdf9415d15..9b3625d833b 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -193,7 +193,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys, int prefetch, int prefetch_reset);
 extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 6cc7d07164a..0c6da1b635b 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -25,7 +25,7 @@
  * Begin scan of bloom index.
  */
 IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation r, int nkeys, int norderbys, int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BloomScanOpaque so;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..5b298c02cce 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -324,7 +324,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
  * holding lock on index, it's not necessary to recompute it during brinrescan.
  */
 IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BrinOpaque *opaque;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb87..3087a986bc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
 
 
 IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GinScanOpaque so;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..7b79128f2ce 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -493,12 +493,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 
 			if (GistPageIsLeaf(page))
 			{
+				BlockNumber		block = ItemPointerGetBlockNumber(&it->t_tid);
+
 				/* Creating heap-tuple GISTSearchItem */
 				item->blkno = InvalidBlockNumber;
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
 
+				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
 				/*
 				 * In an index-only scan, also fetch the data from the tuple.
 				 */
@@ -529,6 +533,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	}
 
 	UnlockReleaseBuffer(buffer);
+
+	so->didReset = true;
 }
 
 /*
@@ -679,6 +685,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				so->curPageData++;
 
+				index_prefetch(scan, ForwardScanDirection);
+
 				return true;
 			}
 
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 00400583c0b..fdf978eaaad 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -22,6 +22,8 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Pairing heap comparison function for the GISTSearchItem queue
@@ -71,7 +73,7 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  */
 
 IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GISTSTATE  *giststate;
@@ -111,6 +113,31 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = gist_prefetch_getblock;
+		prefetcher->get_range = gist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	/*
@@ -356,3 +383,42 @@ gistendscan(IndexScanDesc scan)
 	 */
 	freeGISTstate(so->giststate);
 }
+
+static void
+gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->curPageData;
+		*end = (so->nPageData - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->curPageData;
+	}
+}
+
+static BlockNumber
+gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->curPageData) || (index >= so->nPageData))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->pageData[index].heapPtr;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606e..01a25132bce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -48,6 +48,9 @@ static void hashbuildCallback(Relation index,
 							  bool tupleIsAlive,
 							  void *state);
 
+static void _hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
 
 /*
  * Hash handler function: return IndexAmRoutine with access method parameters
@@ -362,7 +365,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	hashbeginscan() -- start a scan on a hash index
  */
 IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	HashScanOpaque so;
@@ -383,6 +386,31 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _hash_prefetch_getblock;
+		prefetcher->get_range = _hash_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -918,3 +946,42 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
+
+static void
+_hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9ea2a42a07f..b5cea5e23eb 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -434,6 +434,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_heaptid = currItem->heapTid;
 
+	index_prefetch(scan, dir);
+
 	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
@@ -467,6 +469,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 	so->currPos.buf = buf;
 	so->currPos.currPage = BufferGetBlockNumber(buf);
+	so->currPos.didReset = true;
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -597,6 +600,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	}
 
 	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
 	return true;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc21c..b2f4eadc1ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -756,6 +757,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,7 +768,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..aa8a14624d8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -59,6 +59,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +107,8 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +202,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+												prefetch_target, prefetch_reset);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
 
+	/* do index prefetching, if needed */
+	index_prefetch(scan, direction);
+
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_heaptid;
 }
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+void
+index_prefetch(IndexScanDesc scan, ScanDirection dir)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	/* gradually increase the prefetch distance */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Did we already reach the point to actually start prefetching? If not,
+	 * we're done. We'll try again for the next index tuple.
+	 */
+	if (prefetch->prefetchTarget <= 0)
+		return;
+
+	/*
+	 * XXX I think we don't need to worry about direction here, that's handled
+	 * by how the AMs build the curPos etc. (see nbtsearch.c)
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		bool		reset;
+		int			startIndex,
+					endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/*
+		 * Did we switch to a different index block? if yes, reset relevant
+		 * info so that we start prefetching from scratch.
+		 */
+		if (reset)
+		{
+			prefetch->prefetchTarget = prefetch->prefetchReset;
+			prefetch->prefetchIndex = startIndex; /* maybe -1 instead? */
+			pgBufferUsage.blks_prefetch_rounds++;
+		}
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the end index first, because it depends on
+		 * the actual position, before we consider how far we prefetched.
+		 */
+		endIndex = Min(endIndex, startIndex + prefetch->prefetchTarget);
+		startIndex = Max(startIndex, prefetch->prefetchIndex + 1);
+
+		for (int i = startIndex; i <= endIndex; i++)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = endIndex;
+	}
+	else
+	{
+		bool	reset;
+		int		startIndex,
+				endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/* FIXME handle the reset flag */
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the start index first, because it depends on
+		 * the actual position, before we consider how far we prefetched (which
+		 * for backwards scans is (end index).
+		 */
+		startIndex = Max(startIndex, endIndex - prefetch->prefetchTarget);
+		endIndex = Min(endIndex, prefetch->prefetchIndex - 1);
+
+		for (int i = endIndex; i >= startIndex; i--)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = startIndex;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b15199a..b1a02cc9bcd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -37,6 +37,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
+#include "utils/spccache.h"
 
 
 /*
@@ -87,6 +88,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  OffsetNumber updatedoffset,
 										  int *nremaining);
 
+static void _bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -341,7 +344,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	btbeginscan() -- start a scan on a btree index
  */
 IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -369,6 +372,31 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _bt_prefetch_getblock;
+		prefetcher->get_range = _bt_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -1423,3 +1451,42 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+static void
+_bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce95..762d95d09ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,7 +47,6 @@ static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
-
 /*
  *	_bt_drop_lock_and_maybe_pin()
  *
@@ -1385,7 +1384,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 */
 		_bt_parallel_done(scan);
 		BTScanPosInvalidate(so->currPos);
-
 		return false;
 	}
 	else
@@ -1538,6 +1536,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	Assert(BufferIsValid(so->currPos.buf));
 
+	/*
+	 * Mark the currPos as reset before loading the next chunk of pointers, to
+	 * restart the preretching.
+	 */
+	so->currPos.didReset = true;
+
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index cbfaf0c00ac..79015194b73 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relation.h"
 #include "access/relscan.h"
 #include "access/spgist_private.h"
 #include "miscadmin.h"
@@ -32,6 +33,10 @@ typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
 							   SpGistLeafTuple leafTuple, bool recheck,
 							   bool recheckDistances, double *distances);
 
+static void spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
+
 /*
  * Pairing heap comparison function for the SpGistSearchItem queue.
  * KNN-searches currently only support NULLS LAST.  So, preserve this logic
@@ -191,6 +196,7 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
 			pfree(so->reconTups[i]);
 	}
 	so->iPtr = so->nPtrs = 0;
+	so->didReset = true;
 }
 
 /*
@@ -301,7 +307,7 @@ spgPrepareScanKeys(IndexScanDesc scan)
 }
 
 IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	SpGistScanOpaque so;
@@ -316,6 +322,8 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 		so->keyData = NULL;
 	initSpGistState(&so->state, scan->indexRelation);
 
+	so->state.heap = relation_open(scan->indexRelation->rd_index->indrelid, NoLock);
+
 	so->tempCxt = AllocSetContextCreate(CurrentMemoryContext,
 										"SP-GiST search temporary context",
 										ALLOCSET_DEFAULT_SIZES);
@@ -371,6 +379,31 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 
 	so->indexCollation = rel->rd_indcollation[0];
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = spgist_prefetch_getblock;
+		prefetcher->get_range = spgist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -453,6 +486,8 @@ spgendscan(IndexScanDesc scan)
 		pfree(scan->xs_orderbynulls);
 	}
 
+	relation_close(so->state.heap, NoLock);
+
 	pfree(so);
 }
 
@@ -584,6 +619,13 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 														isnull,
 														distances);
 
+			// FIXME prefetch here? or in storeGettuple?
+			{
+				BlockNumber block = ItemPointerGetBlockNumber(&leafTuple->heapPtr);
+
+				PrefetchBuffer(so->state.heap, MAIN_FORKNUM, block);
+			}
+
 			spgAddSearchItemToQueue(so, heapItem);
 
 			MemoryContextSwitchTo(oldCxt);
@@ -1047,7 +1089,12 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				index_store_float8_orderby_distances(scan, so->orderByTypes,
 													 so->distances[so->iPtr],
 													 so->recheckDistances[so->iPtr]);
+
 			so->iPtr++;
+
+			/* prefetch additional tuples */
+			index_prefetch(scan, dir);
+
 			return true;
 		}
 
@@ -1070,6 +1117,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				pfree(so->reconTups[i]);
 		}
 		so->iPtr = so->nPtrs = 0;
+		so->didReset = true;
 
 		spgWalk(scan->indexRelation, so, false, storeGettuple,
 				scan->xs_snapshot);
@@ -1095,3 +1143,42 @@ spgcanreturn(Relation index, int attno)
 
 	return cache->config.canReturnData;
 }
+
+static void
+spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->iPtr;
+		*end = (so->nPtrs - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->iPtr;
+	}
+}
+
+static BlockNumber
+spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->iPtr) || (index >= so->nPtrs))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->heapPtrs[index];
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 190e4f76a9e..4aac68f0766 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -17,6 +17,7 @@
 
 #include "access/amvalidate.h"
 #include "access/htup_details.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
 #include "access/toast_compression.h"
@@ -334,6 +335,9 @@ initSpGistState(SpGistState *state, Relation index)
 
 	state->index = index;
 
+	/* we'll initialize the reference in spgbeginscan */
+	state->heap = NULL;
+
 	/* Get cached static information about index */
 	cache = spgGetCache(index);
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4476ff7fba1..80fec7a11f9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -160,7 +160,9 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
-											   int norderbys);
+											   int norderbys,
+											   int prefetch_maximum,
+											   int prefetch_reset);
 
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 97ddc925b27..f17dcdffd86 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -96,7 +96,7 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..6a500c5aa1f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,7 +152,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +171,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +234,45 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+
+
+void index_prefetch(IndexScanDesc scandesc, ScanDirection direction);
+
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchIndex;	/* how far we already prefetched */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+
+	/*
+	 * a small LRU cache of recently prefetched blocks
+	 *
+	 * XXX needs to be tiny, to make the (frequent) searches very cheap
+	 */
+	BlockNumber	cacheBlocks[8];
+	int			cacheIndex;
+
+	prefetcher_getblock_function	get_block;
+	prefetcher_getrange_function	get_range;
+
+} IndexPrefetchData;
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 6da64928b66..b4bd3b2e202 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -384,7 +384,7 @@ typedef struct GinScanOpaqueData
 
 typedef GinScanOpaqueData *GinScanOpaque;
 
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void ginendscan(IndexScanDesc scan);
 extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3edc740a3f3..e844a9eed84 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -176,6 +176,7 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+	bool	didReset;			/* reset since last access? */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 65911245f74..adf167a60b6 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,7 @@
 
 #include "access/amapi.h"
 
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 					   ScanKey orderbys, int norderbys);
 extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9e035270a16..743192997c5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -124,6 +124,8 @@ typedef struct HashScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	bool		didReset;
+
 	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
 } HashScanPosData;
 
@@ -370,7 +372,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d6847860959..8d053de461b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -984,6 +984,9 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	/* Did the position reset/rebuilt since the last time we checked it? */
+	bool		didReset;
+
 	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
@@ -1019,6 +1022,7 @@ typedef BTScanPosData *BTScanPos;
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
 		(scanpos).nextTupleOffset = 0; \
+		(scanpos).didReset = true; \
 	} while (0)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
@@ -1127,7 +1131,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index fe31d32dbe9..e1e2635597c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,7 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
 					  struct IndexInfo *indexInfo);
 
 /* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset);
 extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index c6ef46fc206..e00d4fc90b6 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -144,7 +144,7 @@ typedef struct SpGistTypeDesc
 typedef struct SpGistState
 {
 	Relation	index;			/* index we're working with */
-
+	Relation	heap;			/* heap the index is defined on */
 	spgConfigOut config;		/* filled in by opclass config method */
 
 	SpGistTypeDesc attType;		/* type of values to be indexed/restored */
@@ -231,6 +231,7 @@ typedef struct SpGistScanOpaqueData
 	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
 															 * flags */
 	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
+	bool		didReset;		/* */
 
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */

prefetching-bench.pdfapplication/pdf; name=prefetching-bench.pdfDownload

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Tomas Vondra (#1)

Re: index prefetching

On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.

Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).

ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.

Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).

This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.

Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.

It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.

That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.

Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

Maybe you were right to do that, but I'm not entirely sure.

Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.

In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).

the results look a bit different:

rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).

--
Peter Geoghegan

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Peter Geoghegan (#2)

Re: index prefetching

On 6/8/23 20:56, Peter Geoghegan wrote:

On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.

Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).

I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1]/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com, which says:

Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com). None of them even mentions SAOP
stuff at all.

[1]: /messages/by-id/871wa17vxb.fsf@oxford.xeocode.com
/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com

[2]: /messages/by-id/87wsnnz046.fsf@oxford.xeocode.com
/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com

ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.

Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).

This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.

Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.

Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).

So if you're planning to work on this for PG17, collaborating on it
would be great.

For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.

Yeah, although all the tests were done with a random table generated
like this:

insert into btree_test select $d * random(), md5(i::text)
from generate_series(1, $ROWS) s(i)

So it's damn random anyway. Although maybe it's random even for the
bitmap case, so maybe if the SAOP had some sort of locality, that'd be
an advantage for the bitmap scan. But how would such table look like?

I guess something like this might be a "nice" bad case:

insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)

select * from btree_test where a in (999, 1000, 1001, 1002)

The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.

It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.

I do agree, but what do you mean by "assessing"? Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?

If we pick index scan and enable prefetching, causing a regression (e.g.
for the SAOP with locality), that'd be bad. But how is that related to
viability of index scans over bitmap index scans?

That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.

I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.

Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

Maybe you were right to do that, but I'm not entirely sure.

Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.

In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.

There are two aspects why I think AM is not the right place:

- accessing table from index code seems backwards

- we already do prefetching from the executor (nodeBitmapHeapscan.c)

It feels kinda wrong in hindsight.

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).

Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.

the results look a bit different:

rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).

Right, I forgot to mention that benefit. Although, that'd only happen if
we actually choose index scans in more places, which I guess would
require tweaking the costing model ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Tomas Vondra (#3)

1 attachment(s)

Re: index prefetching

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

So if you're planning to work on this for PG17, collaborating on it
would be great.

For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

I guess something like this might be a "nice" bad case:

insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)

select * from btree_test where a in (999, 1000, 1001, 1002)

The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'm talking about problems that exist today, without your patch.

I'll show a concrete example of the kind of index/index scan that
might be affected.

Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.

Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?

While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.

We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.

(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)

Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.

It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.

I do agree, but what do you mean by "assessing"?

I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.

Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?

I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.

One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.

The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.

I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.

Definitely.

There are two aspects why I think AM is not the right place:

- accessing table from index code seems backwards

- we already do prefetching from the executor (nodeBitmapHeapscan.c)

It feels kinda wrong in hindsight.

I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.

Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.

--
Peter Geoghegan

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tomas Vondra (#1)

Re: index prefetching

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.

I'm really excited about this work.

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.

2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.

5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.

That should be easy to do, right?

Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Peter Geoghegan (#4)

Re: index prefetching

On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:

create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;

This gives me a table with 4 pages, and an index with 2 pages.

The following query selects about half of the rows from the table:

select * from bitmap_parity_test where randkey < 500;

If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.

I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.

Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.

(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10) from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)

--
Peter Geoghegan

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Andres Freund (#5)

Re: index prefetching

On 6/9/23 02:06, Andres Freund wrote:

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.

I'm really excited about this work.

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.

That should be easy to do, right?

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

1. drop caches and restart postgres

2. run query $q -> uncached

3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Peter Geoghegan (#4)

Re: index prefetching

On 6/9/23 01:38, Peter Geoghegan wrote:

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

Did you mean parallel index scan or bitmap index scan?

But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.

Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

So if you're planning to work on this for PG17, collaborating on it
would be great.

For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

I guess something like this might be a "nice" bad case:

insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)

select * from btree_test where a in (999, 1000, 1001, 1002)

The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.

I mean, imagine we have TIDs

[T1, T2, T3, T4, T5, T6]

Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:

[T1, T5, T6, T2, T3, T4]

But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'm talking about problems that exist today, without your patch.

I'll show a concrete example of the kind of index/index scan that
might be affected.

Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.

Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?

While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.

We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.

(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)

Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.

I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.

It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.

I do agree, but what do you mean by "assessing"?

I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.

Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?

I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.

One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.

The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.

Agreed.

I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.

Definitely.

There are two aspects why I think AM is not the right place:

- accessing table from index code seems backwards

- we already do prefetching from the executor (nodeBitmapHeapscan.c)

It feels kinda wrong in hindsight.

I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.

Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.

OK. Thanks for all the comments.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Geoghegan

pg@bowt.ie

over 2 years ago

In reply to: Tomas Vondra (#8)

Re: index prefetching

On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

Did you mean parallel index scan or bitmap index scan?

I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.

ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.

The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?

What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.

I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.

We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).

The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to only access one single item
from any given order).

Obviously there are many queries that need to access two or more leaf
pages, because that's just what needs to happen. My point is that we
*should* only do that when it's truly necessary on modern Postgres
versions, since the boundaries between pages are "aligned" with the
"natural boundaries" from the keyspace/application. Maybe your testing
should verify that this effect is actually present, though. It would
be a shame if we sometimes messed up prefetching that could have
worked well due to some issue with how page splits divide up items.

CREATE INDEX is much less smart about suffix truncation -- it isn't
capable of the same kind of tricks as nbtsplitloc.c, even though it
could be taught to do roughly the same thing. Hopefully this won't be
an issue for your work. The tenk2 case still works as expected with
CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the
TPC-C line items PK will leave the index with some "orders" (or
whatever the natural grouping of things is) that span more than a
single leaf page, which is undesirable, and might hinder your
prefetching work. I wouldn't mind fixing that if it turned out to hurt
your leaf-page-at-a-time prefetching patch. Something to consider.

We can fit at most 17 TPC-C orders on each order line PK leaf page.
Could be as few as 15. If we do the wrong thing with prefetching for 2
out of every 15 orders then that's a real problem, but is still subtle enough
to easily miss with conventional benchmarking. I've had a lot of success
with paying close attention to all the little boundary cases, which is why
I'm kind of zealous about it now.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.

I get that it could be invasive. I have the sense that just pinning
the same heap page more than once in very close succession is just the
wrong thing to do, with or without prefetching.

I mean, imagine we have TIDs

[T1, T2, T3, T4, T5, T6]

Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:

[T1, T5, T6, T2, T3, T4]

But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.

Yeah, that's possible. But isn't that par for the course? Any
optimization that involves speculation (including all prefetching)
comes with similar risks. They can be managed.

I don't think that we'd literally order by TID...we wouldn't change
the order that each heap page was *initially* pinned. We'd just
reorder the tuples minimally using an approach that is sufficient to
avoid repeated pinning of heap pages during processing of any one leaf
page's heap TIDs. ISTM that the risk of wasting work is limited to
wasting cycles on processing extra tuples from a heap page that we
definitely had to process at least one tuple from already. That
doesn't seem particularly risky, as speculative optimizations go. The
downside is bounded and well understood, while the upside could be
significant.

I really don't have that much confidence in any of this just yet. I'm
not trying to make this project more difficult. I just can't help but
notice that the order that index scans end up pinning heap pages
already has significant problems, and is sensitive to things like
small amounts of heap fragmentation -- maybe that's not a great basis
for prefetching. I *really* hate any kind of sharp discontinuity,
where a minor change in an input (e.g., from minor amounts of heap
fragmentation) has outsized impact on an output (e.g., buffers
pinned). Interactions like that tend to be really pernicious -- they
lead to bad performance that goes unnoticed and unfixed because the
problem effectively camouflages itself. It may even be easier to make
the conservative (perhaps paranoid) assumption that weird nasty
interactions will cause harm somewhere down the line...why take a
chance?

I might end up prototyping this myself. I may have to put my money
where my mouth is. :-)

--
Peter Geoghegan

#10

Gregory Smith

gregsmithpgsql@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#1)

Re: index prefetching

On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans

At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux. Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.

As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching. Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work. I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests. To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages. Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)
\set limit 100000 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid
LIMIT :range;

And then you use '-Dmultiplier=10' or such to crank it up. Database 4X
RAM, multiplier=25 with 16 clients is my starting point on it when I want
to saturate storage. Anything that lets me bring those numbers down would
be valuable.

--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy

#11

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Tomas Vondra (#7)

Re: index prefetching

Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:

2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

Yes, I meant that.

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

I'll try to have a look.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.

2. run query $q -> uncached

3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

Ah, ok.

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

Greetings,

Andres Freund

#12

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Andres Freund (#11)

Re: index prefetching

On 6/10/23 22:34, Andres Freund wrote:

Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:

2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

Yes, I meant that.

Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.

4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

I'll try to have a look.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.

Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.

2. run query $q -> uncached

3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

Ah, ok.

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Tomasz Rybak

tomasz.rybak@post.pl

over 2 years ago

In reply to: Tomas Vondra (#1)

Re: index prefetching

On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:

Hi,

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.

I added entry to
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.

--
Tomasz Rybak, Debian Developer <serpent@debian.org>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C

#14

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#1)

Re: index prefetching

On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order. So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests. OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page. I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Tomas Vondra (#12)

3 attachment(s)

Re: index prefetching

Hi,

I have results from the new extended round of prefetch tests. I've
pushed everything to

https://github.com/tvondra/index-prefetch-tests-2

There are scripts I used to run this (run-*.sh), raw results and various
kinds of processed summaries (pdf, ods, ...) that I'll mention later.

As before, this tests a number of query types:

- point queries with btree and hash (equality)
- ORDER BY queries with btree (inequality + order by)
- SAOP queries with btree (column IN (values))

It's probably futile to go through details of all the tests - it's
easier to go through the (hopefully fairly readable) shell scripts.

But in principle, runs some simple queries while varying both the data
set and workload:

- data set may be random, sequential or cyclic (with different length)

- the number of matches per value differs (i.e. equality condition may
match 1, 10, 100, ..., 100k rows)

- forces a particular scan type (indexscan, bitmapscan, seqscan)

- each query is executed twice - first run (right after restarting DB
and dropping caches) is uncached, second run should have data cached

- the query is executed 5x with different parameters (so 10x in total)

This is tested with three basic data sizes - fits into shared buffers,
fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and
20GB (i5) / 40GB (xeon).

Note: xeon has 64GB RAM, so technically the largest scale fits into RAM.
But should not matter, thanks to drop-caches and restart.

I also attempted to pin the backend to a particular core, in effort to
eliminate scheduling-related noise. It's mostly what taskset does, but I
did that from extension (https://github.com/tvondra/taskset) which
allows me to do that as part of the SQL script.

For the results, I'll talk about the v1 patch (as submitted here) fist.
I'll use the PDF results in the "pdf" directory which generally show a
pivot table by different test parameters, comparing the results by
different parameters (prefetching on/off, master/patched).

Feel free to do your own analysis from the raw CSV data, ofc.

For example, this:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf

shows how the prefetching affects timing for point queries with
different numbers of matches (1 to 100k). The numbers are timings for
master and patched build. The last group is (patched/master), so the
lower the number the better - 50% means patch makes the query 2x faster.
There's also a heatmap, with green=good, red=bad, which makes it easier
to cases that got slower/faster.

The really interesting stuff starts on page 7 (in this PDF), because the
first couple pages are "cached" (so it's more about measuring overhead
when prefetching has no benefit).

Right on page 7 you can see a couple cases with a mix of slower/faster
cases, roughtly in the +/- 30% range. However, this is unrelated from
the patch because those are results for bitmapheapscan.

For indexscans (page 8), the results are invariably improved - the more
matches the better (up to ~10x faster for 100k matches).

Those were results for the "cyclic" data set. For random data set (pages
9-11) the results are pretty similar, but for "sequential" data (11-13)
the prefetching is actually harmful - there are red clusters, with up to
500% slowdowns.

I'm not going to explain the summary for SAOP queries
(https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf),
the story is roughly the same, except that there are more tested query
combinations (because we also vary the pattern in the IN() list - number
of values etc.).

So, the conclusion from this is - generally very good results for random
and cyclic data sets, but pretty bad results for sequential. But even
for the random/cyclic cases there are combinations (especially with many
matches) where prefetching doesn't help or even hurts.

The only way to deal with this is (I think) a cheap way to identify and
skip inefficient prefetches, essentially by doing two things:

a) remembering more recently prefetched blocks (say, 1000+) and not
prefetching them over and over

b) ability to identify sequential pattern, when readahead seems to do
pretty good job already (although I heard some disagreement)

I've been thinking about how to do this - doing (a) seem pretty hard,
because on the one hand we want to remember a fair number of blocks and
we want the check "did we prefetch X" to be very cheap. So a hash table
seems nice. OTOH we want to expire "old" blocks and only keep the most
recent ones, and hash table doesn't really support that.

Perhaps there is a great data structure for this, not sure. But after
thinking about this I realized we don't need a perfect accuracy - it's
fine to have false positives/negatives - it's fine to forget we already
prefetched block X and prefetch it again, or prefetch it again. It's not
a matter of correctness, just a matter of efficiency - after all, we
can't know if it's still in memory, we only know if we prefetched it
fairly recently.

This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU
cache that's small enough to be searched linearly (say, 8 blocks). And
we have many of them (e.g. 128), so that in total we can remember 1024
block numbers. Now, every block number is mapped to a single LRU by
hashing, as if we had a hash table

index = hash(blockno) % 128

and we only use tha one LRU to track this block. It's tiny so we can
search it linearly.

To expire prefetched blocks, there's a counter incremented every time we
prefetch a block, and we store it in the LRU with the block number. When
checking the LRU we ignore old entries (with counter more than 1000
values back), and we also evict/replace the oldest entry if needed.

This seems to work pretty well for the first requirement, but it doesn't
allow identifying the sequential pattern cheaply. To do that, I added a
tiny queue with a couple entries that can checked it the last couple
entries are sequential.

And this is what the attached 0002+0003 patches do. There are PDF with
results for this build prefixed with "patch-v3" and the results are
pretty good - the regressions are largely gone.

It's even cleared in the PDFs comparing the impact of the two patches:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf

Which simply shows the "speedup heatmap" for the two patches, and the
"v3" heatmap has much less red regression clusters.

Note: The comparison-point.pdf summary has another group of columns
illustrating if this scan type would be actually used, with "green"
meaning "yes". This provides additional context, because e.g. for the
"noisy bitmapscans" it's all white, i.e. without setting the GUcs the
optimizer would pick something else (hence it's a non-issue).

Let me know if the results are not clear enough (I tried to cover the
important stuff, but I'm sure there's a lot of details I didn't cover),
or if you think some other summary would be better.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0003-ignore-seq-patterns-add-stats-v3.patchtext/x-patch; charset=UTF-8; name=0003-ignore-seq-patterns-add-stats-v3.patchDownload

From fc869af55678eda29045190f735da98c4b6808d9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 15 Jun 2023 14:49:56 +0200
Subject: [PATCH 2/2] ignore seq patterns, add stats

---
 src/backend/access/index/indexam.c | 80 ++++++++++++++++++++++++++++++
 src/include/access/genam.h         | 16 ++++++
 2 files changed, 96 insertions(+)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 557267aced9..6ab977ca284 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -378,6 +378,16 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/* If prefetching enabled, log prefetch stats. */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests %lu prefetches %lu (%f)",
+			 prefetch->prefetchAll, prefetch->prefetchCount,
+			 prefetch->prefetchCount * 100.0 / prefetch->prefetchAll);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -1028,6 +1038,57 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
 
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	bool	is_sequential = true;
+	int		idx;
+
+	/* no requests */
+	if (prefetch->queueIndex == 0)
+	{
+		idx = (prefetch->queueIndex++) % PREFETCH_QUEUE_SIZE;
+		prefetch->queueItems[idx] = block;
+		return false;
+	}
+
+	/* same as immediately preceding block? */
+	idx = (prefetch->queueIndex - 1) % PREFETCH_QUEUE_SIZE;
+	if (prefetch->queueItems[idx] == block)
+		return true;
+
+	idx = (prefetch->queueIndex++) % PREFETCH_QUEUE_SIZE;
+	prefetch->queueItems[idx] = block;
+
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/* not enough requests */
+		if (prefetch->queueIndex < i)
+		{
+			is_sequential = false;
+			break;
+		}
+
+		/*
+		 * -1, because we've already advanced the index, so it points to
+		 * the next slot at this point
+		 */
+		idx = (prefetch->queueIndex - i - 1) % PREFETCH_QUEUE_SIZE;
+
+		if ((block - i) != prefetch->queueItems[idx])
+		{
+			is_sequential = false;
+			break;
+		}
+	}
+
+	return is_sequential;
+}
+
 /*
  * index_prefetch_add_cache
  *		Add a block to the cache, return true if it was recently prefetched.
@@ -1081,6 +1142,19 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 	uint64		oldestRequest = PG_UINT64_MAX;
 	int			oldestIndex = -1;
 
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block
+	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the later cache, in case
+	 * we happen to access it later? That might help if we first scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+		return true;
+
 	/* see if we already have prefetched this block (linear search of LRU) */
 	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
 	{
@@ -1206,6 +1280,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 	if (prefetch->prefetchTarget <= 0)
 		return;
 
+	prefetch->prefetchAll++;
+
 	/*
 	 * XXX I think we don't need to worry about direction here, that's handled
 	 * by how the AMs build the curPos etc. (see nbtsearch.c)
@@ -1256,6 +1332,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 			if (index_prefetch_add_cache(prefetch, block))
 				continue;
 
+			prefetch->prefetchCount++;
+
 			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 			pgBufferUsage.blks_prefetches++;
 		}
@@ -1300,6 +1378,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 			if (index_prefetch_add_cache(prefetch, block))
 				continue;
 
+			prefetch->prefetchCount++;
+
 			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 			pgBufferUsage.blks_prefetches++;
 		}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c01c37951ca..526f280a44d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -276,6 +276,12 @@ typedef struct PrefetchCacheEntry {
 #define		PREFETCH_LRU_COUNT		128
 #define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
 
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_SIZE				8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
 typedef struct IndexPrefetchData
 {
 	/*
@@ -291,6 +297,16 @@ typedef struct IndexPrefetchData
 	prefetcher_getblock_function	get_block;
 	prefetcher_getrange_function	get_range;
 
+	uint64		prefetchAll;
+	uint64		prefetchCount;
+
+	/*
+	 * Tiny queue of most recently prefetched blocks, used first for cheap
+	 * checks and also to identify (and ignore) sequential prefetches.
+	 */
+	uint64		queueIndex;
+	BlockNumber	queueItems[PREFETCH_QUEUE_SIZE];
+
 	/*
 	 * Cache of recently prefetched blocks, organized as a hash table of
 	 * small LRU caches.
-- 
2.40.1

0002-more-elaborate-prefetch-cache-v3.patchtext/x-patch; charset=UTF-8; name=0002-more-elaborate-prefetch-cache-v3.patchDownload

From 2fdfbcabb262e2fea38f40465f60441c5f255096 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 14 Jun 2023 15:08:55 +0200
Subject: [PATCH 1/2] more elaborate prefetch cache

---
 src/backend/access/gist/gistscan.c  |   3 -
 src/backend/access/hash/hash.c      |   3 -
 src/backend/access/index/indexam.c  | 156 +++++++++++++++++++---------
 src/backend/access/nbtree/nbtree.c  |   3 -
 src/backend/access/spgist/spgscan.c |   3 -
 src/backend/replication/walsender.c |   2 +
 src/include/access/genam.h          |  41 ++++++--
 7 files changed, 141 insertions(+), 70 deletions(-)

diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index fdf978eaaad..eaa89ea6c97 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -128,9 +128,6 @@ gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int pr
 		prefetcher->prefetchMaxTarget = prefetch_maximum;
 		prefetcher->prefetchReset = prefetch_reset;
 
-		prefetcher->cacheIndex = 0;
-		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
 		/* callbacks */
 		prefetcher->get_block = gist_prefetch_getblock;
 		prefetcher->get_range = gist_prefetch_getrange;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 01a25132bce..6546d457899 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -401,9 +401,6 @@ hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int
 		prefetcher->prefetchMaxTarget = prefetch_maximum;
 		prefetcher->prefetchReset = prefetch_reset;
 
-		prefetcher->cacheIndex = 0;
-		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
 		/* callbacks */
 		prefetcher->get_block = _hash_prefetch_getblock;
 		prefetcher->get_range = _hash_prefetch_getrange;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aa8a14624d8..557267aced9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,6 +54,7 @@
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -1027,7 +1028,110 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
 
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* calculate which LRU to use */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
 
+	/* entry to (maybe) use for this block request */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/* see if we already have prefetched this block (linear search of LRU) */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Request numbers are positive, so 0 means "unused". */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool	prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
 
 /*
  * Do prefetching, and gradually increase the prefetch distance.
@@ -1138,7 +1242,6 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 
 		for (int i = startIndex; i <= endIndex; i++)
 		{
-			bool		recently_prefetched = false;
 			BlockNumber	block;
 
 			block = prefetch->get_block(scan, dir, i);
@@ -1149,35 +1252,12 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 			 * This happens e.g. for clustered or naturally correlated indexes
 			 * (fkey to a sequence ID). It's not expensive (the block is in page
 			 * cache already, so no I/O), but it's not free either.
-			 *
-			 * XXX We can't just check blocks between startIndex and endIndex,
-			 * because at some point (after the pefetch target gets ramped up)
-			 * it's going to be just a single block.
-			 *
-			 * XXX The solution here is pretty trivial - we just check the
-			 * immediately preceding block. We could check a longer history, or
-			 * maybe maintain some "already prefetched" struct (small LRU array
-			 * of last prefetched blocks - say 8 blocks or so - would work fine,
-			 * I think).
 			 */
-			for (int j = 0; j < 8; j++)
-			{
-				/* the cached block might be InvalidBlockNumber, but that's fine */
-				if (prefetch->cacheBlocks[j] == block)
-				{
-					recently_prefetched = true;
-					break;
-				}
-			}
-
-			if (recently_prefetched)
+			if (index_prefetch_add_cache(prefetch, block))
 				continue;
 
 			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 			pgBufferUsage.blks_prefetches++;
-
-			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
-			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
 		}
 
 		prefetch->prefetchIndex = endIndex;
@@ -1206,7 +1286,6 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 
 		for (int i = endIndex; i >= startIndex; i--)
 		{
-			bool		recently_prefetched = false;
 			BlockNumber	block;
 
 			block = prefetch->get_block(scan, dir, i);
@@ -1217,35 +1296,12 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
 			 * This happens e.g. for clustered or naturally correlated indexes
 			 * (fkey to a sequence ID). It's not expensive (the block is in page
 			 * cache already, so no I/O), but it's not free either.
-			 *
-			 * XXX We can't just check blocks between startIndex and endIndex,
-			 * because at some point (after the pefetch target gets ramped up)
-			 * it's going to be just a single block.
-			 *
-			 * XXX The solution here is pretty trivial - we just check the
-			 * immediately preceding block. We could check a longer history, or
-			 * maybe maintain some "already prefetched" struct (small LRU array
-			 * of last prefetched blocks - say 8 blocks or so - would work fine,
-			 * I think).
 			 */
-			for (int j = 0; j < 8; j++)
-			{
-				/* the cached block might be InvalidBlockNumber, but that's fine */
-				if (prefetch->cacheBlocks[j] == block)
-				{
-					recently_prefetched = true;
-					break;
-				}
-			}
-
-			if (recently_prefetched)
+			if (index_prefetch_add_cache(prefetch, block))
 				continue;
 
 			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 			pgBufferUsage.blks_prefetches++;
-
-			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
-			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
 		}
 
 		prefetch->prefetchIndex = startIndex;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b1a02cc9bcd..1ad5490b9ad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -387,9 +387,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int pr
 		prefetcher->prefetchMaxTarget = prefetch_maximum;
 		prefetcher->prefetchReset = prefetch_reset;
 
-		prefetcher->cacheIndex = 0;
-		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
 		/* callbacks */
 		prefetcher->get_block = _bt_prefetch_getblock;
 		prefetcher->get_range = _bt_prefetch_getrange;
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 79015194b73..a1c6bb7b139 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -394,9 +394,6 @@ spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int p
 		prefetcher->prefetchMaxTarget = prefetch_maximum;
 		prefetcher->prefetchReset = prefetch_reset;
 
-		prefetcher->cacheIndex = 0;
-		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
 		/* callbacks */
 		prefetcher->get_block = spgist_prefetch_getblock;
 		prefetcher->get_range = spgist_prefetch_getrange;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d3a136b6f55..c7248877f6c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 6a500c5aa1f..c01c37951ca 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -250,6 +250,32 @@ typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
 													 ScanDirection direction,
 													 int index);
 
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
 typedef struct IndexPrefetchData
 {
 	/*
@@ -262,17 +288,16 @@ typedef struct IndexPrefetchData
 	int			prefetchMaxTarget;	/* maximum prefetching distance */
 	int			prefetchReset;	/* reset to this distance on rescan */
 
-	/*
-	 * a small LRU cache of recently prefetched blocks
-	 *
-	 * XXX needs to be tiny, to make the (frequent) searches very cheap
-	 */
-	BlockNumber	cacheBlocks[8];
-	int			cacheIndex;
-
 	prefetcher_getblock_function	get_block;
 	prefetcher_getrange_function	get_range;
 
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
 } IndexPrefetchData;
 
 #endif							/* GENAM_H */
-- 
2.40.1

0001-index-prefetch-poc-v1.patchtext/x-patch; charset=UTF-8; name=0001-index-prefetch-poc-v1.patchDownload

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index efdf9415d15..9b3625d833b 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -193,7 +193,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys, int prefetch, int prefetch_reset);
 extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 6cc7d07164a..0c6da1b635b 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -25,7 +25,7 @@
  * Begin scan of bloom index.
  */
 IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation r, int nkeys, int norderbys, int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BloomScanOpaque so;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..5b298c02cce 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -324,7 +324,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
  * holding lock on index, it's not necessary to recompute it during brinrescan.
  */
 IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BrinOpaque *opaque;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb87..3087a986bc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
 
 
 IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GinScanOpaque so;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..7b79128f2ce 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -493,12 +493,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 
 			if (GistPageIsLeaf(page))
 			{
+				BlockNumber		block = ItemPointerGetBlockNumber(&it->t_tid);
+
 				/* Creating heap-tuple GISTSearchItem */
 				item->blkno = InvalidBlockNumber;
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
 
+				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
 				/*
 				 * In an index-only scan, also fetch the data from the tuple.
 				 */
@@ -529,6 +533,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	}
 
 	UnlockReleaseBuffer(buffer);
+
+	so->didReset = true;
 }
 
 /*
@@ -679,6 +685,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				so->curPageData++;
 
+				index_prefetch(scan, ForwardScanDirection);
+
 				return true;
 			}
 
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 00400583c0b..fdf978eaaad 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -22,6 +22,8 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Pairing heap comparison function for the GISTSearchItem queue
@@ -71,7 +73,7 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  */
 
 IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	GISTSTATE  *giststate;
@@ -111,6 +113,31 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = gist_prefetch_getblock;
+		prefetcher->get_range = gist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	/*
@@ -356,3 +383,42 @@ gistendscan(IndexScanDesc scan)
 	 */
 	freeGISTstate(so->giststate);
 }
+
+static void
+gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->curPageData;
+		*end = (so->nPageData - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->curPageData;
+	}
+}
+
+static BlockNumber
+gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->curPageData) || (index >= so->nPageData))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->pageData[index].heapPtr;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606e..01a25132bce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -48,6 +48,9 @@ static void hashbuildCallback(Relation index,
 							  bool tupleIsAlive,
 							  void *state);
 
+static void _hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
 
 /*
  * Hash handler function: return IndexAmRoutine with access method parameters
@@ -362,7 +365,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	hashbeginscan() -- start a scan on a hash index
  */
 IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	HashScanOpaque so;
@@ -383,6 +386,31 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _hash_prefetch_getblock;
+		prefetcher->get_range = _hash_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -918,3 +946,42 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
+
+static void
+_hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9ea2a42a07f..b5cea5e23eb 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -434,6 +434,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_heaptid = currItem->heapTid;
 
+	index_prefetch(scan, dir);
+
 	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
@@ -467,6 +469,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 	so->currPos.buf = buf;
 	so->currPos.currPage = BufferGetBlockNumber(buf);
+	so->currPos.didReset = true;
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -597,6 +600,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	}
 
 	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
 	return true;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc21c..b2f4eadc1ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -756,6 +757,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,7 +768,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..aa8a14624d8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -59,6 +59,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +107,8 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +202,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+												prefetch_target, prefetch_reset);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
 
+	/* do index prefetching, if needed */
+	index_prefetch(scan, direction);
+
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_heaptid;
 }
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+void
+index_prefetch(IndexScanDesc scan, ScanDirection dir)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	/* gradually increase the prefetch distance */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Did we already reach the point to actually start prefetching? If not,
+	 * we're done. We'll try again for the next index tuple.
+	 */
+	if (prefetch->prefetchTarget <= 0)
+		return;
+
+	/*
+	 * XXX I think we don't need to worry about direction here, that's handled
+	 * by how the AMs build the curPos etc. (see nbtsearch.c)
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		bool		reset;
+		int			startIndex,
+					endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/*
+		 * Did we switch to a different index block? if yes, reset relevant
+		 * info so that we start prefetching from scratch.
+		 */
+		if (reset)
+		{
+			prefetch->prefetchTarget = prefetch->prefetchReset;
+			prefetch->prefetchIndex = startIndex; /* maybe -1 instead? */
+			pgBufferUsage.blks_prefetch_rounds++;
+		}
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the end index first, because it depends on
+		 * the actual position, before we consider how far we prefetched.
+		 */
+		endIndex = Min(endIndex, startIndex + prefetch->prefetchTarget);
+		startIndex = Max(startIndex, prefetch->prefetchIndex + 1);
+
+		for (int i = startIndex; i <= endIndex; i++)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = endIndex;
+	}
+	else
+	{
+		bool	reset;
+		int		startIndex,
+				endIndex;
+
+		/* get indexes of unprocessed index entries */
+		prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+		/* FIXME handle the reset flag */
+
+		/*
+		 * Adjust the range, based on what we already prefetched, and also
+		 * based on the prefetch target.
+		 *
+		 * XXX We need to adjust the start index first, because it depends on
+		 * the actual position, before we consider how far we prefetched (which
+		 * for backwards scans is (end index).
+		 */
+		startIndex = Max(startIndex, endIndex - prefetch->prefetchTarget);
+		endIndex = Min(endIndex, prefetch->prefetchIndex - 1);
+
+		for (int i = endIndex; i >= startIndex; i--)
+		{
+			bool		recently_prefetched = false;
+			BlockNumber	block;
+
+			block = prefetch->get_block(scan, dir, i);
+
+			/*
+			 * Do not prefetch the same block over and over again,
+			 *
+			 * This happens e.g. for clustered or naturally correlated indexes
+			 * (fkey to a sequence ID). It's not expensive (the block is in page
+			 * cache already, so no I/O), but it's not free either.
+			 *
+			 * XXX We can't just check blocks between startIndex and endIndex,
+			 * because at some point (after the pefetch target gets ramped up)
+			 * it's going to be just a single block.
+			 *
+			 * XXX The solution here is pretty trivial - we just check the
+			 * immediately preceding block. We could check a longer history, or
+			 * maybe maintain some "already prefetched" struct (small LRU array
+			 * of last prefetched blocks - say 8 blocks or so - would work fine,
+			 * I think).
+			 */
+			for (int j = 0; j < 8; j++)
+			{
+				/* the cached block might be InvalidBlockNumber, but that's fine */
+				if (prefetch->cacheBlocks[j] == block)
+				{
+					recently_prefetched = true;
+					break;
+				}
+			}
+
+			if (recently_prefetched)
+				continue;
+
+			PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+			pgBufferUsage.blks_prefetches++;
+
+			prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+			prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+		}
+
+		prefetch->prefetchIndex = startIndex;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b15199a..b1a02cc9bcd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -37,6 +37,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
+#include "utils/spccache.h"
 
 
 /*
@@ -87,6 +88,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  OffsetNumber updatedoffset,
 										  int *nremaining);
 
+static void _bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -341,7 +344,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	btbeginscan() -- start a scan on a btree index
  */
 IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -369,6 +372,31 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = _bt_prefetch_getblock;
+		prefetcher->get_range = _bt_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -1423,3 +1451,42 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+static void
+_bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->currPos.didReset;
+	so->currPos.didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->currPos.itemIndex;
+		*end = so->currPos.lastItem;
+	}
+	else
+	{
+		*start = so->currPos.firstItem;
+		*end = so->currPos.itemIndex;
+	}
+}
+
+static BlockNumber
+_bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->currPos.items[index].heapTid;
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce95..762d95d09ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,7 +47,6 @@ static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
-
 /*
  *	_bt_drop_lock_and_maybe_pin()
  *
@@ -1385,7 +1384,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 */
 		_bt_parallel_done(scan);
 		BTScanPosInvalidate(so->currPos);
-
 		return false;
 	}
 	else
@@ -1538,6 +1536,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	 */
 	Assert(BufferIsValid(so->currPos.buf));
 
+	/*
+	 * Mark the currPos as reset before loading the next chunk of pointers, to
+	 * restart the preretching.
+	 */
+	so->currPos.didReset = true;
+
 	page = BufferGetPage(so->currPos.buf);
 	opaque = BTPageGetOpaque(page);
 
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index cbfaf0c00ac..79015194b73 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relation.h"
 #include "access/relscan.h"
 #include "access/spgist_private.h"
 #include "miscadmin.h"
@@ -32,6 +33,10 @@ typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
 							   SpGistLeafTuple leafTuple, bool recheck,
 							   bool recheckDistances, double *distances);
 
+static void spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
+
 /*
  * Pairing heap comparison function for the SpGistSearchItem queue.
  * KNN-searches currently only support NULLS LAST.  So, preserve this logic
@@ -191,6 +196,7 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
 			pfree(so->reconTups[i]);
 	}
 	so->iPtr = so->nPtrs = 0;
+	so->didReset = true;
 }
 
 /*
@@ -301,7 +307,7 @@ spgPrepareScanKeys(IndexScanDesc scan)
 }
 
 IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset)
 {
 	IndexScanDesc scan;
 	SpGistScanOpaque so;
@@ -316,6 +322,8 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 		so->keyData = NULL;
 	initSpGistState(&so->state, scan->indexRelation);
 
+	so->state.heap = relation_open(scan->indexRelation->rd_index->indrelid, NoLock);
+
 	so->tempCxt = AllocSetContextCreate(CurrentMemoryContext,
 										"SP-GiST search temporary context",
 										ALLOCSET_DEFAULT_SIZES);
@@ -371,6 +379,31 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 
 	so->indexCollation = rel->rd_indcollation[0];
 
+	/*
+	 * XXX maybe should happen in RelationGetIndexScan? But we need to define
+	 * the callacks, so that needs to happen here ...
+	 *
+	 * XXX Do we need to do something for so->markPos?
+	 */
+	if (prefetch_maximum > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = -3;
+		prefetcher->prefetchMaxTarget = prefetch_maximum;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		prefetcher->cacheIndex = 0;
+		memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+		/* callbacks */
+		prefetcher->get_block = spgist_prefetch_getblock;
+		prefetcher->get_range = spgist_prefetch_getrange;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	scan->opaque = so;
 
 	return scan;
@@ -453,6 +486,8 @@ spgendscan(IndexScanDesc scan)
 		pfree(scan->xs_orderbynulls);
 	}
 
+	relation_close(so->state.heap, NoLock);
+
 	pfree(so);
 }
 
@@ -584,6 +619,13 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
 														isnull,
 														distances);
 
+			// FIXME prefetch here? or in storeGettuple?
+			{
+				BlockNumber block = ItemPointerGetBlockNumber(&leafTuple->heapPtr);
+
+				PrefetchBuffer(so->state.heap, MAIN_FORKNUM, block);
+			}
+
 			spgAddSearchItemToQueue(so, heapItem);
 
 			MemoryContextSwitchTo(oldCxt);
@@ -1047,7 +1089,12 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				index_store_float8_orderby_distances(scan, so->orderByTypes,
 													 so->distances[so->iPtr],
 													 so->recheckDistances[so->iPtr]);
+
 			so->iPtr++;
+
+			/* prefetch additional tuples */
+			index_prefetch(scan, dir);
+
 			return true;
 		}
 
@@ -1070,6 +1117,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 				pfree(so->reconTups[i]);
 		}
 		so->iPtr = so->nPtrs = 0;
+		so->didReset = true;
 
 		spgWalk(scan->indexRelation, so, false, storeGettuple,
 				scan->xs_snapshot);
@@ -1095,3 +1143,42 @@ spgcanreturn(Relation index, int attno)
 
 	return cache->config.canReturnData;
 }
+
+static void
+spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+
+	/* did we rebuild the array of tuple pointers? */
+	*reset = so->didReset;
+	so->didReset = false;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* Did we already process the item or is it invalid? */
+		*start = so->iPtr;
+		*end = (so->nPtrs - 1);
+	}
+	else
+	{
+		*start = 0;
+		*end = so->iPtr;
+	}
+}
+
+static BlockNumber
+spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+	SpGistScanOpaque	so = (SpGistScanOpaque) scan->opaque;
+	ItemPointer		tid;
+
+	if ((index < so->iPtr) || (index >= so->nPtrs))
+		return InvalidBlockNumber;
+
+	/* get the tuple ID and extract the block number */
+	tid = &so->heapPtrs[index];
+
+	Assert(ItemPointerIsValid(tid));
+
+	return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 190e4f76a9e..4aac68f0766 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -17,6 +17,7 @@
 
 #include "access/amvalidate.h"
 #include "access/htup_details.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
 #include "access/toast_compression.h"
@@ -334,6 +335,9 @@ initSpGistState(SpGistState *state, Relation index)
 
 	state->index = index;
 
+	/* we'll initialize the reference in spgbeginscan */
+	state->heap = NULL;
+
 	/* Get cached static information about index */
 	cache = spgGetCache(index);
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4476ff7fba1..80fec7a11f9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -160,7 +160,9 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
-											   int norderbys);
+											   int norderbys,
+											   int prefetch_maximum,
+											   int prefetch_reset);
 
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 97ddc925b27..f17dcdffd86 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -96,7 +96,7 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..6a500c5aa1f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,7 +152,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +171,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +234,45 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+
+
+void index_prefetch(IndexScanDesc scandesc, ScanDirection direction);
+
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchIndex;	/* how far we already prefetched */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+
+	/*
+	 * a small LRU cache of recently prefetched blocks
+	 *
+	 * XXX needs to be tiny, to make the (frequent) searches very cheap
+	 */
+	BlockNumber	cacheBlocks[8];
+	int			cacheIndex;
+
+	prefetcher_getblock_function	get_block;
+	prefetcher_getrange_function	get_range;
+
+} IndexPrefetchData;
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 6da64928b66..b4bd3b2e202 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -384,7 +384,7 @@ typedef struct GinScanOpaqueData
 
 typedef GinScanOpaqueData *GinScanOpaque;
 
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void ginendscan(IndexScanDesc scan);
 extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3edc740a3f3..e844a9eed84 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -176,6 +176,7 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+	bool	didReset;			/* reset since last access? */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 65911245f74..adf167a60b6 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,7 @@
 
 #include "access/amapi.h"
 
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 					   ScanKey orderbys, int norderbys);
 extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9e035270a16..743192997c5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -124,6 +124,8 @@ typedef struct HashScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	bool		didReset;
+
 	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
 } HashScanPosData;
 
@@ -370,7 +372,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d6847860959..8d053de461b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -984,6 +984,9 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
+	/* Did the position reset/rebuilt since the last time we checked it? */
+	bool		didReset;
+
 	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
@@ -1019,6 +1022,7 @@ typedef BTScanPosData *BTScanPos;
 		(scanpos).buf = InvalidBuffer; \
 		(scanpos).lsn = InvalidXLogRecPtr; \
 		(scanpos).nextTupleOffset = 0; \
+		(scanpos).didReset = true; \
 	} while (0)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
@@ -1127,7 +1131,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index fe31d32dbe9..e1e2635597c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,7 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
 					  struct IndexInfo *indexInfo);
 
 /* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset);
 extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index c6ef46fc206..e00d4fc90b6 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -144,7 +144,7 @@ typedef struct SpGistTypeDesc
 typedef struct SpGistState
 {
 	Relation	index;			/* index we're working with */
-
+	Relation	heap;			/* heap the index is defined on */
 	spgConfigOut config;		/* filled in by opclass config method */
 
 	SpGistTypeDesc attType;		/* type of values to be indexed/restored */
@@ -231,6 +231,7 @@ typedef struct SpGistScanOpaqueData
 	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
 															 * flags */
 	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
+	bool		didReset;		/* */
 
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */

#16

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Tomas Vondra (#15)

1 attachment(s)

Re: index prefetching

Hi,

attached is a v4 of the patch, with a fairly major shift in the approach.

Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:

* a lot of the logic is at the AM level

* can't prefetch across the index page boundary (have to wait until the
next index leaf page is read by the indexscan)

* doesn't work for distance searches (gist/spgist),

After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.

The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.

Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.

I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().

FWIW there's also the "index-only filters without IOS" patch [1]https://commitfest.postgresql.org/43/4352/ which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.

Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.

I ran the benchmarks [2]https://github.com/tvondra/index-prefetch-tests-2/ with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.

regards

[1]: https://commitfest.postgresql.org/43/4352/

[2]: https://github.com/tvondra/index-prefetch-tests-2/

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

index-prefetch-v4.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v4.patchDownload

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..9045c6eb7aa 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
-
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0755be83901..f0412da94ae 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..3722874948f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/* with prefetching enabled, initialize the necessary state */
+	if (prefetch_target > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_target;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	return scan;
 }
 
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+	
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -345,6 +399,17 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/* If prefetching enabled, log prefetch stats. */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests %lu prefetches %lu (%f) skip cached %lu sequential %lu",
+			 prefetch->countAll, prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached, prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -487,10 +552,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +567,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -623,20 +691,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
 	for (;;)
 	{
+		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		if (PREFETCH_ACTIVE(prefetch))
+		{
+			/* 
+			 * incrementally ramp up prefetch distance
+			 *
+			 * XXX Intentionally done as first, so that with prefetching there's
+			 * always at least one item in the queue.
+			 */
+			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+										prefetch->prefetchMaxTarget);
+
+			/*
+			 * get more TID while there is empty space in the queue (considering
+			 * current prefetch target
+			 */
+			while (!PREFETCH_FULL(prefetch))
+			{
+				ItemPointer tid;
+
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+				{
+					prefetch->prefetchDone = true;
+					break;
+				}
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+				prefetch->queueEnd++;
+
+				index_prefetch(scan, tid);
+			}
+		}
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			if (PREFETCH_ENABLED(prefetch))
+			{
+				/* prefetching enabled, but reached the end and queue empty */
+				if (PREFETCH_DONE(prefetch))
+					break;
+
+				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+				prefetch->queueIndex++;
+			}
+			else	/* not prefetching, just do the regular work  */
+			{
+				ItemPointer tid;
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
 
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+					break;
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -988,3 +1110,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int		idx;
+
+	/* If the queue is empty, just store the block and we're done. */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Otherwise, check if it's the same as the immediately preceding block (we
+	 * don't want to prefetch the same block over and over.)
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/* Not the same block, so add it to the queue. */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/* check sequential patter a couple requests back */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/* not enough requests to confirm a sequential pattern */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * index of the already requested buffer (-1 because we already
+		 * incremented the index when adding the block to the queue)
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*  */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* calculate which LRU to use */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* entry to (maybe) use for this block request */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block
+	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the later cache, in case
+	 * we happen to access it later? That might help if we first scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* see if we already have prefetched this block (linear search of LRU) */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Request numbers are positive, so 0 means "unused". */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool	prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+	BlockNumber	block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	prefetch->countAll++;
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes
+	 * (fkey to a sequence ID). It's not expensive (the block is in page
+	 * cache already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d3a136b6f55..c7248877f6c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..f3efffc4a84 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

over 2 years ago

In reply to: Tomas Vondra (#16)

1 attachment(s)

Re: index prefetching

Here's a v5 of the patch, rebased to current master and fixing a couple
compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug
messages). No other changes compared to v4.

cfbot also reported a failure on windows in pg_dump [1]https://cirrus-ci.com/task/6398095366291456, but it seem
pretty strange:

[11:42:48.708] ------------------------------------- 8<
-------------------------------------
[11:42:48.708] stderr:
[11:42:48.708] # Failed test 'connecting to an invalid database: matches'

The patch does nothing related to pg_dump, and the test works perfectly
fine for me (I don't have windows machine, but 32-bit and 64-bit linux
works fine for me).

regards

[1]: https://cirrus-ci.com/task/6398095366291456

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

index-prefetch-v5.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v5.patchDownload

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069..9045c6eb7a 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
-
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5a17112c91..0b6c920ebd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aeba..264ebe1d8e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7ab..0b8f136f04 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/* with prefetching enabled, initialize the necessary state */
+	if (prefetch_target > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_target;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	return scan;
 }
 
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+	
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -345,6 +399,19 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/* If prefetching enabled, log prefetch stats. */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -487,10 +554,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +569,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -623,20 +693,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
 	for (;;)
 	{
+		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		if (PREFETCH_ACTIVE(prefetch))
+		{
+			/* 
+			 * incrementally ramp up prefetch distance
+			 *
+			 * XXX Intentionally done as first, so that with prefetching there's
+			 * always at least one item in the queue.
+			 */
+			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+										prefetch->prefetchMaxTarget);
+
+			/*
+			 * get more TID while there is empty space in the queue (considering
+			 * current prefetch target
+			 */
+			while (!PREFETCH_FULL(prefetch))
+			{
+				ItemPointer tid;
+
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+				{
+					prefetch->prefetchDone = true;
+					break;
+				}
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+				prefetch->queueEnd++;
+
+				index_prefetch(scan, tid);
+			}
+		}
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			if (PREFETCH_ENABLED(prefetch))
+			{
+				/* prefetching enabled, but reached the end and queue empty */
+				if (PREFETCH_DONE(prefetch))
+					break;
+
+				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+				prefetch->queueIndex++;
+			}
+			else	/* not prefetching, just do the regular work  */
+			{
+				ItemPointer tid;
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
 
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+					break;
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -988,3 +1112,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int		idx;
+
+	/* If the queue is empty, just store the block and we're done. */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Otherwise, check if it's the same as the immediately preceding block (we
+	 * don't want to prefetch the same block over and over.)
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/* Not the same block, so add it to the queue. */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/* check sequential patter a couple requests back */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/* not enough requests to confirm a sequential pattern */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * index of the already requested buffer (-1 because we already
+		 * incremented the index when adding the block to the queue)
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*  */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* calculate which LRU to use */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* entry to (maybe) use for this block request */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block
+	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the later cache, in case
+	 * we happen to access it later? That might help if we first scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* see if we already have prefetched this block (linear search of LRU) */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Request numbers are positive, so 0 means "unused". */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool	prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+	BlockNumber	block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	prefetch->countAll++;
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes
+	 * (fkey to a sequence ID). It's not expensive (the block is in page
+	 * cache already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..6ae445d62c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b89..e5ce1dbc95 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e776524227..c0bb732658 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..434be59fca 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b969..3ecb8470d4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d..71ae6a47ce 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d27ef2985d..d65575fd10 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076e..0b02b6265d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a308795665..f3efffc4a8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac0..c119fe597d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183b..97dd3c2c42 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */

#18

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#17)

5 attachment(s)

Re: index prefetching

Hi,

Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.

0001-v5-20231016.patch
----------------------

Rebase to current master.

0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------

Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.

0003-remove-prefetch_reset-20231016.patch
-----------------------------------------

I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).

0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------

This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).

The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.

Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.

The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.

index-prefetch.pdf
------------------

Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.

I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).

The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):

* indexonlyscan (i.e. bad case) - About the same results as
"indexscans", with the same speedups etc. Which is a good thing
(i.e. IOS is not unexpectedly slower than regular indexscans).

* indexonlyscan-clean (i.e. good case) - Seems to have mostly the same
performance as without the prefetching, except for the low-cardinality
runs with many rows per key. I haven't checked what's causing this,
but I'd bet it's the extra buffer lookups/management I mentioned.

I noticed there's another prefetching-related patch [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com from Thomas
Munro. I haven't looked at it yet, so hard to say how much it interferes
with this patch. But the idea looks interesting.

[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

index-prefetch.pdfapplication/pdf; name=index-prefetch.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��Untitled spreadsheet)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x��}]�$��]~1������������"�|]�4#yi��vmIO^�����0����L~f����fn�e��I�dd��{����?��?o�����^��exsf�U�a����Xo;�������v��8�7��d_���]�y��&�o}3�?o�Z�f��5�������
-���^�u�
�����2��n4������i�)�6h��������T����K�B�������r��_���k���\�0����&0�����?,��a�i�X����1���V�����a�q���8{u��)&��Ja��������`�7v������c��'w��t+v*n�N������@�No���k����������Y �����v��vp�u.�tX�w�K��Z�����-S,���l�a�nd��<���lo��-�4�������/�L9>�J���j���?�h��K����{K���c��l�����{K�x���v 2� K��]qAJs)�e.E7��Ye��Q	�u�w
%��������2��|����Ry�o�2�7G�����KF��J�{�����t���|���M+���OV�Y�H���H,���"i?�Qb�R����8Y�ce����%�=�{@�+�@�G��������fu;0��������L�qe6:.�'�;Dw�W��>�y%�P%`Q� ���q2��)�<�yu ����o/�*ry���Wf���$�(�������fEy:3E��c�f��+��Qfr��!����2����l���Oz!���P-�bQ-���;f��8����i3�z����|�i]&�$HB��,H�e��BL�RzC�HI\x�-T	��D��qe"!��Kh31:dt�Wf���/��jq��j��Zr��Z�i������*34����b��H��*4����~o	���������Z�:2E��I	�B���[+����WS�X(1Q:�t@W&qs�G����Fj��F,j����c�<.�qI��,Is4c��cB���}o\�5:'QP�oI�rys4K�G�t���K�P��+��,��V��2:`�Wfk�,������,)��w�i���v3oNI;C�����rFJ��(C��,(��p�W�
���T������I\AvC�{�+4��G{2(+R:<��W&R��
7��eNS�f��[4�9o���XGt���C����9��z]�iN@&os��n��,{.�m�@s��14����d��pGB��C��DB:3�q�|��P�aIi�NZf�4Gl�K��0�x�j��e*�T�B��lR]g��2LsryC�v�>P��T���4i����2��NQ&B��C�����;�Rq(�R�(KJ�FL�v3����Fm��q�a"Q���Cm��V"W�E$���2	�V���r9�
�+��-F$�8����p�l�@���%��L��<�R���@|���@�m<,Jd>|{�����_���`]"Q�f���h��wi���D����7�����}���0�)P��+��<���fj��f,j�R�9���<X���,���L+s|l1aFB��C��D��tY�^$�`&os<f���������W��`�?
���+�x{�[s}R����j34���z��zh\���0E�8.�����2��_|2�CB��qe"��	�D�p�������h�.�;Vj�0���L���4�f���g�����+4�����lj"�:L�,zC��I�I��m���ah\�HxoR5S:�t@W&P.f����'Wl�r�$7���cQ1���^�f���"����Mw���2\��#����2	�V����`AB<�{C��N����{�aOZ�I{��G^R<��w3
�u�|xe��������!e�=0��+�d<���0�xH�gpo�f�U��0'x��'��Lp{���������Po?�J4�k9�x��gF�{����i+��(�(��p(����<���^G���f�����9ve�3�GJ��qe���9����FH�'�`&ho����!w�(�9B��<�\3���r9�
�+��=�F��Z���Z,��Rj9�zE�A�~7�>-$�k�z�����=����y�7�����L����B�n1^!U�(�i!7_������<yBw84�qeV7t����B�|7����������*Qz��@W&R�������yb����,��E%�(	��_H�������L��24�L��ZY���P/�bQ/���{
�����M3@��2c��L�@~>�{@�Wf�~O/*���q73�7G�u?QK��O��i��Wfu?Ll��L)t�@Z>3:`�Wf���uE���/�(��p(���� �&F3@�|7�<bEdh\�CW%�������e��0L�����  Inr��!�������KfZ����Gft���14�Ld��6�\���nB��)��" G�)p��a��������S�I�+s:��O�qx Ylr}�!�	����\�TV"tH��04�L ,����[�R�Cr:1:�@�h�@��\
��5>C�i!-]����������Wh\��#H�3F�f�.5���'e�aZ�|Q�����2�[,��}���B���!����2���Q������h$t(���(K
�X"`�9��o�������!�Cr<sy���W&��H�'��;2\ ^�M��jM���Kp���Ypa��u1QU��!����2�0����LfO�o�?�w"t(�"�({�=���4H��f�� �]�����3Rz��@���YR����@�z�p����N�-����=��]������n�������W���f��,J��J��,%�;�Sv�t�nf���������=�{D��W�pH�7�HS$�3�7����(H�'�`�������Ka:{��|��}&t@W&K��%�(*��,*��J8{1�$�w3��C��6'L��9�eJ�����V
aN��>Ny	=H���
aN��yc�Ob�g@~��BUC��3O�lQK<"�C:t�qe"]���np��'��'�`�2UC���lV0X�����p�.�Ms�r���������
a�ZE�**r��A��6
������9��+��k�T���2��\���'oT��$�nf�x���������eb��$z���0�;f��)t�!g^�xowf�2�CF��qeV�������[&�m��e*31F�@=S��)����y���;��n��ye������#���qe�p��+
�����Yr�<���<$��n�2��y��E��	92�n�\ze�����w3$�_���9v���	=�f.5�9���Y+qEY@N=�{C�#��nn9M�Q�hO�qeo�|���C�;13W�����k���<��P���W�|O�q�"e0(�����`�8	��e0s��!��'�7C�=39`
�+o�x��Q\��g���B ���Q/���^�m��'����t�Z�����L{�����+4�L��9�D�(���"�7���:��D���_�
�+�HEQZ��!����2��u( H\��P�aYa�
�� B�f����P�Mg���K���=����Y���.rE��Y��!���D�XPZD���ym�d����!����2	*rr��w8��������j��n&i-#$�+36~�n*Sz��@�D��<�]&�(/��gho�f��^ALLQj�)_FH�Wf�t�������������e�PeaQ������e�|�nf!^!A^��c�DiA=3z`�'��9������,z���B��6��(2��'�x�*��k����"3�Ct��PU4��k�h9B�<q9T�Ee��2�	p�!c��Yw�.04������a��)=P���9�e�z<H2�����<v3�H��4UH�/�L��|L-3�]#r9�r�W�������c%.��p�
���}��
�����r����qe��l��H��24�L�%���=0FMB�>SxC����'#�(6��'�`.���M�;[)�����ch\���g�y����p��������P2%cQ2� ��`xYb�>@���f��+36~(��	=z �WfIgI�7G�(?X������� ����`A��P�U���M-3:dt��7���1�c����������cbt����{�������8A��6�������L��24�����n�e&��<X	���&�\��e�b�
�����	2��Yu�}�cft���14�LdL�w[��VHmw�S����^���^�A/�diVf���>�\���9=Zi?#�GJ��qeN��V;qE9�
D��0	�F�0QE%�������0'n�#��UByDF��C��D���W]n"-T�C�8T�E�XR-�/zQsP���Y�� 34��+�s�G��BKf��W����H'�G	��J���0�~�_���`M%Q3W�|���`m%3:`�+O��,���H���H�A$w���� m��6s��������q�/Sz��@WfAyO�'qEA��
��!�����N�Q	i����2��N�,���r8X�R�ugJ�/�������8��Q��9����=�{��_A��?UCd���&����.���q�e����f���A���kgK�tH��2~p��2��K3��nw�w�H�P&����dbI��#��o����]�r�%��=4�����W)b�� �~�p'�����4���m$�3I�i�$�x�a����=�������U�ic��f�8Oln�����L�*��J:���=J��b>-'�$A<w��XL9�W������x=������6������-�7�=��>�D�g�'��91<��`��������U�ic��F�;�;��Urr��l%'K����e��P�B���4���k��D��x=����}=.:4���|�Cd��S�{C���g��,�d�h�BiC���U�wQ�u�{A�*R���}m#i�����u3��
+:��U"���,-�;WuVX�	v!^+9G��s ���<��W�ic��v�����`�'�F�����v�B��D�`�%�F�����h����O)~���Kfu���jl-9��O�Vc��55��s�c�{7���"�H��vI���"oR��<�P6u���0����Y	C���n��LOA�*Z���}mW�w�{2��T�*�JE�V�q�f�+�h�B�3������B}����<����}�k�4f��GX�)�!�V6���mR%��d����9��a��B���U�������Gn��U��i]�!Wi�V��������������}[��������?�����!����`����5���F������n�����]�kW������Ek,~��)��`3�f������:��W���U�.�����=��\<���v#�.��/z+�2��/�$�2��/�<����4�7���a�g��yy���Y?��G�����n�_�����n�_�7;�O���t����`����[����>�e����/��?��W��z�K���6��������������|���?������x�f��_��/?����^�����p��������������������O~���__�����_�k<s
?��re�������x��/#��G��v������w�?E��7����ud�?�,�]�(`X����Cq'�������v��__~����?3�B��z��K�����7}#������|��,�m0�!���F�O[��p����/Y������4$���Tn�����ci�?�3�w�V�����oW��u���'�
�7���$�+�����}��}Y]�<���e�7g�c^D��	x1g2
�����BD2�IZ_<o
z�|V��~�$�q����qrO��8O���1{��Q����>�<�j�{�� ���N���c
y}y�6n��+7��|I�}�s,�_]���tu���c���~w����/H��+�����]�����gt��/^�Z����7�Vx@C�I������J���%����������7�w}�3g�}s�2�����u-��+n_������'f_/�V��}�K������t�^���'��:'?G��4�)����0~-e|����-������/#����;q����{����"��=o\��n\wr�_�"���g���I��_���[�����U���]rU,n���\�+tU$���G;�*x��>�A;�v���������~0;+��;���������
�`rV�6:�"9��6��tV�&������������~we���Ne�'��,���<����9=]����j���i �?	�4U1N�	��6pr<p�_uM��7��i������&�+������|���>�T���y����
����� ��WAH������[k��n�@���V_��'t�}� ��m����h��>�0X��X`aG��0�u}d�s���1���K�8�Q,|���
�,���)+��a
��z���X�|X��.'����9���5�
#:��H���n��N���^��N{�h'�]�^Dv$�fdGhxEe����]�s�lZxEyo���^������,^��livS{�����"�i����N���������Y�>������OdU-���F�����/)����#*�2?=�#)4�����w������� ��_/5|��������D{��"wf���8j��
g%����ED�*Dp�)+��������bRV�HyV���w�����(�J����;{��4\	�?n�e�����;�2 
.���Q�E�@�����*u`��p�4���;a�6����B*�]-]���ny��dar�4����+���>:��)Tl�������'�n��--�p�1X���?�z}���8��<;\�)4�,Lu��R�Ek,���R�.r����"�Q,��:r{U����$j�_
����.�
� /b��(qKk��z0�o�{Z����	bO�������j�P�I�ju$��[T�n�����h�WA�Uu�U�S���]-]����)����^����>z���&���G:n\��:D����z������tLw�������`��A�Y����������R����t$��~��=�����i�k��7���t��G�.Y�����B>��)�c�q��p�������d��A�3��e�����=,���j���������H��N8�EX�I`���D���;��6�g�?/�"���4�:r{5*:y}��bC�e:���h����I�Y���.���"�����R�,t�&��
b�����G��
��.���B�Ek,�������q����+��d��Q��$�����Q��+D�P�RH�����
�(�Jr]">w������R������Z��Z�@�q��,���L�_��J��������'����Jiie'��2�.�Vv��S`i*�E������aw.{
�c(M��#��b����,���nC�M�-}C���$u�$�d��������3���k�������`�<�X�Ba�&;���)��s1�(��\��$5���J����Xx.m���Hd�����|��� ?=u�x�����'\������Z��Hh�@��{�@�Q;����4�f Gh�CX�Y`a5'3�
[��Q�?e�p�$���$��iz8C�;d��[��uW�CY�E����;$��k���ONrw�B������qyp�����~&g�hZ!��O?���P�,�],��w���[���b�Q%	Ib��<�c(�]����(^�#�UB��>:��w1��?� ����
�s���x'-��|t�^���I��k9�#�u]^�I�*�s4��O����0���	��q��
Y��4��-���D���>�G��7��}��>�O����q�x�N��x�59	�hh�N���k'7��v�on
I�?�����k��ge������;H��{���]����BS(����RW�|�����������z����	���ujb��Xd�+S�G����u7�&�~:O�����GB+l4o�{RW�Bw�Scj�����koJ��?tB��m�j���1��]�|����oo�X������~;�--�Q5�S�@#>e��{���8;u�S�>����Q�������?
���v��pT[�SW�G��Y}H����p��>}��:	-"�So�zR��B��S��Nn�	��Y�EP��~�M������T�5�����1�(�k���E_���P�*���_������=W-�/u}�����U�L�`����1�	�e��9��FOK��~w$��B?�~��&|i)����Q8;DX��T8<�E�����H�[����p���(�6rv���("��'��M������#;9����C����>���kN���?k�d_�;$tf�7�?/�#��+��X\a2������!�����H����z8>i"RY�m)#{-eed��(�S���|���<R�uuEp�}�������@P}���l��B�����,�F��������Ghy	)	����
Y\a8��
�9W#�cP#�#�Ek/��������WxDX�9e�="�A���8���T����H���N���9�@�kN>�C��g�,�������7�A#|�������:+
W���B�S?Yu�v�@S�+�,����J8�+N��#3��S��*�*�V?+I�,�9���5������OzB�q����T�,�V���$��R��)|
���
�4Y\�g�pU��~��=�0����i*����,
��
�@�0�S����Q�G�-$��m���e�y����^�C��B9d�~z}��Y9��O.��)�����_w�I�Q��'G�$C�/H�K7�.W��G�jD��j�g�����P��d�YZ��K}��0��b��[�}WC��`����X�� �o�M���-�-v���K���}���+��`����1�t%�a�F�*m$����p��PW����SX�cP�q��	�Q��z8>����[��8`�'�YeH��e>������H��BV�P��s���E��#�N��+w,���R���(�j��Q
FW�adq�FW�Y�C�QzIB�����r7�JW��R����8eH3��2$���;oH>�;���$Fj����?�J��������{�]�F��0~��0��Kk�3*/IX�����/�������q]W�!���	��!�c�q������Go�8z��!v��2��q���?j|�������,r���x�,����99DY��v���-������q��U6W�l�p5*D�!��(@B+�o��W�+m����o�2:}JvW�2�0m�#��W��{�������<�����MM\��o�'���CWa2�����*L7,GX�/���jLT14$�6�����P���v�o�������&���S�M]����e���V����e�����iEKC�=�E�Va��K{O@U-$���������_/r���3UH�<&�������W���!�/����$�������1��OP��8`����!vV�2���Yq-IdI�6�\q~ti^��C
�Wl�FOx�3��ti�F�j�i�BN����]�g���W����`���$_�N�O�e���OW��������W���Q�W�
yq��.Z��(dA%��2��������q=,��,��N�t������D]���8\i:�����]vS�*Dh�;�7N�O��t��pS�W����K����!�4i(C���������N����DY��[99+�~�g#(?��ybG�������(|�PM7�@e\w�+����7�3�������Lr�O��r�b�������j�2$_�3(w��%5~���@�kN>P�����{KK���{�k�n��sC6�Aak�2nrW,��]��Rw��������;�������J�t������%w�P�tQ�Q�������3��]�#Gw�"�7�#Gw�"j@�n9�.��Y�n��c��Ll ���~�O�����E'Yu>�NBK�;IP�P����9�o�p������v�sion�~�NK��~U9�E����
|:zb���x�S���,����:����������Ol�M\�xv�uj_qr��d/�?Y��l�?��Ok!��q���p�>����(
W�GQ����~�\y�
E�W>��b����E��8]�}�������K��!�R�W�Ou���8d�����@!6s��
������(|������f�p3�r?d�G�~VX#$�E�����+�,��tG��{�t��}��_4g�!8�,Eg����S�{����q�������,O�r��M�1�z����>���g�B2(��(�f���
�!���`���)/��;-���������=�W<5�O���}���n���j��_��������W,�|�~_���f��Q��s)�L���1}#���'�zO?��H�=IX�@��W�����'���@���<Sln�jr$�����3��4v�'�����'���@�}��>�����lLF��}��Y�������<���A�'W,>��A��}�9z����O!4<�+\j��ug��������H`����i�Qy�
O����Sq����}���m$��M�:O����%���yL�IG�+���>r���'Fo����v_{�����u�������KT3&��5<-�+<�E�z���p���������~�d����������'�����E-�.�~����7b�K/�1������S�4T���G����=r�'�����sZH��K���.�'�z������'
W%��_Uv��}!���OGOZt�,��?����O�.�r��x{:��_$2� ���S�4g����-�������@�z�V]����M������i_qr~������Kv4��.������:����yd���,�����|����O�~������|��*{�;���x�|-^����+l����g��
yEm��yq������}��Ek�n4��2���M[���������8/�!��p4�^v"<c��lq�C���)�\�Q����FX�j�y������(��+��\^�I���xA�����Bz��bq�������u�$��/�I��i`�%uG�����S{5J3�G������~��:���!-��|d��\�S���T�q������8
[oK\��[��"���>��PIV����Hwz���s�M`�Rhd�x��i5�>���=��8JO�=�����{���=-�������IO��8h���'\u�����'{��d�-����tsb�����ii'��HR*�W�p���P��_
Xg��B~��n�!���\�_Y�oB����j���C�|iu�X����hqI',����I��V�k�U���>y�#r��?(�2����k�)N���{�x`i*���BI`��4�^�)��v?�E���S{%d�	T�`�:���a�`�E��Gv����7�A��,t����vO�7'p���B$7N�xq�'�����3,�4�c����y`���U�$����Tee�~�C�=�|��T�k��T�����>?�p�13�}!���x��fc<�<���#���@VM�>��?�a���V�3�
endstream
endobj
8
0
obj
11659
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��}��$�qf=�fh��{z���������A�]�{X�<�
��t��aH>�8�����L#"#2�Q��f;�9��I~��� ����tO�~=�r������?���n��.���1�.��o�u�uA�����GNq�O�?�n�G��J������������.������bC�7���Z��_�~y������q�^�G�S�B����0����_�-_��x#�!V�/�y!8��������?�^���!`�~L|�on�������a��������u�%�FJV��_j{n��.c�.D��+�42��3%,��-�u���7�L	+�� ������+]`�{JW�_�s��{C�������>�7{��4S��R�s���&��t{��^#Ob�IL<��'�+P�D�z��Zow������r��n��od���4Xh3������6�-��DtX��$:Q�~�MoS�S�����j�������f��O�k���:�LZ�XXcM����6e�9����E[��Pb�LCq����$�d�g���(�����4Xx3������6����3�Z�*Lo�6�F��G��/P`��-t�.��=������	+XcM����6e���u�x�.Vl�KLB�I(2	����+GU]�j�0����?Ur��_��TPK�u��Bi���f�X��]����By���>����Px�M�wi��{�9���D"�J�F7�Gw������>�9o���7�����K��jmf������F��^-��
�#�A��}�D'�.�1�k�Ki�mv�x:���(�(���������:�sy%@2��d��l"�M\�F����L�9�	e��=U"������J��b��j������LU8�d�(rmDq��"���`����=�)�����t��1�@KXcM�������'O LL2��D{k�A��$�<AUY�s{i8����%�D�hsmD�=_�/�E!�
JX�s[���mb��?]�r��Hh�M����6��
ke-:)��M5�M5����X����)�3����Li�.����-�dXd���3S��=�1�1�[A�0�0�$��$zb�d����=��BY���nS�����[�s��$���x������"�-���'1�D����������D���;@�3���~�WA{�.3�L�J{n#:}�E����Banc�9��p��1��4aN�t�=��]V�\B�K�N7
i
=VleM�5Q�����o����z�ho�#������Pd������q��
�d��4QMr{>�C��*�z�2JX�s���|vO�MP4���S��I���Vf=�g��]����{y4'Z[�t�}��=�,�^<V���������h�or����<����f���E�mb������>|���'1�Rm�>�w��
���m��"�����&���
(@"4�_!s��%��!%�?�������[���L�K�.Q�����6w����mb�Nd����������0������}n���f�+of���.���x���A��J��.ve�A�1��Y��Vaz��K{n���{!�o�����51�DY����Y����-�LD��61-%�����Z�zPA��Hh�����_�MSP�>`j��=���^���%,����88���X/�51�DYK{n3V�k���u�hA��e���}�wA4�?�h��hJew����5��;�{���=���(h��=����f��2V�;:*h���&J[�s[�]������I����ibc��FyL�"c����6�R��t�=�<��jQ���6S������i�2�M4]��N4=����]P&��m����lJs zG@Y@�4Y_I%-��=�����2-%����������I�)��i�����#I����+��D���������=���4��=���:
#X��A�4�4�un\w��}����l�)�4e�����7���o��K{ns��� �M���<9H6�=���2�d�F��=��m�u�HP��Y���J�(�������~���4���i)2-EYK��>h�f�gix�Y����	���z%��4S�����t���)!H���+A�����"1�D)J{n���(#��0��l�8��� ��
g�If�Mm��q�����U��fF�)ii�m�tG?#��o��J� ��:���!&JX�s�������&F�(mi�m�>e��SQb*�LEq���e CZj�m$��h���vhJW��h3�-��ZJ�
2�%����h�U*~kO��Zh��K{n���:#���4�1�lL�zL�
UH4�?�h��R������L<��o��� �i����xM�a���� �IK�+�@�h�M����6��;���Nh���A����:oc������=��s���X*V)2���bi�mLq��,��Ox�o��om7_����l��pi�m~�G�hW��x�]�s{k#�Ib2IL&��$�erOH0��_�����#��������l+kf������F�Jm(@�0w1�D��������y�����f���V���`b#�Fq���H�����hve�������J�m���=�G����3��P�.�
�=�+
��iE�haeo�S��FZ=��R���^��v��Gb��LQ����"�u��F�<�:��Mu�J�m���=�C[�xA��nPyr�@s�X8�5Z#�b�iM`e�U�~���&F�(ai�m���}%��K���@��f�Ld��{��L(�T�M���f��&��]��+�D��v�S]i$�n.����(2�zD��A�����JmQi����.��-���j�G�b���6H6'^�SG+��Fn�����m-�_��P�
@��A�;-7��
�����@S�+�;�_�"����&�[�s���y�e�V
�T6
����[h-`���3�����e��/��53�LYK{nS����I��i�'ZX������bi"���A���>W)�H����6���(�61�$���Te�\F�&Z�m$�H�
+{��02��x3�-��=���lh�&�+d�}��� h���&�[�s����&Z��2����@�
R�u��F�)+)K{nG����DZ��������-�`-�[������h�u!@<���V�K���
8���oG���*a���=�i�FX���s��t#-<kb"�LDQ�aG}���v��,�\�s��?��rd��fJ���6&<����dJ��&�f����^�+-����� ��x�C�M�7Q��=��n���u����1X�gZ������ZM��M��@4<�H�-z������xAC0�P���������
�DyK{n#�� �Q��K����6��@��&6���`\��Rv=��d�ye�'ZzX�IdM�T��h3�-��Mi�� -�W�-���xsm�w{�^�"�+����6�U��J�E��=�������b�/����%ZX�Ob��L?Q��a}�>���B��>V��Q�Ef�R����H2��Rb A�2H%�A�)���^W� GZ9GZ�X��	r���J�(Ei�mzc�W�obBIL(�	%��E�H�L�+;�"��A�L+;�"��@�L+��[�����-v�����x�V��gZ��:����si�����w�3-T�Ly�����G�o���V
*O�}��]�i�xx���f������&F�(����wwsd�JLV��*2Y����^MA�t��l#�����{�T?�RWi3������v�U+�W ��bI%�A���b	��Zi�$��8���#��u�Zy�M�����z��Z'a�619%&���E9^��GZ-�m$�@�#+{l��T��X3e-���X��
��]Z/�49�6�=�����e�h�de3������o���=�����D���Ud��kQ���=�9�"�l#
'Z4Y�a��12��(2�(���(��3(�n��b'Z,Y�����(�
��%�\�m�VY���@+6�6Q����B��I%&�����\"�K\������h�f��L��������XSy3�������!^a�M�\������h'Z>B'���

@������@��PSy�-���qk��%1�$����E��g�Lk2�
��wn�mh\+#���7S��=�9���LK$@;��&�&�JN�R$F�(����+�}��i!���0��p�4�fif$5�����S3�73�LyK{nS��g[V�y�������h�{��\UyA�0�P����w�.���&F�(mi��{o72U%���T������r.0��y��f�������\;���f��)oi�m������T������d����	��T������d�6�=���h�M��������f�KLD��(2EQD�]PH�f��t)���ptwh�oa���=������RM�(v"���h�fo���ZM�-v"����-���71�DyK{n���������rSNd���r��\�n���R�;R�Y���q�&R����
K{nK�[��+!(�u*������@z [Lk��K{n��\�%�"��Kl�"�(���7t������-�`vi��`d|��e�W��l�'�c���*E�_].$S��o3�U�"��b<H����3X&�D�(aQ ���[}[YSLb��L1QV�U��~ e��F��H�gmw�����b�������%�Q|�
��*o�'R�Y�C��W����HR����=�������<$�&��(oi��A+�,,��)�H�@��~�Od���~v��I�g��h#)�����:���yo���=����c���:��A���_�&��T�bGR�Y�������6��Mlx#��8� RwYl$�3����8�u��R�A��������\ �k������A���"����e����R	J{n��~�&F�(ai�m:v�L�pRdzIL/��%2�DY/S:���]��t(+~���.��� ���"S��=�9��w&��J��� ��!��3��T���� �[�#ED�(������"�L�)"1ED���V�=�����b#%����=����y�)� �L	K{nw��rhym%-��.�$�>|��5R�������hw{UD�E��=�iG��ob�IL1�)&�s��9�
u��j�'Z�Y�K������k���=������s��A���]k�x�M����6�=�����@&6��
d��L7E*�	&����l��'�?v����6S����*��7�2�M0�0�D���7����-���x�5V��%,�����"P$�����r�Z9W����=������t[!�����������}�]�O}?./�YV��m1���L���XE0�mAS���� ���o2��� O����� ���o2��� z��^s���5z��^s$�|����t!�-t+!�-t
�e�-��)�m�7���O[�n�\�u��n���Vf��_���}�=���/����g��i�g�����v��ce�6�p��S�78���	�o�.����O�����������4��L�~}�!�����%��/���p�d����~��_������W���{��oO��>����<��������M�)l�j>���m]qN�������	����|�1_��*#?�>7�F�=j�~�������y��+�[#�2�ysO��&��M�����.��e����
z
{��Z�C����/^+��,b����wO��������O����ggL?7���@�0u�u�J�$����?����'O)����Cj�/V��=�l�����+�RB~��oO_�R��V?�?���w�D�=��8�������!�i���
��v��nl��#�5����6�G��?���>Q������g�(������#��r��XO�p�dp�D�<���"����'��}b=	������ �d���������1��z�O�U�f?&!;Ez�����HO��Z��P\����v?�]ss�{���6t�c�;���=��k?��������O?��t�����oN?71,�!E��N�����N?����}��`7������!�H��UG�[�
lu�"�{s�:�oE��[��j���Un���6�%��D���u�|����~�Q��>��zs��
Y���������y���E���q�%h|p��������<8)<BPg`��4�F��6�
9O��$�yfj�`r�S<[yk����Q4������KC6gTEd����~oU�<�b�^���c��SU��v�?�������9��Ht�4G�%Y+`�d���d-Dva�sd'��Fv�������3p�n�����-o�8�����}f�����"GE�Gv�����W�;�O���;N������F}r2Gv�}E��UW�(A��v2p�B��!l�g�����N�p�^���{i�f�%�7�c�%!{Ew�F�*�y���(b����=��l*�_�B��KV�k��E�����7��=�6E���	{��J��QiT����t���� �[c�-�A+J���T�u
�$�v��������I����;	��|'!{���~~o���1�I��"v�r���qW�m�>>yl�P���"�.R���.R�~�y�C
T�6,��.R���u>;�xM�����O��D
��[�<���D`�$����)y�]y�����b��y�0"��<��D��d�`�B^
��y�B^o�k{7�>L���a�
�<iC��d�����k�>��.��y6�����e��x������������w��������"�g�(x$YU`k�Pns5��J�.IVe��{��6��mf5/�q�������W��j�0�XE�g�-��h-?���''��u�����i��b}��0:�	���3gD
��<���JDLv������[��}L���������7m�m�G
(vg�=�tFF��W,���biN����������������4���%���G�P���(��(����fw�[�����}�rP�5��\<�l={F6�z��v	��.�c=Qw�*�y9����J����s��v�rO{�z�>Z�3L��_n��n�����[%�g��g��f�*������knN��V?�~FQ��=�����J�[�<�Z%�]��"��a1^��h�����9(rd���C�R0T�b�������Z0T&�K�r*��n��!����q�r��r�z��q��v��7��v����i��;,q=?����g`�<5`����]R��(~5�H�������v�9�sD�/5ds�'"�'����m���M�w�sG���~2�
�~<h�m/����������u��%�9I�.��%�	���"Rx�3���[����L�e���l��1�Z��}v�	�~��8
��|��!u�������T�b��^cByB�l�+���f+���"�%�9<
8�u�}��lp�;C��	qY5:Ea�N?����x����/�O�9�d�(�$���"�k`)��3��5`��Q�XJ��Ee�������zyd�wC�t5d{"�W��j��D����CE���kV�6�4�����G�mL��z��5���9���V��������.#R����9%�[�Q����<�/��=���z+"��:�{._u`����<��l�\�>ZK��&���c�5X��'sJ�����K	Q�f�p�V�8wG	�B'�`
�DT�@N$�p�������@N6/�QF�����;��@�|���#�@�Y�z_su����_����;T��f�.�P���_t�����P��D�i���7�O��l?BF�f���ax�2"������iHo��l^���c��:�^���t�ch�Tds0�,��$��45�^�>�
��{��0|��"�e5h��UW�P���N��P�:=h���0	[�V|�����D��5`kzS6{D
���QtIo��s<g�����QCn�1��Q�{D	��#��0���#��0���Q�����z�G�����k��{W��!����5 ��\|��."��V?���4`��.*��w����I�si�1����|���~�9�]�5�w+��>�H]�m��h��0�i��}�����8�u��0���!l��.��Y��>��t������]y��8TDu�Aay�@���y��M���t'�[#�����rT�9�Ad�-�{8�[Q��-J1
�������qo>S��g�_�7�F�{F���'f�`z������
Uc��<�;�����p��I�
�C�7
	�%���!v�����>��G{���a��{��S���x��9ZK����M���k���w��
}��������E��@}n��#Ck�����5ll��aG��?��N��^�?���
4��U�X����Z���N{�v����G�wG	G�(6a
GET�pT ���?���D�m�5=H�`�p���!�E������6.�M{������*���NB�u�kTz`X�
�FQ��%����Rk�4�J7����w��n���Dp�5K
��(��������!�����P���QG���[��s��Q\����������o\q������g�Q%(,�/��Ked_-�P�j}c8d����/�Bn8��������[�#������]�.�W������:���d`��������4;=���|f@]��p��R]�5.]H���u�#r�K����#j.�w}5�t8D��6WL[s��H���bz��=
�GN�L��x���
�-���{T�bH� �-����V�-�8�G�����������$��_�3�(F��:����������N�~y[o{�Ny�R�9__�����S�#�R��e(\�G��}��� 1�9k��
�������nHC�pU����6|(\�����{���]�uO������������iCk�}Z���U�C�w���v��Q}�v��u����J��g-��������w���j�"�y�����o�5>@d*���,�p��8Z{����}1�"�"�5�
��a�����eQ:��J�uU:���S�7�k|�����X�X�BT�U�_
�A��"�����S���������N�jX;���}�����7��:
	^�7(,��:AK���w���
Q*�n^|�Q����o]1\�v������������*�����;�������'�����������K��f�3�q���WWk�Wn>&�z?]+���������N�����V�������	�7y��t����B{��^�}���TV��d��R�����A�~/=�B�{��X���y��3���eU����T���v�_�R	�cC���kJ�����[v���3�aiz��Nh�8����s	����}����pP*��[�c���>���
�U:)'_Ot��*PUV��/�����`r�(6�������������Z!���j�@�cGq�v/o%�ls��Sg�R5���T�+��P���/qWw�����h>��.&�FJ�XL,�z����	E5T=H�������^
��k�z���PEC����g�_�*��w�����q���t����*������j�zP���ZUU���nU��P�z������o{�s���v�|~Pg@������.�}P��(q������RX*=u��������~N8
[E�'������n8
:�������*���~d`��'���k	!�����O��#��a�����Yz�:?M8f��k���}�y�;����w������cY8|��##���x����/�O�9�d*��$��"�k��&�����}"�����_j�(�=�u�;��;�������>�N��G����=��������{T�l
����-���{a��K�6�Qk�Z���V�^ssn����Q�4h{�Z�����
E���?mP���������("���k��@�
����J�����*��:nL��X�/Qe+���5��bb���X�;F1���^�1���m�v�]�K�]cbM8��X�sBX��9"��b�x���4�����)���ET�XuM ��!$�U���M�]�����_��Y�T�Xu�����'En�-���	8}"��b�����i�G�M��{��w��L�t����s9s��W�C��:=��k��
#�Qs����6(�"�1�|�G=k��s�mK�k�����1�
{�L1�
�b����"~�Q[OZ���������k��Z�����~v��}6���Q�*kf�bTw���6tb]��J��:wZL|���#�����@ � �' ���t��e�m�����z���0F(PU�f
�{$`I!�@��N&>(k�s���C�z��������^
���@U�;���Tw-A{R�k�����J���;��6L���Oi��xk�"����(� u���A���7���w�?7��C��s&��o.����[v��2�z���w���y��ui^\������!���f��i��:_q�x�H�=��5W��~�����9
�J���7�?���	w�u��?�x�����?]����{lY-r|w���/�t��T_��k�ET��=�y8
���S0�.p����Q�i��@��+��i�Q|.1�m�����n���0�"o�c���~7�
gP�Y"�
i�v�Ei
y1$������"���g��5��e�e��2��/�������p��l>?�����$�v�<�/0�`p[��<��^��LB~6���5����2z0������c+��8�{0_�`�+��b�`[
�v�t�C��\�\o`�s��������z/�k^����3/��eT��N&pp�������&��;��}������n[���F�;tGw�+��h���b
�t���;�j]�|�}q�����\V�n9�;��[���`�z����9���{i�m	�����&p��F��$����<���5�~[���t�������alE�G�����d���"�%+�t���6E5����K�m�%�w�_q	���8u�����v����e�"�k4'���CW\�
}q��O$'[^��+T�a�
��Z�����C���T���_�T������
C$^#8�?� l�u�r��l���w_T����8zK����;+��=B�J�5;+��E�J��	��g��5�Y��x������Y�� vV���Y)��;+U�o���0|c�:+�?���Vg��8�:+w��f_4GU�Y���_;���Fs�����u������N����l�M�5�h��f��������
���C��y��-�j]��T���>U���2Z��O�a8;b����v[��25o��>UAv�����Q�Q��8���<���~������!�����|q���p��J	�c+��lX�B@Yo����]�v4rW���Ch���W�]��p�W��x��+�a����ai��{��G]38,wdXu�!�)W����%Z���B5���@=�5	����"(��������`�F)���3�����2|v���.��@�h�W����x6��i���;��S�#�z���w��;W��B��>un��jr_���|�R^zf�z�p��Gn�G�N�5�-���N���[��BYo7��z��{.���\��=��6��R��|�)n�6^���P������������R�b��9�+������c�-����-%X��M��Xh�����/n��m_�����\��i�����b�p�u���)�v�������[�te4a���:k8x?�?��������}a�~
r{�m�An�����!�:T�r~t�&0�8+W\��\��	��8&��`�b:���x�%]�U6����xs}�����R�awW�6��Uw%��}W�>����l_��\�woT��|��r������N�N�����������~2�$��K	�3 ��=|��kM�j�V���}�<�.�<	��rXT�;|f�
�;����������a^v)+�]�g}>�zX
���t��l	5�����;2$D��h�y^��a�J.�J�������p�Wo`W%a���]�=/�k�<����s��z���U�U)���z����e����amV
%d����k_q���fG� ;8*�kN�=���z��y����	��-5\��Sp�E;q����!�'����_-'Lu��gw|������*;>U�Y�i��tr&�h�|dv}��'�\������OA���\����7�L�
endstream
endobj
15
0
obj
12228
endobj
16
0
obj
[
]
endobj
17
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
18
0
R
/Resources
19
0
R
/Annots
21
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
18
0
obj
<<
/Filter
/FlateDecode
/Length
20
0
R
>>
stream
x��}]�$9ne>�������SR��hw7�6z�]3=�?��6��
��odF�"d|����T�����N�A�R��C��y������!
�������������/�n�!�&��O�5�8�]�l��%�����_�M����H�u�21����0�16�},�����87���Z��_~y�����.��^������o������e����aB1�<�#A7�������nx���c�}$��
xSsf���/������.����������s�����f���E��n�P��c���gt��%J��s��%���_o���t���4��>z#�-��v���������
x�h����2�d;�������g*����32�D&��D�����!�����f��mC�����U�M�#L�pn�mL����]�Z3�l�4H6��~J@����;��3��6�������44����itXM�5�Q�lT���]w���$���Fj:�&n���9!�(�������6�����0+��I��A��	�T�2�<�a�3An��%��?����EF)]n�mD�K+�i%2�dx� �GyA
PO:�hn���=O4�.� ��#�� ��P���S��n�C�R������*#�2��#%���Mn�2�T{����w�p{��Nu���(��n�#`as���=g?�t�`��������s�����x����My�;	~'��,��
��I���A���*VhA�-�������}��A��72�Hys{n#��I"05E�������RM�I��+�����0�Mv:S�r{n����"1�D)r{n3��W< I�|�� ��=)3kf�DF�r{n����e��FJ��sR2��8�vD��62�D���T�*�����z*�3��>�n��������x�M�wn�m��B���5��La
c�i�l~��_2/��0�;��6�]�q@��.2�H�����o�|{)@�
���I)2)&� Ji�Y� �D����*`n��w�b��01�D	s{n#����u�.�EO���Pa�c�imJ��� �3�-�f;����^���l]dt������7����{�k�H�@��"SP`

����
T�=�f�=�Z� �s��������o���=����J�y�|�x� ��w���
T��0�,����hy>��a�kd������VY�z����)21&�����h1P�0�Mv�P�r{�&�J���U�p�41�DIs{nS����� T��oD���}A^�+Lm�7��6��u�r����FJ��s�n,Z0)E&������(���>$�M6p�9��.��&i���&���s;���b�Y" �v�Y��-�������0����6#��4Gs5@��������1��h�f��b��&f�����{F�a��s{nsB�D�(��&e
d${���)-�`���s{n���U��M��H�����t�C6]��M�kdJ	L)AW���e��4�3�H���J���Z��=�K�.Q����-I�\������
KD{y��fm�4
�=��^]�j���.R�����W�.1md��L;�i',�sd�~8�,P���k��L.�3�l��{F�m���=�������$J"������w�;�
��U�z����*/����b�i��F���s[�-�"X������RXji���i�i��x/4�������f|�8�����haM�5Q�����uW�A54�S� ��6�����FF)mn�m��j���L��F6��
ju�[�p�y��.J

��p;��[�6hr��&�;��6���a��.49T� �������6����	�l��&�
S�Ls{n�1���K]Yo�B�Nd��L:�I'���[]�

M�n���v�L[�����O�K�/Q�����O���H3@2
�����^/
� I�������\��adc�����W#���D�d#����Cy���&F�(in��n��X����B1� 
�M��i��5�D�f;�������?M��H)r{n������^"�K`z	�^v�i����s�d�zY�s����B�a���=���� I�T�� ��P(
G�<"�����~O�X2ad���6ja9j��m�@�`���Mv,l��!�e��R��K�/��6��*'J�����A����:�e:�<��.��6���4��X��2F"�����F6��
fPs��+\h�a��vbC���������x���s���p.4���]� ��P�p.4���Y���=	�O@a��s{nK��E�@�Z"SK`j	F���h�c��J{��X���="=�M�6Q������\�����M�)<i��`A�4
{��X�GoI�!�7R�����w+�����"S`b
����8�;��l$�\���s�;���U���H�M�7���.^A�M�o�K�� ��wo�YDM\�R�>�6��,R
lx#���7��
���
�@���Ti�baW�A%4����4Q��������9i�)�i�`a�7�V=UX#c��5��6a��7���o8���Bx���v���&�5�A
lP�qPA4�0�HL��
v�+��.�B�E��=�[���u��0�F�B��F���]�MVo�KY� ��w++
�����)kn�m�{�jX&���`L�h�f$&�h����s��n3��D���KK3�f��{�A~4/��]h� ��V*?���h�hg���l�-dG����\b�n�L��h�hg��������KKW�'�l)|D�=����.�o�K�� ��mYxa,�N����}�~a���>�f�T"�}���x����F�[�����v[,�*���.��4�v�q��gxg���.u��h��2��i�62�His{nS��@M������+����R<�v�g@����4�=����=M���x�oD�������
?J��s[���f��i#����s������HW�'��G�3]3�vn��J�g����v�k�����!�����4�6#�9�"�3pJ��s[����
SdL�2����c������`�L)�Nt�}a��~+�5��.ZI-]�vnx�Itixg���A�1�&�DW��b�SK����om{�1R^:�`�Z�@���]����O6�3])^���S�%�1�K�� 
���EMTUY&:�:�D���5�g�N����K��s]x#���7����{�V��TCW�<s����L=]*�l��r)`n����T\x�M�7��vp�6���K����RY:�vl�4�/�f�f�lg�����vW��FF)an�m�p5a��	)2!&� 
i3D=�E��F���E�������0Vt98]�z���HaH��p���7��v�,kdY���a�����y� <�0�}��`nC���&��(����;y��(����O<�M�`�]Y
 ��U�B0����P(�+�QDJ1���xO�/]@�h"M`�	�hvf/F�b;�H4�x/��v���F����lwdmvi7t�v���$B
�Kn�m�Z��"�F�2��8#���������H��u��r�B*� �:�X�����X*qHI�l�Y/�����L�
!��x�������/��,L#�4�6��3�B�,)4/�=���(����#�Hys{n7�6�����SR���s7��C��g)8W���>:��H�k���=���A�d��f���My���DK��Kn����"
�#K��"R�����:T���,�����|�(��<T�&���������oe�A�.1�D�r{n#��n���4�x� ��w�9yd��x#����}�~���5�a
lX�<��*�����L-7��E~s{n��{�t��%J7���@�1�e�<�]`2a�i�lL�K2y��� M�a(�����w�\K����md�������oW�w���B��"SQ`*
Ky�lR�?�H���0��v�\��_1SbL�2���n��8���(�����lw$���)�F���di���=�	��V�2?]��^�G8Q]H#���2����<�;�����6��@��K��{�H��8a���=�!Y�K�TB�2
������:�k���=�W{rQDW�"��F.�����q�y��$�0�H/��/���y�g�x�Ir&J��s�L���b�E�ga���LR�%
��ns��,Y�g�F(��=��=���f��IFJ��s��$>�������`"L`�	�`Vv�����f�3���������yo���=��J�"Q
/�C@�imF�����ed}X_�/w��D��]��ki������*�yN�pQ��=�1��W���f���'x�(����k��S���"R��=���m�x`c��E6f��Y��lS$]K
�g����������h�dAaM�5��6e-3�$��_�!)��� O�2(4i�v��'QP�]�{�X�K���Ba���=���\���72E&��D�"Z���d#�������==H~s{@�h�M�6��v��&��%QVP)M�4�v�X7_�z_�@�4=��=����oO��7R������������"�R`Z
K-��r��l$�\���s{X=Ge�RU#L�0���ppA�H3���R�;���F�Ws�"M
a�K9� �B�����/2�H�r{n7(
���p"N`�	K��I`=����@z�0�Mv�y��=�?��VL3�5��6e�z�ETA�5��:wm��u�#�"�F�)kn�m��x�e�� F6��
bX���"T4	�����4���~G
n�#L�pn�mN(D�@���Q�A�1���0Ac�j2�,Bi������ �La��s{n������$�d"�L`�	�d��G��')�����5��x�M�7��6���pb$�
m�{������r)`���C_j��f��7�Zx#���7���:�H���h#�Td�
LSa�)%���4�0�H��fv��dH�7Q�����w�`�!�ZQh�h�aa�z�4G�@����s{W�
�i���F���s������72E���D�������u7���d#��:�������pz��o���=����x�P��(<i�ao^"�p�� �v��D{I�mohad������=[�@����d��"�Q`2
Gd���M�Lv����	n��;K\
]bt�����-�m�5�Z��D������V�j��52�HY�����{�y,��
fd��`y0��C��w����������O��?��w!}�.L�7�������	?���������?�����O��-F���f����L���|�2B���f����L���<�k�5��z����^s<B�A&?���S�V<B�S����S�=�S��������\�_s�5*�������5d���|�#������o��>B?����F��e}3K�5-��l���?�������:�����������>�__��m�u��!����c�]�/��Z���o?���u�|������{����(B��9���o��<}6��a
�u��0�^�qY��'n\�}qm����%�Kw�������
�r��M�m�:|o��h���3^q�:��kBd]�'����[S��
�?�A������n�>7���������0��/qPp�3��yh7�?����g+�����^~�����{6��t1��f�����~2>{�����o�
����O��O��i��9�x?
���$\����<�O�KF�O�~���q�k�4|���^�*���2������$�����;��I�5��6��=�������}pC��fo�����SQ���}}� Mb����EnJ��=~�M	�u��8��Wq����M���;
vS��W���������+nJT��=FnJU��:�IT�����62�i���{!g� ���p��J����8���N���?�������oN?����$�oO�������m�64_b�����h�S������"����=���n����������k~����N��=we������*{nUFh�������u��}u>r{����s+��� S@�d*Wl~#(A���X��!�;���*��R�-���p?jd%��q�]�.��s��������x�+q�|�����\~��F���*��`�������2i����p��Id�=6hiT|l*,���{iT��w����[�V��(���l���].�g��	�UC6�^��������uq���=�B~P�� NP�f�[v}U������X��)�xs�>U��M���U��QB6e>2�?���3�;d�}�e�UA���Z����V��M�s�&1�(a��[���k�yw%�V	�$�i�X�o�����"U�9�����@�2pqXu��������j�6�Fx��J�&��9B���v�X��C�&��W��p8��%��-A���[w��zPE��o�Q�)����$|�K/r�
��*�nw(���}r��4��p.��o����N���aU�ag�[/�-e��7]��rg�����B��+T�9{���^�yo@q�u{��B�'�E2r������;n�8*����^/rTnG%{B+���x�����v4vU
��6��#W%��| k���X�6QS�f-�	O����*��_�i�6��^W��=��7�_�KA�:������������N?�~�J�I���K	�j(��(e�p�>���u�n�*O��Z��
�~�6(���{V����:�u<��w���h{J#�k��.�
��2�C��W��1pu�Uuu����WD�����5l��5����t��W$�*�����Q)�� P��Rk"?:�I��\"�e�y����R�sO���
�1'���"��Reb.<9��w�N�B�p�����[!2T&W�~?
�BdX���S]��sh�����Z3|��kx����+j�^�(��u�(O��;�'_��"�Y>��r��?�+G�)��ue���������'!;�,��i���E���e����OC���4�
y�����,Cg��zwxk���F	��o	x��K;!�["�W��:���>w/<�^�*���L�Ny.U&V�)=yy����'�>e������'j�
	�S����w����5~������]�S���	�U���M���^��*�7y?����
�v�%�N�i���:^w��T����u��}��yU���x���	�����>�%vS�O�_���
���9F.�OA���{y"1���pm�(rU
��U	�ub7�w:�nd)v�\!v����`w���_u��l��e�h�f����c����e^k+�K�i���v������#�� �KR���Y�9��v����?}=�a����/�N��~���wY�Z52�kly�p�y���q
�����|2Ou82�_n����Z7�\����U��=����]���=}m��%2�=y���[!2T�OI��\������6z��#�J\�{��"1�pTn���i��uTn��-�Ch�&�QI�/�^v;*����`��C��2pqT�2������hqT�2��JB�q��6�Y?�V\�>�z]UudpU����6r���Y8j��I�5�4	��������i�n�'
�}�r~�������_<k]���;?��Z����S��w~�2��_��x��)]���,9N��='��Wl��.������9���*������P���i
���R�u�*�N�&a�8M�~��S������
�:�u\UUe`WUY�U��0�U�8M@��)3�?a�!��������#�;�*���6�
��l�`M���)���EP��n��p����>�����`����l��R�{����
�U��+���g�V������*��/%dO�.n\�}S@/dH��j��L����������wu4\o���k�n��y�k�K�}u������X%X����deT����G�\�6#�w��s�7\�~���4d��������#��[n��++|�\��s	�]�)�V0��~���������>���<:�I�4�yf�f�	������Z8�Z�T=o\���/��k�z`��]�������1��������L;`lD]w��g�u1���DVe{�&��/���MNJ�5�H�_qy�P�+�B��f�w�^���0��t0��8��&k������mt�Y�1m�������[�-�i�5~bvsT�o��5u�cX��&�&�T+��~���a�5��9���n[4�Q����T9�L�x������q����^]�����Y�B������Z5�\�K/���@s���d?���<���7��u?������%\|ZH4��6���Ec�y��|(A{p���07�\�,�SQ����?(s@^�p5�����mVG���-����Lqm���Y[m^�t�My�U9�~���S����k��R���vlQF����T#�T���!Y�3�U�7w�6�=��-�GW2vsd?��z�N�w��?n�1�����"a���<K���)w�v|� ���l�Zy�O_�T�$�YZ�'�n��^���j%�������}J(�����KJ(���y�/���~���P(*�1G��Z��7�J�5F��C��M���G�d���Kp������������Q��2��+���v}���
K�5��e\���e\��<�	w}���#���Kw��s�r�D�q��9��h��tC��&K�S��7s��:z�`TC��
F5���,k�����~r��;���c_�>5��OeC�=���p�9�Shh�o�9�o���SG����>���[���>��,�VG����}1�^�����V��;�|��+�*a�+x%���� �A"�G���->g��xc��������i8�%1��m�q�uLs`�}A�1"{���A7�E��{�|�O��YF����'��|�����?{]B�Q��]�}o��>W\b[���/�����Pd�t�Py[V+�Yut{qi�~Y��=��?���8J�B��u��3���5
|5����s�!~r�c�G-��A�"jkEn	z���d�W�w�k�7�����S���l��+w��t�����u���W� *M����T��c]�������]<�\n�a�:�>���s�MX�v����{�q���oK����C��Asbu����
&����>;8"�$�pl����;�yP%a���#������2a���}�{-��5�q����Y�/�
D!jj�t�0:�=�*!���J�J`	��A�2l�?��Hj�{���1m���/(����1��7���=��8�i���O�"�-�wO�5��Kp9Vw��n���g��i���|�q!�����>7z���yC��}�jS�<��B��[5[�$X{��F%@��sE+�x��C�Xu��S�������K?�7������Q����'�����2���{�?C~Y�>[s.����SV\�Q�.^=3�,`W��uq!�F��M����Ea��rO@����������/�@�*�=�+��:�}S�>�����'�������[���7G�S���<������O�q��r"���hGT��������A}���IP����MP�S�P��{�%����ak���Ps�o����s��0�������oN?�r���X��k]�����B�]<����S�h�m�WeE�
eE��w[�x80�����=0�m�v���1lwL]�3���iM]VG|@���J<+�����������#�{!��V�e�9�n���R���Je�(>�I��<9��V�|�Bc)�]��.����GU4
n�Nz�'n��c^�v���;�7G�����3wK��~��3ww���}���O����*�O���
�yj	p�����q�FO����pd�wJ���o����J_�b�)��LB�����A����ujS8v?��K(�����).>��O�ls@��=��O��= �gO��;������Z7��2�MS5V�g'�2s:����9*U���PE�
�q\�9�a}�+�C����w���0�K�!,]A��� ����d~s.���q�H�����K�'����>���U��t�+Vlv��B�����8v���G{(��xV�gD�@T��/�:"���ahm��V���[�����������+6�P��7ay����W���.�k�O�"�M�O�����D�Ht������|8���U���Ywl���c#���wm�l��'����M"Watw`&���TU�����5�U\��k�p���zK�)��|��@Uw�L�V�	x^�|�
��������5�o/4Eq�z�/�K�3�V�_��a�������A}����8��"�f��:����VG�L�9��O�t;���
9P~:��z�0����W��v/���������~V����]�[�6���p�������������i��������_N�;�l�
��7��O����v����c���x�����QW-R|{�����~4^|NhK��rj��2B��5xX����s�r�2>B�z���K�
����V�%������
�2;3���"���������L��#���
|X�~�l��}N�f�����Y�o:�zJx/k���!����Oc��
��n7���xE��?2��E0l���b�mw�#���[0,g+��2�p��8�P�
�����N���V�s� �{�m>����v���=\C>n�j�������1�
��D�<V������s�C�:�W���b.a��r��K[e[B���Xu�tg�\�����q��q���qm���o���_���%��1��>1�x�%`�1Eh����
��Xt��}�V�{0pD(���bB�����P���b��y���i�n
���c���b%v(��?����)��)�
�`�k�z����[
���Z��qD�t�S�o��le��S<�x#�p��D�W����������o��\?���J���C��,y���c�
U2�pP�v�y��*��N��D��z�X
��y���	�����88e~�
�Wl�\y��b��*�cC�|���'����R}�^��IT�>�`�����|h�8�.T�
�	������Pz���kq���W��:<b�i7�NB�r���k���Dae�`����z����J���~
�i���,���e��XB�p��D����������5~�m:%,��;�>m���i�����7b�ME>M�~����
�|r|���4	���95���r���vF;���!��jK98]2���[��$��$c=��_���kKwqq�<�
�#�������r��~�_q�?���.���8��Z�z�������$���.L�@kF�K���j��Z��x�"C���'�y��lG���J�X�Oj7��
�'��ZbL�[���2��P��E���(F�t�Y��k��s�;���o��M]���
d�j��#���W�zd	:���������5�V_������>*nn�p�Z�F���b7'@?9J�Q
r�<Vz	(��Z����D��NE���n���N����w|U�<U4��S���$������,�[�I*U�?���4��p�=e6���J���a�q�����\���>�[�*��������
�U	�jt���Ot(�F�U�^4��j�g���K�&@_'W��\��%�'�����*��,��J�����g����*yt$�qd(����6U<�2mU�k#�N�n� ��k�
d��[#��<K��;�Gt)3�H�hN�cW&B���+�_k��H���8�2�i�\�iG����Se��S�*��B�}�m�b,�e�L���2Q&�<��(�Su�81	*P%�g����3�j��S&n�g��+���TW�>��o��/��l��<�L`���k���T����#��*��62n�/�Q���56"����Y@+�QW�u��Y�v�(�QW���T���p���g�:��]�Dy�J����,>m������(�>r�@��T�fX	�����(��w�����.	?�F��4�^R�}����c:}��8J��_�=�G�}G�"�&�;�x#�VY&���2����5�i��u�~qj�LY��i]���k��eNi������OB��|���B�'����z���:W���r���O�����0�������oN?������T��?\������(mv@�]eh���U�G�����+�qU��z��XT�;�U�����i����'�G���v�nY����:��\��c2�V���n�nO����PZz��w�45kd��XO��N�.�����
m�I�*��������n����~����:N�9Bc��X8BS�����EIU����$GM�"%l�������}������j/��[d��v\=k�"��<-�]y*��MD��D$LN�z���*,a��_��V����G�K����^���qU��KPk���z��2U�X��/k�����_�w��I�a
�5N��?UC��������r[��c���e���O�������
endstream
endobj
20
0
obj
12574
endobj
21
0
obj
[
]
endobj
22
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
23
0
R
/Resources
24
0
R
/Annots
26
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
23
0
obj
<<
/Filter
/FlateDecode
/Length
25
0
R
>>
stream
x��}]��8��}4��FU������0�{�A������h���5=t���b1��.����$#�AJ�[S���J�9y�`���=\�/�?�_�8������=\����
~����>��.���[��#������u�������_{�d���K�2������������'�o���_[[S����_�u��0��u��n��1�~��8���?�-���u!t�K��6s&p+�5���?=��������n1���|�om�L�?_�4u���~
��vg�/g��r=��v��kE���������x#zFa���=�B�UtC�t#0�>���zu�u���:��tW����s��"�����E�����+�e
�v����My��1p���XH��<��grLB.����G,����]l$�a�R���>�����{����J�s�F���P��;*�Hm��h�tXLHA���*7F;������Fpq���X�@�����=������1�x�����v\�����z*��j�����[��@�(.��(ol�m���-����P#+<c
�h#�cB�� Lpe�0��6��=������z��s���������<S�cJr�����L@��W��������i]�@a���=�sB1n�*Y����Q����K6�D��Q�������GF^�x=����My��]�Ya��L?���1�8Y?���F*�	|�b�O$������(/��(���������b#-(|�
�dsZx8�����.s*��n#-�����s���K�_����S��=�����gQsLJ�I�3)9&%'J���/�(x�����*��[��!oE�:f�I��30�@9c{nNw��xc�f��5�~��O�
XA���kE��������t����r�$V�*�6��s�c�
�����z��s��*r
h�b�=��0�z&&�����\.���
Z�T�x���*����]�e��}Ps��Xc
�5��6au�q��]n�F�1i�� Vpq�4��MH��o���7��=�J�YA6�s(kl�mz���Q/V����aulX�8���) ���i����B����}��zc4����Yc
�5��6au��TK'G9�U�#T���a�amB;l���i�Y��i��#mt8J�s���<=�h^o~��=<?J#��@�����=���^���w{mq#rd�LP�	�1A�\P��9�P�:p6�`
w���5�����s���v�a����{��(iT�g�H�(������Q��5��%��6eM�$H�-��$L��3ROIc{n�q��)��tyD"�����<S�cJr����(H�BE��,��vD��������,.TF<���s���	F��Qd�����[{n�Y����X=c��uk�m�������@U����u����wWZLZ��
]O�G�}�j;����_`|�����F|���%%`�vOKH����X)�Z�E������7yp�3�z��)kl�m�z������3�8�W�O]�=-!�6�HKF��+JH�40�@Ic{nR1�Y�D�w�TJ+I�%��MX�
T�o ��j������
�Sz�]���&V�X=e���MG���=.a1-y�%���������YV�$-3�6���d�=���	5���$�1%�@Ic{nR�no�x`A������0�6�������~���=��hi�	Xb{nS�iM���&���H+M��S�����������	�����a�������15�\M�<��E��F�
�����\$���R��"�����)���������0�6�b���VO�L@�H�s���7�qy2�9����h�)�zJ�s����Z����x���CS���]�I6�s[l�E&n���v[{9b
�5P��=�)+��N�T��=��f��1�0J6%F����T��-�w����Fy�����0���1�e����S��=�)����^��|#�gB�LH�	��B:����#^m$����2�J���M�QJ�s�QT*���2��1�0�6"<���<G�
���i=)����F�gt�������)z<���)�3�x����"��$��b�i�(�{Zu8Z�O�����s���f��\*��:i9)��Q�im,�.����.�#�e6���B���D�)El����c��L6���1�8Y6�+mA~�X��H�-Ne�X�-��he*QJ�s�Q�N�!����AE{$U��>t�E)���D�P�]��|AT"���S�����F���p!�%c��u�)��L4������������V;�tJ;F���?�T=-P%�@y������OeT��5*����F�f�����=-R�fOi��(����_=-I%
O)����w���;j�L��K�iQ
X=��c�qJ��cm*��k[����2��
����?ZK|������|�P���0��vG�_����F+_@��V�2�R�|����z��u���)gl��|��������-x%�'�Ex?�:Z�Zm$����2�����/�a`�����f��
#P
��d%�0��pO�H� >Z��Z��l�����<c��)�����rL���3}8�W�G�G�v��H����2{(oaI��
PD���TfGJY��h=���6��6�=V{R�����t�;]����@Z�Ym4�qrl�mG�}?K�����s��i~��8Z��h�%��h���P��H�i��T~m�N��Ar�:��<����]�@����6<��c�p�6J1m����N�q���������$P-�$�@i���f�T(/h��FO%���D�"��S��=�����s��Z,���D���1ulL�<�;�E@#Q�j#%u���[��nvL�����s���+8P���m�;Z�lD[xo
'xJ�s����Ccq
�6�zF�)ml�mJ�wJ��	��|<��g�qL>�#���GB��l;�4�=��Gfw����4E�h�
�6��6�=��d��lZ�lv��CVP.�9��s�x��-���3ZOic{nS������3Ey�(��DE�<Z>Ym�dG�%�=��'�����h�$����QO!��Ck���Ef�	�T��7t%V�X=e���MY�[L����b �lL=S����c��~�4hic���-ed�+��#��u$�@9c{n���x~,��Li�#��Q����jE��bd�pK��F"���S����t����0es�X��V5��3�8�'�T�|&Z�XmP�p�k������{�e�D+@��	~m���N�D�@��	rm�m��JSVWF����k{n�Co]�
�gC���96tN�<���R@�l���r�Z���v" �������]��3�D� ��� ��}-M���*��_�g���������}o�j~J"�l4=M�F���)L��,�f#��dV?��k�[��Q�70�@yc{nS���� 
2��X�=����F�u�F�	A�d�?���.v/�ZH�V�X=e����:����,��h#�gR�LJ�I��II�W$�
l6�'U�����j��!��%���=���m���J�� �0�6�=�V%��bII$����f��+����3ZOic{nS���aKjt�X�#)�$R�����\���S�3
��l6Rp\%�s���-P%)� �@Ic{n�c�P#)�$�h����h�{<�^O�$�7�i}�(��mJ/��BR%At��������^�n?�zR1It�)�1��\9���@*&���y�c�������� 7R1A���n���i�:�H�$[XM%���I��GR�I[{ns�|!����S��=��]�V^����g��L8�	����=y����t��N{R�����1�70�@yc{nS����D���h�(�����,�0�sQ����amrzF�)al��b��R�D���<��c:r��
.��;*�|�j������{?�d��"]`t�������v�	A���aE~$�/H�����x�M�����;WF�����=���[�[2y�"�T���\��C�j$�G6���:Kn�<,TB�.�7�i��(��w��HH�%�F��jKn����+����S������?�y])��I�%�F��rKn#��������I=���V2;��`!8��ZH�#�l���y��KO�!�p��)~d�@(/��Z���ok�m��1�D'dE�`�xRI���H�h:y4�b���B6;E�..E�*bvl��2c F(������C�
�B�x����Q����(�	���/����>��K<"��XF��G�g���p��'��KV���I������A�>�(6{�����J�
�Bg��'���:p+�6���ee��>�Dz�E����M������2�J�g��������{�8&�D���`2�8�u��
�R��HI�VF2{(N�K'��0h�X��he$����/��K�6���G2�)	#B�=%���M�����������\�����'����F�o����v�i��ls�
�6P��������4�DKk&�&��Mhw2M��	�F{���&����:����z%Bv�`�X=e����X�3�LH�	�3!9&$�+$��+ GZ*Y�$�>.���s;��X������X�-�$�@I�����O�O��UZ1I,a�l�z����
Z���>��%��;��z��)�����u\#����<=�)���
�D�'�������tx�����j#�5��=�/dJ�g������Ti���=�	�Q+��VTKE��d�A�t��[ZW��B�Q�	��c��g��������6����"�gj�LM����j*L���H���[��VS2{k�-	4y�
�7��6���.oGQk�u�{c|am�'����XA���(kl�m�z��/������=�	�r�}X���%m�������39&#���P����v�����V2{(M�����F(ml�mJ��.���J
Z��H�(��tg	H�Q�F�h;Zc�lDZ�_'
�(<����M�ko��c��L9�)�1��\9%�����F��h�$�����i`��������{��%)��M"	�hS����$Z#�����d������u"���S����./�p&�����v���s��������$��J�5���7�4����B��=c
�5P��=�)�[K{u�g>J*�5�DF���������Fi�H���f����%�Q}�������[{n�qM����:��DG������ut��a����j#����=�/���
Z���V�o���=�	��/.%����7FF�&���������F�^h�jHk�G�������=�H���3ZOic{n�N�>�}p)�z�(����\���[���Q���V5�5��=�{Zb�Y�����(kl�m�
����i�J�W�%��MX�� �hA���5��j���v��f�'�#R�H=%�����5��HO���RR�Z��������tx�k�{*���3<-�f�H��}/��(ol�m�{t�XA��z�X�(������-�V�O�e�Miw��&��%ZOic{n�q[��l�X=S�g�rLQ.W��v���Okg��$h�,��w��$i�,���s���*� YZ;K,am�*�I��'+(����@ke���x����z��)ol�m�W��]d���31y&&���D1�j4Z?[�$�1�v��s;��,|I�QJ���6�8�P���f'�0J6c=�:hA��\4[{n3���_�g���n��}����/�z&����x\.���:�WZ-[m$���9���U�)�F(ml��+-���f�&i����������@��F|�N��G���*�3�&R�H=%���MIwg���g�LC�i������{�<����~Xf&�����py�G���C7�w�C?\�7�hni�9���c5���oaY��9�2�#�mD�G��9��������o#2�=�- ���Gz�#���^�H�y���H7/����	��n�#���������=��;����TtY�l���]4&�n����o}�����K'�_o��i�S�{m1S�]����M�+�W?�\�b~�����_����~���?�s�����E�~�a�_o��2?��p�����}����_����N�O����=|������cn�s�_%�_o�{X�s������w�����?�����<)Y��GX�hY�{w���qU
����
������a����"��u����7���C����F�6�����u���J����t������e�|�w�p��qL���vz=�'%�8������]�<^w��j��>}��*�������w��q�\n~��2������������[�?�Oo�`���xz'�"�oG��
�\��"���5�S�e�V:X?��>�] D�]��Q�,B��1S�>����1S�~���~�AY:�I	b�)z1m�t���o����p����w��1��A�F���?��	��p�|1c�=��=_�#�tI]��GuF�C\Z��*��i�'@��@���w�!�)�}&��]qx�)�}6@O5���)��O�D����?���O3�;���]WF�NO���P�bu����=�-��%��������O��~�����o�_N���p�I��.1}w��n��^��m�����(�@[��)>���I����gu��a^�>|�4�YU�q�~�G�k���<��jS<,����z�9�C4(�D	�r��N��((��:y2��W�Au���^�uw�~��Y�\��_Q'/����~lCNs�?�����d����m/����gS;���Yl��2���#���0Q[�^��2uM1M��e	�����o�K��g�6���w�]�[�]<�n�K�t����us�Z+E;YA��R$D;YA�<���:���Y
���}]7_f�X�c2[vm����Y	Y������u��<CKX�=C0�[�u�v{���
�[n8v]>�b�
%����L�/x�8*B�����u���x�oK����V����:m��P���:rB�)�D���hY�6��2�>�G��P>���)3��o��^
��}
�y�~��7d��%3tU\��F1RC���T�}4���>��Bb!�hGg�C��0chW;��L��&3��:��s%8�	�7%~2D3x�{�6��%(���?/7�2����:��M�#}��6���u5��eT���6Q�O�)5f����M��(,�2�_O�M@~��!��&���@�������5�7���s�r?h�&Z�:��+^%��Y�H����8)B���bo?�J'�GS���>����k�8��o���z�����N�KO/��:���(B?�;EFq(
U5�a�2��D������)���d}"��f�Up���T�5D�E�d_�#���qq�7���s�����;�e������2�lxq,�oog���|b�|Y*-���5�[���2e�r�;�G6��{Pd�-�YW�^#����G��'��X���]r���oB9_�������$���h��X�*��V�(��&�N��mKK���s�?�����x���'S�P�6�-�@�f�2A�LR�>��\��k&���D�W��WC�2I	�i[Ca����Fvu�����mP������(��u��$a��|�oK�n�%e��.!Q9��r��N>7�c-�+�\(��i�WB�lH�;W�?Y��AV��u���)]w�3uRP)��q(��)e����2�q�z��6�H�z�y�ft��I	�����+�`�X)8��b.�J1��`VP���d�'�G�J e�<n�Oh�%���2��>�]w���C���>��H	��^HW�]"�zC~��y���6�	��z"t�\O�>���R������s=�i��g����7@�cQ'���e�h�(�:�~(���<����(����
uh����q���J��fV%d�F����b5�<������gl�#��c�Y���I8)��|	1Z�����}6���AL�����,��#oK�hw�������a�6
cE�(+b(������(���/�
�]
���{�������p�'*��v�JxX�"�6��0&��7�O�w���(���9����^�(��@
��P	�mz'����(B7�
X��&�Q��uQ���
��UQ���;����hO��:1S��}�S��l��j��Sp\,��a�J�D��"zJ�
�J����BN�Q��	���$2����NK=������:$�L�(�����(�I�g����$�x����W���2G��9�k����&����b.������(����-��I�O�E:)��"4H��Ro�X
���I��o]�*�}�<���
r/��s����������V�_N�����:����_N?��p��T�8�+A%����L����2�r�����$���U���%�u]��>�o�YOo+����nZ�l��C7�FFY!�-�a|3A������Ic	�~�h{d�����%d��G���6�)��������M�#,�c4��������"�8(�Z*�pt�'@����E�o�!���{�G����OO���|.�����C
���f*E��{3G���)������A��yo�{� $�^��oN����H`�%����~��r��QS���>5h���B}D_
���y�!�Q��4)j�c�+G��r������W�43+ ��E�=�cf�A����S�l}��6~�M����'������m>���,���=qP��=��P[�qO�~��N}������"���7�E��C1��"�����>��:���a�<�������.�\��0������_����:����/�7��5�
��*��`^�*�6�*e����r����w�]������*�eF-�Gt�*�'���kTm,C���eh��kZ�����~�Z�H�:�o�
>+���+K��m�+����5��/ Z�����no���:(��{�6'c�G�xO��?ai�?�����(C�O^�N�~�$��n�j�}��>B���EQ��"s�(A[f!pUd�Y�(Xpu��%d�5��9����)6G�Am?w�����o��a9���������~<������w���y�������-�k%��n<���*A���J�-r^��~���V����d�mshw�<�,C����7�|2������PI�h�xHq��E��j������q
�lZ�q���D���)�7G�;��m��a��mqq���u
����km��4�~���~�Ai�Dh���8�	�g�����F��A�{G���M����"�����d��"��%d�Yq)]�9�D�������~=o�So�\1\�#��n%���Q���	>ehs�X�V��@+>�v��$�����,���hk��@�C�eymk!���Z-(8�����cuH-��I	b�_Y�>��-e�����csd�%d}���D���Y��(������>�[2�{TK��h]8����&�5���N�6�:U���~�����I�=��c_qx���Z+�>��!�h��V)1��[�#��#_w��Mq����q���?P(T��G����a��7h�rV�mYE������$Q�
����h{�,B+����@���p���s��3�G��?^��-D��fqx�	c��0��m����;yS��-�1�}pS�Xt���'����X�b��u
�W������������#,�z���DyP�nG=�I=Q�^��U�N�-FG�4O�z���1�*�F;��
���(@���MP@nq"A�����W���V������?��5Q@,H[��f[�����?/+h�x���'[�R�6�p�@�V-E��	iZ�����C���y����&u���Z�4|����U�eh���������>���#SZ������s��NX@�|�"p�
��+y�A����;�
&o��3������MY�p���L���
����*�����h��
�r�L��wN�>������*b
_h����c
_���WE!��Z��hZ����� |���s#��!�d�f�"������Si�&v���I1&v"h������c�6Z+���X�VV��EGz}mW#����G���(2��Aw�C�(.���E!���we��d�Aw@��BN���l.1���u9(��u`O�}"�������#W����t���f
�-6W��g��(�;��}��
�!��U�V���8W�nQ��E���N!����;�+��X�,,-^q�V@n�
zV�������I�<X���-�w�6Q�14��Eh�j�
�u�a������>Y���;-v�2#@Q�m��Eh�V��
�QQ�n��	�O��n��e��_D6��*"7X5S����X@V_�����I�n��I4���-V�T��i]������N�����al�g��!�����n�^�+B�kv"�s�%'"t�E�"r���\����Wm0+Y����U@6���"���<��)�`9}�z�����N9�p���GSYP����A[�s�����h�=�����l~�(���M�G	�����m���[�u��qH���(���\�(���r������&s�%7��k�aYB6,���R�����{�r9TX��9���48�f��w>�F`h���&b��s2]>�������5��oh����
�}o�!���j�L��#4���dQQ��W	�q��o]VS����*�I��Q7\{E����d	��[�qC^9�g���J�M���e^���"?��>�e�����6E�s��^	����r���z���E��B�8�l�9E�&���Z����)r��nYB��-�;�����������]�}#����oD����M��p��d�X���(`i�Ai�a����� (@�|
���"����4(�<@���Q�e9�{�A�|�_��Zt{�@�9%��%y����R�BO�OS-:1�g����k��	1/z�`�f��]2�;��b��YM�8����U�n��_e�6�uHqd�+
�0*�y�l�z�\�����z���N
>�����5rJ5�#���du���#xV��*�����%gh����w�E�w���������4�*e�&�K	z-3)'�.]}��������*Bq���P,�H��f�A?Y��`�����x��5�����Q�92D��=
�����K��-;t}~�������~=�r��v��e1�D�������x��Tk���kx%��p���� �#�6���.��J�g���i���:'C?����k���%
�����>��^h��r	�OQ3��@)��5c8�[��/k���~������6������\��(7f?/�9r
�dm�C��l�>s\m��l>�+��XI�����^�����r���A�o��b��}6l��O�~\�����v�]���G^���������O��dZ��T�_E�B^c���'�����<Q5O�E���<���������'�Y�)��Y��Wt���W@n�
���i	�~
��l8�<��>?h��,WmzB�L�����R�n.�u��=\
��K�^��U��Y{��7�
�/�W'��OP������h�E�g}���`qX
����P��7KF�����_��[� ���]�*�O�}�S��P*����"N�D0��� �R�}f>�1��K�lXB�#����c�#�r�=�2��/�x�A�����RC���jZ$�b5�����(�	�klS&�(��Z^&'��?��f�4�K�R��n�B)�����]��5�
��eWI�A�'J��	�}'�a^,+��,��w9�G��&�B����7X0��~[�����*��2�� ��?�)�S3(\�����X���u5�'��)\�W�d�:J%��%��cJQvb�p)!���]�E��p���R��'��g�����p��}r�v�
qlo��~�ql������(;���m��6��Wo�Gb�N�j�/Kt{jQ<��JX�b�;E�M�6t�n� �yf�n"�:{�]���y�S>-��[��������*��'C�q��~�O���V�(�tP���8�-G�]~D��ok\O<�|����(a���J�M���>I�x�L�R��,���X��7���k�6f�;������Lg������Z�%8��:9������_j���x����X���8�z������_2�W@���NU\�w����P�.��+^q��|���;%�)$���q ��2�lY������2u0e�b�<-��6|
�y���R^��PT��l�K���Q(�bY��BYA)�9����%/���I��W@��Y�o��Y�A8�-r�*�>��wzB?�7zZ���
��t��R	���R�����d�'54
���
H���L�\N���������i)��cY'�L������C:$M���mI"!���M��P�'�?/S��7Xo#a�L�����)���W��h_M*��=$���~�*����^��` R��48���mB�}6��&@���u/a�G
�i��$��r�C���R���e��lT�~1D����l�1����}�l
M����G�N����"$�W�G7�
��!g�*�-�����u�R��*��lt��N����6�	��~���c�qd�{�Ir'^��a�N���3����0]���V���]�3�<�I�������!�(�bZ��HQ�_���B�����<Q�o�	�e�e����T]���Q�n��2�~Z�>r���3=w�dRd)��3�M�'@�-KOQ<����c����#�C�|
�y�Q_�LW6V
cEh{�'B?bMW�n��A�k�J^I�_,KeS +�,s ���z�8��xrO��&\C�L�^At�5��������������?:������meB���6Tm�64�ow���m���Ehev;���S.A?V��jaA2�W~$������Q�-j�|��C�����6XZ�VW�`M�����l�N�*�}C��k�pX��,r��������w<�r6��K�q��bC]��P�E�$%����C�k8��[�}�����)
�#bU)
�u����`�%��g(5J��Rc���k��@���	��k?��yO���i�v��>��o^F*���s����o���]>r���
(�v����4��z��;J���@
Y`s
)C�OQ;�-��<�ym���������!!��KW��V�w�����}�_��7zh��O����A�
endstream
endobj
25
0
obj
13245
endobj
26
0
obj
[
]
endobj
27
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
28
0
R
/Resources
29
0
R
/Annots
31
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
28
0
obj
<<
/Filter
/FlateDecode
/Length
30
0
R
>>
stream
x��}]�$7re=��������y���$��Y#�Zh4�������a���a`��Ve&��AfVKj����>'�<��`�����~i�n�~?�a��������������G����n��.����O�u�qD���m���]����}���y�+�0�nf�}w������6��O�_�=:�N���Z��_�~~������]\w��/�
���O������g���_'B7�d�u1��L��,h�/O�{����+�G�����/6�������^{���=��x�t�����h�m�)[�G�G�F|7.����g�x�z�;�.;���������������s]Ki�ib{n���
�(��E;B���������9^A)�g���0:6�NF��hQw�U�d#��U����Z1!� F(al�mD��
zH�@YA���X�(��u�!�9������s����l@?���C���g��r���fCJ���T��jS��W5 ;Oe�a2���,�����������(ol�m�;�M�4�
0k���N��kE��.�:�!4m���~�g$Z���M��(kl�m���Y��3
O)b{n�a�|��g��L>���������k6\�f�d����h����Z���%���J�b�50�@Y���f�3iO�����d#-H&2FF�f���4x��
0�Q��=�1my"EFB�=%\�s�j��I@~k����i�3-9�%'j����{s�v����
D���$<�T��^���������OH��.0�@�b{n#�=w�i�V[��mE��n��E%F����:J�s��"��^��|�g��������*���G�_��LH�	�1!���n�����{56T�Lu��:�Vn/�c�{M=����g���=�g��	��
���h�[�;�-�f9J�s������
�g��s����P��$�R�8�af��S�p{i/����E�g�h�
�6��6�����T�@%0�0�0�v�.7dyA&0�P�����ws~iDZ�h=�������>����9/�z�%������ZK��$w��x/0�L�o"Vn��q���=���#i`���.���HU�eGVP,L0�5���oyOxn���;���9�ui�m�Z��A���3:O����f7��8&"�D����E���{"��F��F*^n/��4��a�H#
�4��6%�N&vq�/��Xa�c�am���'(�'J�hG�����K��RZ������=�����{`�L9�)�1�8Q9[�
��^;��f{pT����z��~&o���U�b�60�@ic{n��.T�Ll�5���e-/�/�f8��s;�(�FO(3�z��)ol�m:��i������K] �/�z&*�D����$�Mg
�����f�����^��]�iF��h����Mi�P�g���ra�c�a����?aGV�-L|�5��6e����^ �g�����������/`�V����399&''�IfE�,��Q;��f�B3W+��/���S�70�@yc{n^t�-���7�GV�-�|�5��M�vo�� \��(ml��l'�Q9�2�z��)kl�m��B���#�g��LP�	����|hA�4�5�I��������[�Q�<W"
�ti�mF�w�hA�4��h�(��vs�U�L�.6�,���h7b
%�}%ROI�������,Px&����t�Z:��F�Z�����$��nH"�o]v��a���
x��m`������&����~��$*�L��d��H�Z�%��Mo=
�%�[��hRh���$��nVI��2C����S�����.W^���q�r=�*�z&$������ZH�V�.�e�6R�HsW+�#)��3��Qw�h%�@Ic{nS����J�hJ+��Q�o��b��Q�MgAl��5!����n����S�������c"�LD���19QD��h�f�fi�����i&k��n"
�4P�����t��.��RiF+��Q������f��6��f�V6������3ZOic{nS���E���39y&'����r��'x�H�x�H��8[���z��� M�%�@ic{nS���6��f��7��f�V6���"��6������ R�O������<���h&-zJ�s�uozJ���Z��������=��(iBm���CKh�����z�k��K{n�Y7����b'�0J6��^�Q���.6�,���o�d���D�X�������K{nc���cLK�i�3-9�%'ji��nhiVm������Q�[��;
'��h������;{�HSj@m �h��P��xA�4�<�=�Y�n�*��������=���n��pLM���359�&'�iwn8�4�z�kq�yl����E��_�V��z"N���J�s��b�$M�]��I�Q��]n{9�$M�o��Y�Q�Y��-�
q��pOS]���.`�����[qw4(c�Ja��mOS[+{i���iD��x�����x�M���#�h��xui�TJ�5����V��6K���3:O�b{ngo�Q:���3y&"�D��"*�v�
U���f�Me�l�J��\�00�@	c{n#��;Z�
:�Y
c
�hS�-g
tT�&��s���w	�#�g��r���^�������~<��c�qk����Me�6�vlZ���6�����>
G\Y��k`�������Xwl9�d��6�	�`s��|!�����v������6�W
����qtl�z����_!y��FRHim��TKq�C�����s���P7����'�����������]��%�yvK��H.imS�=�����S�������c��LS�i�1M9QS�B��:�T�=�3������������iq���������'�&���� �(��u;��6��'y&DF����3%������Al��>��)�|"^�x=����My7��$���t�am��x���b#�\HNgm��q��"c"���"�h�=��$-���wMl�mF�k��z��)kl�m���r��N"�l�ql�!;���$z;MwmC;+;��s��
�7P��=�1�>�FBP*I� �0J��j�]"���N"X�s����*_�x=���wi���=���<��grLBn-�r��I^g��f;��Y�
Y��u��F(il�mJ����aXr` PZ�)�� �0�6�m�)��0��R$9�D���p�6��}��g��������{�K$^�������[�)�I�,6�l�d�s�#K��;%A�$��h����Mie����@YA�$��X�(����<�#I�D�tF`mz����#V�X=e���MY�Ih�v����39�#'�hg��+��,6�o�[�s{�dp2�} G��A������2����{���L�	�h��Q�(Z"��N"���M�}Ib���A������7����uLG���39�#'�h�N�+�H��`���������o��70�@yc{n#�=y�DR����Q���lj�q��F)b{n�T���g�����^V��gZ�L+�i��Z����^U������Z����x�g��J����@X��':"La�.����v�*h�Q�F����Z��u�ID����K{n3�����g
�LA�)��T�� ���F��hzhe74a����g�����s��'3@{$�(���L���7�{qxA��\S����]�5�I%
��,��6�����+��6��H�A+�+��Y�@|����s������@iA)4-�h�(�����LG�A@���V�@X*��=#��0��6������3�x�������#���
�m$���9���������"��%���=��#���5�t�I��F���{���6Ju���.��E{����F��3:O�b{ng��������39&"�������G���h�re�����F(al�m���!��f�AE����XB
;P!�Dm��<�lJ�#�2^�x=������S�g
�LA�)���g��H�Q�����4����^�V+L�.0�@�����t{<��&��v�M%������xA�4'��4�m�w���D����K{n��(���<��grLCn��Vsm��t�l#�����=��{���70���ije�����f��"�=�6�lF�}m����E�����-��	�gt�����&#'�K�����\�������r����{��m����~���
v�C@z���DF�f��V}�TCB��s��/I���D�)ml�mJ�Yn�ASB@������C���!}?���l�i/�R���=���]��i`���������kN���liB(��Q�	+:�D�V��tI�B@���~���l	�������=���]����<�c�r���_�@�#���o�M����r��?���J�2#k`������F���Hr��t�4���M�O+�F��W��(gl���-n=#��0��6���;���3y� ���
����G%`������!n����n�F��H%]�s���������Ao��������H�(��t� Wn���S�t#-H�8J���6���2���v�x��w���g���.���������v���A60-y�%������ZK��OCK3E��������ni��8��EP��5��(kl�m��&��	���}b���K� P�<J,am�:,"��u��a�|�H��
���������MY���g������]B�������=<������9a=��gr�LN����r�w���BSX��D�h�je�$	��-����(kl�m��������|MdI3Y�mG3W+��n

dI�Y�mG�W+���
v~X�iw*�L��g�������C�o�R���������q� �p�)��F"�{�c{n�$��a���iZ+�J�s����S2���m�'���4���N'F�&��q����UI�D;����N��MI���g����i�h��C����S���������f	�J��z�$G����359�&'�ignk����l'
�
�eq;����$iz+q����6�������y���$W�	�dc�]�	��J�[cC�Y���oA�4��8=�\�s�q&^���H=�g"rLDN��*,(�&�f)��������/�
�00�@	c{n#��n!Mni�;��Z��ts(����5�Me���}2�Y& �l0c{n���9��M6�6�P�Q�s�#��TK=,�S�40�@Ic{nR7+&�����' ��@sM+��i^�S�)��
I��hA7454���	��
T���x���z��)kl�m:���@��3-y�%����Z*f�@���l#�����=����&za�*��(kl�m���b�"���J
b���DF�&�����-[�/�X�M|k��i�Q�k�~b�U,��������=�i�J��<���X=��gZrLKN��������l#�=��=���u@�4��x����Mxa����.�)���{m����Q�	��E`��������MY����Kh�K��J������v���X4�1Ey�(�����ZQ�7|��H%|��n�]C�[�����lW�
�!��Xc
�ui�m���,��x�kK�� W��i%��#L���$�Hy�54��m�+�<,�xX������K{nS�qYP��4rx���35y�&����j�$\��`g	���������f��#�0��6#�U(�0FF�F��8��1@B�)���������M��zJ�s���e���Lyr
H=�g�qL0�^���hzk��B��Z��Gl@���Vb
�5��6b���(!��f�AE���]�kR�*��-7�\�������g������f�{O�gr�LN�������S���m�XOsW+{�{N	�G�X�0P����Iv�������J�X�%����6Az4���4_���mv1r0���J��3ROIc{n���z� ����������<����}jo
���/��<�3����p�_�2>
c��u�f�o�5���\����O���k��������9�2��4"���Oy�����������<�k�5���Gz�#���^�#�<Yh� �8��S�V<B�S����S�=�S��������,i��~]��M����hL�^����j�������oO��~���N�^���L������M'^�|?�����������~����������j���
_�������?�a����:���~��O���~���������������>����(��_���~�q�����tc����wO������f��=,oJ�Ks�����/.��*2�r����"�q�~�A��ep�� %�u��]�{V�|��5�Z�W�������.��M7��m�Q�����������J���G���F�F�<���Z��}�e��.g�?Z\�?������*����2Ms�}��>�<���5����*��}��^��.��D4��+��,����;n����HE�����H��an��4#�7�ucG�cP?+8���8pCY��\�<���=W�y��yV+;��v��yh���?����3s�j�<v�+V;��o\���
�+~�;��W�C��?:�\1�'��=����!���K?�`w'B|RW���������~^��xs���Vpw�I�.;�Z����?��;rw�v����v��J)V�`�cE'y��H��q�n�{����)f}���UF���;�vX��t��������o?���/�I��~������t~��6���J��b[	�Bl�\�D�z�9^�.��E��u�{W�6��s~U���q|���+��&�[��V7�I6�r�fX�%��+�Y����|����*C��-���v��-/S�+�������F0+a��Y6��	��
NY�*�^��Cd{)v����\���}����~|Ld{��I�����V��^��B�����m��L3��2�������r�w�����Y�E���P����W���l��u�|�;�fc�o.����n��4Vp][��4-\\}_��J�l�m������������nX�����y2G������ks�i1W�b=n������� qt��
s�(��
���wT�	$��gC���D�y�)tW��D���E�Ab��O����5���C�z�S��i�G��~��W��� Q?���1H�\���� 1��:AbF-'}�������_��`W;}��B��V�|/�+�f������{\����B��<��C����PD?�VQ\���y�F����B����~�
m�c�����~���y�i��`�k��>e��l��������vc`o��!	�1�oK���-����M������E9J����|��Tn�������VC�r�~U����
�O@������j}�[v<�={s��=+��
�������E���F�X)x�_�ac�(����F���}����a!���mk�e��3�g}*
�)|
3O��>��7���^53w��I��u���*�VH*��o��1��jw���k�R:��C���Qb���wC�&��2�������2�����N����:����c�����z�v��YVE�D�j�:�����"���}_��������-�jW��X:G��uch���u�l~>�:9	�S��#���+V�#H$V�pt9duo����:�e'����o�c8���GS�R�6oe�@��A4�+�}�mE�6�"xL
iWs7���N�R��U��/����*le��O�Uhh�T�<�M�l�"
4sj�D��|������������]j��|���a4�G]q���y���3���6%�w}Il��I����r���G�����Ws�~������A(M�e��iJQ;����&���Rl��2��
��q�+�(�D�1��e�����v�6u�yy�Sm_���k�S�Z�q$�{Vs���u�+�o��W���3�X��<v��Z�/����h�XP x�^������y���Wn���y�������OCy@o[�-��R���s�� ��>��k���\�*;�X0s���
���>7����zZN�X�uH��i����hv����R�	���\O��b���Cq��yP�x���gS(��L�]R(��N[�Ey�2���)��t���S�Z_���~��2��<P��x���fs\��������������zm>��������r>-Ypr�!F'��'�;k�+����q�����=���'2�o�t�:v�J�U�A� ��"(�����((��[�O��g�I	�WnIi�����V��,
sj1�4�13Z?�<f����b(����]Zg���~�l��7�t^S�np��6;{�}��l�i�+��
��\_�C������X���extq�!��o�B��m((�O���D����Q:sP���K������c��5�����-���5������vc`O�A�)����4-�n<���X�z��r������T�����-$d�����D\}X���3�����>�����V����Z�r��������~0e %�nT	�n�)�����9����Ua7�|�����"�i���[v6��������d)���N�h��%���S.��S?�����M���b������Qs}nx�K~83w�w�J�5�=��g���!��_?��Br��mh]>�@/�.�#1�`�ZFGF_L��6e��is��������%�E�K��i�~�J(��izy1�$6���$��h����.�����?/k��*=C�9�>V8�{�>,����g�����dK��n\�~���p�z����c�5zZ���K	���J�UD��u�~P���~����G��3
�ny1|K����X6
�4�������b�8F
e��:�sj�G�(@������ ��o6GY�4,���;v��H��D�Bm��l��*:C��>~�d
��B��
�g�:�Bp�G�������l�������k�_�f�*_�L��|��OV�@QF?.U������vC}�fc���sZ��d��
s��i���G�4�����o=X�K�8����4%
_I�P@��itE\{�����_������������N��h/k�*�e�`�F������=�n������W�j�2���;v8j����d�<�1gi.r#�����F�m	���|J������{�9��i5�}>��1l�
���������q��|�/)&_��f����Q@6n��q\���.��9���� �?���p����w� R@?M���!��B�4���1��j8�����]#Gu�_>,��u��d)�1���x������
Iu8��/��S�7!7��%��o���{Tq���;?R)�L�23-W�!3��Z���9-�f����������������?�`�6�w��?�x��+���:��m����_�����;[2V�6���@+�����i�k��_���E����}�*����I�2.M�dy�������Xi���-��3��/��"���T��Y67����m�������+A�L�}R�-@�����^	��JS.
�[�����`gf1���qY���
d�+
��i�nW�/jk����� zl�U�N`X;����enD%'B��EL��"�q���]<^��L�|,s�2F�%��hYJ�
��j������&a�q����������/���.y��H���fuc��iL�����G��k)[��������t)������ �����r�A����"��"h��P�DP��
��&��"�������Dq�E���tE�E�������������/����K�pt��J��z�3�9Y�BD�U�!�a��La��,"�ea�+��O�����)E\s%����c��AE��_[1�g�7�_��E�=���K���]�m)��]�}��mS� `�������k\�������j�����JF�Y�J�g[� pC�d���4�5�2U3TDD��Q��@�+�g��\7(;���V��b_;��'l���y`��+��
�W	��	f��ew�
h�/�nR�>v�b7)^���|�EZ��c/�_{I��	�%++{�L���E��8���j}�g	��y�|�2�H�������U��~S���VM���/��v����A�{R,�b��6�=�x#G���f��/E���gCI6�����AB��^�3=�b9M�\��h��k�,��������m��rm�>���L�
C�2gf34�0C��[Y��������-��[����y������mm=�����MG	��
U�\�U�KZ�
�`	���g���U[�N���?����^?-�:�%�����N�� �]Z�G��\�����f��M�
���u�n)����N�n=������^��
!��.��z�S�)]������nq[���W�1�jyp��fXG{�)��a�hS�6�@N�>.����=�{�u��b�+��'����b
3����>:Vqw�j�O��Uc�E���g5e��.���������	��A"����������1�2�1�-62�zY5�n]���U����0�7�����2��L�+B��|ED�>.��_������7<Nv(�������Y�hKt%�y��bF%�UN!fTR���]�t���	��\or��9[_n`(\�G�e��W��������(p�P��=,v�����+��=�y[������S��B���\>lN:�W����B>��e��-����-��O���V�����T
J��|��^� za���s��>��9��"���E��hj�HU0��P%����������P���>�k6��� �����w�#��j����j�&Dr�{SaKj^1�U�����b��`�����A�b��
Q(8M	������h��E�/�^�*����a�'���J�b�=�X��a=S�T�m���S(������w��	����6����nJ�u����rn;$c�Ts���rtY�h�{]	�����|%eR|�����<d
ss�:��UQ	����v�����\���������}�lO��������r��������~0e#%�6U	�j����Y��9�Ks�vF�^[���kwE��!�
������YXt���k����)4��6=���uG!�������}�@� 53YJ������ajY_�&f4%dK��o\�������#��S�/p��]Gg������Q��6��v�u�}4��x6�!�5r�{��]��1���Yw��g#����6^�����b�W��5�)a��e���\C<����/���/��W��,6��'9b yY� 0ok�@��+��I�{X��/H>��UqH�_B�~U�>�xb����Xas��
�b�~��4���#bX�G~5�#�a���9��W�*���5WHwf�������|��k�[�'P�7����fx��9r7�.��s���>��]��Z���u����==o8$����{h���/�#m!�K	��4�\_NG�P\��]����Y��+^���T�q!�����%����B�����XcJ	�fL��Du�=,����,���C��"��=P����?�|�cO����[/v���AnUT������1�u���P�~�>��dn���)7���N�#�
d�w�
d�KFr��S�{�����;k����{X��+�����A���cg�^P����+�D6w�n�u�=4��y�G� /�U�����my��^P��4�F^03��K�����5��
���q�GHd\��jn���������|?�|�����������m����_�����;S�U�6���@��+����oC_�mS��O�T~O���~V�&��{r���a���K?O%�
f�M��y�h�����f#��'����q���K�z��,v��!�������d\u����E\��$	W7"O_�}i������U����+�����;7�d��s����vkmY�G�����jB�y�Ud�A����T�-+���H�n�j_�B�MT�q�n�S&�����T� ��u_�����	�U����^��hn���Y�	�GNk��+��&�:v1 �u���[	�f@��H@(�F��P��}��[&�gJ����C)�g�=r���
�e%����x*-��v����mh��OnR���K��-Uk���,��G,�"|_�*��{ss������i�D�p����R�=�b���x_i�N����t��5�C>1"��!Vk��'�B!vi�q�)����y_�_d
%���-�-�)|�j����:%����j8���������Z���������%����,�����zrsK'���0G�p��6{���������v����A���l�8p�_o_ v��m���������R��i
�1�b�)*�#��-�8{�<����<�M�'Fd�:q�]'����1�IQ�g������{���C���!~�j�����U
���4���
��
g,eM�����WqZ9���3�G��Y3�<��x#uDz^�3�R�>.;I�_�L�N�~^��+WT}	�<9;�w?q��C5�W���PUV�aW	r��8T�*+�����������9;�%t�)���D\����{�p�R�e}����k�����@|��:�u��?����p�����4��m��*�V���L�_��~X��o��~X���{�������p���}����x�o���
����U�c����j�\�(�7}@����D�3�:_7{<�2H�K��:e��}$lK}W�5lLm��K#N�����?
���+$+�v�/h�j?c��j?k��D����TS�>M����J����h?�=��#g����J�7�g��1r�{Tc��e�Sy��S��jazr��j����~0�8�/��evK7���'��2B�27M�W[%d��	�q�+�������Z9��c/�5���h�pQ&��;K����)B�W���x���<_h�����(2��9(t�n�u9}n����Q���ym�CG���7���r�?������d�3�.�XN`ZOyH�5�E�^
OE�)F	�h��>p
����q
��FQ���z=k��|�R��:Q�]i�}�\��4y�
�����g2J��ia���w���W�R~rM��dn(�;i���r0y:Q���o��D�m��Ys��:�J�N@�qXCG�y�\?�}��k�L'�C����E*a��*��������F�6�i�;�����K�h��F�7;�5n#R����>�	]��g�Vw��F�(��P*�R��~WY��j����B=���+���gT�-h�����!��
�5N=��l������Ac�1U8�xs�
��p�|#���z�i�7h>�!AM���F�G^��
��4%���y�4�QM��q���x:y(���5]	�dXt@�N�[�"8r��
���4���iM�����k�:�y�ok�{�^�U�U'm!h�E��]���n]6���w����\�������7�L�J����U��������m��i����
*�.C��c��}�@_���u����
y�>��*�����
ER���r�q�P��w��=svF�n��!�?
]���w�P�v���K�I�56�����w�5#�{X�i�a��5KyP�|nD�>���:9)�j����������������Q�����������t���m��/�m�;'!q&���b}��R�b�lm�����@�����\%��a�
�!a���+�#;K_V�O�PG�+B
;q�(@����5��i���������w�iVE�OAK������
��)�`�����Ad��+H~����QW��H�P�����V&�����?N��
endstream
endobj
30
0
obj
13874
endobj
31
0
obj
[
]
endobj
32
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
33
0
R
/Resources
34
0
R
/Annots
36
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
33
0
obj
<<
/Filter
/FlateDecode
/Length
35
0
R
>>
stream
x��]]�+�q�GCo�F�H3? �`�?��k��
#���$��S�0�l��O�EVu���je/v���s�<]��bS=���0����C���������Cu����0����g*S�ORk�8�M���;ks�s���ni���7�Z�����_c3��h�9������[c������_������:w��U���l�w����^.y�%7��D���	���^Y����?,���M���u�8���d=g&�c����C�f~IGT���K��@{j��k������=�2���b�����k�q��_.����>��B{j3|��	m�	�
�g��6\�0���a�AkKsbB��������?N�ZL�������I�=�S�8Ed��Zl	�#��������j�[a�
oGH�s6��x9�_��#��\a�No�oOm�9��n�S��6��~�����]�D2�����u���2��~����%��D8�����*M��e���Y�?$�N��m��Jm��NK(�p�"��6�@�I�tW#�]������KF0��@t5);���~��@m��:��B{j3t�eR:"Kb�@�Y]�@f
�e�j��f������������F7���C^4�:�������6f�79�$d�,��au=k'��A�`����0ahOm2��lq���Mu���Z��S��qt�3g79��7�Z"�`��Y�v�EUn����K�:�5��XC=��������5XA3�c��Y;a�r
��+���B{j��,+�D����b�������~<BPK���
�����0������Z{VO�����?�t���p:��S;���������c!��gm�p��B�%�L�S�jFb��#J�S����b������|{
Y�!K4d���ZC��"��h���j�DJ�����ZGh����MhN���
^���������T�c^o�oO���>���%S����yW`�t,��%�1D:��N��yZP�A�]�D���Km����GH���H��H�����=�	��Ru�u�V`u�o#��gm�Zzp�$$`
v`	���c���lf���R5�%�j��2�����GyX,�<�b���������	���:L�S;��<��(,Q��0���	S1ba�m-8L�S��]=YFZKh-�
��MhK���X�K�c�~�Z?����,���b���v?��r[��t�9f
��MX��d�O`�������=��B~PBK-&���/�M�V�y@j��Z2�����g����`C���Xp*�=�e$j���]k�����H�0�oOmB�.U�����`�vdq=g�Y���7���r����&�q]$I�����ZBj1�oOmBZr���5Y�&C�d�j�x>W�R�b'�mp�eeW_(X��p&r:��Ss>��/(g"��Y��:����� 
v��1+;{�����3��b����Y�x������9Y"'C�drI���b'�p5fe7u�J���0�:B�0ihOmL��
&��bq�&������*
hm�.�.��l|�� 
R���Hk1mhOm2���e���!Y"$C�dJBJn�$��qg������Z��w��9����9�k~�G\��`\�Y�]���=��,q)X�mp�fec���Q�JN$��4���G�1Y"&K�d��+��6.����T�&t
���z�Yj��C��U�H�0ihOmL��s��u$�
6������$�$a�AhO�mB�\C������#���.�Z""KDd���ZD-����P=�T�IUJmh�����R�u�viOmJ[\mp�T��J	����RSP"�-�bG��=�7F�{65��R$��tiOmJ����3��T�.,�%b2DL�Sq�4���v������o�����R�IC{j'��E#!�<!t=k��,�x��T��6�
*"�mL��hj��pZ��Ss>��y-�%B2DH�����n#���m��-v�*Kk�����J	�#���������Ti��o�;T]Z��r�ma�����8��h��a����
2�$���ZL�S��m���:�(Ke��Q�������j\�N�lPMkmw�4���P�+!u�4��6&-$�s�dQ�+�p=kc�bZ�D�Hl�*\k��n:'$*w%�s�����u'Cdc�l,��!�1�l\qaP����Z*o�m�*%:]�[�s��a�����b��K'o��IA��������1i99���Z�
�C����y���Z�iC{jc�d#A��4��$K�d��Q�Y+	���*�P���Q����������Kx�u����6�-��r�6hw@������MiK��8�J[��v_�9;���T%�P��B{j��l�Yw@��Hk�~��Y�g������:���y��w��������������J��>\L���������l�O�w[[������`���������f�]0��S��������=����Q]^���s�Y:��G�������
����WK�{�7�����cT�?}�����o��/�G��������������}�bo�'����?�����c���������5
3��1m�1|?q|s���Nl������w��_|��v_ay��WXNRc��
M����~M�k6�O3����{g�fCJ#�����4�R���&X*�zt��:�?�[:�G3lA�^�#��_�mW�[7<06�L��u�=;��Hv%]�,��~�- �J�M[@>��K3(�s�������]���
�<����XM
�����1w������Usa���\~J�F��6�r������%��&����i�
������_���>�>S9I�����_ey�3M�Ye��q����6:9xp�9�v�T���t�c��
�y�,E6{,	�|�'qh�iN��p�M��'g�k��bR�`���WX�;D>vRN�D>Z�m��k��&#��4�g}��LE�!�q�MBbVR���J�,~���5u^�`d��U!$0��1:Xm�������A>����<+l��� wFn����nX�����_>��~�����D���}�����v���^~.����[�C�x�!3�;d�q��#x)>�	�m7�/c�w/�ZW��q�=�O/�n��C��]]��8_u�z����S�l)3�����r�����V�[���Y^�*��������n�e��e7Q��f����I|g�?�������	%�~������o6�/~<��>���K�9���nM��.��q���mso�Y��e����@aG�TU�<�=o����.#�r�����/u\C���6��87�K}��)�8�����Bk[�e-\����^�[.Z�����C��7�����<�5?F���kD?���.��\������zf�U����_q��a.���yg��cq`��l����iH]������71d�T�o���?���,�H��M����03,yH�7��<�U�u�&���:��03,�]CvXns��O3=��nu	�}���?kP����R��.q�$���2���L����1L�=1� ���1dGY��lp��'g���a��n�_��#^Nu�����^����>���`n���uIW������������$�/�����g]���_�����/�����c�/Q���}��o���!����>�q������c����~��}6�����dG�0Ip����i!GJ��'i!��"���?��Y���0���"v�5��+O�i���_�nB��B��XwO��\��E�$-�M�"��ianB?���nN��0+_��+��+�<�|�N��+O	��������B=v
L���zl������!�p�;S��m�){{�#;d��}v(����032�;?�Y��_�������������0�a�V��Yy�>��0���c���
3{�����4;��/�w}v�Q���%]c��"����AVx��fz,�gcv�7r�z�X�3eX�w������O{^_~����U�!9l��Tt�|�!)�t��.��������U��R�o�=W�����\.]iB��U�4��L�b�H�of�E�.h�>�a�V�4���E�[�4,��:��������];��2����<	wxn\��{�c�e{�G	�=� ����0\|�J���$�,zx"�.�����ZE-���yU��F	�����V���N%<���4�d����8�����Q&���W�j
S{�gW������������+N
�!����������c��A� �����n������z}��z;*�k���oi�,����<����id�I���|�.�s��J������D]w��\�������Zv���S�<�N�@|)�J_
��c~��a�nN�u��)�9��9���9�=�����6�{,~2�\����4{;���}�1=�h�1L�D���������r��b1��]��s�\���b�+����o�E�������/��Q���.L����R��Y�W���0�W�����������rA]��������l�f�����z�9������}q����_����FTv���!��d7J�2��Tua0�syI3M���4i�"�.�A��yW.I�1Wm�L��q��0#E��$!d�U��$!d����YeB���+NJ�������!��fz,?�����v�c!O5K�'��IBh�}Z
���'�H��m��|b�4<$	a��g�arIB��_=�0���q��&�
��>�F{$���6~��������~b^6��U���Q���0��,���-���lp���k�OR�Q���3���X�2����wG�";��
������/�������^~��//?}��� �rh7���{��,A�$)]������b��S,N��3�d���S3��0�J��B��8B��U���%��Y�����gf�/�w�b��?j����3�s�RH�?3�
�C���]��d_@V#�"��55�+?��-��z�N���C� �4U���qX��4�w����� 
k��'����t����2�>�}���/����;Z����O�������3VL���QwJm�9��9
�v�>���P�g�H]���9�!���_��{�8,�fp���n��x�5����|��Tt�<�!;0�S3��;��xx>�~�F��xO$�Y��Ny ����������g��y`F��%�4d��M��I��P�j�fz�8�����0,���Y�%^9�!�A�%Dfz,�/�_�f��R�I�T#��#&�+M6���<�P�aY�|�[�n�_�J�:)d�����n�1�����_R�D�~
N$�$Ad��8�����W��Q�U�������m����I��7��Q�n���.���HY��?�I������*�JD}�w3����l�*b_��/��
�AV����K�������w����f�A8:U�G�L��������X�K3E�6{o���Yp���=���@}���+��l��o���j��3���"��'YcF������5/f�Y#~�����W�b�0�,_��� ���l�t��*��E�]��D���|�
�p��4�G��D�G��/�S�]����Gm��G?�M���a��8f�f5�&���8�W�w
��U�?�&Vq|zS��W��������I��)maj_u�
@r��k����~R������9#���#a����?�iR�
endstream
endobj
35
0
obj
5719
endobj
36
0
obj
[
]
endobj
7
0
obj
<<
/Font
<<
/Font0
10
0
R
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
14
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
19
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
24
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
29
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
34
0
obj
<<
/Font
<<
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
10
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-ItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
37
0
R
]
/ToUnicode
38
0
R
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
41
0
R
]
/ToUnicode
42
0
R
>>
endobj
38
0
obj
<<
/Filter
/FlateDecode
/Length
45
0
R
>>
stream
x�]��n�0E��
/�Eq 4�T������{H,cg����$iK<3w��C�k�g#�>��[�����S���p���7V�3�]���,��i��7�X]��3�&�x2�X��|��k�'��y����T�U�7��P�lL��8-��/�k�������������X��%y���d��]<'��#L	����G��:��s���D�T���J"ET!�K�Rm�6H� �$��^�����s3��
�,:����������������RIF+2ZR�j{kf>�y��y�Si8~��|�������Ux����6
endstream
endobj
40
0
obj
<<
/Filter
/FlateDecode
/Length
46
0
R
>>
stream
x��}y`E��S���s$s$��L���$!d	�� �%$H$� "7r(Pau�V]PQ<wuO��\]uW�E\]]�����������|$�����������zH����H"��%�|�}�_��������f�����R�^�q�����w���,�yfN�4���
��
��=�3Q`=?q
Q����s-�uc���_D�����N���}�@4��'�LZ:/1Gz�h��}�L�7d�m�}�;��������<B�#�cA#��/D]d�����/���!:L��v����L�������)�����t#�Ki9mC�q��G����?R1���]��@It�^�Z�Q����BD��tLz]��
f��t��������d~��Vj�Hg�D/�1n�P
��4c�c}��b�� �]�S����et�G��M|_��I�����4��')��,�ZHW��0������S��-��*���
3�B�hU�b��iz����[6�M�A~�4OV�Z��c���t>�h,5��t58v7���I"OG�M2!Q7����b��������<,�uaC�6�me�s���U|���H���J�I���I�J����K�
-_��fjK�-�a�������B<s]B�0�+h��uX�������vz�Z������w������'��*�tv[�a��~�{�7�I�^������m2�U�Hy��r�"}""�#/jVm�vT�\k7��y.8���hz��n�;��.z������m��3�cgN�b9�+���X)�F��l[����������ba���<��ae�/�W���-���<��n�;�g��|3�����9�_�o������;�!9�����!��i��TZ&��v���I��2��&����5�v�Q�%�3�;��������|�|` ��p�a�a��C��MURG�������j�1#��hv�n����@��_����!9�.6���neV�M��;����
��a6��J�bK�J�`g�=�e�����t;i����66N~@ne���e�������)�����9�?R3h��;�9�
�0��@�7��f��&��C�7*������;��.�>ie�)��=l��1O�Z���e��E4�`�U����c'��A^j�L^�&S+}��e��4������+�o,��+3!$�'
���d�8u�y��b'�@J�h�	����|.�-=�>���Jy�4�\�e�{�!j���H{���$�)�������]�n��[�k��i�R*gh�F�f��	� �'�i�Jc#3�]���J�����v����1v��\�� �)�m&����\�%�+��5�R)���<�f����$R�������SW�Wy'������B��������OY�^%={u�V,���%/7'���y�23��<nWj�39�a�Y,f�Q5(XUF������p^cX��M��P0�]Ac������	��f�s[��rz���h�P[Kf�UPE�B_u�~�*�ka�G�!��*P���������g���q���=��f�����%3�W7V�y�-�A�A���
i������+0o7s�gz������dL���i����'P%��r�'M
�UW]����w+�AS����&4H�&lV�n|��t�z���#�7��irc0aj`��	uaiR���D�Ua�����Y<<iP�������j�,���_���:��}�_\�����s7���7.��01|1�����EI����)000s���X���a���'--t@{���}�/�������IU���~��fO��9��[�n�#���V[,���>1��NO��E�ft;�Q`(� ����H��Hq����O��f��g�+<�0+l����W����J�=�[�5a��>?�dR���k��DRG���>����!� ,$��_���V������} `���m�}��s�_���-!��L�iT]4����{(T��FQs$^�2V�4�k�no@|��p�R����6{jr���a������5c5������7�x[s�9�h}���X*�<�NJ��O��ZH����"S��s���K���Q�K�op��8$z�7����M-���.���-6�p����~���^�z	��x�E���7�S7zg��������'�hM�>{`���o_?��1��-������7�c3Y_H+���l���!�n���v8��.����8�~w����Bz)o+9��Q
����F�*�@��I���=?���^f��1����ev�?�����k���9{�u����=��-��^���H�c�1Q�Ib��M?����h�n?S1���*����K�b�������L?��#?���|�����y��JV�l4��hT�9A6�		��hSL����a�Yj�SUx�,gI�)���6����=Xg2�C{<dUUY�$2&<`[���������.>g0���S��*w����\{��k���W��:����o�������k��j�Xk�Gq��$$���R^��U����'�o}�3V�>(�{6*�3�������yo���ap{��3�������������\1K�[�I6&&�&S1XZfP���J����t�&������-|�~E�M�8G:�i�q�i�3���A��q�k�lw0�`����_BW7 ]\>��Od�iQ|2hG�L�I�W��*�����
{k�Z���G�
�p�5�0������J����6JC?�\*�P��s���T��������|�����9��^�{YVh�w��q���+����8����~)��
��4�K��>�7����������|��|����Rf7�)� ���]ja�C����b3+23s/	<�K���k�!������g�;e�F�^a���T�)L:XA�����2���A0�\L
RC���%��u�s��rJz������@��S��%=����|�U�"��9��k?aS����k2�x�x������z�������.l|�>F��v�4���Sn����}���+Y?�DU���qQ������^��G�q�h�Z�II�Tw�+���lR����9.WL��l�,���VI6��k��Z�\��R������
��������V2���\��a���5y|�bl�AyB�6
���p�7�\p����B�&S$��L�������\b�Y�VH���(IT���e}Z �g���4(��T�\^p8 %�{����{!�od%|�c�f�~d�	}��5�|h���O�{�����ax�K�G�����lj�<��|��������j�ZK�e�5�&f9� ����f�+�\���E�".K��p�j4*-��PP!��d6��SQ��XW��+
Z#[������������~JP|lE���<n�m:�8A��[�g�N��r���r��]Q��T�rq]$�sEg��'�UZ�JwG>|��"��~���o+����].��`b���,�K�3���P[A<M!������p�I��T4�T�b��?e��@d���B��k��r�B��������\s�%��[��]�E�r�8������"��%J6��
���Ds���`����4�Ss�.���MS��=���UgKe�-����"�!�����:{`^����#�e���fIY��R�y���~��M�)�)X�a�H�]Zy����ZkT�H���b��O���k0��z�*--+�)+�����A�RZZ�SD#�dHq�y��j�n��R�~�#Sn�;��I����t�\��F��>�:���:O��������M�����&���x�������fLR�5���9=�5���������|�rD�Kz�mv�5xI���'>(x\��%S�~MO0K\d����)�88]E�������LO1,�3)�����A&I9����y���1�$)��5���dv{m�/�2�QFQFe����JFF;J ����dJNna�!����~�4V�jst�Q;���X��w�����~D��G����j�����
�Q
�Y�z$&�b���
+*~zSk��w�$�+a��P]�dU_�$lm�P�*�s���
�rt�����7kk��9l���[���8y[YW�`���ye��[&�]p���^w��,���~{����������t�2�K��3�o3�nag�B����w�	���W����0*g���i��uF�-�:��r���7c�����/"�:+f�}iM��y�:����Z���F��7
�ti=e?��fTNB
���1}WL2u����
CL%�����>��z���:��-�d���������$}�������>�%a�����V�[w$�2��8$�F^�y�2���1�$1���+�d���iVBK�cV5�����R�����.V�����Z�t����4W�����Rn�-�bq�(�kRmV���-��[����&�����d[�y�o�)y��,W��g������'��+H@�U��7��b�sc{1���3U��R}��p��|���[N/����{#/�).���S��f������n�����
�z���������M�g�
}�M��K��Xue���L`�W��T��:�C~����=��s~Wi����'Z�H[�&�hK�i$%1�-�r���81;������i���I[�xR���&�.HFR�)�����>[���E6�m��]p�N^��x��e�.�Knm�R�1�S���,�/|
��N�	�S�����Qf������'�<3�F�<�������Li���
[7{�a6+i��sN\�;������r�r�<#a��[�K�����J
]PY>�j)W�M~���Y^Ri�$C��)RRJj�iR,	��du��H�����2�����PJGd��
m���)-��,&-JI1����I^��H�-���ezn�M;O�d;W�$�M�$TJ�g��~j�P,q��Tn?�*��
x��k��������E�h��&QK��������z�%�T��O�����~6�����m�m����w���`���������eS	�<@y���%J{
���������y���z+����<7)7��*�2\�B��l��'��=�����b9Dvt��o�$y��;�e��&�
�L��m�x��a��-��j���\M��\�D#��������Z�K����g�g�y�fF���w��.s/s�D��
��n�|�!l�������/`(��?���(�!>:S�p�����^k�?m��p}].�3�q���������b��t����V�w?���1|������[�ys��W�y~����������d�����q��+g�5:��w��-?\��������?=����3^=����)6�*f�b������������A��3t��2�S�7���M,�8�
6���_t�3�������wd�l�N��\r�9w�Q����H�;������&B�i%$�%�6���n*J�L�:��D��8"qb��D91��_��Skr�!�I?
�����[����/�mPHw{��o87R���X�=�`������+�3���j�n
�,����,�*�c���J��.����kn����6sH��=�
&���0y�����������k�����>���Ir�g�TV2�/�I�H�Z�M-����E}z�05�N�$���������I3H�e&����|���9E��[�bP�e3�����&��5��@7�$�����z���pS��bL���E���a�+��L�y�r���k����Z�Q���%)��,����(u�����if=��������������9?b��J;����JT�\"����`���}���^x��d9���
.o3u�FBc6BiRN(��y�"e���rX��KRLNu��	�I��q�K����&:��
Y�_������8��D�F!wR%��F�iz��Q��E�Y���_�`��D�Iz)�D1n���$���6�u�\���sj���B0#/��pf"�&h�64t�`+����>8\���x�#n��uG���G�-|�e����Z�%M�^]h0�lc�����:�h(�h�%u-���@p����0M`hfVV��s�D��y�>[RR ��t�'KNv�z
oB���,�x���C�N�)��i�d{2��=;����g�A]{�^�%^���:f?�j�X����'����(�(�8��h�}�if��#~�F����;4�PH��M��	A��0�%��\?�����/w���)���j�r���M�nJX1jgC��}����������G����d���z�o��?Y���b���������^����1�4u9��E�C#s��*�c)l�����e�������7wX�T�
w���
����kS�v,�����ZX$�@4n�Op%&�.
ekja��$��ABP����~�__���{\u��%�[�Y%-C�� �At��;&��
��O�u
����Q�?em,|��?�����l�2���U7����1eh��o�������aq��[���L.�=���"I��S���������d��tFv1���y�����8��z�����Z���F�a����8W�-��+)^YEPc��(�����=>R��<URf�Mk��g��Ua�����4�^���2����m����}}���FN�v�s��m��w��������0�7���=�����G��y�n������O�x��Qw�$��?����`~��B���Z�����6�oJoo�<�64E�����)er�M��h_�Fa�����l��7�6��o�����!l�}��`��$�2]����O�������jq����K���i<���C9"j690��t�j���&��������o�j��+��c3-n������DFg�������Y!Pj�X��p1v�b�T��Ja%�{n}����n�=<k�n�gA�]GV5V/�602Ky���j���������zXzE�#���koz���](]>��Lh�I��t��F�lHLM�a<*"�G2��|�+�C�2��Zm\�p�-�$!P���g�!����K?GL�b�b>��m�y0G_bWE��D�v�&=w�UP�Y��Awt�0 �1q�c����5IL���n.����Y�����I��2n�)�eq|+F�	[pk��w�.qAQ������yo�X�eg�+�S�Za��c�I�W!}��R`CL����*�����-�w�����H�=�ZnI�-B }"p��VUH\+]��f��(��Y�Y���v�O��K�H��T�2s���N����y�U�������\�
E�ibu>�V2I�������F��sy����c����(����Ke�Ou�}*�����ih=����SPg���)�1��U��c4�:��qM0�L��s��.����F��"�)>)N]P���+^U�tiNn�������=�gy���������:QZ;����U��l���7(*(X��:w���nKh�W��g^Z��^c�q��_���|�oRvR�\`�n[b[k��zG���p�����aR�D%Y
	������v`�}Q�A�d����3%���=��,�x:�5�L`�%��j7m2I���9�O�x��<s��!����=���������L��e�1��9�,k�y����Vm��T���=Y�W�r*n���w~u`%���)WEVnh�7/����5A���y�B��b��c������������h2]�p����.�������D]b^���m(dE�r�P���q�8�z����Wc�S=�8*F������y���j2"��y���6+i���bw��
�gs�Uz��3�3�3�#{�4�c�����+V�	�=�KE*R*�x\�KI3�y*����M;��q�5D��P��%>	'EWq�@*��vCP�R����`C�T��q����%r�s���������4��o��O�l����G��8}�}|Zk����A��-[���#�h��2�4����>�0[�����|���IM��V'_�nJ�����#������z0�����Y���6����`�XSR].�����_\1��WW������r&9�k��,\�d��3�0%����Tr�i��2\���:�Av�,|j�D�-�@3�c�1S�����T�d����o�����������4D<��[=���U}�f��0�x{*����~2����\?f��.x����s�!�
���/�$�7	���q
���d�:���d.gto}&`�=�������|�w��H��"g�.�>g/h=�����`����|n����l�U�I������t{h�/��J��5+�Kb��e.��I4��D�U%�B��(�����TUk�����D#�o6�F��K�U�[�Y�u&v��D)L�
F�O�B�}1{����
��\D�������B7������$f�PC�-K��e������3��`o=�z����TF�?h��/�)wm�h�i�����w��W�����m���k������8,�s#����o�3�����g3�=��}j��pd~����g�Y|{Sm�e1���XS^����u�"/���,5��*�t[�d� =��n�2�s�!�r�����Kx�v;��(/dE�K�y@p���X������mX�d���*�����^9J���� �O
�r��(��{e�*�_�g�6<H�Q>�����	�����4���tD���>��i���s�G0�u�Sh�P[�dx�x�
},����<z�rv��cG5#�#]�����T<�3�3�D�?\��&���8����"-f��C4�;�H����7 ��9��	����������@�W�����MpP*a��!�?���b����h����|�9�����F��8Q��>����j��(A:C}P��p+���{����s�d���!_���;����%�,L��pw�9P�����8�o���i������
�5v���M����XwV��1�3
m&~�������`]�	�~�Z��	g)M��m����,�����8�xv0�a@!oD�n�Q�.��!��|@6�����e6:�:]��=����sn0<D+�]�h���/Bf�8���%d&Nu��M���R1O!SmT�=�ym{��b�	������R	�
����/�����h�g��a|�:}���d�)N��h��i���=�����,]C��_h*�PX������s����#$4����@oP_e�*G���������Wy��*|�]�'
���.�RO��v;�T�}������5eMG�S�UM�|n{B���8E��	(0�����EKv���!����L�7%�@��|��-�6R?�3���`^e)j
l���}��h��x>��vrt��u��8��kG*t~L�t*�Y��S�m��[�m�Y���:ty�.o���4th\>��S�h;���ww���T�-���}*�F|�B?
't��s�����w�g�g%�'w�z����n��.�|�v��e����bx�fJ4�0�}N��J���h���h���i��-|B��q;��Q�g��8]�<O�t;��	�i�N��TRc���P����Z��<�n�o�>�<~/��Q.���zQ�����<Ql�t������h'���=q�%��e������=O�eb��'�#sT�m���>ko\�}����?E���&�>R�sz�s����W�Z<K����������+��b��m�]�/�L��fV���)���@����w>}��P&������H{V��M�o�h��d�2]/UVi�H-���a�(���
�J����N���tC���G��	�f�������^�e���Ly�{S�t�HFz��qLoWK���&	���2QF��a���W��A�)�*�K��S%t�(����	��*����{��%����5
v�Z���#���H_�Kh5��,���5�#��L�V�P��p����j�w4B�Q��,tu]��!9����J{)���D���-Y�R�v�B������'R����^�����}�
��r����#�C�}�g-��p������yx�K@��]������a����IP�/���/��6�u���b|b��sq�h#�Ie���@n�FF�����nG}�bM�	�`X	��t_=|�$j�3����]��(��4�O���	eX�H3���r���FY�����S+Pv)�[P�un=�|7��D?��Q�:Ao�n���-:m$�J��y?��{��|	�w����?\�{�A}!P��R`����5������K���>��w�I�5�������+���F���h��y����h3��(n����QB/�(����bT��z�SaC@�B��X�C��?A�]�A�]��=z�{Z�+���o����t����|��
��6�t�y�>�*�����Z��U���7�I�K���q�C����m�/���b�W�+���7���t�RK�q�8asa����9	{�U�#aw`w�1��g]?�����K%H;����"�N���*2�l�x��'������<3t�	�2b>FG����l�z���7.��/��w?���������Q6������3����g�5�'�m�������/���~��kot��s��I��5i��������#���c���e��s{2�G�^'G�����B`��	�W��%���}�S@b�rM�V�u��X8��O�����gA��@������/�W��T_aN� v���>���������)�,E�
X,�=�i��u>F�������J�� ��g�M�X��H���/����]>'X�y���e@@��}$�/��ux���I��s��v�w\�������Q�L��
co��O��A�F<z{�xFJ�WP_��v/{��Y@~7����~v_�i��R��Q��!fg�����;�zXuoF�?X��n�.����~4�������El�=������A���[;m'G���#�_����W���@���+O��:�#P>�'��s�~n?W��(��_��=7�(���j:�5��q��s:�9�0��������4��E����������vi�a$|Lg����"9�j�]�/����_��U��|���?�B@:�����">�g�k�Ou����Sc�D�o���>_��o���O���y��t�sH�������H'F}������u#��F�"����gc~�^�{A`�Gh8�z�.|V-HF���1h��"p!�'�zA��t�l��N�A:�Z�JAu�+t�j�/�l�1]�O
�q��8��t"�����
�h'
"�B�w��=�1�+���{����u�����[���bn�!������+����~-9�����#b�&����4���o����r���J���C?�h�����Le
2�@�M��*�h���,
1�3m��I�A����z\�G�	�&�<y.lg��x���$������ #��<���ccn�}4����P%���>�)�����������L�S�r'b�f�,�u�z�5*��e�o�v������S�@!�I�8�F�/Ud��O��@���"?����q��X^�w�k�~T;�B�U�|��|@��l������H�E�/�'t��#��I?�x_��P��w�P�.�����[�	3���
�3����4��v���5�h�m����o�E��uZ�G#&��@�^�����
��NCyd���PJ�����G�&��� �?�b�#�#�>� ^�`o����<�1*{�]�`����"Q��W����h;�^����!_���rb�wE?wvy�j����=�������q6�Gw�r���4;J#B��9�H;����9�/��u�E�����Why���_�1�{a��=^����u�����X�zF�c�����b�6��;�_���9���;;�����~m��VS�r;������Y���>B���q�����p�&�e��v�p;zB����OCb��%�H�����	?��p��n�{7�����a��1���O����@���1I�����"��l���q���C�o���c�/����9������(��������o������Kf�$F���n�������\i������wRK�������i�g�J�}9��$�h'u����
t�>`�}c���N1�O�GE?��+����W�}D4&�����&��c�o��L��F*����t���"N����1o lV,���rP�G?�1�e|��U��	����XZ,��e�02�b�!���"�<��1]�xg�����$p���Z��_���?D}Ub�+��h�����4�e���{��9���-��U]�g�J��k�HJ����B����p�]�8{XH��������\���]���>l���3�����rb"���e�|�0N~K�V�W�y*�������H���4D����8��gt��-gCi9�;N�����������F��UH��;�F�O����s/n��@�=A[�J���]�'
>�8��lZw>�_�z��������6��4���d���B��C�}�i)������R1��/��R�������h�?�Z-��	h7J*����%�I�m��f��0�����v�mR����oh�R	Y��]~�f���@,����d�s�*�`�~WD�|���b�|���H���
vE���?#*����~��&��.�f��W"�]�{������s�<y;�I����V�GY��a�n���
*���5��d�����)G�B����`�O V���Z���V��wa�#Q����,X�E���a��������YE�v�[�����sT������>�s�w7�Kn��~�n�������]=S����/������c4�w5�������_�~`���vya��6�]�T�
��Bvj��1���cY4�x��AL�40�/��B���������N~�� '�C�����T���!7��-t��/�C������
�X���9�A�{����C��;Q�w���I�N����_�	~���Z{i+��x��R�^i��.�^�`�}�Y�O�xKP�%���
]�����xm
��I}7q�.�2���K{�i���5�Z�ry2�DF��S���}�x~�������B����������	��� f�Cs��7������@X������#^�����(;;i���7��&�p�O|oM���g�L����_'h��BF��t7d�n���gk������;x9+
Q{c��4��p��~����=�u:��{�������P�U�v&��f�]�WD�`�ka�Ns4�E��~B�6��k�\�G�� JC����d�"�*
oA':��Nt���D':��Nt���D':��Nt���D':��Nt���D':��Nt��
&~�}E4��d�"K�W�a$��������h�;����Z��f{JOA��"�@sBR��8�z8�d*�u0�p��=�D�{&��y���M�^8���
�Rsb���Fiq�h���z��ok��7�/�����{;��1���6\��J�j@F���S2����������{�����>�w�>��BfT'�0�P��e�3�:����u�~���E����T��_��G�k�~���#��\���g��������}J�BF^f/dv/���`&f���{c���zy��y{%�����^`E�o7�_P�-c$�r�d$�����P{hdmb��D2���=z�X_:(��J������'q�O����=������=��x[�l��;og�d�����{�~���{�`��Z�n����"�����+d�^�����.�^����W]�	Y�Sp�����:_��e�O��|/����U-�=�
y�-����S�����w���W��0�]����y���e��T{�&�Me��V7�T7�P7�T7P7�S7���{�����E����9W���:�IF��jL0��F��(���N������a:
vA���z���U�f����FNP8Y��5c����)T3��fL���G�+��,�TC5
t��kZTmt�,XVG^\����Q��Z]T��<���t�wp`U=�nLT�vc}=�.�tW&�w����Kc��������5#��*�5���Td� �Yd7��;3|k���������"�e���o��Pw�=���:�����T��-������,��b��h�� hg|�*E;�4����Y�]@o���K�Q@oH���.�=,��v��(Ko��z�]���U�����mF����x�h���M�U�z/�z��g�t�5�����������F�$��mEO���g��`�z����u��4�~��(M�����{���������z�>l[������3CB��"����+���v��P���6��Q�UV�'�cU�������X����n��-���gU��[�Y���&\0�&\9j|�nU����QV/�X�[�#���(��������d�57Q�FxY�P\�����\�P�b������
endstream
endobj
37
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-ItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
39
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
35
0
36
[
666
]
37
39
0
40
[
666
0
777
]
43
52
0
53
[
722
]
54
56
0
57
[
666
]
58
65
0
66
[
556
0
556
556
500
556
556
277
556
556
222
0
0
222
833
]
81
83
556
84
[
0
333
500
277
556
0
722
0
500
]
]
>>
endobj
39
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-ItalicMT
/Flags
68
/FontBBox
[
-517
-324
1358
997
]
/Ascent
728
/Descent
-207
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
40
0
R
>>
endobj
42
0
obj
<<
/Filter
/FlateDecode
/Length
47
0
R
>>
stream
x�]QKo� ��+8n�U�MI�M}��?a�$	��_v���5����<4����I�ho��0O�W@;�%<���pF��Q:��p������k��Eq~��{=upC���;������]���l��A5����t�r�al������c����:�9b�.�&
��
�����%h�� `�?��T�'
�_~Q���X��0��"��������8On-�<$R�q`H����'�KdrV�D����S!Y&g���G����K��\���cqj���i��u�nr[
�o�D��
endstream
endobj
44
0
obj
<<
/Filter
/FlateDecode
/Length
48
0
R
>>
stream
x���	xE�?|�z���}�%7��r�I�`4�e�E�deGdQQ�QQ�
7P�P���8���0�8���������$�;��;������{���=I�u-]�U]u�:U��
����$�����<���2g��7��Y��c���^v�k��6�&��2m��I���02�u�N�_Y(��0�?m��E�+N�����I���t��Hv���1=m��Es|������|b���s
�����UvAS���\X/��s
���?���/�v����96����;��6�N��?@��{-�_�
Pa,�������f�t=��z����^��.�H�X7Ioa�������l��
J_��#����r������oK�����N��Fp@.����������X�n�>bw��A
[Y�%�y�V��Yzj�$� ��>�0��=<�O���[,���<��K��Rq��i�v�
���(�����!�m,���[`;�6���s*���� 
��||�zx�����5U��)8J�������+p�������S������oC:�(���X�S�O~�.���������r�6�c1V����������[����`:��}��Y�m�N~@zL~F>�f6N�qF
�x~�\��	6�]��a��|<��]�����'�D|�_�,���2������4���`w���~v�}�{��|&�F�&��^�{��y�|��\�E��iL���?6�3�%��#=,���
�����.�~g
s07�&X��������Q�{��c+����[�;�U��sx.���y�J�k� ?���?��RX���R�T%�H��W+�5��M����i�.�=�#�S�3�K�1�i��
�7N?�X��a4�l��iKS}�o�9��(dC�~"�����)n���8v1V��c�pd��l.[�#y#[�}�
{G�������E�;�
����_��|._�����~R�H�#���T+M�HWI�Hu������i�M�v9[�����_/_!?,&��S^W>Q��,u��������<�0�pK��v�v���	H���m�Z����2���
n�er����Dz���)�?�V�kY=�W���s�8&�X�����i0�F��Y��7bP%���/����O^�:�u��	[�Jl�e����^�����E^��,���'�aH/��)c Gz~#�e��6��~���x��ra$���%�A�C���I�
0���"��{�$y*�el1|O W�S.WK� {�O�Wq?�.?�oW�����FV+�U����p@��������7�`��r!��p-,���ep�2F��

�a�n��.r�KP��C���{����`�� �B��b-���rBF
��<~J�7�^�`��f(u���.���'���T�<}'t@y�"���|��S���k`d!�|�)���_�_���#�=g�/�v�����L��<��?��N�NB�.F	{?\�|�����(k�7��Is�}?���'�������Px�(0���9�c���&����M�qn�QH�h]����T�Q#{���;������U��u����c����v�E��y�9����xF,	�����y�.��n�ZTE�8��}��MH�N����@����1�E���f�;�L]b�(�8�d
KNiU2��L5�dZ�
�:�O��K�����h`c�����}�juGE|���q�sr�B�odZ�D���[�o��U}'���mv�{���l��6�u`�.�7g3��D���������N�����������I}'N�6|L�>995������]Ry��<IQz�f���u�Lb:�
����~���
\2!���7i��1u��j���v����>9����z�Y��n���odz���V�H��>�������3�./�7aU?lz5��	l��T3����M&�M��������r&�H���z�M[5cNMlU\xU��X,�3}b}�F����������'�9�.�jk4���}�C���W��n�q�ZF&7�1Q�b/lYF=�;	�.qi{2&��;]&w�U�v�b�S��V�$���u��Vi=(���)Z^b�w��w�g�L4r��;�(�I3��}3^�L����Xz��b����6���9Z>�c;��G)NM�-
)�uK����	�$c�J�5u|��c�	��;K�;��'�!%����:ka�?��������~��d���y�����j�1�G����wo�g�����H���I�E��\�c�ur�SQOj�X�*EK���&��5����RC����jF7�z$�N�sV���9WI�aT�G�]��~�=$5����)F��I���Q����!��;�&�.�C��
 ��YF���F��:;����n��~y�~�&����^zI^B�[����_Z5���p��n������j��L����<�r��[9b����V���3�{B����xo��@J�r��LJ$(��n�3v�������K�<�������if�<Y�K�<�!�{����#X��R#g��V-v@�7�[��J�tB�s:��)H�{������G��p��(B���������i�XFF_��D��H$�z���I�����v�GR)�J,#)(��l����R�����B{������l�J� �=�H[_�����F�������j'��G�jUZU5B�8��*+	+:&���v��}U�BV�.+�
e����C!�J�1���l��l��c�o�XIfq6�����E��S^����B
^4�/�@��afd���w-���X��-d�j0*�����Pbe����+�g���o��*{����!��y�o�����+�F��[3V���/����>���n��`�m�O�Wm������_��+l��V/�8X=Y��g���OE��9�s�����9X *��5����>���G;w���9���k��
� �����u���;���%h�����:y�}bJji2:w*�(���?��J&{�=�A&��\���k���p��6�?Oy)�D(m���A����0}<r�z���t�8�j����9���,Y	d�\aV��x�(���.����9�xuR��d���~�/�8c���'�'���O�].�:u8Tz�F9�9�t���G�yf���jq������:!|O�pj����++=�����}���D{g�����^�1�2�q������������B����}�{�W#��w�w����K�������4l�W*@��oK���w�1�i6.^�J�!�S	��l����nx����(S	���'�X���&��[h���k�G�J���"j��������55&�$�{���"�jy����Gx���\�:j�[�nY�k�[������O/^����-�����dv����6��kjj��s��`5���1\]��z�r�����N!����T��sq��7q��o!�R��-�)/�������2���p��0'g���P�i��jH��4$����)� �� /� /��T����A-1��P�����,�%�����c�b7���s�q?���X���J�����ZS���yU�R��k�2~���[#��{����g���k���U�&��(d��N� g�X�0��[�W("rS!il|�s����"~�:�E4�����)��a,+�d��@V�����N���6��?�rr{8�����'�Q"��������H$YM��]��xs�N���S6����N9|~>*+@y��-�h�8��?Rb��5�Hj�Z�������>��V���b}5n9�Y��������w��f��Ob�d�9w;v�y��2�,M�m���a�bh��e�5�������5�����+Kk���2o�l�7��V��v�Q�&���Q�4�����K����� ��rz�U��|6_�F�.��������jI&���:�U5V�&�=����e��c��
��!�u��C-��gpbA0��RPW�Cy�������T��-���p�ck�y��k�������o���K���z���U���n�'Sf���*��w�|n��6����8pt�S9���d�����
Rdk���FR4y�80����]���'�no���Y��J;W���2�%4����P<�����_J�������jGq������*���.��b�r�u-w�}�yfH�.�f&��p]X�Z�9�q�]IH��N�[�0l������<#��U����.��|Z���T��.�9�>s�}���7|bv�'"�"�����,ji.j�_(�v!�B�����A�
�t�4��[�o�]�;��YF�i���4�#)$SO�6���#�G5A,:��dA�@�����x�w�	��n�Q�>�@,t4eF�g�=s��G�-�9�7,�1}u�>���,zm��I��i������
��W�]�x}�a���K������W�n�4���Y/�����O�Z�����](�]l�Np�O����q��p�W����sqU�8W����vq�X��W�P\V��OlWE\Uq����|c���k�O;_u*��A�_�����dQ������zM��$K.�N�l�����-�u);�2���r��CQ����r��8��U!"_������\�Tn^�eiN�e��38\�r@�>�%N��F�l�:|���������7�[Wi�jBmh��NTy++�a�(<N7.�v�+��_%J��S��J)�C�$gfV�#j��L*�L9*�K�U:S����8�*���i�{[�@���[��J^��i��?t��}�Ml�����<��E���3IN�}��<Y�t����W��HqP��I�)���O\��i�&�Q�/�E��:"b��������N`8�.^w������T��#�Yq���%��^�""$)�8�U�R�{Q���i�J�#�cWe���������w�����wIL�uR:9v�B�P�h~{�����������Bq��SGR�un�v{RAftj�Gfo�(B���R�������vM�~A�D�P�0�h1�JdM����9����.�g���Km�#����@	|d��cRA�����ask�tZ�MKy�B����P�@0`������_v}�s�/Z]��m���Co�c�.����Rm�-{]�ehu����M�5���+wl9Le0�W�P&�pi'�����x�H�-�)2�(���r�����1��Br��b��$�3:-����x�k������\h�d(�Y�=�A�F���D�
kDSa���by6������l�V�Qmo-�V��Q���
���I���L��&Y&Yg�&%X���d]��v�kI���2C��AuF�q�B7�y����^sq�g{�l�Rif�)�V`Q�IY��U0_��1�P���A����=J��)���,!�����j`���������������p���C��p���l^i�z�%����Pf8L���9���R
��RX$���4��,��\�j�H��@2�Nm��?����.�=_�Z�x����t���?���nCo2o���\�Q���ip���~��	M��������l1{�wO�t����5
��i��D�m!�Ip���{�����[e*4�8�mN�|I�4�C��'���:��Eb��jf�%���
f���[5���!�	Z-h8dVz+u����*H�%����m��mu���]=;����f��s��n�5�j��s�K������\E�	Ct������T�	1l��c���t��v0��v����Lq����oN�D���x�s)V*n��^�'WH�xs:&��L�.���T�(����N���H��-|T����,U)������������ �!A��17=�n�&��RR3OK�A'�'O���v>�
�hee�b���jvD�;rD�"<��7��<��N�R&5K5�B����y?�1l��0Ab�[E���:��S��['v-\���s��W�����jy�u��F�r���7����\[���D)�$S��N������#k1���2�H�!D����1�.>�F��Y(�,k��=������vG*�%"LD��0��u���T0_��7�����As!���<���+�n
M84�7��f��WY+�,*C��V;o.���Q��pV�@�x��Z������s����|9s��[�N��n��;.Y>s�M7O�����kF,{����<)e��o���>Z7��v���|!
����wl��o��O������7>e�����P��	d�<�`D�

(�y:����if��&lpJgR,(x�H\�t5)L��f�i�����
j,�
��N��l4N
�h����5��K�G��������x�'�Yw�{�[r�0�1��ti^u#�(���4y{��]���"����Q#M�������8c��J���L����V��J[6�*�J�]�(o�tYhVlj���k�V�n�Zz:�B�����	����C����&�����<��HNBMg
u�'�0NM����j��:�m�X��>���'{��q�A�L��mik��f���d��hr(|T`������)/��I��G��
28��H455��e�5?0	����xEyii�����B&�D���y.�x��k�ue]����4�����5W������_|��-O/�v=�]}��%���������ik�>n��������-�?�}���QU#�\��\���A1t�O���\���+V��URR���������ZWm����	�V���[z ��+X�����>���OD7o�>_�7z��O���}B�d �Nu��GWEH�
��C)���$���W������m�IN�NO.t��������n�n&k����.9���v���v�Rw��v�#��[y����
�f.���J��&�&���.��/�-�����iR�����$��6�\��B����-��u�[d���&j�$$ BWRLn�bU�W� '�!��6EH`L&v�'
G*3_�N�y?�����~a�t=5���/N��R�Iv*�T�T"���|��w�)O���+�$�Bj��S��U�Jr��f����5.���
Z.�]*��4�z@��j����a�U ���R�]��+�B��U��4���Oj����\����e��p�=n��j��'���8�DVI�=��u�b������Z�XhIs�GH�t�&~�k���y�\6�Py%�W�v���l����I;glz���3������\rUf]���7��8L��s_��/�;{\�Y��=Z�y��~��4d�����/�_��������LM����c�n:�;��8�.0����^I���)[�E{@���5Lqz��
���Tg�e����xY�W|N��l���*T��qP��Z���
�*6�z�k������=��:������D?�<����&�ROi��R�I)�<��)�{���'5�t�*���ja�`����B.b�G�|�*����4G�1���!g��Cp9�.)i�dy�,;�-"��h�)'5�����y��F��-�`Y���y�#�I��$}���=��a�X;��lV���*�!�$��Oe�3AzL���H�	�c��D��h���"�7��X4�7�hhI��p��h��� e6�M�D���[�;�k���9y�2/�3Y<��%S$�-L�O���|�����o_|a&/u��g�b������nz�k���iz�!v��Q���������\:�R7��{'�Q��
�A��z�U��T\��v�{��y,�`.`d_�a��es-V0�0������W�X���'�5���H7��9Yv���wx�Dx�B������A�sZ]Q�>]��\���Z�z��������9�;?��;w�g�g�G�0�����:5����a86�xp��q��I�"HA+�|�U�����t���_��	Z�w$�Ck����^�X�(����J%XJPDJ��A1A���������B��XX����-n�f�S5��N}�U��t�(�k��
9#�j���?�{
��w�4�<s���,�j��?�$T�sM�����x��[
�hw�js�7�y����������7E��]�q��3nc7�w`���,��6���y���z���Q��CZ�e�e���b;�]�rW�R��_�G�/��O�����K�{��V�?�~��$�M���'B�����1Hc$�,y��c��p
�}]���/��vMu}�~:���5����2�a�
�,���B���|~�C�b��c8��h�A/�����]�E�D��K'��D�W�c�S^���+������y��i��f����IM������u���_�E��>�n��G��E&�j�,Y����d�|%hF����hV�����5�4�Y�	jLV�M����[����j6�2P���K����.9t���o�pO�����W,|��k�_���S�=��U�{r��~���k����{I#
D����&���p*�
� .�k�Z�(�di�2�6�a
�Qb���������]�d�DL���������0>�7+61�H]<�OD41�+���Bq�m��5M���-��o$.1%;�4U2��~a�Gy���0k�)�G��27\d�����Q����:s����������de����qb
V�SM��&�@����K�����^���������q���y?[��&7
�2DC81��7/����j�[%VJ>s��,�y�X���o��#�,G��T�W������oX��C��Nn�r��������G���i6:�X=�F�d�M6}�%6����^�{��s�HK�� ����6���F;ES�9����vYc�bW]tOT����b���V�����,���,�`$�i�CJ�HXf����2H�NF|�gk����@A2�]�X4E�M��{! ��#�����8R�--���8n�[�"��0B0rRx\��H��r��C$�<���ho�x�V%��h�h-y���q%0�}U*��f�E�����|�U=,��%���$���2o^EYEy7r���&1���-�<����p����].�s���v�����.�=d�7������x���{��{'�p0��r��*L��B�+�r�"+����^�7 )<q�p����t�\�$	XB��,������ml������h@m�������+\|��������-@�j#�(��6�q��v����kW���.t,������B���1�1�A�1�)
{u��|&� �2M�y��d*,�Y7
����9>�[p���ra0	�i�r����*ZP�0�Z�����J��g���-n���\V�[�%� I'K�t�.
z��B��A�����,����+f��
-�o����`�x�~�5#n���y��+����dO���.���UW���Uj.O��uzD�-����'���Z�+7�FP�y�@�`�F�!���-��*�q-���(fD��Cl
v��j[c[g����}d;f��-�6������u�����mhyYd.�T�f��h�:���v�R�����N�����eu�|L� '����e�H���1���~�N����e��e��i�E?�D
�kk"�W%6|q��� I������EYP��^Y__/u����\x�=��7���&�r�����O������fH�h9g���T����������Q[�+��Sg=�-��L���(��#�G�</�)[��,U����e���n�$�vA�R���67o�{�:�{�-xO�!_��p�G#i�[�y
`�|����fA|���'���^�u���P-D+$��}'�
i��85�G�f$����0#13���I�2�0#13����*����x�����������X3#>3�5#.�}k5#(I���p��G�#���?I(��	�&�l���M����j0�aaj^,���5�
xA8s��2�,���0|�7I���56n1��	�&��#yI��2�Ym*+bm�{ �5R�&�e�2��
d�"�K
dI�!�J�����'5�a��2��b�ey��yBM�	5�W���!��j�����S�&4���[�2��iS�O���IR�P��4�E[s�<�C����*�k�����I7�;�����'���`:I�G�*M���No�����0-��6t"D�
�-�]���R��������f_������;o����L���\x�������i{c���=���x/��h���w4�kX�"'����_�T?Jk�>�>��N�U��r�����}����H:"'�w �C���!���v�M�u�'|���C�#�����!�	��'���C�G�(A5�=�������a7|F'Rbq�&���?�����d[D�E����H]dOD�H�,t4���D���K�7)��L
o�B6�������2$,<��?��q\�g�H��}4*���>z���^��j��%U+��i��}��,���H��0e�X��LX?L����0�I���M}��rm�|���Y=�|��Z����\���l���8X��w��~������>�=�����Vk�S��Vk����#T��
�
���S��.�j}��#��Y�I�,������,hS���He��b�e�de��2�=�-^T�b�>@�_��b��!H��|d�"V��/�t*��q@C�����@�%b~Ay'�fI�����������h��q� �S��j���>!X�+��:�2��BZ�8�)l��4��>����q��%%�2��D���l�0O������F(#l�(��dV[#����C`�h�-�g��/�������������b���7��������Mk����,���x��?���k��M���
d�{R�Z�\m�&W'�<;�����%�%�W������G�G��2j�;���e���tN�f�gf�I�� �A���#�#Y��D(ONj�`��C�'_���>q|���9�n\���Y
��pGM���%���Q���3���O�/��	A��q������1�'��Y�9��Ng;�vM�}���2����otWN��{�M'�������D>���,6�P�'rv�nv�����<~���c�@�V�t�M}
�v^�Wj1�+6��s���3������w�>�p�3O.���i������W��{���-�z4��6������_�3I�M���8�����e��%�s�@~�S�VGF�d��R����Y}�}2F�Gd\��4cB�����C�O�/�_F�v<��V�
����s,���u�5�q���������@,����@\����������������4O�3���#g�J���3�N<O��#�x���'
�O�f����Q���B�-0��h��F��u-���B���'WH ��������Z�S~��P;�����vb��~8�0�y
G��T����/�w��M��~����>������OlZx�cM����!�#��k����N�����������+d)����������&�<�\�-����d�����6��ks�de��`���2kn���<���[�����y�
�_)oU�
ay���/�U!-������-��h�������4����
��{i���Z�J��*��7=z����u^�^��*�%��;���E��'�k|�\��\��#�IB�Gw<�FlT\�[�j�#�-���|3�gFr�H��c��%b����a���'t������l7�?���K���E���'�d�Q�k]�=2�:�6�>�1�9�5�:�6�>�1�9�U_X_�)*�/�o�5���1�pR����K���?������ww�`��X�����/���D��3#�f�x_�|�|)�|M��H�|Y�c�EN�KeG�����F�{�::4:>�)z �z�������rv��(���Dz��T��k,���2Lc�|�[�r�����r�:���,�g��Y��k�O����)?�������X~4���w���7����Q�[4A5�	���Q�}��8����`I�]8^�K�A���KX	�I�K��JL�R�j����)%1������	]�t��]�v�]�=�}i"�=�>���R��dS��
��]<����!�N�
��s�����=����hg���R�H��!
�yC�M�dr.��[,d���S�>��+��i=��I�&����vj��CV�h_��|�_��\W"l���t�KV�9����s9������fW�rdk�d�&�T�"�@%�e��5!����yh�4g���EQaQG^Q��������h���nVo��|��Ew��h��%w������:����g�B�7��w��}�x���9or�s�"]�_6��U����L�\8��ny�L�=����qc��Y����oy�r?�Y6}q�d�j2�E7�����jF�D�y��6��Ye��.;� ���;�B����B.s�e��u����k_[�	�9���5��]g�������yH[t���'��f��R3"B����\&
c�a5���.>"���)��6h�5<�GH�������ee����1���}(r{��/�q=��U]rY�o��m�?Y��������KW3�eM��n�kp�y�PV���n�	1��	��y�*��/}�����[�!'�����8~P2��!��5�KC�0- cbu�����(�|�&,�W�yE+�p��;,�aZ��h��a�'��Cb4�E����9�u��X:&�8���T:����Y��w�	�A�a�l3��Y�>w���S�B��j�&��!��k����N]���	U��agd����]W-V��Xq�);3�e�f-<KJ����5�]�"��2/��bW�K�����������|������~�������[o�����W��S����h7g��������kw������_��N2����K	V�Z��a�
v��T;(6����O���&?��}���lz��
�\o��T��[i���V�plt+����u�e��C1^����[P!�`���u��!����
��Pa����h^cc��������Ep%��_e]d���������-+��l�}�;�����a�e��Ux����>�������:����Ph�f
)�MI�B�
N���)��:�:'�F��4�'V 4*"�+��Ag�>H�� �'�'���ZPCF���b������h�����:�Z9g��n��)�N����R)�R�H�elK)K�`,eK��u|�'"���hcmcm,r�H��]Y	�OT���
�cr��{Wt�PP�f�q>��O�orX�?���_��o�.�����H�;�.�o�:{�B�R��{���)M��>�2(M��������i����PenS�����#�SJ�1�����yr��*���h�X�=�=���k={�=����'��I~[��*X�2v��Z��H���8���e���s��
�?8^s���'����W���g���	>�'�B�#�
�)�Q���n�*9��h ���EFj��J���T�������y<.������S����k�`��k`�$���b��p:%��&I\E+���P������k�LTm���8�;R�0u�*�
�w������Cq���.�k|a+&�Z�D;~������+�u�1���WN��
��nm�~�������U5�Gw��Hf�����Y��
WJJo����g��J��SiK�+M*��2��E��EZ)�t�X�T�<������X�x���n�����{=���������z��jr6��.�i�%J�i.}���3�F��[��i �c��#����R��$�����K��%qu�J�in���lI����GrH����:T��I8R��A���d��Xd,�Q �H�8�������S�������Mv)�J��'�;�kt�8m��+�+r9�\]�]]����b_�@��W��	N�M�O^�.t]��:pu�&�*�j�j�����O9^����
|i�,���Q�>��g���;�����F���6w__��*k
����85���"&��|�&<N��Y������c��q-�K���<����yp,R�>2����||�o���X���}3�tK�V*������9�N��[K�����H,F��q.��H�=�����8�hGE"d���kE��0">'�]!u��:���u��c�<�����H��O7���n�Vi��V�Fv����+
2�s��&��yI�~���<��</wI���U��B��4���������.���i�����Ok��3=�rq��W,[���<��M�jF�?�F�&����*������a�����J(|}�9�Onj����)�z(5D=�yMC��'7��Z#���U�Y�w���?����D�o��+��0�����\L�2!%?���FL��p����!��1#o0b"b���N����C��|k�����t#�w��
LA<��G���)�faz��-t�2X�u#�����K1�a�`z=��a�NF�f��"T�o����x�"�w�U����K
>��rlc�����^��X�^I?��1������c��97��j����0�~�z9�b�,T���a)��E�{#^�i������7��C�}�����<^��C[����
�p�TK1���@��a�<����' ��h�>D�+O�!�f��J=��4b���t�� ���Cw�w�z��$���P���X������0>�sA�`$���2�AC����o�q����2������X�?��R�e�l�������n���G��8�����&���gt����2����PF�&��{��9Q���DtD|�x1�1Q�m�+	zE�!�������c�}4����b>u�Yo<���Q���r���/D���������f�P��LA�_�{M5��{�W��� xi��������Q���H�7�R�����hM�	��V�x�N�G0��Z����h��|���)�`��Hw�%�1�#���J'�����u�+����p.�b��V�}�!6C������!x�t�|������<��B����N������������?Gye�3�/�C�4������X'D�1b)���d�Yg��(�T����r
z()�&���	��G^��Q��`�t+�,J����R~�[�0���2
����
z>�sZ��Y4��������!�|���1T���4p��	�wHG�&��H>��2�\����f�|���>[���V��lM��C�[P��|����|��$�HF��#9c�o����oD:&9��|�k����
�G9��}Q:��K?�����|���.�BI?����Y��I7����K�|p�zT)�Y�<� ����k�GG����M�D9���2P�w���8���������#*�@~�|�81����q�I�
7HE{����W��j�����C�J!�)���W�E��vL��������[��5�r�t���2A�c�ubR����;M*��`A��e�y�E����� �B�G[�h�����BaO|�(��"������W��|�>�q�7����bB_�
#�D��e��>%=����:BZ�c�D��8�3����u���G��D#��(����V�I����[1�V�$�{�����	y�f��m�m��o�|�[M���/�������S�}�X/]+��{Z��q�	: I������C��3�Z������H\K����-8��Cw��`�<�H��w��A�#���������k����)-��X����G����0����}0V���+�ry<��6#��
vy
�5�SnC:�����s
��a�4yk9��O?K�D����� ������V}���.�9��R���b_��i��G�'���������m����=l�o�g��{�[W����p\�����i��-Ft�+`b��c�[�&=��[�q>�wn�u����b�������Z������P2���JoC]�`������8w������w�/ �K `Y����^������
�%H�����s��N-�1�����0�o��a�BC7����?��D���!����;��1��M�i������i���%�[��
����o�n=����[a|K�t�Lw�y��#Z����y�e���������P"��>!
�0��"�������9}e����.�O �%�z\�!��W@_B�q�J�*�����c�K������7�|1��p��aK�m����LY�ceZ�F��z������������b������h�T�y�����z#�����'P�����������q/�M��!�S�a|����r�3��(�m7�Z�����������Mz�S30�_���>��w�����z7b��~�q<�"^��W��1�k0b��G���=�G~��_|����h�\���&���[�!�����_[�5��������U������v_]�����q����%�Q�F�)�dG�-K����P�������lg�_�v&����g�����u����7Z�VvFh�#��e��E�7Q�x����������:�PwyP��F����1���w�N3e�d�/������#�:������S�&�8��Z���%��?��?��[�������M���.K*����.����_�s��tk��?N��K�tk��~k�3��������S��B�v��7�������H��m	���}����L�����w��4t�>]0�
�z��(������[���n����������~��[�-���>�1rp
�J� |���Y�\��~����u�<6���&��
��a�E<�i�=(���v
�$<�v�(��������E��oy@9�|�|_������S����:4���a:H�!K��%��?7A���E�m�;���bOh�L~�o����>���K&��+�#h������1t��AD���S�"������YA~wi�`�o��7���W�a�$�g]"�����0�A�m����+�L�J:�G|���5�4�wnm����A��i��Y��u�����1��A�
1I��H��qg�
��9����u~k?�8.]��>�'�������c�[��l����l!�6��E��������b���/�|�Os����=4�����?�K^��9D�o��Q���oq�?�<z9�
��|������M?!�]����#���)Xo#����� }D��_-�(�wb8��.�=�^`T^
#�O���`D.~�by$�q�������������|s��U:���e-;��%����~�V��f���e]���W� ����p���I�/�&�
C��	�K��;���1����}5�?
�^C[a��&�F�c���L=�y��]�|�Fx��\���S���v/����u�lG���Z�\b��h��|�����X�����|
Z�����S��50��������T?~*��50���?����������������A?~j��[���CZ�����'\�6���������_`8C������Ez����Q�^���#p���ee^���+0�����g��*�q���N�.D	b���mz^o[�h�i�^��9��*B|��'�&���<�Z��V���}o��L������^��%��X?�g��MG���A~�W�~Q<�z���3rN�kQfL@]�l�C�$d���t�!?����K����.������
$�����-�$�M���h��7S>��2D�O`�|9����]��-�!�e��$����n���W)��h�d����E�2�3�����l+�1���j���5����"���:v�����`
��lu<T*��sm����D��������������@�� �1�f���wo��?��W�!N'�D���h�������Z�I�?C���� ��oPw���^����u�	|������b�}'��L���
(4�����`7C��3�h����	{�'��@sh>�����j:+���1��f���4������g��a{C�)�A�4I��>����'���i��g-�����0E]#��8.~a�=�,�!B���"��Y������87�
~�A�������g��:?Q��M�k\k��@,FL�����=����|qo�^��0M{p����#b�hi�g��� <�wO������F<Lg�~d��ux���4�y!���u���[��~>�:=�!���cDkd��[�_���,?c��|f�g�{1�_aa���_[��9��F������[������;�?X?����d��d�_ ���l������Hgc4���T�$�eg�����Xa�n�����t�0���G	��4B��5��-������a�k�Wu��g�c�`�O�*���Z�7n1�N�q7��|�O���5��l�x���<������{�\�[�8�g�tvO��~����q�f#�
�8��#�A�f�.�J��*I���&����u~@����`���7����%�����5?6>��:�Y��q�vt��|�q�v�!��m�`�qN6�d�]��N��`��6_z���N?�zR��^e����S�F�p,�e	�l�F�X��~�M�9HU��O��m�T���x���������3=�L�L�+���D����X_����~:D���>���$M���3��h[���E�yY�=|����0���M�k�\
B���y��:������}��/�D9�W�g��8�����NdI��P�"]�������p��!.������F� M���A�w��?1�=������Y��B:9�����70Tpm�`��V���b��&�M<	���;$��Pa{I�5I����rWJZR���x�&��[_��_���{f��2�Q�g��T?���.�gz������9u!*p��4�dG@�h���k���s����`�r.tP�>x��0T)'����a�����s���B���Y�C�7M��	um/C�C��f���?���;J�#���'���"��u^v����|�1�8�=E�CyO�����C�n���	G+M�0eC���B:�F�e��X5�,��k��h����Z�����4M��_#�g�`���{�_�o�V�P?�_�Kg3~�������J���t�����\~i�i�l�~�Wv���0�q��
�i�����%��\����O���Y(�������\�Q6��}����w��J�9�K���"f|�@�������&����0�d-�T�3�l7��P�L"��_�2~Z�A���,~�^��^"q^b��^`�e�.w��<�W�Lr�2K|^�3������b���o�2��eLG|I{5��kj:���M'u9)d!�!1.�G��O�A����%��|�U����]h�y�������
�����@;:����(g�?��x�l�3v��o��s����������5��������xBO�8~�v��� ��8��3?}��'�O�H��4�������8G~6HS��Dg������vA�!yg�1���Xn��7�yq�G���wt?U�����K�z�:?��Sy�k��� ����i)�#�8��+ �o�LlD�"9>��7���������8�|��5�(��#��%�\��q��
�g����F|@�R���_�(����
��~�9w�)Z�l��^�@W���0}�	�s'�8B������h%����6��
mhC���6��
mhC���6��
mhC���6��
mhC���6��
mhC���6��
mhC���6���"0���[����4(���*?��-(�w�H�xka$��R;8��R�-����R������T��������AJ��J�5����M���KY���u	b)bb7� B�+�M f#A�;R�����zIQ��>z�0|�H#����7������{.>���K�W������:tN �g�i�	Xf3HR�VJ$C�&-�H���k�:)�R�D�0U����;�Ll���e��LJ��n�M?��'���gh�e���=��}�����{���9���]�0�s@����i�<�|"2��b?�ko��T9�tB4����o�f��E��~��������=���we\~h���X�����XI� .2������W�G)Q�5�bL��E)P��s���P�I�� juE�vSs"���l�����>v3������\�	�!y
X$�!���<[�5�Ms�2p���:��q���#>�.�R�0,����g���",�Sc��������a}��[���g;�/� �xW�Qc5�����	�����;J����Z��I�#���GTG
���T���W��w��Vr3� 0�g�:x7��$&�p
��2X?��5�-�&�
�  �?���a�m=�����7��=�������~-�Mp�����*I7"O�����������P��f�����]@
��Y��-�v��� KdM&P��#�?%/��8��l@���1x0s�����O��F��<<n���7�w.��F�,<n���q�����F�������x�ji�A�fP�Ti�x�?�_�vg'*v����T�Ej�N���|����<G���O����SS�f��5��^������j�3B�\����,QS�f����h�pX�~|����J����������"�h{>���e��@U���vW��0��Jg�m��I�Nd������J�<�A��F�d�`S��lU��v,|VXl�����P'��0r�������j��-���g�E�6����J�
���@�fIB�r�C��?k��gM�>]�.�Y���\�g�;m�Cl�Kj��#�`��}D�1�^R�G�"s�&
{���#������"��{��w���)��eI���x����W���%��.GF�u���EMH����kBz���z����]�_}J�I7q����S�#�A��U��F	c��)����U=���������vb�;1i4,�F��'���+��w��Uo������6o�w����6�~�A��:�#3�������Ox�^'���M�G�~����xES��'��%)��C��V&H��f�k(���C#�}��y��X{�y�[=l%�y�;��B���ED-�}����C�<t��
(,J�/>���+�-I(x6J���}-{3^�_x�f�.����*X?k+Z	�T��y��C�ha�~J?�e�mN����G?��q�/[,�zD��FoC�s[�d|0s�������b�]'���IL�b��B��\W.u�����i�HIhJ-�5k1hb1�	�dMh��&�X}B�(��!�EH����\�U�\��\3I�s��j��75M�����c2��Joqb47��Gs�������!�<�i��"Oh����8����U�Nf��hV+���#=����l�����Qc2k�����l��?���k�K[su�c�A>X7��?y�t����\I>W���o���������L2��.WXc��xk��	������	�k]����Ic�h��XMO=�~8�SxM��6�}�T�\o�u��\K�n�fH|j�4MB�'��_	BS���������\�2�eKS�������:4R({����K�z6c��9���w!�����%���<V__������x����*��)R*JV8?��V0<�k),���x(q�%���1j���u�_�&��k^�S5v{�Ki�$[/V|�bS���
�
endstream
endobj
41
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
43
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
15
0
16
[
333
277
0
]
19
28
556
29
67
0
68
69
556
70
[
500
556
556
277
0
556
222
0
0
222
833
]
81
84
556
85
[
333
500
277
556
500
0
500
500
]
]
>>
endobj
43
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
44
0
R
>>
endobj
45
0
obj
312
endobj
46
0
obj
14437
endobj
47
0
obj
287
endobj
48
0
obj
21773
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
12
0
R
17
0
R
22
0
R
27
0
R
32
0
R
]
/Count
6
>>
endobj
xref
0 49
0000000002 65535 f 
0000110853 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000255 00000 n 
0000000420 00000 n 
0000071267 00000 n 
0000012153 00000 n 
0000012174 00000 n 
0000072138 00000 n 
0000072289 00000 n 
0000012193 00000 n 
0000012362 00000 n 
0000071423 00000 n 
0000024666 00000 n 
0000024688 00000 n 
0000024708 00000 n 
0000024877 00000 n 
0000071566 00000 n 
0000037527 00000 n 
0000037549 00000 n 
0000037569 00000 n 
0000037738 00000 n 
0000071709 00000 n 
0000051059 00000 n 
0000051081 00000 n 
0000051101 00000 n 
0000051270 00000 n 
0000071852 00000 n 
0000065220 00000 n 
0000065242 00000 n 
0000065262 00000 n 
0000065431 00000 n 
0000071995 00000 n 
0000071226 00000 n 
0000071247 00000 n 
0000087334 00000 n 
0000072433 00000 n 
0000087769 00000 n 
0000072821 00000 n 
0000110189 00000 n 
0000087977 00000 n 
0000110572 00000 n 
0000088340 00000 n 
0000110769 00000 n 
0000110789 00000 n 
0000110811 00000 n 
0000110831 00000 n 
trailer
<<
/Size
49
/Root
3
0
R
/Info
4
0
R
>>
startxref
110947
%%EOF

0001-v5-20231012.patchtext/x-patch; charset=UTF-8; name=0001-v5-20231012.patchDownload

From 2faea9a5ef9b584853341489c9e30f11129638c0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 13 Oct 2023 22:33:51 +0200
Subject: [PATCH 1/4] v5

---
 src/backend/access/gist/gistget.c        |   1 -
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/genam.c         |   9 +-
 src/backend/access/index/indexam.c       | 411 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  16 +-
 src/backend/executor/nodeIndexscan.c     |  74 +++-
 src/backend/replication/walsender.c      |   2 +
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               | 113 ++++++-
 src/include/access/relscan.h             |   9 +
 src/include/executor/instrument.h        |   2 +
 15 files changed, 650 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 31349174280..3acfa762e7f 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -677,7 +677,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
-
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..46c85751cf2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -747,6 +748,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -755,7 +759,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..230667f888b 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..0b8f136f042 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+									prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+									0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/* with prefetching enabled, initialize the necessary state */
+	if (prefetch_target > 0)
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_target;
+		prefetcher->prefetchReset = prefetch_reset;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	return scan;
 }
 
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+	
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+										 prefetcher->prefetchReset);
+	}
 }
 
 /* ----------------
@@ -345,6 +399,19 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/* If prefetching enabled, log prefetch stats. */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -487,10 +554,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +569,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -623,20 +693,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+
 	for (;;)
 	{
+		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		if (PREFETCH_ACTIVE(prefetch))
+		{
+			/* 
+			 * incrementally ramp up prefetch distance
+			 *
+			 * XXX Intentionally done as first, so that with prefetching there's
+			 * always at least one item in the queue.
+			 */
+			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+										prefetch->prefetchMaxTarget);
+
+			/*
+			 * get more TID while there is empty space in the queue (considering
+			 * current prefetch target
+			 */
+			while (!PREFETCH_FULL(prefetch))
+			{
+				ItemPointer tid;
+
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+				{
+					prefetch->prefetchDone = true;
+					break;
+				}
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+				prefetch->queueEnd++;
+
+				index_prefetch(scan, tid);
+			}
+		}
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			if (PREFETCH_ENABLED(prefetch))
+			{
+				/* prefetching enabled, but reached the end and queue empty */
+				if (PREFETCH_DONE(prefetch))
+					break;
+
+				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+				prefetch->queueIndex++;
+			}
+			else	/* not prefetching, just do the regular work  */
+			{
+				ItemPointer tid;
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
 
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+					break;
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -988,3 +1112,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int		idx;
+
+	/* If the queue is empty, just store the block and we're done. */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Otherwise, check if it's the same as the immediately preceding block (we
+	 * don't want to prefetch the same block over and over.)
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/* Not the same block, so add it to the queue. */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/* check sequential patter a couple requests back */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/* not enough requests to confirm a sequential pattern */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * index of the already requested buffer (-1 because we already
+		 * incremented the index when adding the block to the queue)
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*  */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* calculate which LRU to use */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* entry to (maybe) use for this block request */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block
+	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the later cache, in case
+	 * we happen to access it later? That might help if we first scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* see if we already have prefetched this block (linear search of LRU) */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Request numbers are positive, so 0 means "unused". */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool	prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+	BlockNumber	block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	prefetch->countAll++;
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes
+	 * (fkey to a sequence ID). It's not expensive (the block is in page
+	 * cache already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 13217807eed..8837da91857 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3566,6 +3566,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 								  !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp || has_timing ||
 												  has_temp_timing));
@@ -3663,6 +3664,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 3c6730632de..09418f715fa 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..f3e1a8d22a4 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..75b44db33c6 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -658,7 +666,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -703,7 +712,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 0, 0);	/* no index prefetch for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..185ff0f1449 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		Relation heapRel = node->ss.ss_currentRelation;
+		int	prefetch_target;
+		int	prefetch_reset;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_target,
+								   prefetch_reset);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1662,6 +1701,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1674,7 +1728,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1710,6 +1766,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel;
+	int			prefetch_target;
+	int			prefetch_reset;
+
+	heapRel = node->ss.ss_currentRelation;
+
+	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1717,7 +1781,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_target,
+								 prefetch_reset);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567eb..47093cc9cf1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
+		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
 
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4e626c615e7..b814af4b2f6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_target,
+									 int prefetch_reset);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_target,
+											  int prefetch_reset);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+											  ScanDirection direction,
+											  int *start, int *end,
+											  bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+													 ScanDirection direction,
+													 int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	blk_read_time;	/* time spent reading blocks */
 	instr_time	blk_write_time; /* time spent writing blocks */
 	instr_time	temp_blk_read_time; /* time spent reading temp blocks */
-- 
2.41.0

0002-comments-and-minor-cleanup-20231012.patchtext/x-patch; charset=UTF-8; name=0002-comments-and-minor-cleanup-20231012.patchDownload

From 61b7123c6b3dbd716c6882716ce17239d38e0604 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 13 Oct 2023 22:34:40 +0200
Subject: [PATCH 2/4] comments and minor cleanup

---
 src/backend/access/gist/gistget.c        |   1 +
 src/backend/access/heap/heapam_handler.c |   5 +
 src/backend/access/index/genam.c         |  28 +-
 src/backend/access/index/indexam.c       | 328 ++++++++++++++++-------
 src/backend/executor/nodeIndexscan.c     |  17 ++
 src/backend/replication/walsender.c      |   2 -
 src/include/access/genam.h               |  12 -
 7 files changed, 273 insertions(+), 120 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 3acfa762e7f..31349174280 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -677,6 +677,7 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 				so->curPageData++;
+
 				return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 46c85751cf2..ca91bc5e878 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -750,6 +750,11 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		int64		ci_val[2];
 		int			prefetch_target;
 
+		/*
+		 * Get the prefetch target for the old tablespace (which is what we'll
+		 * read using the index). We'll use it as a reset value too, although
+		 * there should be no rescans for CLUSTER etc.
+		 */
 		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 230667f888b..6e3aa6bb1fd 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,7 +126,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
-	/* set in each AM when applicable */
+	/* Information used for asynchronous prefetching during index scans. */
 	scan->xs_prefetch = NULL;
 
 	return scan;
@@ -443,7 +443,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		/* no index prefetch for system catalogs */
+		/*
+		 * We don't do any prefetching on system catalogs, for two main reasons.
+		 *
+		 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+		 * or we often don't know how many rows to expect (and the numbers tend
+		 * to be fairly low). So it's not clear it'd help. Furthermore, places
+		 * that are sensitive tend to use syscache anyway.
+		 *
+		 * Secondly, we can't call get_tablespace_io_concurrency() because that
+		 * does a sysscan internally, so it might lead to a cycle. We could use
+		 * use effective_io_concurrency, but it doesn't seem worth it.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
 										 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
@@ -700,7 +711,18 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	/* no index prefetch for system catalogs */
+	/*
+	 * We don't do any prefetching on system catalogs, for two main reasons.
+	 *
+	 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+	 * or we often don't know how many rows to expect (and the numbers tend
+	 * to be fairly low). So it's not clear it'd help. Furthermore, places
+	 * that are sensitive tend to use syscache anyway.
+	 *
+	 * Secondly, we can't call get_tablespace_io_concurrency() because that
+	 * does a sysscan internally, so it might lead to a cycle. We could use
+	 * use effective_io_concurrency, but it doesn't seem worth it.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
 									 snapshot, nkeys, 0, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0b8f136f042..e45a3a89387 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -206,21 +206,30 @@ index_insert(Relation indexRelation,
  *
  * Caller must be holding suitable locks on the heap and the index.
  *
- * prefetch_target determines if prefetching is requested for this index scan.
- * We need to be able to disable this for two reasons. Firstly, we don't want
- * to do prefetching for IOS (where we hope most of the heap pages won't be
- * really needed. Secondly, we must prevent infinite loop when determining
- * prefetch value for the tablespace - the get_tablespace_io_concurrency()
- * does an index scan internally, which would result in infinite loop. So we
- * simply disable prefetching in systable_beginscan().
- *
- * XXX Maybe we should do prefetching even for catalogs, but then disable it
- * when accessing TableSpaceRelationId. We still need the ability to disable
- * this and catalogs are expected to be tiny, so prefetching is unlikely to
- * make a difference.
- *
- * XXX The second reason doesn't really apply after effective_io_concurrency
- * lookup moved to caller of index_beginscan.
+ * prefetch_target determines if prefetching is requested for this index scan,
+ * and how far ahead we want to prefetch
+ *
+ * prefetch_reset specifies the prefetch distance to start with on rescans (so
+ * that we don't ramp-up to prefetch_target and use that forever)
+ *
+ * Setting prefetch_target to 0 disables prefetching for the index scan. We do
+ * this for two reasons - for scans on system catalogs, and/or for cases where
+ * prefetching is expected to be pointless (like IOS).
+ *
+ * For system catalogs, we usually either scan by a PK value, or we we expect
+ * only few rows (or rather we don't know how many rows to expect). Also, we
+ * need to prevent infinite in the get_tablespace_io_concurrency() call - it
+ * does an index scan internally. So we simply disable prefetching for system
+ * catalogs. We could deal with this by picking a conservative static target
+ * (e.g. effective_io_concurrency, capped to something), but places that are
+ * performance sensitive likely use syscache anyway, and catalogs tend to be
+ * very small and hot. So we don't bother.
+ *
+ * For IOS, we expect to not need most heap pages (that's the whole point of
+ * IOS, actually), and prefetching them might lead to a lot of wasted I/O.
+ *
+ * XXX Not sure the infinite loop can still happen, now that the target lookup
+ * moved to callers of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
@@ -264,8 +273,12 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
+	/*
+	 * No prefetch for bitmap index scans. In this case prefetching happens at
+	 * the heapscan level.
+	 */
 	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
-									0, 0); /* no prefetch */
+									0, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -301,12 +314,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
+												norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/* with prefetching enabled, initialize the necessary state */
+	/* With prefetching requested, initialize the prefetcher state. */
 	if (prefetch_target > 0)
 	{
 		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
@@ -367,7 +381,7 @@ index_rescan(IndexScanDesc scan,
 		prefetcher->queueEnd = 0;
 		prefetcher->queueIndex = 0;
 		prefetcher->prefetchDone = false;
-	
+
 		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
 										 prefetcher->prefetchReset);
 	}
@@ -399,7 +413,11 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
-	/* If prefetching enabled, log prefetch stats. */
+	/*
+	 * If prefetching was enabled for this scan, log prefetch stats.
+	 *
+	 * FIXME This should really go to EXPLAIN ANALYZE instead.
+	 */
 	if (scan->xs_prefetch)
 	{
 		IndexPrefetch prefetch = scan->xs_prefetch;
@@ -554,8 +572,6 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
- *
- * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
@@ -693,25 +709,31 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
-	IndexPrefetch	prefetch = scan->xs_prefetch;
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
 
 	for (;;)
 	{
-		/* with prefetching enabled, accumulate enough TIDs into the prefetch */
+		/*
+		 * If the prefetching is still active (i.e. enabled and we still
+		 * haven't finished reading TIDs from the scan), read enough TIDs into
+		 * the queue until we hit the current target.
+		 */
 		if (PREFETCH_ACTIVE(prefetch))
 		{
-			/* 
-			 * incrementally ramp up prefetch distance
+			/*
+			 * Ramp up the prefetch distance incrementally.
 			 *
-			 * XXX Intentionally done as first, so that with prefetching there's
-			 * always at least one item in the queue.
+			 * Intentionally done as first, before reading the TIDs into the
+			 * queue, so that there's always at least one item. Otherwise we
+			 * might get into a situation where we start with target=0 and no
+			 * TIDs loaded.
 			 */
 			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
-										prefetch->prefetchMaxTarget);
+										   prefetch->prefetchMaxTarget);
 
 			/*
-			 * get more TID while there is empty space in the queue (considering
-			 * current prefetch target
+			 * Now read TIDs from the index until the queue is full (with
+			 * respect to the current prefetch target).
 			 */
 			while (!PREFETCH_FULL(prefetch))
 			{
@@ -720,7 +742,10 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				/* Time to fetch the next TID from the index */
 				tid = index_getnext_tid(scan, direction);
 
-				/* If we're out of index entries, we're done */
+				/*
+				 * If we're out of index entries, we're done (and we mark the
+				 * the prefetcher as inactive).
+				 */
 				if (tid == NULL)
 				{
 					prefetch->prefetchDone = true;
@@ -732,22 +757,34 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
 				prefetch->queueEnd++;
 
+				/*
+				 * Issue the actuall prefetch requests for the new TID.
+				 *
+				 * FIXME For IOS, this should prefetch only pages that are not
+				 * fully visible.
+				 */
 				index_prefetch(scan, tid);
 			}
 		}
 
 		if (!scan->xs_heap_continue)
 		{
+			/*
+			 * With prefetching enabled (even if we already finished reading
+			 * all TIDs from the index scan), we need to return a TID from the
+			 * queue. Otherwise, we just get the next TID from the scan
+			 * directly.
+			 */
 			if (PREFETCH_ENABLED(prefetch))
 			{
-				/* prefetching enabled, but reached the end and queue empty */
+				/* Did we reach the end of the scan and the queue is empty? */
 				if (PREFETCH_DONE(prefetch))
 					break;
 
 				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
 				prefetch->queueIndex++;
 			}
-			else	/* not prefetching, just do the regular work  */
+			else				/* not prefetching, just do the regular work  */
 			{
 				ItemPointer tid;
 
@@ -1114,15 +1151,49 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 }
 
 /*
- * Add the block to the tiny top-level queue (LRU), and check if the block
- * is in a sequential pattern.
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
  */
 static bool
 index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 {
-	int		idx;
+	int			idx;
 
-	/* If the queue is empty, just store the block and we're done. */
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
 	if (prefetch->blockIndex == 0)
 	{
 		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
@@ -1131,30 +1202,66 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 	}
 
 	/*
-	 * Otherwise, check if it's the same as the immediately preceding block (we
-	 * don't want to prefetch the same block over and over.)
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
 	 */
 	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
 		return true;
 
-	/* Not the same block, so add it to the queue. */
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
 	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
 	prefetch->blockIndex++;
 
-	/* check sequential patter a couple requests back */
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
 	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
 	{
-		/* not enough requests to confirm a sequential pattern */
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
 		if (prefetch->blockIndex < i)
 			return false;
 
 		/*
-		 * index of the already requested buffer (-1 because we already
-		 * incremented the index when adding the block to the queue)
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
 		 */
 		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
 
-		/*  */
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
 		if (prefetch->blockItems[idx] != (block - i))
 			return false;
 	}
@@ -1164,30 +1271,34 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 
 /*
  * index_prefetch_add_cache
- *		Add a block to the cache, return true if it was recently prefetched.
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
  *
- * When checking a block, we need to check if it was recently prefetched,
- * where recently means within PREFETCH_CACHE_SIZE requests. This check
- * needs to be very cheap, even with fairly large caches (hundreds of
- * entries). The cache does not need to be perfect, we can accept false
- * positives/negatives, as long as the rate is reasonably low. We also
- * need to expire entries, so that only "recent" requests are remembered.
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
  *
- * A queue would allow expiring the requests, but checking if a block was
- * prefetched would be expensive (linear search for longer queues). Another
- * option would be a hash table, but that has issues with expiring entries
- * cheaply (which usually degrades the hash table).
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
  *
- * So we use a cache that is organized as multiple small LRU caches. Each
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
  * block is mapped to a particular LRU by hashing (so it's a bit like a
- * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
- * the most recent requests (for that particular LRU).
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
  *
- * This allows quick searches and expiration, with false negatives (when
- * a particular LRU has too many collisions).
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
  *
  * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
- * prefetch request in total.
+ * prefetch request in total (these are the default parameters.)
  *
  * The recency is determined using a prefetch counter, incremented every
  * time we end up prefetching a block. The counter is uint64, so it should
@@ -1197,33 +1308,39 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
  * and then linearly search if the tiny LRU has entry for the same block
  * and request less than PREFETCH_CACHE_SIZE ago.
  *
- * At the same time, we either update the entry (for the same block) if
+ * At the same time, we either update the entry (for the queried block) if
  * found, or replace the oldest/empty entry.
  *
  * If the block was not recently prefetched (i.e. we want to prefetch it),
  * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
  */
 static bool
 index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 {
 	PrefetchCacheEntry *entry;
 
-	/* calculate which LRU to use */
+	/* map the block number the the LRU */
 	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
 
-	/* entry to (maybe) use for this block request */
+	/* age/index of the oldest entry in the LRU, to maybe use */
 	uint64		oldestRequest = PG_UINT64_MAX;
 	int			oldestIndex = -1;
 
 	/*
 	 * First add the block to the (tiny) top-level LRU cache and see if it's
-	 * part of a sequential pattern. In this case we just ignore the block
-	 * and don't prefetch it - we expect read-ahead to do a better job.
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
 	 *
-	 * XXX Maybe we should still add the block to the later cache, in case
-	 * we happen to access it later? That might help if we first scan a lot
-	 * of the table sequentially, and then randomly. Not sure that's very
-	 * likely with index access, though.
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
 	 */
 	if (index_prefetch_is_sequential(prefetch, block))
 	{
@@ -1231,7 +1348,11 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 		return true;
 	}
 
-	/* see if we already have prefetched this block (linear search of LRU) */
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
 	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
 	{
 		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
@@ -1243,14 +1364,18 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 			oldestIndex = i;
 		}
 
-		/* Request numbers are positive, so 0 means "unused". */
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
 		if (entry->request == 0)
 			continue;
 
 		/* Is this entry for the same block as the current request? */
 		if (entry->block == block)
 		{
-			bool	prefetched;
+			bool		prefetched;
 
 			/*
 			 * Is the old request sufficiently recent? If yes, we treat the
@@ -1259,7 +1384,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 			 * XXX We do add the cache size to the request in order not to
 			 * have issues with uint64 underflows.
 			 */
-			prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
 
 			/* Update the request number. */
 			entry->request = ++prefetch->prefetchReqNumber;
@@ -1276,6 +1401,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 	 */
 	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
 
+	/* FIXME do a nice macro */
 	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
 
 	entry->block = block;
@@ -1286,32 +1412,31 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 }
 
 /*
- * Do prefetching, and gradually increase the prefetch distance.
- *
- * XXX This is limited to a single index page (because that's where we get
- * currPos.items from). But index tuples are typically very small, so there
- * should be quite a bit of stuff to prefetch (especially with deduplicated
- * indexes, etc.). Does not seem worth reworking the index access to allow
- * more aggressive prefetching, it's best effort.
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
  *
  * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
  * prefetching does not cause significant regressions (e.g. for nestloop
- * with inner index scan). We could track number of index pages visited
- * and index tuples returned, to calculate avg tuples / page, and then
- * use that to limit prefetching after switching to a new page (instead
- * of just using prefetchMaxTarget, which can get much larger).
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
  *
- * XXX Obviously, another option is to use the planner estimates - we know
- * how many rows we're expected to fetch (on average, assuming the estimates
- * are reasonably accurate), so why not to use that. And maybe combine it
- * with the auto-tuning based on runtime statistics, described above.
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
  *
  * XXX The prefetching may interfere with the patch allowing us to evaluate
  * conditions on the index tuple, in which case we may not need the heap
  * tuple. Maybe if there's such filter, we should prefetch only pages that
  * are not all-visible (and the same idea would also work for IOS), but
  * it also makes the indexing a bit "aware" of the visibility stuff (which
- * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
  * (if the index-only filter is expected to eliminate only few rows, then
  * the vm check is pointless). Maybe this could/should be auto-tuning too,
  * i.e. we could track how many heap tuples were needed after all, and then
@@ -1324,13 +1449,13 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
 static void
 index_prefetch(IndexScanDesc scan, ItemPointer tid)
 {
-	IndexPrefetch	prefetch = scan->xs_prefetch;
-	BlockNumber	block;
+	IndexPrefetch prefetch = scan->xs_prefetch;
+	BlockNumber block;
 
 	/*
-	 * No heap relation means bitmap index scan, which does prefetching at
-	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
-	 * without the heap)
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
 	 *
 	 * XXX But in this case we should have prefetchMaxTarget=0, because in
 	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
@@ -1339,9 +1464,6 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	if (!prefetch)
 		return;
 
-	/* was it initialized correctly? */
-	// Assert(prefetch->prefetchIndex != -1);
-
 	/*
 	 * If we got here, prefetching is enabled and it's a node that supports
 	 * prefetching (i.e. it can't be a bitmap index scan).
@@ -1355,9 +1477,9 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
-	 * This happens e.g. for clustered or naturally correlated indexes
-	 * (fkey to a sequence ID). It's not expensive (the block is in page
-	 * cache already, so no I/O), but it's not free either.
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
 	 */
 	if (!index_prefetch_add_cache(prefetch, block))
 	{
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 185ff0f1449..9796f8b979c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1710,6 +1710,11 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	 * essentially just effective_io_concurrency for the table (or the
 	 * tablespace it's in).
 	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
 	heapRel = node->ss.ss_currentRelation;
@@ -1770,6 +1775,18 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	int			prefetch_target;
 	int			prefetch_reset;
 
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
 	heapRel = node->ss.ss_currentRelation;
 
 	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 47093cc9cf1..e250b0567eb 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,8 +1131,6 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 			need_full_snapshot = true;
 		}
 
-		elog(LOG, "slot = %s  need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
-
 		ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
 										InvalidXLogRecPtr,
 										XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b814af4b2f6..907ab886d3e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,18 +235,6 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
-/*
- * XXX not sure it's the right place to define these callbacks etc.
- */
-typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
-											  ScanDirection direction,
-											  int *start, int *end,
-											  bool *reset);
-
-typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
-													 ScanDirection direction,
-													 int index);
-
 /*
  * Cache of recently prefetched blocks, organized as a hash table of
  * small LRU caches. Doesn't need to be perfectly accurate, but we
-- 
2.41.0

0003-remove-prefetch_reset-20231012.patchtext/x-patch; charset=UTF-8; name=0003-remove-prefetch_reset-20231012.patchDownload

From 61b13cc9a2a0445d6ab992520a945437cd0f275c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 13 Oct 2023 23:04:39 +0200
Subject: [PATCH 3/4] remove prefetch_reset

---
 src/backend/access/heap/heapam_handler.c |  6 +--
 src/backend/access/index/genam.c         |  4 +-
 src/backend/access/index/indexam.c       | 38 +++++++----------
 src/backend/executor/execIndexing.c      |  2 +-
 src/backend/executor/execReplication.c   |  2 +-
 src/backend/executor/nodeIndexonlyscan.c |  6 +--
 src/backend/executor/nodeIndexscan.c     | 53 ++++++++++--------------
 src/backend/utils/adt/selfuncs.c         |  3 +-
 src/include/access/genam.h               |  6 +--
 src/include/access/relscan.h             |  4 +-
 10 files changed, 52 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca91bc5e878..89474078951 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -748,14 +748,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
-		int			prefetch_target;
+		int			prefetch_max;
 
 		/*
 		 * Get the prefetch target for the old tablespace (which is what we'll
 		 * read using the index). We'll use it as a reset value too, although
 		 * there should be no rescans for CLUSTER etc.
 		 */
-		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
+		prefetch_max = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -765,7 +765,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tableScan = NULL;
 		heapScan = NULL;
 		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
-									prefetch_target, prefetch_target);
+									prefetch_max);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 6e3aa6bb1fd..d45a209ee3a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -456,7 +456,7 @@ systable_beginscan(Relation heapRelation,
 		 * use effective_io_concurrency, but it doesn't seem worth it.
 		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0, 0, 0);
+										 snapshot, nkeys, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -724,7 +724,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	 * use effective_io_concurrency, but it doesn't seem worth it.
 	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0, 0, 0);
+									 snapshot, nkeys, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index e45a3a89387..8c56acfd638 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -109,7 +109,7 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap,
-											  int prefetch_target, int prefetch_reset);
+											  int prefetch_max);
 
 static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
 
@@ -206,13 +206,10 @@ index_insert(Relation indexRelation,
  *
  * Caller must be holding suitable locks on the heap and the index.
  *
- * prefetch_target determines if prefetching is requested for this index scan,
+ * prefetch_max determines if prefetching is requested for this index scan,
  * and how far ahead we want to prefetch
  *
- * prefetch_reset specifies the prefetch distance to start with on rescans (so
- * that we don't ramp-up to prefetch_target and use that forever)
- *
- * Setting prefetch_target to 0 disables prefetching for the index scan. We do
+ * Setting prefetch_max to 0 disables prefetching for the index scan. We do
  * this for two reasons - for scans on system catalogs, and/or for cases where
  * prefetching is expected to be pointless (like IOS).
  *
@@ -236,14 +233,14 @@ index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys,
-				int prefetch_target, int prefetch_reset)
+				int prefetch_max)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
-									prefetch_target, prefetch_reset);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
+									NULL, false, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -273,12 +270,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	/*
-	 * No prefetch for bitmap index scans. In this case prefetching happens at
-	 * the heapscan level.
-	 */
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
-									0, 0);
+	/* No prefetch in bitmap scans, prefetch is done by the heap scan. */
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -296,7 +289,7 @@ static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
 						 ParallelIndexScanDesc pscan, bool temp_snap,
-						 int prefetch_target, int prefetch_reset)
+						 int prefetch_max)
 {
 	IndexScanDesc scan;
 
@@ -321,7 +314,7 @@ index_beginscan_internal(Relation indexRelation,
 	scan->xs_temp_snap = temp_snap;
 
 	/* With prefetching requested, initialize the prefetcher state. */
-	if (prefetch_target > 0)
+	if (prefetch_max > 0)
 	{
 		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
 
@@ -330,8 +323,7 @@ index_beginscan_internal(Relation indexRelation,
 		prefetcher->queueEnd = 0;
 
 		prefetcher->prefetchTarget = 0;
-		prefetcher->prefetchMaxTarget = prefetch_target;
-		prefetcher->prefetchReset = prefetch_reset;
+		prefetcher->prefetchMaxTarget = prefetch_max;
 
 		scan->xs_prefetch = prefetcher;
 	}
@@ -382,8 +374,8 @@ index_rescan(IndexScanDesc scan,
 		prefetcher->queueIndex = 0;
 		prefetcher->prefetchDone = false;
 
-		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
-										 prefetcher->prefetchReset);
+		/* restart the incremental ramp-up */
+		prefetcher->prefetchTarget = 0;
 	}
 }
 
@@ -576,7 +568,7 @@ index_parallelrescan(IndexScanDesc scan)
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 						 int norderbys, ParallelIndexScanDesc pscan,
-						 int prefetch_target, int prefetch_reset)
+						 int prefetch_max)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -585,7 +577,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true, prefetch_target, prefetch_reset);
+									pscan, true, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 09418f715fa..eae1d8f9233 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -773,7 +773,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f3e1a8d22a4..91676ccff95 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -210,7 +210,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
 	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
 	 */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 75b44db33c6..7c27913502c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -100,7 +100,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys,
-								   0, 0);	/* no index prefetch for IOS */
+								   0);	/* no prefetching for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -667,7 +667,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0, 0);	/* no index prefetch for IOS */
+								 0);	/* no prefetching for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -713,7 +713,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0, 0);	/* no index prefetch for IOS */
+								 0);	/* no prefetching for IOS */
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 9796f8b979c..a5f5394ef49 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -105,12 +105,11 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
-		int	prefetch_target;
-		int	prefetch_reset;
+		int	prefetch_max;
 
 		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
 		 * tablespace it's in).
 		 *
 		 * XXX Should this also look at plan.plan_rows and maybe cap the target
@@ -118,8 +117,8 @@ IndexNext(IndexScanState *node)
 		 * just reset to that value during prefetching, after reading the next
 		 * index page (or rather after rescan)?
 		 */
-		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
 
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
@@ -130,8 +129,7 @@ IndexNext(IndexScanState *node)
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys,
-								   prefetch_target,
-								   prefetch_reset);
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -197,6 +195,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	estate = node->ss.ps.state;
 
@@ -218,9 +217,7 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
-		Relation heapRel = node->ss.ss_currentRelation;
-		int	prefetch_target;
-		int	prefetch_reset;
+		int	prefetch_max;
 
 		/*
 		 * Determine number of heap pages to prefetch for this index. This is
@@ -232,8 +229,8 @@ IndexNextWithReorder(IndexScanState *node)
 		 * just reset to that value during prefetching, after reading the next
 		 * index page (or rather after rescan)?
 		 */
-		prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-		prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
 
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
@@ -244,8 +241,7 @@ IndexNextWithReorder(IndexScanState *node)
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
 								   node->iss_NumOrderByKeys,
-								   prefetch_target,
-								   prefetch_reset);
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1701,9 +1697,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel;
-	int			prefetch_target;
-	int			prefetch_reset;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
 
 	/*
 	 * Determine number of heap pages to prefetch for this index. This is
@@ -1717,10 +1712,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
-	heapRel = node->ss.ss_currentRelation;
 
-	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1734,8 +1728,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan,
-								 prefetch_target,
-								 prefetch_reset);
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1771,9 +1764,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel;
-	int			prefetch_target;
-	int			prefetch_reset;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
 
 	/*
 	 * Determine number of heap pages to prefetch for this index. This is
@@ -1787,10 +1779,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	 *
 	 * XXX Maybe reduce the value with parallel workers?
 	 */
-	heapRel = node->ss.ss_currentRelation;
-
-	prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
-	prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1799,8 +1789,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
 								 piscan,
-								 prefetch_target,
-								 prefetch_reset);
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 0b02b6265d0..52e3aaaf20a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6216,9 +6216,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/* XXX Maybe should do prefetching using the default prefetch parameters? */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0, 0, 0);	/* XXX maybe do prefetch? */
+								 1, 0, 0);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 907ab886d3e..ceb6279b0b0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -154,8 +154,7 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys,
-									 int prefetch_target,
-									 int prefetch_reset);
+									 int prefetch_max);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,8 +172,7 @@ extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan,
-											  int prefetch_target,
-											  int prefetch_reset);
+											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index c119fe597d8..231a30ecc46 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -107,7 +107,7 @@ typedef struct IndexFetchTableData
 } IndexFetchTableData;
 
 /*
- * Forward declaration, defined in genam.h.
+ * Forward declarations, defined in genam.h.
  */
 typedef struct IndexPrefetchData IndexPrefetchData;
 typedef struct IndexPrefetchData *IndexPrefetch;
@@ -168,7 +168,7 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
-	/* prefetching state (or NULL if disabled) */
+	/* prefetching state (or NULL if disabled for this scan) */
 	IndexPrefetchData *xs_prefetch;
 
 	/* parallel index scan information, in shared memory */
-- 
2.41.0

0004-PoC-prefetch-for-IOS-20231012.patchtext/x-patch; charset=UTF-8; name=0004-PoC-prefetch-for-IOS-20231012.patchDownload

From 047eda4fa39e5b37b5beaaa3f24195a8d7aa6a5e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 14 Oct 2023 00:13:26 +0200
Subject: [PATCH 4/4] PoC prefetch for IOS

---
 src/backend/access/index/indexam.c       | 140 ++++++++++++++++++++++-
 src/backend/executor/nodeIndexonlyscan.c |  63 +++++++++-
 src/include/access/genam.h               |   2 +
 3 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 8c56acfd638..4ae4e867770 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,6 +49,7 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
@@ -111,7 +112,7 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap,
 											  int prefetch_max);
 
-static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -755,7 +756,7 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 				 * FIXME For IOS, this should prefetch only pages that are not
 				 * fully visible.
 				 */
-				index_prefetch(scan, tid);
+				index_prefetch(scan, tid, false);
 			}
 		}
 
@@ -1439,7 +1440,7 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
  * in BTScanPosData.nextPage.
  */
 static void
-index_prefetch(IndexScanDesc scan, ItemPointer tid)
+index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 {
 	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
@@ -1462,10 +1463,36 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 	 */
 	Assert(scan->heapRelation);
 
-	prefetch->countAll++;
-
 	block = ItemPointerGetBlockNumber(tid);
 
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 */
+	if (skip_all_visible)
+	{
+		bool	all_visible;
+		Buffer	vmbuffer = InvalidBuffer;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 block,
+									 &vmbuffer);
+
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+
+		if (all_visible)
+			return;
+	}
+
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
@@ -1480,4 +1507,107 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid)
 		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 		pgBufferUsage.blks_prefetches++;
 	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid_prefetch - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
+
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+			prefetch->queueEnd++;
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, tid, true);
+		}
+	}
+
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
 }
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 7c27913502c..855afd5ba76 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -83,6 +84,23 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int			prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -100,7 +118,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys,
-								   0);	/* no prefetching for IOS */
+								   prefetch_max);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -124,7 +142,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_prefetch(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -654,6 +672,24 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->ioss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -667,7 +703,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0);	/* no prefetching for IOS */
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -705,6 +741,23 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ioss_ScanDesc =
@@ -713,7 +766,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan,
-								 0);	/* no prefetching for IOS */
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ceb6279b0b0..6e92758ced5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -175,6 +175,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_prefetch(IndexScanDesc scan,
+											  ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-- 
2.41.0

#19

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#18)

9 attachment(s)

Re: index prefetching

Hi,

Here's a new WIP version of the patch set adding prefetching to indexes,
exploring a couple alternative approaches. After the patch 2023/10/16
version, I happened to have an off-list discussion with Andres, and he
suggested to try a couple things, and there's a couple more things I
tried on my own too.

Attached is the patch series starting with the 2023/10/16 patch, and
then trying different things in separate patches (discussed later). As
usual, there's also a bunch of benchmark results - due to size I'm
unable to attach all of them here (the PDFs are pretty large), but you
can find them at (with all the scripts etc.):

https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23

I'll attach only a couple small PNG with highlighted speedup/regression
patterns, but it's unreadable and more of a pointer to the PDF.

A quick overview of the patches
-------------------------------

v20231124-0001-prefetch-2023-10-16.patch

- same as the October 16 patch, with only minor comment tweaks

v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch

- removes custom cache of recently prefetched blocks, replaces it
simply by calling PrefetchBuffer (which check shared buffers)

v20231124-0003-check-page-cache-using-preadv2.patch

- adds a check using preadv2(RWF_NOWAIT) to check if the whole
page is in page cache

v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch

- adds back a small LRU cache to identify sequential patterns
(based on benchmarks of 0002/0003 patches)

v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch
v20231124-0006-poc-reuse-vm-information.patch

- optimizes the visibilitymap handling when prefetching for IOS
(to deal with overhead in the all-visible cases) by

v20231124-0007-20231016-reworked.patch

- returns back to the 20231016 patch, but this time with the VM
optimizations in patches 0005/0006 (in retrospect I might have
simply moved 0005+0006 right after 0001, but the patch evolved
differently - shouldn't matter here)

Now, let's talk about the patches one by one ...

PrefetchBuffer + preadv2 (0002+0003)
------------------------------------

After I posted the patch in October, I happened to have an off-list
discussion with Andres, and he suggested to try ditching the local cache
of recently prefetched blocks, and instead:

1) call PrefetchBuffer (which checks if the page is in shared buffers,
and skips the prefetch if it's already there)

2) if the page is not in shared buffers, use preadv2(RWF_NOWAIT) to
check if it's in the kernel page cache

Doing (1) is trivial - PrefetchBuffer() already does the shared buffer
check, so 0002 simply removes the custom cache code.

Doing (2) needs a bit more code to actually call preadv2() - 0003 adds
FileCached() to fd.c, smgrcached() to smgr.c, and then calls it from
PrefetchBuffer() right before smgrprefetch(). There's a couple loose
ends (e.g. configure should check if preadv2 is supported), but in
principle I think this is generally correct.

Unfortunately, these changes led to a bunch of clear regressions :-(

Take a look at the attached point-4-regressions-small.png, which is page
5 from the full results PDF [1]https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf[2]https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png. As before, I plotted this as a huge
pivot table with various parameters (test, dataset, prefetch, ...) on
the left, and (build, nmatches) on the top. So each column shows timings
for a particular patch and query returning nmatches rows.

After the pivot table (on the right) is a heatmap, comparing timings for
each build to master (the first couple of columns). As usual, the
numbers are "timing compared to master" so e.g. 50% means the query
completed in 1/2 the time compared to master. Color coding is simple
too, green means "good" (speedup), red means "bad" (regression). The
higher the saturation, the bigger the difference.

I find this visualization handy as it quickly highlights differences
between the various patches. Just look for changes in red/green areas.

In the points-5-regressions-small.png image, you can see three areas of
clear regressions, either compared to the master or the 20231016 patch.
All of this is for "uncached" runs, i.e. after instance got restarted
and the page cache was dropped too.

The first regression is for bitmapscan. The first two builds show no
difference compared to master - which makes sense, because the 20231016
patch does not touch any code used by bitmapscan, and the 0003 patch
simply uses PrefetchBuffer as is. But then 0004 adds preadv2 to it, and
the performance immediately sinks, with timings being ~5-6x higher for
queries matching 1k-100k rows.

The patches 0005/0006 can't possibly improve this, because visibilitymap
are entirely unrelated to bitmapscans, and so is the small LRU to
detect sequential patterns.

The indexscan regression #1 shows a similar pattern, but in the opposite
direction - indesxcan cases massively improved with the 20231016 patch
(and even after just using PrefetchBuffer) revert back to master with
0003 (adding preadv2). Ditching the preadv2 restores the gains (the last
build results are nicely green again).

The indexscan regression #2 is interesting too, and it illustrates the
importance of detecting sequential access patterns. It shows that as
soon as we call PrefetBuffer() directly, the timings increase to maybe
2-5x compared to master. That's pretty terrible. Once the small LRU
cache used to detect sequential patterns is added back, the performance
recovers and regression disappears. Clearly, this detection matters.

Unfortunately, the LRU can't do anything for the two other regresisons,
because those are on random/cyclic patterns, so the LRU won't work
(certainly not for the random case).

preadv2 issues?
---------------

I'm not entirely sure if I'm using preadv2 somehow wrong, but it doesn't
seem to perform terribly well in this use case. I decided to do some
microbenchmarks, measuring how long it takes to do preadv2 when the
pages are [not] in cache etc. The C files are at [3]https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests.

preadv2-test simply reads file twice, first with NOWAIT and then without
it. With clean page cache, the results look like this:

file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 78472 us calls 131072 hits 0 misses 131072
preadv2 WAIT time 9849082 us calls 131072 hits 131072 misses 0

and then, if you run it again with the file still being in page cache:

file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 258880 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 213196 us calls 131072 hits 131072 misses 0

This is pretty terrible, IMO. It says that if the page is not in cache,
the preadv2 calls take ~80ms. Which is very cheap, compared to the total
read time (so if we can speed that up by prefetching, it's worth it).
But if the file is already in cache, it takes ~260ms, and actually
exceeds the time needed to just do preadv2() without the NOWAIT flag.

AFAICS the problem is preadv2() doesn't just check if the data is
available, it also copies the data and all that. But even if we only ask
for the first byte, it's still way more expensive than with empty cache:

file: ./tmp.img size: 1073741824 (131072) block 8192 check 1
preadv2 NOWAIT time 119751 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 208136 us calls 131072 hits 131072 misses 0

There's also a fadvise-test microbenchmark that just does fadvise all
the time, and even that is way cheaper than using preadv2(NOWAIT) in
both cases:

no cache:

file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 631686 us calls 131072 hits 0 misses 0
preadv2 time 207483 us calls 131072 hits 131072 misses 0

cache:

file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 79874 us calls 131072 hits 0 misses 0
preadv2 time 239141 us calls 131072 hits 131072 misses 0

So that's 300ms vs. 500ms in the caches case (the difference in the
no-cache case is even more significant).

It's entirely possible I'm doing something wrong, or maybe I just think
about this the wrong way, but I can't quite imagine this being useful
for this working - at least not for reasonably good local storage. Maybe
it could help for slow/remote storage, or something?

For now, I think the right approach is to go back to the cache of
recently prefetched blocks. I liked on the preadv2 approach is that it
knows exactly what is currently in page cache, while the local cache is
just an approximation cache of recently prefetched blocks. And it also
knows about stuff prefetched by other backends, while the local cache is
private to the particular backend (or even to the particular scan node).

But the local cache seems to perform much better, so there's that.

LRU cache of recent blocks (0004)
---------------------------------

The importance of this optimization is clearly visible in the regression
image mentioned earlier - the "indexscan regression #2" shows that the
sequential pattern regresses with 0002+0003 patches, but once the small
LRU cache is introduced back and uses to skip prefetching for sequential
patterns, the regression disappears. Ofc, this is part of the origina
20231016 patch, so going back to that version naturally includes this.

visibility map optimizations (0005/0006)
----------------------------------------

Earlier benchmark results showed a bit annoying regression for
index-only scans that don't need prefetching (i.e. with all pages
all-visible). There was quite a bit of inefficiency because both the
prefetcher and IOS code accessed the visibilitymap independently, and
the prefetcher did that in a rather inefficient way. These patches make
the prefetcher more efficient by reusing buffer, and also share the
visibility info between prefetcher and the IOS code.

I'm sure this needs more work / cleanup, but the regresion is mostly
gone, as illustrated by the attached point-0-ios-improvement-small.png.

layering questions
------------------

Aside from the preadv2() question, the main open question remains to be
the "layering", i.e. which code should be responsible for prefetching.
At the moment all the magic happens in indexam.c, in index_getnext_*
functions, so that all callers benefit from prefetching.

But as mentioned earlier in this thread, indexam.c seems to be the wrong
layer, and I think I agree. The problem is - the prefetching needs to
happen in index_getnext_* so that all index_getnext_* callers benefit
from it. We could do that in the executor for index_getnext_tid(), but
that's a bit weird - it'd work for index-only scans, but the primary
target is regular index scans, which calls index_getnext_slot().

However, it seems it'd be good if the prefetcher and the executor code
could exchange/share information more easily. Take for example the
visibilitymap stuff in IOS in patches 0005/0006). I made it work, but it
sure looks inconvenient, partially due to the split between executor and
indexam code.

The only idea I have is to have the prefetcher code somewhere in the
executor, but then pass it to index_getnext_* functions, either as a new
parameter (with NULL => no prefetching), or maybe as a field of scandesc
(but that seems wrong, to point from the desc to something that's
essentially a part of the executor state).

There's also the thing that the prefetcher is part of IndexScanDesc, but
it really should be in the IndexScanState. That's weird, but mostly down
to my general laziness.

regards

[1]: https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf

[2]: https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png

[3]: https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests
https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

point-0-ios-improvement-small.pngimage/png; name=point-0-ios-improvement-small.pngDownload

�PNG


IHDR�uU� cHRMz&�����u0�`:�p��Q<�PLTE���������������ggf������qnm���������������������������}�~������b`_~��xxxYXXry����MKK���������B:<����������������||������������������������������fYX��~�nk����m�����x��a��.��U��G�����li����
�'`�ebKGD�HtIME�)%#�}3�IDATx��}
c�&���`pxY ��S|������|���H����d�6m�;�w�^� �� !��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h��6�h���.�]���
����������������g���<�����xS�~�����Y��"����y����+�����>c�c�@�1l>�5�|��ac?��������B� �.�E�>��Rp3�S����7�
�k�������
v�^����������2�u�?y����]���~����<���'p�$��o@�����|>����o_���_�)�����_\����W����/_����%���E��7^��
���A����!~�~�������$�~���_�y}?o_�]������|{A����\/�]>+���W���c������/���*�m����B�������xu�d���K�^��2
���� ��/!��s+�����)@<��>����xc�s�"�K��;������w��3�g�g�?Q�����6���f�����:�����3�{_��o������e=���O�{�wZx�������M�y^��vy��� �3�G��5>�f��E���at�q�~�} ������.~����L�
_�
?�r;S�Ep��_�o����w��3��������o������5�sw�{�[����C������=��/��s/#�������#������x5	�V����+�o~����m�-���6-o���
~�B���&��~�%�����2�u�i��up�'}������|�=��]�q���N����L��Y��������xz!��^^���8�������������o�}]�5^������@~��.�
������
(
����������o�..������zq�zV���g
��a�vy��������>dZ��-
83N�r{	��vy�������K������_��5>�>��H��7����������o���p����X��/}�Y� s��-����v���eH�KAs���VD-���������wq<K��p5����b���A���X[���S�g��a��x����Qk.c�����!���n|/�9�1�2fV��9����?��OE�4��a�����o�������B��<u����|����W����e�=�gp��t��"`�,���ouN�]+�re�uy��L~|�v{���v���0��Wl���������W�mRy�d��`�v�+���T�.�7��\K��n�Rh	W�K�; 
?C_/��>
/�G�X��?'���R�� Sfs��2�cp�4����z���6�~��nAa9�;d��s���#�>�{��}�Qp�C�:
LH����e��Bp�yd�,X������F�w�/��C��ro�����e���`�k����~}>_o�\����U�Z�����^�/b�3_�2o$��Z[���x{{��d��[T��V�_�j���3\��]lO����m{�����~�bu����Yt���)	�?��@��"������X}?��o�����y}�6�c��fy��WP��C/��zO��(t2�z�.���O��V���o���U������~-��_�������C����o�d��������R�~�V�e�y����>p�J�.p����H3���������w{��.��]�����"���o�s]��V�����/��eYc]�
lK���
��-�9h�/m�{�,�P�Yh��YB��"����%�
���������-����X`���kL� K��{Y.�?�b]�}[V�P}]c~�����J%_aq�)���
��Y�_��1l���+w��W��A���,^��]\���+�U%��
�&*����"������o��)�s�R����r����/.�|�/
��6���V���e5����m5�L��*/��9.��}�}���������������l�3���������j0AA��__ob%��o��I�6�(��acq)����T������t�e/+2��XqY.��f��\-\d��4���e��/+o���U0E�
�~���[������]�Hd|l�����[��~-�_Q���2�-r.��\6���o�
��1K%w��"�[xA�����d�t�;W������������pR*�E����{������:V���k-d�'��0��P��FK�f>	�l)�s�m�x8��ea�v,�Y���ZX!���H���q�l��*��S���f��I�[.\@���9U����:�����`a�� 9�����qm�X-2�~�������4�������8\��%j�5�����5�[����_n~U?\���n���2�oo���c'������E��/��%���A^����/�I��o�U��_�RgY������.�'~C��W��[a�r��DsK�����������pV�]�x]��2\7~����mi��� �
�C�97..q��-�|-`�u����[��UQ5��
�x�����vh�UO��u�������Sd���,!\��_����\�t��^6CO]������Ce�o�a_[g������R�^�������Bl�o_K6����.��t�!D��i��
P�4���C�6O�y�.�k�i��}���#����*�V�iS'� �U ���PN���B>��c���	��`�sy��`��X��+F����AY��&+@�\���J8_�5���e�J�Np&]�"$�$2�n���`L�Z��a���������H\�"�UO��t���Hl������Wv��-�~����?� �L��>e�����2H���x���������<p�����MmxS�9������iy��%��]�s���A�L/�.����I��s�e���{~��l��6��lg�/�\�f�K:�cx/i��9�v��K
�)������k�7�TXrw3K��J����\;�������VH)���hfm;�fs��k���\�2��'����HZ��4���Lc?���a�Px*b��OGpZ��e+'"c���~��_Edv�	��~1�^�S����/������� ��Oy�����vU��T�~���\�����W�� �K�]�!'WKO"R�9��<��{�G��� �����Z�k��:��L�Y~����s>����x������^�S�������iNv��H�W�<+�\W�jc���'~������"k>����������o��J�u����G%�'��������fL�����d�z�	������4���fi��L���D��{�Yg��}~?�m��t�h�/����;� ':�/tW��������[sR
��L�L;�����)0�KF�k��kr�N�3gp?����E��g���W�i���_������d��;�s���'e�
Az����=���?�=������������/�H����sL.Cs�v�~���4����H���m���l�C�7�I��M��Jb�[��_6^�p~���H��Yf%!�_�<u���?0����-+{Z�V"[��2�c�����aMM[�L�~yF��C�
T��.H��R�$�T1�i��Ur�����`�����Q��	%{���W���g�0j'������!��zv�B>L
l2
�B<��1-S��d��k���]�e����`��PT�V����f�J�r��"�.f�0��sY/-�U@�j���N�>m4B�A����y��,�������R��5��}h��@��>�Dv&C�1�	1��&�!LE��G1�k@yw������TL�upio����{�g����{�W�O���F8�F?��g�uX����db����p
������H&R�?|i�=��,s�����K	):���n��v��w�����'�I��)YF
�0wj�������$pY�=hj���SG�6�xR|����J����>R����E+����Z��:��)%~	W���1�3�O;@�����g[��_^�����r�o:���X��C;������2�
W��0���4cL�7U�A�>Z	8��P�+{�g�;�.O~���;U;�����W}����/���:������1�<:6��P�V�<B�
V���lB@,`t�`4���V�BH����t��xn��!��XQ��8��k^�4�����C��*cb���R��E�
�ZV�n� ���0��=p����/�~I���68�m��;�w(��I�3��������[?� +�l|\��_��������^z������CS�Z(,Y�k|�>�u�;�w�m
����Y [M�S�J�2��Z��D���\����;���i'�VH� ������Y/��
�A�p��-]����a�;�� ���aE�%I���mi:������n0a���>�s�Z@-	e���?�K�p��f��N�^���y�����n�F��E�`T��_e���rI�<��pB[�+��G�C��W�\���~��&��I@��{-���*7g������V�3-uYHj`�����B)!�R
��#�/��Y�
�/��b��D� ��f�!�o�����MZ;�zZpBY����CU��j���������N�"��q�5��.�?�b�w{��BtH�~^fU�k3��z�m:�\~��,MeCt���j�g��s�2�gt��C�S/��S���t��To��&����G������Bg�r�N]��R�M���������VFo��UA��f����������A�3��tj����K���"C�N.�xn��T�������p�=�m�����p��B�]�%Fb�WT��V��6R	i�P~nlYR$h��:@�:��n�)�	jv��ZKhU����y���*&��l�mXd-�C@��
��h��>V!�.���������*��]����(l��
.�-�{�g&*���6��0�7��
r�n��uQ�����>V����8��
:q�o����`�b�`F���b�H0�����v=�(��0��d������3���1T)���p������� ��x���.����)2,�vk�>_�8�������gzw#��#�;m/
���*MK�)[�\�X�1?�������������Wg	&�A��p�E_#]u
��KM�n�R�0*���D��U�mz����`�"���AVm$H�]bQ&(�*���U�S�:��m�6���+#�h���GV��W���=��D��� F���BXQaE��
^K2`�~h 2����V
�e5X4e�j:`]�a,zB1��Y�����Qf#3�M�>���qcm����}F�%"�����-�Q���ay���(w#��a��2&D��G���%+l`%����}@e�a��w�0f!\9�0����Q�)���g��Vf���������q��-j�0mD6����� r�#|��������}90B�I3[Xv�l��26&���y,� �����g_&6��E;�����p�
L�S�-\����)R�3����M�u^����%t�F~�����AN���X�[)�,���c3�<]�������$�I+m��W�hV}oO	���,��i+0����[��U4�t"A`h5�Z��7���DaX�J�b���@��
�������d_��Z��ta�9� ����[��|XLc�|�E>��bi���?d����D������DO=e�����0�Q�A[���V\e`XR�&D���s��O�S�6$5���qSV��s��Mfc"|6����u�Be��������`�a������h,B0Y��m#�@�.���.���e�'�x^��A\��'�������=��W��u��Y6Jh��C�|�r��<��	6����t��p�z��g���'�0-m�����#��J:H��X^C�~:M��f88�G�o�+SuQ9���H����X�g��;>�� �{��V���~����b&$L5�$�B�d��Y:o ���rI�0	:��K;�RN����c���s�)Fm6��/.�o��A��q�F��h�	K.����*x�|��ZK�T���W_�`�I�dq���V}���C�����m��P9Pa��Rb�����X^�R�e�(W�E%��K4�I�l�XL�l	���J��$!��(��O�W�?�b(�d:$�2�2��}nD������exX��A#3��k
F�4������
�.�Aq����V���:�F|�LX��p���i�y����rm��N�	������LV�����V!t�2<��0Y�U>�=�XH�=�C'���2>�Mu ��qn��\[�kaJ!�3w�����$��������8�s�	)�x��J9����B�����eS�q�FE����p��Y�;���1�ZFK$Er�p������R��M�iy��6/.!|������i.�������PP��I���j:@������IH���r3�j���u"�4Z�������Zz��t�������!���w�sW����z5�1���d�+?���b���P,]Tw{�gcrR���5{�
�X�}J�]�a��QTD�!��MUq�du��N<�����1���X\2��y�
Vu!����$����v�9��.�]i�����[�Y#i�c!�k`�b��c��;;���v}�����"���d}�AE��r��c�h�l����:dE.|��<#�$�d�D��/�ST��D_e�%�a���2�>-��xm9{��LB(��V5���"�S1��f
�Y����
�<�v2:q	:������hV`�Z�*�I5�d+/[.3�c�����	�F��E�0�JFf��m�4k%!ML��V�f'�+���[�U�#l���)�a.����$�w�<A�b�I�

eM��h����P�Z5�b�L�_M%�i��Q�,l	R���[�`��������%�6e��,�*W
/MV:td�2�D�dU~����*�����s+���>�`	�r+���8".`��]�yT-b0%Gp�A�:."6EPQ�'C��v�f(��{_���U9ud��B�+@��J�$ -,���2.$vp��pW���X�L����j��B(�)H�_d���������c7���C�:��x�������V��:���=�I�_h�����Br�Zv�|K�_W���F���������:�����#h�w�?d~NuR��n-��d��k�� ��G��*���}nO�4�3�;ky�g@������)�;��u�o��_:�U�������AX���t�/�
a�;M�,�.�w�p]�������P_��Iw� �I�.�Kz�����.b�	��B'Y���[% *�t������iMSS����h��+1�5���#,����������������4�rdxw���1������Y�
�?�)]7{VPO��x,���.�}~��4&-���=�0s���']��L���	��j����
6Z	B����;���#��t_������~��z&����jU��L��?����R���i��4����
CQY��a��=v�>B[���O�el��M�������9��b�r�AX�������oO@���`���jy.Z4�x�<u��L�hwN���,I�1����������k�M� ),�kV`��������sG"�4;7:ti�D�SIO@kUI�Lc���x��4���@t���PI['���v�K���~����e����,���r����(;����q@�E9
�i(^*��I�P�&�}���q/jt=?��w#�AqCL��W7a�h�sUf���CD���1���E�Z/�Ze����w&�3��s�/��{s|�>!Z�Ns�M}E�V������ ~��������;�|_��`1���)t���M@�T��e/�}�bS�A_�M��~��$�� �/M���<�dD!��k�2�-�1����=X(>1��pj����bV/���� ���dS�h ������B�wJb�T����'�\`#��$�(�X�H�i��@6�,-��Z3b��cC�x�����.��.D�����:�Dj!�
�.4��#]��s��G�gfl��e���/p���:Iz.����`���!��>vR�����-���B������J������k�r@; ]@}�d�u
�0s8 ���~4�X^8@C�W�Z��.�
C&��':X���7��!�]�����������*���Fx��)b����4�22����������w������GP�;[-��eL�,�U�&X���#o!�9����������H�j(X�@�b�hCJ"�x��>w<��4{�|�d�l�2������2�>-}z\��P��g���ms��_�|[g��o����������}^���x1�������	�4����=;��A�L���"k�����]�Dx����bA��D5g����_�\G�*,��o!�������qM�i���-J�tZ�����6`�7��SG�Sb��w+�c/���O���9mF�w�������l�X?������#5~�sS��2���V9A
�fC����6G���p/
�-��9�'����S���m+��*����?b����G�����. ��S�YbW����~�dU+�5�;`�ji��U2���Sge���69�}��<��_�g��%��:8������zL3�+!���}����D��\������F�[E��'8�ue9�k���(�V��\M��t��,.�E���U�?]�:R?�D�9�O4�Zqw�������#���S��Y	����b��z�wN���D&D���#�����{����g�ijt=
�}�Z�ap����eA
������bba�JL��(Y�.Uf��\
9����9�c,5���6��j�4�����i�*�<�\�wF���9��O�����-�sCL�;wgv�<�u%�R��rx��+��(B�U��0\k�
��Y'�}��� �bW�Y)�}�g��ZPAa|HD?!��y4bL�B���Ai�{�U���`�����������g���w�T/�	VJ�gu����*���ZFP�!��Uj�|�0�=jAqj��{.�q��1(����L�V�YiD)���)�%m�"w�q�^S,�H�R�}V�lHWh�V��u%Mv�5$B')R��V���:�/���V��:�A�H��,��}�0)R(F@d�@�������6��	nL����L�\�p>��z�Dx��ZNy+Jx�N���j�
,��_��+��r�
�sR�~��~,� �b�E���:7d#����D�*iTy&CJ1�#��{;v3�1��I���%N`N��J��.���-�0���"M��Ae��B�5��=+����!���s���h��Ny%�G��\��%)@a9�YLR�2�p�!
�_ED?���V�9���u|g��o����g��t���.�s[�U�y�`dM�(`J{�����������1G���x��i�C�������+�Uq�C!���r���AX��n�*���.������h���H�6�|SR��;����\����&�j�
��������zf�\�=(N\�z�F%icy��f��5���:�/�����+6Ug8>S��TN�����g�eI���`��f����4�(\�����%0MW�{�e��=g��h�9�%q�b�^C���]r��>��'U����	�V��`/����N%$5v��N�M����5�����e\�|��W�����T�m:���!�{�h����s#�T�/�w���?w2�6��t�B�����H���p.g"�����J6W���
�E9RK���������@a�UU��_��;���Lo����&��?!��������i���g�K����� =&�^����J1��*_&�����F�d� [4��KZ
��4'����AM�U"B�7���L���+�1M}L�wu$��h)�z&�9
�3c��,O�U7T���9�B�����6.���	)IH|��f������D��b�'�\)ZTy7���>��H���%Ix�����lN���	��1�u�f�1�PV��t�.���N�B(~Z�U�qz!�`8�2���������*gO��l�=D4��C���m�1�$b`����B�T�L���X����	G*Qsd���`��
:�sH4jJ��2
�����s�'.�,y�8z>a��d�m9)��dg?	B8�,cbJ���M�q��M����������$E�_�5�u��N_�SV��KF�a���H1o�d�/}b�<hM��pW�R��j���r�=M�t.'_�@�u4�L�~P<�d]	1�4����+�U"8��UN��\/�#�R_��C������+%LR�g|�c<��� �P�����%��;����.�]��W#�����
:��`@�|@Fw|(i/X4N�>Gwa1E9��������G$|6�aWv�~���.�C�2>�+�n��J���
#���I�b:��I�Cn���W��$��1�_7�cU�9��}����!��M��c�.�:�[��0�����]I���7%�!�F�;]�K����;[e�e�:�?2���]���``W:h��������*��*�c���i�-Cs�i4g�S�&�����Ah�h\\���Z|P�C� _-�hS���U�@���]t���)F"�r�@
������4^�������\�j�e�k��R��WPGc�FtO�����w5��	����-5:�9�'��#4�"�%k�s��b��Xl)^����t#��^t�~����
R&�m��IY��S�oh5��m�v����2�u������}}�h�����E�L�WI���k��{��5,0�;[�?!{���1���B����PQ���U�R�B(�z��Ey�1(�so����+A��N�z*���Xh��d|qi���5�l:������
���p]�t^��S��R���	-w��V�v�=�*����2>gW�^o�������?�������!���[�8z���\�������l){����|���?G��M O��QS��w$�|V ����9RS/�3�$�
W�)>c�\���V���#B�)T)��CpW:�-�X��%��
�Y G���ryKm���+�;��jh����f��RXz��� �)�/K�p�lB�Z*�%���Y�UV���������8��(�	B�P�4��� md���r���pnNBR�$�2Q�����E���2���>1��X����]�Y�<���q������_�e�e�� ,W\8��Z��d�����!2L�`�
L������i����4���|���!�s<����#V��(�;��^�t��B�S8����A�xqS���l�q��A
�\�x��?�:>{�Pt���w9
Ww-�u1���J��y�s����=�p�(�<s,���8��`�]�4�</[T[�C�
J[{�7�����I���	��l���P��	��L��I
�E�r+�p�Xy�|zx�q�6<�J��bN�Z�D���K��B<AhR��! Rm?#
{������aF�F�� �/�KF�D�!:����%��55�
���e���-
U\y��N�d���������I��"�IR�@K� Iq����s�����:14O.HN���5$)RG�"��K������5�:6�e���@sA?BF7�u0"L
cq�i�~,n]�����ibX!�=����*����_�����Z��Rp6I;9,��#�OaU?:l����PU��8*3�E�x�����r�h`�^^Pm�	baT�#@�9>8��c�;����9Y��{�Yh���[���n����=F��?���3�h8t<���C��Ueh�!`���J����=�<��B��|�r��	�d<���G�8�Q�N��
O�RV����H>�q��*N�d��C��zb����Fi�T�x�5���&�}��$��VJR$�b-�R��A�,����.� ��I�R��2��J�S���Hth��H%��
0�4H6}z��&`�^f�E�_��h&+@b
w�]["w���
��
C��1�uGq��A���![.����-�[�K���N�%S>�*
������^��a��������d���7x��
*mv�2>���c	J��?T�q��M�x+����P�e[?�^0�O����������l]�������uy�2&B�5�8?� �W�= ��<@+or�w��^��_��P������h�k����^��ll�\?� `��.��7�,����W8^����T�7|�M�!�?�������:�F�	�'t����F�
{��e���O�s���#��uV Mcx<�\5�������z�p~��\�:�����OJ��������
��S?�i:$	s����5�[��Zs�����A����+�g�?-++p��l��W�Z�k�x�����6�s2^�6�~�����H�	&$����
5E�	��A"��E;-0�==H��&Ioyr;F��f�Q�`�U�lul����8_*q�/e�*0
)�h��m{c�F���f�%D*1�����FA���yU��UPZI�����[L|@7N
K�>��X���wQr,�I�2���{������.�Sh��s�!�m=���dx������B���?<
WP&��+Y7UR�T+F&[G����d��1���� �����d�1x(P�����s�0(������/��/�{@w|rOV�q�����&A�����E^���;=�s��p�
�bwG��w���D�	�b�Yf�VY8�����%�d��7 L/BI�����A<M����>�O�J�T��1O��~�m��$,��
).�� ����k
'j��l��9z8(�
G� $_�"���*Z���(�V�f`��C����z�75��]K����E�6�0G����NQ+�#�F�](o���5�
G� :���2��f���S�� �A1�hC`U����Q��CAN������grW�������K�>���0�R?`1�B����������(��EPz/�����\J�j���{�����!�gR�$�z�T�!8��� rR�]6�W�'�BSq��1X���<�x7}���}Y�����#"2�g;�6�E������z��Nx<��a��E���y�I���.��/ ���T�ne�P��'s����]������e�������"����~p=�d������h�!��K��WgD�,Lo�:���1i	���F�U�>F��VX�f�1�X��b�cj�z���q$)�_o=��.��'�o(	�f�O�7�����@�m�c<�n$H�������E��_@��aIb�6���77KJ'W�n� a��3.r��U(L�/Z�����I��W�8*��hK������������n�}��a�C]+]\&uUra�gyi1��j`���*�����@�X�"y��P<����%j�����N�����mZ�v)�DH��p�Q�2����:�/���X��m�(4`�G|�������?U��s^(vs�8�-�Q�^�����w�q[d����lD��1�m�����$7��v"�+�8;V�!<<�o��=���JDX��Gzrc�n����<���s]��@��5gZ������7�k�&[�Z"gMk��%>0�����Bt^x����>���)~+����}�.�O�E�1l�O��j�G���������M�a?�dX:�C������.X����^��b����>��� $�
��
XK����S�"�l�]����_�����fd����k������.��P:��e�eF�3�V�W+;bPaKB���O������������Y�RC6���fn������0���0���(�������bza���4_�[����3�.Q��[�&���N�L��������	l���(��7��~�>%�xiq�
S,�PsktwQY"����43*��<�XV���"�,�T�F)j�q�\�(�7��L �4�N�����;�Y#�.*��/J��s��B9%����%F_�!Q+�+��(�L��-W�����lR$C��b`u:(��%E�z�U(/���x�0�H��U�)b�>��$f=30(�n�P���8�����dg%D�N@I->8���Y�W�:��c�#��[�l�(Tb������sd�N{}���Y�j��F�
e[�QVB�`�/�	�1'7g1>�rN�N�}@��
��,i?D�K�z%l*���<X�e�$lC�z����c���T�*Y����������)r|��~U"�Y�N('sUh	��e��ad&���g�&�O)����2��c�W>��#wT�
�UPy�2CiS:�ig�VM������A��m�z��������zv&��$E�����������o��,
�d���yk55�X
�g������J��18�������;E������<Bp��� �{��v.-����z�P�Z��V��c?�K��,:�aM�S��*�-��M\���a�)�<R[�i8�n�� ^�T55T:���Z�����PxL}�Pq����$����E�I��T6V�IS��9�P'(���E�t]���H��YjJ���!+�\d��8�����he=�(h�����+�S��C����$�n�+7�'D�����l(�&��1/�X��b*���|O���+���H��+U�����P�V���z���R�T:n5��T� 8�PG��u���I�3��[� gm��m�z��w��_����gx�<���g%^�����}�_���F��J�{�E�sV'iAe>c6��z��:�Vn��f>[H8,C��N���3��F����!����9����Y��G�MM����pX�ikUm�Y"��G���
a	������E������w�����3!�����'�{���7���	����o��'E*��g�s���,��b�<c�<�����//\�w�Z�9w�A2.V���qK��x��������S�%�>��L���g=���Xc�>����������'9 ���9&�QB��9�9#q�`�cg�Q�t��U��\��guP��3�Pw��LQ���g#��Z�3F��:@R�����yNxV��:���IK�<�<gQM��[Gz�N#�j|��x��$gR�����N�w���~�s{���B���zQ�@1o�^!��<Ad����~x*l�`U��;��0�#�\[rL �4�J���� Bp���p
�����sN���n�������v�N�s@����]e6�0V\,���+�������(UM�W���p����Y@'�,��]�<�@��J�a��g�*�����U�B:k�j��������wN�sJ��������t�wrs�?%}vL������+VV�3�������Q�#���L@7��Vb��{.��l�z�95�p���I$V`���3Mp�q��n��x
M�>+�I��U����O�"���"�ZEF\Eb��i%��h��0�=�DZ�@r-#>�!K�]/k.C��m�45sQo���<3y�z>��
�|p�����������{7�:���l�L�����52�Z�J��f�a}�P�n��'��,��P���&�����=7�d0�[��,&�2V�0Vk������A`B`�2��l�a.@�zL�
�8\����� 6V#���H�K�jEM�u���E���1c5�z��	nd��2��{�x9�3�������]��K�	F�.�!9:
5�kR#k�H�a������~p��s�\R$��$�h���;rbC���)�U�	_���3�������D��t�4��[$D���>�Ug�1Z���Z���]�I5(���L�����kEv�����E�G�������-��i<���5���X�
�u�%�T���Lc�[�J'S�V�%��*<��.�����\�V66{h�����O���zr�Mi_���'|)�zBiL��b�#]]�V������0�����V��r�|��v<C���"��%#���Ke]���b�N\O�}����W���C�C:�W4m�O��������x���������N}�p_��*f��p��D�/j*�R��>/o���_�i�~m�.��
s�-H�`w��Z<�K�����Q���"���������y�*T�s0z���~��I��}|- �Oh8�'�������Y�����k%����!H��n���'� ������8$0;_"���_	����q�PD��%�]������"�f;Ev���$P�Z�5��~����9�.
�� 9w�#�O�����t��f
ER��D3O�%��J�vy$riu��I�F��lTK���� 6s-{0�-4�5-J��s�`���W]���)Kd�
g/�`�C� �f�0T��v@w��!Z8��&�|RYM��F�5�`c����
�A���Y��ZPF�Rb�3	c�%����d�/���A�9hW�lF����O}���.�2C��q#
*�b�1 �U ��:�b��H����A
0x���!������������$�'�y������P�-���{3����Y����b�G|:��k/���:��|`|�q����~�i���~���/���-I����g-K��w�
�9b+�T��N�D�
�0�?�ie�9��������;�P}��]���Wj�������x�DD�����=�y���@�a	�HW�$oI����9�uy*`�N������-'���
����2�0��
>���>o.S��C��XuQv�}�>�bp�b4��2{�-�1l�!�M@��6��rR�}��NX{��"u����	��]��0z�����s�7uP�U��pDD#� ��Y�����b�6���"��������8!LSt�{@g��!Wzo��y��������Ym���F�	X�i�����a=N���0����S|r���
F�8_{t<�|/�.���0	��p�xm����<����e�'��m�	��O�=5��5`aA&;)��&�"�@�&��t���[�r�R���:��c�0�C�"�s;�2j�d
�aV���m�%XT�����v����%{b�	�G��/�0DB��R\��#	�d���t�IZLA��9Y��%�i_'X,/V`�j9�Ad�DzH0�c���ek��
[������R�~4��9����,�&P��i�8���l,x��E�����S����{���QY����i�5TUN�p;��2������B����E�����=��`�=>2Z�'.Jf�����#� -��8
���yU!���!�Vl��g�,{�e(�k���3��������C���������P=q���dR�{{��$����B��"��6���/�.������E^�|�f��{nI�[-�dS(o���s}�9C������pN�S
N�����M�*s�*}���e�����k!k��c>y��5�lo?������
�J���t���h���@@����sPAY�����,Akh��f_�i��E�������'m�����UB$��R�J���>�
�+{J8��V�����xac��:�-� �{u��i�����pc3_j<�9O��h�]����V���m\,.��	B�r�PcW�}Ip)��J��).F�I�fh�]�w��F7�%�L���@r��D d�����I�a<������h���%r����G� a�;H��}��]����fs)�~��`������^�u�����{&>�.�!"�������7���3<#����!0&�B�H�1��F��.wOa�f��"�X�2��G�;*q\@`�Ro�/��\�4#Geo~�����w:�#�0p�j����^#�0�}��?�S�2z�/���w�������w��?t���l�;r,���M����b�s������7g���gR(Q�p�OQ�9'��@��X���J��2U=�i���*!��K+1a��e��@o��Y1X������1��N�N��.S��l�J��5��MF�H�X�I#���������d_�,��
��19�L�"�~���2�D���c�$k���I�z����!I��� +P\�jZ��
;�	�E� W�<���%Z	��`c_f������������c�?����Q�+��6[c��9�#z�C�}���(�&AQ@��;��{����h���p&���l
�!�?J8,��r\���T�i$�qx�G���r������P�0�������E�N�4����J�/������4Df�/�=m����P� ��8dLr~������:�5�R=T��FY�p|�?,��b�Q�:�%���d`�0�	.-#3V��������x�e�\/P��g�O��\G;�a��Qr�$����K8O���[G����glh`ji�c�M�}�����w�S$K-Kh�X��H�pA,'I��a����4��!��4u��[���3�� �tH��TBW��-�td �f�B��V@0�>�}9A�����%r7�Z)_KR��4p�B(�"q�/np���ao7&'�|PG{�W�.��^������{r<����}Q��~��\��p�1"�)�$�lKa�+�d�[`{���f���H<��2��L����7*X����t�������z68{3�0�1o�5��D�Z�!3��Y�f r�.�����Q,��O �������������te�B�.��&�bJ���D�n�>D��y\�"�i�a��L��:��d�I4a������2�X��[���2��q�W~�?�k�t����!�T��|��z]���pz�i;���~N��4�m!*�s�M�ko�N�=b�P�B���o��-���0��i��	��6|}hV�����d5;3!�Q��g(=��3p����[���#�1�D���gJw��i:7�)��K�s?8'~
�<}I�I�S���J��'2�n�[a�-0h���H����BX��=�/i��x�	z��$z}��nBI�(�b�0M!]��i8
�F�og�t���E���N�J������H`z����7���,�m&&��HT���T�$��v��I�q�Sg���V	�|���e'X=�G��tI?������crJ�+F'x�p�P3��D/�[���0���>`�K2��k��	}!Y@��d��+���R��8<�e�b������,G'���,���U�x�\W��U�2��5�!����9�]����0B�qWb'8���]��u�V���z��{���q����BpU�0s��1B��f���l������������K���0�� &���
��	����+��Qg
���8:*���4���$��[�
o�����)6����X�#^%h��U;�:����Nqj��)A���
!H�9�*c
�t��/2��9h��������0[s�8���N�'cv%�D�3������M�e7�z���$��	��.+�#3Q�cFTw�
Y�������B,�W"�[�D�9�I�j�)�6�P�O�O�*�V����"u��V� ��6!$=�N������PnW�x�%w�-4��Jc�/A�bP!�|��9���������JT5m��q( �k������bP�������KU8[�Z���q�#+B��:������E�1F"�����mP��]�q�i��@7�"��!�S���P�_	o�"����:�����(|���h<@u��@N|���j&����USV����1�Z-�"�@OB+l�����`������CI��u��0W�glsc�w$�
[a1�^�$�`���
���qL[HzK{�^�W����U��rF�������
�/XX9�{d��*��$F�-�B�*(���%:�XW�Z��AOJj����g2>�a��]�I&��'���
�J@�Z��^29�m�d��{_W�������T��������E���{�'=���kU�9I�x�BCg=���b����-B@�7u���,Q0�jU���W��R�+�+� bW��Mh�������S�e@s6Fx��� a���>qZ�OfM��=����W]����:��}��/���?O�dZ�
��@��2>������t��s��6|+e�y���'k��������9<��s��Z�����%��;{Em��%���^��}
�d��p_���3[�U2_zM<7��=u,ifN�XO�`�?K�Q����f�$/����t�U��3|[��2���v:9H�^	��y��������2w��>-,�q"�[�@��U
7��a��%bo��}SC3w��t�is�/!�������9)u�V��~dlZ�0� ��gW`d�04`i���Un7��o*a��#8H��~� ��\�����dJi6��9��yQ�����&�	~�O���J��k�J����q��a
]�	�5�������(�rTg�����4����L��J�F�EV�>3D~v�����H�.�*'])�4?����b�V��|
F�!)��s��%0���sI����&1!9ARX�8�GO��<)N1a�wRIKd�A��
!O�E
5����N�R.W�Et���FQ#H�$��'t��l���%Fo5$?��� T��)o�U����6a�Ui�cnO�RP��
�������)��C�8��(a�w2����uR
������b=
��t~,�Q�x0���W�$E�0hs���>1Ucr�CQXR��:����V+�r&�������4@�}a��P�vL�q��E��0����� l�d�q�|�}�����i_3Y|��=��
�t@q`|�S6�(�����������4��G�7��8��z�~��]�sU����x�o��Y��U���(�>��<KE�&I�"1����pgr�1jj���2�#� ��U�P���X��+��o`&�sY�����$��KKp����!�PR�tbi�tq�g����o��\���S�s���v��M���,�Nl�����r�k{��1[=��?*����?��v��=���������wJ��
v\�r�s��V&�|W��uI�(����!��]I%�2}5��\,a�����N�*<�d����)hYI18�]�f��z�����"e��4���mq��8g��l�����&����M!b�� �M�j�b��*L�X_�f���(�mHF��z�����e�2���.P�C�1���&��n2�G�O>�>=��,}��p�LO�f��zK��Y�J���
��y���<���-|W�f/o�,�s��EBk^���A`���W� ^���z"��r�U�������%}1Jd6� ���������P��'�]_���5r��D�*:�|]U�g�)��%m��I+�>�_�J�V�������*q��g���S��E�����j���#h3���>��1WR��Z�fu-`q����]�\8�F�z��_Q6��9V�[��F���!���LF}�4i�3@j�Q�\g��.u��N����_7H�&��#h�w���2^(��9^�*R�/@Jt��K��gx�C��
���!��M�����F��u�'e���6�u�������Ucr��F	i�r�y�2��/$��d���]�{;f��FP�8C0��Q7��
��!�x�;�C����
Dq�b�uq��g-�]*$����5�����zqx3c���VT��#��xl�DS���Kk�_t��JH$R��=��Z����v��M��]3�h���w��|:t�3�P���89�y�[���s������g�vW�m��;)P���|�+�P9��^��\�Z$��1���f�8��J�A#��*����������qPO�G�Ai�r�,����>������
� cn�)���^y����.<�\�W�/F%t���:�v����s����N8�k��qT?�x�/A������
�!����wRu%P��\f�2���X�m��M�*Z���)2�}(2B�2v��R	��B���x��/oI���T��5�H�7DE����H� ����	���p4 ����?��*���Hw\u$1�w�x�4c��b>��j�������rpi�����X���!�6�D������"dCu�����M���{\��W-)RR#�&��������>����
�K�I���=Y��������)�b��I�����cIdG�M1����?�j	i9�F�`���P�f�n��f���C@>����8f�cq�76]��N�*[�uw'�_e�\�@���~/z6�2�w:��MI�}��3���
T���;[�.|�
��OJ!>������
��2H�"TU�U0/�'���K��*�$��
�i�q��Th2��s��P�@�yL(�T�}Z�gPk~")�Y��1����1�P$�G�J�����M
��R��V�A#�yo�%4�b� [�"K��9�F�����F�\�����v�_�D��E=HO��$�:����@�m5�a�O�V�i%4����,v���TXT����`ar9��2e
�����&q�������-Y�>�x��0z����n�y6t1V�����y�,-��;��>�q��*_������0�g$b�O�x_�����.�dr&��zcz�_O5�����C�pX	�=������x���@#a�P��=t��#����l���a�/
TfK��'�<<�0����_9��?:}��/c��x\�j7I�N���������y�q7���,q�:���h�����E�u�������y��[�k���s��������	��D=A��������3k��^My��(�[5H�%�i��b���N�����&������O�;����G�3��
[z���?c9�+l���D��C�����u�ysf�Kk�������X�MD8�'��0����
������z���/�Z�'����cLD��H�%�EZ.�nF�H���DhJ��5�D$�,�AF�p�T�+d%�E�J���@*a	���h�b��4�Tm$}g�����(	�H^`Ic��f����S��u��J�fM�!�w�H���A���%�������D��I
��D��7>�����I�X���?�X��N�G1�b��W�� �����+`�+*�e��rJ��L
�n��((�!���
�A�@�,���G\*1���1k�9��i�w���-`���r���IE���Q{�zY������E�A%)��Uv8d����pX��>��P�;���K�%��t��hWl#��U
���"���/����� qz���xj���3`���a(*,&��_�8�U�����Al���$E�^~	��mI���YO��A����|SR������BvV��I��f����]�N�� D�G��uj� ������)&zV-=A#Hk��K�i��o1���"��E��N%E��N�@h
��e��P���H�K��m��,��.-\g�11{s��d#����y.����@a�;]���p7�ob�>#W��#�.At	��1���5)|6w�'�[���#@��/l�v�PO���Wen"Sn��_D6L�1���8��KB��JUS�#7~��K����}���,���To�
��W �b��W���]��&Er~�c�j���d�����|���C��EI�)�4��2^� ����M�jG�g����E
x� i1-M
�y�]J�b&r�dhK��9M|�iZL����L4�&.hf#�����������i1%1��Ww��`A� ���LhXA��aV��_-)RL���I��j;K��y�*�����Vj��09Ad���$oa�;��5D������L�\��O�A�W��RqL=�2����.c�_��u.k������}�s6�tJ_�SDs+�X{�gd,xe���*w���8ymt>I ����������\�����D�9���
6�2z�W���u����>�tY�YK�/��\�
��
���}���],���J>?1����)�{o
�D�x��EP��2��p��@<����g�2[d,���!�����p�eLD�������V�.?P����������0���x���`�6��@^�c���?�U���`�=�_�_���Op����v���<5�!����3��rO�tge�����j���gg�i� ����������5�P�=���*)R�?����2/,��)�
���>�h�����;���Oo��7�W:��Lgr]�!$�JQG���d���?�:�9\�	t����z&�� 4k8
o"VqS�el�i�	!��j��� �(�-k���@K�����xkV��J�6��d^�n�[�r��x%!���p�?�Yi)���)VTX��~Y��u��r4f����\�����::��a�L���L\���4�:x�s@���]a�J��b�����o'E!���y�[U2�����e)8����Y`_�
*S_�a��\���~'�y���5��=����j�jo��3{�1r�b�a�23�S�{�3&��_�0��yhY	f��Uu�
��f�=���FZ��JE8{H�<�<S�2�ko]1&��#�5�������M�����/J�~�J`R$o���/�*����$F�R���� �-9{P�����|�*�_���9�E�@Rf   Wm$@E�
��bnyg49�M�[Po���W�*�Gyu�4'm�D��2/��m�$}�i���Z��<%��8�@�"�lR�:R��H�4�3#H`l,	������+0�����T1�%���"�#�xR�����/������a}���d��5��N�W����05�8�j1��_�Q1�~���&\}�|rp�P��Qi��^�+5F�G�V��x|<���0f;w���� ��R���i"�*�}�/���C@[Z�aZ&�����^I`_�p9KU ����/%>`cf#H���t������+�A� ��^=��:_�8H���.7��+�����<-�L�:�8��{�t�k���HH��-+�����7�5�R�m*���LZLK���m?���1�y%+@�j
�I oc3�@�\ZLC���-3���uC�]��H���Hj��;h.�
GXI",*�fV-�i����F�_F�ZP�o��T���(�t+�d���ua��(��g(��Z��������5�k���3���IT��Dv�&c�8@'���U@�$��������k���n�
���&;���
��L��"�.�`R8�bY������d���rRcq)���a���]�TT���do��������J2qUL�b�vA�� ��X�Exg���=X���P��X�)�ZU����s�?z�y���M�V�_����ND6y5��_�l���h���f��kM����Ly���vL�Lo�SK�R9�8��!��/�-�G��������mJ������h�4=7�t�k�~��|��������7�����_z�>?���`���@0Q��]��&���������441�$[�U��������$�����������5��%�P���H
`t��;���NK�Nh�6k��J�SF����x��9	�����)�;x2'����������	z�jH�$L��|��"O?�>$%�F'���(�����������+���*����V�0�T�5���������.��$Lg�Kk�J�j�����w �Y*&�,�jU0�]����z��A�m�X;���q�}}��[Z�0M|fc��;�g!X^�c@��']3W'O�����9����^���� 4�����r�)2��5:W��kW;��A���B��!�aK�5FYs�&Ej*����V�64�v<.�y0��ds+���x@4��c��G:Apn���&	G�.��351j��W��i��S���*�� �R�Ze_'����5d����+���hN���6���}��':@"l*�@u����l	���E�Y���1���U��VJ�������Ll�T"�������0�
���UJ������O���E(n��P����N��=��&�_T�cU,
9��6!D3�2��e~'K��X#bv�`�Z�>[�LtLw��1)R��v`��5#|��
�C��D�����E����'h���-K�j�,WL���B�?Y����s��l')B���J'�E���>�8���]�qw��'F����)�'E|�
N���iH����A+pH�Z�T*��&E�����v
��k�e�t�i<AO�K�����^��(l���6QV����>�t��v��*�]��1�3V7o?���e��}oiX%�f�����O*,�s����Emy�^��BY���M�JU�
-BN�(�:e�bu�t�B��J��1�0���
i�+�������������BLl*���$�'�KJ� f����4-�u*������J7�OE�p��"�rJ�TR=h�b�K�!�����&��kUBq��e����� �\*B���,�9so+o�;�������OA�=t<�AF!�zZ�S���e��_��j������A�g�=<t��>=�i���6�[yB�n�|g:��3,W�y�A�S	��r���7}��'h���
�E�'�Z�<���{�j��a����� U��F�z�5���:��s�^v��m�35?�)��E��G>������!$LV��("�DRN�+P/����.�_\�(VA���JH������3��?������3�W�fN��4���,���P:�������V�F�6�}�!CebN�`H�},Z�[�T��)��o��"�YJ .��tC��������0��}�	�����=\�9��S���R��t���g��y\�kU��t5��6sr���p}���'�����<�c0��47Z�x��r����Q3!���������g�}yz?�~���F�g����*@����*<�
J���/�,`�I,>L3SW�i�1��=B�
��RV����uDT~�B]�fa~��}~{b��$IFg�2{�l���%VU�]M��xd-��)�� �:1z���f�I4a��K@�f|u:���Mq�[b�p��1-'���q��H�U6s/�J"��o���%���'�l"f�U�{b�Jj�{���3��	����Z6}�4��%�be����(�N�1�B�z]�;}�A����JJt�pn���;��p,��|�eYv)�*Y�0���C$�����M�h�u�O���f�pD#H��M��UV�R�1�!$%J�e�X�`P���*\��b��.�6���d��>��h`���\B�QJ��*�
����iF��'��7�$�c���?���Kq�b���m������j<m�	G<LB�����������  �N�����$��1�S2`�'������#�-�c$s?K����{)\Om
�]&L�Tg
eK�i<@I����0���-�:������R0�	���"-�����}�EG����;�
G�t<�k���I�$���S�L�x�KW��/��Ci_�Y�I<@��n��|������+0�
�`���X�f�|t, ���}h0t�Zk���}��I�������.fs�M3���!JR$����$E��.`I�xx0q(�p
�Q�/�	������v5����o����\�`.���/�E��b�p�yC����x�nr��	:.��l�[���'��?$�������.�?vSx�x�!L9�(ZR$V� Z@9!���x����k}�C�e�p�1^�ee��Lq(g_n����u8BU�g���C9H��*��|����f�u\g9=+�s��t��-n��0�t�g�x����������^z����.>�A�deN^���g]U��|?�#��Y�h���+5��,++��+68��s���t+!R&ZUV�����.�0c����f����-�s{�������`J�i��TbHkr�I����g�q����Y�=J�SJ���Y2�W�������L�	��%�4���?P���ymF���3W���o�����WEiX�3��<?7R���0��c��l'9>��U-�c�*���^��_�����g�0���{i��+��EH����%�.��#h�w���E�u��lhf��.	����)���s�vk���.�1�����R���Y��~�������	f��BK�x3��O3��2�FM�B���f@��Nm$�|M��'������7O�1�hKl3CG����J\�uDH�[�!'�w����8z|��3������n���X14���$}:sK@}x��������D���w�&�
K���r������f_����v�a��\���b\�P�*���![O��9w���^�E�N
rV�i��)���*$�LN
����pg������!C������>c���4��^�����m�������1��V����G%�'��Y9��&��`n�)���6�J�W�o~�����!��9dRj��w�`8��g�%���~�yc�����qX�>��~�{�����{��.0Z�u�5p���i//S;���'�awBHV�&�uL��F������T��51�4I�b:���!���q�����H��������N������#E��$E2Z��fp �4w|�AV�I��(��AV�AV)E�����D&&;b9���=R��f���F� ���6o_�&E�3�&��y9�
&��#��Tl�w���t�`��cF%'�/q���c�1f�#������.��C0����{4�{�����i�&����"��/i��x�G�P�*@U��]��z�+� �~�x���Z�I��Tc���y1E�cd7��1���l��S�^��x������� �����7��%Z��
�&�v�q\����o��BKni��b��JvE,{��2��S:�j�7!E�}ZpP�����������AY��q��?\'�:r_&-��7����A��f�-�,��74����S}�vj����K����`�%���TB�3�JeI��+\Yni�h���m(����jcM�#��I`���xXr�P�v������L��<�P���9��iHG������U���� Ov��O��������
�.4����@'�-�'>�O���3�hx�����3���J��}������p��a��53�gd\�]N�$�
������=j���O�����}��Bc+sU�\O�eLV�LtxR/_?���~��"�=�;���e���&�j�l�-x	=��g�a���4���e��������o��0�+VwhCOE��fn��E2��Kn��t����9�,	�������>�KcK��!���J�M	���������|�$F�ud�s��9M�.���3/xM3�y�6�)���E�rH�w'�"s&1���������t��EbtbJ�������	����v���;L���b���P�'�=�������
���sI�(������~��
����tW0��_�?M�Xp�8��^�p��ZG���^�;OB��U��3'��uD�%g����F��9xV����Vu���E;y=$/��pX�^�u�T"����9M!��hqAC���-5�����Q��J�����V�4�o��F�-��oh	��V2���2�y?���(����y�:�����d�3����zU�~�)UOO�so��$%c_�0��`�P9:�m/9�*���CB�r��2�n�M.+:����������P��<J�2'4�n��bH6��R8y�D_7��Z`��"��3@|u��oi�����q!��������)�B��vL��#F���_��]��J��C/��/J{���K3�e7�;+�Kt��lN�$����K��J���^�E�6�*��L�����F�FO8�Gw��`�OY����HN��H��M�I�VA���Nk���Y�	�\&V�����]����t���4���X��m��C������:��C!�:�F���`S4a�+9A��N��M�P�g��"&�J"
�([v=XL��9A�/0M�dH����P�'��LK��w<��f��X,Jy6A��x���N�]y{����� 1��w�`U��}	0��mf�0u�r:��Et�)� o����jWZ���U��H�]��$��%Q�t�C�lR���Y��J������DI������
�w���c�x�Y�{@���X,+\����:��`5�a�*{��]	�]Ia\�"A���
3��� �#T��cq�1��,c���vt�|2�[��g��BF�~��%�3��������*��&������������Uq��lM�����h@������V���r�}[�~
4��j1�N_������k�[�b�*iy�`p/�xn���2dO���m�4@I�L��
��Zn��h|4:+�e�'=�X*�&����@H.�`	��a*o_h/Xm��#4���M�*�|�d�i1m�ivM���5�4]�lf�*��\�]'��}��P-��+P�i�2\H�|��F�qI�9���*]�����XT���.�/+��e��N�`�
�eC�B	)5f��3s�oZ�����M�2[�$�`�Oy83+M��,r-�=T/c5&7e�b"$`���2>��}�1�|��������%E��t����ZH��o`�
�E����<{d��-6��:����s`5�����9���eR����
|���Kh���
�E�P@�3���p��t�TE���D�a����o����
�No�##cW��<~�����4��J�E,�-�)���hN2Z�b��#�-�i�
4�7�kr�FZ		a�>.��,)OKgx���)_����-�'���l�W���S�M[ �)����Bv�+�$�X�4���[`��B�R��k�������"��ct�\����&���
;���wCIhw�f�:�����<5
;��/�
f�h�� D�����y�	J;�r*QX��;���TB<�>'7g��E5s+e�\����=f�����B��]�Q�*��{#���{��������K@�����
�W�����`A�r����u�������P� ���8;��=)�n@�,��Y� �|�������!�h�����w��r��x�t������&��Bs�0l�q&	T��B��~M���j���6e8c��� ��"�\u��B��<5�,��0�d��fZ|e�Mv$#C_U='�����,�M���4#y�
�A5��H��Tvbc��0�	����tpc,F��W<������A	�:����c�Bv��K��Nz��5l��e�v��E���!�5�z�b?�,5o�/�]w;Z=����F1�
������A�Y��Vw=�b��-1�<#�Gq�$�I��]	s���8�T�����@�W��B�Q����(s�1�R_��� ���
�J+�t�J�Z�0�K1�0���y0Do�*������>'}~�O���;��'�2e�����2����;`H�ZV���!���7�r�z��	vuT�����$��)K(�	"�%EZ[�[��fP��D�%�IDATU�����X�[D��x� {wf������)��\#���i+05���i!���
r����
2U���"1��W��![�aY
��8w;�+X9�F�����(e��\���{�������g~�� �u��$n�i�����)	��|�$��Z��$���eL�VF�'u35����"��0�_.��Q��u�^����	�C�.F��P�,��M@`���`�&&{��`�(�	Y�PU��4�a����h���l��X�^E`/�\-Z����y��4��/��Ck{���A������6�{��]xF����]����^g�P#�D|;Ks�����P�]�%�o���
O��o%Et��`�>O���.Rq��IX�3<��l#��g��������l�gA�N=���r|��3��Q���q�! ���>?���	&%I4(�z��w��5��T�$�&I-4?��&��H����[�"6��T�Hjr�];�-�������D��:���<+@���3�Y+	%�kD�J�h)��c����i0�U�����^�	���BY��2���x(�YR�-U�t�|�LH1)�j$Pq���eF
���F%�c���~�&�X@w<�����A�[Ws���+���t�:G���pV��g	z�WJO�����|�Vr�	����9����OuP���f1�f[�(�,�;
.�}����B���e��l36 m����a��1{#��1�]w��"i?	����;�^EN�4��F�O������q�q0B�}�]�gU�FHe+{x1�M�.�O�������Z�J,he�����\�!�,��K��Z�%�H�0���f�00IS��M�h���T�+n���Y��v|���t
�-?�0~�3N4a�|�DpY�Q;�;�)pN�a��6k�&ERtm��.��O�a�El�_�w��4k1�npE�a5��k�#L���8��9$�G�'���jf2.��?�N��-��b�G���Q=�0D�W�B��><�~x���k��(*���5��aw=��0&�*���u��
��L�+1��	tB����� ��sy�{mYZf;�k��D��}R���3���`���j�q��{��}"F���R�a�����0���Rv'My���,�.
�u?`�z��~�������+���@t����=����G��fyk���������W6����C����YTN\_��%E��"�Vx+x����j����}Z�^K�S_�UR$��A����m�s�.�.N���JV0s%�5�I�d�/��'FqcQ���9I}�-�WD�h<@�V������KI����f��HV����r,CVN���[m#�������"���;�X�B]kX��w�j%6 �SS��6����B���+�p<�w}^���u�W6��'JR$�o�s	�F�g��L���{������h�0`��	?7"�9����rN�������/�����IE@���c�A�����g�F���)��~f?AE�Ph��Gt�����Q��Y��7(�lNaX�2�����x$��
���� �z8����=�LL:9��YXr�Q�V�Lr/��?�`���W���C����H�["�5��t���O��������
b����P��g:��
:@��g��\�C���
��G������3)���n��N������9kvhl^�1���f]+z���k���)�n/S-�y%C��g:���P�O�i�`�����>N'������>g�S��O���E�_�!��0N�t�(
Us�N#[`��t:����B�/���R���vn����- <E�~v�8Z{b!�4��i'i##{c���L��,5-N�H+qt���)S���F�pX��j ���$@��IL��[4�J���&JNrl$]�`D���B�8Y�O���Y$	3W��b�X��V�P��:�`��G<>������a�
�Tw���H��!�����%q9�GaU"��0M;<[��|x�@��8���	7����93	"+=)�kU�ze9C�	�XA��<Gi@Ga�������a�����}'u0�8)	G|�T:@U�&��J}_� Y���@����x���u%P���+���A��`��@����dP��T�t@	CR���E����K��{�����������F���b�����0�ZH|�6��\#��a��7'
���B�nZ�O���E�B�`Z*1�� �h���� p�	�� �gF���&jN�qb�m�,|`�I#K$��_A���A��;eI���A�����4�>���c(���cQa&&�S�\3��~,�@."�Q�%u�k�=�!���1�;1��Iw2tzT����l(W�I-v�;:qw�o�� z��	m��������nF�c��:Q���	�kp�I_���@���M��'�o�v���aT}���N���O>��1k'q�D�; �I��n���1MOP:�����&�D5f���g]���0-��PP������L��^�qI����p�����d����d`y`|��'F�O��;�N	/B��D%~���
\�zyD,l�����HN�U�(F]R	:�V�~��s��&E�=A��T"/�o�C����DQ#HkV���U4���|@�E��P�_A$	U�T�$�	Q\�� Jh)����?7��Z�s�&�-p�]U_�LN���$�B_B@�x�gD1#8��C�q�j��^���/�Pc��p������:t>���`6����
��'3���	`��A��C>�-e�p.6��$�)�V�{����OAJ`_N

&&1N���T,���0x��+��(����;�����:P�����~����x��4-A�2�(�5n�m�s����c�2y<>�(,���w����*-f
�#&�O� ������L�q���������6�{�Ch^�vo���������9������
�~'�Oh���[�C�(����Y���m:������x��A�g*����7�7��P����c>�Y�����I��A�?��>Ukn?��G���F������W�o-��a����h����,���p��.#4u���]H[�~`����W1�V���8��=�+@Z|�FZ	Q��gm$��3������!O��'	-����J8��m����l��Z��|�{)ABS�F�V8�w�no������ae�`��, �Js�&-1���.�>�xf�����v�5���)g�>y!�'^�kNX�N�����=siNs���`E���b)�\�eq��G_`MH��O�._��4kb&u��f���|?-l.��4=NO��}u��B0�e�x�!�zmXc�U!X"�in1�=Grv�*���5�k�0(��0�����&
���}�����[��w�����/Y��`H�se�rrO� v�����(:Dpn�����$F_���4����$rI\O��!�W��W��i>��FHp��1�
]O�b�w��IF&��P�Q�WI�i%Dv��
QM�1~�8�;�Se��)��RaSa� [�j"^��;�	e�%>����0����C�>��d)���T2(�����*��F��L������W
3|]�*J|;��7N��K�_���3��KU���RY�Ua�L�0��D���u������Hbb�#�;�]Dc�uC
4}��g�N���'�',����j��jB3�X�HR��
a������`����,��U��$������mt�i��� �ba�����aC�r���>T ���������_��o+a�wa�z	���%)�SV)��7�+W82a���I0�6���Z��]�V	�p�.�R{Rq�{�2L�Q^r�#����O�g�B�:h��E��I6�r=�p�TB�J0G`�Ab�����������.�![V�I�E���c�q$0@X):Y�pq�'����)>yY���
�@u(V
�p]�"=�*#&MsN��xTJw�J���wSed�Z��%��W&d�����}�����_�sOL��)	����Z���
p\� c&����v���!�x���z:�UB�:�]�����&t��l��)A��F! �T���m�|
S!��M�A�����Y%jN��U�������a�DWO:�4�S2�t|���p���� ���	r*�9�?}���_���^M'BSx������q~[���Yy���I�\g	���R}�l�2o	�?�Y�4��<7@}.�<A>�6|��9A(��^�V���:�������)��������\b���{�h�^��������G��������H�
�*d�V�hp�UD�u
�~�,�&�7���jEh��t�
��Z�����`�a:��Es��OjHE'��B��To��_v����;D���y{�z2�q�]�/�B|�b-'~[v,��k����h��[K;�1����*��'���f�������yTH���`�c,7��|+ �;���������W	�NA_�J���s:�Uq�w���r���I���l��t����.�{��)5�����,��jZp<���1o�X���L_���y�= ��\�e��n��"����x��8�+���u���p���Fu�~��
!|
��t:�j�uV�5V��z���l�Py�i�����WJ��O
���=g��VBw�
� l.�,��11�@�y�&��nrwyP�*�<P�����G#K���]�1����]cU�u�a������P���K������9V>sR�3�
V��sFj��)H �
K8,�'�w,��5�?�w��1��B�2�.e�������f��ZK[����}'H�fm�������HY��Fdz$�-��g���V\�k���vad�l!J���� *`N�z�D'�r� ������WC*����0�>�!�IX�1��SWs�1.4|�gDP��q��4���|OA8�UYl�u(F�U����T5M���j�/r�g���Q��)��b����<�����\w�w=EF@�dc1��AM���T�r�*:�
��~ cJ3�x���d6���*{4��j�BN�bQAv6�&t�������jt2�P���~`������\���O�	�r��-��,��3v�)I�
��'z�|�t�����tq���%���T�W�O����6l�.���P�L%I�MTb��x�8c?�K3�	L����Z�lAS	^bP�<
�ms���K=��"�x�h_����nf���JT�G��1�)���^=��.I�u*����x	���_��V<JW[��\�
����U5����3� �|�^��/�����5�5���
{�V��Z�o��qZ�S�0��z<�|�@����d!�����7��xq,���K���\�Kh��qs��o��pE��=����j����S��7������:����-���:��N�yA��p�������ne��Rn�	���lm�����47�G�_C~m�.�����\�::w{:��1�<�*:�|��p��s�(�X�y�6��������={c���9�7$�tL�Po��x
��Oe�/�6-&��Xf_-q�0�
jg� �`�.E�N�$P�#Q�:��h0;
t�D�������"�yN���%k�����H�#fQ_])R��H*�a1`Ly
��7,
��#y��#��j1�zx���l��i���!�;����/�F�e;=��[6	�4���LN�
K�a�tr��o��KC�ln$������T'����	B���@�j�E%c(���E�^e��YLoR�h���*
�;�=���yr�@�E��ku_">%{ac��5j�@${ds��y8G_��������	�7�*�!���%�&`nb�A����CB�=)��
�h|/2�*�Ba3��-���!�X�vIxY�`�98g{W��p��
�&=q���^�W����	�K@5yi�}y�X&C�����Z�>��s c2��R)��ea��Hm�����h0�l�T�������p�4��%{��)V9{Y����|'=�
�,���v���U���)b�q���l�1I�A�s�AxGP6��h<@/�A�Q�dFb�)�HV�3��������R��"�4))���I�����m.Cq�$���x��v����[�� ,�� ��c8�2��~�Fu5����~����:�+�S�awW�d1F+vOy�_)g}��&��}�Q[�Tp���������b1`��-�5r�����9����(�e��������,��G�a��Ov����\�����}��)��6aD-���	/��p����cj��`q����pS��OE|�c�������Q"}?��/�B�9! Z}oue���oJ����P�8&������o���ILL�j	��w�R�"B�u���-�2�%�<
-�����YE���FN"BSu�	-��E���&�kMN�P#2�x ����t�yI�na=�
���)]l�$an%����-m##���[Vj����-��] i%`CN���n�
�����|��M��R�i;AR�&E:�'s������P�^*����^�t�K�3^Q��M3�!��%��Jsp5x��^��e<��g�3O��!a�f�P������93��v�p�����'��{,�o�}�YNc�}'|EC�AW�����^�����E��Z'�����m�`��
!��P��S09-��&�<����"��Y�E0]b�h��N���i�')����f:��t����{������x��2�KgX
�*�5+�����1�������xE��=�?�<����n�5#�Ke�s�&�Y����#��7�����M1L�x�����Y�t����ke49�����&���%��B8J��g�bn��6|+�ivQ8����!�$��F�����4���L��6K��:�/� ����gl���/pw�O�~���9_`*��j��a�~����<��+5���t�b9n>&����G	,����$'�*).�;C"[4���}�d��~�d%�/��.�Ht�$��6$�I$��.��yh����U$1:Q�
�zu%��[���1�����T�,CM���nei�<I�DXg[��i,�i�e��>���q��.\?�������s�m�D��n��m����k������Q���������_1�Q��Q�'@x�:H�������1��pa��c����OO���la	�#�^�{�Q9����C{@������K���5�K� ��|�8��Wr|�.�+6g.�q.�u	�mw1'Cp���K+'F�M�c�7u��.���{��5>l�^YE�)��������p��#���h��]����x���p
+�DN�����}3�PC�3#�M���0�O���_�P#���U�Flt	�g���z�ud'��fe���<����A��^���H]$��2�G������� ��B�JS�dR$��	r.)R���NY�����6UD_:�gI
�B�S�eRcJ'ygo�2Q���}�F�s{"&F����A��v��cQG���
6�M��g���9����E86��&����&��
������5j��C,UXD��}�H�%{�9��N�}0f/��#*�u�c��1"��M)�0�Z�`R$����f&������Z�JN���y@�������(�a��K���@a�1~�d����7������@�b��h$0��X8.I�����B
w������A����5\�� � x��D��������W�A�����D8(��m��eh�HV��:ph���
0�����aV���HqEW�����-vT�$��
j���{����fD!�L���������\Z	�����KQ��Dpt��&��C*�K�k��� f��7�S;9{3�T�$F��&�?F�Et�7M<x	+�<���J�����j���8����x�a��`���
R���{��+����B6A�L������G���a:d6
�Gsw�(����@@FQ%{���CK?d����v���@I�#�I��|�B!��.CW���|]��X
�w�I��1o�a�A�*F�YT+���>�b�B�7��qP�x��"��E��$t�����P�m:����|+�t���<����_�t������Y��L����r��_h'��4-e�?Fh��l[���7��o�
�E	��)�?�����O�N1T�9��NS�m6�d�5�'��0Og�����������oq�sg~0�Z�������B���m`'�������Ri��&��	�
 C��s��o����q���������6$�y�]��
�pX�6E����\#��`K��xgO���h��ck
y
���40��|����*���W%��s�GhhHG4	�C�n��q9�a���:��N�C�m,j��V�|x��HVvpW������R�H;�$�Ez���.\V����
H���R��p������9@�4���U(�����R>���������%J�.�	;t��9A`|2��<V�����s����]�L��r
�|����'XU����F�d���EK�JU!-�D�B�8Q�0�l������Cg<���*���X�YY���6��G
%Bo_��	��)��� �_
MN�v\Hb����@>�z9&���`X��#��XQ���|?��*h
��Vh�5�h�Qw��<(V�KF_E�"�4k�V��mh��Br��|G-�]�>(O_I�����S��vZd�!*�H$�W�fC��D@+0��	�l��[����Z+�
�O�6�R�1�g�oR�w�������������N���z��Z�P�s7�C���.���2`��t���Ib"/�.�����~�xrW�� Z��.r�=���Z���s.�}1��b����� ^�
>�!������y�;t*���� "�������+�P�!� ���Lx�A6�GV��#��"W�������U�h��^](��#����n����G&|��
P�H�"D�=���C
���&S�pZ��n-]��.���*��t�-�A����Z�iqA���R:�hC�Ep5lV�o��A�0�L	�:��Z�k���$�T&[%2*����\I�VBl�-��
L��,G�d��'��{���:8�-��`\L�Z_`������H��2���b����0vgflC�,�x��N�@N�Y,J��W�����(��jb�*d��Q%=�
��*�-a�J�������,��_���BJ�c�(�0b��9�T=��R��*����\��Cq�`��P�[����X�0���h��������p�^�24����z���C��z��`Ii�7���~a�5���X^�J��ot��GU}��C��]X�$����G��1:�i1��W���6Ms<1��	�����#����7|m�.�8$`rf~���N�B��gVG=V����E�NE0!����x-
����� ?�l����>��o�`cB�|A�!��MO/�vh$z�n�f�KC�kj!�|���=�<"4-��}:�4cm#;���lDh��]���V�i�U�[%�4������� 5������S������x�e���i�5:��{V��OO����!Ol_|x��W������F���������+]�Ic�yV[���k��i3����{�S@0fK�!v��4���VPc��G9�O9'���ME�y>�M9��+l�Cs�j�
���]VV���t�J�c��M��]^i��"
a��6G�%�� �q5�T�V&v�e�"�r?���Q�Y��tfQ���gO�i��sT(��d���Q(:�U���j��Jj�:����*/L��F-J-�]��4���G����);e���r�S�����]/�p30ehNN�����$E0��P�8M�~��*��Tj���g�1g"�s9AZ
fQ��.j�O�tPfk�N��3��=Y1���B_r�k�/�X�t���y�&E�����>{���n��*>�S�G��^:?���h@���{�����^*�
]��	��k:������y=Sx���;�4w1���>k�C�A�,X�;�UY��S�wvb��q@�$T�O���a����P��b6|�
+}��{�(;��.��"0����(Q�y��2.	������ ��B����f�}�'���0����F@d5���Y�I4	��iy&+�Z��:�A�"5+���H�py��(�u�s�
�hV����
�Z�-I���<��
�Q�\U�����gy<G�����*)R ����w� �S��
��[@:�'�R�t��I�+���i�N��)�dtS�!L���	X���ZI@@3��H��P�"�@���R�9�Ph�1����j>�do�JR$���}����9�H����C���%������(�[��'_�"�����b���X��,��2Zmp����D�����{��l�(,�0�IB�q�	����$E
��_�'u��A��F��n�a/V:n�s�}�,�/B1�(`	�k!�p����|0�;�� �{h����A��,UO��U�<cn���h)�J0Rds.�����Jgi�}:��p���
�E�=��������8�F'�N������D�dcZr�Ut����,�������u�#�����F"�����_w�\�?mI�>�6|}�
�3��wj�M/a��}
�u�f��E�y~+
��,EuH�ZU��<������������{��yX�wn�/ziC�����gj���/3�/|&��|[��'�*��Q��tB����@y��BW��M�_�����Z��m1O��`�������(q�c��J�s������I���2��U����:��#��q�
���%'�]��|�Xs�s�h��!��v���-q��n��2}b��XU�e���|�EEb�1�(��A/:@3u�H��2�{7&
�/��}��b_N}c����\C��W���nM��(~tJ16;�1�BU�wZ�C���m�Prs��TJ�� {e�S�P�D�Q����nx*Y��`����=�a����Qs-�*��n�Z�! N�:�TRai)t��'@+9���w��~�GuVDW�$v �>b.���9�?����'��#��{<��r��(U�,����Qh��Q�d����b��C9�
��2_3@Yn.��<���ce\H�<t������y �2�4s\e�0(�:�G!$���I|�<��?��z����I���y�F/}I���M����N����k��6��H$�d�H��u�l����&I�hH�uR�����pG����N����������[S�t�3��~���J�rR�$3������w�*[�O�������E0�>����G�~gyR$��<��z�U��N�Ir���<��%E����	�X�$E*���\�������P�?6n�V���j$�"���H#;=��� ��V2�/�����1�=��{H��9Ar����:���W��(�;]3i��,DI�s%-@C�g������(>�(jLe�y$���\������>>��W!*�_����]J�6�J��������F��c����&��FZw�\�����_8����*a��<H�+&��	:^��"joA�qZIV�	�6��$�|�s�` �����\<U���\���:���j(I��/!����pG;(Wtns��l��zn]v�H�>����L�����V<��7�n#�5�r�����E�9���xW��v����p�q�q�o��L�\��*��"�U�&@���U"�&����'�� ��� ������h��$(��r�&���b���;6��Ju�?i�� ���A���V�k��v�+�����xP�#�yh�8OG����������O;��`+������s�?���^�����9�J ������� ���#��-G�����g�=���s1Y���X<������v�tS�Y���P7�������t��4���u�'��U��zv�&1L\��}��	������� {��LMO�\��n�	����"d��9A��6����\��G�x�	2��;��s����q��5^`&�B*���	�	�o)�����|b�A�oq����n3���{��0�/�p#f%}�I��z����������c��I�P�[����_R����/��J���pZ�`Jd�(����Q��XSYt^.tX��C\0��o8��j������btD�se�d��^�����5���*Qi��+��,h-g�w��� ��s���#��q�Kq���c�����A�P��N�h�F3����5���,�I�}:������-�z�*\?7��S���i:�Fl��W4����:%&����I
� ����
���
P��?�����:8��}��������N������������������K����}<�\9{�����H�����s�u���{8<(�s��X���)���N�������JC�6j��$�S��7�������%T�{bw�����}���-���A%Pg�4���}���{M�?�jH������'�
�Ox��t<��}�-����8%��@J�C�K�1�UI��O�`� �F�}��H��5-��	��	r��9A�����Y,�j�w�(�Nn� �;AX�J�x���%(Y^N��'�a��2�	��Y7N�MR$�o��?B��}1)�f]��B�H��)'&��LO��-'�W_�@+�\|��I�xx��:��)���!u�m8X���-��tV�L9���tm= |������ae���>�pI3�(}�Zt	��FAI;�i@��)���7��� ��p�h<?�P�{�*�Q6��Q_���Au8�%W��������������3���j������!����AU���`��C5T�xz��C�+7��X��)�l{q�WD�x�QO^7BE9���>l�(8��E�~om���a���Go�$.��+,KZ��r2�f��S���Z!g�A�Y�;I���i�)�=#Uf�v5�����C#�}�����+hyr/3B{�F��;�KJk%�[��n���6}�|��FT�
�EqeY��e��-s;a�D�`W!j�_��"��'@X�[C�<�9�7���}��s������.j�@�q�d]�8��������������{����L�5���E�G���n�%����m���\u�x����<#n�a|�_�.Ba���=5R����:=�\e�F�������a�\���Uk^"�?%�kU$,��x�.o'Np���oB�`�Q?6G���#��M$�Mt��z�������:�����������i��|O[�;�wD-EJim�X�m���6����^\�����b����~�l�������]��#&���u�l?l������$��5?3����������P{^3����E�u������:~���us�t�����
w�;���<���v�g�N�gC�.���D��R��R��3e��^n��D������L�%��e��m���u�T�~������d�.;��7�eJ�T����[�6�L;���9A��iH��_�"���
#���M��
��zp0/��f�A"��*#j�&�X�u�,����
9�a��L+;r���@`��$Eb�DfyI<�|��Q����8�o��Lzifq��b''K���Q�z������T\4N�tzlm���@t-I�f_�����������0�4P��M����u�F|�Gf�_��K6��i/�M�
�����e�5e�p^&���/�k�
����W$�V&;h��T��"�Ba��%�$�Q��>����Ze�cs���I;����[��2��LMp�����_�)�1�]��n=��Q����/I��o�n4�V�/��'�:��"Q�X��qleU�_?.��B��:�I����H&�X9&E��?q�t���� �"���<Q�N28r�<9O�X��A�l���e���Mr�����4�lr*
7���u,�E��b�����T�'E�+I�����y<�8��D���Z��I�2Q\Y�o�VG���D�j,�~��D�T��rl��o�OE�1W�����X�u�y���
���r����BO��;]��(���h��bUU|�_*��x���j�����8��D����OH|��x���j�9A����)�$�s���H�$�9�]�},�X�9�	b}���E��k�GR!)�8��7����pm��t�`������)�iGO:��L*����g_J�s��Pu�}���W�i�&�������2��kn`��W���k�����!��\�NV�M��
����z�I<���Z�'d�$�����|�z�8@������r�A���
p�Q�9+�����`����o7���,)�JK�%��`��gSe������!�$��3�i8�N��j��i�gX�����!�*�)j�L
����@��a�tE�������\��GZ�v��# ,�\��J��������UUh�
p�y���r��JsKk+ws[�~�a�
�������xf*$=�@��`����tK�T5"���G_5�:��e�p���V����1P|[��~��e����_u�a�k�w��O.��O:�)i�^sd�o*����
�-����+#A��z���#��cf&����X�j�\MzO,��u�~o�T��;��@L���W�F�������U>�=�'���d���������Y��(���z�e�9��NC����_-��G�4��%.��
��i>j��h����u4�`W�]A����\�0e�z�������@������BnT������JX�JXN2��$���7�E��c�w���
=lt��s�O4����]%��?�������������~���$�$YN�l�ON���F�H�j*��P^�5G{���I��8������VW?����J�Jp'���MJ���t��J���v1��� ���=�	�H����[�c1�4%�]�Y��J�P����{y��P���N0�O_	���^
@��T[*{�.���������t�L-��ObL�A�cO�J���5D�(�������"C���hR�O:.����h��0�X�m�j2V*A+@��_t�A���4�d�Wq�ic_�l�lN2g��ng���{6@�8O��!>Y�p��&�� �;6vJ�r�Uh�81�%!��o�q�D�?,��.
�6e��~y�X�
�� r!�+z���
��M�}�i=������l�LWw>�U�K���z�x/�	B���mV���������>�)�a$�@cL����$�mo�s����)0���cYaQ<��6�+��������5�xC.����YRM4���E�����1��thg�<���b����5a�i������j���x�WW������4D1����
J8!"'����Cc_����00���4�0��m�
���o��TLz4Su3���L���H"�i�t�3b���::�\w'#J?��&��,	"�����n"[�R�	u_7�����H0~@~��/D���E�F��5[/�gH�����k%���\�8X�i�6��~n����c%Yh�W��YC�:��#�X��y=��7��0�Fz���������xVG�~q�D��D)g
�2"w&WVSP1�{�����+�Fy��7����$��	���P��%���^P�_|;'rt
�q�K�3t��w|�d4��-��19���"�x���	U��T��QF�J�����@+r���d_�`B���\
�xa��(���z����~�@	���nJ��� )D���b�����(�<t�P�>���P(.���g��M*�ft������??&M��
�-��m�`{Z>�f������V�8AZ^�5U�"�{&f���F{��6�k���&�%������i�����T�_1�^1	{�ol�+dv�B�+��*�����a���I}�^u/�c�/�^�{����#�������-u.���+QRV�������Fh��`v;�~!�/�3B������%�i/B��j/n���!x%H��
Fh���)����'F���?x%�aW�$����/8ku�O�|�����e����a�%��!�qK9�/��9wv����Z���*t~��E�t%p�UP��9�z:�Yd[��99������2�@����
���8����*d�F������H����(Z��I����IQ
�p�<��Zl�.'���\������C8��#�d���IQGX���B�����p������������cQ
�:�l@w|���V�����'1��8>4qN4����.|�M�3��$����(�W&h�ay1t�NZ�l�����rz�xR� X�����O�$���s���(V/pfy1�g�9���%s*�����n�/(-����O,�����O	�H�51���:z6�$q��z6MZ��Z���M5���J������*~�	��U�6����^I
a.5�8s�h���zR��y((Es�NH�F@������Q���P_cL�����`*�
�)�9Pu^;4?*�Ek��V�&���0L��Gu�&����w�$d��)=�?������O��q�Co)]f�2�V w�a�y��>���!��|��yw2J��P����2C��2`�-�q�	U5�R����^��}���G�
;�b����B���Ou�����@�t\�
y2���Y����\ul0�+��m�.�M[��@e�����H2zX�t`�(�
w�d��J�pN��H.l�")��-cL��[��zEF+D� P��e;��Y�"[��	�3>@�`pam�Hl���u��;/3�	����}�����V`I�����l�|�;��S��+����~l4�93�	1��@R�+@7����������>:f�S�R�dd�R,=5������p/���A7���\��H��F���T��+@�z >����sic�T�T7��^�re�:��?���"i������z��b@���_7P\�P���z��)4��K��D�G�IJ�x:�����������kZ���R�fJZ�#tw���r�%)R�����U���)������`�
{��(�C����0(�����I	:>!�8�@��
���K����\�>;��C }���,����?��H0�7&�5�P_�>3	>\���z���
�z���c3����y���7�������X���|���{ ^<��Q�����{eG���I��}����.w�d�����2�Uv:5���M�(�X���	�L8����gb�O�3�_[�n�-��h��I���2�g������X^���O��5�&��Pi',�N�v�"�i
�]���G+�����Eh=�\;�3M����'�*n����|�^.��I-v��72K��7E��x%L�~@/<?��]��C8�X��h��l0<�.s�df�+CYm_6/6�����d&D��	������K����`6@��5��.�,}����BxcV'��
Pf�7�0����
���9.N�~%�z����:��B����oa�#��_g��*�D_$��
yx������������W�2�u�V����}�:H������t�}_��V�M=9C��C"�jR��Z�;q���S2������t&:w^�0X�����a@�!UJ��-��J�����dd�0����V<��k�X(�����T�UKeu/�K�b�S@t�xw�� z��������tl�[����I}��l�M���q��^��y7����b�e��_�1)�>��I���E�.Km��G(�)Vl�>��"���M����t�	�����&�p���q����u���X�.r�&)��I���.�c0+�daw�0������y�:I�5�8%�����$Eb��L�N.�`�'�a��!��	�N��G�t*�
9p�O-��m<���HY8J�c��@�;q�x�<L��:A�E�}�����O��l'xT�������!U���A4�T�2T'�d{rcE=m���HA'��Sh�FIL��f�R�<�O*��������{�f��J?�^��b�x�U��I���x]�MI����~M�+�p�d���1f�tH�|�8A�}m�)��"��G�c�[}@r�q]�p�b
����,(�V�������3�O��_���������d�%������������d7����-y�e��c+@�9h_spspQ�wI��:�� _��3�e=�5,������RR�z��^�{��r��NS�!NN�.'�\iK7<�&B��wW��iV]�_�0�D;X��-b�$�O�U��������n��ZK�X���C9?X��|��B[��4��d��c��k������zv"�������!���[������]|_�=��Y���)�����G�
�X7�0s|x<�A�|P����%�+Y*y6��R��:���O��Qv�H��-�&n�inz��a�����O�[��
pv,47�RM�������9Y��&���}
7�:��J������������tX��gQ!;���d��=9��������z���y�?��
��t���|�k����O���n�lb���9_��Ky+��������M�_����~��Ae\5���KPy{u#��'n�|Q�mQ���	_8���������q��5��4�d�y��L�������7�Xx:���`6)E�����`��oE��ag��esKM��;K�'x[���e�q\C�JVnA���|=v}a�G6��q��Q�_��SQ�b��x�,���>�<�ne���)i/'������Q�N�EF74�����|�2���p�c��1�AS'�	J�������WL��c#����db��JS��VF��N�Lh_����./f����jz_��2�T����DY�p�DC�S�1��c�	A��KJ��1l�+�f�P�{����`�?����!I����S�]�s�����^�'
,x{*k>1��X��{����6�qaL���S���-�
��.|�BzQ��I����H��"q>�
�,����-��Q;|���k��e6W,�
�$w���H���2G��Lkov��6|����lbj���w��f���U\����2��I���y����='v�_���Y���s���
��,�����k2)83���Q��"ADP�T��|�QRR���P�g������:�
��&�8'E����"A���&'1���v�S�+��(���'UjU��,��0�.�����URO����Ps�o;���Wm�#�r�Y�	&y3������0s��	�@<`�F�t
��1����U
0�:P��o)J�8%E�0���Sv��%E:OI�7��Z&��5*�y��`��8A��#�����,�*Zcl�2B�Y�����)�aq���=F���H�e�c���j�m�c���RY�#m��5 �9�2�3Bo0~���<�0m�|##4�Bl��x�)�xwY��k�y�K��\�g��n9b2���e�J��,+��&��a���$������q,��`YMX���)�0������Ej�y�`�S�#�]x6MwU��0e):�������-1x'�VC'\#I��dD���p/B{�M<������	\��F@h�L��;0V|(D��N,-�-,�������>��PC����BU�<��tU��������;a��]�����2C�&i��	iy����V�X�@<�[��U{�R�:VV�����s�����O�[��
�]��
�u�}m�D�@���O�^���e�I���(�������������*��d�_Y��yv�p�g�'���������?\������,s�N3���ad�`������'�d�����?�l�|,y�����j������6��|��,��.�M^`>�6y�w�����%��&/�����"/��fJ,c�F�f����Q���y�tI�+rQ~dR9U�����"�)#J�s�[���o%��%�4�,J&�8!�;�<fACWs�(��ZD$��P�dn&/�P�1�����Q�-��b%�*u���h�����S_�%���>KU�
�[�I�
����$"x�����}K�����=�}�K;
/!�T�")T���:��J�
�6��Hj5"G�Xp`o���Z&���*��@-.����=�M��y��8@;��M)�x���b�L�T� 7�	�g��p3%�Y��	���p��g�y��A���8�	�i#'����Z�J���p��J�M97��p��p�f}�j�zZ���"����y����w�VJ�9A�sku�pQL?�o�^���W:[��T������#���L���a�G��g��Fr��d��Nd��z)��3J�5{��t�p�� ��$^�M��	reJIc�� �o^�)S.&�RN��"��}��c���!�S�>�5'H�=#E��L�+��8��5YR0=T�C�)�X�	
Z�jN�������h�_E>�t 
Zy�H	9�#�����{J������	��H�&[�1�.��4���%����v6}��Qpn;��� ��d\���`K��.������m�8�+�~MKd�wuo�9�${�X���G>��
p�8@�F�6��: �-��$�EV��b+@�PQ�������}��*-����p��9.U�8���e,%E���C	|����2wj�V
�9�#���5-�@#�z�8��W�a0�tgn	�0L��pA��5�~�%ZS�:���l�l���lU�r� ����"������	a=U�X�~�5�&����7�J8r���E��dSN��S
Q�)-fU��*N��Z��t���\_�+Og\p�(�H��
��/�q�c@PTn����:�kj�2g/�;+~F�(8��/>m�o)�6���}'�4]��xr���[�,�5��'�����������
�O����c��'�K�o��r�OH����tJ{2� �v�i	#&��{^����%`l�Q�
��.awO��

RV��|�������.�f���5{8��y��3��Jo��/B���^]~0�;��c?���U��#�b,X��<�F'��^���=!�^�b���,O{��p=k|K%#V��>��(L�P��ju�xg�z���fi�?�:���Is?���f�w�q�~����g\lO�-��9@�N=�U���	�~6#��U��������YR�<�Y8����#��I��K+��AkZ�����p*!�j	AA�q�����S��p	t�8�a��O���t���D���P�]j�#0����O�mt�9>�3W*�iE�gU�~J��G����a�LO�"�th@��MJ� G��5I��k�J }T�D��9��Z���U��� ��JG@��J����(pD(!�G�Jpt@���ke��l�c�OvFh%�
:�7��B9���������*�	��WF��"Kn7�,�:Lp��K<B����
�<��e)��2m�H,�xP���-?vk&8�#@���;f�3���v�%F2ob��rne��D���U���O-'�b-�|�����	�����^2:,F�SXJ��=�vqyW�N2:����������l���![�L��Ir������;����TP�d��c��l��3����b�[��fl�J��HWu�d��s:�d���M��}�]���(f�g���	����Q�x,P���.)�'Z���r�.��<��{��C>h�������n�������2��<�\��M1[�U191F��&�
�ODwE���A��@��uzi%��W ������j`��hA����@�)jN��������=��V�NRB1�`���NK��Q~Eb���z1)R
Q�*y��-Dv��l���]���^`�Kf|����yt���|<[R\}��%YR$��6��Qk�#���,)�&JM����I+Hg�5�L�����l�I��D�No�� W6@���`�M�Vd����;�t�/r
�%E�'X#

g��l�q��gJ=�qO7�S��C�� �SS�Q"e
������>�VT�p�*�k�j(
����S�kD9�z����?��+������)NW9�D��TOKC���U�?a0"|
��<��"`0!�H.�=�����{�k���4���u��	���0� �d���*x�w�[��
��{q�����?���|��J&�����k���j$HFeg��+d�t��;,��}=������Z�J1�\�s���H��P���7�����Km�s
���i��9��Ij��H��	�����>q�L{e����G6���uo�N����jG�����Q>�"�G��7�����\6��R�_�(����`�����������^Q���/z&Qo����4=�?�||�a���sFh�������3d�e�l|��m��j��x�mX%�����tf��������
F�����=�
Ov��p��M��l�[\��g�.���.�Pn�	f�0Fh���^n������2ILM^��B�;��
��bG'm8>82�����k{��>>�"�\���f��b��!�t,�=F<><
q�^�g�\�|vG���p�o[�UF�	G"q*?NB<< �xw�])���ni�'��I<��.@U-#��K��|��%�����>��(�*�"���6�-���8�=;����r�������{t��I�����P��u{K����G@/�Q�L��DDF�9�\��$7�ua����j��
,o�&)�0�P��V��<Y����h���]�� �����B��E���22.�'HY��
�Z��,�H+j����5A�J�?�.Q��G��
L���UCy[	'\h�bg�YaoU�
��g�W�B��p�g��T���I��){-���8�L�j�(��������*�TL8��F��c�sr�T'8�zI7�!�6�X�m��&[�e��z+��A<�G{�K�Q'�3�9>�%L!N��q@S����)���D����8�A�r�G���X�P��eH�Q�����PU]u��!J����lu��(%����n����Z��2�H�E�0���g�W����d�
�����%�1|zo�����a�����.j��^(N*(x�#�G�"<���K���"1j��	43?�����+_g�0>@��Y���2������Ts�0�����b^`[��u���	�m�����F/����hk%pa��eT�
��Y�9��n����^v����"��83i���H������!����0�UGr�k&��
T����aa��ZR$�b�s��J+������O�t4��"��7��N���Q�as��<O��������}�4B���eNwS�[R�l�@1�{kM���b���N��8L�45)��j|��'���)��'�W��$(
�%O�0'��:�P�q&��nBna�Ii���i!i���*hD|�Dk��^�� �n��`x��ct���;�u=�AQ�D������
OJ�������s��dO��
�-���H�����ZJ�����[�E�{�x���]/6�M57����z�o|��t��-OjmzaN�gF@nd��l�oGFX��/�x�?�1�����3����_��w+;m�S��~��;vqx�f�g���_y����#���{YL�$G�k[s�ii-���\������
�.���#5�7L����K+��|���|S�Y��S���_O��������.�B�B,������������eC��.�����k�l�M�zt/��6��dJ�l]�&�-�����O.?��'���:De{����EQc�_n����{�����	�T��b��8�!l�� �-o
�E���s	~�6F�Y�����j�6��%��+�`D+@fC/�z��;��.�����$F�6���!U��H.�#��1���vgI��]�&=+����� O!�� +��,�*��A�#�����D����*�_�GXS�JAgZ�,Q�M
�����1C�#P�d."s���oV1����Xm����0gP�;!���B�?v`��_I��B�C%D���4��i:V���g2)Qb�i2'��������Z�j����JjT�4��&4���:DM3C�������t���T��`���
��(�}������c8x	f�������U���&�W��m_&E��V����$���t9��I^��7��l��
��GX)����	-���x(���qI�ds
~''o���n�[�[#A�ul���I�� ���pv&[�Z��PS��`������x��bI���q��	����g�����a1Q\Y��~9��l��G�:�*�I�/h<\FFA8��;�3���+��

X4�gB�6Q��_��u�����+PO�t;��-)Rb�.��6�e�$'A�}�h[���	cC�3j��EO'���`��~���"I������C>S��NbZ���<*�9p��>��`A	D����T�3PB���l{A�}�<���>N9�[O-��)�|�\�����vq��K��p�~��FPB{�9b�������� o@Y�D+����O��(Q�|0�-��9�m��&�lW�8�s�{���s����`��/�6�������J���n���4^�ge,R�9�M�"������"��<��/?..?�cg���
�������s�<�t������S=�~�T���@�
��:���������|\O�-m�N�u�_��"	�~����r4$��V����l���r��wh�_��D���OhN�L �'�`+��S�����u�M�� �R��G�����f7�|�����J�n�]s�\a��%�
�����Z���i��s�8�q�a��8�i|K�����5)�����b���}�xB�&��t�c��\3%D�����������_��G~�2n^|j����b���l����|���O�5*�u����?�n������s���'Y����5�%�Q���xQ���|\��^`/�T:�u_%U����'��=�	���v7��yR$�:�����9���n�<imV<G�g��
����~����HuZ��,s��C�������#j�H/�2�o��"m|(k%=�sf� 
m�	����M%#K����X3t!�� LW���n�����U���H0�B�\wy�oy2��B�~yesV4!�+x���U��9�j�q��U�R!��B�o}��B���*�'@��J-F��un����b�RV�|U�3���*#�
:�8Zc�b�{Q#�e�zhds
������J0-v�b�3�7�Nw��8����bZ����$i\Wu���/�	A���x�2C�c����/�����c�6���������Y>:RR$uM�����
f�z���s'�(��*v������I��|E@�;Ax�mR������ W'�<)�&_O*��;���Z���=����q9r:,f��+8i�b�����`}���9�k$��4@�4�c;V�k$��9A�3=m!�<�\����H"��=�+��!
c�@nI�P���d�������S����$E�,���!=��(rH7���
������H�c�f�w�5<P�Xh��:��8�a$'|����)������P�L�?��R�	<)����/�	b�(TU��>������?@�k��M���B��@|�[�o�''��2�]�6/J�6�:�)4WM5�7u5.o�U}���"�t�1���Ve�����^��y��
u��V�����Ia�k.1�f�)���U*Y+�]�,L����;���X7}grY"#Xm� ���8���E�K�#qm���O_���j�}�-O��4^�����&���������05���G2|%�lL{��~Z�����u=��$e�J	!)Q������7%��D/�@s)���@�S{����6��HX�C@>k�J�����6M�i!�x:�%��	@�0�?�A����vg*��/��au}��H4b��V0�Q	�LF���*����b�U	���Eh��F�V�9����9�,������@JS��.��v�]"��&��\;W�/�������|��R�u6@���?].��jZ$�������|����L/X�^c��O���r��6�w����-�_[.�����G����|BxiJo�[����yv	r�7�3V��O\�|f�{^��	��B�o�	�^T,Y�p�-`���}(���$7�e���[��X��f�i$������q���
���<oZP��������
����>�
�2K%��.-�}�R	�uf;]��QK����
���������q��^'ndm�b��W+n��G�g��k���F�6��*j�,�:v�3�@vk_��O9�d����JR�n����g*�������J����5I�7��Q*����!�A�zP|e}	.�6�'3h�&i{[��%��!iS�'�X���X�'���|�����g���v�s�sT��025=\G�z�C&mw�yp!����F���������y����
��'�vuu������gPB�et���6������{q�%t����~Q	�90DU� �9O������}8�w�gG�x=�c�o>p�	u�Q��EiJ��%�|9��tX���e��	�b�mE���]��HN�4K�TO�/���]� ��idtX�<�g��]�����%�I�DQ�"���H��u$�d��<,����/<'���H��amX�����p�tN���mR$��iM��6�KI�@Y��l��8�1�a��B$L�}O����~���`�H��������)��P���q��r_�8��g'H���#B}wV����r*����/���Q$���	w���y�p�t���o.�V����qM�+�����  �0�(��\��qS?Z\���?��;l1����J��O��n��mOx��H��J��VI?�����������������*�������c��X;VOP>><�4�/I�0��J�V�}�������k���H�a������ypj�X�k*�%�f�h����F���Vyn�W��i��\��waF�T������LB0�Y%|��4���X�h�_�
+���w�_�t\+�5���E�����F�d�����d���E�m����x���q��V�,�:h�d��ksk�+������&'sK��u�#�[���-��7.T`��U�F��o�=��)��5�"�0����J��vy����[@-P���@��
K;%2�!@�,���I���Y���q��&~������c@{
�SS��1�8#AJ��Xp��1����QQ4���xeCUu�b��-�%�/T��3��rXUU0,�<�q�8�@	��(a��y.�P��?1Li�U9�<<�����+�o)X"Y����	�������"A��-�������������U����Ow�Kfc�:.����c;����7����'�G��7���n�;��;syzE�����V[_�S|�OnS��a��g��N:���oO u�-���;��h<gww�enG�}=�$����@S�0B��q����
~����h�E���l�}�c�),�o����Yx�����+�	���W�� ��
a��`�;a��^\g{O}���l�r��N���^���������t�o�������aG�����zy?�B��2C��`���uk��gz�D"�0*��������u���||�������mS��#QZ	@i�${�s�����0�)����)���x���]
:�����T��+l'%�w�UI�B \%�<�����j&<N����y;��^�H��7�6��L�@������>>�~L�v����qTP��V%hI+@#���tP��QTd�;�����bS�8��'Tu\H����z�N6cUU�>������8��t��?��R��?�����p'��|��	����\q�Z�������I���~��"C�<�)+nHMX�E���
?�l_����6@�� b�	�9����V��7n9V����u����L�Y�u�� ��	��
d�u���%~�ie�	���0'�ploi�p�b�m#����6����i�F��><���qrfs7
�p�oWI���R��>�>Q@�6�?���X�F����u�<������%��S�M�)/��(���^;��w���`�0��Cc�"��GL����dI.��mOTr1����A�<�dB�oL^�V%a�8sP��	&<TE3�����R*C�1�{������1��}�J��Gm�%T�_�k�-H:N'��t�������G?���=�'��)�wXU�Q��p���P�`���U�f
[�k�Zi���9�q/p^�'��Q�� �$��	�L^+���f�]�y��j�c��1�O����G9)������0
��&)RZB��T��pl���w�kvXF��5��������)fS	C����+@���
,�Os���q/�tx��k���
`��xd��kt<du����NN�|���c���RJ�M�a5�i[x�_c��W����R�&'CNO���}2	b��+�L�������g�]L���]���Oj@�����a����XL_'�}��x|�?>53bNQ�&}�2;��F%�(��_������ *t�"v���S��tLx���C<^��b~������CSd�q�S�,��H������U�	�},d�t������o)�6���i�����6�~h�o���]���� )�������	��Q>�M����&�B���=����>��^/�+o�j]?��&���iNv�V�����
��6���J>n6�w*���G!_%�$Fh��fd��X�q'��-���m�c�v��rY�u�zY~2�?�^A��D�KkH�a��������Y����eC���M%e��U�����b����O����;�J.3B����j���-��E����D+)�����
���6w�����+�8sR4��GAb�����E|P�=�=O�(�8����eB��*����Rl2,@���}&
����������)q>������l�F�m�Zl'	�m��h<����\�uM�p��&:R���Z�������m���Q5k�Y�b�R�o��;.5����4�j�q���1���EDi�2A�N�n{�j3y�4�i���,����7�-�z������f��ds$lV]�9A�����l2�:�b�6\{vG��-N7\��^�d���A�[b<�2N���� ��t1�w9��M� {9A�[��B
(%`��Nw��2-si�q�|UcE�^�d��,������F�I����RJ��p��%�D}?�h�P�bN����<�I<�:�a(�}�%�*�j�����:�����w��x���qX�-)��|+��<�Y�s�<�s8J���9��N��KeRG��<���!�:,(���&��0��g% eM"%$/M�o��9}K��V��0TvJL��Z�-��c�1u�k��/�x����G��Q������`��M��E��/��EFo�����q�R�`�5���n	�(]����p|	���/��H+�i��������Dg��p��H�r�����k7|��z��yD��|��d������ �����]�N�e�����Xm!�5)R���Y���&����f*V����x�W��<�r����W�[6
�A}m����@~�0�,k��t�ZC��MN)*���1C���f��U0j����������m�1�O=4"��z��`MK��U�Lp]!��'<C3���t����r��@�9��4��3��KJ��
R���	0`�SR��MSj%�
���"��&�fJ��m�kk���t">�h*v����������uXC��0�%�2�$����T�M����E�� ���O�kaQ��|�O�a~�.���������g���K.�E��y�4!,s$����yk~��c��m��Sx&������m��un��|�{�O|S�)����Mx���=��p�y��|��l����'�����a����N�&E������������1)���-�_[�u+)������%��p]���+�G���'�ji����1���p��W��GH����E�%�_��1�������|tT��F~��l�{���7v0v����S����(���P��qn�c�I���:���������66@�".�r�8v��
S{��GO�'.i�\�i��q�j����l�B.�u�����
p�!�w����wE�L';6@�
��*����#l��wWV��(Z#M_�c9�8������0^^��Z$��X����f������:A"tO�����_kU�T��D&�'�),�@"!���"���	x�I|�!#A�0�
�^N&E|�V��S#�;9���(x��*�����������I���r("�!��Pzb$��2aUM������� �8��6���+���q���.:a����u�&��h��`AS��$��W�;����2�tua�U��T��aO������� k��x�����0FS��0q��`���!� /����Z�80�� ����~�7I�����)�V/�]VN�Y��3�y�9�_��-� ��*pwqrT�_��bd�i�Lr18�E�����hl�;qcu���4�*L�@S�����D��z+��5��k�:�����0P�\�� �>5���*�>��\8F6���Iv:�jA�uu�L%������	6�s_"��tE������ �	�:�p�?v2���\J�%��Ni��+Y��9AF�#�kV�he�
��c�S�(�an��	r���"�/%G�c����E7W�!���"K����^�	��!g1�v������8a�I\�y\\�B��(�'���n���F����+Kx�3 �4���+��f����!=F2+���-��H�JC\�X�����=��|�������5^��M\Nr*��er��
���<X�
������~����&+���������2�{,TE�����a>(��8��:������,^��
���q^� #�@�S�$�<� }$���O����w	[)$�$%��U���b��j���a���21u"��N�2Z^TB����z���:����m���
2�O�[��
���9�sC���lZ�n���Y��	U��E7��.���B5=�<,'��=��_�i-'2���,��'���?��c��|�[����cgb����3!\"=>7�,�X���K��)��W NZj��$9z
����?����#������
�e�7?��|����c0C^�?�>�nM��9��� 8����J�`��JaV���ea"�D/C����+�N�
�#c�~6(���
�X%*-��N������K�q�k�l`N���@?��?�8Q���OHN��w+��0��p�U4k��)\
��{7�Gyi���� S����^����� �6���Pu�0��=Z��-9]��j�Z�E�������B89�w$=��~�fJ��N������a�f�=1�����Z[��������'��C����N���������Q)�kn�P��sG��� ��Y�$Q�56�MVGE��?�S��l��Z'u�@��C��t4�)c�7�H�������)!Di�&���:�X�&�jX��N�^#�ag5'E�!�S=��\*L�R7������t���������?aN�)c,��9A0)���"�W%Eb�l�"J�A v�%E��I����m���n`N�'I��w�����3,nN��tX����|���8�VE���`�v�yR$���zE0�p��
���j�IR$�E��H����'Eb���e��|��+*���	s�TK��e<"�U�m��=��!������F"LNgNw�W�{d�9;i��V�Gm�Gq^�T"qQ�lDjV
% �����m+@L��
��o
c��`���8Zk�cm����m����)9w<>��4�v�7� ��)�V���:P���!]5�C��f<���1��{��"E_������)����_��[MU��P|�g�=��O�)�������E[y����IDAT�(s (��]1����"�l����3����c�~���(�1,J�t��d�1B3�>L]�N�����K�m����d
��N\Yx��^Q�0�p����\0p�@k71�l��������#�<���S��������h��i65o�`�������V"+��E;V������Y���k�j�l�,��������-��`�.|c��.M2�N,�8�D�����X���:�D��BG�����G�,�`�t���S|�������J.�:�a�k�P?$�Q�KF[4�RWQ�^AUZwC��<�b����Z[��������'��v�-=����Z������:[AS|���T��PtH����H�����	!C��um�����+:#���I�	�-�@@��J����*>=�����"��cs<��O|'Q�6����
p��mQ�@�8=����C[2��4���~h�N���?�
���lt�I�^�8��t��S?�d���D����|��vv��%F�0�`>�O���'����2B�z����SfZ����S��\��Y��kp������������e�yW�9zU��?�O��y���|t"_e�1�h�y��O���-Sv<fA��$�e��gy�1��rW,l�k��Q���F�������r�$~W��pZ�����2fn��v-�J�9Ax���9A�Za*QE�����J�T����(�Jx��k*�\�]��t�k���|��U]bI��*j�,��Zz�^������9Q&A�(���c;a������U��x$�@���������@����>�:����'B����e��.�H:��	��0���f��1��R���X���b44��]��\=j�C��������������'Q�v��z[w��

��z�x���6��7�|�!LW9P�t��C:RB9�o�h��
�R�-F����f��`��%��~���2�L���g��>��o���v���&�������	��/D����tc�gNF��u����M� �"kY���{N���u�pU3G�8����Nf��!k<'�}Y������H�;A:������~@R��q�	���e�&Y����bR$W����g�dTf�Qj��	'�����j���h*��IC_����
�R�D5�� ��e831��9A��G_�~�����?���+��$��}�0kq[� �)��8`UJ
-g��������������~6]�BpA�-y���RcO)��J � �$����e�ahQ6�1�u��M�Xq7zC�C�8:A��t���/�z�����(HS56:���0R�����a<��S���V"1���c%�T�D���'�������K9A;��P}���>��<�`���`�0���c���e��.��i��Y�X��.A���>1Xl�>V����W'���P�Z�T����i����5��%��d� �����{8�K*�Pg\+�5Jb�{x���b�`|7����>��&E�(�0o��x���4^vY� Sv����fR�^�8/�e�
��3#U<��@�v�#�;56/����:ic�*�+��N�`T�k0Tgn	���_�O�c=x��U�Npk�!�*������U0)X��WuDg;\Q���`#VE��T�X�~I4�\���o��:����~������zp��M}����'X�M�[�]�p���j�P5k���e>-c!)�k����~�f�K�����|��-�R��E����C��h�o�����g���Q�WM�����ig������l�����c����L;�]�7�9�`r/��<����&�����_�%\�oOK'�k{������z���zX;�J����x������F>S�	��[6e�.n9_j�����^��4�S���^! �������L86�7ZjG�Q�{�����Z����,��D�wlV�����������e���G���j�;�F���N��u�����/�x�	S\(�����u�(�P����"�3v�G[��6�N���?�B�l0<����^^����U�������JJ��
8:j��
�
?~��w�I_������
m��,����-�������6G��+�?�T�j���>�
VW����{7���'�@_J�',�������v�S���p���"��-����O`.�bn�Lgd�g�*�J��N2(Q���F�	��9G���q�J�&#��������B&leL���~TP0�30K��LR���]iU���:���7���� �U��}��%=�}-��vL�H�B��[t
2b�q��@�ZJ����og�_t��^��?����QJ�k!�����/=�F��p��� ZW_��K��W���I��V>�v��@�;	�m�gD�8��APf_��1\���f���U����Kb�`�&'���\T��#R�8�a<��cw�W��C/�;h�����������to/���b��DY�n��#����BZ)��
��nq��a����kr�bT�N���]B�S_���n�EsL*;6�-�4�&^���{X�R[C3s��q���~���j����A������N�VN���C��$oE���! ������=iN,��jU��)��x���G�0��'�����3�UH�o�o���UU%Dd6�J�����8-�/6�-���Y�Vz������Y��h���d���w:��3��@?��%���s���H�L�xD�9��b��{�jg����Y����&��U,��R�
���f����,)R��wu�Y��l��{���=l����7��y�_�h\�6Tx
�?��u����5l�eX��RF3���N��f�M��H\�{��j(����k$HV��L4��
�N�v�KD�.C����I�>�1eHk|�	�-��������:�v��w-$P�$��E��h��?j,��*� 7RpIu	S�1(�^��(��5Op���6>8��{k�u�FX���(^*���"C�� �eX~	�(�h��g�5����I>������ )����X����=E��w�4%(�f�w��P�P�R�c��o�Ql�+v,����|�y�-��t��T'�������.%�����[v=�/����=�{e�
F���I�a4�B�|��G��7�����a�_x���������7w�2����h��������-���F�
��/8/1�Bn�K����{����ch���+wO�^�e����;^�x�t�������U���>�|��=
���#�����eF�n�$�<�!�tv�t31j��� �<�p�a����+��pM�OvB��zY)�g� O!lg4���� ������q3
�O�Io�����Z�:��+S�(���B6F(t����������m���=���*��~J��q7�Sc��C�S/��z����X�gO��g��[,8��=_�~������+4w��.'�>d� �&'�f��������<-1H��� ���y����%F���.���D�f��2��%E����a
`��5'kVe�yf��+{/'�Z�g;��[e�2Q�������Y���=���%��cG}Ln�C��~�+=O,�T�A����\�5_:\�':r���:V�n���@#�7*���S����K�wt4�D$�����z�����C��g�`4;�\eH�O�%ER&u�H|�P�������i�������L�z�J���%������$�s��625ACM�n��5��%y
�;��=�����&�x�B�(R��?:�����V��~]V8S�\��p�8��
�7~,~����NV��90r9+
X�|�����pk�%_��
r���o1~$���B�OX�jH���`G\Dd+@���MV8��5L&g�4�Oj��+���Qr4�Ii�5rv
sCm�����h&�7'�=,$	����%�����B�Qsk��������p�T#Ar49�>��|,}m��������������mU�	�1V'����
�
Z�R:]'���Q���=��S�S��$f�#'H0�P3�x�=%%$��GGJ��C�(�(�k=����|�1
�����#u��B�q�c���z
�^��Xht���Y~�������i|m1��s�>)������U��?���v{�>�����~l��l�?�����Cb�6@�cby����*S��#mL���~������	��Q>�M�'8AF�u�,����Uk��E��.9������H�������0v���	2��=#�k��-������L�N��L��>����_���`�R>���^Z�L/,L^�m��$��H�m#�*����������M�f*B\
N������-�����uXeN�����������A����'�"�W�(��rZD���Z�{##~��Y�Gf��9A��Nf��$1��;AFv�����>0;����a^��[��j�eu���b�1����B7M�g���l�m����.��"s��`�� nd�DU�������%�q7v%�K���a�Y����a��w�0N��c
2�w�dd���
����le����<hmYM�������8�k�����i����C�L�(�r���ui'CU9��rC�K�FHi���0szH���	�F�C��o8���|��ag1�|P���"��9�$��������O�B�o+*����A(������p���t:Dk�D��K�SD��k�P��������;@��9�p ��ql�=�����������R6����h���J���������U�/%E:��I����M������#A�m��F_��x��9��	�����
��(5�� >�n�`e�\�I�dyR$����K�\��xn�	�#�9 jr�;��d[ ��MR$��
�`y����p} 0b�N�Kj�XV�����O��`e��w�M�9b2.)C3��d`
W��$�\�`�&IA��0����>8����l�)U�)	�jTU���P8�S��Q�U��9H�������E5���.M�hw��	"����NW������G1��m�$��#���'�E�`��Bo�0��y��l����-�����u����o�����l&��	"l��W���%\�Y���y()�\7��>�����%|�f|8�X�(��"���b��1r0bM��s,B\�����l;$Z�|Q���%E�TR�b�u��\t\.Cl��u\vkw��z���<.>���m*��b����N��.n���7Opem�'C��v<lV��IH�v�M�'i����>v~=�;�+�������&*�����8'��+�L���������$�`��:�*�6�%��"����>{���H����D���ZY�6L�������<�J
^r�
n������;,�z��:XN������p���lFO
l�������{=HY��8:�G������n
��u��7B�FAQ���CJ,�
�u�|Y;c�#�����Q>m�o*?�����o]�?�q�?}���'r��g^�u�S�H��)��[oOd��sO����I�:F��<w`<o�U�6j�t��-�?� o)�����4G�@^��]���|r*gE�1\7|���#�c)[�B����/6����;��$E��a ���c0UK��s��$:##�['Hdl0>.(��g%�07���~#�hh�V'���^���n�}��,	�&Y������qX�m`}�,�B(�m��5pd9�)�9S1,��h�Q`(����i��N� xB(�����)U�\�B��S����K���i��E]e�-�F?���q����75�������&DcE�Q�2�N&|U�'��W����w��&�?r�V����^Ak��������
	���/%2�c��������{���J����O������z{�U��}�NC���W�	7����1�A*�=6��4�R���2�K���^��������(A��zu��z5$3��	/��uk<��/�7	��l��������}�����"���`R��a��mR$���u���*������N���A��DaN��a.a�����'��N�
5K��T��	�/'E�������B�6^&���=�:A����I��>��;(+�`2�0
q�I;`��`�r_����E�|@�}_���WY�$��b�!U4���9�:��1%Q>%��\�P��ls�����=��\?�N8�q)����
�aLP����W����� 
����?���GLS��Bn+���0T��4�_Br-u���w&	��b�spf<��D0�q[s��(����g#����7P�k����K"���C5�M�R���:��Kl���TBmY���c7PB}4�E@������:���>
�(0w��.�-�g\��S�i��-��K`�O�6#�k�wua�jfr5#&�!�����k��Zt��+�#��U�����s���}�,����]�����?6�0
,����,G�V�|���d�Q�6���Z�0h+��T���|�	��,�j*�T�J����qJ�L��}��4	Z,Y3�c�s�NeE���"�2�� ��	Q<uzn"W8��yx�wM;����z��_`����a��H%��B����=�����1e`�KEh�Th$i��D��8���������� ��������/V�1(���E���t�w~���hJ54~r#W�$�,�D�����;�t<Yq�	Xm�0�>p^�����i|K����9��
N�����o�����"�U5]h��Z�����PM���i��������1��v����h��?� ��'���B�ak��[1�?��?��@�5��Z��K���?��+�5eZ���G�4�X�+��\�_����"��e)�N33�������
���J��Ur��B�RDp����O�q��)r��{��������{pz�s�lX�3?��I��f;��oOy��O��6�m:�D���0����6��,��Un����$�+�p������By����L�?��O]e�I�_�W�	�����X%!n�^�1x���Q*�^��O"�AO��%��E����nI�)�(i'�BBM��T���xy�v�>�TFC��}����WD;h:�����$�.kZ�(�������RckY���~B5'��J�����$U�]Po�_���7�����`lw�h�
�q�'�`�Utd�xeh�_�'D9�u�I���q�]�~���Y*�.|��2�"���:Ax��m�<E�UR�
���am^�u���@f��>�n'+�f��3������Q�����8��$L���)tyR$������K���]4�-����,d��;F�c1�e���1�������5p��/��D��nN�����F����#E#d���	7:�S���,���0�Q��u����z��}�r�k��PRL=����)��"A�+-7P�������l�Qc�g=��1@1�T��q��
(I0o�Li��8B+GB�4DCU9L����N��0���aav�qv�a��_���K��<��#%��k@	��V�J�.�k&e��+L�=��.�D���82���Y~�EO�<�P�b|9+�.��<�AW��V!q��)���f�M�����!��Iz����!���%'���-���T�@GF�|�l(��Y�x�6+�m����*� 7?��6�����+@���l���
�kh�w�\��*~��|��i����	�(��Z������e:a[��L��P��F����ttV�:�
��E�o���u����9C�*������+���r���S��(@�>f�@<~z�_���[�����
��'�1XN]!mb����[fe��U�M	"5"J�A3-T�_=��e�������nU	�r�C���Z�B�+f�53#1T/:�0��5�&O?D�^a%\������+8�>� o)�6����`\�d^i�g��e�?#�����o\���6@����Y!�A^��'�G��7����I���|��~��7��_Y���	�61���B�0�~v�9����T>�1S�0B��/�3����|��G"K�� �����Y��k�N	�l���,��W�x2^�q��������w�n7]t�&j�]�
�>O���Y%vl�A��� ��f�7��9��J�1�����Q��<s�3)�|c���}�����<��I��o/B$��	���@�J�^��l}� �T�`�/��������9��Zj�wFJ%�
�-}Kao���]����]�k}._e0(BQ�Zh}���!`���>����#���R�F}"��������)A_1m���!s����j=�Zt��*2��(��@%t~��<U�5O�u\B��D���X��@	�*�n��[�s!|x��n�������3ZW6v��.��v��i���p�kg���A^(9\�k����j�Vb���T���?a8&���3S�M�G��=��d__�*<���a��o��l�B�0�,XN.�#95��fe�Y����9����v ��>U�
d�>(�P�9��	cR�:��T6�`�8����h���<�6�k����Kz=2hm����>�D����.����zm]JN�5S�S8P^t��9��&*f'oE�)YR���\��mk��D8E�QU� �X�G���T�k�y&��}s�y��
�S�!WXo�O� ^E!f%4��e
-u<4�����#*s?��r�g�P��EE����N��
2;+	��G1���d����k��
=��q��(���]���r�<�W���<[j����
5p�����>���+/X�9�t����yy�7y���^��w��i�i=�u@�9v�pRBCQ
��6�E���h��N�vxS�1���"%X��t�Fyb� t|�5��a���dl��P�DS�e��<��Z��ZP8#�H�&D+�WX�ep]������^`�}(���{X;�O�Z���N��XgzI���{k&���VEJPU��b�e����H)�@gbl�&�{$C(�cY3�ML�V����4tw2�pV)���.D�D��}���p[>m�o)�R�jw�����m�����#�W�������]�
��*h�A:��l���(����^�
�W��b^�������!r�i�	�;,\���+Z�����TO|�������{	SG�'��/D����#���mI�9���~x}��g����<��bp�>9]��0Fh�83!����s/�]pK�����s�J,'��*���c���@Sl��8��V��)O����O�MN�r��Y|��tV����D
fj�tn��PK��66k^d�	������<�9�G1-M<6���0P??���K���G�� l�	����4�_���u�^�
�b��!��p�j�(��l���D�6�V�:n������x|b�UMb����5��zU�����=��
��������z
*��3 ��+�p�����b�?IZL�M�����r����m/'FWK:�����L��%Z1^�� �����H�9.\��g?���Y$������x~��cI�,sx�r�w������H���ph�s=3M�+���
���v��r\���H�aS	�{,D$��{nX�-�z����x�w�;�%.r�=U4ho��\��8���:��"�)hp2CK����0���t8u)9@/���������.�'�f�b�|�d�:^��B��TU?�"�a(�}TSc&mu��h��%kMu����"�9�t�R_��z�#������=*�3���,m�9�
����f������Utg����Q������{T'='z�^=���

K���C��SQ/SI�����L��a�u��z�.�����@	����r�:�N�(��f������oDx))���|���@�@,<Z`��'�)��5���y(���lc���j����4%��"��=k���#`��r)�K�
p��0��&)�[��l���p=�E�> sP��|���W*���Y�3��L�4���WQ�A�f=w��������H��`�	f:)M�M+�@Fj�p
�K}�6J��jR$�����	rT�L���1���r��(�P(�t�U���-Oy���d���=��dLi��n0)q2��kZ/z&�O��0+�`�w�����B�����3��0*���+Q]/���Su��G������v�CS@4.xA	��1B�rE����e`I�z�:�R�	��2�[l(�[������x������	:^�F���?f�L��������6@���[!!��c����l��7�$;&~Z|T-�HN����697]h��] �������'�Y�Xk��cU�6����E�L����U�M�%����K�{����k���Y��O|��	�o*?��e�[!l�0X�S��Z8�k������ZP8�L�Oz�K�&�kYqt�]����`�b&�?v�#�����q\�����h��1�qyE�O���?�/�+N��������O���LyY#����2�4���:�!��u�����=�Z=M��3I����M�'��Q���� "3�E�J7���z���L���y~Nw�,���	���Fv�����[�Lz�]0�!�2�J�C��������Z�u`�P7���6�#���X-�X��ffJ��h8��al�1$�8AFfT�=������Z�XC��w�	���z5 *�Q�8��0v� �]�D������OMn<����W8(�������-&t/���K3�7�`d����`DK�a�a:��\�K���_����$i
}��\�r)
kN�Ju�w�����
w�$��k���p�N��^��K��:�C�w�����_k2�P�&��zc��%0��#���7:d@h����%t�^{�"�7$�K��OhN����+LhBplE���W8���i�����XE�/j�OJ�D�(i1E���2�c���q�����H���&)�zv%x�/'E�)O��q�0�)^3'G���H|
�3/	�������3�r'��N�a�~���z��a�r��	���ih����[07����#�
L���d�i���S$�I��D��(��5��[h���Y|��y2�w#�T'�:�_�8�V����IB
:�
��X�t�QD�z�5��AUo�l�W�����j$�}1W#:A:�Aka��'�����C�����Nb��U���\�����@����&j�!L�����:A�L�nD�	L�������_��E@��#%8�c��E����q��p]0�
;�*�|����Dx���'1��~)\*���g<x:��>b����/_��!���"�R��������4��	~�=!�]�Iw�.���X7��D��rnj�u�d�x�Z�1G7?J����H^6]��9��n�����9@~F���s���ss�-q7�%����c[.��n�_
�[X�
I+��D�h�)������S��T,����=��3m����_��X�A9��t7���n>��_�,�����r�|������"�@�\+��^��)�V���J�����Gwlb<�Y��k��i ��$�hA:�<?~tgq&��#�r���{�V���%%�rP^�r����_+X6�v��������?��W���lh�p�e;��UQ��`C;fQ_=�/:���o�����rx?Q�;6�?f�Q��v��T�NM�+bnoo��@��b���t������>a������|\F�_	��yK�t����<l��K���Y����5��|<�0�����)�n���\[��j��;Zz������P���$�=��^��}�t7<Yw�0�p:�
U\.GF�'�fy�79x������b��>��s�pF���	���t.�)6�6t���a��=3�@<��j������TR9K5X������%[
�0��a���1��ZC(k��C�BN�^�'�~J����y2��J����+��#AE?(� "��9��J���04����H��X���i:�^���L�
�*��X��F!?J��d���N��������(p�:ZcC�	Z����>OF�8������6Qk1��e��A�i�F�|H	�uZ������P�
[Y�����U���3C���{L��(�(��jU��7������[~>&��
1�qx������E�E$���Hr����7#���ON�%�O��L�<��!qL<IQd��i#7��J��tX>@�*��w�N�`|j������^%�0����MR���s>@�������)��AY���lE�4��� 2���dj�w%�C����:P���9{HD��Kg��l�Om]x_�Y�U�Mdu#d:@�4z������E��k��^q.���W�u\EFO��1�-���]�#�m��8�������� 
���PG=f1��G������ 6��ZR������vCR�@��}us��bFs$0,�I�2��[WWx���co���clM%4��s����[��c���c0&��x�V=ysZ���l� �F1`C������a��)�V�����rF���\�f|�!��n��A�P���+2FN����J��v�.X�K�|���k+pjB�`�U����/�5�!d�������E/p���y�3eU���\���������-�%��er��Y?��X%�hA��+��Y[�w]h��&��c=�5�����~�j�%}���T[#U��K��*e�7�*��|���O_[+#��}���OHcK���XJ�)b��m���z�h��C��iQY�@�-���(0�gc��&�:�q�q�&��F����m�d�3!��XT�4+a�`Y�S���3hwED���E�V!�{�b������O�[��
��� �%�i�|�(�/�y�v;g�_�v�� ��u6u,�|���y�"���?�A�O��_�s��9���b�
nZ�6���n���+�,N�_��;���F���T�3������"�,�L)`�V�1(�^�|���,!v�q��-O�9��&���
���%�����j�mL�+]\;���km<��I\||�w������7k����,b���DY^`��^�m�N��<����+�{��w J^��|�� 0.
z)��_��u�g�L��d�����)�A�I���������(JEO����3�<��$�s	�'+28D�I��Os 'HP���:�=��
tl��\n��9�6�yP@e��	��F���Id')
Ve;��M��'u[�*N���m#��_kN���.��@�%UJ�1�J����tK�>#�c��w��0��e}u�I�qS�E%����� 44%��i:��LR����o�'��/DP0"��N�f ��p�	��,��b���R���'������� +6��4{f���M9��:n�B�E��St�V;}���N#@����6N��;A�.F�I��������Y9, s���pee��;pE.O��P@K���Mg+�]��w"�@YL�J�=	�N�57�3*�������r+jk��Uc���t���^\Ui��P�7�����D����KNG�5�z��:+����t�����h�����O�0� J������L ��������x�]Oj�
��U��1���n���N!��������U��1����	n�� �B+�h���:�m���c�
<^����to5��>l��8X�}7y���
�����
���l�������6s}����8��_�����ew���)O��_T����/��-E�p���
Pm��������w��ks�f��b\��db�.?�wP��$�ZR$\
�5a�4���[���\u���e0q�����	K�����8��Z�"q��0DV_+:_�5�J
Ig9F��
�:�$���;X�!*"��)�UO�����[��!�N/D<���XB*��r%k������m�g��T�Z�3Y��Yr����Jn�7��D<�����-@c�
S�i�V�((�p��k:%��qk���8�}�t�3 f{��?��|�s��J�M������l�o4m��^6������1���/���AI�.��l��#���G6���	��Q>�M�'$EZ�����|P9����s�[Z��p� �?WY����a�W�A>��nv�����&p���-~M$����q��Xi|��A�V��\�C��E�1%�u�9���ky�n���lU�-V_8�
[|gv�v	�o�����#(��`� Y�CJ��Y��%xn�d�p�l�����Ds;������3��vqZ�	['Hdb���)vHY��n������+x��>�����i,��9L���_��N�_��r�|q���~����p/���> ����c#B����
F�QS�d��=�.���x�E5���s�n�?��V"�+S�S������}e�.��x;3���_���fb����
����"����E�w�����KqN+�S5�t�A����*���:W'H������DVp?H�B��`�FO�uRF�V1��N>�ITS�������I�E�4�
�u\�������z�?o�F��1qty1'F��'� �����R�{D��Fj�l0.��W��-_�����$-��r����9A�����O�9��DN��	�[eImxX+��,�A9��r:,���hN������M�3��$F��b$H`�b�u'��e��k>t�m�bK]���!���\�g[d��:u��������m2*�M>`6�	���a(����H)��)e+`r����qt�bU��Z�)���iT1�P
��"5���{A��[����}
�HSe���H�4V�� rN�Ukr�p�rEF�W����*r����6P���&���?���^�T�#�d�|��Ht���%��2Y�����|�Y�:����[_�*y%�js�A��X)���Uu\�0!���	���P��,��V�9�mz�Hj��l����(��Z�����8&��l���(�+ld7�l�W��N�M(1�T��]��*Y.�s{^`���T�=�,X4���-'2[��JL��p����b����r�'.�Qf�rz�����
�c�z�C7��t�lR'	o�)�S=IjL^�Su�X�y�e�p���*���x�z@X�N7=�3F��8��y�4��h��$
`�)X����E���D����k._nQm�z�L**��t���A�z�����aZ���Hv��T�B1����9�+J���J0n���������$<Y����\��7���K	����1�tL�g�@	���I�
�7I��I���<�Z}�;}�i|K�W������%z��36@��=�s������blU?����y���D�,�i�����nk������
����O|K�o@v�N^������}�|�z:�2hWx�;����v����`�5�Ya:�&Wz���:>��x�S�^�x�Z���g������E���5�FN���������I�mB]i��s����Mp34
Ns�X�BE��dfL����Y%�Sl��$�~9�`�����[��T���L�0�mVbV��wv>;���g\`V9<h�*a>#W�HT�� �cTgr%x(n}����{K�;�T�+m�������c�������jw�f0	T�a��E��E���<�9(�8�O�h����H{P��;��huAzHA���#�����!�s0w6���V��L:�
�EM:V������0xQ���[W$4 �@�
�u���������W L��	(���U�E@���\^]�]P����C�����Z!<�T���@R���)J��B2�1�~������c��qx��?-J�	r���qG@g.;A����^1��n
�<e�bNg7�1Y������E�D�(~�T%�?r�D��ol��kv�Q*�������l�2@��&h��H-��5N��q;��� b��;A��Tf�E��51:k���(z
�n}<N�LhR'�Du~L��0u!jO�W��;ds'�m=�=��{���>��o�xw����&�#y ��!�����TW5�+$(At1��5����~x]�h�rc.�,�G_���R�T:A�� �x�|{D���������p�_���j�~�,��X�nL���!������!!�����X�����c�A��P���[����R�����IXm���M�:��@���:�Ui��A��0�I&C�������������(���Q�A]���b��a����D�^�u�h��I��A���a�p�Re�:&�y��]^��ly$H�23G��:AX�W��=d	���(�.�f�Y<����00:,d�[�*���l5����,x� ��k4��c�e]F��F�(�L�������<i��8"J����>�F�		�1�/����^V/�������Ut��Lq���	V}NU��1����z�0);�W���{P��
z��*�Dc&���$�$a	?��W.���S������?�t�z��t
�@bG���u%��|���^44�?Z&x��z&��?�NU���y�q	�B2Q�L�I���.0��J���;������+�E�"���E�<��+�?���������?�K��x�|�_[~�
p5�1I�o��x��=w���8����&~' ��l�{���`����~��'���/�_��/��a@2�������&4�6���;b�(��{��Kw�*���/���[~	������rNQn.�����bv�i17Y"�M���������	/������6L���
3�c��8o�^�����/�

m���7����W<�=y��k]���'�����{���H�uKp����n���B���"lqT�,64]8�)'wAsVN0������#v��
�a*����p
�X��z�4Lpeo�j�(��]�V��e�g����z���R��(���J�B�p�<(07hX��� �	S������������f���*�,Z��;��
:&���FP�O�
j�G����8�:��UOU<���&��E�r�����@���*OED����:2����������1`bS�+7���	�Y6���\��#�88rD�4@,�8���Y��@>��#
o����-�0��&�-.�l�h6�#���0{��=:����B�~� �ZoS'�Zk����?���$#O����H�}���������e0>���a���s�I�R��8M�����U
#e�D��;'�D�I
-#�R
0������d[
q��������]�����9=���)���!x����>Lj���dl�P�PO9�M�n`N�X�'7��#f[0`O-��F�-pA/E��A�	;��I~x��_#�
P��%��@�s���Wn{v#��>�����M�-FL�����V!�>]�"�t�	v�1~��h�������2m�$,�Mq�P8��&�c�e0�Y�DJ��^.���	.9�x���l6���{}��
+�&d����&�	sV��u)Y��_�5�}��&1Y!� W6�b4����b�M����-����<@��Y�J�9�3L�T��fj���"���h1��
y�Dg�c�
���~n�
�:I|>��M��a��R}��d2T�� ^���ChdQ~&���5T�b�^��	�b��[�9�{b���YC�	!�~0l�-L&�GU[9E
�pM	�\ZIS����u�\��5C)�5�x�u��(�e4	V�X���c�����r��o[�f��Q*�U}��	����������5����6@�Wl�ky7`��2���`���o*����T���C�����P���]"�W�f��7������nB�������Z�_���7_��0�sp3P6]��������o���e��"2Y�z��{_�
���O+�;_������nzM� �s���V�>Gz����lw�2u��WW��@ia�x���E�[?_�r�g���#r���y?�b&.f�
j��>��}J�V�t����u��>�;n�VNs���Sc�\��'"'�\k��9����>�����!�G�
�D��<�J}�^2��&e/;����b�������q*���s����{������sY<����c�~�����qn"���������6@�g��tq�F0�~)gY�vdml���J���J0'�����r��
0Z���Mg�lu�����D��
���F�������F��!�����T��g�����{k��O
�Kg����Q�N�T��X)�5`T�D�	��;���r)fG4�� bc����HG/��!�!(��&���s
�(}4j"��Nf�t�&%��ZP������)���R�r���{�H���ejO�z�������iL��aH;�:>�w�%�
������<UF{���>d��y�E�6�L�J�:A�����G��'3��vp���u��p������ o��pT8u(����u�lp��d����o9�w��=���v�,Z���N�"veu��X����/��.��,h�G=�dq�	xG�� F�Xj�������ns�������+��]���t�-�==%j%F�O�	bE���	b{Q�����LiD~9���9A�vNu���N��~�1������*�&��(s�W��du�$Bu���7�tX2�9��dd�J@XW���m����GOP�	�L����t2��0eM�U9~��V����	�r�1�s:�������	���N�cE���\��l�O��O�����0U�EQ{�o���������/v��/�����5��m7��6/�],��x�/����^�Qu��ns���S����(f_��Xwt��#q���)~@������������@��c0u�L�����B��	w�����y�[9a��9 ��1��[t�D��K<)2-M�7#��8_a+K��Z��B\�������`�Y����a�q�������J��	���M���M�p��j��"�-��#�[t����:v�����g$����yS�E�p����&EBS�b��f[��P���3�2������R�S�y��n.���p�������g�\��^(�������n�:����_X>�=�'�������/������������?���i����:��*�
�~���Y���~��k��N�L��_���b�}�>/�g�_�������H�X���c�_�b�R�hf;���d���?�h2��J��Z�bd�za�a^�l�#'�R��5��z�Q��;��2vk&
�)X��J��;F�u`�:���_��W-Y��X$���^�b�Fa�=����~d��}��#o7�W<g���0����-��k2*��w�-��y&3EDF��oj��.
����*
�����E�����zLT�h_Y��#�/S8��7}K�ct���9���)���xL���#���@7��m2�B{<��a$��|���hB��*���z0�������d�p������~����8��W�n�A��!5��T��bZ���4�����#��cn��Q%#��xo������th����2���[��:��h���d+m"���`�?pJ�_rfo����H�l�"ub�`�s:mr���g"Z:N��2|h���O��8�������I��q/�f]JF5�q>�MB��c������k��4��"m,z��\N:,vW����|���f� f�0�?
���]���5��"M�������`�� R�	��)��������"?��C�R���	�K�G��#�f�@�1lf����J��^�!��u����g|N'���r�G]NwDI��Q��W�!�j�,��PbK#�����C���c���fTi(����r2s.A;c�	<����% ���#*�^��vX�"�C"���'�����a|��}w�{��{���Y>R����I���a�QP�X�/�P�aN���p�c����C���_�k[��"�:~����9�:����MC<��20tZ��b�������Z<(�����1���j���l6�lq��Dl�'kf+��PK�����C�6*�7�!�d(E���\C.�X,��p���`�^�Fk��zg�N-rB���VK3����7�Ze�mcl����`
��h?|7F��V
�������u#��tVg��!Tng�
�;�55��1�.;�~|<�c]$�����h8���*y�	I����wB!��I����Dw�[�]�4����Z�2_{T�Z�8����5��&s����a���^W���+�}�����,�q������r2�}�T���e����:������>Z�{��_|��R��m�0���}����m�?��P�k����Zf:�g��������;bKZ�:�,�%�{�������L����0~�{�O|Sy/��L�� "�cV*���s.��M�>6|��5Jym�e����&)sx���K?/a~a�����\*�~�m��������F�S�'�Y�����������W�AXf9C������#��tmf���g9A�\;]b�j��<�p^�� Y�iX�#K<�=1�1+CL��"EL����7������H�w��%c�f�����N���"�2�Y���ma��7}�����YfFF���ay����j[t�
2h�J>��Aa*�����"���`�7	S	a�db�4i�Iz���o���0�����S�H��{�����D"���$���1�w�9]��f~0�f���5Z�N$����������r,1)�����xCxe����Z���.����������7��Z3�k�E��&�4�z�^�)�o��0����*�%)m"��2�N��/�j�DF���G�{�?������-������O��X�te�_�N�x�a��I�3�y��='w0��M��|`9A�y^5.����� �*-���YG��r�IX������� ���r�[�9A*m��)>�9A�� ,!TGaZ���\�� �y�t�,f,7����mj��[i}d����1���#\b��8�N���N&L���tH�0	���X��0S�F:��p4���t����l�"��\�45���������F�\�P��bumY!9���M���Gt!Gs=�
�2��o#�1��E%L7&��m�n���al�QB����������B�3�"����3!z�s���-�F[�n)J �q��b��k�YV�||h:��o:�k��m>|`7��O���0��\�\Y�$a�������IRO����g����10A��>U�i1�]�m�b7���^v'�MR$~#��}�$��uL)I� [�v�;h9q),���3�S�pu���
P��7��;��XZL�*�c+�����d��"�����|�'�77m����R���N������
�F�NXE(S
W�T3:��(9��Z�z�,����LC	� ��y�$0/S �����d-����k(z �A��R4-��`����
�����:
l������-L�~����������E��PY�I���E��
�� �Q�
�0��o+tO'3��P��Yb������`���M	pw�>�-:�*Y�+��[���+�^�7;���lJ������}O6�W<����bm��������T���7u����7v�<�X��e�����}��<��An�7��R��
T&���.��f��<��2i�vS����x��c7KQ^.����ik������Y�V}]R������Z}�M+�51�ho��h��������y?��[���,�~��;6kxGY�}���3~����n�K�;M8�������/"l�f�����/��%]Ekl
���.��7Y*�^��`C������G���v��U��gS�l���)�M��4�VR�8��l����Df���{�m���>�����v*	���p#�%���Ept���7����'����i�� �c&vz��P��h�`�Y%��Z���Q	#U�x@m�u?������	cv�ZO�]�`�l�#�a�b�Kn����e�t�.
h�%B�6�FI;P���	��f���|�d
��i�}
�{��{�`�3T��+���^�`�����5CP���D��:���i>�_A(�������{��!&<v����6����	tw=�)��[����\��H��6�����)�g����S>)���H.e�b��d�Wq�I�i���sv�
��c��	���7����X������md��y%��S��"A:�y���^��e��;A���w1n��z�
���711\)����w!�x��"�V�a���'j��0V��1���(�����;!MFM�`(�}�fd����K
��S�g��������8�NYg���"�(����g[���dF�9�}��+�x�n{�k�=�Hy�0m���t:y�D=8�c:�D���n��\S@}W�+
N��r������*�F�9�>�o�����MC)��2*C�k�J=U�������7
D��	���w�@�U�
_�nJs.�����b��p'�_=W�!��@�a��� UP���d.[N��pQ'�����G�N�Y6)��;Q����p:,�=��dy�y�5�.�MV8����q��>n����X�'��i�������BE���Ye���T������L�����wl�,�����\�
�vQ����R�Tj�	V}�z�8+�������L$����x]�01��s-�{�q�Z=��b�,�L��z��pq2D'��'�I%������:�SO����jk��\�Z7��-SQ
�P}@�<�V��6��q�t3
M��6�F�v��������E�q[�@9����K�T�ft�������w�Jl����8��cN2#[$.Mp�F���8�W� C@��P���y	�Xh%�_i-t�Jw+�y|,`3A/�dr���[>�dZv�t�1���on6v��q�����'���������s#�[������>���oF�	 k�ycI�P�G����}��"�^c�yM�a���t���=>�=�O|��'��}������?��7��z�a����������]�����6~���6 �A��@w/W�=����V����E�U����R �t�W+Yo�%C����,X�{��H=o5����T�ZE7��������V����z?+
b�����%���[��*��jK�1SO�%*
�nf,����)��ld�ln-^:��x�n��Y�Y0�t�J�F��.dn�b�=��C�P|^9�h���~*-�0(�V_��CA�%(���2&��c
g�VF[=F#4����bhe�;�J�L��V������7�����m�#J���6�6n���b�>��?0~tU8�Z�\+sDQM�*�JU�):��'�R+�$��SL��������'�TgH&HfN���FPM���A�LZ��Lk7�[k}�!� ��>�5��(��r;%i&������2������^QT�����%-�osB�i���D.�
S���$T0d
�T����
��=�9X�$�������y��+�^��$�'L4F��T����9�U�D�JfG?F��.�3{'�s9�1
h4^��L�9��SZ�+�����P}�Q�`P) ��p�7F��P��K�d�+�e��*�#�A��
m����e�n�|(�o�iB��F��e�>�R�L��kTI�3�}�8�R%�6����m���6I����#�#��\<��"����xY�aV�Q$��_^�"�K��Y����R��Te�Z��^������:�W�m��XY�����!#��D�j������U�E�VU���bRnE���ZnM�BsM������=����w�h�%"�w���`����X.x*.�A�/]�
���V�<�����P���F�Po�3� ���h�kJ���io�Q����(�55K[��Yn������y�H�Y	��CE1��-��q@�/���
Q%v�x)��Y�
^�0I
7Q$�f*�G���:A�W��hj���FYxoF��
O`�z�����F_��>����>>}�o9��>�+)�+���}����5#�����#>@�t��F��b&���u�?���HR��W��d{��]`z��@+=h���Ur[����������6�*ei�J�� ��j!��#�#�V<��J��S�q��p1��T7�����JN��Y?�\W_�o�|�,�����R����+�"��_k`}��Y������{Hi�b�7��eRH]��_����Zt�t��A�A������Sb3P�����l��e�����tyskt��Ms�*�b�+���0j������wh���'�R�nJD������J�+q��5���b������2�������������:~XG���������s$��QJ��8�#���,R���0�����[JI"���Pw�r��v	�4�bUJ�YyC�]I�?�"�>@����{n }��SF�EW��J�Xe >@�He%�D����$Gw���2pU�b!Q%|0��o,�Tr���$������L��)�d��]��J��rV�`��>$���(/}����B�|p	�
���a���$�-�T�-����o�$�
p3������3�1���|��y�z��.���x��u�������9[�}��0�7�K���=�&���t/rfG�5��U6EK��	����������l�*R!	n�r}�=�*�7��G'���P�)-F	�k����\?�bc6������-n��AOf����\(�]��~����JQ����E�	O2�0���� K
��g�A���
@�&�D.gT�;�i�o
���B���AjQ$�[�H$��x�4	���7�A! ���i$�#P��uQ$��E����|S�"��Zx�,����D�IT���$")���g�q�`�V��d�d��2�=�q�bI�b2��1�.����kx� ��%K�8*6�KD��<�q�j�C��N}�Pua������p�O���4��v�X��}@���k�{_��> �L�.�^e���P�=JDz��Lkz��y�����������kS1N2T*w�c:�?���At��lc��[<k�h-~���� +��b�0�.�"^�6H�������|:
�'�/���~Z��?����=���Hf��"�Mr]]s��H����V����rm�e�*�����>�x�����Z�W��(�]^����5�.�/g�^�;�PSw���:����������`�4L��l��2x�yoX�S��%���,�b��F=������t�s�|��9b>pF.��hx���w�/�W�,����	��N�&��tk:�X��)������2 ��UI��M-�z<�q���;�,������<��������	2���Jh;L���^����^30�m,��:m�����������z*,��zZ<CRU�q:)��{���������y��n� u�����k��3�;��N�?�2�h��j���k�ZD�����2���7���S������0z�t��)���%�[���:>�����	�o:��m0�������c���N�tMF��b���{.��@�4�B#��n������~��7����okx#=�����K9;8�h�}�����'�k��8v����h�B|�	i�Z� #
aZ����@h��h�e�;~�Z��8�\c��?,)
�U&�����G}=�s���D��/�T�IX@h���}H�G�I&��Fh���^���_�q t.L�h'ZQ�BM�hR�%"}7��y�v^PwKQ���L,�
��)`N����B���$X��8?���>/�aDH�Y������,%z�/�w�z���8m��A��� �X|d�!|�&��F���{����g�;H������������&$s��s,2Rp�mg���d�>����{�:Vx��%�c�F��r��`Q���-t=����7���O��c{�gR��dt��T�P�m���5��rt��0����1;hY�`�/�������#��	�w`c,���w�l��y�J�4R�@��G;���{�"��"U����� ��m*,
���H4R�"i!|V�g����QT"����F�LJ�e�HK�
����kI�v%�����E��0z�$	��9��M�����E8���(����	�\V�p�L��aZ�|�jJA��s����,C�U��6g)H�c������n8�����NO�K?��.�A&(�:(�B�=���v?w"��+A�o[��V��~^@U�rd��
u(X)�"y��[��JH���`�\��e�������m6����)���w��w������v��� �80^&�����Vj�vfJ�r����Bb���	R����������t�����Q�$��,E�����0Ji"{����>
�L�;r>�?����h�]%8	(���g}��	nP����Z�!1Y�r��D��l�!��H��dr�A�M��R��I�
	|0Z;����jcl+�����U���_�����I������zT��
���r`r_�����e1�fI�9�w3�"��`���� S9D
��b�"��6�V0|�����2���Y.> ���rF�u-^K�K��%�2�*��0���lwQR[��Z�w��]dX<�Z�`���_93��������ZU�`��bKpD0(~���v<�v�_1Lo�<K�k�lbE
�n:�0�G��(����5�H����|S�x��,�N������Z�q�,/Y�r��/P+��Z�%T�?E��)���������}���~>�{����\��HZ�]h�b ���E���������C`z�����y�	r��������A��>�b"t�����g�|������7!!�+�����]s����_O���VV�i��U��{�5|�m�"�~��?���s��`4g1���_<~��&����������9�%�#_��@R.X�t#���(���^��'"��W��T���c��d��m���cTX�]��/H���JD/�.p�	�b���cxKX��'2$��+����3�	���tX��Uhe���Q$�"K,�z�%/���`�c��z���:�f�)�,|�RRz���9����;Q�o���x�in{@I\�
�7�I���<��7E��������X;bi%v�<7�+�:Q6���<���z��������YIp)7����^���`�����*���x����N�8���3�������	C%��/�t�B�p�c���mvX�M��q���G�)7G����K�l,\���Q$�����~�5���}{�/�
'"A�W��A���1P��U��S(x����)�Qq*yNE�:�)�������4�2��2AU��bO�Kw�	4�2�0�:�{F. ]��H�T��}u�)Eq��� S��H����%=/�������P�Qm�%�a�>2�����#A�����M�]d� ��� �q/��63��a��)�������3��r31���mq�9�}��U_h��l.��cs1Z���fQ$�sk��%���'�Zi��^��c��h����H�C��1�P���W����*�dC��`30E�3�AZ�
���^E0��IQ�}��C��v.a�I�rE�J���hYs���<�m����t��u���^e�'3�f����
6H|�Yx��R��`[��P��2Q$�
@J�h+�5
���d�\H��T"�Uu9���?���hzy5�$�DA�|�:\1�z���3�`�3M.�*U��`���5�+���w�)*��R��0(x�3J�-Ss�tS4&Ss�;/L�[hX����o������-�
��*���A6�3^��l�20f[U�v�}���{�g�U
J	e(*�#r��2%X�<�r	�4_t.��q�
f�h�B��D���F��1����z��q��2�R�U�P73k�
mF�%pL�Z��Ngw^��a�1v<�0�ng��s�a���;���>���>�W�i�]\)��]���9��PC�5����W�����J`u=���	��q|����\��q��[#D]�U{V������s^S�;�xm4��5�����@����T��R�P���vW"��d�B;��	te�����/���/�����V���_��x�'C��R.(�}�����1����j�BU%�Q�0IV�?���$��(I&��
�E�Ewz/5h����a�}��)����[7-�TI�������k���e�����}P�:����a���)(�,J��O���
,������^�cM�!��eR��F0�x]��x'T��KS�X���� ���1��}�l�/�������W�I��^�\,JU�6e <2�&3���:��0{%�������5����$3PsF�

����������m�U`+do��]l��;���r|�T8��`^9�{&�T+������� ��c��(0�A����V� NQ0�t�E���H�?(
��i*�u�0nK�H�n��"�E������r��dZ;����H�IL���y`-	�TPL�$�}fq�Q�YYP��_,�Q�E��a���9��b?ii=�:����:$c,�f���%(��[7`�j�5r��b8�"�m��
�60x���T�G!/�
��4�as+��:�������[�J$�M�bJ�B��9�K�Y�]��JxW��o����-C4�2���a�
����,to3T��!WFp�wE,I���!�
w�XZ�d�xRVq��n����6Nz'm�C�����b2o���!��H�e~���T[Wh*�ne!W0zy�$H*	�H�)���Nu%��usK�X@N�%��'�4=�qN���Y��YV����]��x 3�X�T�i���HK� �(j���ppkoFpB�T���E]�5�+T�fD�l�%N������J�r�}��,��ab��%����������?�����N�z,A�S07�2���A�����5@�`����+VW�7�)iq>I���,�o8�"�Mq��A����%�:>�9\�y�h�i��v�z|M#hi�lo<��[xk���_g��
���Cr1w�g*�3DJ�s6u��gR���.`��������T�O�[�O�k�w
;t���?��������_@����|�����c������0
��i�V�`�P��w����~E�V�W3�V�[s�(3w��#���G����WW�#,�{"X��E|�^�t�V[���eh"c�����D�]�6�C��>EYs��L9�	[S��}`}y��N��QJ�J��:"�R�� 5]��M_,��YsfU�){2��mi�2
Kb���4~���(#t}yE?C�5J�[���X���5�r���G��*O���0���&���/p����
r0�8(�R����`��t���<��du;K����2h�n#R)�j���F����0��F��e���mn}���3ou����f[���7Nv:��t;B��y��0���e�&�L��A]EZ���)�x�������F���
B����@_�� ����k���V��p��U�f�k�E��B�FP��S���EWJ+|~?p�g�7��^�#s�4\�AQQ�������{�d�d'����0�'tXL_TK�����
�Q\PK���7��v�;gi����F�V��	LV}�	�M�C��i�2���p��J<�Og��-u�^�R����i����SYFB0~���ble��7%��i�B�h�e�UWP�Lf\��,/�8c�C���`�e_<d�(,���M�B�]n��#��q������}�R!�����dsx����b��:�����\f����c����g.�A7�8j2��M.���a`������\x
v��}�2�`�B�2q�^F�
b`��1��e�9(@��b��
7�24B��I��P�2a�/! [?,5o��R��E4�y�M�6>���3��T�� ����&_��4}�C�_�w���\}}T?E�RH�$�U�(ER~�z��2�.�x��m0��O(�?F�AhZ>���vMI��*"Bb��uKDQ���tI�� Ao��iV��3�h�'I{�W����ItX�7t����=+�\����V��k�����f4�E��r�����2��[�`*�(c�T���?����~�<���j���^r=#�R� �ba9�<��j1��=�S�u.����N���a���9��V�����g�"�0��y\_���O}��	v|����|`���q�TUn�`4V�`��%���1*9QD�Tbs�%)	�\r1`��r�G����}6�'�C��i������������6��G�}�(�����CW��W��F�!�������-��p���S���6����\3��fG���Z�]���k6}��"��K�+����V�|�����V����-�-yenv��;�'���'��x7Q�����������	^iZ����5n�ue	�0��F��h�J7$�P�*��9n�~ 14s�5��\p����qnz�������f���'�*~��]�I����H_w9���N�Yk�:��]���\��6V�v=;K����C����aF0��Q���p9#���R�*���+��x�F�s�{#+MKp3�V_�X�[X8O�f���E���pF�Rg�!�� ���V4K9WR�rAi|e>}F�M��g�QR�5^�$MPS���:U��0�4�X��_��u���c7��>���+9��JZ��~5���#9X�!����#Q���Yk�

uQuTH��'�H�&gHLc ���"�@r�bO}mt�f��	b[W1�,�����Z���� �%��J�����"]���Eq����n�&���20�f��NL�Kh&��k,%"'A��fu5 ��$%}��(�~������+�u���D�.��X�ux�;����x����Jbd
�����Q���,t9�:(uV�; {}������A�T� �4N�|�T��[����w%����<v9��M,n��4������J&��)�K�M��Fk��g*��X��q�_���wWIa����M������I�4���T0� 44�YIF����@�l���%=0���H���J\qh'	Q~�����?(�$���Kg��|{H$4�(p��z-���GLE�A*�D.g7����4VI���� �%��JI�nQd���#gq*�D������e�B�KmCCCT�c]�� �	#4�&� �{������!H4A,1V!D\$�3��<����x�N��x@�l	/ y7$t2�YA����(,���p�{~�`1��%���>g����V�]/(�����0.���P���q&�3��6dP:Xc�eM��a�%���i��9��dc��Ya�m�]�����%�-�"*X
7��j��b��kS�]xWx	F��B#��L$�!W�F8�8A����%$'�R�����!�]����5����%��^�X2����������A&z?�B�?����=LW[��'�U&�H��JRc�5�,E��>�j��r��:����6���z`��������.��J"�����j!�xRv����������O�Eu�j�����Y�e���JlLs����KI����OET��5����c>�%0;��:$�����\N8���<�x=b�=�q.�tG3@(cb�P�����on
�����m;�� �V�<�����7&}(3�1bYe�&�A��kx'�-�^__�*�
�F(�F(^������F�U�n���&h�Qe#t6�x5#6f�� �/����K���{�b�|x�#��t�����������������Y��v��b����������{�g��W�cu��8����s\������g�>��������m���Y��/�����_�������/���so�a9��!_�%�'������skg>;�����|�����^�W>��O���������Z�������G��?Vu|F��r��G�������AO���onCJ���|c�l�)���"�������(pz�@_���xf��>{���Y�c�.���IDAT�p���_M�*fT���ik0�FgGU0�@�jn;c��K�3R�����	L!Z���B��#}��'j tXJ�l8)�R�%��f�Q?��
�[�b��Rh��h��!l0���7Y�}��'}g����������BL�������Tt;�e������FQQ$jlJ�O�?�-4Q,�3���~��f}���]#����m4&t���5�2�OG.?�]!Cp����cn]������Z�:��/�������Y�&_Q9����B����vr������5������R��q����F�~��)S�]fy����
���1{$�����{�|��������l�1�/��[+��#}����������'.;Ql<�y�6�V#�
������0�+[!�����=�x/�����_{�|��J���������R�h[��Av/0S����XD��W�X�"U�Pt�]�a����Y��m�U�
���E0K�L3J���q+}�U��e'.P2��\'�y���(�b|K	Q���*$Bd����"UX������~�D"M���(��\
��}60%�'��1��x�B���K��"�l��t��Y�^�lsdBr��~���Vv|��hS��KJi���(O�B��������_1rH��2V��f[��<��l��o�,��
x���|!��|I�����3����>�
G
�6�!@o#����������0B� �8�CL����a��Q��9�D��r�>)��8��i4�� ,����yA�b�����C�7�����Ma9����V�#������G��T�1s&nM����C�"����_��"^^�h~>l�cr�W+��t������|&���|�yE�@y�*�A�F�5$a�5�J�[)�B�����#K`��j9[c<���m���UW��jr�D���8�n�+j*�F�w[���e��1d�L��������}	`0:�U&H��,j�4r�������01���P�U����1B���c"����n�vi��-q���R�4kA)i�%��g�lxJ8`lI��PxsM��E9o�������!��XHP#���p�O2��;�/�8����,����40L#e@�cj��oO�G~�s�9<�
�C���;��C^����_
�A���J_�9
Ui�\*�cN�6Fc�@I6��F��:��}��>���>��5�"m��Kk���+�����ENZ���k����b����8.?�'��o���?p��`ZR�^sPH ��]���OM�D�~�S_������
�o�	@���rB��i�i��+��*�`���\�7���z�~Av���]S�
��>@r9��W��Td���)%�!�KS�����;�o��&D*�$h�>�.��N�v��H��S�"G��
�Q�����o#��+U�D�$a��KEX��-R�!NC��z	���.pl{P��e;0��C�E��C�\SR8z0��M�^��Q��I���v���`R���g7y5�U'}�Y���xwz)s���RU@���F�o���(�nV�v��W���y���z[��y���c�� ��(R�Bl��$5&e��\��@q�W���6y#�����G�E��H���
��S�C���������$���6y���N��&s����z���Y�g��?w���_m�b_R�z��q�JO����8���#�<�"=�P�[�S�_x����Jt��\Q�'�8OC�����\m��b���HI�H�Cs��F+I��U�Bq��*r����A��:��U��N�ti|�N�j�e�J�����MG
g-H��)�����Z�[��m�������u���d��G�it�Ny�mC�<$^�s� ��������%[}/9�f�r���A�`J�92W�N�
�D���[�L��r�Bk��sDc��KUcw~��s�e(� F%0�g��	%r�]?�<�������r:�Q�n��*f��h���-J��[�Q 	^`�����c�p��Ia����0eM�C,6�����`T4���X�(K�G�����uc�~�0�����^�6�����4���/hF�2��0��)h����E��K�l:�Ad[���`U����@f��*�
0��I��]��`�B��Tbu�']�I�e-g����5���fY�~Y�Ay�9��WE{�3��$�u+��)j�)N�s�����)1��o�<�FS�d3J!�q��s��	������^	��9���
b�%db��k�;6�+�q|���L���V���^���%
�3>����YB�=������9����l{�H.���^[6��Tu���)W��
�NZ0B��Mn�����z�`�������s�*O����L�}��`�M�`a��6�yj*�=	�1��`��\�������>���`}|��r�������������x�	�����q�_jn�����z���3�Ye��6N����zQ%���?�Z�'���_�����k��tK4a&xZ������O������=��Af�6�9��y��t��
74�\s�A����f���)-�"�y���T����D��(�r�/hI���p��!����QZ�B���L��2�����T��Qk��,q!��3�p�8M�������I�^��$wD;�A��T>@FT*�G*-_�����H-U�9����_7�(E�����EB����\�s�4���P��[1o�w���9-�'�#/�H���K�0�x��&/��.�!�li
��4:S��k�h�?��i���NX#YV�����-x�EH�����R���Q*�m����s@�'a���"zV+BE�K��lA1��lY�s�uq��\�� {�e�������3�%<�H��>�I�h�iK��	���9��P�t�`��&'5X��b�������kEP��������8�~������(��c���H�	���\��t���y�eX-�'�o�����5�����g��L[�������"$��6�$�
I�������*C$����J�E6
3�=@�������t,����PK�4:��iU}M��EOR����w��>zN�����\��
^�1�����e�(�m~��X�Qz�o����-
��x�� �0�%I��(%|�W�K��n��YE��V^;!���8x\��$�{U����Sj�e�f�����g�t�J��,
~@
�a�	CW���a����_�AB�y��B�v��|����%f#�� �������wa�z�+q��\��A93�1�\���)![���1�h�l�$��n+���VU�]	��@S�c~z���{���P��>�Gn&�����onI/p$������	lPQ�������=��*�K~���=�79x	���2�iH%t��i����������2�h�+Q�@��0��w��_���I�`\-6���V�.�W}�hy+�����(#)bR-�[/`����)%�Z�e����P��W%��wGD�C
����g��^10��Hx�
b"�7Xn�2B��������/��=(-�N�$��m"�(3���P��2���N�U�lw_\�g�A���OR��v��	eI�S���gL���(zk�}k��R�����j�����N}�,���hQ�M��S�L*���4��-i�'i���T%���	� �)�d�;�a�1��������l�d*�_�o�������H�3��+��JE�'-�����8���������U"d�6�������x�V�,)$�|�������q�|������z�Y����w9>�M�� �tSs{.���i�=�t���94;����]S�s-p�_ ��A�����5k���7�??���! ���W0�5������6����*��"k�|u�h����^4A����Wl�t6aw"n��h�4������c�����b�M/�UY��^�.�,��fU]�����]6Rd5�t�����Q�����E������.�4�Y��W������'m��a,~~��k�~�=��)?��tN�-"��3�.�`�h�l8�I�K\L:�v,�����x�����tXc����0c��1{�����Y�S��&;�=2-�-��%��H�3�(�0!28l��Z�X;����Vbc�!�i�
���x��p��GjSU�b�I������l��V��-�����3�3?����*W��f6����V5?��%
F8��)7m���E������5���pI� M#*w�H��M�������� � �e$m��*��8/Z������9E��R����M� [��h����O��l!����Y0��"���_#g��C��{�q��J<}����N	}�+�d�(R�A���q����k�=Q$���(��T����v���h8�M��#B���������]Nj�����6P�3,x�AL
+�8��p�da�z�2�^cd�R!�����Na�Z�2%fv$�����'@��!���E.t�t#T��k���%���Lr�*��
f���O:�0�(p@=���6;Z���|S�4I�<�v��_<��n�9n�h��A��+h�(��v�N�Q���&���������T(��F0h�<�W��_��G�4�Y����ME�&���0`'�4�q�Z
U9�'R�� �J!��\|#X+�(��?4������$������������F���k9T���y�8��KQ�����^`Hw	�T�	M���*�� B��l�:�	BB%4"LE����U��A*��J��Au?
��+�A6���,af��H$s��|����S��E�y�ka���<4����0"�O>��n���?��
���3 �"L��w�*�)a���F�2G�&>]$>��Eh@�NM[u��s�9����/�Y%]���y�/|��x%��A�Bf~V����y@�7���Gk}���0�R�8`?N;�������� J�t������
��3�t�.�����1�9���~�sh�S���
0(�g�@�����~�<M��1�"T�������1���������CU��=><����-������w���>���������%l��t�U��ik�Z\x��R��������������x������?������K1���O1�����~�d�~�v��n�������_���[a��:�+��W�:��a��/Y>��')/��������?�X�v�<���������v�B$����+��]��AOIk��"���]��W���27��K;Ho��]m�\����=�b��,���^�4�;��?.~BT��/�<\9��+g��*�k��j����8= 3�Wx��{�N����*OsR�Y�9N�r,	Y:O�rDv&��4�:u�Lz��;�\3n���-�|:�������N��Zw�g����zOK���q���s&D��uw	6 y�K�D���vqnU��=�}�Ak���p���[�s"��]��q�>��8����������� �h�u���������E�*5_�1�Hh(���>���j �X+)%~�	�������pNY��*;?q�G����k�)��R��l�i�Z�j7��>�����O��Jt��X��f�M%G}��n�6�����B��UE�
L��0����XPI%K�c8�����\�'y��q�����K!gm%�L�foY#8S	��K�
���zP�|�B�o���,�wP�;W��\p����%��e"?q���n����7��S���p���8X�g?(�����i�b~��$����@-�����~E�&u�A���J6K�`�����N�\�{�`�������������$�=��~P��'l�~�`Q��� �����BF�a��ew��J
���n|���k;^�(i��O�Qy��t���A2t�D36*����7������h��H�'��M�-m�
V�Xh����b
$��J��
|T}$R]��T���Uu���IB�eEIT�������P+�����
,���e��� �u#}hLfbM�������d���a�!t�K�'>�H����P��lg0��lk��A�c*r1���s�j��=��
�������`�2�8���KtB������5Q���d�J��5�x:���,E����U!���i;��!_�X�$��,��f�cV=�����:Kr B^_���t(w���a�����������������=��L�%�@g��{E��}��b�)�F.�R��<�����IHw��W�P�W6�������	�3,U����]���Z�R#S�X�D���E�]^}A��q# ]�:����!
�r�J�h�rr3��"�X]^�u����?�V� q�BtKS,��X������XI\&[��,_C��b�k�[���s�4+����p-K������{n�z��9��R8/.3�4�I���tR����8e��r-%�l@�����"�a.^�����1��t�����B�J���j{�b-�.���b&�GS<�����[��'��D#@Uq�	����hh�.-�RXL���z;�;uO���\�������6j�g��Qg��'�`�`o0�����w>>3A�������SD����{�<y������,��[�������v���;i�Y���;dS+!�I��U�KlV���2�LQ%���]"�@j�O�������5X~*�1�?��bV�pG�y����w}����Q`;�
����.�zRU�Dm��^�35���h���B
�����!u>U"���'�:Cs����R�d#t�?V{��m]��-��*L�
Gu1�#J6B���pk���!�-��$u�0�"��K��f�h����U:-���m�p?��-j��B4)�P��o�o�RcSc��rM�R��FHS� N����$��gZ{x�0<S2�m��I���;>�g4�����{^�N�Ikdg�s-�U]��H�`i:M������%��a��NNA��m%n�������"#|��'S��!=����[
�8'4�NY�T^g�{^���;�r1�:�����\$B���_�s@a����S����$��y�����_�>�����r�>G�N�~�OGD�17i��fR��f'4	�����
7�O����O�D�(9rY�u��(R����NE�n��N��@U����<��f�l��2A��NA��^�:r���9_�`��\}U�(��x0�
V5u�*!�"b���
V�M��n�������� 
����B����e�DfnA���b�,C*A�� CT�L���:�pfk����I�{�X�9���os���,��*�����$�I33�pg�������f��L�f�Q�Y���A�!���,�Lh���*@�����C�F�B�����E�(�-��m���K����Z>x��a���ghY���g#4��J��Q�E#����~sDi��c)�BA������7ZfM��D�C)��������1D.��[W"��t��'���$�������%p�������{y�����Kf�k����[�vK��4<�g�
F�ZHs~�7��g����v������� M7�H����U4}9�-i��a�%�����H�^���"D��M�E����K9�q`v�$$�y����>�����Q���h`'��V3����|��s?��q<�KX��F;o���%[(E��2)���O�q}�d�������pz�n3�A��{���0��9��iJ6�_�����_����%0���EeK��[K`j��%G!O�#g�-��B�K�b�r�L��\���f��L�}Ll7�
Qyj;��b)�?���8*�I��g�:����fN��G��\��<��ug�az���fR\��?����������{����r�}�7
�Tgx��~^ �����
�������jv�����A��a�X����=3+~�����CVl��|��O|��' }�W���9���Z:r^t�^s�S����2�WB���3i~��G{��J
h���#k`\uv~�{��4gi�Z�?��W�����1%�����TXT��Z��Q��-�(��\`J9;M�J��-�9Ig`D�G{�i/�o��8��-�0��bJG*����\T�d������F5A<a���������D-�2�������}����J9�(/�d�,<10r���EE��hC������p�lo�� �S`�'o�`F:�L��!\0���?�mw>z1�uF�F��,�}Gu���+��h�����K����O��������>`o<��8{�f0��4l�a ���'>���	�]�������/���p���n��p���
�����M^�������g��(a�H�z�svS/|e���
&O
���R�?�!��h���f��z�I,�N��6��`,$a��P������<�3�`c�E�z�7��@��c?��r�Y�	�l#��Z�8�#�LS}��
������YzJ�E*TD��myC4A�(_VW�N8+�:��"Q��E�hFr�K����f��J�|x��JS��!,)�;��':�J���ad$�~M��;�H
En)��a�@qo�1K!�n(�(v��M���I�v���
n��C�x>�����-���~��3@�
��Q
=���%f�&�%���8��.Or���~��A����c��N@1�n����[��a�PL�ju��n�����t#|�J&�q7tyh�
C�>M2���Y���E���7�s����P8H�7�vY�"%%!�Mp�����L�Uvs$���#���\�3�`������@���PU	�H����R;m��Y=JZ��@����{e1�xC�?@{���[��e�}�H��
�J��c �&�G"��&�z��'i�#�{.�6�B
"��K�+���K2�/K`U�)l���#%d�B5$�;M3�h5@��,h�b������H`�4�����#�r2l�����0��%
�kf���gB�MF�C&�M*
�N��D/��r~z���s�v���1.���]�g�M�~S�M������^��PlJ,EJy��2��K��<�3�p9+�q
����[�a�g�uz7 ���&��%jf\�r�c��{si�A���`1�c���yz����*��#���@� ��h�s*s���/<�vv�sU����	���
`9 ,��]WP�6L���hD�/Im�����PQ���a|�|����W���-�����}�w�z�5B�����od
�2�_8���8����8�{�}/eF�
|�p����#��;��������_{�f�,�G�A_sB��4�k���ZM?���k�+���p��5���g9z����+}�����vk���;Z���s��i���6�n�� � 1:���T���d-�6�U�(d[���:���2��2�����F�v���E��(�,E�&����27���5��Y[%|��X��\��9EH&k�����D*d��J��m�sNL����}���-���6�]�`165V�(�G�c�����p���e�aq�[��	�M0Z�);��;uB<�vJ���������m`�{>4��)Z?r�.F��a��[f"t�Jy	,�B�9z}6�-��Z�t����h�(�b</cU��3�/l0��(��0rf<+ek\	^�V$��D>�0�,'rHBY0���?��[@���S�o�q������W)#�Q�BC}���a_4���� ��%�7��6�ek&T	fm��\��H����$��4��4��?�� �A���FD���R����C�����7A:&|������T����[���N����� H4��hg�B���V�g���*B�:��e�A�Ds��$Hu@�"+a1�i�_;Eo,��������h��@<����l�� H$A�(>@��=��@�j:�wQ3���?�����.'�%f������B�t�@�6]�e��w���+�i��}Wt��=wj����Zx?n]aq��a���h�6�`��8]6��� �u}&���w�6(t����=����2�n��R��o�eo5�a��3��3B:hY.���o��� 
���z���X��8���?��iKne�o���0���!;���6���	�0|g���A����������� ���m���Jy9����k(���=�[)M�g�q/�Cw�w�EB2@�	�s��D���\�&���b�v���'3@.�fW�B:\H��tN�(0��q���������vN������w�,�"���LL�gh%�z+���Kw���]HNE����W��H������F���������	�z��^s�a�'�A��9�$aR��[Q(�r�e�q�u�:�}���O��@8��.�WGo�2{��k��|wp���r30�V���i?=u���{xqe�2��1dXB�<&
�f���2���\(3�p�@1�������*���+�fJg��*���8�]���tQ0�CY��W�� ���;$���0�{�[����x�����+� ���~2��B�Yb�
v\wm� �2�{��d���1�b��~?���p��Ho8>}�o:�
�K�X�e-�_��y��]�i�x��Fd�b��v����1E�����]t
��Ia?����2�	y�K?�N�`�������G<�M9�������Y�_�,?�������a����������������"�b��G�||�����=}�rlKs1K9k��k���������o������W4���o���U=���c���c��W��wJ0��������m��N����i�F���~Q���������EQm	�E���/��Q�����8UlZ5�>�Hx��Utc�%���U��JHC*6��rJ����EE���E{��f��Dp�%)��(�i��)�F�����/d���R��e����h������v,��6}c��I
����%�1q��a�sF2B��#�ee��&��XT��x]�4�����4�z`�{�{��������Y>��7e!tb���f�Y8
�������4~?!C���l�nk�
I��e35�����n<���<�M����8��#+S=���03ui�<�����V�id��
:��3#������B?��xi����y����	Q��|��S�����N�#o����<>j.� �W��
~�>2��q�$!�����rT>��������V���u��pk$i��tkp���a\�X�N��St�-�B����n�H;"���(������A��� H��T�dJ��"i��"L$�r���"����+KG�"2e�����!3��:��C�1@y��'��4u|:��Us���g�
�.g���V�-��L��aU����+�w�����dJ��	(������y�h��������aZk���O�n�:�K���~�Q�`7! E��>;9u�Q����{���^-�d��0��a7�%^���+]i�MF�E�|��3oZx
���`N`�K��r:��,�d�F�i�n��x�R+������l�N
m.�����I�����G��o�w���%��uL��F���B�z�Fs�����QE�r&���U�(A�PU�"��*��bZ�d��Z��R:zEF�n��.�_�.�d���0�
������~��^�o�m(G�rE"�ZRJ[���J6��6�Y��*�/�,X���t�{�[Yb���?uD�/y&���6o�����(�������Q�zr�v�yk�=TUD�����cEr.x������������-	G��VI�8�4�#T��G�
(�"Y�����q�s�Q:Y�E��q2�N���G?��t�������
g���g������%�s��v��X
��T�S8���(���"���T^m�g���fc�i�YxJ���+�H��{�/���A���� H"���V"���joL)�z�e�5k�]c#�/)��������J�Lz���� wt���x�6��}���Zf��<>�M�{`Zx���Y���(��K���JZ�jL�e�y��"'�Ld�#�h
���a�]���k������u���j��	R$�����	�'�.�=���0�,���z
S��@xr�������&���-mf�\�D.���-,%R��0��JZ���XZ6�
x���%��b�Re5?�&�����y�)�_��Y�����L�^���'���G��9����7��~5�m�3�>�U&H��Z���|3X+)%>u��U�����[�W��+�<�$����5���P@$��N*���;�/���t��k������j%�T�:0+��������w�7��(05v%�tWD�k:��2���RFVR�$4p],�W��]�v�]��;m��O��pk���= ��9)�S��@����%kCN�p��! ����Y���r�s3��c�Ar��\�_�4q��yhk����.$c��Ne)�����}�h��\o�Qm�`�����a���NL���R����������OHl��9h�M�.�A�b�u�����VY����h#�(��i����J�A����
~�1S���i`�Bc9��K��Q����1 �|���`d�?�����4�XiY�(p��_�7|��s���j��+��T� idy*���5H�i���uc4��V�mt�I�wA���[%��(��� �$���`��(�jp���"\���6�Y�\N�!����r�/��y�g��5�L��T���R��f�!l���)H���8g������ ,T]6�(0�iG�g�����c7�)�VE��S��:���wO
F)$�^����?�I���m��J�@�J�m�.k���t����\`����jF�tqf�k;�������
2O�Q��C�xB������!F�u�u�P1j��(���U��4���q��z;��N� H���b����)����cQ��A���a��q�3hq������-���E��+[��}A��12-���6[���Ba���/���?�D�_+/P���I}
!`�p���JIjV`��%p\+W�ZZU��]A�u )#�
>�x/oWuKz��b�������G|8�O����1�w�^���m�Ipib�p���
O"�h}�j�I���1�����p��[5K�$�Y`���bT:���T������g\:���X�zg�=,��"��G(-J�	N�����;�";�{Rz���@eL�\�ENy|���fm��t>���	�O�����Ia7��D�
��iG�b&����� >���g�M������E�8��&�OZ�J0Y�WC�[)�<��<�:^v��L����#�.��Zk�x�IV��?>��S�=�O|�����L��O9���>��lv5k_Z�Z2yi�wq�t�z+�������� �"��b�_W�����{q���A�9v�u��z����[e�a��E
�)��
����|���a.1�"�n���h-��3����e���7�����mY�������mh���@"8�-� ����J�E�>�<>��.jC4Ax#���D)��J8�����$2�]�&��E��A/>@���)L��ei k��z��"o�0��[dX��/�����5�p^"�Ih�;�����?��	P3i�G4�9�|���D2�38
��Q�������������
�as�����8��:zY0B����+E���"�(���b�c����N7?m��=C��%;�;,f����}���O��R�3���Ik���[����w�=���|/D�U����}_6�o���P�SkW�K1�|��e����F�:�(���r������im3V@U���1�=���O�L���?���lp��H��D��������*J��H��uZ�4�R�����|�����Q$�\��D\%�d��*��HShZ#}�tX��}��a>�UQ���G�]]�(�h'�HU��2�V9z�SQ$���Q�{k��|L���~�eWOB��C��2
����KD_��:}�� ����'wVy�=���p�|��0�/��#2-�!�N_��
Yb�4]h�Ui���0��D!�Kv0�l��A2>���,�,���$N�+��~�� ^�y:��L@u����}XJ�;�SW����a:��i��i�����>{u���0i�����8�-9*�fy&����O	���O��1�dW&���a<�A�d�/3���61�4��7�<���cU�G=�^�f����E�HN<��~���f�[c��$����(5]��n����,���d"�E��~��U���2ZIK�������2BW�\����������4N�+HY5�&q!G*����SXqfk�#���!2���(oh���"�k�|U/��������`>��-|Z��vf/�L�{�lb\O��'3�-�eO�1e��5����V�k����~%|.����7��]������f�����#�G=>���(P��H���V�V&�Qy�������c�p/��$'}R�A��dR3����V���&q����`0 �^�C#��g}z���<����	�i������,.g��t�x�c�v�K��x�V���
�k���u%��G����M�e;�������
���M��m��|Y�O�z����
���+���h���h��5��[}�W�$�q�SN�{a����E��$����/�R�z�����'���	�o:��p6W���_���VO�i���Lr>-~�_�y���U��������}(��������i/��7��^�ls����W����=�E� ��#c��><���%�_K���3�����_V��ZO4A��{y���E$�=:�d���~��
���-�����2Cqn�%�3O2A�T��D�����T�/�)A{������)Q��"�����E@-H-����;t�0����2,K1K�zE���.0U�KP���
m#���\~��0oA��a���Mr��}0�6�6��S�.��0�4��8��|�i:��5����M�R0�7������W4V^�3>�G���%z��|���H��sY(�����!u���s+���M���<U�����]�)����[��I�1���N�
G�:����M�L�'o����������~�tXG7�`�����������}�<��e?�����N����$wP��h��p{m�`����y ���e?�c�<X�UF�&�b;{�~0���������`�c�|��_�G�z#OdC%�#��`����{7�-�p�q�9�����hS�Iq?BE��� �(��C��x0]%�D*�2���[^�E"�AM� ����x.����6�A�Z4A0���AOx���UWK�
�Z�����:�A�������:�O}'�p���|������a�D�
l�-�,	��&e����;�L���y���&��y��p&�~��2{�x�g���m����s�.k�Og@=�&H�����D���2I�d
��"r���C��y���n���r(�q��L�K��$�(wA}~O���`,���q��Uk�A1y���S{h�A���wh�y �q�,����8��97�������b��h� ����)���`���:�D���<y��W
j�>��q�N[2z�A�^	���-�-�{�yXf����3m�(RO�������Jt���,������rxk/_8��c�b��������*�K��[G��&V��;����v-�mo��U��3z����c��g��&��M�M2�A�	M�\��9K����	��������p|�d�pD?��aP�M@yI��m6����=���LE�x����{��l��!�x~�xQ-:����D�&c<���0������[�L����50��8LA���\��e/&`0F�6��4|�`c�7� ���xy��Fm��?&����mB��������5��!�6ES������%0�yj�'�����9����^�%�����|������H��g�����Q���;u�z}�����_�?��Y��r�~a~��J}�
�W_wi�^2k�|������K}vJ�z������N����;��4������ov��6x�}qZ=��p�7m�����C~m�#�����L���GGiSk��]�?(f�%�%������8TO�^k���n�{G�L�{y|A�t�s&��G�<�Q�o���5o=H�r�7��+�}��$l|�
*B�TH�~d��9 bJ0�H����(^{�������{4��h�gn�R�j ]6t�Z��k�q`HBKF��v�P����/W�Mk<H�	Bk7��jZ	��L+��li��^�H�H�@��v'ih���i��6Z����x���t�#��V�I���D���KP������F�&��>@�Y�b�O�uW��\�bx<>f�
�v�{�J%J�#�h��Gs�T���e����'o����04\aU/�5��q#�%`nAjWM��iK+�*�,���*��-��b0-C4p��+�6�w�E�V�����#v0E�7_�X;���K`�6����*�U�����rk���3G�>Xc�q������3�`���z�����vQ�6��
/Nv�x��N&Dc�p�����[��\Z��+���<|����E����#XP����Ht#4��E�H���.	�����L�eA/�������)E�����S0O���i��#@y���Q�VVKS����p^��G��R__=Q�����/�Hnh�
5��[�Z�J�d�HLB",.F��B��\Q)�u��������b|12�� ���:,FK���1����X�/���R�dcl�?k�\)^�,
�-�Y���mL	�����\�t�����QuG;�Q�������G���l9�3����F�F�������~���I��������[���G4����a�t~(F����dEx6��. 3;���i�+t�;~�1F�f#�%�6��u9��xa�n��CU�����cy�LpB����km�j.���E�?�������7GR+*"�J���7��(�Y-<�4��Q`���H�k�4����a(�]��
#tOU	����I����������'�R���|vt>Gf���-9.+�'WTv�w�`��>R$����/��KN�������H2G_`���M��"���S��� �����6������z��9
&����I��Dg�M���
_�rf��>h�e�y��s����� �(����X��"�f\)F�h�e�pj�����.#�{��>�d�����0���3DL*CEr0O.U%�0B��:�Y���������#
���G\���f���SE
\@2Of��bW��9������Yx��;�X/9��v|n�~�������}������W�}�����>z��_��}\kA����]�y4���7������'����W���SQ��=��N�h����~Y���@�x=+�<���sf�?p�$�?X��z�������o�/ �3�t+����\��/r���/��[��W��g>2K��U������^a�F���M��V�w���v�e�D��y��������|��~Yk���	Qu��&���W��M+Pz��W�E��6��XA�)��6��]����duS���=�)`b���@5-�R������H���2u+>�+���=�p��q�\�joB0�)��t�j.�eQ�0a_Y���s��7�V���P���0��k��1C_j�T66?��[�d��n�}�i���;�����i��c@����#`��4�,���)����=j?�����a�a�����:
�T~0��AC����5U�"�5b���K�[#������j����88 ����}�L�I.}�Z���n�x���U����x���\D�������Wj�\]^i� 2�i� �
F'������/��"�2AL��fs�WbJ�a@�������������{L^@��i<E��9tH�K
W�kgM��n&3�_b�7��JXWG��`t�r*�P
;�$�`C&���Ev��a��7�����nh(0���!��5�0�g'g�a��,t7����/9��+� r�.@[���r*���(UuQ�{zH��s���7J4!F#���,v�����m�z3Q�}1�W�e��:����&�����)��v������>�?4d�Xg�t�K�b�"L-��<�|��9� �H�x���%A�%MBxZ�D�h�T�Q��4 '�W+�P:�
�����*�6�U�*$\!f�.���,�k�=j!�
�T3*�$�&%�ZQ�#��x%���,O4`���}&�%�ZY��i*���������1�Fq����E��V�})�rk�,����s�r#����L|�W=��[|��?�����F�G�����m�Us���l���Y��9@�7����*X5���|��R$��� 0�n��)A����!���wK����A���\<�Z
�`�>��E?=� ���[��Rq�Rms�q��G0B9������8W��>����>>}�o9>}�k�~��5E�g�6���w�V��9?*(�<�o���	�������������r|�k��P��uE����9|���+�]��C��O���S���P�������z�]����+�N����W?Dx�H
IZ0���w������.�
�|������7�	��)��?3���=��
��{�o�@�N�Vv�5��9��f��X����?��\
�3/�R�KL�����`$8�����U�����nV+�����}��������m�/��������~����Z��~5�%$��Q�!��b���1�(��

#>��j�J"�mu��v�����5��"�%ERT������%^1�	����[�wF�4c��aI�2���Q���d�Ej��.��z�}��n���4�4�B����&J,�S%ywE���/I>m���;���r��Kg�}�J@�1/��Ab���23���w��G�_�IdD5�1��
��kT,� (c,t�6(�L�JwL�����4�|�+.9J!8p�3��]����S�e'h��yi��������E����:}��c����bq)(w��|�d�r�NT�Nv�OVX�e�������t�E����bT��P{���7�u�u���a
4e#@A����PU��/`�.dT8�d�\
��n>����a��(������*��2�*B�1T���XB�E�GE�
U���Y��w�"|�]���A�iZ�n��T���$S;K�N7�0�w�� !�M&��H�eo"�TE�h�!�;��4������D�J���e�I�������������0�*���D�����L���p�GYr�4��X8c���p�<�	�\r��L��b�>o�l���
�G�9�0Nl���/�hj����K*�>���`L��9R�)F�=�4�`N]��?ZB�zD�~b}�]��\�,���Ch}f���[T�#�t6#���PR'�����K�X��[��*!���|�C���`c���91��#��<���eWf�.�������Avf{�OA�s���p#]�6���?��<I�>3����Q�;��M1��8�cu9'���k8,�����Gj��~��U�����xw+]�E�'�����-��Q�7���P
�B�h�/�6ZV���8���4�W!i��^90�(K�V�L�[_���� �R�����Pv/L���(��s#��v@��8�Q�{���<���%;g'(>���8�ZK`vT�DF��8��p���,
I��Q���Kdm��J�
����pka�
n�'��}�[d2rM���t������}���(��.?�o0v:���qU���\t`e�
�<�����x�P?���F(��NEl%�+��0������������1�8=n`c����a}�}���O�[��u���������F�8;r)��-����X�*��R�Z$>*��.�.��S�������+]T�gg�/�n��8`�������7
.�>��g���g�[�]�H�v����N~���	���E1��e���g�������������k�Rs�Su���j�w|t4�k������8�l�
A��2��7���d��]��,#<!(a��!ZB@��mOu�����r���Lh�-��{�AW���a=���Dd�!#KH���y�c�/�9U��jb!E\�����[%P���v��X�DG� ���^g�-�o�u�
fqx�����% ����4����(����/\�}�C������eX��=�}^����~<��6����e������wx���'_��E�N���CU����%����E�x������L��q7K
s�
I�A�BT9�b�Za;�-�&(>x��=K��8��WO�w;@���H>{��	}�]��s��`�3�w���"n'1�����^p4�|�g�l�S������F�@��	�q���i��.����-�A�e'��pk���L>����UIH�(ERt�*�Gtf�H����k<=��"�A�1��K���}Q��.Q�J)���%8��i�a�7T��U�X;P���[+�� )V�An0)��A(5+1��(��]b�BT���H��J��[&H�QgM�Iv�<Yb�p�.7���K������&`;�����^n�2�#�a���o[�pC����].2�_T�g<g|�B&�Be�d���<wAlw��	u�U!����A�`Z��u{?�|.�g�^�/�,�r�����JU�5Il36�Q!Ely&j��L��!O���_|�h��S�=`8M���<;����Q��m�V��S��3��?r�)���$�,��g�fC�+=	���Y�	�����Q�}�����c�_��#t��Gx�eE��N�U7Vd�^��4��X��dK�I�A{���F����LX�t6����+�q�b��.a%�n�����~I�3ZZK���+1�]���h��N)��Bm��4��C�5����@��d*KB��Lw��!�6�p�eYs%��R�.����oU#��m�I3�bMb<#�h�bF�P1eO{+F�����n�2��+�hy�>���e��|I��K`�x�j�p
xw����������##�����*A]���H��yH���K��8��<	�EL����#a���Y�r���bh=�8]���G�2������3��G2�����`����Zaw���;%���E.X:�m�C�\�EY�86��<��=�6�XU�a�w�����?��GI��w��?������z����+j_?����^�����o������%�7u��V�xC�%�0����s������5���M���w��2\�-���j�C����U�z#KK��������������C�[=�uF���������,f}|A�r��A^�q�>| ��{[1�^K��.�/Y-��j�t���9�\[�x�����>Q�e�~�}�S'S��\���0����?����+-�
��#�9%�����`�
l�H���&�b'�@,����+]G.��~� �����i���1���]rd��������]tt�m#i�$}g�uk�H��*�������;�	V�������1R�74B�H��F�b�r�������6ye��{QV��	������)$LF��
L~(���(�= ;��t.�0��
��-0���W0F�BR*|��,(��E�����7�����s�e�&m��v0i���C�^��#;�C���t<��@9���z�4�u�������I���]^�� `������[�l0�0����PS`^��2��
9�n�p�,o{���Y��^l,�8��4��(O����%Fh�G�Z,�B�g�i���S*6�������������G/�!9
��iQE����A�E�M�����\�TD��%�S��Sq��SM��4T��B�3U_'���^�����Xuq��V�/d���������$`Jh���� ��f�T���W�r@��(���
���_���v��/���U�`7�^����� ���!'A���C��p'�d�}�pt8J��?�M*�u��(�T��p�3C�W�Y����x��~������!g�	g���Ntb��x�j�0���N����K0�����!�[��g����$����!��!K�Y���,����;u�)o���l:�3A��V�7Jr�Ef�[��!��x�0�#�����8t�B�7�)�aq4�&O�RT
��I��o6N�B�6Oa������-�W��8�}+�{)�t_����m�a�Ef�
�)��J��L�I�c0W��pW.��M��%	3
fW+3�*����U�x�D��T+E:��$A����DU1��5������
����)
���j�^�~I�,�%����%G��?���01A� ������[��<ml`?�*�Z&���bwJ>��y�`8>]r*����~�m���g��WI��}�J�M����3���x��?}���/���G.O�
��yy"/�E��=0��dL0��������`�4 �f��`�f�}���q/a.�o�
{�@U��8b8�}{v��g�m.�e�K��l�9�2��
����(��?#����.3����s�qM&��-q����\������|��+�0��7���m����Gy��x�m�;�F��V -����,���E�����5i������nQ^���v��>����������3Y��]���>w�O8�
��L�����cM��e6����<�]u�������-��#U�
�*]�]�k:��q�D���"����qV�5�,�b������	f����M�����(w_�}��2��������o����X	�k:��>@Z�_6��c���E��q�G*!$5FPv(G����r���rM.��&�Zu�3�!N$x��Ab�-P���^������k��Ct��"6��k9�8n�����.�\k0���	7��6��O���K�vl�U@~:���3�������Ev0<�)LP|,yFpoa����E�������?�i/��B������q�8���Ng���q ��L�6C�C��M~��b$�es��2-���2�mF+�Gg�-��?�x>��E����L�&p���&!�4��"�)���K���;�3�����E�b�bc�{��e���.Je�x�l��	"����
�����o����S��#kf���UtX��O O���������"U�&�r*�S
.iA;M}��F(:PG#�*o��MO��*y���
\,xd���b�tX4�La���h�`�xv����l�;-W����D-95���i������8
5��
�83�6_�-F+��
f��s����A�r�_6�B����_�!+��|����X� �P���B)>I�N�n������<m�}��"�.����0f�	i��~b]V0b@1���,w-|�!���?n9�3����%q�:�A������PUv%:����O��$�O��Q����P�<E�n-�:|Sl��Q����1`+C����$��#��;6�aJo��F�b�l,�c��w�������������������q�t(�PA�j���'�������t"R?�$��-����.�
A�H�]dGNC���vy�-�l�J�J����q�e&9��V����F2�7���:�������0vi��J��]HI�����{���}`�,�8�r?ss����iv�n�xx:��w�����s��$�O�1Q$���-3@��e�����! L�\I�I2`1�t�OO��;��*7T@f�@��]N�8@E�CF�fL�(>��'&�]wF���56�u�W28]vA)��������{�?�5g�;�����$�%LY	�MK�C��+��v���0���v��z\������n�pz�����	l���D�a��;�0l�%
,�lch
�-.��������B����O��X�>���R�����sj!-����Zb���=�?�H��G^=������>����f��J��������^�O������C^e9�U��=�HLJH��0%�t`X���6N��!5G��,j��+�^}:�dr~�����l�k��Y(����e���~�q�Qc�%��������ENh���V���z����F�N:M�u�d\�����}q%�N��z\>��%�JZ���]��*��������r��^a�nZbe������y���K����������p(�E��e��l���4���2wU�V�
��O*Z3b��<��S@e�-"H���sg�T������2{`���x������D��}�Ye!��~���`:���X!�>�������~�^��g�"��"�:�&��27��dc��d�y��1�S��";t��s�4B~v���m}P�������>)�i��JQ�q�~`����#��BrF���=e<@#�������i}�x�����pi�a�hE��EO9c�S�hA�	��'tX��"q0��R����L���!�9I��� ��$3M�P���*��G��{�w�Sj-����_"���w�5ii%�*�+2uN/���z+6�PWB��{����c+[�"��N��[H���:��c0E����|�Z��fx^��:�����H�=B�(���c8���J���8��Z@>	w:w�Y�QI(?�y���N���G;vN���O�`��X�X�~�s>Dk�2����B�7g�c)�x��n��w���2��Au��T\^
��t���8b�-�p���	�D��TE�������j��L�h�c���e^j/%�n�f�mgq4����*i�(���x�Z*f�r����ZJ�;)����[O�F����+�8�T�NJ'��K�����XA4� �b^�"E2d�p��@h#��
Cjy��{�����)����)0#!$T�:���VKT�hF`h(���hZ!�"M�[��	�r���!|�t�m)����h���&H ���e�"��HA��24�F�J�V�sY�d�%��B�m��~���;�S�$������7���<8�J�I����4���Yt�&��q\Bu>���� t"��Pi�^����HG������r�`v��U�0`z�����y40�<�8a��
�9�s��0+��.�Y�b���?iQ�}R�*�QM�ioT*Ui�h�.���PR���|B5#�<���39�X���i�s&�sjV�#��w�P��i�dX�A��%�E��c�j������(���6�����W���C`���n^��IV�Z&�D�YW�]+u����x?�@�X���a���W��U�5brW�c)��i�-��l����|���,�>�O�=�'���2�Wp`-H��7�/�g
��c���4�KplV�@�"��,�ys�1��O��[��b0��=r?z2�����\���qf��I��'|�	�o9�X"dj6���(�,Il�e�6O�J&���f�������w0O7�`��2����d����`5�B<�ZO���I/��\M��`C#��4q���]�V�����y	�W��w���/8{4�'9�����%A����C4aC0���-^�-3$�0�S�� �g�R*v��7����R}k$E"IcC��t�p+�5�
C���&i�34j�n�n�~�Jj��j�+�z��x+8��������C= ���C]���N���������/�n���Y	�Y.�lu3�k|d���)��U��O�(ne�#��aT����p����K������*�����yMm��}��^sn��������;�������(�l�p��������8��sW@�K�v�;��|�%+ihc���u�i8�}�[k���b0�8jc��L�e�I��Gp��E��"����e�B��d���c���;�=��/�
<-m�?�����q�'w�������������^vu�A1��K�O����c�7g�����`V(oa`9���%"<�
�6�VFT�.s+(c�(��*O��M�TI,�v�yA����`���� Hd7��M�����ib/%��A�0F!{��r�1�:�X��� 9
����J)�����������l��f%�U7_����� (\e|���62'�(�r*I^�b�ax ��
�s&����>+�|�|Kj�D<�+����\�4��O�����V��h�Y��t<���+|��sf�p\5q��C�t���y������ ��bz���~����<%}1������7]�J@E���3���4�����?8K&�	���4e>����N�F��-��d�0'�pN�c�������������(�E&1�92�6����K��Ac�m�_8��O��'`�	����6�Q*�N��}��r����C���/f-��z9�%'��Q��i��{9�����*E���}Rm��'Q����d�{����Y�K��5�!�X�vV�_N�W�:��L/����|w������YR�M��mL� LTul���y��\'�]zLS���N�B�xb�7��}�`����t�@(2/.�;�E�A0�������v�Q�'G]�������B�b^X���������e`.��6�	�9�������-��T���B������T��a�[��� h1�JC���_�N,i��4n��4����`
V"���<�}��	�������t3N���=~J��v�5�6>)�������`C��C�'�����J����y������X��v��L�{������g�����m��?b ��?f�����������1�6K#H��G�������r|A^{�m<iV~_E
N�PH��Z	�L�K��sB����y0y�E:�e��k��<J}�u.�i~m^6�����kB�TQ���8��(����BX�mg1��EQ���i(C������T�W��.Vu�n {�������M�
� 3�R��(�i|d_9���Q>j���l����<e���5r��'}��SJ|�/�+�a�E�#
a���������*)�P2Nh�E"6�����[h�����p_�g7O����0����.E#]'uv������Q���	�2��t��3W�U�$X��c��'v����{z��{��H�p"�,~��q��]�vR@�JTF��rtLr��]�����i��p���}���^��/{�S�s"_��N��3�O~gX���b��^�g�>��[S��ja�}�S�Z�*�]�/�p��K�����O����=�p@	#���#D���L�1�8/�������"w��A���K��O�w���Q�dwG�>��rEj�x!�DC��f�D�n���� �V����`F0K�0-5+�����[S^r���QU�JpHR>@Z	E�}�=
�Tt��W ��Dv��4Aj��P�VK�}"��!H3xH��
��������a�w��y�����cW�����<��A���J�����I�{�=:�����{���`���E�'t�BN����d�*��������;�"5FB���2�l�CT|(�� ��9���^��;��y��x���(�J�#Ej��A�6�+=cb�������%Ko�[VD�d(Ua�@��X��+x�3��K3������o�=���N�W������ �
��)Eb�����m�Y/!���kd�{�+{P�r	@�_��ic~>l�ZI�\�]����L=�E�E�(��_�m���5\�'C�()ss
����q��HlES�FhF�UE��*Fh��%i��at	L/�������4��x
�Scq_>������q�;o6�h����(\[�7�i���r���Og3Z�+`s:������9+T;�R�(��[&�k�g�x��r�����.�����o���o��Kn7�V������K'V23#��%a��6�Y�V&�u^���N���M� ���X�D)y,Se��&���IA��2*/x�����l	� G�G��}*U��z���c�fiKr/��e�dv����3���I�'q�{��i7��w��.�Y�P��k��1�q�i����������"F����F�7�>�7����?QW�
`����_)���o��B�7��O}�wKy!i>���@��>��Os	�SWg]����T�u���A�S���	�o:�bl�.����k9�~�����������~�s�J�����k���[o�20�#��t�g�4U{^���#ch�}x�A��C�-��U�y��lm`!Ra��=W�mP����'����7\�+�����Gw�-���*��DXEx��H���r���b�fM��P��[��-	*M� ��J�����._�����R�,�)I��o��8�w��V��������J���%cF�����>�o���H��x>F���OO���5APv�l�`�l/������hA�J�=�}���������:X��&_����mL������&��&�[��mP}�k��p��g:,��/2 X�1��5��r���@l��g�M.Q�������6�����q�R���f0������l�R�6%zq|��>|�BU�A�wZ�C��y����NS�8�|�����=Z���(������`���y���jg/}>=
��b�N(M�_�O���F���
0����,����P��3��F4A�JO��"IvY��AE��>@z��������"��d�j�����74�Qg��;��s����[d��j���I@^��+l����l�B��<k;�8I��!�&��~+�L����>uK�7�X�����KC���U�)3t9�jaX��n�������"���a���8�Ei�a�Y.>��[�T������	@f�fQ$i� s��2)��2�����%z/����a}9
(>�L6���h�(�T�oD�Md��RD���Q���r�$�����;��fF�>����O���(����3���m|�����xy�/�_-�����6����n��+X�H�<1:�KW*x��(���Q��Vu'n��p��2��T��T���Mw=	JT��jH�H"���wx�$�C;���9���*��,�2���%�O �	�
HV	���&*Jt�R�o���h��(���nS
��Q�
>�^j��%�{��-0c�����y9��~|��(��=��8�������c�����MX�zm�W�.�r��hg���.O��
����6�)�r���>g����].RN=�Q�a��L���t�`�y	S4S�<!
{��}&��A����g���2���b�m���q�O�����������(x�*��Q:�d���g�Q9 Ll�)�bc|�Z��M�~M��3 ����,s��������z�����]�O�k����W��8������{�LK����~��hS�7���v��Jw~�I���}v������	��
G������6��|����t�y�"����DmJ��]����=��<���
�_�a����k���,~!*�����I���1
y)U����Kd�Bz�������VU�;���=�^��������������������,����A��6��M��%����%p ~y�]A/A��+qaqeud
Y�I������U������C���#}��6B�,&�������A#� ����M�s�YQ%��Jb�@�j��:��X�,^���bWe��E�s���J6dm�c�Ac�m��;����q����C�e�jcRW��d7��)g0�#%��{?�<�*z�3@?U�t<c�4�PX���0
��)I$�v����Y���M�y���X�b�F��efL��a���t|S<�=��o����#,��^��^\�&g���3){[�)	Uy:i��s�V�~c�r�?��.�'5��d��	����@��7pz	��@:x	F@7!�E2���^"��cgr �2ug�A�ig�q���R
G��g� �� ��G�*>?����Th8��"LLD����/�T
s��"�$i��<�H�u_��+n����A��rI�tX�3�������P�i���7T_$�@�?�89�2U}��sU ��E�-��D���a���]�k� H���2��Y��fH�����#�����L�w�OO��b�")������a��xt��Nh4�C�s&�����s���"�t|�r�3�!��)w�(({8fC1��o�x\�N�K��Q�k�Z�e�49N��,�����fz��<���5y��yF6��U�)����2����d�r��G	��&���oq�����!)��
C�xJ���@��^��A�`��	��7��<���'�����/r#��c|�}��G�A8	�*���89YPvjy��'3@���<�4�b�!����'.~#�x���Z��D��\��'�����	JSH.��
���!���`�6��� ��M�dK� <���TG�hV3��$T����Sw�j-������e;M�l[R���4L��+�?u�}Wf��
�tc���
�u��g��:�����F3����X~{�x�3Dj�a�a�z@�}��?=M���5c+3w�i���`?L�8L�K��	Kt����a��N����9
,p������`9ba�V�t���|������#����=�t��Zte���bZ���a�6��c4�g0d�)�D\���6-���M<��%4�F#@�L��*���9�2����-;s�l��|��Q��%���������|���%�}�:�~�|��B~\������V 
8,�:}~�PO����p�Xy����k� ���ks$�i������	��=>���3Q���K��{Q`���/y~�u���zJ��������]�6u��E^WP?���KS�&������5	��h�(��
H�Q`}f��z���rN�/��0G�e(�.7$j�����G/�S���-�Jv_m��pMXv�������2Eif��qR����}���������,�8���������&�f6J���b�I��R�
K';%��� o6
���#CP-Y��m�8���5���l	�5d�Z6O�=��a��ke�1
n�R��n;�F�������?�2�hz1+�3��)�G9���������0�O�s��y�����\4AZaxo3TX%��o���g}V��}��JF����'�a���)���I;l�����"LP9�����@�;��c���@�����2����;���}��Q�����6�D����>wd���h���lo2��h�Q������y���G�-�x>;���r��^��bMo�b���	l�	����,���s(T��4A�+�8�i$R��c�`��A"��A-�6E����_;C�ZR��PA<z���
�c�e"��DZb��$h�J��������*�;����ZD'�c���a��X����f�]D�n��$�����3Gn�\� �	&t�eC(������6c����
��*�)&��:;g�\�����a7�����`?��j���~	r�7v�i���.l��`t}��.�j�h�3���V��e"rQi����}�c�~R~{�3A�eJ��ce�
��X'�������~'�,�-8V�7p�:^2A��H{�M����KP�$��&�8?t�#��1xn6=J��B���X����8��n�w�Q���3J��!������m���f�	�����z����8[�"�t
��u�p�����8��qJ�
G�;���H���Q�*A�Hrv=��"���S7VK2$���n�A���)�zK2��m���*�J�#Rr10�P)�U���
���W1#|&+BTw;�r�of=��[~�,��]��e�b��O����r�/,�B�w�H"�[
���#r!���?ft��\��C��O8
5]��J�]y��>���,����o��?��6d�"(���%+r����^�IDATcP�}E
F�KW�3�t�bp([��Cy}�X�&�>��E9�g��}��=�������An�<P��+�F�y	wh�<9�<0������1��O�#X,�������"��q��JT��d�#���i:A��.v^/�%c<��	������-���x���yk����T����EYinw
a�"U>@2L������/��Qx�����orz���|x8��������>@���p����#�_Q�����]����}��>T���5�d��_�"�;���|��H�PH���+L�u���*��:�(�a�2E�d`h�|eEpO�E����.x���'V,3������y���)V���r��*��&�A�N�xm5���B+}W��'�Fz9���*�d��&}�
��E���(%nEjU����+���e�bE
�h{Pcj��hj�^z��m��9:N�l�!g�h
U��6�XX���)v�����������|�l��Q�]��b���)^���If.��1�Ueo�������X�n�*��J�$�"�	���(j��������*1BTy�����B��Z�7�'(�����3�U��3*5
3�����&~��B��6�wm�?(6������`�C]�*������_�W#��C�?�B=�
�G��������?��E�Zi��\��������lMdw-y���(]
S*������������,��Ji�;��P�"{B���J��H�;X��U����G�P�����uL�k��2�2��k)����:
�,��%�O��x��zX?�����,�$��z����+{�{�oY(�|�mw����z���e%h��%;PKv�3+"���[&Z�?�@5A����	/�/�8��*��W�qW���!_�E[>��2?(�`�|�g��B�z��L�����\����l���7����
m�q�������{lqJ���C(e�P�d�����(�~z:]�+N�j��eQ0��6V�$wC[l�����=s����#�F���6|�������)�SDU8���5/��H����a�'�W��WA�8*��J����e�T�|[�a��U{����#����X1sU:
�Z� ��#�ak:����C5Q�@���j�i
�<'�(��UU	��b�P���v@bl�/G_�����At�e�}��F)K*�1��2�f�C��x�p��b����(���y�Lz���Ox��,Eb����\��.b����g	0�zY���X�
�kdf��b�U���\�5����GDY:�C?������Q��d(�6�����p��*k�J�@������z�(}w�������������
���n%����d�q.rKP���*��6k��P������e�g�� �P[�{�-��o�*�m��-���a�4�}:�\_]__�tX�c��6��Pu<�3�8����4[�����T�K�_���P8��|�����yj�c���j������	b�z�'�<�s"�x���\e�]=m�B]5�>��=����[�
�z���W|jyb>������j��<��c%����z��3��uNJU���	:&Aq�?�/���p.������"���@���r-�QD
f��a�b������N�3:R,�����F�8���>���'T�kD?�cx^��_�9?�l
}^�����!��qt;��+���>�;�Gy���=���p}�0���"��==��rS�I|���fNg�S��1**�z����~���G}G}@l�?���S����U�~�A���z�G���8~
�M�6�
=;��?d��%'��f{sus����%��7��Sg��.f0
'�M�����(��P�r��A� 4�I�X8~}��o�|a�
�P�?_�x+[F�CEJ�^4O�(��[O�U�'�n��v��c'6@�lf��
P�������Nb���k.FXLX��&`�Mz��^G�El��=��l�P8����)��9-�K-����k�|��I�7��d��YY�|���#k�1J��s��k�tch46�(�>��������q�yb�U���� �[r�*�v�E����Wfq�$�pWE���O���>�[�P+���&������R�����G%��E\+��g*q�p%�?1z���50B�Do�f�}�����Z��J�n�o6�
����xTB����Y3r�7l�1���_�z��0\���8��8A.�r�$�N�z)R� u���,���g� aQl�aE(�{'H�
'H���	X8A��\Tv���=1�	�N�B��e4�
���	8��!���	b<�$T	�x�X����xVr���Yw��3�Ix9���%7��M��6�UM�'�����5 �� ^��]���Z�L��
��Y����>vI�\�Z����w-������i)��-�eJ�+�b�]�����%���m�i���{���V�wv�c�z����DY�dr4����c�Z�s�g�����$#/V~M���v�j���Y���� �Dz���N�n<H@i0�+vt:�K�:�N.��������������SV����%0�fJ�����E�)ge�jNV�1������8�~J���P���b{"��X�\8_a��Q\.��������fyA�4c�� T��E\3�q���m�az��V�'���D��"�_�y��xjj����y	���
��1���z	��q �e$�{YJ������>?��G���}���pt\���S/���q$�s���QdZ�"BM\o��
��?h������l���1+*g���W�K�uR��h~�Y�+�����.?�J��'^� �)#'H��?yl���D'�y�~>:c������y�G�l��s�Hs����������-��Q�/1�TzKu�V��������
�/Q^�Y������mso!A0
csHdi
����Z����?/P�/G<}�|�����t�����f�0����������-��F
�BJ�/�)L�9
�(5���Fv�v����'�����ZL���� 
�`���J>U$�+�$�Wb����	bUa<&A2�����d��&NY�\$��h�;�V��	��cr"E�9��D������0C6�3�
���l�z8E�p=�A��<h=��J���wJ�v�9��mr/,�+�i\���0r�=�X����B��������b|eec�4���nST��=�!�e������A\��Z�x�������%<@<�,��Z�=k�_�h
>_�7�5|d��W�Vf�rK9H���.$�]�}�s?Z���qmT�a�
���1�Iad�j���P	U���nLz�J0�������`FhP�����)X�*��l1\���	Mr��!up���������|;��I�j��y")�Fbd:��m �<)R�xQ�"{b����yQ��"MR�"y[X�h�A=��up��R$K�Y�4��N�Hg���$E"�*�����@}3t���S'I�A��8A}M�z5���z1�� �kE�m^��:���_roM�!G)��
���4}�)��"��^�� !V�K���AE��^���E2����9%�&���tIM�|�Q��f������l|Xx���N�Z�mF`�����[�B��|���R�,�Z�.���f��_,��a���?�+4���Y�����6)�4X��Yf����,y�m�:��a���2�	�Q�i��gH���^|�s-����p�5v�|�R����K���_1�������v
Z�L��O����|���.�H���:<jf�H��i.e��41'42!�Rj@E�bd�i���H�@��S'�f�5;#cg�*l���^����"!k�L�,4r�!�)WS�`�!2���!h����b�J�YU�hG/0�}-�����9��Tr�������%�[�aT���+�[*�CWj��"J��d���y����Y���8e|���<�g*{8�;���E�b/G~J��^�3���:�T� ����!"A������U��k�lr;�-D������E^�s��*����d�J����r�L�����d���YF�%3k9��������94��W���Q��Y�f�1�M�gjH�����wx!������6���W�SK����$ �8��d�����������-�g>��
��\'���	��6@�����)�)=a,m������en����K�W|Vy)��_��@>���������A������|�Z��z���?�|<E'�t��=����r�D�s������f)�>��m����\#?�}���{���o����*�O�zC3U1���%6@L�N���@����$����B
��i�(�}�)�PH�)JN}!N� �
st������YB�_�vu�����!2F� �$��u����D�
�4�m�1�6�Z6��Ih��71��N���I�����J6n�s2�������im�;��R�3�K���2-��=�.M@)���(��s��M_��r�����W�i1�r�3A�=���e5w9[,-�� z��y�u����� >�����F j5�_������_I�u&#<��2���{)�q���O�apy���J���3�C"Erx�m���R{Z�p�J����	�������-9���H�t�M����i��Fz|b�����tI���$�����=.���� �����R]�u�)�B�@L��X
�q�XQ�{�	R8A���@^o�D�Op��H�k	R:A���8�h>@�Q$D����!���F�^�2���Nj���.��'����"U��R:A&^����mc�F����M��0���j������
:U�4Fn����n�_��~���H	��JF�v��bc����i.�fm�(!dF'H��&�p���w���C�6^��` >��+<k���;� �:�����w�s]�K!��>,�m
����Ku����(!;A<R%�������&�Y�n~Y����ZW�@�p<h{�O��/M�i@��s_���r
� �M��Zo������w)iW/�e�%
�_����v����i�y�?���\�7���d��I�(�CS��3@
�$������ZJ�8�R!�P8B
i	q��HV,(I�%E�&����V�7�C3)��h��!e�N�2DLb:�bR�(�;��bR�*VMKNx�N��\�
���5����V�`4o�e+D��d��}�$W���A��k�r�1K%��N�-a������7�	:����`�>!z��m�hX��^e���y]�/Q5|Y���������"��m�d"�r�� �L1��v��{@��/Q2��6��&��(H�����P����)����<��v�0���!�["�k��/��MoV
S:%�:?K��O��R�������@	�c�{����&��]�|_7�Cq�1(�7��M�Q��~�����

=N�������H.&u�5��c���b^���/'�\��@T^���^7���X�����1��|���X����l����;��fQ1��+�2~�;(Z^?�|A��������)k���nj[����L��^n��5�<�������������T�������u�>?���1\�F�2(�^��o^NV�r��Q�`���e_F������?{=}����@�w�l�J���3�����Ow�3�-xzoR���=�p�L��f�����k�������gN8�M<�E�/��o�r#��7^9SrO���4�IR�����V��G_)��9�GW�aH{�X5��u���R����b8�->u�e���������-���.x��^<���y�\e�1T+G�Q
��g��#5��D�PFy�,�"F"h����rX%,���k9z��l��E-m^��U����R9{hg��z�:������.c���;����e���,4����6zc������vu�[���������k������Q��q��u�37�_{%
�f;�m�8��>�?��Kt/�=���!�q��}���^�PXk3;��6:$��Z{������6��4���F�m��9%C���KT��<��a��]�Z�|�|j7&�8���{�m�ATS��^���*��k�<�����o1�97��C�F��I�����#�����+I�
�e+���
G�p����*r:Q�
�/1��_�%F�0�v%�VXa�N���#8K847vK���ua�^��qA�p�S�Lr�bt(������=�-�����H��"����I*U�����*v!����v|�32�����Z�C����im�����\�M�u�K��d�������9xa��X�3j&��e�^]���i���`�"��&Y��|L7�����.$���X��v�c��m��
��C��,~�4F���DC�W����rb� Y�">��~��F%t�=(!�e��z��z�q�X�h}�+����.G�y�$G�
psJ�����h����}��1v�������f0�80��m������t�������'�;!�����Z~��������"�"X�f�h�U,�]�����XD������������� '�9�.iz����T���rEg�^��3^8Ah�t��G��h"���H
��_~�Fu�����Y�E��*��L~|x)R���ML;��L�Sy�z�Iz��@������\>�Vl�����m�Q4M��rfW�*K�~�$q�TJ��>������&K7���<8�a�i� �%E��=��_&'�z��L�����[��@����Y��v��H5���2�U���G��P�������q�^��lR�/���Z��P�*+^xEr�z�d�wy��	w�ko���Y]����|��pZb*�*}���4�5��G��:�:�Xm��fk6t�w�����+���Z��+���y�r��A���)`���?Si������{��Z�f*���b�t)�T
�I��<�����;W�St<�2�����cs��wDO��
�O-����LR$�rR�,2Fi���*�	A�2����� �T�p�������)�����p��)Cff�v����0�s�exx�8�v�bW�K]8���y�>gl�)��	Evc��b�X�����*�C���n��|c:��Lm�E2�������)}9��
�4fF.{����l�W�z��D��d��������f"���6���;�W1��[��f�p����J�$�Sk�bJ��S0�
��,^#�0c,����ss��e����2�u�+3�B�g�6c�i�6*A�n��^!���x��
!gus�9���A'4Y�m�����bc�5���I|���� ��(>+A��tJ�`�a�<�J�*�����71���[m���� W0�O��Q�ya���J�N�1��J�������]m�	��a?��g��$��O��{�ji��|�Nj�Z��{=C�ai��NQ ������/$`Kf����y�lE�A����"�&6���VF�$\��g����H)!Ii|��5�q����������Zi����=�?����
�b;�9�	SAv*��c�����e���
Kz�i�B���{�7m{�`�{�%��k�=��*S�T���Vd���0D;���>;A�����=`�E2��&g!M�6�m���>*������Lgr�^]_��BN���v�	g��(e��n0���$��^�Q�Zq'�On{����BVt2�E��1����Y����]�|�Am!��;��`S�8���_co�d��V d�B%@�Wz��[$��T�(�q�+����t��	�K:6�6�W���O0~a$+����3� ��,I�HL�����n4!jQ� ���Y;C�T�J�D!���0F�p��*���GM<� �T�=��~2xJ��dM}%E�����|e�X�$�UT��p$?�`:9A�t�9���rD.����
�+�&�Rz�n7�D����LQ`�����)��4�f1����9�i���i(�N��er������-`��^v�I�mBJ{X�i$]��f(k���n�������!gs#o��o@i�f����Ymz����~h2)��eM�(!
�f����v\'R$c�Q	��Y�C��L��I���~)����n���?�pq}�� ���M��h�a-S��,�N%��k���)�5<;AV���
?!�d���;�j|Ny���m]��fC�C����
���������R�
�����
0���F�7�����
��*/��?&�:�qzx��<4��� �c��|����@q��0�T����S��:"�������?��C$�Q����NQ��%����|�1Q���������Sp��%�<u��D�	9��t���;����={�ls�:M�KZ��U�i�5G���>bf�p�DQ���,�e�;��wD�����.%y}�=��9i}�8)�������]���q���#r&�t��i�1��cS#��;��p��?7����{10=��I�1�j`|�P$����H'�c?Bv��n�x��:2�����T��r�rTn
�$�Jc�x�5��.�����q��2������z|�N/���H�R/���v����B�Y��JZ.�4b�a�����$}i�!J0bM�9��E5�"F�s���$�K9b�
�K�L�;Xm9x�&N$�<��^�)RMwHkb���������������k'o����f��
��l��)��Su:��#c7�����Yb��R&�i�\O�@�)�r�]^K��7q�2�1G�Z6��4z�$^���II���B�k�p���f�7i��M�Q9�G������4J���ono���������e�&�e�E�At}�Q"�����n#�m���r�g�FI�}@�<�]z���9�$�#�V��7p�vls�F�����5g�6���d����2��
�-�������[��������K�����H�Yw��>"�!�t�P	��j��_�W�9aX���H��s\:��	D����hVOV����HS���!E��z4��	4d-R'	�(I�H�� ��N�`�����F�������{�:A���Ut	\�v!�-����b�� ��sf�3c�:A�I�������I
Q	�������&��`4��:2��4G�����V*�!!`�����F����e��q�V56]/�A�h�*<G��K��������h��JN�����L�Z�c�p����5�9�;Z��1�����^�������lV8k����yZ���`�
���.��3^��c�P	����=\���5��H��I������3����K�d���i��� �Xc�^.��K��VLQ��
g�L���WSF]�r��"V!�D{}yr�l�]�
��O�1���k������,����Mq"��%����*����9���a;��!g:�T�f�'�b+������x���!R
���:6M�t^WgOD=�PfEP�T�P�{>1<�����zx�X^,2��, |���#��O{���4�maq��������O���c#z�O�g1���w���O����*<2��O�8:�Q���{�)���"�&`�qG�b^W\��n�I��X'��xD���#�]VB����������I�o^u�e�e��%0O
Y#���b�va���m�k4��?�%�Z
_� �)��Nr|�=��p���F'����4�u1��-b�J;��Uva\�z��d*3bht]
7�_���)��G�4{Fg��#��T����������.���o�����o���*�!��#T6O@�������]��5��V��1�����N�s���9��:�A�~����3���i<�o:���s��?�����
���\H��� a4(���/�"���J�(��� I?����"��9�h�#�$^D>���4[�1�S"N"AjE�{\� ���/pK��4�	�(5���&Y��1C"AHE$n�E��ad��S�D��]�����S\O�DA��|��P�c!e�[j3���ac���i�X�x��wsa���k���rP�)�����TqX��hV�1�EW�vz��&$R�N	�p-Kv��vs�\o��:.����!�b��� Y8F$����B4�+hn�����lq�j7W�����!�d�
>�������t�����s~-Q����]o��_|P��Q�
�5���f������]�&n�D%�vlc��Af��-eJ_c�4�%���u
�Z����i������Eld���:���4Hs@#�W��x$E�_�U4�)��EHt
��JL�f���� �<)�(�����	"��+M� �	�H� Ej'4�,D�q�>�R�@�h5����X���$�����	�JR$3�s/W�RR�b��~w�P8��P�D��=Axh$��	���������1�8X2��u�9��f����	�O��O�;����E� !l0@
�Z���1>����A�M�� �.�> ��������Q�H��Dk���x��T8��g{v8K�+�����m���6�l���3���*�.��E��p����F������>x��l}��
~$E��n��e���IP��lp���^6�K���B���H�+t���O��K�5���ke�����E����*����	���C�I�Zi���7�_#��[���� [����'�bE�pMS��>LJh2L�@ENX�CU)9%6V�*�����#�E#�8r��>��*�����,��%�K-jZ��<m]Q�i���-�5�$�3/L�3�H�."'^`���=E�N<��
��^`�W���Zj��e@��V������q[�)ab�La��_�~���;���,	��[��������:�����A�d���k�9���oo:A`���}r�(f��M���2��&�Y��g�9�^k���Tr�������
>&����q;���lF��]e	�a�����n`�^d�����S2����92���+�������k�b����5|�.���&U����1�qZ��^��q<h������-M�zP&���LX���o�pw�����(����^P��!E��e��^F���W]}ZA�e�j�Ta*3��������O*T�3�>���G���Y}l|��"��	���L��~��?�:�������PM��c0n9)����#��3�����Hz��p2��>^?
>�77s�psi^�f� gE�����Hg6����+*�_lx���y_�7$_|A4�_R�[�:`4��Tl�:r���S�r�r����.�2%>��"n�=%Er�:A
EO�i=�"ap��l0tb�����tUSM'���lN�2�}1vBj��@���"���JHT���;cS�7,oF����
_y�MH
h���iW��j���i�y�j7�V ����6�c����4�����#�����i���^�7�M"E���}��= ����7i�u�
c�G�%�b����g
���x�as��7!��fs�����Q~�]����t�#q��&������p@�{��er;��g���j�k��������s2gA���]z�zc�����f����)����o���0�U�����9� �����Ul�F�����R$���5��o�gO(�����������Oa�C'�Yz+�����V����X��|Y(�a	Biu�@��))�p���"{f!q5��J�^��J���"�����V�:�tL)�a���Pp1�����j���9�
GYG��"A_O�s�!Ft������u���W���tyP}��e/':Av�fq��e�����P��q6,R���J�]�_��m�����6M�s�^�{�t9V��E��ay�7aa��8�Y������a���B����0���v9������"J�,�]�w�X�
����$�Vo�7rl�wA�b����m.w0�z/WC�g���JH��pz�	��Zc�rh�v'��>�"}j>5��_z�k�:V�6��
��L<�������\��s��U{�Y��/�����$k9<����|Ml�`�Q-���7�Z��3A�i��,h1�dyIPA�F*���sX�_9^6��:?�F�D��~4�I{�"��:7G�I[/���F$����y/p���c�))�/���������I��M���++��[���4�hm�B;�X�V��F������S<�����\"@�%��&�?*0�d��y
�}mG\�7;�H�~K�L?��6��B��*9&��gjL��.�)�)���J4WW�n��-�-����J�J�d�B����.i{XR%��Q���WM�M���{�����k�(�z(!=�R�Y���n@�N���4������^�"�>X�gk�q/�rX`�7M��o�X%����_��z���������O�x�>���i���2U��
��5���<OPD�y�
0��9����
s������W�r��?n,X�����K�W|ny�������c������POK�z�7�X���8��=������\=�D����O�/x��!�g~������r��� � ;��{��%0��/�8��������C���[a�+x�iv>G�E6ZX��`����.��������`
`���V/l�$Q����Ttk�y���L����*�&��1,S���'9�����&fnxVy�-���I5����R%'��y����<��v��t4��;�5sL0o2THx����6zcL���a����������R�^`b�*SH���Dx���^&X��]�UB��|�&/�����i�)�9<�2���x�PYe���1�E@%�\�i������k������W=�]�U�c��9���`�������j��v|���V$8��O��;��X�u<�����[�]�5O� r�9������`�'��
t�
�Tc������Y��Q�32�@~)LT���`l#u�Pl*�s�@��N��S�s�2��U\dOq���	1�&���~�Y+ia��t�t
�/f�X�F��=2�����	BM�3N�F�I*U����c��L��V�\��mf<f�7m��H���k��@#6M��`�f���K6L�=�j�G��$^3&�d�^��rq�� �6w��!;A�J^6����4��R/�H��P��{���'���f��9w���5&Cp`�k�I����5�������-C]I�Zl
�y��MCdn2 >�2�^�d�
2t)J���.���^� ����7�� B�w���{�����ce,'R��:�g����p��8��~���`d�k/~��ix�6�y�B^����Hs�`
� NB�Hr?d=��V/���F&���:A
��\�����L��f���C����<��������������8W�<�
� �U�X0Nr�2�	����&��\\��Bb*�|%tF\���a�
���H���o�!���iY���k3�7�v��{�=�N�����;47����I�x6��6�xf��G��o��z����m��v��n��E���F'ko����+\��;��_l�r��Z���M�}�\F�dR$?-[@�{#{�&�3f���<0���n�9�!�f�M�J������o���o�R����<�g�2�<�8}�0��)�%.���U�ul6|�l^��D���S^m����7s�OF�)K�;���W����z���X'����A��a�1��l��c�����+l��� �(����R8���n���J�8'4I�7�����@��O�a�(E��qtJ���9�l�3uG�?N�#���������>�?�U}���P��G����-?>�h�E�����)S���nM3U��U�$i����u�@Q��'�iuE�m�N;��F�H�,�@]<�4S���"�=�����jF��t	\t���4�4��`I�Un��Q��dmM���Fh2*i��o��<�����:�h���^�-�;�-�c����X��&�P������g����-`J���I���yK3�]o����#�������+6\�R�� Va�L�9�X�km��A(���A��^!��j��U��v����?��w��D�v`�k�I�����n�%Sk��T���^����!XT+�S*���O�M�a���U�=
]c��w���`a����mJ-J�h3B��	B�^{��}�>�&-�AQV���d	2:����7k��\-
w�����B���/-���"y���Nr	BI2�-���P�j����j�hJ(H�1e����0�H������T�4V �����dh���K��iK�++�A0&�������%�6
&coI�������")K�����&��y��: z)���g����{�|;����n�\����r��f��J&�mT�s�xg���8�A�Un���U@�n�n�=N���,�b���6�e5RE�F/�����q��)-�~�\�����KM��m��m����c�l[��,�|3?������p-������� r(�`�]��1������{�b,���������i��2���:���]]_c~��^�U���5���o�C���>�5�ZE)���:�C��\��4�d_��5������ ��E]��B���K.L=�"���Ph�MF(8@��vtH��4�D�D&�� �,z���|�5��4UW'A��G�@cd���A�t����8����I�B��P�X��D�d��<����<�z1�@����$��=?$j�2�wEZ�SLq-�Nh���}��X��e�@��H��1|!�U;t;$/��u�I[7�n�$��9zk�����V��� n�Rb�����jw}���F����%���6�M��
��Z�M!���v$E�{iB�x-v+u�Dz�=4b�n{���lL-9�!y/6��Q��pL`M���X^52�7�m��4����{d�R��2��B�����c�g*f�(�y���?�����w,�����1����������p}������p�Id����PF��7
�{�"R;�z���z������ey�>���gY8�gO���>���Y`�>9�27�y�&��g�O�������N��?�@�7	������l��p�?O�y�]?,�����h2}0��/��Ly��(����R�4�;�)��������J�j&����k��#���.3r��"��zu��,-�Q�b��������n����7�o)�������N��
���E2����X�EM���u!+R,�����N�/��r��g���
v��S�HA/^x^����5�v�z���tIf?����9�\���[�9:I�#5E{m��~H��d8��d���8V��Fg����p��l���������Pc��3<J�F������
�lo���^�����z�0
H�I/�5��;�~@��G�x6wy;\�B�������[������v��F�t�v����8^�9�FI��F(��.�jYP�u�4�)��SJ|��M�p���#�u#��a�<��
IS�S$�<�9T�&���E�������c#�i��0E���],���F�*)H�Bd�M��!�D,"A�����0�j������"�����0r=�A"�Ble/�%�c�S�k�m�:��E,j��4�
����>�Og�������H��`7Mrn��ea�W����3r�l�/������i3z.��_Tp��2Q� ��)_[��M"���V���@��=��%�W�U�S,
kk�Y�B�.��v�l+��>+a=8�=[���m�1��j/�+X�no���6��"��O�����e��,�����������6A�	ua���l�1�;�H���;��3��@�����G`����!Ob��\[�4d�d�$��ymh$!E"�U�
E�p�8Y���:�%E�to��P�;"�6����@�HI�$&'�Q'��tiL���+�Lr��s�K�[\H<���	Bb��o��Ar�����q�q�at|�����wx�&���n���]nr���o��m@��^5�I��Y&��u�%��2G�0&z'���zgv�d�n���M@������JH���nt���~[Uk��"%D������f{����R�&E�����f��u��Cz[�.��_�W��5���5�*��&�2p��;�������A�+'���;.������	ua�x���x�,�7e�~3�C,bX��{��H��������V�M�G�� ����5=����EA��N�'T/�(6�E�]p�u2����8L�f��j�(��V,D*�Ky}���^~x��5�������J���=������V��a�i�UrGFM?k6V���S<p@�$C����f���I8�=�~����c��?&��e���n���q;�� ��qI��gkX���$���� �:�B^�����6M�����w'7���6�c���.�?����Q������X����`�>��Cl~�!�P�^D?jd-�����M{*^7���L4�Y�?�lr����?�����E�{��y�����7��������J�2���6*f"H*��N.��Z���9��e�"����g�>��8�w���L�`�w�^=�e���w����l�+�<w�����=M��c�2����7�����u�������%j�����<�a��$���s!���S$���EE��Uf����K`��H&��/$>�#|E��
����\�<b��tw�M�U�N��d�-��`�QR$\V��QY�8A�/���N�����48�0���r����k�����d
�1R��N����)���a#,oU�3�y�X���+�%�p�:\��D4�9'���>�J���5Q��d���I|�@�@Jw���^�x����	�tr�V���e�S����Q4T��h��+ ��Nh�a�Z[���%����P��T��_�M"��Z<�e��k����@	��J�J#[��dR(��&Y	�o�������F��������)�O�"8��5����runQ3�]�����O��hn]���j��Ed��N %E���+�a�D��M�N�J���S��_
R$������D�� �ZF��)�!���i1��>[
�d�4�W%
N*��')�UH�*Z�91�T.m�R�1��U)B�N�����8}F&���N����Z0?Fx�F�l2���U��}��
c6>�\]7v1F�LWg��R���8ol	����k&l�7.�5�ir$�����n��;'�"�y,;���h�W9�������p�#@y������6[![���f��Y�������w����tX����}��'E|���C+�����uer�^8f@	�p��@'-�:NM����m�������>���D�'X�w�����P[�����]�
f�!r�x8�� Jpv���a~�]TQ2�����N�s�����$�G9���s��T�V��Z��,�3�?E��^e	�DN�qi�f���zH|�T4�H�7�?"�U�����j:�b5[��i��JQ�K%*��@O����	
{��HYNx��j������yO���=�)N��
��#��al�#���J�	x�}=�pGj�g�Kh�M���(�eRJ�k�L�6��1"�4�DZ�W�O19!d�^Z�B[-��g<m�����2�L{��X��cYC����Wu���psB����7�R���o�pk�VWj�Qz�(���J]$%0��_&�o��g���c�o�1[�/������LIN��f��E�_?���'^9A^����Z�"Ah)�e��k��<����'�`����J?n����[�w8��Y#A*�hc���~���J�c�������{��?����/���S^��E�����~�Lo
�`7}�4��<�)�;1�c�[��P�.��)�k1��|_�G��R\���O��AP1�~��m&��
Yy:2
������P��`i6JC�f���t�%H�W���*K�%W��`!~��OkMW�S3Z��;�;����yI�����+,�����`]7�r�QNZ�i�)1��G����"%>�&�6]y�Z�8�;!3z��]���<��MR��<��O
h�W�����z���r���� *&��80��mF����I�1��4�����9�M�|��f�����r4�Y�M3�K�q�2y,p���.s�Un���^��JZ��KzhJ:�A�\��m�T��MD%|�X#�Wh�%��T�w�gf����S�T���o���)o�p�1>cN���2Y*)���u�}���������y�YO�3KQ8A(Qa����������`V�"���Z���	��Y���4���A��@�T�=��j�+��'�^�d�� �(� �^����
�o!N�h�y/0�aLI��Y'������6�]L�\��6	G
���|pO1��,�T|���	b7�f!1!jr����p�����������0�G'H]���oNv� 1,��p�������90_�G%���Ue�f���U�i�pU���e�e������K6����Kl
�B���-���o.m�<cM�S�*:x�X��v��gW�3*VcS,����NYx�����Sz���A5�Y��G��%��co����Y����q�h��i�3@���X��O�y&�f�~*�M� �Qp�-�9:e���P�b��x
���4��QR$2�B�R����z�6�
:,���b9J������]1vKI��v�AAtJ.���IN#�8�������uM���4N�X�V���'k�)!�>l� �x�Zro�(>��8
��[�:�y�aa_���k��J��������"h���&�i�w�
	sS���Q|������R6�h�����2��Y=�b��+��}�f@�zMaZ�����:�~������)��0�M<a��F�R��x�Co��J%�~R�I�0wt�lV�M#�P��f	�c�x�>��}l��5��=)}��'�w�����~/M�q\����/��nI}��o^�m��I-?���<�
��)��8���c��:�FQ��B|
�xr�X�}|����@�/},O�3�(cY�`**�u?O�t6e�,P���c=>��_�
��=!�/o�rN9���.{%]?9rl'�� N�*J�D������o�4��&LBP,�)�9���$������1�Qs4'@A�;��V��&��Ich��_,m�S���W�,�5U���*5�\u$.	)U6US5e�D�T�Gbt�cY�>���N�����^��������� �������c��3��[��k��`�d����y1n���LV�f����"-����R>~\*y���Q�$h��b�jYH[c����uBBc.���Fo�eHi�����K��~H������"�$�%��v�]Q��n���������V��+46%?�����
s2�\Pc��]94�������d�?���7��A�"�G��p5�y��z�q�DmB���*A��^��������k��<�]t��w\�#��A[X���0�o$�5h#���3�a����2^P�c�)RGD������9�`�h>@�:�b��,�h�"��n��5����T'V?G}s
�{dd.��Se�q���Y$�c������g&8@�p�yo)_`U�mN"��Z^�@�^V����� �e�i	c�?`pC�:��"9I9���9�#�h�M'`A�1x�Q^6��(���kT.=����Ee�i")����	��~!q{2z&�e��������$�M'/3Y��������[A|���b�2*c�"}-���PB�L�����>�b��K��	�2Y*�
�c�j@Nt�Y	�2*�t��8�n�o/S��ri���y�~��:A��#�����{�H^xZ�'/L]8h�A���t���vE��+bc]!�x�Ir?��`���QWu���8ky��EuQ����i�
FJ�3V�m��3�P��%3�'m*��Z�\G�*[�X`�eA�A��HP�Dh���Mnf$Q�UZ�I#�L6����v�`����nPZ,s\��}�],�\������/�����?r���)>-m���q�t�*��Q�%t�3~��
�?�>�O�OM!:��Er�H��vL�Wy�� ���M7��<XF'3��v�����oe;���dv�����P���S�S��s�������s'=5m-s
�b�*�
�������������]�[����k���l`��OS�W�����x�>��Mm�G�����4hK��6�)��,C�\^��s��s�
���y���a���@�����#���E.�i������
��*a$����B�F���<��������&�L?�{v d|�o4�����x���SV-�?�z��b����3�� ��z�M4�Tu��x���KX�la��T�����l������)r��d0U��9�����^v�;%%f�XWT/�8�#��U�����h��:/�h5M�Ei��N�"�5�|��6U4��Vv�FT��p�/�b��)X�)��9�f���U\��,��l����c�DjhG�e9������D��llJ���L&���cv��jxbs�������CTC�dt6���
*������>��-R���=4�f����u��H6�����}�+��@|T*&�$_����`tk{L2���5���|K�Fg(3�a����G*e��+��=��:�_�����O�?H��j����w|�8A*��,s/�7:P�,+A#A������AE���x*{��A6��}=�-`�d�a�&N���\�]�t���iiB��D�B� E�mdJ��$�3OC��\������$�"�I�d.��tN#���(����"���!�����eZ��r�kZ���go��V��h��q��^�i}h��t����Tc2���q����1!SS}���&YF['��y�A�����j�����[�>VVF���}d!tC�+e�����X;?���%W���a���,(!�pL������e�Vjl����m����!`C����E8��w����
�����#&EL9kFG;��Xv
U��]+�f��Ak"�*�����Y� 7RH�%��*���X`��yI/�[�n��i���A�}Z����Y/p�� }�o@F&����>�����	ei(��.�l�F��X�)�����vST��'{p�!3@�l:b�G������8������+���������X�����!���l�Dp�b�~l��m"E����evzJ@~����z���N�0���47���m
�s.x�:��nlJr���DX7~=(��Q��p�-}@�9�������M�I|x3��Y#�ps�����	p�\J9f{|arS!����V�Cr|�f)�sS) �}R;�0��KZ
cI��v`1{���I|@��u,�b��Rp�:�n��q���9������tXg�hN�Zh�����l��
������Sz|���OO�j��m�x�������
�:�������/@.���}9Q����1�~�!���=�����}���?.���X��=WAg�����a1��YP��z!���8��O���s<x����D5sB�����NhzF���P�\�f�X��b5����k�L�YYsC}�d�.�S��r�f����u��� f)�!��8�B�X�.X=��v{�D��������Ex��"�#���l'}(^����\������;*-����$�y����Dj�%����_������''0v�8���8����O����+���5V����8��M@1i�����9�m2z?q�S(�N#|�$�x�����<u���1���� 23�if�/��h.�1r�Q���}WT�����Fb�rX�9���H
��G[�'H�Y����@71��X,��43���+N�{zI#�Hu�53N����z���w�T��d�u�(������������lg� t���kA�)8�������y��)����R��Mz����lC������7��_���~��P0��Jr��J^`��a<[�M1�
�F�����W��+���o_���$��Rr���$MT�'��wQ�D�$��v�a�bh�j��>�}���M�Y��y��:�J����89lLf-�X.{c�f;�G6�����H�V�p��I	�s����Q�~��;6(A��f/�a��]����Ye�c��&y��a�+�q�O�:��uK�<��RR���g�CF{����[��1��g����������X�u)>�1y?9�����
d�iXh�p��40�:A�/pa�t����N�`�X8����� ��4P��U�}������(� ����[�����������u����17f/Upm��0��
��Ymo���3@�������*�"�J��aB�~��j�Q`)�g^����<��|�����{���[�b���L�4����O��!���  z�R������������|A��F����=>:CW~�X������u��a;���+��������	R���jd�}�,�66! |����6,����\�_"�z�;-�H���v<�b���le�n��bZ��~G1�������n�~1d �����D������3�g�s������*������b>�z���b�,���H��r=��s���U������uv����'H�]����iv>`!U�?&~|�zZ3X_�Y�r�pS%�j�v��_�a/����WWy	|Z�1o����]�(���fo�g��K������ 	O7,��x�c��xQQ��r�WC^����O����;p�KL���IKM�����Q���F�}�n�N{LOJ���4����$�5p��^��?���������I�r�!� 'b��i���v�����'��`F��H���W/�S�7�a=�@jbt�����9F�B�g�.(�?���`�E��6�c����xE
�ni��G5���Oy�����=�f��WS���o���^����+��S�+>�<�KK��_����4*� A��~��<��1|��1g�0�M���'��>�I��?8��pV��N?x|	��.��[!�������t2
9g$���0��FS�����$|�5��:�k
�3�K�$�_N� ���_8A
btj.��P:v	�B����P1D;mQ�d�5��{g�U�g�nIyh9�MD�{H;5�(N��A��c�B�CC�M���7�oS��E4f�H�l������:;A��|�r��0��u. ��7)��WRX�M$��nBZ�^�r��_��1�}�qi�})\��DP�K�<~^�5��l�}r���d�� �}=��~E\�\���W��6���t��,��Y	/4b�|s���?������Glza��)n@	���-�b��~��-2_s'�I�bY�U�x�@��0*���e�I�3��v����[69_�BH��hi�g������IH���"�����h��"	�
Gk�F�X6�lN�V��������n4�4B� H�6�ET��Y�	��pT�.�"���T�����~��huA����H����|�",�,)RdA�tX��~�<��O��d��)�!0�[�4����z�Zl~MN���~��p{��������I���������1���43l��9�?��M�3�[���/��/{ky�v���.\r�l�b��f�Z���~t��l���	�no�Q���t����R��\��EN������ 7#��� �p���-���Y���u��mp��9��,6���3��k0D�E
�Q���"�j�M�GN��K:��Z�2f}�1����
�F����+�tXn ��q&8�#���f�+@�l�&�.��������BP]��[Whw)2j|L:���.B��q�F�N� �5�(]�{��Y�x��E�������������=��V�<+\G�1��0U"����:��LTi�4�K
�B"�K������0���~�����@�=���=��Q��I)���.+��+b�3���������s�D��]��1���_���Y�=0�k�w���dJV���]I���w��4N�9zy���Z�q�Z�J����yu���	�o~?Y��`v��z������H��`;���u��CSf�y&k�a�����qm�A	�Q�KP��B��|�q�y
L��B+��4��3a���O��1wH���S^m���3�;tb6@*�����gm��*�"���`�e�d�Rb������2�W}��s��d��q�#F������y_g�����
�O-x[O���y7��_�x��7��')��sx3�����f���Yp���P|��bH��*�x����b�8�";�A����{�_��!��"#�#��;���b�&J��XP�9�V��B[�o�������75R��c�.T'�������tX�Y�(%S�j�����I:�����5DS���{BB�)/0U6U�(�ZP��jz�R�[8hX'����[�Rg�j����l7+��m"C�Z�M'�>/��@����������}��
MJxb�f�q�J���p}��d������:y���k����� ��e�^D��i
�0x�
G\|F���=`�]w�{�2=>R4i�����=C�k�����|�������~�Vm
�i�F��m�M
��;��I8Nh��\��V����6����R�%�����/}�#y���V5��Z�������n7RtF��K���yQ���q������:Ah�"o_�~q:�4��)8A��� ���J/I#,�}���V��
E���mGk"q�,Hu�	E
+2��d���/(�/��L`*�]��&�O����� ��-&�3()��y���3�#:At��h��������.�B.�p�]�.�"��>��n>��������My�43R��0���p
������2�����.;Aw��w:Ax��E����!,x��tNa6>�+����I3�����v�	r�m����DD%=o,����O�_�E�����B1��
���D�����b�9;A�ah.qa���8���$Z��/�������~q���N� �I�t������7?���IdG�����.��E��������K 7��@�:�l�Z:,��zQt�$�j�`(h1����� @#bi�:��@g��V��,�40����j���scW��B� �N�z+N���*A�)G�T0���}I���L����B�t�i��HVV|d-
�\0M��@Ip���]+��-��Y�������w�}�t}w�*��1a����7�B'3�N���^5{��	G��~��|����z��uA)�`���-rV�X��+�����w�����r��,+D�d�]		7���!l���
�\o�	�|�NX�m�YB���-U�T���,�/��%��84���c��������F%��V��W���L@[.�:���P�����-�u#���W�S�'H���9`?�&I�I��6�d���<��O�Rfm���GwA�)��S� s=�&�>o�v�F���
p@������/�_�E�+>��'\3�>�_o~%��G�N�o�����E�D��!�'p��q�a)����s�e���On���X��!��7��b/0�����a:-O���v\j|*���TSE�St#4��-Hs
Y^Vt�[����.���*E��� �{A����n������H��HurBZZ�Xs����L���H���Ix��Q3r��u$�t#���������H�H&�L��Q*�������`
��[|�zcaM��b?\�{@��������n?��E�]+��g�Zf�V�����{X!�*�B7��au����_��gD&�6k�*YW�3l�^fp�+i������&��%����}@�6������2�����i��Z�p��	�Bo��'�<��b�e�k�s��O�:>K��������ukw����Y����{��*��W�:u8���S�4)q�0�� 7���?|�1+�Qvf��q�nzIL��$��"R+L�t
�0e��p|�=��xE��P'�zfB�h�
�.�aP�Vd�K]�J�2i��y�nCC7BS+)q�V��F8�@�4��
�\����3cg��EUM� t6M���7��i��W,�X�T��M�q�,R�����i;��j���]Z@t���i#t.Y��	4����7]�p�Y�������:9A�����.����}s�`/�����8������V#,�O�=\��mv&����]@���./p�,���KQ�����YG�`�l���t{�p����!U���L?�k/�"�R�]������v
���> �t�G�d����$}�2�U�uU��^������v(�t;�%�_>�^xZ�'�T9vP ��FJ".j���3@�7��b2����<��B�
�N�
N%I�=A�f5%c�E�9�%X�����LB�E

M��k#I�sme��Q���Y;����bg�Cf��	_Z�	���c(�c���^��O&��[;���1�����M�s�]�K�p�t�7>h.�U��.��ExB��1�j��M�Oww��.�{x ��������7C���������Y����I�p_��������pL�8�(�/�e�B�=��STUA�J�1�,��?��f�F�2%�0E��^`�y`.SU��CS)fp�0i^5)n�?}��t�Kt�d��6�s5|�f����Ig�1f��P8��z���p/u�����j����2�
�7�S$~+��zP����<SJ�o����(3�<4��H��\=���O����������W�X1��du/'��/'��`���%^FLx��v5�{R������=������P�����Q�>�Ieu�;wcO��=�=�9������l�S1sZ�����e�Rd�d_���-T�����,�_���=Du��R���A)5�p@lg���d�*�P�����U����>{�z!W��RJ���U�=��g��9�c����9h���pF�u���h��m$�d����nh"g�����������@s���k�u9����eAKlL�l{��"�x��n��f��!|���5cjmr�~o+��T����	
0�>����a�Ib��Z�K�;��1�tT5<�c��ht����d�P�8���v��!���b�5����q��Y���#����X��������>����0��JJ��(A���I����(^�j�J�L�~��L���aX���VY	�cP�1����UL3�ag��j����?�����y���G(�}k�+�E�aM� 4^'FC9fH=<�������K'H�/���%�/�<�h)"Y�(� ��"�H���kg53v[����h�"��r��Z�����t�r����N������(B������H�lM-��<���h�4�N����0��y�������i/�&��?�W�	7�	�&'�E[��2#���e����G�pSi��nH������n�9A��~�]�#�N�=�f��#e;��%X���+����,��z�����/��p�0��ms
}�BhG�g�!���W�KY�q���u��z����C@L��D���6��c,
.���E�e[����I�NJp�v�W����
t,��:��.����Q���&G���>=fT�6��_���������^�����o��\
��ya�9	���I�Hi1Im�T���H��
G�/31P�D��O$E*���3qBF���	)�,�dF�	��&�o�",���$���tXT�ET��8�A�hs>@/�`V��9N&cZ�.���X
�tat���M3�"}AN�1��z;z+���_��`u�S�Fx~+;A��^R$A?�AD�fZ��C�Z���7���'kc��3L�2����N�GfR�^����C����}~�l0����Y/a���	�3����������kF%D����	{��Cf*RK�c�.PBT#)���H����6x�v�W{TBJ�jn�:�8��y@G�B���]o�����	����yV�� �<)%�LFZf�E��kx ��sa=�y^�E�!������g��%��'����	�
��(������l�c�C��3#A��3I��)����.>�LN��G�b�r���K }P(ub5�iz�	5����_4��'�|�W�'�����pu�����E#4�By��}D�M/g���{�C1�xp��~pA~m��	��`JV��F��X�\���_1���
�q�u�W�jI/A�#�z���1�q�6�2���pR�8��"�����x���;�?S����)��>�3=�OI��H��Z��^(i9�o���_� ���K %E��D+�1:?_�Z�tCRB1�h�XyP M
OR����HG���g�"�H�2j���N��~�2Iut�A��K��Fl��v6%>ud�'Fq
�c>~��H���&a���=�pC�e����[��#
8 EX����gR����d�z�N�w�P�M�a���3)cMht"E�����r�I�_���������(�E8���c,���A0�$�-$E�gR���@`�;G��c���B>n�T!���v)�
�1���������E�����3�E�d�W�:c���������A%8P��
������t��A���:n�A��^�\���������Q����g\���?I�$���Kgxmg�`5�@NA��p�8�K'A��i3�)����B.<GO@Z�z()������
��$��dt3Xp��`����X`
��@�����j�� ���`t�0�9�r���dh����pt�W�.�H��.no�f�Rc��X�w<�e=����a,T��Q���~����`�21'���*���L��:;A`�!���d'L�>$'����Mv)XS�]!�5Pv�HoA|�IK�D��� uU;uTB�����Q��M:v�vl����5)�*@�N���K9O�0:>�b�����U� �z��N��X��ex�?.jH?��3^�bNB!��HNhO�>_�:_���������f�����$s�Q�-��=������Fb,��W��rg����}�6N7�s������5;�fc�:�qO n\��cb���>�����J���z�����}��O^��Q������{�'z	�����'��j�f1��t�����;����c�>����)�������Q����Y��S1��^�cSx�qwD2B��z�N���G:�q�$>����y�)u��e�O��a�(�N���W'-s��n�5��]��"EzN�sN�������S�q����;�.��=�?=Q�:|���W|j����IF ���+�{K����
-��'�K_�/�	���L��������_G]���f�.
����V�
[l��`
�����h�k�	��Z����i-��\'��"�����f��4 $��5y���H�77��.z��E��h���1Ye��L:�9
Q����1d#4E�E;5�`��L��J��R�CH19�}��h*��M
-�1�F&'HkW7��D��t�������0��tys��M�?�+)�!{�	!G�+���^|���1��9GAihBHi�l3�[��M����3\n�f�� ���+���F���8�J�M��
����JY)�q�O� �8%C@r�������	"�9��B�Z������~�
����
<�-��)�����|<�����(�u!�:�&�Y�p,�J�x!u��	'�Tbs�����x�:�SH���� Eb4�%W0^J�D�{7K�T2^PV#B������v��fP��N2�5�p�����k���:�+�b�H��H���`�+(E

��]3�H�U�E�%��= �;Qb>@|i[yN�����s��l1�UsL�u5�������������^J��H�Jx�e
������vL��]
�6�"}���_�tX������U�	�mNO�J���X���e'��n����������!)��F'�U���B�6$��V���j�|������/���R����u��mps�qe�b��QGP}H��B�x;�"!�	"�����H��g���H���t�G�$L?/k�Z8����'��%�hk���o�PQ"M=�d��C��_Am�����t�r�E^8A8�]T���bi�p��������}�1���F0L���17vF[����
�>�3�(F�7�$C�Q���5���9Q��M3?�{����H��+�doG�n�L�*xZF�P�K��\�����oR��������oQ�um�21������A���`��l�p�h*Q�����������*�r�%(P06��/@L-����\eg2�v��9���t�X�����B`�����
�����z����|�(��sJ��u}TB�L%�c8��������V����{x#/&�`g�u�4�I9R�~H�G���l�%<m�$�L&��R���3n5��_��N���S��m�3��P?��G� h�rK�!6�Bd5.}����*�s0��E�<�yNy����
�)r�����_��"=.����@�3t��%E�rI���w|��������z�-V8C�%��2�O)�=����kP����
6��H\����"u=Y��=�o�vi�m�6@HRDY�ej���EJ|J^T��i)H�GM-��r���zES���]�tX
������g4D���)����H���t��$�a��yXGl�d t7z�����5�����urd1�n�����z�x��I[���Z]�r��w7�����4N)�id��9����#34�����K�����ac�F�/��n
���t�R����>6c�B�Z��a��}�3'�|��tJv%��.�[��4MN��>'��0<��JNx����Qh������7]���������NyS�c��tz�+`����<�����������|;=�� �n{����c�
rX\
��o���+�aL5�vf��NK�$4)S�~���O9A���HC.
�r?��#E��tX�|����8�]�v�W�z���b���k����/(����~�Fo	��suK�$a�����S�#��kK����
�j��l�f�
�A�]���~6M����F��D���lG'�n�n)���fqw���K�vo2:���b�-�_����f�GN�W�]�9���-1z�;��"�R��iMX���}�
���)*�Xs��f���N���_~[,�,�`[�J�^y��P��D��5E�@S�� ��������lh�.�����m�Bv�8�b56���!t��L�e'��n��O�
6�4PS�|�:���,��MX$b��\�2,��K�?�������P-�@E�
7�4���,g����E�.UKX�(�����,���4"���2�@E|��cWt�\� �^����:�*AW���/p13�_�� �S=P�?��&�;2�#)�uo2��I�g{xh�|,u+\��
�?$�����O�^�i�[x&����7�p�����&`CoG<��V\&Z�����|��8��{��9���_my�NlGv"���������6�b~>|���`���1U�31?I0Mp�
�UQ]��bg�D]'��V������T�������=gY�8�g�
[6�2�LI[���$��Q|>������n��v#��}���/��f����0(n����F������>��^��z<��E���r�8�g'��j�Ir3�*D"3}�>��5�9�C�������S?�K�l��[�G�|�>q4[+����R+�r����>��d|p��EW��q<��8���[���u�������a*�x��o>��*���bH�����T���Y�0\�?WWs(+�:�*��f;�1��IN=���������<�7�\�&��`�,k#�s�O��Ng������_���(�N�g������C�b����\Oq���UON�39�����9�J�,Q�c�W�o����h;�����S{�����K��j
��E4�q��������SN*�<M�T�%.rX�c3!��y�R^`��r�?/���&���mHIDATA+�HO��E)��+N#A<�.�B����C4@���9t�_h����K�A�R��.X�gV��������C�1�UI5-���x,��]y�'�YQ��1����o�7��j�/�r�=*!��PGc�
I��I&�%P�d>�l[���c�����1���
��
��i79��)��y�*�|��q����}�d�i��{/,K��y��=,KY������.�	����Y������B/t��h����)������j9��0��45�Y#P	���#�u��M����$-#�y�l��_���+i�(�L����43�y����2i��L����x��yD��}�g�PQ<2��y��H��R��n2.<��))���(+\���Wf�#hXG�(�yZ����C�
z�9�%HR��V�Z��7D�`3�hI�y��)���4q	�g��
�"1�B��"�.��1C]~������n,Rl[`���K�� W�M���?nl�=��M2�bo�Z�<��>��S7Fy�����b%�.l���A[_�4�7Mkh<��xD
A<�-o�����_;��S�6���\�n�lr�����7#Y�\��X��8�����g�	r�E��oC�v3W\�0�wF9��A	��5?�����%�u��}TLf%T�^p���:��6'�
��!/$&������!�g??jg0�%���5#�����>�
'�f�NxZ��h9.�,���1����%hh�H��l�{:#�t6I�3E����1�"'X�b����T��3K�6�L�9�P����&���������:m����m0��N~��yV��l�	5� |S���8]�Q�yq'��d2���S���	�fh�����~�K`��.L~QY�>:��k��k���+?�j�c��[���M.�xB�!6�I��B��j���D�x����w$��h�,J�w>�g�+�l\&?L��{���0Q9����8��k��m���(A�"�����l�h}��|�<�W���`Z=�N��it���X���+�c�������{PBR���m�K������?�����j|��w�NYbf,mZ?��'zC�b��v������n��������i��l���2�L�����������_0����W|N�o���>�����4M)X��,x��3/�SzS%���q���w�Q/s:����#�t���[���?�4�v���{���k�t@�%������fX*�m#�Cc�K[��	E���Y
�f�d����h�-��]�(h)�8��"�G`�L�sCdD�����W(���9#�y]Oc��V�\
ZA���s
�h3*m��o{e{�;h'f1
��*s	)�H8l�~�����j������p��Z��H`g�����Lg�=h��io��w�����v����gc��~�A�|	C��f�LJ�R+cc��$���Z�f�U
��������>`�����b-t@
�0?A��f��
a�����I	Y���wD��1�8�*)a�m1��I��Q�p'�U&���H�m�o��EhYm�-��rE9A�%J�'rL��raX.iJ�9�*3�	CZ�	�F�_���H��+B�dFpF��b�T�rL��1�Eg&!K,�N��:W8A��.�b�G����K�qM!��|�����
�R'��XIo�6�z�!�����*��M������
�Jc��D��
0B�CX78���NfD���D�$*#��*�@��R�j6����Z��i�����{�c��q�}��Y���3����[IG��%����:t,��=o��#��y��=�����9�Vz]��&���>Z�l�@		k#���+���e�&���0�mzHE���3��>�7�e%���8�����+���e�:_�J���8���^n�.����d�X)��w�IA�S@��U=���%
����:_�$E"o������*�����A��F32�-����p����$�je|�q�|�����Hd�9�����KR$�"��i�0�����gI���y��c"�	
�wy�ha^�RrTSS;F��z�j�e�
f*��m��&��E����;�{��D�_���j�aBB���"��������%�=��M#� ���H�����oE?���0I0��8/��{�k��~��-��k���f�8���z�8��������g����d����.$�����H�$"t�D��U�'}��^.�
���c`R�K�����Y<�8V~$EJ:N_rn��T.���N���S��6�'���L%���9�-�&2��Q��D��y�n���@�;����J^!?����6�3��.�_����Z�4�S/��)O�f�G�;U*�N��9��
y,{�Jd�����5P�������x��>&�;z'��3S�I��t��n�
�St�#_����43����
,{
����|Z���4��W/�5��h��/��?5D�Vxn��ni�r>�K��R@$�9�L��	�F����#���:�:��8 V�:`���p�?;=���lW��9�vj4����x��z9�!�����9�IyRg���;w�#�d�%�)fU:kI���p�EeMU���N�]<�2�����tg�v4�:�����
P�+����
���D%�q�����y@8�\���Q�a��$������;_���5$�A�� �����2M�����.�EJ�N[���
5!��@���*hD
��I��.�wf����d8+�����]0�^�El���(�����'��e�l�''H�R��������=�99��:.w��r�P0��2��][����`���o3c�B�T��
z7u��m�K��� �e��������n��</��e.�,�+�l��6���V#v���)�����k���k@k�K�SS�\�"�*��A��D�T_(���\��q��	�hy��_Y��bQ���gBf�!��^����^G�(��
t�F��dL�6Yu�9�I�BN)_X�������KN��X�	���t��#{Dl�N�J�i}C�a�X�4�T�,��%� ��)��6	R�����E�B�����;��u�������U)���s�.�(G(�))IC�������3c���x���i�H�����.`v���	�k���� �2�Lt�Cv�H�U��xL�MH� !�������f�3��~�g\���i����|���%(������� �����V��% �!�a���Xr��R��C��1mR`�{@ileu�&�*7
BN���.%��|�}�Rw���eL&�[���.EF.�H�Qm����e�SUo�z�MA���at�p����	^U���K��{S��aew���t���!�	�3�^-�f��%�Cus���c��[��<�wk��&��]�yHa��'
?,�&RR$VU��K�������������`�������
���s���������4_�=RYd	L;O:Rl����������S�<p<����Ka�=[g����R7�p�����7*�|�vp<��x�&����~w_a@�L�I��3��cf}���YvD����|�>��pc����c�K`��RJ-@�M9����������P������*��Ay�1��u����AS)S�^��u�7��Ls�g�8�5�BA
�lw�:�#���o� 
���HNwp7nY�yA��*����9��	��"��,�c�?5��Qc������t�Lax��'�g����8~U=cW{J$�\��73=�����Q���m���W|��
��*~fhi�<p}�WM`pV����������"����6����m>d���zS|~j�[XU�#�<��w:���@~x�[!�
���!�Fh�%�����N� �[Gy��Z��((p���M��\U����\.�F<�,�)����������P�}_CE����"qih6������j�vq>�*�z���-dJ�/��r�,�f}�;_�F�������t�Fr��*�V:����kw=�D�d������:aU�z#a�)�S�_\����3�7�?�t�	�%B�F��O[����>%�1k^KL���AK��@��)�
�����$��s��9�1�$;��u�Ag���~Z�^���9h�I�����1��1��<2ko���lUd������-���@	JX
9
*��J�db�2��:N�������{&�r�g����H�$�����Q�;��(/�)��%E"s�����P����������L���z���@C	�
�.�#������
t�5�!L25�G�r���
�BC3cg�X�tXT�%�d�����E���Kz�E��zV���(��E�1���#y�4A�Ef��o�`�fn,��n�x����>h9�����&e��M�0�m�j
��~��Y$G*`ad9����.�'���f>d��w >�����������C����~Sg6#x"�3�t�)�I�K��'`_0p\�~��mj
^�j����.�co��cS9V�Z�EN�%���
��Yz��M�}TU�����l�Y�Q��'��Z$��R��
^Lo����/q���%��(��7��y�$Y�w���_8)����|� ��03#*���
�uj&��+q����
RY�%0��RD8�c����N
>
W�RD����(IH�


.�N'�n��GN���MS7���z�#Xb�������$��)2���H�(��/����W3.Su��}k���|s���<vW���+�_Tl���e��U2�]him��W�������,ul�_)��V��e����K`�U}�����.�G�8�|�ni�f�QG�Uy2�������:�����v J�6!5��>lyn
���i��������������cm9�g��n�9e%��A��=NB��Y�JK�\e���c��bbo�^%E1�K"���OH��/���
/��?��t�/�����!��������b��=�+������r��zO�������E�����N8���#�l��T>��B�L9������V��wO�1�H�lIo��?sy)FS1#'��(G�W5?�j�$������~o�4��x�,�?��S�'�����1�2=�q���� ��{���&��F'82
7k�#��	]�
uG������h�������K`��A�qfsB�I��+Or���:��'�N�������B�V�=���d�
��V/�Q�&2v�w��3���hH]+�+H�I�o����c]����y���J������f�x�j��'�r����/�qMZ=�F���9Tz_�7x��~�
�f��^���B���/ww��W��x�Ll�_V�c>=�=�L��z3�D�����<l�����_������EB�
��e�J������W���!�3@��=�����Dfh
�ot>��Q�Q��.?6N���[���l)���\l���\��������
��D���1p�������_����Atf��g�b.���w���G5���������o���S�*�-�W�r�����*�����U�	1z W	7���d�EYQQ��\x�q1�������s6@��)7B�R'��9�8�����}��;%EB�o�<BzE�,�I�*���F�����R�o�������G�<�������El����]~�r������4�E��w9��Hc��C��e^�����A����n;��������-`���aa��BU!y,z�	,�]���b9��4�Cs�t�_�����n����`�%7��2��]z�C������+��if���3@�������������w
��<<M��F^(B
(�C�D�����4]���E�q�c�/��c�b%����t@g'�1��D�~��V�����~������Q����|W���maha�Eg��D������~�)�����@�@��4?(|�x6K��

�*#i��@�U/f��D�	�d���r�X������Ws^`*�&��cg�=��p>��7��<'�n�3�C���$l�M4M~��J@�^���6��h�f�S��8]�-zkB
�Z�Y�q{t@>��0+�v��R^���@�������	���yc��U����^\2���������	��+��qx�����3��]�m��T��D�p
�"?���1�,�.`
_h|$B����	�
�3�v�p@=�L�Y���M��Q���e�0���^�^"Y�zv��S�� �U�����T���r?�%Y�,�_���Q^m��*/�+�v�V5e�z�
0�?��Y�m�����
����b]��3�V�8�����
���yB�^���U����5��������S�����������<1�}?'��1����HC���42t�y1���f]��$CxAR�?����7NI���`MAgS���c3!�9����4L�������k�"5�.�	}:,�H��F"�U`�!��C�VP�����H�{��RCt�N��QS1GLz��`&��������g� TYtg�4����%JZ�_*�7!k�{��P���9��S2�^H��%��~/�}���
x{���Y	��@���
�����M�a�u0F!r
[��+�G��%����4���i�]o�!��F-,TN��b�k5�U����������w������k784G��b
+W��P�c'�*������~��%����}��h�eR���=�j����T2�}z�Y;�6CN�`MV������D�.U��&���2B��Q�����iy��L��n�D`�c���y����+H�
c�����������R�����:E�	b]xC���PA�,����}�6�!�$?�B.��G� �-�E���k���*Mwds�/pa�,� 3cg�	��Pg� ��Js:NNF� \���/Y�S>�~��2�a�lf2�vk��N$�Ym�����	r�e����_����[[q%I���l��w��
^�����9,�����z��-;ADpM��E!�d�
��1�Ud�����qv���A~������|���p����%�r��L��1�\!�>�xt���o�����|{{�����M%%l��&t������]��D��Z�.E����M�:yjdX-|@��8�Vq�u\){�E�1��oA�(>`�Nx�;����}��� ?:F���\�<XO)������E��{?�1d���� �3C*W��C�1�zei`/� �
!!e$�Q4����[�&0�A��9����p�K�@��d�FR�n����f��HOZ��:'�V��q�Nj^�|�-�Zb�Rx����C7:��k����������o6	��r{[�(0��$L���r�2M��P����%��N����g���w��_�	9���bc�a��g:�y�C��~���	"�����_`*	�	R����K�f7��2���)pS��^X�����2`ZV���������<�����T
�s���8���xo�^��~�}p�@��Y�]���;6T�M� {�/�L�)'���M�<G���1C�.��H�
���j|N���eK!�':A����f<��z��g��~�q0�
�
�������6�Sf����g��c�����N����U^
wSy
���V.�����s;I��D?�������-��]�3�@{F{��|�#� �?�x�o��G^����b������YE���P����o�����n��D���8A~/0�������t�s�F����!!'�2�,R����W/�*���b��,��DrU�o�&i�,�M��`��G��b4�UQ���L#�*2�<��2#t�������,��@~��z������v�
,Y�zh'i���W�!�.��
�W�m�O7��������P����J�gw��^���;�vi�d���	w�n?��������	��`l�����m�80�W#�����`����oG1��_�";��M)���Zg��a=����.���+���u\V&<|����=��`�j��d�Rh��a��{PB�9��_cS)�5��v������U:9�moY~f��YJ�#e������jcX�>h
*o'<&n�����%3@g�^��������pg'�)<�n��!����\� 4)�,k8��hinA��YR$eHZ;^B'�C�R���k��dTE�����]R��.�(N�hsU��$qP��H���y(@���;�:Zq��:����cK]������V p�4��%W�����K� t��a����\{����������noqh�w�sf�;�.��
�#�����4�f�}7���~�]/���������W�3A��[&9w�#�j����eJ7�	�S�+s��]%'�Wq��������m��S>��TmN=�zYs�qy�_-e��O0����>�������WK'�}F��<�����n��	��!�k]����!��b)d�(�
���zh��	�U��t�ZW5���q��D��xo`����\]Klj�m����#�_2������.)��BJ'�0�R�PGEb�����@������t��v�c��p����5U	�i�.l�H
��I�-Z���E���KM�.���5u�P
���� ^Z�qFC����b��d�&���
��m�w�n�wv��:���:���h������^IV[�������o�M��r����u��{Xe�'}�R��\fR$y=������W��R�)�����C@�o}���08g=k�8 #c��b0��`r����3�f'�������/_`�����]H��P���79� �h34E���s��Yp��i
�zs�5��];�%������>�"�k�m�W�����@]x���$��
y@_�������E�r3
g��g'H
7��8��k�"k=�[rP��=Q��#T�(+��t���^m��)�~�
�d����K"r���?���^$v���l�����&'��
p��F�q���' Z�+�?�5��"��^�}��%e����^g����0+�W|��
��*/���	���grs�+���7�
�}BS�����SOhUJ�t��f��3@JE�B��M��	S��i<�����a&����`�N`~s}�����'��V��U����q���>������V�cW���=9x��*k���>�|<�������-���#���
�	�nn.JO����
t�:����}%3@�,�Kk"2���T��~=�����I�(f�~,�|5a-�����(���`z"^g�
zR�	��I�Q��OQLVZ��/�9.��M������%-������� �T��o����e�#���a�:A(�8;_��tCi�}<�)�}HT^l������AG�)Z������[�o]xI� ���'W	��NY���8AJ
��`D3�y'�H7wO-�8h���������z�$�~e-d���5�
	��j�����������`�9)RKH���v�#���	�i�Lhv������%�����������{�]
"\p�]F/��Z&���C�@���b�$/���w�?��{Xnw������kxV������jc2k�j��p8�=�m4~!w��~Y|El�Zx����{�&���3���;c�Z7��*����RB0,�%4���l�������t�;��k�3�w\�����x�n<�;.��~�����	'��[��3�&N�"�M�����[�����%(hZ���@��J�!����tHC.��}$qT(�4�IVS����u��on�5@�����s�e#�JU�I�fB�lQ}f��ES��VD�i��V�n���;��0x
��������<�������w�l�
.��>��c$�p4#)����N�B���f�v��	b|v�|�z��|{�����>�*�s"?x��P���jm��|_�	b���0:A>��|��>�"Y�a�d'�U0�����l+�q������B�������	Rc8�xe��������t���ir�K�!�&1a����/c%������J�N����V�`B����V0p���x(�H����o����'�:\��;��,*���;a�O�N�@Z]����-��������� ���'T�k$�e#�y����d����}�g�2X��|������J�q���
����-s�������h����:%,{��X���]�"#�����n���W�_���za���|�3=��d���:G'���*�G1)#�������z��������A�[�1m�s����#���>E�p�����,���u����dT�I|���������q�����fT|8����S,�z7��.n�hQp��{9��KH����=���?�g�gO�?$��7������Q�?4<���9�Gd.�\W�3�['�.��4D�,(�{�e*�%E��C��������;D������c��:�y!��"E2|
�{���~j�O���K�HB��Mv��'E*�����b��I��s#y��>��<��j�c��/�i��[������p��iP�a��������h�0R}WT�����SN�`�Yb��dOa�U��l0����������t������x~#4M�_�I��)yC�����b��������B���v.�1DK� $�!�=��W$�)r��0%#`��h8	���A�,�������	OC@),�"Z�c�}#oS��bm�jdJ������H�{$���I����^�V����R�T�~�6���r�o{~���tb;��C�)�_"���a>�
�5��hl�"A.��U��N��>1��*�iP�U\�v���4�	���������=��������/��r���0�D��-��m�b�u�y��J�.��KxD���%��U�xT�Id���V�����A��WC7�-��1�����#��5�Fz�Y��p	�B'WI��N6�P���wEeR$!��T$���
r\������IS ��(�>NB�nH(����)�zA�F}�4����X�i>@Y�v�Wz1vO:L���L:�9���*I�3X�����B�2���4O ���z�))Rd!U�M��{���x/�h�S5���7�tXW�X�tX�[w�}HN��/d#��k���jaH1wF�D���a;�vt�I��
w��~�I���
'�r�HY��&�Lt����X[?��2�a{=�������v1���vi$�-��}�(JBp}0�"A��/��am��	��1V_Aw�J��0T�	STcv�������B2N_���T�1!S:�X-�AN�T������3��c8��U��
N��{}�F�UN�x�&�|������0������PE����|������b�W�8�3��`N�%�(��T���������a�+"�(2��|�xb��������t��:HA�3�p�����Fj��	�� �PM��E	��d�N��	Qi"�;9���g�`e�$|!�E����	JRyK����YJ�z�������_����V�Er&����i6�+!�O����F/�j���k�����=� �Q������q�X�3#�����2����������]]#>���"%D����1O�m��k���S��j�z1��`�3�������r$��P���xN�'`�*����l��+�L
+d����9�`R�����)�:��k���=��W+�������}�z���6���W�,�������Ln�b>����?s}!�|@�`���������Y6��Q�!�f�����
��b�k(���
��*!�B!�W
��cn����p*
19�S��������>�>��G�d�+��o��$�?^*�9�r�%,��)��.�f�Yu�?��� ��Hg�O��tGXo�wt	��R���4}
Y8*n�� �����C�.
'H�ECR��z�CT�	BH�������k����F��@�!���djA��d]2i8h�c�����*[�L�T�H�� �,�|���]�����N���=R'��/���:��6-
}��RmB��L�v��H����!G�������\��AfMOmZ�s�d�d6
�A�iT�'�#AV�`-}{�y���u�}:���t���U����e�J�|��4 :A��,��%cE��!�X[��������g�[�G3�a�9�C8V����5���\��X�$�D��:��e<���c��|�B�4X��p����X����1���	b����
�+�a}���S'�9_�L�E0�7�:Ah�"�iG�����E�E���c'N�&P/0U�:�
��Mi�$� lF����:u�D1B�N���;>h�	����@]I*12�@�1���:;x������c�]
7���������.���������0�m+k� ����h��-���u������Y/�YL��m�_2��X���X�n�U�^��S�����7�E����"AV�����1��<���K��9
�����<t����u��l��"A���$Fwb//��P6)�:k�J��������G��Tp�,�>����k�0Axt��RG'�
b4��i6�Q!��Uz"A�B:,�Vd��������x���5�(��!O����F)^Ew%f�&��~��I�=������� �>�i�%S&p
�7��;C����lb�H]��a1	�(aF�,�e�	���^g�����1Ufufg�o�N���Qss��I
e�C\@��
�\������R����ii�E����N�>�"��;"��q�w��N���>/ s��d����>�����M��$���X�$d�~<��JO�5�C� ��E�.���1b#8#QLH����7>^=>���f��1����6��]'�@l�Y�+wH��R���JN����"(�u�]�
b�
�������u�"�1�Pm�
�j��[M`���t���cw���r:v��x�Y/+�\:4�^�u����F������KP���X���kn�E�^��t
p<|:�!s��z�	z�|'�s�s��tXoq+t!��@!�����q>8V�����.bQ��|���;���9�?�U��I&yt=$2��
7<�BR�q�G�)�VF��l����p��������*��5��&}���������n���k�Z�Q����5�\��J���������x0c`
���
�8����������2��tV�d�	�h��g#�Va&,G�� s��<���|�q��� E��p��
�4462Y�]#cO��
=Qj������uj7T���	����ti(����m��hNZ��c�`�	p�j�N.�s��������g>V���SZ��U-�#�Eb�eG1�U����,��Z�msn�'��������@����6x�����nK����������%��h/���m0�����������e��#8�Yj|���J(�����l�g"��{�||H� �)��IV�@-��]��~�$�/~���h7�.�"(���VKJ��'@�i_�}��wN���}}3����fs�k���KO��������>�
�<L���~�&Hp����"Aj�%)+����R�q�p���j��5w�U�|3�o`�U�� ���d������FY��\:�����p��GY5p;#���
���q�,+��A:,0�U�	$�3�>
��A5�N��,=9-�Q�����l���M����Uao����E�tN��i�g�������]�_^�_�qu�����n,,��������������*���	�mf�&�����2E�tV���;13B�c��a\���X���6GA��8q��� ��g>�sQ�qFcB)���tf<s�h=_%'�B���D'�v������'��b!sT�'Jc#�#�DY1�C5����2NC����N	up���SBr�H~2�@]��=��2�o�gq�9?�G�'@�DG��|�gt����y��~
�5��`�k��}]�������Y�5�
>�\���]H����p��2���kU�.YU���Q��=M4��NM��`
��b�6�^�����@����/x�`���g�Se�*�G������m5�)��i�xD�	p��<���O����#�l��w����!n�����{<�����Og�>#��n�'4
������!��wwV��6z�[^w2�����iBm���������nz�Y�IWSZ3��A4H�>zu���(�bua�g�'����>�2��O>E�Z��K����������������5�	a)��a����|B�X���O��k����K��uxqB8Z-���+��'L����;���h����w;};��[q���x\����{�Wl5]-\S�����+~Q�_���\������#�J�%<��j�����!�W	�Z������8�8����"��Q������j��q
����M��c�XC*����Sy��������HnU�
>p~���T5��g�^�N����pS�
����bw�Y&z6�~Y����k���YF�����y�:������ B��bn��j�4�Q	@M@�r���3����
��r)4n����z�a-��M`Y�l02��
M�����Y�
,���L��&�(�3�`�/�������a*m�	��V���kk�e�B2�����m��q�i�2:>*�z����A�68A|B��f���s{uvy�	�_�M����aH|�����'��M�	��������.��z��xT[��&�@2��`��Z����0��O/�NO�'8�^>^^������9:����@�`�^�8I�hB^��Y5�@����I�n?_�_�9���N����<l�S�vE4�Twc�(f���{�mC�l�k�&���,4k�!~�)�3}����2�����3���wo��������SB�O�������"A����~�� <;�X��r�+�YJ����p0!#��	�p�
�m���t�&ta[��K0����\��������SFJ1p�w>D� c��Q��
6�`�,mA��8A�����������c�	���Fhf<C��)��Z�yJ��u��h�Y��f��//���3G�kdN��,�ZRBg& �����@����Ie��9��H���?�.�W1zX�#J`���,f���6�	���@�d�l�)������[�G3���F���buC�>
%��Z��"A�/��N��{���s(��(uJ��E�H�j%K�����E�=;��Zd�6N	~��%�Z�B:,�a������*�[����y�u\E'���#P�� G��h��&��3��OKZi�6��	��d��������AJ������'�v��A0�KV�	�aY���(�Q8;
����kh*
]8�[��R�� �I-'��*��[��Q��[������.dp
l��#pB��}��#�ID-n��7��14���q4��9�B����������h}N�m<��H5��P�j��D9>�&���d}v�������b�g������n�������{��C�g����(�������'���S��h���@�#=%chr���o2�!=^%4+�&c�,��mD�?W�b2���	���p>�X41�])v�p'������5���4�A��,T���;��P���4���7p/f�2Z,��w���	��5���9k���?��e�,p���,�U"w���c
0-��^E����oI��yj���O6BW����k�?�P�Z$q����>���d�������v��:�����"��L�y�5����d��p���������:������2���Mvk��<����F�'�5x��6�'�y@g���:�z�|1��($�z�_�(��0?����������~��6�oL������'zuM���}:����Yi�Y���X�$7~��d0�|v:����� Ob�����Y�~�����d*��!��0`{#l�0��6�Q��@#�i����!�����d�0$�O�YV�Cr$"U����p��I����n(���uE|^jW�+T��Ry��B.0�����0��}��8i�|H����,����P|]w���4B�v�~����
�8�t'�Qd�\��
��th����P���1.�<�Wx�B2&b-�p�%��%�R���o�����c^`�a$S�P�d�lR����3�>%Q�]�����w=�������aw�:����Q	�Z�����������'���L��_��ex��������;?aA�[�MT��
�$^�e���pJ�,]����$Q�.0$�xg�m���aY���(YM���V��Hb��u@��vH�op6
�G����U����)2Q=3g��A��`'R���3�
�#6���A:��"��4L�EP�TM>�% ��P�qM��o�k��2����~���
5��Qq��X��,%?��Y��Oqb�,,�!B�����w�415�VJ��eB�������#�	{�
������Z�_b���jO��3�����:Yk�4��������x�v��
��=M3���
<�;U�>�XDu;�a��z�L�b�D��$,u;�������j�h��r������o*��R�Q��-�������=���x��1�-�����C� N���#�5��I��D��%�^V�W*H�*� ��i���	B���i%�Npv^##[�9E
�
�1��7eU�a��,~bl�r6E�pOK��
���(.4��&�^�Mk���D��i'��2#{����h
��1 
^���AL�H�nfg��J�L{��6�u�"����1H:��l0]��� 37�s��� 1������T����;8��Y�I�����i/*��&S��jc�w�b]�Y�1����OB���������l�=R�1��
Kok�c�'�G7WM��������+C�&�u�Cx\�~��S1���Mi�k��R;��sS�5��pu�g)B9U�,�����c7��:v�F:����N	�I:^*7Q6���$��������g���/Aq�<tr����q�jX��n�f\��6>��7������oU�t��j�N���{����0@�3X1��g�{@����G���_�����v���{||��G'kD��S�����������krj��E89~�g��g������/��zN>�&�%����!@����$Kkm�,'��E�
U�N��&�-[P�[4�:K��m?��e���Q=G.�0)3��dM4��`w=�
��P���!�Cg@e�����<wv @��0����V�{m�����*l�2wt�Y�\o�����6���~��?������
�^�RS�
�����w���;����d�FjA��u��Jl�	���������m�����`�����%�������q�q�{����������������.��qX����t����+����F����]gvW*SB!���� )4�E�H���K�|����ZI��>{�bY6�E����,\ ����#���l�6C`
0����Mf�p�Qeg�TJ��5�qMQ�G�f��kh�(�nV1�8������(��v��w�8�sI�B��:��n�Ub��&�"����0��}��T{\�������(�V�V=a�
��iG0bm������jc$6��B�5�����b�p(=��� �[�	W
�eL+��+-�@
��N71���g?m�	����:����"�mE�(_�S��a��t%'���^	���n+���Tg�Vm�sFZ�X�D�dh��Jp�����
�+7pG���K���M!f��[�����=	:n�n��h��G���EH���b���xO}�X)�kH�S^�����SR���3�g2�!-���k_Xk�0z6� �����?��z��QuV:.#@(j���m��x�/�����V8�D=A��ct����N9��I���CI'B&��[0�K/+��l��=��;���-���3@7Ba6M�g�! k��|"�8���u�(0�H�X��o��M��2��Om4]���y��`��&E��PmbQ$���'��q-I9e���w��D�Jvm��,u����9�n,
�bVJ����7NG�J	��Fs�C$��tt�h�U�}��(�J��j�jt��ge5�/�:j��
\�4U='A��{�m�pst|���Kw�Q���f�y����t���F�w��`'�7����
g�S����	���v���>�i�����!N�uf�<,jOE�O�L`���/�����D���y������n0�������t&���.�F������n
��(�����d��;�^9<���/�&����O��~�����`.���H/� �;>���~�6X�C��(�4��<u��}\t�n���w���9���x����E���Y�������v�����p����[�tM��6p!���8A^���;]�p��INZ�/g����wL���%f�~��L�����\��N�����tT�F��;�RY��m���
�
��������bV��$�<�&U��n�F�_@����;����O�?�;T��W��������������������4��g����P��p#t��!U���x[R�����ol��l#t"���U���A��`0#�A0Me>��_s�m�;&�BRf��j[6����CD�%����`$X�C �_M�Fh(
.�#�|�7� ��	�)����B�9
A
USI"pL3�5Um��s�)��������5]������Z�H�Xw��,�_Q�imWV�\S]�gH�L�:_�c��R����=+i�2������q��VM 3��ke��5������]���PO3}�Em�m����V��+����� �m�?�N�c�YD�&>{����x�}��t5�$��$yB�D)q��aQ�VD�x�L�:wX�nLm�+����F�sJX;%��MN����������+�=~�Do���hF�|VQ$����(���%��R����m04+�s:Y(
�=b %���RME�a�<Hg�RA0!jVS	V������0@������k�dN�M��8+�B&�FU|���#	b�ajA�,<Qi��R!4�s���|�9>56�L��y��/���k���7*e����N���$��W��'�ep���(����7�faz��A�o�}nX��E��{�hlt=�� � ��
�s�M\��	b4�x����I�;����a��,T71r�0l���T��3�g�+z��v
omO�nR��G�JX�^�z�pD���6�F;%D'����N������#�Y�;��:b�:���4���N��A<Z/�S�?��c���:KK*:�f'�����To�,�L"��.j�S! ��:.�)�Pf1g�u��^�=� V?O�POl�����l`��A����!L���l% ;L �D2�<]�D.�tkD`�IO���Hd7���z��c��`��!��MZ�����N������!B�_j�s[���vjt��������#���FC���x��I�bfG�C1F�'yhU{Mw6���H2��$�^�-�&^/l�V���#���^���,��"$Cp]��?��z#�}L�����O��+$6�m��"Q�m��$���X(~Ts�`�_��:%�#\S�WB�a27�`uPB�g�-�uJ��{�DJ����%��F�1� Z�-����Z���} �0��������yfM��k����5��e����5�������=v}Xq��F|�%����P9��=���B����B��������ar���f�>��C�S�pW��+b�n��t����OZ?�e�b`����SV
�]�����
�m��s��>�7�
F��E���������r:�G�fa`)��e
ek�����a�,�H��k
�S�>�u�K����)��������kP�)7B�nq���Ff�&��0V�0��L��!�
LI���2���k���"j��Q�B�(5u�|��������Y�u��S7�������ZS��@����*�V�
���8�R�Y(�D*e�R�{A
��&z��*0�������D1V�N���bG2�\���ML���:�������X�T��/��E��u��k�sI-����Y9m�l�L�sw�[����J���U$8*�v�,��l�`�s�t�eh��UT4�������;�H�����(��%����p�6��z����:�������������E�j_�#�NH:0�U^D���B'H�9A�+3�	�-�O8AZ�&� ���FL8A�������������:A8���7
�����5@4��|�:A@HL��DMpBV	��Eb	���X3��w�}��Y���K�Zj��5�� ��!z)|�F��*���,�<�)�;���a�uL��!�V��z�k�	�Tt�l���e��y�
�(�VsP���0��w�2��	b���;�uV���H��8���^p�o���T��Hn`1��z!�<�g�c���K�pR���#@g��^vV[$��H��en��Z�`\��<�M����N���+!���:#�:���
��0_�:VA�!^���4;��wH�odh���������3@�<�xgn\�=�{�$@�G$�����_�`6)h���G�%m��t�A�"��3�l����;�"Y����@�+s��)8
��mn� f
��P6v�b�U����	
�	c�G�4d��m4����}lC���0��z.��dc�
����-���`��hEb��������i�~��M���,/�}�&�kw���y�~$)!,��'2e�� ���,1�("U�D�8�o�')�Z�vu���S��i!�����V��F����Ej�M���4<�O�h<2���
3@�;P��
q$��� MW�)D�T���K\h&}������}YL� ����^;S��Z_��U��w:������F���b����e��%^�[�+���{b��S�?���n�������a'��A�~����}�nRU��������8���g������zx|�+�z����8`1����������? ������O�������pw�Q��	p��������D}�
l
��iU�_�	����<��;xx�'$>�����8OV�L��Mf�����qv��>q����X�5��MGs�B��&��|p	q���v�3��	�mAW�t`g��[��(Xl��������v����	L��m���Ai�-X����iX-S�������j�=�5��g�5�������V�ZG�������u��3a9�u�cg�vN�b�OO������xy��<�v�"�B�	U�FF����7!V���O/f�����	�u�/?]^����
��+e��N��ZC�9���F�f����>9mo#�����g�~�3���E�����������RI������E��m�	L�!�,&]��v:������BJ�
g[�k��RL�hT��B+��?���_��c��O>����z����f�	�d�M8A�����V0��paU8�_9{�M�``�4p���D:,�-�n��/4t�dku�������(������>`o	�HF�31B���6F��9E��(�,�Ei�E����L�RXf�|�q���$�j�	�f��<uxy�
���a>�
hf��B��-^�L��A�@1fu��\{�9]�3��'�/�.�xH[=����V%6<9�l�zW���X���H���y�1����%��6�SUc3kf1����k������������z���_ ����?����c�
G�B���4����b�{q�����`iB]&��x�Y��n����^FT��y�a��b�!�8��N�A�JT��P_�/=��~&'�K�P5��u���(����$@b�0����<u����o2�7�Z>Q)�Y*�l��0�XSH�Y���=���^`�E�d{�e�����i�
f*�aF��(�F�"� N�W���d|f�!6R���	��t��������=G�a�>�(:Q� \�&<��d���'@����:���.*�;���6���Bw4�I6���;&0��|��gG�o]=^�M��H'���7�.��8�&�
��F1q���h ���W�@�����g�P�S�=�<2�&�Z�����X�����D�g��H��!�{����$�r��6(^xUG�K�T����M`C8G�:g�����q�
5j9nbQ$��rB��Mc�x�oa��?d�2Z,��7`��2�����s"A�>��O�.�F�g���� �!;|w�s�xp�_4( �"��cw����}2�� :��IvXOo��YWo�_�_��"��|�4]�M <�nH������<@#�^�����_��I<D��U�X��bB�N��/����O�
�/"@b�O��q/B��@�j>�~��X����04)������x+F�D2�v62�q�'Y�p�`4q���f zN2��	BF�d������m��U����5��%��;�a�:$'$C���$�+LW�sI�|T�X0�8����{�X��a�v��pv�g��-j�iN�ou2<������>@g/���6�-]l������	�5_�Xc���D�<
��l#�����1��5�X�<:��s�(������������p# �>��7���5�	�68��	�|�;�Q#H���M�T#I�C]���$f{��1���3~�Z&jO�N�=���X����#^���4�d�_ek���;�^�$A�S>�7�y���9A��+�aQ�l�q�9/�}<�p)X�Ck��g=�
�r��F�j!���-��ck��g�\�gpY5�l��,��dN�zg?�P0��t��
��44� ��Q1����&#�A�p
� ��3��iSl[��e�y�JvHn�<�`[�k��-��m[������{/.�/;�,�FQ'*����Lx\qrn�Q�]�������n��^���W����a>�����C�Q:�6��m��'��XJ?���V^��8�nWs�	������>��71�a%kc����hZw:,e#��m=�a�V��lX�|��`�a��*��r7���,��a��68�|n@_,	�����Z��c�6��xt:�v��K��/��P�w4��B��2^XE�6�o�R6[�T�&��x�������!fD���!���RA`B�,U�7�6@�s%+1�	Qf��2�	Q
�Y�YVn��O�a��r����2#]M���"�����*������5pH���TE��+��3�2����@:,#>���"�q'�M�������L���t�	���!8@{Y���]t���E�����/����s3���BJE�%��ZnyH��DH$��8�n*y~��|�|���tLz�g<�1ji01���;V��h��Pi����L���\��>��`����.2\+���<�y|m|">c��t��JN����*���[�`���T1�e�ny���V��.�5&���c�10����f�
�����K�~�������Bx.���
a:{,���~F��},jW�I��`����K��u�|�~\.�@����A��S���	�V?����jV���y�5��l
0W���2Epp)d^u
�-���_�_���<E9������j���� �N��zF��H f�\�*KO\(�I���$���:�^���R!���g�e1�~Y������L���N�s���k�B�YY��j�D�I�����������=/���)Y�p(t��52Q�O��S}DY:�	Q��?��u(�lt�������v�>��Ks���
:@���.�|1I��7��{�o"'����(r������?}3�8���/_��\*��b&�I�b��:�q��k&���&���Ce����
E7��0(���,N�}���G��������n���e7�����;�&%�����}$@ro�Q�����G$��
<����v��/o�~E��/������g4v�s�a��@'HV��,6A�^6��	�a�55���h�z\��
�6B��X]�����������1�on^p��]J0v)hn`64�4	�]Z��QU"k�'���5����F��G�c>�&�@t�B��@�c5��n��#�6�����mx g���Q�	����?�<:JQ����kA���E;���CCCU��i�s��h�����	������2`��,E�#� �����"n�-C(���F��91d�n��y�?><^]^>�}��a��"Z��dL�JbM��������N��oM�}y�sO����"�'3�;�J���jp\f��f(�Y
�������=p�v��F�� ,z��Z8%�t�������m�=BEP����&��������� AF�oEb��S�"��El�,���
�<�2����pZj�\� �u=�n��;8��XV�,�nl�A`�������	����5�x:4���:
��J(��6J���l�����Xh*���P���\` =U�:���P�� ��"a��c+�����c7Hc<W������������./���h�b�."XtK'k�f:���8�#B:E���|��������K���h��el����x�� NHU�XI
��bb�74�����b(R��������j����9�m��Bh�J��#��`7�<'>�?\}�5A�2mq0nb��OPc�P�	�������hd����k��+���	B5��������cD���q?��J�b~1u��GH����9�H�7���@�<��&��Y�s#��&pvzv4�3Y�9}�u���O��/xfOkh���1�����/^f|�_%8��c�6������j� ������L�YT��2��y���U2����q5��t�e����0����d���@|�|�Rz?��1��]�	)���q���:��iW�7�w��m����(#���M^�dw���\�����y��v�{o�&�H����;��w���z�'��|��lz�����E�E�	Ykw��2���R�h;V5?g���e�w�������c�����������?.b�Wb��"vP�(��� �eN�(`�s����� o�{q��~�0 f�|�U�j�w���
g�7�\����Dy�^aU���jP��o�

~�ry��


~�|},((x�x��B�	����.((x�x��XPP���,�P����	��O�xqf9��/��on���	RPP�s���3;r������7���Q������vB��
�w���p�v�!�$��<�/a?�eXPP�����%��
��)((xm����_vr����}	������9-pAA�k#�2�17��|��������������\PP��x}7p���	�fx9���f��2�+((���%7����)���1^�������
|���3��/�����0^AA�?����
4?_�H/s��+((x#H6p������a��b?�Pa����i#�a?�S�6�<q��y:v���[p



�~q��)����&x_a�2�+((x��^����0���


�>x�r����}n������!j��p�y��/q����1'���t

�=@4�3w5�
-?�	����RPPP����)��lh)��




































�n��~��/_.^�������*�����F)�|�r����0^���(���������;<���k��c��rq\J

~�i|��q�O�����5	��������(((��q�����w_���$�:��wxz{�������U��/����
�D���r����Jx��.z�{U�I�;�z�fs}�ew����O������6�p{������I�;�}�k����?j�f�&}N���w
;=�9�������kx{�����iL�M������
����������G���u���c�����4�s$�/n����x���=M}8��p�p��!����.>
o7�p���[(*S�^�"���,(x�=�}���C��h7����=:r�����j����e����]$O	NBO�&�]��r1N�6�07%��:N����4L�.nN�|,��~�"tD������v��pw�u����������e�|3��tq�;q��Y�x���o��o��lZPP��cG�q:w;p�uz�(o�6����=�6�[���]$�7�@��l�v��w
w�\��ow|�lw|���v�������f�������������4:'�'Qi>�����B��;���9e��4<����/����|�`�{�#+���v`'��E0����n��N��_��>[�
���.\�y[?��h��/��H}75�cr(��;����,(x7�`|{^����N�q�� ��e���hV
�&y����i^U%Q�a�����������`S;#LB{��8��v�k&�&������V#U�hw�&Bu���`A���@0�	?��h���y�o�����������'!#�87��"��C!x]{hU�-^���;(}w�]��L`?`��6��[�uw�@�	pO�}+XP�n�`5E��I�;�S��
���Cl���=��n�|4��
��;f�{��$��
�[$�q�^x�/�%���;�	$�W����};�2^j5zu���&�������E0�o�o�^a�~���#�RP�p�����9��2&��yn���,�����������$��q������������	�n$�/�yw������&WG��[$�������V�'��:v�`������\�7�6��Qr��02��
f���m0~<w���z�W�>��c�owO�������+n|�`#�_�;>��Q"i�4�F�����"��/w�$����86�����X��z{�6��w�R�����p$�����>q��~�������'��#��q�������	ps��b(������=s�b�iK�p�	��Y���S�vQec+)������dc
�D��|}]�U}H��/�s>��yB��:��.*��I�


�&���L��$��@nvt:�����!D�p��&�<�<H�b3�]�����7y2����x����_�_��S�q�oAx�.((x�=��s����w�TPPP�����`AA�O���*,((xS�����)XPP�vp��o���`AA�����1��@!���������������������������������������������������������������������������������������������������������������������������������	�8F:���BIEND�B`�

point-4-regressions-small.pngimage/png; name=point-4-regressions-small.pngDownload

�PNG


IHDR�uU� cHRMz&�����u0�`:�p��Q<�PLTE������������W������� �����w��U��G�������������pnm���xxx������ggf������������������������YXXLLK�`��vv������������}~�q��Vi_��~������RKA�H�f�w���iVU����������{r��{�NI`52�b]�qi������o��z���bKGD�HtIME�7%��B��IDATx������6�6J�	g$�XHa����'�I��'�����[�]j���x&v���P��~,�:+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+���J+������|����������=�?������=��VZ��@w�|������d���[a��V��]�?|�����U�����=��VZ�?Og��O�������U\i��=�����o�+�����. �q��VZ�/M����?�����QH���]��i3'�����;N�_q�)|���������H��?)����s�����Yw��_���c}��wt �~��;���k3^;}��8?�����.�B����1ac�Z���j�s��|m����V�V	���t��%������G�O��������|�[����Z,|�E���w?�������/c"�-�F�_6����
�ORWx����}<��a�~T�g�N��������Yh���}������+��$�@�h<	��'O����}:�>�W�'O�����)�E�)�1����~���'�?�����gy�0�gu8"�
$��~�RJ}O�K��xvLO����S��y$7���'��J+}�������*n$$jX=)��#A�*�/��H�%����f�K�?+���b��F����(����X{�$���cj�m��/3�1��p��*���7OO��sA���c;�J�Sx��������E�����i����i'~_����
�=��3�W>;7��������u�c��|���VZ�[���`'x��&C�\���)����{6�������l�hgh�?�`m���]j���c��*���|j���]�Y�?�����J_�:�*�Y��}'9�1Q<9>���z�'d���p��@O����i���a�������g|�3����Qe�>Y[�6�o#�~��V�������������f��"�����������^;i��'�{��O��q �1�P}Y����K�q��N{Yi���Y�������h�#�,M=��A�	�w�e|r
�O�@r���^��?>������V\i���Q	}Z�J��[�N1G�%�-�������M0�K������)���R#g%�N�Zp���y��:�f�8��{�����c����wL0��3[�a}�39c�K�>�
�+����;6�G���������XN�(�N�;���H8A�^���.~������Z��V+�
�+����{�|~���H���������O��_�|�|r�m�X%GH�������J+��m�Q�Fz��P���~r���Is��x��F�<�p��i\HR���3H���w��u�`	R@}��V�+��X��F�>��F:����x��S���}K,����p�0&�>�>���F����~��?�xwL?�O%��`[�����X���
�+�������'������>&�`�C����`G�`��^�Z6�K���]��������.O�A��/?�S�-�elk#���WZ��CO�{V�����A�a���%���/���������C���I��4_����up��@N��<}g0������xL�����\���;�W\i��.�=I�Kt���D+4���J}A�LZs����J��x)��J+�����$,yU����y��V�S�O@]k����6VZi��)�ix�m��J+����?K�\Y�z�+���J+���J+���J+���J+���%�����VZ��H���'O��?��J+���7F����~����U\i���vD�������'���J+��h��VZ�oK��?��m������J+��i��J+��w%��q�AVZi����J+�������V\i��������x�fvYi���v����z��V��Q��
��J+���i���{��VZ�oGd����JW�{����J+�����W-�J+����,�cM���J+���fp���J+��(#�
�+���������+�������+������x����z���W|�����A+����������_�-�e����hfz��w|�g�A��?��/�,��y��;7���\�_'�7P��@vDs����k�����	b�r�>��q�W"�:������P���~$$T�rf��#�h�/W��yy���k����vgJ���EC�D�����mq�-�c�gC�\P���N�������Q�G���.�5SC��:wb��g�����eu:����hM�g&���<4-s���������n���^��2�y��!3��!;0]n���x
4Et����g���-��{��l&5�wL#j� zOS�n��3��Mle��1N�h��L�z%���0u��� I9�{��,�0O�i��Q�p��r`N����t�(��3���A�/kA+�\��#����B"Z�<oXktp��n5�����z�o0�)��s�SWx�Ha<;$3�bJ��.��<��-2���c�L�<�;���<6=Rx��+<�����:���D��g�!W@���S�[��U%�Cb�]Ih���`T�"�Ix�F��3H�e:��T��`)Hl=h8��V�:��*'�e 5B(��N���H�;���<���1�8X�#��r7�9�+?H@�8�00h6K@�r�S$��{����@�0�5��?*"��K%^�������1#<M��sW*d�KB���r����2T!��������YR������L�.=�/T:��s(���sd�D�>�1Mo������S�N�%��D}�i�i��%T�x����S����"�1�cp����%r��
*����X��<_"Sg��8MY?3�Q���fl����!H:�8J�XidI��������4!s��t:�{
o�1.c.v��c�@���*���;X@o
�
��&�2a��/	
L����	3WVG&���a���>������9�n�<�$3A��4~�H/�Cp�^�����g�EL���Z�@	��K�g��Y�wiAS�6:-�|#��A����,aA�tH@	7H�"�)k���#K�9��������`��~a,�}HO�`��qY����Ps�X�T���"�Q����R�D������aY�
�_@�h�e>X�^�2}��M����3.�B�e��l@!;5��r����"���!��8T�N�C���g9�n���	��`��S�~I��i�#��Q����S1�q��9J�r��8���FH3�E��%��~�n�p�V�f �� #��5��	�.q!�7��AiY6+�M�~��h�,5�'@�x�
�6w�;&yM�;�n�aH]��N"j�-l���R;@���A�	<��83J[x�������������+`,�]l&0@u�8�|,�x�n�t�"��UL�,3!���,<��g
���� ��%����.�XR���>�s�28}���4��C���(M��������I��F�>�a��u�3��'w�����;�?%�=��N\�>Z����xH��'I�9ZA�h�h�;���3��"L5S���f����{Q����x�b�}�����P��Zy�}�m�/��@��Z���������p����?�����!���z�s�hs�W��+8��?R��
EM;�_D��{���bc���.�u�T�6^��u�o��3*��1��8(��p��~��!�����
���)��o�V
���cpc6�;�s'e�}S�c���	���U�1��(s�{:�e�����i&N�<��B8��1�PXp<�N��<^�?�>).����������u:z&m�KZ�8e�g�hv�xA�
�f��x�hDc� ��'\-�d��5U�������$.4�I6���91����#����o�Y�|�_�&<�����8B���Q����	��
��`o�8�O ��x�Q3����Q��k����f#H����t+#}����jX�����r�%P��m���	/c�3���>�\aaAVE�4XS��pTF�a����,��:r��P:	�)�iT�����F�G����r>0Y0A��a��)�f4�q��H(�Z2�
g#��L+�b��wj)�Y�)Fd������=0X5)q���`�VFx<���uM�\;�����;�^S��e[6��y������u�G\Mz��@�-vK�nr"�4�K'�Oj+�-��p��aS����"Y�`��F_��4?g|>!��Q���h���~�l,�2�Z��D�87QXMg<�vA���2���f�*9��Q(/$�q�<��R�x���G#^i��.�8q�`y&!����#�AYg�8R�q�o�	&nCH{_�Cx)��=�H�i.�_���3�E�1Q��<$�����0v`B��I3�w��^���]p:������O�G'_K�����/����;�w�Y+
����!��E�����<T�4UQ��a*��e��!S�3?�&+������#�jv��]�W�g����XECx�k'.��"����$���0Uq��U�5��c>��9+�z	T�p��$t�,��:y�zM��2��Vmk�0�q��@��]��`kS���j;|*�A
��|����HG�xB��P\F����f���l��11I�/�8���6�/AB�#��5fH�./1��&y�p�k��v�{	��r��	?�r�h�n�dzG�7YH���Y��l�]�8�&A�:��s%��T��V�_�������Id�	XH.7�sx''#�p��$\�9H�6�4L< �w�8�Cy���X��3��0!��&����q�I���x�C���@m���3~���������(==����Ms��9	��A���X8������O������pF
W�f��9�zQ�?�iO6D��9_��� ���&������8�@ZGCk�wt���M���R�e�T������
�;�M������	s4�v3����J�u�e8�Y�9��x�lLh#���l�<���:�/��'�lh�9���~�
zW��o`��6F�:��3�dhzh��`��&��$�s|m���f�B�>l���G�;'����[��O�0i�����h��
��/��}N��``MN�o>��s��c,'|&j�m��q2�\���>g�gqo���rq��r���C�^kX�1�Zx���;�Dr���U6Vp�x�Z��@���@�-.��by��-����k�]Q�	�+af^dw����M�`�0ya,04"�����h#��+TRe)+���d,��y�� Q!�^���b���+ms0���M��-jr��
��Q�j�I��-����l9����=O����$�!�g��4h��G8w
8$�,
�"�Nn�,E�P;Am�YpO��6N�������X���y7�5�OM2|xT�g��2���q�y(�����c�B���ZQ>�!!��$�4�4�?`@���� �+�a0F���w��<��������;7�t.�_�G����1�n-p1Z�fX.F�";�(]���n�8�����%��*h�u�a�����4������LpO(���+h��1[���n<;,I�
�vn�����i${=�1�
;��DU��ia�)���<6dL����e[+�j����}��x�N[xl���/��
���v�:G�P'��?P%��<Nf A5�rW����i��{�
���U��5�He����x;�~��r3���1Tdo3P1�|�%V���b��/0)�\�i�s,p�q���8C����R�����q�h��L]�D����k�!��K]L��8@����1�9���0c$��u�T,�G8@�+Q�Ti���N�%�1�H��P�Ef/�(��+�<=#C��L��D35����w��.������9��I����/|I� F�[m���O�0�v�&�02���d�&��0������%4�8�_^/�3�]Q)	 �B��+`d��7v�i�@6�o�n�g"��K�����^'baK��noP�����N8��9�m(L �����g[O�����PCQVz�q� �blH.G�M��8����O<���BN�iwp����/Lw������ �O� �v��
�C����E��9��Ts-���p0A��O:<���&��?����l@ ���xV� ���M�k�i-z����.J��S]���^>�8xR����.��]d?@!�V���C
o�����I�N0`�����_ed�s��:�8�bf���K$4%�1����j;��R�`_��h��
<�9���h�-�sI+a|u_N$m�0���@����<NM�8���'
����$pOJ����8��l���a�h~������4
`��:@���-��)���"y+������95IK�x)��k��$�>���Q�[Uo%OMt���G&��)����C�M�L�U���03�?�������H�����G�(��]�;���A27?d�;�G���|��wRZ�K�t���}����~�����C���Ep�S��s'�3���aC�|@c���t��_��s���9���/
�]�Y��fR�Oq���VN����}�?�+��tr����Ez���7�.UM�~�8�q���yu��;S?;
w��i3���	��:��.>������]��X�u��{�l��=F������x����Z� �3|\4��s���A����eArq`���1�����?���0w^()���I�������!<~@,5{������t1M�	"��{���F���F�^m��+���$2N��c!0'��a��hKU��)���-�U��������x3,%8Jcu,���+MQ9�u���u�/���|�����	�����p��r�������H��(���.�j't<���L��1�y����JU���[�����
>�X��r���������6�%�1�Ia�lG���zH+?HV��r�9�4U���*�|��P-JRE5���as�	v�Sl�R���*�L3m�%c!0�i�w`�����S�'�����f��X���9�Y���y�)Obp�����3���9�62�)���0�Kkxj^�6��r��Sj9��GD��l��&�S��<��3I��E����3��e;��8���%2�%��B��cr?g���-z� �M���s����X��nmb�i8v��E��!���d��0���������������
���h�{tn�w���j<�)=�;M��p���Hw���5�����JJ&���!Z�4�	.D�jp*��F�XY�h?�����!^#A�^[�W:�l�Y'�+����T��	��K.<"� �8H6Vl`i�	����}�����$A�	L@
�Y��C�N^+)j���U`7��UW���h�i�FUO�U[t�pCj��cWF��:��\I���Ru�DQA�#�f���=O�������S�D2#H
�a.��M��C���g<`�3�$!�>�����9A�M>E�n�_64PL�������9�R�?��o������,������b�b0	Y����%���b��@
X�6Z)�����	x%r}��RB)HX
#J� *`x]Z�b�K-p��\0HX�+.����x�+`�,���1��c���:�F<�
M�>��Y�p���������>g{z�.�|�t���NY��w.�V�f/t.�/UP� ������
�YW3F|�2Y�w��6��j
b<R�����z��7k�B5S���2�cx����w
��W5�.�%L��b�X%�^|��D���lm����e�4�*����u��z������%#tl��X6L�]C����,��N:�(@���d��
�
���,�v����B�Xy��l4��XaXJ��$"gn
���$��	s��UZ.DJ����E���
�XI�l"�c�<f��p���]�s4p@���V%c+�nL%y�+h���fm<4�zH�w�^�D!5	3��J�����$�!�<"���f�O���(������<`Bjf{=�2B7& ���HL�Na&r@��$��0k
`���_�[���~D(���!3Jc�(��,U���x���Di6s����#m�I8��������������1���U�Zu���^���|�I�7t��sU�Q}O_���p�G�c(��������I=N8_|�u�9���p�����]s��7�dk{�Q����N��r�����0�������D���y�.��08.@q�\@��E�����F���z��wgd>zQF;�S�����f�8�\z�CR��6�b/;4�i#2��e[%�d:=nL�AX"g��*%���D�����a4y[f4��Y��������1-SJ|�u-H�)�F]v�L��%�A�"/&L?�+q�:EgGh)BN�E=s���C'��v8b)�2D(�b���R������9�-������\6\0W��%�"��U\����U������� �T���rI�aNy,�)�}���u�0p"�Q{���pf���h�X0$��h��W�d����-k�N��)�9�:�p<��k�<�q������32��qV���'u)��'$�C#}�aV�R!��
��n�+:E�g��R\�9���J��2��8o{k�p��?�����z�iQ@_j�E��������8`|N|
���&$3 ��3<]Cj�p��L���k������V����]��1iti�t=g&|�tF�:��y�;�
� =m�n������\`Q�l)� ^��>�u�\@�uQ�I�)���|}I��U�`��H�=2��������ou	�pC����`�-F9LZf#�����v�Q2<�a�.� J�"u`�"���1�����t�����%�!%^iQ�|j���hn��Q�FJ����ZS�47%r:�pH
�����l�2��SJ$�1�t+��Dc�4��XK���*J��s1Z�#A�-��4�t�1�fV�����N�<������W� _�21���
LA
l�B�tw���"�f��D�p�;!WerJa]�ie�	�����<F�!���$�AL���#"d0�6�Xo(uwV�y�H�U���1a�0������X��%(z%��c����(��D�0w�I�h% ,��2f#������4zxS5% E#����\�K:3AYG���p�q^*�d��5��c@�`���Z�����>�r~�z���%%�B�L��w����drd�����*��93��xc)���$�����C(���P�"IJ�?����m��$��N���[.2��5�N��Y��c��,���"��RDC8����U��r�]0���C����	�J�VN*qmX��D������m0�I�L@� u�UQ�K��l�;�Lt�������6Q:������)�,f����l�*F��2V��fq�$�zJ��i6����z�F�'b�4�%{�4N��N�B�
o	���$J��f�?�%��s���������=04�e#�����7]
�E�6�$x/��VIjR
�_6���-1�� s�W�2h�2r�������}��=�HiJ)���������Pn��>�ZO������U�8�c`�w�E��"��7�#�Jj�������>A9>y��x��J[���/��|1��yu=w��}�:����JX��U>��,E��,�����,J�����=���V-������u��QT���sa w%���y��3	�V|�=�(��8�����O::���R���J8��9?�}��a��:5���#�s���c�E-{|��+guvw�r�^t�.4s/s�h.�v�|�	Q/ ��BD.^|~*���)��|�-���K����:�*m�W4\� �\uSCW����D@�B�	���S���E�/����QW��U��H�������w�uB\���Fw�������-��5�;i���V�c��@�M�/.t���.KV��u��U��kE�`�u�Fn���S7���z'�k�\3k^����N�� �N�h1(>�����0��[��)�:�9�>�N~}3���Hy����pV��`|����5O_c�a8�:@��*G��|MQ��&�W���>U6���A8@�5��3 �I����P�yN!�q�)��!�[����6"��wB�������M]�Ru'����y����8�)�a��S~j
�X.;`0��'T�0N��#���^%�/%
�����^��,���F��hH��&h��� ���^xLS�}�4�p���7���jg�w|�"���Y�H�.o�@b����43����H�
�F����,�R�&1�^y���
��V�0��� ���J�Z����(��>�F��
��q;����y�R�5lJ�<����h�B�L	����w��������M,t)Xxu,��%���i�0�Y�^������&������A��.��R#A^
�b�����c3�M��$����Q�U�����7j�9�a��k����|�	��N9zUig���+�jZD@���4�9��������_����s��%����Oy����9���9��^�e�F��*2F_x@0��ZK9j>�<����`2�I�x8�����YL�����a����!5�t��q@��O���\H�x�����y&\�:.�Ma��`����*��8q�B&D��u��!�����/�\D{�6l��x~��	��D�����aM��}���L�[��OU�w��<w�QT]4G^3��Og�nH0��1�kQ$����l��-m�V�b
"UW4��"�$R�h�������OA9�u`�
�����H�]�E�z�*��h\�:ZQ$�uW�Q7g�V��t�i� Q��s�"��x 8w�fT���a&`c#�l�� pU	p���G{Y�j�u��P�c��*<c(��E�)3�k��o����D��cJc? �fT�$�0ef2c�1����z��7�����w��������a�(�+�Gc���C�@�
I�1��e��/��D��	��&(l75���B�����8��6���y�����
���_i<$��Rc���<|L��wZlwc���7�K<fV�K�����R�$'���).�D.����E�I����\��J�cQ�H�M�XO)�78�@�%A�;�3�����Z��Bf�N
��}w����	gN?���|�=t������>��JJ����B�.�9�HH��27;��Hwgu����U���Lu����2w�?@�T���*��`'.�I�yx�V|��0?i�>���t��i�D2�R���R�Rb��|!�T����I�Se���t�1�����'��ZS=���<6U4=�������|f���������"E���]�Y�t�!f/�1�Z~fT��T�q)����vU�S:w���x�u���5����a�So���uRk��Y�1V��}a��&��t�]���1�-�3���i�a�%u3��6xmc�S�t���9������/�}����Cu��m+6����4�k���_�w��%�����qlh�����y�f�9�?[Va�$���;Y[��(\G[�d�����
�/��*����x���+���v�6��V�&8���b�pQ���`pu��#�G�n�j�������#2��'��`�	953��`uaB9��&?n'���CW�	�cB�qj>2!��3a��g������G����������i��9�IS��17���f#FN��<�ekX�F���:�\���y^@D�1�������X��	��.�e�F�0��.����d����x��Ht��`m*y=(k�r�<=��d���3��+�~����>g��#��~�0=�K��Z6I�3(6��zG'2Y5	:a�#S��O�3"k �����)�)����$ 2��������SAiR��j��������N2�|������C���2���F�t�����I ������\]�c��+�;p8�s$,'hF���J,�UX\[�QP�RIpaG-���Qj,���y�9I	���
���|0it�+��P��0�`�egD	�Z����D$�4>��D	#Ad,$���d�����������%E���O�%W`�����n���+�c:�b���������K�W���X���4z�I�l�	��L@pr���@O�X��Y�i�J�k=����rQypn��0x���@x�X���C��rL���\=9���^/�B|�2I��5pPHJQ$��������330kiI���[�]C#kQ$�I	r�����-0!V����VI�6�lm�]�JU�A����hN!%��P�>Xf2�.�����7���P����X)1�H�&O���ZR$`��b�Q�	6�P8dp�8�D(\q$T�TR��Ne�*v)m�%X�[��E��5ABk~�X�"<�:H��-)��4O�X@	���2��9`�N��ZRzY6x��8�g`zN:	��$��� �Nh�1^�KN�AB ^�a�nL�5g����/�v�_  �y�yw��RZ���)E�|��f�X�t��Z
�U0Jxp���
4�
��L+��b��_>$�s������|msMk���44b��HNIdBz\���<���yA��r�<�u�LP����+x��8d�<��������g�"�>�]��������0Ia�x��K/�?G^�������������t���t�e��t�o�N
�h����A���K���LIOu�
+e��������B?���I���d������tL�qL���e����!}�d�4��m=�y^g��=`1;3�S������5�q8�Y�;�Z�n~*�����hg���V#�c��n�2�L�y=����� _������������0�?@�rj����=/�p��I�V|���l,v������?���`I�b(\�Fw���Mnu9��&X����u�k���s�]�X"�h��(���p��X.|1�a��C��De
l3���H�A3��:�F8�*�$tR D��A�������F��K7��)���<Y�auX���R�hp��E@����N��������Wf8���j(][#A�-0�0�]y'�Q��%v��n|
�S�
��fu���N��aU��=�\�c�����1��!�M��_���a8�
���%��"����Apd6:�k����upY?x#L25,D97�V�0����kf��Mx��;'^&�?��y�a0F��P��c|��sWy,b�x���"����4��N�Q�J�����Q���+de4�PR3�Q	���'�E���������	�s�zu��:8
����Di=&[��Ygb6�C��&H�s(����
��2��X����{!tI4������`LdB��n|���gw����	O�d����g�8DxZ�.��(�C�x3��Z,���]�el%�����m�q�I���F���y�Nf,/`I���Ga��13�/�x��`��'6�����H�,v��6�X`i-��
;f����`������n��
�(�j���J�Y��ax�*�kh�<�����	[�)��`*yWil�>,5���+|�Q�m�r��<��+9�PHS�rM�@	�Q6\�l�`��F:�T��$=4��%+���#A�D{T�aF��^#dHvW,�H�����y��)��2����:k\�
�g���3�����i:���\�H�������O�X:�rd�`����"1����GGh1+f��h�!TjF��`(p�
��2w���ZR$�����*E���
%k0������gsK��3'�P��<�����
z��|�d��W�vG�T����&�}	"���Jk?�����&����$��J�����y��U���o��K�_��S�.�j���r�N�E���G��H@����y���h��H#]�R1����jU����.?bpt<���]'G�/�8D0�������Yh}'m�=�k]���S���|/����x�0����1�^�CaURt�DDfy�����K���c�Y�S<`O	y�����
�E	/�{��1il�S0�95I,_��y��1 �I���M����S����90��6�n��	0�HfG�s�l>���R�J��HbB�/�?=K[`u�	����<�+��'i|
����_?�I~����������y��^J��i�]u���U��r���������������2�4��C&�
�)8��<pG��a����0=W��n/[K%3������^��
�'�bp�A�>��-A`��MVn>g4}\p���^B��@B/������:+���&����rQY�)|���y�Ep�uB����}NG1A�����)v��[��xn��Y�>
F�a�H�w��|E�O��Y�F�v�T�0hP)5��a�.����`=���o����������Q%L�k�Y�jmxN��`��������r�z�QY5���k�n�X����e&C\��NGt����C�,-��RI)D�X3Z���8��,G� �me��k��w��bvY�lT����f�{�[:6��r�`��9��D�I��y���!���tb
�E�1�N��8m�ya��B����R����oZ8��c@���Z�����3>����1�`a�<H����<���L2�I*O,�Wr�N/tt����`+��NL�F��$]��/�q1��G#pC��a2��Ll\,j���C0u^E|1#n��9��9��Ba����
��0���������:OM�[�1�I� �-���(Y�x�w��t�y�x\�<���	����N����G/`�E��'�������.��H,G�(�� �B���x�b�+�u����x�Ahz�wk7H+k�<nu�p���w�>��	�8/�%�2��H�B��Mf��h�p�����6�0�|�t��	b�XK�j$�����c��0��5Av��Y��
C�J��PQ!���q������DXMG����a�+x�����5d�:T��X�)*�Zm�Q�R]�s��NS$&�2y��Zd(|����L���c����3���H���,Tqy��U`Q$��%�u!�t��l�����6F�P�R�������6�@6�$,)�@���{����\��%����qV�1@�C=E\
J#�:#���T�	��Q�aoc�	��<���"�2DR:)�!=0�A�l|4A2�#�-0�FK
�v6�WT0b& �g,��)idB����8�]7�\I�sM%�a���&�2���{�&n0��������s8v��AX	~��
�.�����
�9��le1wCNi��rG���j9fG�����C�:5���o�������
�>�a.R&����8�i�cA����N�e�:!�M��O�.�j�C_��!�S�Vp J�e�H�I���2x�})!9����'����Z.���Bu��	`W�Z:b�(�0d�$���h�N��,C�����&�t�VA�������D>��aR.K���L	���C��6&�#��C.���d��bi�T�	���A$��^��+9��\�>�c��i�5 �kF\'V��I�
��6�<wIa��r�W	�aM����(��Mw�' �f)Y86�M"��po����vvm�1vm6/�T����������J��]���@�3IH@�X�}f�+��)����������yoc���"1�;5�[��g/�rk3�=��X��x/�-�;�@����\|���;����?	g�*�I������|�:�c����4�I���A
��I,����k�6�����J��Vl>��Gx"��	J;����`�	�^2}t�+~��`K%��<���R$��&��J98U��L��w��yj����_�8:B��v�s�}����\B��s��dW���-'pt>��3�TQ���I�,�����x	�����Sj����b��Z��B5c;a�����.UIk�[]!5U^�M��+����/[�<�����j�"���Z�����1zl�h�����l�"���Q�N*�1���""����UiK�'5+N�����n��~����u�	���vvf2�:kj)�[���"���r�v���%y��Tw��^��z**��Tx��dJ��T �.=���4Z��w"�v��
"��R����#\�-	���E�p��*i��h4��N�`�f#r��}����+0z�g�t�3��������H��j��U),�ly��]��w\FkN�#k��
����]r�k�x����3\�������b�D���#K���f�OB cF��A��(�hQ��!�I ����9����{'��L`Lf{,�������v�z�T�w��M�u�
����)�5�^���(�.\w� E)?�
����5{x
Y�-C}��p��V�
�hT,��A����'����GZM3��
�����
���#���C��Z(M���Eo�>��4
����+E�RJAG��Z�LK�:`���"���<�E�M�0��z9��41��Ht�����qX9�����^ �#� t�B��03�X��7S�-S���5[���9�M�i�������!�U�%U����)�uim���[�NN|������JZ,#�v����T���@X��1�,J����~�hQ0��\nR�y@(�'����]��"C�d�e������m�<�*GaB&��`\P���������n�9��T��siRW����LHv��9A�2�VY�Fo�'�q|��f��c
K=~�D{��Wl;::�j-�Q��S=bg���i<�^�_���Z29���Eco��#m�@Zk����a�ZY��[ZQ�>�!�*�1?=P�,��$hln��b���L�lSC���%@C��Fc-�"����)Z��m�r�+Nq,xT�Ap��C�ZS�p��+�.H�`�\�-��+� *Vk��Dl�W��vu��+0{>T�s���H����-1���VQQ�������V�1	� �R�l���l�,�W�IFO���X�9�����"\1��:��b�T��^B�pg���H(�����t]Q�E�g&FM]����f���&��L���|Pn>�&�l5�%��q���
���[`?���a
�D:yf���
y`��l�\BW������L��Hh>B�S_�bI �b�(j<f%��`dG�^/��`�8���*�&W�	��y�'
)K���I�Kt��1w��b.��;�s���Xu����������:�N�����F��C~����AW�C��t���-������5]`M#sGxn(��Q�����+�������=�pK�
����<~>%�}|���!U�h_���v�I��?:�d%�����\�
0��{H$�y;��A��V�|<�li���o�Y�:��K{������O��w�Zn��f?@�jm\�#	���c��	�����ViV,%~�T�PQG\)C� e�`[f���{�r���S���8���/���K'����"��� ���f����{������u�8x�j&h����Z�	g�P��p�D���nx6�j�P�W:�a���+E��fEy���6��<u�X���,�C��0_�e��t	��*�"�fR����]�40'Sb��u�,�	���iJ���C������5���|
�I�+B����-K��=����[T�1�4G�1q�P�~�����l�fk��2��=�4�.Lv��`�4�~����M�%���$9��
.N.��A�� ��kWz��#A �U�*,��J����3�+������1���
Pa�<uu=���|�mf 8t�&���Dy\+C�Z[d�H[`�qwxL19��C`��1���T�1�}F����(j�~�dq�&E���F�4���$��L
�Q	T ���a��`$�a��n:1��J6�IU@0�k�N��&��zl����Z��S$�"#@@�37H�R��3Zq3a��m�\���PE@09��O&d�ZO3��R�i���0���,���d*qAe0c�t�`���X��e"T0�g# :���,����\�l*.�9����v�|��:����O��d��B~]Df�H��-*K�erL��8��B�0,��#T�!��a����8	������aPT��t�pC�bI���SL���U��u�c�t�0���`��('����I0\�����On��1�A�Q�J��w�sT)Qct3WD� b��p������<\#f1E�1��cI����rL�s��w�����&�@>��t`�vPL������cf����a&�f�����:}d0�;L���2N�eWxL
�!���M%'���?R�!O|vF�:�_1~8�M���AF�fC
�8�SY< #� ��^��
8�E��D����8�u	�c�b�%�C1[�/���_m�����	�j( �c��	7�RM���-�����-�7��U�(��j3
�i�~��(+�f��()6�6���0�\�:��
x������9���,3����?�6h�+�E�kgK�L��Pcege�q������Y��e�VG����8�t6�@&��sF�a���b$<nTg�
Fb�f�|I�.��6t
��a�Y!���
EB�M�s>@|(�7��`�(u@��:�X�����@��^�8I���Vr*��6�DO��.a��5�Q����P��G��46tN����@�W<�Y�LM����t����4L�)��L��0q�cy0`B��d�i�1VsH HC�1�0�!�})�dK��E46O��-<�/@����W��N�����J$�Zu���Ux�:-�����N+g��Ug�6���P�����\��K�`�����8w������&��o���kJ�a��)��t���43u���rmw�����#�����yss������o�����;��������k_�l����/�p����KMq����hAm9���/$r����7X���:���.������5]S}~���,����c���L�7	���.+r����<�+��0��K�wn?�^2�����YB��a������S,����%����so��8�k����)"��������cVkx���,�<h��r�f�����B)3�9�329��x��i��xl��f������>���D���S�VZmm�%�D�
,W2�����&79��os\�����������fs�����O	��	�+w=e��/-Mb���5�MOKtW�3I������0���T&��������������!M���/���1�g`���&�1�53 ��+	n��S/����F��c�.,vd�d�B+>��rF��55� ��wP�:�i�[�f���������*���R�������K��xG�R�	?���,[-��
q��+�	&�����i`����p�T��������Y�OFeB�����V.Dt����r��*3Q�r-�[&1y���j�]��`���mL��'�:���:h!j�$o��5��N~2��m �n&'�\�Cc[cpA�J�c?����E�T��W!�sj+kux�
��5��\��g��WK@9p�L�?5O�=�bI���s���vA"��2�����W�R��z����V�[Ne�s����OEUb�
�6|	b\��;�^%D#Hr�g���O��&"u5��q"�7���I��Z�
��H����8@/�& ��D2sW��9W"�%���vI�Bmw���=��m���F������8z��ag���1���[z*������	�PkP����)�_3*)%���cQ+�5`O��:���\E-�I��pv>������>n��"(y;�C��c�����8�$%2t���V�dh�x\?�lE�3P�&e�i-��k�H�j��_��&��	���;�G��l�T#���T��?��/�����;�j]��"AL)�ht�[%�����6,�ImJw��`*�S��tEPI{�Y���w�DU;�����5	���7%����*�VvB�}(�
�Y��U��$+1�4�8����M��a�\N'�;l�%�������GO<��i�W���D�*\�d�GSu
��i��81�)=*n�O�g���H�id�����j��
�� '7�fx�k'��/y<��%�k��!S4	�@g(o���;����Y��/�����X����"1@�T��$�X�����. 4�-��/��;0���d�D�+�l)R�Q�0cL�
��N�+�#"g0����.OP�N`��������w:����F�Eur�\�2��_u��x:�� �t���2�\�����F�c������h!�����|����DSm�4�^����8�/��6�z����e.�Z���x���������X�����b3udx��@Z�������q�i�V5�M=�l������/R��{^����-������w�3Nl)��'��B�'��������Ooz>p��3�j�"�W�wO�W�
�Z����M�BzJ�F���;��9���I��t��uq�u=����h>}iJ���i�����3>�w��6o7�������	�������^���	<�Y���+^�J'X0�P\���-4��t}W�c>�g�����p���'.Np�iA���;�:j��p�o�&3����}sO�K�I,���/'1�	oz�08jF���n�M��kat:9���F��4]mY�I?��[�b�\TU����������0:���������N��Kv�a�<+���1FQ�D$~�Z���tj�������16;,{��Mn�a��+���Pw�������V���=qB'W!2?�A���t�l��SG�W�a �F��1��-���jQ�A���]Z[fNd)��3����2x���@G��/-��r{5q�PC����^ *��=���X}�h��p;�
@7/�
k�"g��L���CeP|	I�?+��nLu������Y���x���^�
���6[ML�q`%�#3\�d�s>�-������8����#2�*&d'q/���r�.��3	s�BK@��������(�h������@�jn&h�=����b'�~?n��
����_M���4�2z��`��/��x�Yp:�7Q}?|��0�������?���Lz"9vb`����Z>��������=��%E���rVDF 
z;��2���Vu*~k��l-�F�UH��(qN}���Q
� �EY�|���@f��rB:|���ho�q��5�������Q���C'v����'t�+�$D��"E�h=�����CW�����WRhU������=��3�7#Hg��� ���r�G�����_����WQ*�ac7��;l�K�� ���x���r{��"cQ�Q'�������s��&�|�%Mq)���+����o�E#����pe.�l�u��W�P"s�3'������=�.#�F�H�Tnh[T
��E�T�hFOX�&psQ%t�����J��8F����w*�g�� y�x��S�>e1�Q������I<�\m�&@��A�P�A�15�@��x�����/��d�d��h'{>�>��Cl���BA���sj��hR�`u��5L����g�%��7�:��� J'<��1plg�w�������"�)S�D�f�����h�(j�Rd����/����hl>�V��.|�/=b��(��s�� ;���1��^�����\�F�s*�����q�)`@4�����T���}�a���&j���������X�'�wv >za��H�V��>��`q@��	%��I�E����A�,�Y�g9%���������_�B�P�O��^�����Y��%G�k"6�6y��B��9��d?/c�����T[|g�Qb�YD#��m*
�u#p�3,~\n)�4�Hx|S�
��&�`y�70p��	��5�8]:�<z`�W�����Xy�%_,�����!�Vz$�Y�kJ�;�)���V�c�o�$������������^�9
\Q���[N�x�=��sJ�cm�"�/���Rm���Y�y�Fsf&w�^`�%-_�~#J�
N��V���{�������{�8�Y����
�������K��>����:�}{'@t����R�����.����l�h�yC�� 9=B��Vu��;�=��r�j�7=�7�w���rU���O'D�������������
��Z����LK�����NC��4���g�x�L@������jf���[8���
MVj5�OK��8��
k�>���3��Q3��PS�1|������J&�����7t�����u� ]���S�2Y�����M��:�A���xy�� ]��Q���KZ�k��4(2q]�c6o�A9�����O�����>0�hh�7|�I�f�b�d�h���v6�p,D2�����)�
|����(6Nk']B9-$�,��a������b���Jatrl�	��/I	8�@��E��e��1��Y�GS	n�
v�'���	�j!������r�M���2�M��1���L*C��7I`��9�D3���G�]k��QO���p6��3�r�&=JY\�.�E�L�����B�K����rf�����'<���	_%��Z�L���.d2>����{���HX�p
��� ���J��|w%+��S����a�X:#��"Ax'���L�A�j6&��|L�F�i�,��� ��:�FXm8(�AZ��A��V���Z�
A����~XG�V�N�&�����2��`]3�tz���'��x�.�w�X`Xg��-��P����P������lq$,���7~\R$������\G3�<��v���6�hsZ`��BT����~]��S������5.�!�RM*7�H��z����P	�s���W���V��e����W��R@m"�P35�f��>msy'8]bB�W��0b��'��9@
�"y9�1�
���8S����\)�8����8����}H���yrQ$I���a��8q�9��r���x<�����Fs��������������	����k��"Gg����V�9���e���l�����M�o��fEg^�A�]��|'����gFh�^�/�RA����L��d���VI6`J�r��C�)xjstm�2��=�L��j����`���]^�K�]d����9��jE���5��5� D��|(�:���R������l�,	�!Q�����r
,�
�=�Wi��
�������L����9MgV;a9�B�N{.n�A^{i�1ZZ�K�L&�sJP+qL� (`K@���}j&D�#���BS�O��N�|���dc��	4�-�af�k���-J������c���w�N���9�g�4����ZA3�=X.gU����sv���+����#Zv�g�3f�H�wq�x�(�}�S2�T��f��D�'X,5����y�2��->Jk��T�|i:F��p�?����	��i���#T�$��:���_\�N5�Im� `.�CQ�i��wLE#�.�\\��=�_�v�6��&���������Ke�������!<�Fq�v|z�i)��TW�����M�O���cP�O���?�����W|����G���sHl
���.�����R�s��������q� <�k��AK>��s	�"A.
�tgN�]3�l�{�w����
<������;?�3����������s�Bb}� ^*�����:]��[a_JaLO��o������\6����[����d� �)�Z�p�t[��j��&���,�EC���
65��q����Tu�;N��������X
�'��$R�]��VC����A�y��Z�s|���zp��&�f��6��kM�0��n��y�.\P�4��kM�.������6:%K���]� �m����
 U�G��D���{��o�Hy��fn�N'���$97���&%�����Xp�

�c�'J�T��\�����H���IF�{3��P�q�|�Y��$���^�+�(^�J�T	n��6�W>L>Z�5n`�^T�'��N��rja	���|.��
0���N}`I�(��1w|K���R�	~��(����#5[`5��N���{�VH����>��6���K��;R�L�~g��
Xg�z�q;��,/w��O�����q�X���$�T)Av������S�wF�
���aY/��k��
��7s�`������!���e��?XJ�����Q�X(�p+eX�Y&,�E��}35���
	�c4�����JL�[�����Pi�BbB��K���<b����i���u��i�K:A��3�,t	��u���3�!�^)�!��cG�!
�reo���8~�v�\� "��BSXB@3n�i����3��X@������#(��l�dq�/c��WR6O�e�t@��sb=�k�I����G�crue9�	�������S��v��[�`$A(�2��Jphi�8)��<-I��(��G�H���1�=y�*3A��U����L~��0AW8�[	�<��i.
�����>Zp��cR�������9���%���x�<�@E5L���|�����;B/�z �^����j��L�l�c������2tA_SM�9�J�'���7�������y����5�������	������4�a�qb���&Q;D�*��sd���`z�f&�����U��kD�,��������w��@{�i����j(�o��L(M�n0��
��.��w�r]#��L�C�a��3��fX	�NJ�r�����0S���
���	pCF{E� P����R�
��d�0�upM�_�T.	K���C����N��������0T/	A�lt���Ee�XY��������?�\��)� @H��������
\Oi������z�����J�%O9ha�JCl�`t)N&+�dA�`���K����������oCb�&����h��
^�1��;��~-������&l�Uh�W�c x[+G����3�"�W�_}�<6Brx��r+�]$������v,�����e�ezO��1�W�������j�����y��j����e���|�������f�e�����w{��Gy�����{����n�}p(;Q����4W��(�������n{z'��'�?��
����8��W�	>��\�������<�����&��O]3b;��Gi
/
�t��O���t
�b���������&c5���������?�.Hk���N!�~�:�e������Ts��^����nAN�������uZ8�E�$�2UQ�������~�M\�U��DQBC��F�k� �W��n�HmL
��Q��>��#���kv��hHs�#���=J��Z>�G�*���a�9{���_�C��f��/�q�18�=���!��m��q�Te@W��c1�\ TI���U���[L�_�\ow�i{vC�
N5c��}�$%f=�\KY����q�q�b�_l<��)'�9�p���"m{��5��F%WqSU�U.����������k����F�P�&%�LcB�����"a���k�"�+ �������oxD1�`TY�A�.�d���|l�<���n����{�����t���<y���3����������'7�o��Nh���%k��x�a��p�r�S��^uL�EGo�8����Z��J��N����$)B�q�0Yv�-F�/.��\?�-wJ3��iE��#-������X�:?��OF���+��U���S��N-0P��JU?b����Q�&��>�C+�6�����U
f:��@������5A;�����M�m��a�h�m��2&���=�_�(�$I
��s+�B'4�6#H�!EDg�V�ZF��|\x��5A�p3��J{Bw�<��
�4}.M/8Y��VL^'t��E��S���N3s���%�?U$[&�f������|U���uc�>���(�����uT�W<"W���bH^����g�i�	���+�D�fv��c��"�}�l��#~�	+�-,J��Y7M(��n��2!����VT�$�_�@���8q��|������,���vHqn���Y��?}
������2������������E�_��<������Y)�M�K��oD�L������X�m��o���9�f���h�8�Pq��&
�������-fCu/�c;�Y�[D*�d
�D��*���g�;�D����G����.!�����W� �{V�V���Ej���i�j�UN���-��j�YWIt%��h��h�j�.��P�_e0���������
gs0��o�"��DcO����E:Ku��Y�}1�H��Sc?�e�+��/X���	L�XhC���1����[-���(��Q��f����rJ�A��H��!.0�^l��2�y��Vp�0!\���)l���"!1,6$d�_���xC��Q^���&�6��Q�<����T�j��p�����1��bGB��o^^m�D���J������Rb2�0/�. 
zc���
nv����'�y����w�(>yr{���Z�(�[C�g|�f`�I`�h��o����R~x��@!��������t������^u�������p�~�-��|Hp��f��7�c"�fDL����0Ix��9�9y������S�����#��G�������SN��0�ob�g2�~R�G�e!������u�+~��`zS)��P���v���������s_C�7�e������
M&�O@�:����
zcwf��l���������O��<%��0_��@S�> K�����~K����V��������B�Y�~��`���^�g(L��]3�v��]�b�r�@��Bb[��H��a��.�xM=��y�����_q�.��+)����]����;u������x:��p���V�/�P%+L�_�l0}S�;�j�2@�)�y�
����FY��L���4>�j�EC�7�"��b�00�<�q�S�����)����84�r'�����C3[R� "=J6}H0��
�f��������Kglr�m-��7DQ���x6�	Z����D��8�r���n�={0�=7h�A�g�#�}�1����]�O���/)j|q�48��F}`6�z���(�j9f\0��0��m3�p�������H����s�4�\���g���i:@����E�PUdfp��81Cx�u��Y`�������[a�BD����������DJ!�G3i��<�-�]���./E����qH�.=He��������:�+	��#�(���E�S�#�Y����'#���j�L���d�`#Tx����Ziln�^����A#HH��
Lh��AX�-_����
{|
@���������d���q��1��sI�����#8��1��^][���Gg�QcL6<�S����9c=�)���?�I?���q�����es�G��=q)��
�H�J�L�E:�i��e��������%h�C���hb�KR�s����)����h����<R4
�k�@�����"���+�7�x���4���+�9_���K�;.1�a�D��k����T�$Xi�1�s�Z!:#H�3u�ZR&\���
0O2
 ��7�t�g�����R	06��xBw!�> �
��]1T{):##��!��������
<tqm�mN8�%t�Yx�e�	���3v`WeK/��������J��X�.L��O'Hd�:�E"�E*��(�q^5#H��k��:Vq]z�w6E!LfI��Cu�7*5g�0�ggS|��
�ez�(,37���a!�)&��������5v�9�ch��rH5�fb���I�>��"�����91 ����
q�x=r%��}+���F������!;,�18������1A��7c�,y��I���s�n&�>@�g?=H��)���������������\]�7 ��?��F��$4���kAT
�`2@<.<V�@�R���wI�8������*�������s<I��`��B��>C��	
n���?�`N����@������a���T�a�8?Ez��|t�x�����;Fy���+�uB����lS��8�J�z�Z7�S H�D���d���\�����iN2�x7��KV2SRwsQm���-pl����:���I)���y��El����&@;�Ul���R���6��(����/��.��T�yH����49��H�??n5��	h������������_�j�<n�[2.lh��:8�w8��uob�����A��c�AI
Hr�kM��f�s.Dr�d���I�������\z���Yk����`��X���~�2��1� @�#�4s?_������B����<���m�=o���l��K����l]��[��(�]�&
��@P�[�@�\�	���h���]E���Q�����%�rm�p���K��.���A��!R�>��]�"��l�a��X+����H��s����.��&����t��ff�5�X_������kja_�y�����F��OS�:�c��3�DDM9C��������A�=C���kCE-^�C"[I'~�N<t�A�j�%��t!�g��T#H�!Ct�OlI���eg�����Lq��6�Ec)g���HT��v0�]1�{c��K�+bvwM��������M/b�!��3�a.g�pN':�����"�x��`�R2�U����_��fn$��s"��F�,��J����U��(���)�2��~A��/�l���y����~C��~/l(�������e����%�t22����P�.8q�<��~.:����<�n@��'�+=����V�x����T_u	�.����l��L��:N�w���iQ�,�E���:���nF�G�LMlb��BJ��e���hi<��dj9��Zs���^��#l��v�2��"�H����[������;�������*�����n���U�GE&L��I���d��������C?���KR5�.�s�#}Y��vFW���	���+���-B{Ll��2T�X$7�*c��,U��������
Lm]�!��q,�z�KV����������$�)��8z�r,p �	I�	��r���z��3��~B�+n��������������W�)����f���_�=���(��^��8M1dV�QU�r��h�|1
��8����.��>%�SV��8�yc,�??������H=i��@�Xx����r���>u��7b9����9�;<g�h�����&UJ�kr��	��-�
\�_�������=m��.g���`��
�8�{��3�X��7�rFY��*6����H�{"Z�Lr�k�����M\�mj�t��@U��7��nk��*N���Y<A6�vy?,�����(�s��o��~��"�NMX���P��N$�������D�����n���[A_�dk���|�|#���X�����t�J�w�nx���G��,^�]b8�{�~w��k�v����1)��z�`��(U>��
B0�@2�(g�\��{�qj�x������B�m���<2�ds�`���7e�������+h�:m�q�M�@���I������k�s�t��X��Xji�����A	������V�c�o�|�~��!�R����n������{ w;�,�}^�F1�>}z���
h1��hI������>�`�K���O��I�q��_Bxn�z��>�:�����O��
����������Q���fz����W��4��z���A+�>��_&*
a���A�YUrJ:.�w����d���+�~�a������=���������#�9	P�\X����p����U�}M�o���`l��V&I�Fh���")����-���\�����D�8j�G�aX���\�r��.����VIj:�i8h�+��Z[�
�+�Y.�i�,"�]bD�%W�'��p�[��'[�r�L�h,�5Li�+i���%|�8/H+j<�VI�`��!���&���������|/�.T�b
8&���p_`nS@�,�4 ��n����1c�%�d���[�����.��"
�wA��o777olj�	�R� 3h������N���j��:X��?��A���7������y�!��S�������.��s`����(���{E�n��F�1�|���j���k��b���77��D����0�$x0!N]�e��)�cS�Ni��4�_&?�_�����x����e/,�Py�0v)��:�7��Y<|�]D�#�Wv?pa�L4E�}Q�fyEUr���Up�TNf��Q��>����v�]Y"�Koz%����QK_i`����RT��%�����;�u���1}$H�!b*�rh���/��$(�<�f����:��m�C�Yh`��*k)�u]:��&H�!�������0�C��1X���]$HgB����6Tt��z�y-
A��a
��lQ��f0�I���.�\h lJ��X�A�@�>����HwP����ono������E���6h�j�M^%P��H�D�>"�Fl�>�`��WD�7�c��c�a����/q�{�����a �Q/�!����RM�E�8��ctS�|ss�OK��W�!A���"�(�|v������8��EX�f�(����N\*l>-d*C�1At��N������ ����P���?������dvC*��15Knz��#il�E�V>F�7���eq���m9����Y��6	
��b�6��w86�U�,&�a-�9�C���t�����}��[�g!e>�|��h�C���.7�d	rZ�o;�,f������s����w���clF�RE�(-ln���\�K=(Q��d�����:�A�4�w7�����a�2fA�H"dB���,��s��!��0��u
w�s�"ZF1;)�����1K�\a���c&�)�sJR�I�����Q�#�l�h����2H�{(y�stm����'4�<y�t��7�=a��>�S)x�S�*�VN�8�����-��L���l��P�2e2�����b- ������.�����X�W��p?���Ud����xw'�v�(�^92p��:_��=������}���b���kk�rM�/8E��5}����iFv�G��q�\l�A-��������<�F�����&(���>wV������>jf�����������H��{3k=R���jy��� �3�-�v�|���p4�O��CW���b1�V
�IR�OOn��()������v���8��=�����h����F�_h�A��>J�w�5��B3�����H���	�kn��\���RBi��X���.B��|������
����R�������Rn��T�� ^�nl��C��qT��iV0����������w5wu��3�+�jB*�e�d�J2-�X�r=~�~���J��oi�6��0���XW2���.���)�XX�U�iM��7�)v�a�W��	q<;�`k'Dz`pW�+IO��v���J��-�YR��(�����.� t��-�"��T9���XCh�p�3��`�H��W%�����9���ZO"=�\k�����c�
&R���MZ|�},"��
U��(|���JcWp�n�V6 Z��(@��x�y�G|������=\E|������?�C8~������q�J2j{������L"�"Wj���>��#����f�������Q�3 *"��a���t���9�������_#�uR3p�
��
�e�q3f����pL���[`?�-0���<��/�c���V/���Z�N�+�?G��]>����n}pT����v����'��H������dq�IDAT ����D6� ]���4<1���������s�4��zEs2�r�����&H��2���

�hh&�qh������"
��)����1f���e^#���B�A:naJ��)��P;Wz�Z�f��+��U������K�u�m���nST���V�Z^�����0>3�tX:8�L��c�o�2��c^��eD�;=�B�4rZX<f!L�bA$a�	r�n1��~'�3;���!�IX(
�����X�sT�_�d#�t�"P���7o�}��_]��������O���#'���E"���l��r��!/7�q��o��{8�wD���>~x�p�]l2qL0#S�s���K*-B��r������@���`�A[	G�#}D��1�x�0��\����2�T��9�c`�"�F�U�T�K�M������G;[���v���0}������x''8WD���������5$#���"i��+2����)M�a�Z^a��������������&�J�	S$�X��R|�4��X
������Ch���d��T����
s�1Ym� fi��-�k^4��)�������X�Ax<\;�N������e�r:���aP [T����*k��D0d BW�0�XH8ZF����#��f���r�vh/��x3��\�O��q%��N>�AVRzI��G&�=�V�[�]0�'�?�5h�)F����3�`@� 4�����wo�"~|��7��d�7p�������vss�y}#h��H��C��}�P�}��0��)��BO�/����
�o�~x-�p����p��(������P�0k4 o����/�K3����&���l�B���91��+��g�Ghz��9����w�[��|1Zu������8����h������H���H��m��R6% !���i���Ms��<D�&����f�;�n���v�,����_{�9�`����:����r�������F�3��C�^:��c#}�p�
����p���C�����i����&��F�#|@3���6'�� 0����=�z�I~Aw�A�����Hnu���� �dS��h��n������ }� �9r�j�(�yvE6��}u���R�'��]6���^��rh��6�&�?���Y1z#��U�f|���WvB-�����|�`;a%W�%'F2x�Q�s�����*]�j�~��Lb��n]�5�h������s��|��I�`�NM���}�,��|�����7v���>>n����:
7��h%TJ���~���b�8��Xz�yN�`w��	1
o�IH2�m<��G�H*���c$�'
�����ovq����E��o^���u�����]�����w?D�����|�,��9>~D��s{�#n�~C�m@<.x������w#f~x��4��-��0� >;,���D���o	���?�iQ���������[cl�"_��3��������1�)�x���[��* ���8�C���=�r9�����������h8�����k�#�z*��d�II	%	d�B�H������������<���Z:��IJ���0z��S3�`����[�C��;�R��p�&�#������E�4,����S���}��9E����P�+������Mgi����\�U����m��+0����� ���E���K��]W���>cho��S�$_v��y���;�7�A�v{H,B�,8E��+9J�����*~��1��K����K4,j�T��S�Y8�o�������/9:LGnEFe����n��5������������E��o�?�?�������6�s����P��Dz�f�����y������{�B8�cT",�&�����~S�lg�;Xt�7�]��x
���\U�h�.����#"�~�QZd�	�v�A��p��{�� L�1������D������&�#��9�����E�
y�4� ��M]���X
�e����l���=�`zg���lM&������2f��3��&M�>������A��bQF��`ZX���
LtBMAX�Z���	S����+��z#��Sk��b��j$�6y��<�I�F�.q����\hE�����y+v���X��(�s^Q��IB\�Q	BZ��P�f���OK�80Y����l�&"�Y��t�:P�Pp"�c��|�1��q?(�=�p��y�y)�������a���@��(�tH ��`�A�`���e|>�����Kp�w��x%@��~�0�x�-!��r��$
��%#J}7�m�H��l�T0�� �}��c1�D�JB�b��b�	;8�ZA��0%=��&v��}4&�U�A�e��w��w	
7 @^�*>�9�XY�l�1�B�:�o����)0E8�#�����	����{����<��s�^��c\>��|�:�#�N9�n��/!�i[r�bh��:���rV���:�s�x`3�������#T��:��O6gy|ln��t��|d�'
����CR�^z�]S�Y�1�W�������u�4����}d�V�;k&^���-5Lh�;���^w����'
�����^h������f^#�Q���G�!��+6_Ls��v�|�I�dW}����Q����-��?���%&�a_����0+m=�S>��H��:Je��oV����������
����Kz���V�����+�%r>#t?FJ�;��h��������e�&=�%�=��l0G#��Cw�H������S�
+���LQ[F�[C��!�c�<=�>��J2���	�
VW<=�p�<b�8Ex��u/��J���G�
%�F���������~��#��[�Us��c���p�o	6<K�������������[��%</��yXT4�V����o3���15�0��lci����[>���8����9��������S_+=�BK?�7�k5�^zyQ0����;f��mF~��K�2u{�T�D*QN�y���n:Y_I��8T����xC�{�Z(��FW�Sq�^9xh2����J���ZQ��c�l�@U�2�������`�K�5�������\��qU;H�j&�Uc+���dDvu|�3�E�Mt�o�`i�Ril�u/��=++������%w��i���D�k:Co��-q{&1���"N|���cD�)E�����B.�SJ{E})����z�	E5�F�1�Qw�T//����8n�8�b����`�6."�����x��7�q���ph����:�����p����_���8	o��Qt���������h&z0���;@C�3C�"
���� rm�ln�����psM�f>|x�Q%��68)�NR����p��J^��0�D��^����T�}��uZ���=q.|�"��0zO�=:��twyol�c\���O�����@:��-���jO��������`����������"u�p�2��5�t��+<�D�E�:������,-�P�<���f��Y����B;q���^��<)W�er�$�.���G��L�����1|��{�����'�����O��A:f��R�.�Z�O������E����N�;�5u&A���-$WoW�
�����2$�IN�hJ�lV�b�k�;�
�����(��v!.�$4�[6��`1�\���l$��)�k�ys���7�~{��m�^|DSp�X�>P�����a�fB��$f�rRn�)���7o�}���w���wQ�s�$�eb��X�C������\I�7���z��78��?�G���(��+)���.�����6A�#^4-�
�j����EWi��D�����C����o�-��n��r�=p��m s����;��I���:�~��M�?����u����0��������NAxq�Usw>�BK8ti�q����F��~�vUw�p��uj���NY]hk;~O����(�z�S�\(��I�m�O�b��g����;���S"��I��b1;�pN����c+1U��[�y'�����I1b[i��?���a���AeH�+|��Qd����,U����Q�qM}�F�|��A��5���[��`�F���Q)�HV�v���{y%��M�o��K��.@�mw��C���l���������@u�
B��fz���yrby�'�� �������{1�q�p ^�5�����h�8��t����KI$��i��x���#W�(�v&�Kx��L	�=]@R������W��N���:rfO{��������^�r��K������9��_�����K+��)���C���SZt)��n������P�Z��]�5�,	]
�C�u�8�������,a�cr+!C�l5#H�g:����
��t�����J`��njC�G���5e�KT)6?akYoG�`a�UH��j�!��5o*��.��4-i�������x������!K��K���9�3uT����	��a���ln6�����j���l�����P��"8�hxp��������=����G�����Lu%�ux���qC5��u�7���3������wR6tT�N��}���g�? �'l��/�����Nx�U�JE����p���+���5�EpG��z�	�f���Qd�����~�KQ������?b3{e���0P}s�$����c��[���"��=q�c�':,�U���X�����~wY�G�@���og1��i�|�������<������.`	����*�9T�Uz�5+E��� 
3�j�0�N���������)S���� ���rE��K�*��UC^T�f�b�}g��m^���3����-�agG���|��:��(bU�d1f[�k&���s�#Ax��t�����J$�1�������j����������%���he�c�j9}5R[��xG�����l���J|��Hb���~��+��i#�K��\�d5����qQ���	����%�<c<��s�]���K�_���X����	>�tX�1��w9.F���0��FJ@��9�|���O���������J����C�y��1�$�~��w.�j���t}X�.�|����h����1=!������8bM 1X_	b�6�,>��E�Pq�����_D�Yz����-
w�Kn�0�s�Ua�b���x�������!�z�eDp!�5am�i����GU�|�����8�*�����]���q��6��h+������}c���H�(P�N�=��r��|���n��tYr9��|�"�0��A�y^�a���X�#V:$G,��46�Z��:�Y�����:F2M�N�'���|��@�jn�p��+"<a
���-�'d�f#^�|��sT���v����^
W�u!�^�F����A�5\���
�L��S% ��Q{�y�C�i���8��R\�Z�=������8N�����Fy���}�q���4J�����pH�3�L��^G�������(�%g��Z����=�����T������^'[`6����
�I��s<rM�<�������Q1�46y���l�#e����d���X���t����w���-�|D��|�
t�%�n6������S]U	�+���)W`?���7Y#��/��t;�<�Z>$�%`N�����	b����������<�9Q7����\=_�>S:;����{��=!�'�>�Bu���_]�0z��D��u�z=��Y�������S=?�I�Fz���w6��4���@r�5�Tf��s���H�W���s����E<w������nX�����'2�������h���b�fp�"���kd��0S��������
)�.��V�N��t���~���.]����a�m�T}�F�4(��m@�'�}y.��
�0��q����4��i	l�0l��hY���^`J<B�u�C�]��>�i��1xl�����7om�q����������w]��nf)�B�f&`�N-��,��F���H�:�1V�n��G��	X�(����M�v��bI�L���c;5��33�4�eKHK�,�bi��n�]���l�A��-�c����]�f���lY���H�#���^����������S��������]I��0�oo�(j���!�������-���>�+��>�&��]j����1�Ivx\��u8�q3����90�����&����������>���Hu�N,������V�b����4����I^��w+����H]����H�R��f�`L���V�`;#H��SE
zjDU�)�Lx,��k�te���H-e	
��TG57��T������X�A�@*�P�&�.!��#A��F�p'�D���4S���tX}M�.e'�AB�]H������g�3�����v������&��:���N�>���	���qAd��H.����Z6����p��AG��4��6��i��b
�Y���6�	���D�K?)9�-��i���=�QLL@E�	���������%_&�JE��>��HJ�iI8m�w�D	p��ysus���&���h�xW� �(5An���K�_��r���3��~Z�|x��ycM8����DsJ�	�E��|XH@i��_�$�����1��j����<��gG<�(�k8v_7�����|Y��s�r����Pk���M��T��A����#����L3�������A��p'��H���v�W�:�EWr�Q�`�X��=r|������2!s���
*P����>u;��"�w����al0�Wv[�nX]Bi,@Y��e1�Y
z�����>Z��t
� =��vk����Ra{?9���o��Q]�D�� f�Z�A�������{��~�5}�$��c������t0��a9U5�Fpa#b�B^D�~�u��K	���V�n�@��z�^�Jye�t@�����p7��k�����C�d%@��x�X8F����/��sao�h���������9�qNi�?`�U���j`�	3�)���l� ���C�������1�w�}9L�������	����d�$`�������T48��U,J(�E������7�C����A��GOz��Zu�����Pw:�K:������S:��e9:���|;�Y��$Wjy��y���F�4���M
`�F����r>����@�7St������=���(��!���_)��+Zw����������?�����Z{`,��j���� �tU���lW�V��
<��No���=}$d��M���E��1�!�S�
G:�P���;V�9�51�����;{z��9m���p�5s)=/�;�p�����A�T�����nh[����U�n�.���
��w5wu����y�v/�X��L�u��.m��v�������&��l����
�APU��t�.WM��L��.��a�-��9�C�#t�!�Z���&�e)d�3j*���4]]��!B:����Q���"t��Up*[�����K�������O�����z���c�e!�d�W��eX�h��%SW�	@r�5��=�����u������)��'�X-&S������D���#6��9���QqIsG%\1�P���g�'�;B��@���[8&��;��6#��\z������5Ab�t���+�&I�17v� �-p,h�Sv|�^�p,�������''��yI��v{�|�������>g��%��
%Xw����w��/YxE`������A�hh�E�vA��FS�\G�:]���G����g�����t�!��cN��m�Mb$H2�G�����;�PC
�SY�D�V#m������39��A�nV`]��������rR�{#H�������tQ�-���K}7�Xk���ZGfce�@\��E�a�#7��Hz���L�%���hl���$���]V|	�fE����q�� c@9���T�0z���H��Yt����:@�P�k�h?2�H8Y�����c�����$T�
7~���\3^��,���5�-L�p2�WK�F�^���|�N�1���F�l���/F��9����I��H�/!���o�}�7�����������]1�`��������y�U�'���&�y��M�`8>�������U�
����batx��8z*7�ft�U@4����� X
)�RDk��X��}����(nSd�U�N�KL�F�j	w�uX�k@�E������M�S��.�;���2/��]v��{E���J2�3����
&V+�O^���&I�P�5<(Y1�(�R	����N�G�{�E��O��D��X�uT�`C�Vz�m�����2^�w����^A�K���b�6�E��a��z�n0q>w"��/��%����{+p��r��E$������ \��y]�V���f2�W�c-�		�A�H�F�5����y7!�
K��W��r�*����������N�y>z3�) ,;8�N�d�W�.I���Vn�2���6E	�JL���S@t�m�����h�}��R������`�Z�����-E_����{���k��>��H�>��HX�]@��<�$J��FR� ����9c��w���H�?��1VN�(��A���Y0���Y�M��C�.,Z���WY��������������qN�p�;r]���|��I�R���CW`���"D�������fj������I��?O�0��_�����t��H���rG�q<���~�F��w�m������x���\��w��>N����Cu�c��<h�@OB��Q2?��c�%�p��$��@��u����l��BB�S�uN���A���4-0��a���P�m�s_�����	Q�G�Zz����O��<Xw+����.�� ������(=�t[���v8y���T�7����
}�/�#:���T��5k�'�������;�zR�S�h�+D�����:�au���E�AD�V����{������^��B9L���5�������;�
k����.���^3�;?@��*��3j�9�V3��V��-�����l0���*���d7���8bD�x�����j�p���b�I|cy��^�M�Z�I	��}��RC�O��,����O�v*�c�), �UM��R(��3����������	����M��
�CD�[�����M�v�	�����M�9�7����PL���y�16�_on67��fk^��Z��	,zw�����>E�����?C3)�G�Hp��G8�Z�u�X������n���1�'7T�f�F	��!�]d+�����s!�#��E/*��.�rSZk���Cw��H���6�(q��\|��>PA�Z�VD#�F
��k�^:�������5A$��$��A��+��kb����:,'�Z�d��5���w
\�Zl����� W��A&�r�H���a�N�3E�kU6%��]���S��7�4�5���Z���8�*��`M�
��6��naq�t�-� JIY }�e����d"�a*(w`�q����t������uc�ue����SQ�q�eV�2��b��u7~���}��`��&�>����4)�1*�yt]_v����a�77o��&EB{FsD��y}�?�$�E��W�@��)}Mb)r)n������HW�~��M���C4�|��l^�����*v�E���A-�B�%�����cM��������~]��D���~�n#�LP��T�����e�!:9kE�����]���%���6B����s2�����~���������-E�l�����
���!]�<�]����b�>a�h�;#H��,����\ 2Q���/q������#���3����M����rJ����EPW����M���kP3`��}>�nX�6��u/�����
{.��uJ3b����@�3�Y��{�g]Q�����~lC~��<V��(����F�Y�|������[6�d�R2R���p,��*�6�^�u~1��sI7���^ �*�������e/��?zD�:x�F�����no�%@��������/������oBH�i��T�n�����}������)���X����>��H�z���:���b�9��l,�p�����6z���
�T	}o�ps{u��l��N���"���h	���F��;V��j���� g��	���y���s�+g}u��
�i��c��������;?�U�����i����W�+*�mn�':����<X�Im���7������N$F/5��O����t�43��,��16�#-�}�\���)�(�x�V|(�|��y������
�Y��Ea��;waA�����1���\���_�7�v��b�����]���.��h�lM~T3��K����/�\%�:g��T�����s�������`���Z�����Y�;�yo��^�7�_�a�
<@�a3���6��C���8Zs�X���-�<�?��-��_\�U�N��qA�����������x�kjN�u���S�c�
��]��a����4�6��>���C�V��
�vn����\�[�j��U,&d+C(p����|��-���V��*
	�`\��7g�`�hC������$`?D��/�Q$E�I��Z�bPt��)f�n4�\�X~��E� ��{�x-#�S��9�!d���H�o�F���'�A#�������%$V���x���������3��@>��o�{��F����;�~���~������m��7�[�����F {b�1��:��'O���0�W��M"������f�0�+��#
<N/	m���Sxt{����9��������R�s&|(2r�m4���� ��;U|o"���4����tP��a�X����:�.A�P8W�a��f^��T��u�o�u3������X�F����4jC������l��f���3�TKhk�c�Q>+S����,�%B����D��
�woNd��h����]������.D����M:#���IV�]���0�j\)NJ�8@E����9�����\Mn������L�V��2�E	���a�`8{o�����~X���!w�����L1�i��
y�u���s������T�o>�}{x%��R�Y<�� ���p���<��a�):.�m���L�b���7�b�H	�����#��[1�����Yx�_4f��o��"���z���� ��1��\���]e3��C@��x����
V��&�JU�L�� ���c��N<&;#�0�_1^�spv��{_���pYvDB[g�4��������6IMvMm\ @���x���3�t6�W�]=���2X ��,1���=��c�iSA����BL��:��W��	�m	���
�K��X���E�
*m�<���C��tE�Gd)�'����X���uF�.Ki6��b���
�dr�3�p`����M�,{����[	����;��zL[`}k6:cc�����i��f��
�������M)`���'jc��&TK1��������F��t0S@h�)��f��+��� Wo�bQ$@���.n�Ox��mn&�<�q"�H�d���S6�����m�^��"�h6��UL��r�d��W����������_E#���-E��l�(U"���fF@2\=/[�
	�y�r���L��=t�xl�$�&w^���_%�����H� ��G���Cu`�%��gt�I	�8`�OfN����m �$�b@�\������>mWIN�0`oWa�<���������.$��w�F��i�rM�/8���9E��N��~�2�������q�o�h\G�I�@�6�t�����x���xg��E_o6�����x&�A/��Q���'�o�Y��*����p� �(E�Vte1�#1���y'��V��p�o~ ]:��vM�H����1�T��S~R��o���x���2��9�}p����oC�m�0�RB����I��|���9DZ'�xqcY�.�D��e��(�Q7���S.�\��j���l�G�Jd6�G���rL��`����TI�
w)q�My�+
�!��)�Qd%�E<.*�<���f�tT&Rl2��`q��NJ����l~vKp�{�9�����K�(E����������Gt_��!��Lf�����������������8J�<���-���(1#��X�<6�E���zl&����������Q����slFqK�m�GEI����(&4��4�������j����'��onS
�x������p�m#[���ZX�s$�xP�KT�r�D�8���������V$PRKv{�9��1m��Y��f����p���������'[_{yU�K�������pgGm�5��
:>�}�b�8����� ���������6�j�q��[a�R+�S��QJ]�-J�j)@V���������+���F�@��H)Gl�~2>{HRb2U�F7g��}��X�	"��K��h���=���<����ys��4��V�s���U���Dh{�����X��\�v���2�
puj"4����j�W��s=��2�2�$P�C�9�R�QD��`GY�@z����h`0�a����B��p�)�1zJ9V���-���y����1�Z��X&�T\�:J�8S��R�3����.�0��8����!fg��G�={�h�������������7@-�Z#�.����N��?N=��L��  ��d1?���~����
f�1X?����1+�Ask������>��C�}I��"��?���F��dQtF�
I���x�+�#��T���9W�Xw�M��owj�R�;u�]$��(����;�"���c$���Qh��_*+�v�����+�z�H����A��j�e��4E"`J*L0.Kd
����U%w4B,*����3�A�.�[����A�!.,4^X��u�������t0'��4m�H����BD�����P������X4�����#�4���Y��([���V��E]h��"t�d:�\���>�3�?��K|v�B��������0ht�Y�Lb����^�w�y��s�M����d#9����E�l
?� ������\M7�q����MX;��R�e�zz|���{!0.�#Z!���\	��f��2���%��+x����H g1`���m@0:V����<�s
q�D��a�����H���o�����J�P�����r%H<�yp�g"�T��$�lHS�w���8���W�B��h���.������A������ b_��u����/�X��� �J�G���A���?�mI���1������?~��hN��)��N��f����������
x��yR�u��\�D������}�����pjU~������y���Rd�e�\�u���l����	�	���n�f���H���
�4���� ���\:%�/�
^����n���}��������S3]~����U��������Gp��p��f ,I($[cE�SN��:?�$��������F��W��4�W�(�5����B�/m�DnKv8���^#D�������6Er'�e�����u�W�$Q�[��m<����{��� _8�n��qn�"�!>�@�~��YfR��r��)0v�l}�IH�������T��#g��m��$���FQ���DL��<h;�0Yd��~�Tl8R����l���y���+���>�DhK}�#F�,�7��ya�sqI6������#T�2�x��g��Yk��3*�55�
�x.�65��r=W
��K`\Hn�Q`V �������a���r'�Q);��bR�����hE�H�7�S*��x�Y��&{��c@��1Z�}�dz����!�?3�#�E�S��������> ����~|���j�>� H��B��{���%�����q����������J�s%���p�{�#�\�����8�}������A1O�y���D1u=�g7��?�)������4�oA�+�)�DW�t���v�������}���5V{��k������t���;��^*o���&�hS$N����~��=��t0c�@��}[�I�D���a�Cs9s�jk��
O������/�BJ��\�4��N
�$�x�gh��vj��lyB
�� m$�A���� F��pQ �#�!58#�~$
��)(KOe�*Y@� )��lIg*���^\^snW�4E��hc�-���D����J�tq�ca�����h��F0��N�!w4�mX���������V�x�������n�7rhv�5�[9��������t�N�S(���>�uLB�h�S�D�������X��{�<<|,�����A�A�$�zd0D:,�5os��X�������w�G�@�K_�',�{z���������P�����:�g5���}OO��<O}��&}��� �������S'_����q�/5^����>3��?�/�)0�W�Q'�e$y>�
�v�L�
I1�h��$��C���,4�h������~���L�F���������dw�X�V�����-Lw�$k�,������s�$���%�j(S9�y�y����
�8���Q����V���F�Y����zrx<�q�9
=
����|���x����G�]����0��Q|�'�E+��:8W�k���-�a!3m�\
�Z���t����]����7������*��>�q�
�#���������9����M�[�U���kw��PP����?�o�T�T���y������6�&�����z%�>�8�|����m�	��[����j���NX�/��
~�p�������������Zk��,RB�f?�'� ���3���?[��k������� ��Y������+b��&�&V�]k�twE��}_,�H����D�ox�[u|mRz�O'?.�^ ������)��"��v��`�G�cv�~A�q ��'�gA`���Z3�BTr�G~���j�O�oX�CuLy_@�J8L�yE[�=�'l��p��������������g�1j)������q�=mB�V���q����(�:���u45M�E��!ef��,id�l%T�7���:���$�!���]s�3_3��7�R�0%���D��3�i���?rF;��l�&�c�5�UHs��dX�5	,�J���m���h�&����hH�X��k�*(Y�p"-��x�R������U)�U2IL��ha�b�#8!{��8�j�q��p}���p��P�Q�FS(����`}���!De��j%v	�1CN%��K�fw���P���g'���CBT>��R�9����n������n�	���� ��pp�[_ G�A�����>g���c�]���ye2`�,�.-���.����l�CD��c��U��i���C�_��v��I�S&�������R��
�����9�sw�����p���)���_��w��uyf<���]Z�_���/�������rS����_����%}�B$��:FI���L�n����f�h�p�y��?��N�cgu����mh|��"�����J�#������p�hcI%��� ���Q���jp��X�������w���5)��DhA��k�(@�� q��>@������9	����S���d`��i�W$���>�lZdQd�&��P�����/��Z��R	�a�C>m���Y��)������m�n[���
2����X���g�p�C�\�f(�����q5d:,��n�5Z�.��n�l������0�q�l��"
r���3�A$1N� �7c�	���=�R���SO�A�����A��L�X��<��>�33�� w����_3���=>L=A�n��d1���urO�E@��ehP	9�$�|7p4��B
c��(����@�tX�L�;��^���cf�7�K+���hb/��������z�v�����1���_A������DN���h��op��PJFhE#��ug�������d@r�!t�#��t
K,��G@�j���J�<NK��*mP�B�����jQ0bFC��kN4�ZO$d���:���9�n��Ek�	����l��f�(��t2�.�	|3���+��,0��	1L���U�n�?�,��J�%a�6��EQ���H��oB���M���Q]�3������3M2�� �DJ��v���^7�x���W���<!����L6WZzfB���t�<�X�����z���n����-&��"��	ELX��J�Fo�����J,�vQ@N����z�c6���qG-f��3�x�"��J#$�t�cl�
~��s���&������-Wq���w}C~��������w��������%���X� ����B�M�[c�i.;���+"'d�8���1c���d!���_/\�)"�]�^����h���_��DuFK�_����-�xO/�	8]����.��3���K��;�w���"��.	�2z�1�����'m��{�,�A�%�!D�*AN�"��/ �%� '����h����7E�=������+'/������-W���'~u�S<3$?�yp�c�6Zi�����6��
��`��Q%=w%O3sS2`��)p%��_R���/\�-�wzvR�����oG']D� �]0���z;u�u�;���!=w
��)����G=��s��V���4�.���V{�����IV�Z���,�II8���-�i���I��A����UN�=�N2�����jE��G
ZY��P���h���?`�+9����'�T������������7��G��a��
:���|�����H��R���\���Z����)�Rb ?����W<O�o�K;z@.�m+��������\	�8Q`M��c�{��E���G�SliUs�	:s�T���}��-������'��������)J�F���C���pw��47F�0�Vz����$f�_N.o���<��q��y#�*d3Z�^0^t���/�Y�26��*�]��U?��Z�{������r��N7k�%] x�`��w36q����K���;,�j�����v�F�n��� �	�����y��=�7��G� d��$g��2��L&���^CzmX�H��H�9P-��]����3�>"!	���s� G�&|��d�!A�t4�J����(�IO�w��t,��x9|o}(���
#Z"�cpx������8�\�}�>��8�9D�� ��2��1%��5��d��Bj9���.3��x�$���?�@�:�c�+D��	�n1��Qd��������n.�/|�b�e���?�=vB���������Sr��J%�����2�>a�������Ts r��p~�-h����e�0���x��rK��'������\Z�`O;�� 'Af�f�Q!o�jPet�a�e�<�6��eSV����o8|%	�j��H��x%��r��g��^�J�7������d�#��#��d	��K
F��P�������x�)������}������Z��k�f�L�"8'z��Iem���h��I���	D�5I�'���$�W�G�n�8b��N���O��R*���v��
��>��PAi�"�����,@*J��H�%��V�>�U�Jq�az�-2q[�yC��p�07��7-��e��X�`#F[��`F�Ig3@���c�6�, R�2;`�v\���D���U�`r�0�^���1x���-6
c]��{@�\{v_�L)�MM�����_l��i���Y��-;v$ ��S���M�������z�j,��,f+5,ri2�{�J�r89�1���X����-����)0*!�0<���Y�pw�kx?�8.c(���6��������tK���q�QS�d��=.���%��pU��<,X�F���}D�d	��{�y/�Sd����w��&�K}��?�j��H����/���\\������}������q�+������EF���a}�?���0��r����5d��|��\�PK���B��Z9��_J��h!U>���k��rB�zy8�/C���<����_��r�J��u�l��xV�;�GQ7���r�����<@��S`��#�Q�<@�m{�l��������(Hr�i�2��D�83�~�Jt�
�Y�-}�L��2���� ��<@�F��Zg�(�yag�s-�:a�HK���C5d|=g�>m�����A�.���s�.�m_�v�0I���;���Hd���x���i�����"&�2F���n������������n�9c.��X����Y��8,�q���7�x��n:+�$<���m��=L�r_Z#����7��^v�a��'S���������hC2�y*�SO������t>0u��#������{���0E,�2Va��������������a��W��e��N�2���7�8�����9���=>��<������� �U=U�i�I����z��Jt��d��p:e���d�J/���ruj���S���^Cg}�:��2��� ���A+\�q�)���$��$B�$�)���_�|h$,4�H�n���PrR2�X-���5��M�@
�d�
l������_�EX�����[�&�25��x7�^�C����J0�1hI�yg�}6$Kq���0�i�t��H��d�qS4�U��1u'���J��MD��B�����D���GI[	�<���14��%�3�N*+��D��<�~<�m�4���M6���n�!��Y�--�:o�f��[,tqa�[`�([d����
����l�����0�n�	���������x�����`�`�M����#
9#f�dga�}A=�X��"J.��H�p_p��g�}�(:#��3�(��3��>[����9W	|dW�C���PZ���?�m�!����~�4���K�1���^�p=�0f^��M�8���Y�l�o��r��9#Vu^����N��M����(=e[�?�K��O���`H)�[�3x����o�z���b^�m�D>ZU�	I4�E&���!K�U����"C����c�����w)	)�8z!Q`�*��qXa�f�2�"$��o��'������jmq�g����@��������q-
,�8o��XrJ3I��M��D�|��n����i�k��53h��",@NF4�E����mgw������P,��>����]���R�����!)�]����Fc��k)�Z�g�+�u�3���M3:g�8�&��NH_������L�W��S�
��=�����p����T��3K��<3�_���tB.S������
���pk���������-�ez�0u������C+����x�3���="���h
i1��4��77���P��+���
xpe��K�.]�]���|�4|���>�/�����%����w�8^��'e��9;�}�x�o��s��\d6�����"�2�O����F�'�K_����/������,s$|gd�X���!��|�S����>{~���"��p����Nm)-a�2�x�eb>R>@"���b<	�\C�������f�������O�_�?CU��q���1WZ���)2���nI��M�f������\�:#4���9"�M<�R��W�e��pn��e�������S^����SF�I.S��r%�CP:,��SFh�Zt�Pd9����Fh"��,����
�iS����0B7Q�F�v���gN�h�����Hp_ ��4Y6��j��������J��/�q�fiT!(���U�9�<�+�9<���J�c�:��*��e�9�S�| ���)��+�d�,��������9��#vFo�����#�Af��
#��`�8�`fi~�P��<�"D�����V�u����:,8�~�?�UL�~�F���_��gJ�.I�5p=�#�%����.�<����-������J�gF��!d�F+vV�AIsh}GXEF���r,	ff?~B�+ i�����B�^Y����#2���}�
u'�slDs��6i��)R,�i��\ '%k�M�Z�q;3����-p�;
�����^�b��$B��%�����%��6�+t�����������H�������a5����6���k�ke�2:��)��` ����-3Z�r�f���������1�c{�"��F������W�+�%�i#S`�Z�$(<C>I�	��_z&:`��e�x�������l��o������{1`�<�
�s��4E�'M����'�~c��u�[�cX��+6�����������D�s}Hn�������j����������R�r�q���#���� 	[�|���e��#��4(��1�p���O��L7Cn��
V(rR��N1l��}���i��������G�j_f.�
n%��J�R8�#B��4I��)UbJM�����$"	��+M�HW�w��a�
MK���k���2bM������A"
�hR
�
T����2MM��U\�t�)1�-i�����-E�	jXS>@j#���GC�cm��p�0���L0$�FuC��[]p��r*=]��"���Td�&�-ctWbJb��Z��d���[�t@������b=�k�E����
����������p��0�+����}-}����`)�K�u`��)��[;�Y��E���i���c�C���`W���_�����_?��a��873�P���
v��,X���"�{V��2�1J@�i0J#�'�%�u�T�/~,b���pJ�����
����#$E<_������8�J���9�k��oq�\��l�]7
�����Z��w6��w�K���&7���awZ�p�C?��g��sW�,*�Cg��>�U����D}�S��/���Z�����'M���5�yQ�Y7�O��
_T<����������Zc��2��_X�z$�����s�t
�lX'�9!�{AS$B��9@V~�~�����)dTy���8IS$*�E�X`y!�B3Wt�W��s���:��r��(v��D��/�T�/[��*Zq>�%�����p���g����2������Z`V�?��	�0C�p��Iad������e1KS*�r��������!#��`���1���ty�����4������&�\u��+�T��_��y�5��`������{�6�f}��z-D�-B|���G�
�4����5Y�����$y���1�xo�p��&=�!Y���#�(�)l�����HI�1"����p*�>��^y���R��d,m�����J��0�[�`<s�d�	+Lm1�>/N5����"i��j��t&��k����`.�@US����Y���#L����snD�ZyL���]�����4���GK���s����J+��*��.I9�
��~1n��������
�w����!��h���:#�� ��Y�2n7���&����q���A@�����f�)�������U7��R������v�t���=�c����G�7�������7����_)=(��_�!`�u�����}j���������w��~W���M�o����
�������x4�����t~���xI_7�O��f�s|V��,G�����!��m�ta�4IYo�p��PR��cGb����ck����n�-	�
3��A��G�
����a�q���{�c^LG������n�k���4�M�Z)r'ZQ���5FS�*�� )vtX���WQ� ����0(<n;Z�ZD�8y�N��j&<���Q�4��6�Al[�D�hq��}�:.�|	��g$>p���ag��J���<��p�m�d�!���������as�55Z�R����B+i�{������@���N�8v�F���S#�{��o���3����d��N�H�z;7Ez��Nw���p�M�c����!Z���=v��� r��EB���)~z��.��f��K��Y�R���VQ(n����:,�>d^�����C����������f��U�k+�:d����EZ���,�0�XrPB��M�16E+b��� �QB��>�N��T_�!-�YFv��.1���MI M���Y� (
�y����,I �t�H��M�M;�A�D>R�JuK:�a3��
����FJ�Qi������y�c���/�k���?����b'�wP��������1��^N���zL�c��v���05�5<�XN}a�<�f���UQ
������|X���lo�C�K�p��"���#�h:����8D.q4����Glyt8���S��c�������d*�## �K����D���4s�}�(e�Y���QL~�0]V*@'nVS��`l|a�}�O%
��L������Lf$<v���q!c��(��&Y��m���%�����-s�
�N>=�;��������^Np��\���K�����^�\s���IO�1��<�*2��� �]�s�@�AnXs�Q9��"M�����BN��T�gG�>!���<�������a���o�����y��(���E�<��_���|���B���}�/Y����O,W��)vyb)�O�O���\k��hh��+]����F��0_���*3��9|�����������w���_i��V���ZW6�+�v�=�)�����;��t�+�ID����/�1�u.��^��J�7JmC�A�m��'��<Y�z�Xr.��F+��
cD+����_�;Q���K!M��������J�-i �����[�`�x'�t�n�\j�>�hH�Q�]��(�0-2�[��l�T+���J�z����[��^[�}�k|/�=��c� B_�|�n��s������[\��\]e��u�
�v�".����0��f��.�lLE��qj.���ap�|t�����3��.y�G|�����������Gk���+9���0�~�����.#�x�������w� &w3���I�}��1��~�K�� ����*�,�[I��+6F��<!�$&7,.��M������W����*ArZ�v�So�������1(��6p��K"�����o�%�7��)R�M�,m����8�O
H���D���A$bh%1u,�����&��&�V��I�0�YS$_??���U�+]$7���V�� a�2� [m��
���l�a	r:,}�%��"��t��C>�\^AZ	B4���L+^N`|N����[�J��:,B��Hm�"����^H�j���:0;��u����I�ws����������oU�SV�m.bZ�����b�
3>�po��i�R�%8k�����Z	��X���N��~�^d���oS��nov�)�:����!g�����6[�r���T	�P�q@� ��v?����]�'H1!7���PuFx�����0�����w��� B�#�+A��1K9�a��G���w�� �7z������\IPx
��u�@��7�\u�}�(��Y��u0`��2B��"')&����fIt���g&Ga�v4@������Q���m��T\H@��md+��V&���au�M�����bE�i>w�/�����-�k��0�����%�gQ��� �y�t���%J�h(���4�i#��[f	#4AuC�s|���6���I��/M��1|��.-��8���r��r=O��xD��,����fr��w����7,,6��7()O$;�6�uY���@�h����z����H���B��M�����_����?�����M�����=�o�#�c�P�.6�?�����+m1���C�.x|�d�e���������@	9$��_��n`L�/�S#���B�u9�L�����?��LZ�T��uW�U����qrEu|r#k�s���x��_���/����'U}�_Q��&E}�S_G�����b�KL�Bu�D�T�*Xb�a��+��-{sy8�/��s�1{r�B���#��s��T���`���!�?��=��:ta|�|�<�����*n��kA�bl���W���_)�^�G�;)L�����2���%�O���n�e�d4'��y����xK+[��U�h�bn�����������E�I����{}�N��������R�r]A�A�t��/�x!�z�����'T�����/m��g3�w�)R]�RG���U�h���\GzH4�'����Rm�^���S`&	�J 
3$)H���r��*U�x��5��y��j��`RZ}}��� 	=jj>�u��)03m�������:�0�;s<��2�&�'�:�*��dIN��~���6���LP�k���N��W�D+-[_`������&U�ap�4@�Ji���(������i��5��������n�����c�Y-�3��7<�"�8�KbT�Mz��&K�L����n���{|+dmNu��Klt�r���n�N��u.(PVJ8�
3$-�\��2
;k�����b�}�nd='3� �x�\��6��`�H@��]NtT0F5b�6nx��CqP0(��vo�L���q���
1��UA-i����K���q��	�
<tp���,�O��"���}R�Y0�xUn
8���c���1\o���ox�������?�:�����Z��{Fp��H��g��/"�i%�L�y�~Y��_lK��n~�;���X�7�� \���}E:^�H� l.�S���mwZ	B� �����Q����w��+�;���O��9�	"Z<��~���Dw3�N�JD��D� $tb���3�&:�fWOatH@H�"5�l$�����M��P7$��D��aUQ0�v�C�?O�g>'����'��S�����q|�O�/��H�INNyx�!�:D���p���l?��)��������2��c'8��� Qi�c�8%�lA����[��z��Wnr�
�k�j���5z���M�����������f�y���G��o����.&]�8��f���or�O����l�I�9\����?�F�x��h���p����7��m���7�0��0H'����$���%�@	�]yU�:f����k�1\��U���M./ }9cn�V���)��%��Zd��.J}�����c�����e������'`+�BR��0�[YF
'�0�HIbB��������&�{ps�m��FM��.FhGs�	�&5�|���MtV��p
�L�Z)��|��,���N�j�E3�Q�#o^��@[�V�	/��`���N�����M�������-
����F��g%��P��Z�^s�]]���O6�v,Xn��v������z��!���a7����.�T���n)%r�h81g���KW*�doFc�l�9�Wv�F�
�.F��y��)
�*,z���3�����y��(����3�A;�����c3�3n��*����4�E��UC~H�^��>e��~�����������$�$T����p�X���v|����C���P��Un��O���
�	m�U	v(p��8�e��������Mb���J��k���h�]�"���v��RIK����q�}:t{�(�%��S��;�WY�Q��J�>��,3qU��8!�����x"���c6'Bcc�����x��e���/���:�
���X������0$u�lgU��O��k>@Y�������J�J����_����-W�v��?�kur�����|������v��tD �51���r����A'���R���i�?=�
!�q�P������w�0���+s��S��}�l�����v���w�W�A���xB�I(<����������
���5�lY����Y �>@	m
,������S�sR�t$�8���1��������,�<G� �EAz	:�4����Eq�|=!a�%KN(���I MuL$P�"-Q�FW�J8Z��FK�����)_��ZtG�E�
��!=�=����X�e��/�3k|Q�1���<@�o �4�:<A+S6�5�/����8�E:��<����Cq�i|�,���p&wB��?�R�q��(�bav� ���0o0L{��CX����V3�rv}5'B�<��\!��poS�R�DZ(�xX��/��/�-�����A�Ur������M\���&��~6�p��#\�7����x������	��(�[�����9U�W����^�����7��07�2JN�c�z�����_|��%��|Xc�B+���%�tp/�$�u����|�{�}��!����R�z�����Kx����H�6��t�����PcS�r9�`��n&��Hl M��$���!��D�M���t�����w�h�e?o�j�IP�c-h#`�q��l������"����b�\��k�u)o�t�t$����=�cG�J7g��Y]��n�)�<�<�62�r	����$fdd�	sr��,�[1���<H�rqYO��&;F�rz�
j/=�1��H��q��"Ce\��.���q��6G4{@��E�W!���f�E��� &��o�]�x��@�"F�S4_]�p�����J��+A�QeLoWf�Q�V�<
Q���x����R>�mvl�N���W���G�3�ar�C���!C��v����:�n���xH;6���M*�����2�u	�v��0*����0k%xlr4���AC������w��?�p�Fb�C%��
^�-6L�+�VvE�z��2��S��o��s���q�#��e^������v9<�;Db�u�������L,'m@3&ob`��AK�I��!�]
v>�s2?��s���b?oe;B�*Lc�s���(P�1����Q�H��*A��ARr!b�����C8F��1�����jNk�[�'���,3�J
o�l$
�`s4
��(������$����$��4$u�-�&Z]9b������i��*�����������b`Yq�#��[��a�a��]�Vb(E�s�"h'���Xl��Ve�r7kt���Z/������@+�Z
�M�@|&C�p��U�R2R�y�!l�q[LC�	o��<LL��8XN��3*��Y�q����`�����!�4�L�.Ak��D�@	��H< �w
�����I"P�F�h8�&$���d�*��I�F������B����u��6����H�����'���b����d��K�.c�UZ���/��^���{����%�w������O��$��F��e��9�|��S�Z][\M7�����������x��gF�~�=�3���~���M/_�Ag���i���l��6J��ys�����~��o���!�� �����b,�,����Z�p)���VA���q;���t��H�������h��#����!b�s��������e�R�R�^�V�\�>�5������W���b��8�.�r���������%WK�|�}��;76�a�X[`�xgV���b��7e���(������J��dz@�brB�Lw�'���D��nlcd��y+�In�0Fv����IN��:R�B���\�TC��v�mj{�5t��� �q"��@��rb��6�h���P0�"�x��H�����b�\Yu#4H)��DN����P��h�KY4,.@��w9c$7�)\C�3J�{P� �f��:���(Wo9��0�\(��ZE����"3HI]����5x��P27�d4�s�6L%�@���$���p3�� 
��l#Q|Ra�) bV��&����e@�J)�`��|%�<xI��D�����7��(��"��`�K��_ ��VX����X
+o��F+�U�A$��v�#�l��2���������S��'|���J�$m���<���p�k��<U��eF��-���R�b-l�Q7BR[0��(�!�]T*�qQ��B8���JA
����q�k&����E��5�@*����k���l�k����X�j���F5�D���5g�EYx�K0�D��-Fx�!I-
�H��(7n�7�dZC~�'��l�D�<$���E���q��|�|X�>�]=�n������aaS$���!�M\h��������hK���2����a��d�!S�A����3�"���_d�I������;�����Mr�����sB�Q�����>Wv`g���Y��r>�J��P)i�����[PB@oc���?����O����IF��-���d��%~�X��G&��&��
#�K��(|����'�����1�m��'��z�T��9�H�)�*����BS�����Q�A$SC�TQ^:>4����u�W�T���H$�):�����1ur�$AK�3qB)DRjk4z&�j�1x�TO�"�Rb�}�8���ZW8�j`I������v?�t�t�����D�W�R
��������O�V����5e���+�s!?� ���a�����*�v�8b�24�r�j��b�{�M�����.7��v����H�M�0�Mm����W����*��
��tc�CF@�w�ltF��2_� �b�9NC���!�(��;��<�nr7z��"%m����(�rl(��u&�RO��G>@��E	�<	�D�!J�;8��N�!�����E�7��rn�#v�����9��a�����+�%N�R~4��8�6X��Cu��R�9b����}���x���l�O�]?�g�w��,�E>����e��g|���0CW�����Y_`�\����N�OD&��d�g +gU���S`dE��'�$��R�I��`��L�O��e1�������:��O���L���\r��_�u��y�5�*D��~�/@�xW����\�t������t�z���@1K<����	Y�A��/�?9�i�I�����X�Od/G`_X|)���(~i;�G�=
��,�EHo�����<
/]�b9�����g<[H�&������j��%hi���O�oMa�/1�9U��lmu&��r��"��Eva��b95�~��|�s���@Zrmk~m����77f}��z-D�-B|��#��-������QO�Bl�]��HB���Q��d��6��9X��n��,y�b��(������JF3p9U\`�!c��1:Y��
�*��
�I&3��b����|�R>ONlaKBF����0�Z8����h�����ao&�*��M�#y�,��l�;m`f"F4���>@F�"E�����yr@��Y�R�#h*H���^��r�#� ����d����<^�K� ���L�(g��_�A-s,��������L�h���??��y�V�h�p�e}�.s�+�u�
$��� �
L�-��}�����f/v��2�F�;i�r>�W���������Y����1[���M����
������,�����SJJ�����`���|���^�^M
�F6�@�p4Fml����4&��~��m9qc��k�����O�������S�%�^��]=�g�������$AJ��@�XgT����y�s���D6������u[!�t��������U��fZ���hS�� H�
�nfxP��;�x8���V���Vj�$�m���)R�

:No�l{!V��D JW�f����\	��V-"���@yi����YC�YK�eS�"�i�* 7�59z�����J�Y����F)�-h��!''k����r{3��T�������� �@=\���-s��K#w�<�X�^��E���"��3}���w3�0��@0������{9ZW� "�n�����b�Q�^�f
�KZ/�����f%�V�9<Z��
&9��F���� Hn��i
2-�F[69����<�P�b�}j�1�*�����1�&%�B&~�F��%�o�>�X��_[aS>������`���u�M~� r���`�4������"9vR��4�Y�ef�w�x��I��"I�35<�Q����h��r*����:Y���y����&_�����Ez�9.J4�h ���J�6I�Q)e���fh�I��d`2R:]eI<��;Mf�ny�$��b�:�a1��EX�9�q�\�C,N�d\N�`���3�2L������;$����.O�1�I�9cN���L�Q��%O�/�0���]��+4��z�S3�D0|���%(��vq
���mY�eW�Oy��#[2��Q.�=�}5���`>�g�2���x�B�a �>?Q��
��0J�)m~�r;���\N���0�0�#��`�?,�<

G��fV��:f9�4��m-���t���1���b������d������n�KEq����VsU�L����(�e��i���X&5
2�j�4vZ�GG�s���2��Nwm)���������#%��?q�LsX��n2+��%v�5�J��'���./�/�1��z�#����)��w<_��K�p�x���ns��
V}��pr����9FNw9M��R��,�dLF��{e4�:�43���+L�g�.�a��y��.gO�����,#=�W�{���,{�.Xp�����:a���E��x���9�r9.����U������l���!���k<�n+Em2�U�`��Vg �F�(�\�*��9� ��
F��*�5����2�� ]pv�����F(F�EPJ|��sgN���E��Skh2Jl����:�v����T�bNM��CC�����
�#a������z5��lEP��$������.U��b�Na:��=1\=�{�� ���F���q'}�y&8tNX�g��q��,+����v�
��gm�\a����|t��*m����
��Y�������Lh)\�K�a��@�<�2z]�'����&�Lo��o`hS�
�$���}�Yo,�/
wWvS�8s|d){O���"� ��]�qi�U�Zo����0l}(/�gt�Oj�FYe�P����)��������G�d��q]�q@�A�p��c�),�A�z��Y�:�� E�� �%��d��Tbx�.@fRo.J����"��~���=AhS�i���D���p'(4�	B����y��kt���+���ZZ(+Z�F��2����f��\/y�,�!-E:��?����;�������%l�@�"���c�~��U��8{����M{��-�$�������-bgek�-[�$<��+�	��;_�/�0)�^s�_��w-��q�Y����>Q�i��)�aoz@-Yj��
�d�6d�:��(���q��d�FXx������������M�~�7�l�~�o�a�9,�6M���P�Fml� c��=�u��4�[_�u���� �:1b%H�S;��Y��J����
r���~�d�<�G�72�PN{�B�����PjTF&�6��$#��_o�[��G�.�s����C���BPY���k8T�qHj(Ej����-�8j�����Qc��.��ry1k�|�����9�r!���3����L������-IA+u�Je�N���OM�#k����$rae�&��l�	<Z�F`��>�����"Gs�
#��,�����SFhE����%�s�`��5����t$���.��a45���{�UT^�_�[��Nq9��l�����c����Qks�� ���w+bV�
WO�*G�e*��r�[���3:�+7��W�n�b��u,�
�Y�BZ��X�]_���i":�\�~J+-�L��V��H����x��~@��d!��1.����h�^T�>�MX�RL�(Z����(�&��?��p���0����kfJQ�f��>��Y1#J��Q,��3*A%���fR�2r!=�\�6+	'*#6��B�
_D�����]��[�O�	��E�S� �*S�$�R�]s�:�o���M,3q������������l������y��Ok��IDAT���?;�3�W���}�_���}������V�Ev���L�u
��_����Z��j&���I)�3W���������,7���L"���P%�W`Wz�����|�'W�}�s���Q���d�%{6�1�Ig����)*5��/~w�0�D=�%!Wx����������j}x���,��Z��e��`AG��lYW�t�	��U�`u��q*�=��[�J��LM�H�9���������D�:��Y��,�r������}s���?�����}EQ����%x��[~�Fj��Vo�(�
h}�I��1�u�X�D�8��t��a[2�"I�I������r�#g�TK���T�����>�� ����c���F��y��-sE�D�Z�I"4��nC!ly]j���)0W�����Q��$m&i����fK����/=E�)�Je�	�1Ta�!b����fP����R���p�R�=���n��j�2)�R�V��b�)��[��*�^Dl���(+l�Cs!w��\���m2�,��B���"S8�s~��	��.�1S�K��3.1�w*�}����ZS|����T
�*`���Dh�$��1Vy�6��\��Th��]�/�Z6�_Mw�N���3��<y^����:�������<�)j3�9�;�1�Xe%��
&3(]��p��0��#l1z+n�R���V��r����)pLrX�7��!vy�w�����%<l�Mv.�di��&%�$��<��}��e_�[���U�"��!A�6sa]�����S1��O���M�;/�
�\J��X`��bp6��r��){{����E����'V��5�YS� .M5yIc��
�k��9�<������U���������i����si.���,o���Y��������D� M[�UR�#]���:Fk����M��3�A��g��P�Ba>2�U"�A��{X����vl��q-��.�q���D2�z)�X(�|o�~�$�
T6�����0�����q(��>
rV�
�g	��p�7�,hl��1�J�Y�x����1b�;�K��m��/���a�����=X�x�+e�.�)�y=6EJ����7bo
�=2U!�Y��,]��������`�n�|�#��Sz�.�(m���v�������a���;�:�JD����f�7��Sk��!��rN��_�*�@��7�<��	<	��q�b<��\�3b�Miw<�0��k�5Z�H�aBk���+�$
L������FV
�����pq�����(����hD�>?�(j[7��n�R�I��	�ghQ`�hS$2�W`L�c02�H,���"q�4T�f��Xchv�
�<�OuS71�0���e���
��	.�y����Nb��y�D��,���)����1[��OB��[05��N���y}�����=���\�|ga0�r+�5�'6�@����az����g�Z�[��X�``�l���uu��$t�b��
���3l���$���m���\�A�+8T��X��p����V*H�=&
��>9w����0}�LjQP�c���
J���+��������&
B�}���2�[*�R@��(�!�GP�����u��C�`*5`�z�F�n>�S��&c-�C�)O<�Zw�����UlW�i�/����=~*����{"�w�K��Yp����RM�Y�`(���4�yp!{��g�!�d��j��z��+��U��bpV=���I����Y�U1?�Y��y�WS��3' ��pRa���w����9C��T=�S�'>�[�?7��w|�2Pu��B�������[m�Y���)�40{���,��Rz��u��-0ri�������:��<��}�u.�b�����A��%�9W��M��2H+������'�9��U����.�"���0	�y�K��B���������I�m��9�-Z��T�|V2��D�����
���y�9�-���(]u�ycU}
E�\�S��O{V��(��m��C���\��
RH3KR���0p��N��@Y���p��MD@���K���a�����P<���K�C��,�Q�\��qp����&��c�0j�w������P������5��M�H����g�<��My����WA	��PBz����"��'�J��x*���1.�1�31��av�F"
@nr�s��[��Q9x��be:A!al�\iHS�
Y��L�<�Mi����d�S��P��E%"���t0+���{�&x�l�6�Q	�����U��o@��P�q�f��M�������	q)��;��j��
�f�'p����#n\���>+��HwV����F]6��#���v������"�\�	�<gR����������+:#�F���+�|u��H�mHN�">��u}��sC�����J�����	.w�"�dRV�$�A��[U��l1�N��-��:�z��`k�~8H��je(J�����RtKJ������.�`-e	�"`���u\n�uyj'������Mz���Ymz�P��T��r[�0��k3��V\����k#c~��)D
�3��A�\c��`��rP�KL�� ���Xm�B�-�����[����0d �)��8���h=�N���tS�����%�J�����Lb�.AxQcd"��[	��3�F������4�+n���!�q,�;|a��c<�3�9��*=�Y��7SD�yd���J&�b���;s	^��g�O}{�����6Ez��~i].�L����@X��$��U{T����4h�#��
$�l0ZJ�1O,�
���������p��\� H�5&�o����9[�BY�LK���o]��)�W������[�)0�S��J1�����A:�z9���V9
�
iV��^5�U����6�����8�8�'*f+Z9�E�[����-�jd�K��U#�����V�(S�U&���,]�p���=*�F���H$��X���n�O�/���0���is{"m�IK��K�t��-��l���A|�����N;��K�m�>���|���7`����$`�9����	��[���0Qrk��Q�y�
�����&�<����)+*����LIX�"���)������������f
�%8�mY���Y�!D0���%�����A��*���"���e|�0|��Y		o|�P�}�_����i��9��g�P:,���<mg���|��/�}�8�/�9;���y-E��	~'�I����~>7��:��?�\!C����q~-���?'
|�_����l��`~/��jU��-/>�m�V
G�y���~gs�t�di#"��V�kXAts0_�+]N���h�8./�����i��Zg�B����?\^���H�����$���|�td�6SA#S�������Y%���wZ��r����|�I�[����nf��m-��$e�f�e���2B�����n)'�)#4��2���,93BO[�L�(#4�$���Q������\Fh:,������!8e����r�FhJ�]�p�^�z,����rO���f�b��G�`+�
�;����r����'%e��E����1�TG��f�fkY�"�"+~~�T��X��?�|y3�>%R's,�5��\8Ta�p�M�Y���m��%��cC�-������p�#*�D��vAV���fy���Br}SB_����T:jJT��q�Dc&�f���|���H.fJ����C��gg�X?�"���kx�L�1?5j����{��G?a�]N�n^��2�
���
g���5E���S�;�j�F�����4�3��H�	_!IIaM�e�3���5�Fe��2C�:��XJF��4���% �{Y���v��M�����d��]�2�)J�k�ePF��Mn�N����Kv<M�`�w�����)4�n�W:,���)U=rU��h}�&�-nc� eTh�I��M�K�{O70�4X���=@��Y�M�=*�����qL+�e��I��0d��,2 f��^aH$G�V:�dm�����_u����S`�;�Y��R�0L��J���eL�$L�1�
f�n� Mpq"�:����,S9����3��7&*�����������6�uvD�������e)���O������Ap��}����b)�)�%7Z�>���P�p����D��
�j��
�e^/��M�c!��g���N:F���a�������������9~}�"�/���z�HS$�k=����Xm���u9=���ZO~y�;�I�t�|�n�y]t5�]lJ7m�~E�~��v)�#|���j�����|��S�
<�XH�RQ"��������;f�4,����4[�$����O���d�2�>�~NKQ��>�����T�v��smX���mM3�)R�o^�>b�6E2SE2�:�^#�.��P�1dprVJ�U�����J%�`R�S''��R�=��V�:��$�8d!-�x�&�RR.C�#a�$��
���0����|���X6`?�+����/���n}!�������{9"D��.l�
�P�yb��[=Q�*8q����z%O�c����AfP<�a�ny��(������6PZx{�E9���)�u�rq�V����t�G��i_��i�����c0#S���/MD���h��on!~����Q�
����/��q��|��?������qV�e����_�P
�-��>@�:�������]������p`��=�	�|�T���u�7�\;�3��N��9��Us������.��7l��/W+/��u�[|6��\Z�����d�=�����[M?Q�������1/��'*|�S������)��#f$��,W6O��I����|��/��/�J������z*�X���/�9�_�q�����w��W�d�%b��g}6��������.��v>�_�K��m����o���[�?�m���1�������p^��dmWr�4�a�PJ��[����
�?�0���/���>|v��G[�'�A[���?"Sz����;�\�C��_������]��|����yY�o��NO[�������mp����������Y�8�^c���G�w���Q�eQ��Ic'�����+J�������O���������wc�-���m>@M9�� ��O>�:�a� V=�=����j��shg��9j�E�%AK+AFN�*���gJL>�����1�X�����#uq���vl���`d�i�M�������j%8�7�fv��16�6.&�s��4t�JY�����!ac���j�L72�R����M�l*�W!�~���~�z]���f���,b�8�>���vq�C@���`��!���`����S@�����������.O��O�O��x�t����T�k��?�\P�X�>t�>��t��t�������{^���`��yb����������^ �?��g�a�"A�=����?����"Ng��1?��
:���>�0�7��aH~�M���
o����%�k��2�9���g/?���	I�~�77��G��g������,��/MR��$�$O1�Hq�&���ZL�rz>=����AHo���� H7�0�z�'�<�����8',)�%��c=�W���8������Xu�N�5s��a��!V����I%k�-~;-!�T�@��(P1m)���2d]�L�[���Z�vl��4>@�
M������zd�gu<!Oj���d"�����T��e6�dh����>�TfX9�%z��+X�4}����|���`��s3��[��R`=��7K%��2���3r��6�K����0p�������3�����Of���bn�^����N�C�>o�rp�{@-�5�?e�M��?��&��,�����a�����������;�����|����!@1o����G�3'�g������!�����a
��o��C��������4��l�������r�\��4K�� ��t�X'��b90.U��//�K�|�$��e7'�������:�O�mi�8#�y�,L��c���$���'I}� <���0����uub���Z5��3V�hs6��)������3�rN_���0�HN����rSR��
��S4�@��	�5U<=����9�v�p�X��#p�[*�9�X/��	qM��L���x@!f����kc^�_����x\7��N-�*�>��B1����f�`�d�%H�(D�
�L���O���*�����y������L��4 ����
���w�T�<Xob�����Z��m�'�Gyd������:~*�z�$>18F���*�#�r�x\��T�a`�[T���O���~Z��s��e��.�����j/�?���/������������g�y(w����l�@>���3bi(��w3
S�e�����i��G�>/����,��Zo���c[�9}����g�����]sw���`�C�����:f�~�����+/�����>~�/��;~��w��$��*S'�%����L�����]������>�R
l���<����Q�<���8�B1hdJP����K1�R�=���S�B�P_�����5��������d�X��!x���
���$������D
s�k��T>�n�ic���e6�;[�`��3��`�
~�����)Y��\��1_�C:t�O#���F�e��� �����HP����.��b`��:1��C���Vsj��:�laj�5�v��aQ+�n�X���[�!�����G*�T��d	L�kDz���i�Q��R�c,�� �5����<�^���
���a��^�<`g�`3��;����2��`{W,���K�����~1k��������T�7.�q����-��sI[@�G����r=�=�?���g��|��x,�������������'�1Z��2���d����%m%������������|xD��k����n���2��G�c��3���>�����z��\<)Ag��A��v���������z�[
�M��_������ �<a���F&��sS$�� ��	�t4�B0�zW�I�� �,p$B*6<iS�Yb��P�E&�tt�{�B�	b;E�*� H��jM�r���������aQ����i�UN�4WDX*�p)�i�������U���!��/���1]t$BlTZ	b(��I�tAg]����2� H�r�l�&A�0P<c�����@���������a�8��������].M�l�C��j\�M����^��~��h���?��6��!
��I4_L.E2���)I����G�b�A|������om6�>��~��+��Ou�t�u�����9r�5�~�������������C����L�eCS0�����x��/�Pw��<
� ��=
��c�F�����
��:��C�����K�R�-(����}%d�����7\w5|K����g��b���3��,�WQ���6�+��
e�&����I���g:a,�b��}�cGJ�!D�t����k��Y�d���JFhE;��F���Z��0���dq2
�E>>���(�����0J+�"x1�'Xg��z��d���V]j��erjP����N���4�$����KE4fk��k�Y�a����#�~�Z��rn��g��.�����)�f��,�T��	4)�
�*8ag��)��`��&�W����8�[G�f1i��X����I_��v���)Cd���S��n�����X	�0f|��+0�n��.���yR"�K�������_�l�!�Q����<1b������d���-o���"���S�1��(�Ss�9�����pQt�J��K0aQ��,{%K5v�37��PA�����"�|��9�-OmF���3d=;�YE�w�WY�{|��A�KG$v2��$�AVni����Y�^��4�"gd��hf���W�`������<������?~^��R�����`�515���f.:<����.�2�5��u�:;��Oc�����s^� ��d���K��������~�"�'`�����x5�R�����!/���I\vq���-�������G�] ���@����N����X�<���9���/].����k|����\��� ����:j�����P�;R&�:R���0�*����q%��/"�������Es2��`�F��i���O"�j�.5d�u+�������ND5�c�tZ��R��x���UY���z�$�IO�����T���������3~�lg%����5���R���Y2�.R�7e����<a���i�����b,6�H�.�=>��k�\�S/��6��$�%7h�
�U�r,g�PR�x���u���<���������0�2�N��:w1g����U�8�^]}�^���"�������1���p�m���?��,8��Z���e�z�T�[�H���Q�B�v�p�A%�����p�u���s8��a��!����o���D���b�q��v_0�wNP�_�i���z}�:�G��)�G�%�u����8�=l���|���l0W!�_��������y�T���e��)`dM+A�/�$�9;�7<r����X�U�p�����vld�c_�Fm�mK�4o��S�J���gU�\u���PfC�'��r]����z
W���x�,B��n�����eZM��i���3R��d���T��c�Gb�W�E�&�����"�}���^��u���}���a,tX2:1f�N����sI�[�P��\�c�l��M��`&�N����b���C#��99�Y9��1_j����WL#��������D�����g|���m-����u8 r�>��f���=`��>�1�o�C�����A���E���������x{{�7�T	O<"|�pm��_,���v���_wS)�����{��a�K�c����%��2�+�q@�3��X�3s��I� ��P��O��6��pvmz�qw��3���q+�z���=��W���oz�7�pa��J����G!$V������H�Dn)T8�h���6�S�u:�Sd(�u*~J�]%3�`�����4
�e�t^�Hs�c�L�H��C!ql���Q�\IGu��m��������Q�`���6PcLz>
f%�I���
��a���I��1hJ���Z(n�2�����1;�W����)���zV���U�Z���sbl�O]�L�,��/A��J�w\��4�
���S���+@��M1�R;�v"�}�l����`����~�nv�0_7#����;�
!zkB#���{�~�� F~�����2v�����G��^������
;,6#������-@�)�������d�����W���e���Y����� =�����$����V�vG��~
������Oa�_E�+�v=��=�U��j���rE��}���|�`�d$�r�x_<������c��1^���W|��+>�R���%y�x�dj	fd [L���x53���\Y���Wh:����y�����*�]����������-P�;#�wWIPW������'�����Rn�q'$V�\������Z)��ke.�Y���"�@3��r7�W���U���_c�����5r��cW�K�
�1���4����&�t�Xii �OtW��F�;���T:��*IG�Y����.��/�-�dq�"t��$�oG'�!���$�R�g0}j�J�r����$uv�Mbm��E��~(�c&L�r�v��F*E�%H�����0;��x���/���$i@209)v�a�Z���
��Z�� ����TY�f��X.��A��2_h�����1�0��e0#��y��
s��s	���9�jEdx�
P+�V����1�'���W+A`���Qf.�������M*����H�������><b��}N��T����,��1EQ���l���bB��l����w�����|�"�����\�`�>�
`D%����w��-������+�����#M�<���1x@���a��:�7�$�����(����8NN�8���]>c�]Hmy)^D��L��J����B_$��	�i�`m�p�ALgI���K:a�`f>�T-
,i�q\��{�CE0��\���aCz�.L�����A���T���\6��|���6F^c��O�hiRu�8��aH�(�DC�V�I��Ad��I��Z0�k;D�g��c��6�AXc��2�=>���56����2?��yU��]�b���.}`r���3��u�5���9��%�a�k� 
��`����Q`
�aH)�R9k�v����{��I�������G��3��!�	4+�K� �����_v�-EfXz1%B���=�nq<�z���0���� ����JxwKF�\��k@r
[f�B���a���{@�Kb�C�z �����~��e{[�sa^��q\<nn��������|eL���������%�p�
U���R����]�c�m��� �s�<�6�=dP`����}jM�����Q��a��_WE�S�z��W�� �h�`�+��H�S�.~x:���Q`��I�8T�Im����j���
���,��������b�F�A�{�M[/��S`��NsZs�������9��T�O:R��Z`��l��vbDCA��"*'��$
\�)�,���j`%�0����>@T����I�0K�����+4�s=_�P�J>�Rig�&�U4u�<d���q]��?�<��i���Z���Xl7�f���0����
K�r`'?�R�w�����T
�<�����3�9�Lx��*kFz���s��o�nw���_�
������v�n�)OZ�aw;@�S=X~���`(a�~��\��2:?<�����I������Lof�~Jn%W6e�W1����Y���X&������	��
���E~eB|��w�WY�K}��������j "�����g.��1�[���>�L?x=������u���;|���>�����i����Os�/b����|��K������,s�u,��������g����-q-��*!��$&M��M8Nz��	q8�=AN�y��gu���_�s����iF��-��o ��R���/��3��O������4	���n����N�0��
O�������va�w������SFhR"Bx�Mg*#4�HU�_#��'����39�I�0}J����a���$sr���=$��d�.��Y��<�<��7�q���v��4�0B�kE������e�'�p�d�;��`�?R�M�3�x\���e�{Q�\��Jr�3�E�Y��
��C���z����,SH�P�j���??,rR�d�LjPP�����>2�F�������J�\�;��
<$�3��B�x�}��/x�Q�Y���{9���O���t^C��� ��j��Cd0J��E���5�3�s6�	���	�I[�3�����_���ut}��O'"������ ��K��oV����7��i��R��aU�Jc��+*Id�M����P3\���6p'C%�BS�a�$��v$�9���K�'��,
�T�2�N�k����&���a��e!�3�^����@R�y�L�)N�����{G���jS���'���a5�e�L��������
,��D�0�jo�	�M��.�05Eb	@�*a�Q��������/ze��p��AN���f~f�W�����K6�����0��t����
�qf�n,b�Y�0���5��;�8��J\������>���R��'��*�.O�,�6EZ��`�~c����$��-�z�-���W��fWbv�?������aS$��8���G�'�l��p������2==�� H���E��Vz��������f�c��K���:w+������$�#\����@��Sk�ov�\��9<)r������H���)������e����^5��p�N�QUu�6�H�K�g�&���kI���.�)Rj��j�f���6\k�Dx����R0\�F��^x�&N��h�A�)���0��H�����a�(aL$�X���0%&�Yt�>�W�>�$r�����p&�e��X��\�	��Q)�k���at
��	�_\�p
���kQ�[n�[�-���lX���M�z�`��{X�d��J��KD���d�}���R��:d�r0#Y�Su����ks�_z@BL�!� ����7�����{�}*�/X�% �����H�x��x��M�ui���!3�e�z�����4EAP�P�M��^�L��so>>< >!~�xX���G������������)�F�~�2~�83�z���!G�@�X���6`?��#��up���/�i9���I���G�<.�:���L�$��A���|����
����U����W|��g��R���&'"�����[�K}��O�O����x��7�?���4�z�&\��3E���oz�$���� ]D�A����������U��9��!?��p�s�
x}�����cu��'/i�A@��J��|�:�Q�+�����
�-�=���B�Z���.
�����S��$�*j]S�����w������\���JS�`[f�����j8R
����f�����@��	����o{�P/@�a��\����B��|�������{�K�]��������1� ����M�H��T�Rb�����vV�
���O� D�!=���xc��@�����*������F9�as1Z�F�X��:��j��q�O��J��l�:��������ba�$?;�m�;���2����m�{4�6����G��s��������9�|����D����b
�MTTMY�7�G=��u��(���$`X��4)7]���2�6 pF�u�������9�e�%-������Y
�nC!�F��A�_c�)�%1b�,�2B�C�*%?"����\@�n��vJS��7A��m���+A���d2n��2��I\�Q�S�������SS$,[���g��Y?��H�J�.�!��A	1�:��c���je�R���*MM�0>������f�����s$\�(;��-���e�a�.d��ML!���Y|�:Q�YA�%0�Z8Gv��x�*�7I����;�lO�h�I]�I��:7��zE�92O��HS$C� ��=�e����=�}�t���RN�&�t�x�[1����E��ad�� ���r?��� ���z�!/e(�s����� )v��ZD�)�Gu!ni��X �p4�g��~��3�N��1Q�(X%O	@fk��"Y*4e�|�gB��biz�������V6������0Y�HDg����
������$K%����G=�����T	7�r���[��1�r�����0�D��b�)���5�
�Y�U]��������c)�xx�}�&���"O����)A��/�RPF����"!��>��E�w���H�
�>�'(����_�6P����5�-@��)A����	�Rf���sq�>L���������x	PB���Y�t:8@��"�[l��0��
��e�����}�\|WX�S+�}���Ef������y�r#\�&v��4�,�=u��G���v�k��&I�6z�Yn� �B���XFi -"'rcC3FF$b�
W����twI��`���y���s����h�$�h���"�F���<�]b�J�,��kE�����^B��t���Z���}���[T��B��z9J�����7��Qw��a>�4S ��
��Z'B�&�[�*����{���D���gYm��q��>=����r*���!����	���J^<,8;��20{��,fN'�����Y�)�.�-4x�H�(b�ON�1�)Y����S ����}5�\q����}#[���)B�M�3������������|���v?a��������c�Qg�|�~����^^��L���,�Y���zU�9�N�wWJ�����l0h;�\�~�����������JP���uA��N���}�+�X>�AW'��M�Kf�*��4F�H#��d�o��G����_y��"���_��R�)xWJL�|*�*!*Yx�5Dc���w�/ Cv#w#�[#�n02���ep{��'
x�
<~Zw!l�����,���<�G>,���=rd4:,���n��L7�v�|��E��il�c�	��tFgmw/]%���|���wJ�V��6�m�a"�_�<�p_�o(�����-Y�9��������
R��6t,���R�B�j��vP�A�V9�+�0�b��Ok��mc�B�1UKW"����M�T
��oD�|�����gF�������,M����������f��������m+�����l�~���_L����Q�
a�X���
i�o��Xl����d�qs��7�N$+���������}Nf������x��~���3\�t�LL�`dF�������H��)���"a%���{�?��=��cFQl>'��	/�����
�]��pw�����BHx�?�M����nw������6��g���
 6���c��Y���U����=��-J���w4Q�3��IQ���F�a�>��"E��)R���t��'��N�(�0����i�!	6y�x���6��^�P|���u�;m�tR	b�C�R� �umh�
�h���%	H7�u��@�
rDM�I��jS$��m�0�k���u���]�"ul����A|m`#�������]P�"3!p����	`��aY�bv��W+A���!�A%�J�I
�4`���B���`k��vF�-~Ba%�O3�����!�qf��Yd
�>�vS1\@��s��K��7G����o�P���X�����9�����m@
�@?@+L���K�0h1SS�����������}1�v���m6$WZ����m�3����Q��x����������G����&��$|x!D�[&�L���GP��1��<��
O�u<�����s��w��r���Z�}qb�}�|�uL�H(7;gL=�	�c,�D�	W)u�{R�J��.�e��c�-����)%���r�N�,��x�oZ�"#��UUe�$T��{�����jc�-�DT����!��F���'��D����n���V�v�kP�O�6�fV�]�+�Yn�����2D��y���U���\y�&���I]ag\!�#`#��7W���s�����z	V�,�4`�9#�M~jEtF�a� ���i%2� #�p�o"�6&���
,d�`E��X��^�����{DVh�" ��f��e��9��fZ��I8��e!P�_a���x����r��!<>e1������������Je>@�v��p���8��T,�������<�����>����k�At<�B��*f��E���/:F�;]x�T�tlA��p_�����-��x� �{"��������� �9l�� ���?,DT��f��9�#����� �O)���R�T�'|�U��f�c���){O����s1�n��itS��������9���b]���
�/Y��x�]�zQ��68����U��O��0��4e�H�.Hg���Hs��kH"���")f� GF�+bh�9z7�r`�	F���"��4��w���?�Rj��Vo��F�����N�,�<���c�)������Rl��$[�lr1P����������)�>�i.Wt(]�,5��\(�HnO�3�JXr�W!�Lf�Q3������'�pYC"�� ��Z�Z��'������K�K�	%����f.c�]�z"�	B�W���H����97�����
�U.�E������pe8\2w��Z�r�/�Oz��3��a��p�!�
P����$0K�3��,��Z��r�����e��CZz�c@��G��B��f�w�w������6J���dwA�����#�>��F�a&��u�b�'��nbr_�{���m<���ss��q�w�qxd�.��{����\1O����{+=� H�D�I�W9���6���#F��J���~�1�Z���iS������a�Y	��4)���?�����s���R9�L� �5�"H�=��H��5��-��)����E_���H���uvl�l�������z=�Q:,R*H�v���:��Ri���������Q�����t��:7�I57C
6��K��t�����1���hx�:''�������� ���� �W1a$w>W-}��"YfE����U�J(�� ���^g���K<�]n�nwN���&dq-���\��8�> ��wa��<:��q��#:
�����`���k%�
`����z�!7���v���(��L���J��7��*�l�Zw��A�y�J����pw���~; o6}x0�_��1LA�������
a���m<>!>�x����1�7SA�����{���G���������N��'��a����AR����B�8����o�r:�����Y!A��Mf1���b���9�����?������Y�o���������R
o���gIthRjZ
w��&e����y�����T$aE@w�����$
�Q��~��'��������4�Q��D��5a����(��R���@;2,�j�I�F3�}19�����LV������B���fZ���9X������X��1�m���f���
�b��X��=�u�e�nD�3�a��������d����.�xe�I��dK�7��#2R�U�A����D~�������T���4����	��A�_�"�t��44�J��n�����;���-��73��w{�{�����n��`������S.�{�x7��X�b�TFX�������������.�gz��G�=��1�a3�]ci��*�������^g���X���I�i�;i�}���B�k��o`�L1?MIn���"���N��.���/��t����c���S�F�y�x��%{�V���|���?sL.�|�K�-0������������K��1���%�=��A��������m]���n"�N��L�Ma������%>@���_��V>�/�MR��,���k� g>@���C8��LX���Rr�?�������U��gz��m����r��..��+��~n�m��=�:|����/��g���E|z?�%��/<�'�-���E�@nZqv�.�I,�C�tWk����"Y����f�����K+\������[[�'Co����"�7uY9�3�����b�q���K�o��j��������|���q�T�E)e���!Xi�s���<��*���sX#Tml!,��CV%�$=A�����m�0�zZ������k����[��+['�`�t5�MEd�jy�����|�k��G&�sH�U�J�$���l�=0�Zej���p�X)r`n*ys�R��~�����!:����.��V	"�$_����D%��J��9��L>@o�-��"����C����C����?,
���I��a����Av��],��m��S+����;��p�vs� ��L��"?<����lv*�F~q�w1�|||��z��t���L]s�`X��.o����T�����#�l�=,��a�$���H�=��A"��?����~�)���o�?�e����+~X,���f�"����������32��v�O�H��B1��{����O[4���:�>��YK���$LF�d�66>L�xp� ���p/���~�9(�Rq��uB<CA�z�Y\/e2V�pC�������'�
}��.�&�MV���l����&�o_`{��s@�~�>%����pU�@
���u��4�{;�@
����I[v�I�J�e��� �4��s��$��HX� �����K������c�y+\��D������������bik0Z&B�U/#Jd~ �f�p���O 
Q�34�OT��!�v?u��y��'n=���Mi�duzd-aN+�:�����v�|;l���{������m@4��%m}�>Q*�m����P�_��t���&�n����	���1��o���>�����ak�0��g�����!G,��+�1�g���cf��,?
���n��FJ8����0l�1�a�b�)�S#��"?`w�����?��o
��<s|M��?�>�Q��u%���Y�+m��}�	\[���+�=�'��l����g��mF�5�A��Ny�l���f	K�rg����;G�=<>��������s�AHO>;��A\P-��k��%q�g�v�TLcM�%�2��E��9)y)H)��A���0G�ugl��F��!� � H��o�p����"%�������D�I���uaHX	����N8R
��}Cb����	���l�7b�-@�����v��1��w��zr��p�0�#����R�
+�CUg��w�v�o�$v,�b�6��K_,@�����\�*U���,���]�s�����G���"=~x�`&�}3 �A
�d[���g@�zY�����-�0����J�����d���+��[�R�
�A	9�&m6{[v�x��3�V��<f���y4������S��{��5�A�}��z
��}�����0��T����2�uJ����:.���L�K������X���%�Q"4�__��=���=���sKv%�hZ�����Ka��xiyIS�����|V/��'b�ej��#��y%>�
�T\�V��<��Y�:�RV��G����^@~����_�\J����p�n�;���o�l������>%f8��v#"�_}�/D�O�#>V&��,OWn�����v��n.�;e��X���LO�y�-���3����G������F���I/?O��|F}i��!�U�������W���<E�K�/P���E>�*#6����Fhf�,�,�����W�IC�W'|�'������elY�
���	#4��0B��r����,��������0sy�!^�����`;�(	#4�8,�!��6l,^������s�;�7������((�qY<��GJD.�����\�{K�.������y��@�H����3�9"9�!'��b���Z�o.9��`bdAd<F@4�>>�=�NC��wyb�n��M��0|>�|?�(�>��*���+��� ��S�F �� ��0|�����`�O��a�G����#���y�D}�0Wy���&^X@l@�#���X�9\~��v���;�b�c�bd�2
i��*ojct�r���h��@�����Ss�������]I�'M����v3���DJ�)R�&�7,�8M=�L)�� ����V�2��R`��z�97�^�H�DbL�v�lAO{��/���Q���
�:���z�FbE�����$�;�b�G��y���p\Q��r��{Y�eTK:7����FPU��q0� 1��}`F�r����wcF���������$v��%��r!;����u)q���/���%~"�2
���������@�3�(�����s%�+8Tp\��=��f3�G2�]oo>�g+�4~��"�3pj���TOO��2N�Ncp(�i�2}_�{l"2�W��K\Y�)�j�e�dN2��v���oo?� �c�K��g�$�kcS������
�?�.�v� ��:f�a��0d���������/��������J��z������
����^)R
�����k���M.R+�'������H�oOq#��u��j%��;��1.�6E"��p���Ia/���!��g<%�MlA�&��$_,��h7s�v�ws%"���]i���:�3�)�}���RsS��#f�� �$he�d�V�13��S)`�D���R��:HP���b��7)�f�`.x
�BA�����W*���t�����gI[�^S''��$�p��a3�[�,@�z�^�J;�:!�R�V�UlzE��f�)�� H�!A>"�^A�'��� ���vn�#�����u��w��L8��� ��>�sx�aC;����>>��H����f��#A�O;�����O���9����)RFBl�V�c��)�?���������}0:�]~Ai�
����s�9�B�"M��9�\3_��fZ�O�:���-���d���3.����{��.��h��&��xE��?����{���"��b����~5`q��*5M)9L��^*�O��Ij�������s/`��'��|���;x-|5g�_��{��,w�k��~�W_��� ���KA�)�����S��FI(��)�<������k|��������y
�p(���s�+�s'Q�W���;jd�2!���W�c�sK��Q�_������{�	�.`�_�(�/Z�k���i���>[�����z���;���j�z	^k����(����'����K�O��H'i0��$��*�U5�k�uC���8����[[@�%�
r]��I~��	��Z�[�+�sI����)���O;D�ZDHA�V��SZR�S��b����78&�pK�"
�}�U����X��)�������A"������U�����(���#���)wN�%����VOp�������L�O�iS$8w����l��z�|�5�6�s���Q�Se���b�kelse��s�M�Bg��I@c�0,
Fwc�R� ��.��ZpkK����7�g!��Y��A�^?�������{�-�h�IC����	��q+�B��f�P6!��������e>�\
�Q
������w��]!e�gp�`Ip�G�6:d��@�/��3�:�����������8��c�6�(a��	����ww j�����|���a�e�p�J{+
�
�U�}j��g�sX�o��3�V��������7�\u������	^M���(��X<k�t��_����I8�I���u#)�"��<��H����6oA��Ah���������)N�����H+AzR	B��r<�mA����&tX��[1�S�U�x�[.O��������v��tG���=	���������e��R8I�5����kAa[�3�C�-�����"l������u]�4�k�0���H�%c��-Hg�Y[���h��� �SY���cnm��f]r���5k[(� ��(��`)��S�T��y���!9� H����q�c���.G)��� ��WA.�����L�D�����SF�����n��no���[��+A��v�����}��Ze$(!��Z�V�"i��F8�L��~�{x���tX2'�C��[����	Fg#w�.���A���t�`P8l�9���tl
J�����^c���>|��������L��(���x��H�T��Yg(�j���"���A��	�����4�~�i\
�9��V��"�f���zt��&};:��$sE2������I�#=E�;#y���2h�Zp�<��)0U�T$j��A��cT��)�-c�������Y�����J�����\M�aZ'.
��*�x\IUL=����|f�
n�i�0��P��|%Z���\H�o�Wg*���#a���N�q��gQJ��X|Q���)��#&�=~h����`��"Z����$��:,�g�c����5��G{����8 �����W(3����xJ>|"������<O�#(��6>d��>�~�<���3�NV+�4��[.:��\����a�����L�Sq�_�2���>�/Y��>@~���c��:d^����v�����\�Bl
��i������KS`#��� W���I������>�9����������� �LW�3���`�s�{^�����������������'�P������R�-�mYYGO�"]K���2��|��b<Q@V"�Ef�
��U�o�50�Tt���e��8��=N,�y�F��A���jl0�A�>Ys�]�COl
C���0ql&�@�2k���������'|t ih8�h���$u����VdZM.����� Xc�1��J��+�!s�2m�������"%����
������� ���c?5{P�Z�+� �bD�7�w���s��*��vD�W�`e��	n=�N�PO���kS4B%0��������;������I���|G8��ilH�]��*��w����S/`�v���1]����1���'e,�7���A�'�+8*J������~���������6A���~j��I\��9l����nw��C�*a"3���
��i����)z�������/"�#j���C��*!�2��p��e8��;8����e�?0�Dl����w�]� ��jf��q��.+�#�D�D���%���W}��owyi��U>@j>x��2W|���2��)���Ab_��u��90�$P���.H-(���'~���!]����>��_n���^!H�J`��yA%U+Al�	�P��+�5�jt����f�n[7�tX�����A$��d1Jb��|�f�!��M��)g!kA5�v��Fre0���
�
�5e�T��&��)�K6���H�3;�<fQ=���7B�z�J5L=AF�u�|XxA��o������F����b�0�����s�^f��������/Wl�l�A�8��!R������w��$�=~���99N|w��]���2�L���b�������~����$�*U!F��P��w����)�3�]�Z����a�A������C	��c���������l�`��m���e���/�/X�ZIm*��p�V=��<�g��������f���=���^$|��g}���|	"���>6&����a[����C�	0b��<�I�s)�\D:�-"\R����m��S$
��H��3-�VpR����m���:FhE�	�mc�
+T��:������ -�k�TCp����NG��j��"���1�����f#kI��l�����f�%��$���66s�5RCx�MJ�^y8�E��5���fD�G���b�-�z�(�yMNe��)!�bv�Qd�EeK&_@����L]��<w�*�y
��p?N�V82#2��[�?nTF%�����X;�}x,Q��b������E����������]�����_�
�*`g�.x,���2w�����������MI�NJt2,��h�GE�G
�<�%�o������%�DZU��/y��,������2G�a&�����bV�������sX�i�|���d�����������D�q5]s��:�r��,9��<�P|c�ax��n���|�~o���c������#�#b~�>@��{	=�{:"dn�??�6B*J�.�����PN�����W��?O��La�w��|��.9��T>�o�|��Su�Q������|q9�$AX���m^P|B�J�4{�����0�^����Q�������f'���]�&����u-��9
��q�(7�����7����a�m|SR8Ck4(��\t���h_�P9�%Om�e��$�)
f�$/'Qs�5'��1&��"%��+2�YB�D������:����S �_#}AJ'b�h�8:[V�f��'��S
��;�����-����YJ"����������Y=w�U�&�*��`�p���(�H����K]�<0#�<E)����30��lX��������Ao�`�]�_��Vo��
{�F5�B t4�y�qF[0���s�w�?�$�Eo�)���I>�2��5
��&��`B\���������� �R�����;��y����&`��z���x��~�yGD�����o��_#:�e��+�)��IH���@�#�$��i4(2�/���?@���a����j	�#;�J����V��J�a��ji:����O�������{8�t���[��s��w���H�AD �H{nM��HF7�'���yB@.e�l������0���]����2���:�H�7{L���0�������6�'=A�:�n�4�Z5JP�]�
w���h�Yo�L�%��
V�-N��H��8sM���A<��Et���h+��I�@�"o��
��v�	����}�le�e���������YxvsT��O�Zh�X���vJ�Pn���~���wO7��
��A�
��F�u:��C�c��wcc/F,|��c��09��&o�����J���v3��m��`5���$����xX�c��O���z�G���0�Q� �=aw�,aD��r�Y��� :����VK7$;���Bb�K���Sk��~n�$~o���\	��9P��.N��}\�����t����2��p�����
���f��)x����23���{m�'U���������X������c�,@K��QpI����}b�L�-B*�7O>m��l{M��=�B��q�xd_����������P3#�O�Z���0������8)K&��Wz������<I���~�'������L�Fn	���`�����;=��a�������X�$�����m�9��dlmzj��?{����������ZXh�R�����n��������|����$1�D�i����8�D���|8�g.��,f+��%�H�G��m.~�����K���M�}!CO�!��;�Zc��L�@���]:��[��{|�7������
r���L����o��v���`�\v?�bF�����(��+)H��������}��]n�������vL�e?�L��Z�>��_)����iG�������C��! ����0i��.)	���l���(���>��(�(c>��G\�
��#w�@b�O�M��v��������4m�+k�x�V��&8-���	��u���y����E�?��SS'hN2�}&	��	���]Y�&��jF�1a&E�9��2��zb�#�L`�a���Kl�M����eQ�)��L�\	��tz��g3B�����FO����8��_��W������m|������"A��3#�i����e�I����q1����M��"��K�%�F�@o�{zX+��Q�v����������p����E�$k$C�1i�sr'm@EK��<bQ�#Nk/���,+0x���]$��&�zQ�W�)&c��u�����E�JC��KB���m�KE52-�Q�� I�(�d�;�a~��o�7���q2p�m� �(E=	�%a]7
����u/0�9�8�3e@*_lR
��~�������"��*��#f���SYVV��jF&l��n\�13t(sX��l����������rs	"�����L���ZQ2��~*���4�G&�7S>��	&#N���am��/��\p�M<�bx��?X0	p��tV�7�Uj'3�
��_��t�My���������@/��~����q8��� ��c>-V�(���I�6.�G�[�^�k@��=!��)�/����d��r��+����f���~V�����Ld��;_�������Jhd�2+\P�Eb���%��h�B;�%Y0���5Zv���"��0�TU�m�<���I����P%A)��h$��i�kz>R��Io~Nyi 	�	�g��9�F�	3�vYf��V45!��Z=u�2����gA���%�������R:��0�.F��*9�I�� �D�Hi=O����HR�!s����/��&����D2{�M���%$#��L�]�"��K����9��5%��Km��#���vw�5C~/���t(�	��O�Gf6#�II���Bi�d�>���/?����G����������l����"�������s��;�r��"v��/?����jF�fSb!D���##�"a���
��5�����Xl��w�6�!�L�����s��<���-����\{���#�x[d�#����+^d����p��b\��U���]�h%@��d�������;�
7T�nB��!5�>0�|�<�jd1<�>,��(5O*�5����PPMa�ZX�D�
g	6BT�2d�L�4�&6��;���`�����sw~60�I��<�&jh	��nH���y1� ����8)��&,00L��e
��$���e��I�|�D�S�'_�z��7��(��s6�S��Ec5����CRY�y��!
to����+9��,Kcp[r�h�S����sz��)z��;]�����(B
��$+Cw�����{�!�������B5�:�e{$D@������p
���Oug�\oG���_3��+"fa/rPqA��~)������CB��9��}���D��t�e���&�:E���]���Ec4p&�����d,�����C�����j�2�B�V����W_����e|���
�s��
0zY\2���
�������H�|0���8&�_j��K����!A��e��/��K�YM�eg���1nU�+���|���?�WPvD?K���m�t�)���9�s}�J~�;r��s�&H��T�����/��&��O�)�����%VI.���H��K��J���D�=U�mkL���9���22�i���������Znir���?�s�VC�5[�����+��I���b4-��c�_Z���������$Crc*�������f�}qzC��83�A���
N��aU#^?�����J���H���[��fMz��Z�M$D�SH3��:xfHFh���^����&�U������{������g�a�b��1����4��jK.���7X����<�M~r���E)(��=�j��;�\�{sB?`*deF,��6��|�D�f��r&�Kf8��&�6>l6�`N��������K�����/)����a�(�\��a��?�5��cF�<�G42�������A$�����G1�
n������'�����������L2��V�E}k�Az+m��Y�+	"��I�����`i2�7�6'TI�Efs�M�E����,���f�����Y�(��RnF{��X��e�tM����E"%��j$H`��C��y�,��rCs 6@O����Y%��\N�k��VV��wM�	"�PS��t#2�����8�������c!�<+���9 �
sy�hm����eF�q��4x�������Cm)k�9����`i�b!%d�K��E��f���L@���U�T��z��h�w�6kF�I��:l����~B8����?����_KQ��R�V������C�-�C;HL��*������c�cm����~=���K@L��\	����GU��F�s�A�.��N"'s@�_~�x������>�1
V��Y�l�F�t�W����4R�l�7��u���!v���y��S�	�#�
�I���|�8x�I^����������~���`�W����a�k�����K�I�� Q�JI+���k^N��
��p�d%A���~��h��� ���������A�8�6���R�P�7V4t�Oq��sW��N�g�<,H��� D��sy'���nx���L2
)�D$���|�K��ps
=c����\��+�k�{���[
���Mri�����s+x����+o�>��5�%����u���y�Y;E:�����H��^OV��9��d"A/�\��kc��2<�tSQ�sz���X	�����|����5�_�?�k�������9��w\�
��mW�8�pDC9~��"����_����Iu2�����P
"K��# )���+�EM�	s	�����q�CdM��P�����*�
�/E��9�	�8���s�A~����Y��=�
���w�/��)�������\|J�&}�������m�9�cNg����W2^�h�X���i���I��������F@�n��x�F�K�-���K����<y.P�"��/��c&���2v�����Vl�#����w�8e�M��7o��(��l��O�*���C"�^�s�Id����$�i�t����$�~���uSb�'"����L�L�n���Q�����FM�rV���$��v�/*c���������lh��$�g������ 7Q�z�U_r�m��r��
�����pz��5ub��xN��!�l^��n�+���K�"�����p
��p��_�K�w��p��ht���q�x|�������N�T�	��No� �����4iN����#94%VY`!E5�
$��4;(�6���6�����D�������btv<1�X%A"I��H�X#�4F��O��bz7z�;��v�$l#�q�DBa�����^���4���^��q*!���dMj�_�t�N��32cU��H��rT�~��u�QOoa������l�X$p���r�����c�4-.��d�%a�N����`m�����U!d<�sb.�F��V�y)
����C��T��,�`��!�{az&�B�����xt!gam��e������_�?#��I�_�wtZF��_01B����_��~B
��Ut��`�mT����p����_������O�]���_������]���]./|`��V��H�}����(���{�CA~)�H.�~.h�4��ckNhf�����#����>����	��:Kl����$����r�nk����B*o���&�hQ$A��5M�J��H>�`S�O�yB�t5�7��l,I��*���83n�S%A�_��u����i��~��kq	����&2NPK��;��i(��GK�5�mi��O�!���3H��O{aM��Q0[��P8I��hB���|������E.	�#�"��'A����i|:�m0SQ$�|b���8�w�h��<���$�@�:x���@�M9h���d��B��]kv���:�������)�3��"�B�h#��	�v������W��p���~�
��Ts��o]q���	�2��S@��cC����D.��� �F�����X�7$3��H?�5xA����(�pLZ�u.���?�PH���k0�j)������t�"���V�d,y�	�h�In%�`�I���
s�?	���#�V�f#�����~���K�x����|Lv��d@_�JI4���������)�d��~��U�GsFixiQ���#=�=5�����^J��T7�c�]$�7��<�<�;e��H�.�Sw��2��@�`=!#������6gI�B�B^�H�eo���7
���z��X1����.���!����v9��c1/aAw,-�$�!��B,����bM�/N��x[�-1Y��A�vg$� V"����-���s.;"����q~�^����s�}�J7����M��f�(a9B�����afC(���+
�o�f����8j�p0
���@�s�*;�Q�M���T�$A+_x�����!�w��a;|���/8�^��`W_t�_H�_hH��}�%���_�D�����������X�S���e7c+]�jk2�i�����E��P;���<�����7�>C��_H5\����"�����'���\�]�$lk���	�i7���&92����z5�b?�����D�D3�;Y*���`M��hk�$��zP�]�I#q��$3�������R����d}g
�����Hc����T-Ow���5����x��Y���V���f�a�����4�U*q�sM,,[�G����\'�B�65�$Q�u]����B�1���u��x^��H!c6�h��Bk���V*R=�9R��N"A�+$4��Q���]S� K���pn��dbT�H{hH���8nm�e3vU�d���.������3�u��3O�=tg�����q�����p����y������AL,�P���\�����Ep!����v��������r�$�f����������Y�;u,��#�l�d0y��*��$�uR��w�&�n_��
�:��e���I"��}�2��������4t���o�^�M�K)8'[;f34��p������as`g��"��
>��6G|��~B������`y	B0�.";O9��[����q��cS�+xz�G�m��W��u��7��1A�I��ik���Z�xxUP������J��Y��H�FQ$�O4�F�P#����d�+6]p�&H=��Sc��D�,���qO��5�Q��f�9��E������!Z7�r5�-����RL7��(�"���%��+i�\��A�6,U��	��@��t��������N.����O�&�=��+��I��
�]�����}`����\u��C�o���;�g���!b7��F��O����\�q&�M�K��(}����#���2�~������jYX\kf� ���6��}�@�,u%���2���x?h���	��sI�������!z�9�Z��v`e�'u�5���pG�B���VE����!e��ti�S�>�.9D��4���
��e�����Ul���.E��Iu]�������������O1�V�n�7;kY�.����j��3�.o����������6�cc������xWt�4u(��M��#-#�
	�(�O�\�T�J�����ccj�f�����S��C&+����0'�&�(��!;�����2�.���S(gD��M=	�������I���T���c1��.f�z����Hb�o���V�B�<F��`��P"&��f��&z��K`�&��|�Ag������������K���f�rq���?���[��	P��O�q��A�M,9Yj�i���8L�&�U�@!���7i���L��,MK�I�B�6&��o������-���|�#_n#vS���W��\$��p������G�c���KL7�#A����������2�c�|"��n�(���9>�2�}=�-#JnE��2����pF������� �<�����?�����J#��������:;0�n��,���#D������y��Sn+<�U7���W&��
/E�H,��q=�j�#6��]�������k�7_�����?��7��9����:����Z�)t��,�e�}�xP����5���RB��w���p��
�}h_)��9��h�#
���/������o� �$�h�f�<h��`*�[
�g�'�so�����y�e7Rj���������4$�!�}I���1��>I]x��2yO*�����:�6$��&�
�k]����%\)BoP�� k����B���y�I�
^'��������.�%'5A�����dC�xu�)50�������y��h"!I���:y��\w���@T�H��;�~������l���.d���Zu�����o�i�c����#yyf�lG@'���"}���U����S�C��W�	gB���B��$$?��q{�m��f\����/���|���h�	��Z�9�jtL�k�d�4��h���uz���;�0	Lo�3���3�&_� 4�\M��T�d:�E���9���?n����������������Gc�`�mP�� �KR�o�rxT<�gKe:[�/���N��g�����K�g�!����tXO��*>�s���TXi$AH$I��$A��7��J���kMN�y��Y�^J���-W�C#A�^��e(u��`8Y=��	Qy$�s^�G���"���H8G�G�f2�`�a����B���������1:R��%FHGB��WC,.?��5�'���I��CW��,z_�'�;����.�|F+���P��u��x?�0�*3{��C�O$�!`��Z�q���n�@��b��]x8�������Kb.	����<j��.�7�;1t���r	��.��
-`����9G%w�-��0ty�
~�o��-���#�1�S�P����)�W�vC}r�������
K:����F!��:Y#��:��
��
���e��=���p����a���,�S�j^m[��'3B����w�IDAT���o��=o����i�>�V���He#e�U���^R�����Z�(���K��O�yK�G�m���DP��u�A��H�u����(a�-eZT�������,pC��(�!	�C^J�M&���`�,q�Y��x���H ^$ 
��Te�"!~6���������P�[�L�
��������Y�v��)��J[<FTJN�B�����`I�������`�t_Vyi#6~@��*i�0���Q�2\1�Z�{�����������������`kBc�8A\��� �k�)�f���d�]�Ai5�:�h���:V�GB@-e_F7���q�!j�`�YU60J�R����c6]'	#�0��>�m�1�����G��5p��poD���L(�P%�����-b�zu������U:����W����`n���b��6s����#&�j\��z��'�`Fxr���2�\�k��� �5 ��)���v���f�vfA�
)�G_�.��n�q����Hx
��L�o���G���W�
.�����Y�{s�&�}���T��Y|;��'q���*r-p�F]�Q��Y}T`,���e�b��5Qi��6*�����K�������|
�n�[`����O���g �7S��w�;��������%0��=O�2�FO���V�0VR]R���:����Y��Z�Wt=eK�J�p���N=I]3��.�E^`�W��������EK��d	��7�H�.�a1��,Fu����D��PW�-�l2�qKM���������6���F�QqQ#A���.M���(K`��s6����(�V����
Ku�\�3*u*I9��N���
C�Yf�1A�d��|9�{�s!"6��R&A�<�;W,'-��v��9
O�<��E�~\��(*��t�E{4w<S
~�*L���5�}�E�(a��>�����K��F�	����E�B@�cI&��f�	JrPm���,{�/���=
���1���(w6���#M�G�8��<��#�Z���)����^��;�Ad�Z���:&5��!RJcF�� �%�q�`f���aM�'4@�17"A(	B"4A�^�03&k$H��F�PejB���-y+�itXD�PAQ ���{�*�����D�!���	���Ve'&�$�������x$�[��ZZ�:B�b��D�4�^�&��M�Z� �8�K���������4k�E�cz�u�c�l�C�6�p�1������������{X�f�/O��.�%2����$H�0� .���Z7pp�FX��6]{@���A��(��8M;d������T�-.�r��k�u�A�}{����Z�Jr)$A�2���wt�'�5�����T#>@9E{
_vn(a��
��3Z�����	o�v0X,���A����R`�}���	���]WV����f��@�&���������� ����5A� ��p����_�e0�vq��F�N���b�^i�Z�R7J�4�! kHF|D�$�nhiH��&B00J~�]�vQ

�[�D��$��8
]�8Z-)�(J�0���HK�����i��JMX��B��@�I�P�F�X�D���1��)dRa	Z�����i|p1Tf�9l���;(c-�J=(�Q7��J��=[�2"wJ{@��?x��"�5�x�Hm<t9�t}���3��Q�a���/�����Y�&<��\|\�q����a|����.�������l`���;�"��{Le���pZ 4�O�!�}�@��� �`���w�e���=����x������v6:�D@��G����Y@6�u��}����~��=(�{��u�&���>��������^����|��7����/7����|�~N�/��$F������`m��GO����mi�\�(������dCm�MIV�_$��6�~a��,z�h����/���z��L%f��~�=)f�/s���J3��A����{������|�?���XJ���X,������M�������K���C\�g�;�����|�����7��w�y;�����l�e���v3�0� c���f����EE��h����[s��v�o�[��+)Nu���qp�����
[^�[t�V��/����r��V02��!�%C�����~�$���m��}�����Z���'!��D��J�%g���4���# �B�a� �H.N*�;��gD��Qf�[�(F�0G���V��
���y)���
�������:��*}v�TM�5��6�kZ/��:0�r5Y����&�3���&smi+�S�)x��������joc~���|*��0lw+Bx������ �|3[�(��g]�������g)�������m������o�y���7?�<�M���]������?�}���	~�1������w���o�~��A��[���w�����q�6+�����<��9�" �K7�X$Pzd��t�����V����~�A���"���!z��$���m����P�\���@��=������5���o�.I�A�������0��/nPk;�^��'�1�+e�H+v�c�������M�0I��������Y��Ct�
��d2
u�L	����j�dy��,�;�I$��u0�H���3���0F	GP���4N����V�ha������QdnB�O;?�:B�ZK��*R�[
7�X�Z ����p��n*z�����6�C@-�}`��-����Cg��HC���cW�$0:Q�C0sH�C7�=d���$w�C���0���yy��~�X��`cA77@���;��o�����������G�|�H����
�j���Pd1�zp����N���}>���$�i		�f\�Sg�bH��.`�����9ON���r�
�8�]@����s�S�H��W���9�N/U�.���:2b�M��	�7z��;P�*Ll���4�`���H�%A"I�G4��7F���&~W5@�6@���EKj���)<a�
�H��HSv/MRJ���d�k@r�!'��E"A����B�����Q�Z��&��h������l�k?M��%0h�.�������mm;j�b�m��}���<�O��R
������S�~Z�3�m�
�nZLz�@t����s&�
?b^�q?~&^�����u�����s�~s>n".�a�m8f-���%����tr����{��L@.:(FhQ�-���N���R�<H7����m�Q!��O��i������w�(����_�^rp��8�����~�����W����
���r��t3��(���u�����Rw~����__��w�'��7&k�[�B�V2�����6@C�W�|�v�1�_O���o���b�������n6���pO	?O�����.�u5E�W���O��O���/@j�_�������r���7�V�?�o�~���s����`���"��Z8�;����:�(�����V��g���2FQ�	�r{��b�HYL��zs�xq�0^qb�2\����^,�p�c��%�CO�$0Um�r���vB��^R�TY���4u�bXt�L�_"�T77�L��$����`m�J���� �N��C$�'�5sWD@�
��:3��A3���s5�@���e6��������o��.�_���GwV@� �����a��(S0n�V��e���C�m��c��r�I��y��
���w+?�}��a�0��4���OY��a�T��I���Q�[*`7�����k�it:���kL��(����V�h���7�E"��hH�y��ezUMW'�w�(�8B�i;�q��I4B'�`�~!u1���7����t$�H������&���f��5\�(��A�"���A�u�$�J�5A@��t\�
�5$�V3'�Sr�i�PK f��m�;4$����V��B
ft�"
�����?B�v0�J���<d���,�nx�c�>��O-�W��wX�.�1�rCR�����*v�����>������O�Zx�u�IL����>|�..�y�����������f8����L=���w��w��o��o��@�����:�08�(c���4����t��C>C�Jp�����t�)��t�?���]���VYom���T%A �vUP���$��Hgh(�$<QD0����(;��Pi����4B�0o�SO��:.�l	II�a)���s!�~��x�v�DIIRqY5^z�Te�*���%$�?I�]d]�|��D��� �P?'��'qt����*�D0<�IV��=i��Ce�[��+��/v~�_�Y����]
L��ARUR��"}��s�	"��Z���
�;����������F�#O-����WAvy&p����){������d���~o�chL{�hV�����w�~��f~������v6����I{F[��������)��f~��Ia~ 2�!�%������s��b����|����]��5#�������^��io���
�s��
���l�$7��������8�U��9r�������������)���|x	�(zfv������?_�&��92D7��O _"�JR@~��]b��x�h�
�{�6�
�����x�3����*��

�����E�
|��}5
����I�l�8^���� �7����Z#�������LV��w��kE�i|��|��d\��f[v�;ts����?�%W�Z��I(������/��Ws����OQ�<���)��p8���<��s��jcS���O)HY)�v��w��y��]J�/��~���/�o>vc���=�����Y_G���O�0�^&=3�.��e�:6�3�g�r���Q$�bY�����qq}��7oYS�������Y�lF��Q�������`>���/��X��n'����Y14�������*�a@!I���X��%�,�S�7�u��0
3��Z$���_�	�d��"	ls����d��P;����f��3��iQ		Cg:���bS�
6�$�8)���5�{3r����s�eF���2����Z���L��)�D���R����T�����`+;s��
78o�
EXB�&��j��lq�n4[���x�����2�
����X'�����s�c��q��14���i��wL)��|��f/��q)��K��-�� �Rr�r�z�F�N9�C��R������t�xg!�\��qc�{��m����d�g�	����'��x��?~�����h�����	��0�}���G�I^,;��������{�<0��1��B'�|���Y�����^�{�F��Z���J��p��B*����x�������-���N��]�t58�	B��y�>�%A�������h$�h:,B5kE���T������a����b�X��EV�YS�B�P	]�4�2-B;Z�� �� �F$�`C���$	QkW���"s�z>3���0���{m�"� =��K-
�
�p�z����c>@�+��,�w:���:a	�|;gT!�Y��n�-�Z�F$"��Si��0m��� 26�$�3�x*��@uNS��`F�g�$�G�>�(K`��9�����5�m�t�Z��3~�	�)�`�0�Tby�o������(BS4 ��G���:�#�P�<�*�jT':YB+�`d;��0�"����^���a��K
Z���~���Q�NgcU��g���I��hg|���/���_�|*z��b��>�W]Ih�Lj�c��n�����V�95�g�%�M#��-Y/��1�>	:��}dC=W��I(����S�_H���|�S���eU�K���L�17��	��R�E����H�F�f^+��
�=+�}��Xf7�m�r�����^UN�f�K�o�q��m^��/>���#�$x��]��c����l�QY���n�{3���i�7,�����9�2_U��w#�S�J;�.��r�=/
�E�"N�4p�'�#S�����G�Yq���P1�2���m'!����%�.��$)s��L������k
|.��)��41�� p��g��z>b��� ����$�(�K���.���7����7NkWNK�Y�RJ��\ 6>�}M��{�[t�y��\�|<We^Yz~^�+Fq�,3��qS��Y_�to}q��sv��=jV��d`n�?���p��x+f��=hr��q�W^`f����J�^�_��5���r��W_���y���;�����X��>=��^�k�?������_��W��F1/�����[�w��S��H�P�1��N��B�xk(�����7��Q�F�q5g�W�Q�$����H���Vb�sU#f���J���d�f������
��DH��@�>)�)����4x��N������_b��'5��G��:B��]U�'���y
�zZ������u��;�/BI��Q��9*d`����8-��)��!�-�z�(X>�����y�\�U��s��@��*��`�K8U?JZw6#
.��>8�>�������!os�q��N����R�	�?p��6����mw�\��8��\#�uG���FYx ��`��I����y�`��-w*����"�!�'���(N��`;_�Y�
nk�'����:�1��rWG����Kd�2+��g��Ts����+O�������kWNx��TwLk�v�����j��Or[�DU�m)���I,�"�tX���"��l	�G]vq $y�=)��X�N�Qs���(R��^ � ��!,p�26QOc������d(����������
���_���0�I4!�}i,�H$-��i:,2wM["���[�p�������(�.�Z�v����w|�E��9F)��k�(����!�Cn��A&�_B�`��
-nJ�����e-OK�S�d��L-�YA��]g�Y3QR����l|]G�+Qc��7j�Ki��0�&c-��Aw8+L���OuT����3	���u�@�@�R���n+w��"�M�SQ��'�1/�i;|S$�A��eg;5����j8e!���m�X��s�{9 �9f�x��q���`9�R%*xN��j������^e������q����J/�'1uJ	i%���B��Z�U��@#���(�H-�jAR�E�P��v�d$Q�2�?#=���B�0aQ��/?Ik*�
A���Ihi�Enzv���@Q���)n��]i)!��C3X�e�H,)���?'}-�N�E��5� m,\8s�+I]����m���^��^�^�������pqU:�[�.���� T$	#M��@-�����cjd~�<s��kH����#������R^t�8�*]|p{�M^�
cp�����<�#v��1�.7;L����!<(�^1�"��>+��0�Sa5<�"��K^���e{��q�}a�X�u����=�MfU���E���L>U�$E�����gt��1��2������8��^N yxh��R����[A�}��v�Z�����v������n�����
�s�?��;�
 _Z�xW�fW7�&������8�"_�A�"G~U������n�2������\l���o���6@���y�p{	��\��������+��YR�h.Y�7e�u�VVJ�B���V��i���wKnv�x�p����-"8n�ma^#9����S]#�,�����x��0��$���J���=�q�q9�:�/x�|��Y�E��7��*�dU%|���wF+���N���3:��?�z���X#�u���)�&�ZDJ��%���q�s��<��G�}��zvb5Td(���N"��6�j K����

kki��h+b&Z���������II��D
�Km&Ap����VA4Y�w$8�k[�@�`6��������*��Z9���gG��0u����L�Qi��&�\d�7�@��Sv����A i�:y5'e2Iu��}<�S�C���s�t����F�S5=J�9B+�\����n���p�Qfs��������Lg�2�mb�0�X���J��(y04���$�k��y�Qw ���x��P'����m����"�6�{]���$��|��`�D��`�+w ��69�A������������x�v8q8��#^#��[ckE�N%�_j_D{Bx���e.�Qk�{��D������HA��iI�T�b��m4�:�h��9�fF�tX�(
Bh��F<�����F����GP$��P7������0��i��S�Hj���!j8�W�t�G���2�G��� �����4�� 
��'!��?�����[ayG�au$P�^��-Z�6��l��Zxt>���J��C�H��vh���Vx��(��m���>�];�bIRY0V�{1p�F�(1t%D6^���@�a�9VrL�Q�������Q�H� {���!�����n��C
I��8��&�!��
�m��!����;��T9'�V0��c��It 2NN���69=/$��C�i�G�l(���7!+�)y��g�+B�7}&��a���2-oA�y;x���9�n���
���/U�n	_�%������j�`��	Be4�!�5$M\:��4��Z��H�z/�S�
��_^���Ow;�FC(Z��
��?WM��"%,0
]D8ZT.�jD(���� v�]���D�\CB�A�U��/Z�=Z6���L
���I�%B!���lu�ip@��Y�LP���.���������Q��~g\������|�vz\)^�M+�(�\���u[v���B�4D�W�$�U� ��������yL�m���=�i����8���R���.2������%��c9g$����B@��O�W�w���m�"�|���p0�&�4@��{�e�!�Ac��r/i� ��A
P<na4�N��`Y��r�EN!X� 0��Y�&�#?�]VN�J ���&]�gy���k�}�6Y��0�0�s�K�r���\��}����F�/��y������C�������u�,�g����uVl{%������[�����/e���������Tk#c��x����P����->�����-������xz&�j�c/y@��i�!�Nj���5�=��Jw_�����~���V��r_�+���~����������#i�7��p�P��x�h��L\vC���>)�)7�Gm6����N.�!m��'��b:w������l������r���T��#����Lv����$W[q�����-���$�����s� �{eC��EesFJ��J8P����{A-\����4�;)P,,~��7O�!��p����B��]mk������@������~����^�	�(�D%�X�����%"`/w���(����B$������j�*�K`G\�
q��5���(1�#�{�w������XP>b�������>�y����s:,'-g:��.��
G ��a1(�i�~�Fg,�n7]=vc";�E�b�,sJ�~K����
+��t,�/��Q+P���y������{�[���{�;��>=���2���G{�X�(i����"�.z!����a�>��M�����v7v�G�����#�S��p�<�����
�1B���2����\b7�u��q4���?Z����1�W ��t#�ON�")x0����gv��5X:�;PS�t,k�+�����D���/�w"9��U$���y$u�q[	7�p��jQ$(����1�4DD����K"�4 �$�#9��h(A9�O0�����0�U9�#��#���&:G6�_��d�<V+b"�����S��Ej��E�9;3��9 c}SS
;�Z�$A$[�T�&�F�b�53�h�=FX�y'�#�kt��5h����3�kn�MG9$\�>p���+}|{�!�������j:19t&�Y�J��NuH����yUw�t�T�[L2�yR�^�6���w�`sx�a����3�M.��5�&�m���L��
:/�n�C�������}�������\�n�v?t�N^?�}�Z��v��	�7go�>�n�����q��?��y�Z�����3v<��a�(�����Se|�����C�f�A7��]�]� ���>��0z@7Q���>0�uC~� ��dl����������O��x�(�d���m�H�k�v�O�8}���'���"�W�4�h�4]#��F
��RP��bx���@��R��O.����=!T<����$����$�A����LbZ
���
�4N�e��
��'��(������c�8���k��B�Dh��E
�#$Hi����q�IS*�&p�@�k��lv�l;f��+�f�l'����9���"�]�����(E�m]�"�������%�n��L�s��26��%^���.��;
��.��^-:c3��r7�]�'����{���+�u�����5�
;���m@4h����<�G��^���9w�x�>�� �����CD�E!�R"�������Q���C��&��Y�6��}J~����]�t�1g�v�����0"�C�AY�<����������(�Pd���)�M�b��?�_��]���qWN17�����z~u����Ow�.|��CLms�F�`����n����l������q�|L��������8y�e��I��6@���n�Y��V��x��(MS�_�J7>���T2���J������e��AW
���
��p����������j�t��x���!��H�D
`�~��w��>6F������|��F��|�5|������o���no �1�E?�\8������7E����%y�m%��s6@*r
��?���2�N[�bdj�*J��&���=�F-bS�T�]-2�N�P�_#��8��<����t/)	G��1�������!���q��l�E�j���0)$u��*`�<	�;
�W�g��BX��:#�,$|lJS\����^H�33����d��`Ztz�$���Ql�!,4�2r�
��|����`���UaebVr*c������R�e4��e��q���2\��QF�hO��{��~����S@L�<&���K6&r��K�/��e�D�����F��4u��~��@�D!d����,z-7G���)�&�h��������Y}N��`���,�=�2.�l8v_d���r��FO�_��9�|N5����X.I�|%AX�~����p��R�]T��mO��j��&H"���/����Y��A+Q;}�I�v#����e:����.
��s>�H�.�	��IB?��S6:��_H]IA���$�&�T���dx=IH$�B�lA?��jK%�aL�C��d���KW����4��t�������I#��C��qC�i�
3��$�]��s��
:�E����X
�	�<������o����B0�K����'�gTr�� @u]&�a����������xDT<�%�
b[,h��ib/>}w�#��������"Sh�����r��/����C����}�QO,�'{���p*�x���n����d�{��$O���Q?��T����q�!�����$x���@�����	����zj�P}��^,�MzaNx]�?V���o�'z%�d
0�XH#A�PS,Z�O�(�)����R�I&
A����,
����-��hQ$���eRZ����H�F�������b��4��$���1j��F��xVA�e����0�zvN}���H�W�EJ�!����(	B�"�IJS��������H���d���85]k�4@�� �����!g|p�U���&�1_�kx�I�rahI;��cM�\y�J�h��|����w
�d��Zd�p�d&"�%����k-j��}���c����R	084���}�r>}����f��\���O:��X������)���B�����N���������P8�0ul����|��P�L<��^%���Z���w�ts�b7�(F@��UB��vyU��y$�%>��K2N$�X��)��o���x���r�3��8����W����
�df�m~�N��/���+��}�i��Z)�>���5`3�)O�+6@��g���/������J!�2�/s���*����|S������BaP`�Mi\%�J�\��?���p%C����fM�W�L���]M=&����+�sSk\������_�fK�� ��Lf�
����`�+a=A�W���f��b�V�k��z��]M��r9�+��������_�
��f�y�`����5\�r�p�E�T������dU/_�����:z�����������nL��O�O{}�������	��%������l�B�A����G���	��[ZFYpw���[��8n�am��6���$���=c*�����(R�B8�2k�xd�ZFV<8>bl��nz1��dr��l�4
|<���+��,��E�9__?����_8�3��'��m�U���fv��������}j�Ok�35����f���v���%����2�����FfDS���4
 ki�w=�TTp�9���$���� t�J��-rvg���E����"�-�Of��o����h%#��T:�soG3s��8�b�	zf]NW�#?���:����8��JQ$�L�9Q,7����\�p�lWj�2�\�����	6d<�\�����eW>�Qx)���4��I����H��b�����!6��� �c��z��f�9g��d���
<�S��)��H�EK.{�D�b�6��Y��lc�B',��x�0�jgz�g0��r��/E��W��)���C�R�/0�a�#EL��C��M����s*��Ev'�r^�6����G�Wv�n7�����q.��X6�����0�s���ZW\���g�(��j�����$[g�p�� �� �)��h�&Dj%)��bt�!,�7��nlGI��kx�n�,&�|D��H[y�t�!E����u#*p������AP��kqqm$��!\�BBFS�������H�:-��XG���k�s3��S��z
GHF���+��0d�"1H�@����Pc��bt"��[�������6�!�1���������s����GO���({�p#	h�	�� �
'�fm�Gc&���06�������cl�o�JJ��]��R����>!�����.�XUT�q�O��@L���`3��c�����/E��x����x,�#KN��J!��%|�S����"��0qQ���f.��w����>�����3	"���1��&����0����|�����g�������#��[s�jDd��a��u	��3����~���7
q(�%��c��/uA[��h���/c����
�O��5�~��"�@�>���d�o/���(o���� iUH���/�K���@8\��"(�	����)F�z"6.�����
�A2S�v����Y��F9�u�Y��d�L������)���#N��u5�Rr��7�e����gf��q�<����8!�"2�SRe�@��L�����������E�4����qh�:��A�#��xG���y��g*t9u��\�e�l��
`�/cg(����0�X{\G�-<��K���%0
������r^��c����Cde��a\��������
���]��7�]e�;y��x�+	�9��N������Y����$����)
�Y�����Ud����&��bV�f����o�2�B�YE��7�b%8�1�mP������5�[d��k�3��u��YR/~�k��/��q�*���|������~z�o��/�"�����>0/�Kp�����0���n2����[��)��
,l��dA"��<V����~�����}��Mx\��v�:>���Ny�6x���� �y�6��B8���sM�Ct���������j���W��/�x}�s���DCIB������a2`Z��m����+��2 J�(�	�f]�dN�	�� x $�������%���H���`f�L�b��@�w����k��m��O1$��5���3��=�,���E�#�N�8����mN���;����B�J6���JN5��^;@!C�>e������u9}��L��"�Q?��=gvV��I���p4|p�AO����!����I+d��7�<Gu�T��w�=u���tb��x��=gl���-����'�n�)8���L)9�s�d��I�6n#cs m8i�pNm��z8Y���@�sA�1�_J������S:�h�u{��tbn�C�y1?<x��(��y�@����[aR�������{u���o����%
�r�^���r��L	�E�b�WE��D}�5-Y!	�*�0|�!AM����-�tD�LUF���c,M�'hd���	��T�a1�I�wKfB�E�a%��p�I4�t4.�F�������"A��� A3�kQ��H�t��3W�uz	B����<M�J�����$��W���w]�ld�.����1k#�6�.w��u!�����B�z$A��rO#�e���sc��
�|<?[|��K�&�c�����h"J�<����e��f�
yY��>��������6!�A�������0� ��������Q+;�t|��s)����n(6F}J��H��s��O#��.n'�>�P3��7Oj�#��Ny�>�������o��p�X�����&v�}��$<�K��g���2U��:����3��Z7��7r\��eW�������Yf�5�S ILi$I�8�\0�f�#Hj�c���Z.�H���*Q�&&dQ���H�����BW�'�92Fx��%�\[�*�L��k��M
����-=�J����*L���F/g�4b���/�.��+��K�6V�|��+:[#$��F?�f]@g@!/�.�$7Y��$�p�2ko���	�cr�!a�LDd'���
�<�=W!`�d��-�~ac�G���
f����-����Q������1�Q�G7��
��f���0�h�������<?���}VH��&35�Dpx�@���L��O�o�'��h������3��yRLD�2e���{i"v	����y4�$����n*w[ebm5&��|��,L��y��C�q9q�����b+f*J�@
Y$.y]���`h�5���<��������W���������l���=	2�����j���Rl��RUi���(�7v.h��8B��f">V����N�kuFwF&���O��e�j\�'��6�U�"���zx5>�Y�p-��W�k���d��U��m�EZ��aTm�~H@
�f�����^�O����?����`���J�-����3����vIfBC�?,�^�yR��-!��`�_m�X�����.r����������v����27�������u]E��(���%���";���l�o�#�5/��c�{U�8�},������S-��B%A4���oR��I��)<bm�2,�Tk�����':���6����f�AK]�Dap������)0/u�lr\$�j�*`X�+�	����xm&��&EM�kU��s��bx����$L��"�&	-s]�-�y�=*^������������>�����G�9CD��n�+�s����+\���������C����x<���$mx'�[���n�%�WY)��2a6G|+F�B��s��9�O]{�	�5�w�>`�Z��3j����u������G�$�P��S��!�>��ZQ0���DlM�f�gl7>�a_��{;����vw�G���^?�iK7R���&}D�XL;��t�(c�k&�]�K����v���e��)w'���-��/�
��['�X)�+����
-�M�HIB9�����]�5A��>�T�������U��1,5�j���#T����hXeZ����FH�"/�).�m!8�|J_#e�a� �M4!~Zy	�N+��G�p'=)�D���F���� �����$}�$�7�H���V-���k���Pn^��i���e7]��MF.}��E>�q�������z��A�6�����]���Oa��Xo�~*��������<�a���d{�%����f#>v����E���e��;�wC��w���0 a��C~�����Uhmw�T����a��iw�}7��&��?�C������A[;t��)n��CLpNU�
j�v�O��;v���9����<�6Z1����$���Ar��P�m��V02���b;=�bxu�y+C�S����������0+�Y{�5@�ZC�G)��h�DX���B���|:�T��
I���r]Oh�3GKC���Xr�f
nt|�HM��j�;�Y64��Pk�a|������]$�v4�lU�	1F
0.z�R����s'�u���2��%` �L4���+�B+������IT�c��_��[1�d�2�M:�7��*���etQ���+h�u�����X~PG�w#��*���Tj��c��O|��{E0= D^���f�~7��c^�G�F�Wfk���}�~������	����
�8 ��x�r���q3�U���������>>��\�{����K�S@���>�����w��;���H�q��> �Z)B�Fo�����~l�F!�������������t���>���	���BW�	�����Z/����W����"`�K�"��
���Gl�K��6���6���{�
0���?h��b��X�6��\vCm�sB��]��_�����B�fL��w��m/Z��,Jc���'����_�s��n�#��u~(��ZBTN8���E73�Jy�%D��H��I�AB��F#A>�(�s�������@�K�D������	�@dz���hM��$I���F �P'�H.������hIl��F�`X�n��F�K�tl��<�@�brO�pB����@����q�R�����������FJ��R��J��u(d�,USBDLMz&��X���U~��A�y�T=��
+���J��ia-aB�l����Y�xb���3���Y�2K����u���-�tNn��X��� ��ltI~��`V�zJn���B$��)a�[F�q�1�4�M�������p�F��$����k_�2��U�?$_ �����rQ$�{d�G��B��e��x]�%g�Y���N��T��S���m�x��+B�k��E&��8�,E��nwj���?�h- ��&�Y
�'��r������fs�}/�Qz�������^x����v�yo�)���}7=�W}k.>�_H�����\-��[;�A6�+Z742b V�J������$q�]��u�9����4�2�U�Ha5FC�TCr�hRk��a�0�VY`:ENB�H��)R���6@
7��e�(MD�3��;�s'E�f���%A(�N4C:�Fv��	�@�%,��#$#/���������XrgHM6�a��C�r=��C2�,0H��1m��n�LSVi7t�vK��l��R9k���7oa#d����]@Q��FI��
�������IXx9�6���$�������(3r�����r�O�w�{���NNs�d��n?� ���!�a:,��,]���K7������g|;%��Qby���H.&�>f:��v$A��/�a��#t��-��?S�0[��p�?v/�Y��"����Ev6���8B�E�]C�����j����D���1w��)�H
x�	0l$��pq����� $y��!\6�P���pOH-��)L��8C����pN����(L�:���o���F�x������P8Gb�+2zm-@*`fi(��&�3��h��$�'��W7�&"�5���q�c(h���>�����v@�Di���TV���`U1EJ���8HV0U;��)��]����c��v�t��+��@�*�{�<�8�\�����>e�.lz�N��������Y����\V����t���Oi���v��G�#|�ad�?=~�5�~W��:jn��&��F!�hbH�]�m��w���������8����;��]
PqA�{���q�+�����u�4��k���^Di,mE�{AU�����6�/���l�������,��/�V��
p���`����f��L �����
����HO��R��R���5��X��OWJ�����:^h���&_�O��"�v0�.k
���
�5�Y�|�a�?v+HG�)e��h�J �iz:�������rE�{R����5+����.�!��W	���wM |N��D�/w�����������P����^���[�UL����
��/�
4�3MMBDbC���(V��HW�>�4���Z�����z����0�|O�KKR(\���&�$���/�����/�~r1����|D����IC2B/p�$i^������ p
�HFh* �MzRu5c?����v`0R3����H�zS2Q� L�\�����w��x��=^X��'7�����M�xN+�����i$������129w��cP�p�R�:�\��(�,���[c�4z)��p���n�Rq[RR3L�-�d
����������Yq!
��!)���cN0��� rYF(��_�<�iN,
����
�3e��� 2���}�����I|�~�������vC���p���+^}�U~/�n�=%���5��c���pV�J���P^G��h�;�R��-Q�g�;���9\ZE9NZi���4�������Hcqv���3�!x��Y���
��S4zC�1��J��9�5,�rB��-�y�1�P8�g��h�
�r�-@x1$�[Z����1vvs��ZQa/f�S��\��T�`nG���K�Ey���i����?�[-���(������z��Y��h�����+\k�3%Mp&�4<��4�{T$H�a��0���T�[�dJ��t�M+�(��`J����4�t[� �0��v�p��f�����`�H&'���dL!����-V�����q���Y���N#6�$L ��:mu��W�B���fc�87|�!�)�����_ c��R ��>���&�������v��gM}��-dv'�����%d���:"E�$�R��1\E"�W��U�0�C�P8R�����&$�? ���E":���(�&E���c��#U�HQ$K��5%AhQ$K�	�'�2&MT5MH�����$f��T3W�Eh $�?�<u�;���dXR���l$$)o�I�9M^�:��7���K�I�p���H9�L��k��S�J|�[������X�Py�H�b�rn��{�U�
IjUos�-�R�)H�C���KQ��4to3K���
D�d{�����f�M^�*aL+��vK7eT�
	���qXb(��;�\�l��%w�+�� v�+i�D����y�
�y.QT#��	��oQr���%������s��'���+E�N;!����]ps�"W�:��v��xp���Q���z.�
lB5A�m�%�'_�ULZC�[Z�U�jJ�R��oOC�S_|�~N�/�b�c����
�X�&3Y������H�1�����S}��}�
����*��)�����r�N��%U�B���o��`�k�sE}|j��$�]���tX��7 ���~*)�T�i]>l;��X��TJ�1��C3���,�N�}��]a���a��[�	��g���E�f%�����5x/	k>�a(�R�S��������=�s_����=��?���z(�/�����eL���zo������ ��}yh�����O�%���9>xje����_���{�pXg�,��r�SX�-#A�����}9l�L�;']�v���B��m�Y��nx.9��P��d���n�*�x+",g��=`(�Q��O3����T������L�����:g�������~��)5��>H.z
M����dt���R$���2.0��]�B�������S��kh/�k��Z�S_x������N�		RMF��9
-��H:U��^l@!�oGY��.-H���8
��b����H5����hj$�R��FE���q����z�*�p�9��!�`��^Q~�!�
�s�1���3�;^O����_b%�s�	�z�D7���;
6���C�����4�����D$4�J�D���q+�o6����0��%����V��y��
4-����L����T^�����e�Y��Rd�uWP#JJX%�g��� ����tO�T8[+�s���r%���q4�)%a�:�N��4+�h���'R(�05V����A:
�Y������$+M�#}w0q�0Z��L�p�d�3�7)�
bV/"Wi�R����I�6���2s���4��v�� :�j� %�d��c�Ic=�����������9?����n!�j)�l$�	��(I�<����j,!AOH������ ���"Fo�z��h>@J��4���\������H��� �X]�H,F��6�P�������hJ���E��=	��$HG��k�1������&�V�� $������c�������eo:���0�l���9
|��r'h���c���Ag�Lw��,�])z��e�;��=��4�B�a� H�i��x��t�UC����C�)����F�Y�L	i�+K`�d!A�X�H�2�$5������0�W�uh[9���^3+[���m�����Nm�����9�8V�! ~�6C��%���nN�vk����w�yg�(P-L7d����jp�v��P�������(����[�G��5�����n
3[�S�[q�#m�$GA�l~��x����!�����X���(8T�\����rb�y���x0N�h&F�
[�.%D�������,���W\(������$~��O�p�\��Q�q����2����%���6�����g2	��'l�%���|_4@���F��������1�<��iK,�2����'��1J��_@	C���������L/K�[��d�8N\��``���7(Y�>����
���f���e����Q��U����GQ�i�Y���d���!��A�l�3���f�z�l�8�����R���m��e���������������?�������X�u�k?�t��������`W8�/��/��.b���.=�v��x�w��G����mJ�����;���~|��=��;���&p��.���cr/��xp�5��3�{�Nu��x��s����4��=�������u8��2���1@����T��rN�	�6y���;����Hd4���=�1�&O���'>��q���
�t�x�g4�?N��B���#�����=�6���'������y���|��g��gn}�n^�,�4�*�r��t����7����S-�����:��@�O���yx�����_�}�L��i?c��'hq�������{�?�E��pV�2}����M��>�W����+^�K�\6���NuB��5
kn���2l��"o����0J���U��@���$q��P9����JV�_�����%��@�
�)-���djQ�-�+T=����J�������L���Qha��3�IeS3��<|5�9R]�Hbe�V���Lpf\���2�k a,"�Z�wU�������x���zzJ�c��x���iw�<I/�8�w��;�����e2�u�0X��eY;@!C&�@_���J��C���I�4:��|������l@@A�j��"����������*�>�v\��!����- k���=#>������8�w���60W��:g�����������N����O��6d8>����B`)��`;w��&)��v��������-����� �tt��y8�O����:�n9�]�����W\���~�[�
q���������
�"���H�2��a�6Q�%5��Z�I�#$y�=�.��nA�C*�kK�aQ��M�����j^%{yzM�"Q��v���"q���`�D	B���eFH"�kiB��tX$f�6$�SS;%���,I�E��)�D�oN��M$^qV '�bks���ch��RY�.������=��V*o�)�������B�I,�n�T�
w!��*�����R���	��pm6&��������El7{X�2��uj�Pf��)xo�6����!�	*��TX
�Z���}���|8�2��N���B���Af�����
����Y�Cb���$���B�wF
�1t�������}������k��c%���x�xl�x���c[����n�����B���Z��x�������X
��|O��X�G����gW��'an��l�m�����V�C�4���F��R�!��D�-�T33��$.�4G]-��� 	Q#=�!��f`E����L'�9?3�MW�u�"�!�Ps�	nK���-J����vKRSc��,���O�b�DlP�o�U��9�X�g����f�f��>lOv;N�f��(w�=D�'4k�B�*B��?���F���J<�2L�b�d�B��Qx������}<�
�f�U�I�Gc�N��o>��������0c�0J��Pp�����~���J���F��}4�A�������#+7[s�i�2:T����r{J7�����:�Yq5�{I`~�,���Jy�%�����/�
!���
�7��
�BzR��n���E���u�a���� �iu�������k��� ���m��J�����[����_s�b��I�VF��#�){�
p.Z�b\3���i��G�$�z���X��f��
F�!�6���*����1���f���_y��g���w/����_�s�W|i�u�(>�}n#6rhG���n#��6���*;�3�����W�!�L�g�\���/����}	B�L������e%��u���"{��������x��$k����h�|w��jK��4��������${23������R�W=Q�#���h.Oq��a\
��d/��S��D�h��^����'�u�
�����H��'B���j��H���d�R�uk"����u������&x�'����IM���Q��
�\T�	��1�0M��*O5CNP�r�}��;���9F-t�c�2�����O����=���}#]`�3~L��wP�Gu|�\^8�pm���xW�T�j������.�������yu'���)�c��$�h�H;�;W�������r��0�2r�I�s�$�>p���n�����g�>�����Oa���������:�����H �5G�T�����v���OC(�T������H��r���4��<z���[�U��~��7��]e?��R}E�{&���/�L�56�r�+�Y� 4�� ��"H!���� ����f9M�E�ZS��&�hi$1w�Eatb K�5���)$#f�H���"� 4��ji��������h};�	!�0����.�H�
m�9R���]'���&S#E�H��"�YI-xt�[d*���^�9�m`])�����O�<z�����RH����i�>�	�Z�����������x�
���:JM.��T\d�u��!�0�����K-���d��y��i�\4
^l����6�����m���U��\N����wC.���=�(#�V��&��r7�;��������;�>'���6��2	������Uz(Y��a��1w#���
#�-��,m�0�(A]I��E�l���Ah����K��y7��]��K����o��^������5�SZ+�����R#``d��	U5��M�((��o4>P�<�]e�M��kH�G�@B|X�D�0������6���-�{mU_A��Y�,����J.
Q�u��H>������{0�U"��F���T�I���!yc&��?��ju�w]v<�v�7~���Z��X��\�gvw��?9�7�z�m.�n�u^g�"�7��s@k�0{��>�������
� ��9���f������+�������M������7|���>�j' rY�Z���KX`S!��l��P�����N�Utz���n@-e�2����:^��0��Z��N�6��n��y5~?��p/�]�������'��;�w��h�nU�9J���Zs���qk5�_A{n	|
r/L��D�������R�$��
�s�W�Km�����82!
�d���Y��sm���
����
����c��k q�6+�\	g��v7�
[�
������?����\��@4A������$/�c%���p)�
����K�D�^Z���<{A>�E��Y�g�Y���_�<o���M������O��i�n����_m�k�s��� F���|�5^� ���&	*k����������8#��:���?@K�v$=T���*�/���(l-g�i]!Z��PO@R��������^��E������:F����CuV�����8#fCA�o�J7����S ���rd���~�
.C�C'b�u#��p�y�MKi�S-���a�|yl��~���8��D��ae���+�������XZ�+i�QY���$�����Px��X$�f�P����8���QZ�)X!Lb:����(w�F0���b9������,H;�<`[� 3p��aa�2�|Q����z	V�%Ek���P��	��O��,��SAZ��7I��+!�c��L}.��3��kO����N�G��ue�����\2ial_���i++]�$8=��y-���i�w*��]��m�.��E�a�P����h�����Q8�QO��v@FIG8n=q�%	�9[������
�`��-8�14��p��2����EJd�Lu�%�kat��#5�5��������9�}�
7�0�\��L54jj��sA�����c�6�w3�J>�"�@�+�,p����Cc��W^��%��?h���h�����+���
���b��R�its�f���
B�D�0�L
��J��y���%��l���I�e[�X��kMV��{9���7�u����Q��3�z�28���)0�N�{E���T_H��|�2L���S�n��Q9{T��(�	i_NWV�F���8�W�p���Ty�:���e�KZ��JE�-��]axU�����hR(�|*xa���(w;���{��g3�k|9��-��NN=W)����K�P�^)2�d���U31AP�W*
�s���z���!iSS����,0���,0I��H0�Sv�������]�q�``�$b
=����d	,�.X`GY�Q#]�y��;��*���e��4�;�a	�D X�X
GXr�8�O���s[[���j���e�����.��lzl\f,�SM)�JT���hEq�p�������mJ)!T����(��[=JP�e._���J������!��pcuyY�IH]�5���=��5��wA�f�?����6[������C����r�?�z�����Z(Z4�$v��
�`�{�&	�#SN�T������{��,W�h�nD!����-j���I�K�q@k4!��%!d���"%CK����X�^0��u��j�{���ZM?�.^�7��j���O�n��6��h�b#{����exm4k����+6��!����lWm�/���
���]��Yur���Id�����<m�
�����H���~.�Q���{�MG�`��5��	���B>��HS��p�;]�
��o#��������(��P�+/��n/�+nU��7t�[��)T�����B5�6^��r<�)�g��%������|�$�H�4��������XMk�)�(�c�d��,�H(M���|,4�_J��@�$����%0=�gD���h�|2���|����=�����������K2����b��d�6��M{���h*>���}���3(���%�����\�{�+y2�O����Ql,Z��X3J���=B�G�t?f��)����?I�j�XZ30+)
k����t�n�����1��c@���<��Y��\�����<�Of���0�R�!��Ps�r��y� �^�9�����a�p�������/+)���`ic��SXK��&��� Lo�y�Y����pP�j�G������k7��>����,NG�[��>���-\�$�p��+	"Ij'������dK�`5�@���R~�.-�-zj[��;"��4�L3`	&�����i�%�S[FHx���0�E��indz��H]��|K�FCC�Qc�t#lp���Q�H��-�%s�v��m��a���`��3��x;oWs�o��%�D�B��-*+|cL��/P��:p�U��.?��N��tW���*��e�w���Rf�O
��sl���g��!��]�l�}���0���eL6a��N�
�������3v��Z�!jV9��QJ����/���^�l�2	#1���w��3�Y����n��G�#���X�����}�)�����$�l�
>i��2�P��P�x��L�Y;U�&f�L�.M<�&��O�_ 2�r���
�B��up������ ��/��*�P�wY�O�(R�E�H��f��l���U'� ���p���$E�v�����H5�����|��M\��Y��a���[�� ���sc�)�!m<�;C����^E��K_$q'�V}
��HhL�w=wM�z�������\���O�E���
�R�L�$H(��w���i4r$�����{2�T��Lfv�X������}l��8��^��V��[;B�t}��E��z@&�������Q%�@�Q1l���G&�l�f�E�}�C�M�1����@����h�������j>�����]��
 H��<�� �O�A���79e�
M��?����H�X����8���1w�!-w�yv���dCow�nAa��������`����3�R_n
��?Er�0h_yQ���Ze�:�J��v�W�~��_m�����g;�����U�8��^l����O�
p�n^d������@��/F�"`v��i�[�}�
��m��u+��d
��{�."A�]
/���=��i�t���U�M����BQ�����x�������
�^�}��5YT+Y����/�A�����t�~�1��?��������W��{��j��'CDn���#�@����?�S�t��E�Wpyp�y/o
���\������z����u���)^��@~���\Gp�
Y��	)F��6���G7�R����	}[?��������(n`�(�|���-�	�|0�����cKby`6��Y���2�\�����������	��r��_�:��r���w��U���}��o��u��i+Z�����7pt���5���n'���!5{���
����dj��u}jQ$!�oG�54�J����\���E[�^�@�f��V�`-Iej�$,(RY`I���y�d�J����+��NSm�����t$i��=�xk��2�r����soK����{��x��uuT�$O4�&�#��INNWm������ �@T��9��r:��vr4������0����	����/	]���X�}l���5N�t�r7��)���,�6e��mrV��p����	G�Xbs�A$�)���
����D���\���OX�W��y�	`dcR:!��=|<�\�sk�����������c"S,	R0Z�;���6w����7x`������N���������O|'
Z;-��F�R��i���[�2����������^Xr�z���5.wW�tt�{�M��0,�����������>��������� �pM�f�������;�a���@H��� �� ����$H�6zP�����8�y�4]�v���|H���IK�a1J�t$��������
m$�\	�%	B���<!A�6�P8J�tUBK� 7��B�3y �	M��=�9q���EJ��������|L������-zn��k����[%�23��q3��J���)AF8o���(%s#	b�[�J~�0F����Y�bJ|����0� 9��u���B�3%����������"}���|{/���6*���r�����8��HNow�5`���t�>�{[����<-e�6:����`��}�
��6w�x������X��������P
OI�X�%���\j����RL���5���r��E��
#�*��j���~T��Gn�1l��<���u���*�m
��?��-�����,����x���^��-���������q���Q_Ei�i�=�V_�������$�3���	���r�S�����.'yK�h`�S��y��M?@2��d�N �e����KPjd.��y������$39x�,j
.NeY�����\%������
�!��/�s�P�rd."�n�5n�33v##�����2��:i+sM�;>ld�fa���C��Lo�}�����pU���]1����vf+���������]������5qT����
�S�����hP e00y��%v��������������l���oS��%�^6E�#.�lQ}y$����m/�4f�[��X��u��B��7���_C��b��� �hk�G�r���F�����c-�����_����k�$i��&F�{��A&:g���k,�$��� ���M��y�M�y���K<Q�D��#�}�^��W���_�S{fx�U�Y����@����_������>�\����y��W�.�`�T9o���K�x�����n���_�����*
���[��������\��������*I�����q2�H�
f V}iIN�@�^�G���i[�G�BQ����--@����m��H$)q���w����(R���t��	�Hu���������p=�f�C��|-�D,��c�sob�����q���UaUg+}MQ����z��W<;B�&�A�������y]�5n�\[����;�� ��i�)N>x�!�RN{u2�]�����zo:_PH����t0.��� Rg3K58dl����I����`�c�1���Sw��:Q g�a�=bU��w����Rkm�<f�
X�$/���E��H���l�W������'@���|`~�R��l40���[�*���H�|r���`�q����}���)
W��6tc�d�|.��Y�@��-�yn�O���+�#�����Ry����H� 
��������zp0����^$��J$� $H�Ue�-
|4.�F�$���|�����j]���H�Q�"��3mXC#A"���I (K�dy5kF>h&IQ$"`Z)e1w	X�a		R�M��-��%T�� 6�H�to����1�c[|��r���$t���^RJ���8�tXL����C5`��������s;V�F��Hxi��nK��b���5�!,R�w��d��QG��b:3�)Y���n����������
�O�H�<h��mR��\�wo����sA�Nom.��$�=��9�$��3�����&�����z���b(\�;h����?������a���p�����	d\����
�%�i,���A`��n:$7��U���������W�6�y
�7�����������'y���'�l�2��]E2B�jf�����
���
��o��A-�5���(G�M���
�2����K2:wx�I s��'�U$����PCs��
����������������!z"M�mY@��$��Y�Ix��n�<k�4qU��(�$7c&SkBC@�O\�h�L�2hP��4��R�D�qTZ��*t ����mo�v�����2��������Q����H'����\������0�|w�=�=�Q��p��S����)3����
���RY	50q�?�2b{5@�l�#`v��(zL�Y�]������n{���)wsR^��1`���QDa����{����\6���P�v;�����
)d�4h���U	YVq��]��_�j8������?����_���O���#H��-���T���&)����o�j	2��������va������|��HG�	2�1_F�������Wo\�y�����������_��q����9��������y��!��������{�O6�M���gX��Fz�T�-���j��(��k��X4�&��2����:*$�'����U2��"��o��v�
�Lf���j�OY��r,���Le��0���
��)mt��Y&�h]����d��w����hZ^�K�H�T!p�W��Xph�d�R��8����$ga"��6�i��oR�y��2� )�b�5���5%~�f�vQ5�Y���$���~\$aI.�������8�� ?�.�p��a���yamx��.�z$fi��,�$��.�k���p	�1j�s%��i�Z�2F+��E�;���m@#��n�0D��r��Jc��f��F8->�2H��*������-�w�
�F����2�Ho���{t,�Gvm>r8J���N���$ ��!������3���S�`�2����'?8�'�D�>C��i�tX�c��y�B�A��,M(���5�����R:��s7&��!��q2>����,c���4V%���eG�*�Pd�]L���;����m������<�Ap�	b�T�&=���(�
��!F1�I�t�!5-�fD�E�hBc	��(�i<I��o�#v�fQ��~�h������lD=e$��9ZO�!�Prg1��1!�/k��EMB�����iM2�%	BOB�-5u$
sPU$])�xK$�wu����n��@E#u�uC��dl�C7� "v]�W��d�6��`��Y�N���aA4}��-,E�=_��#t��l8{�����v�����u
�1���m�D�X'�7	�a[��0~����M���]W�p{��C�`������1W<j�(o3U������}�x�;�,c�=	�u�*�m,�����n���)���r�^n��	1YB(�p�g|��G+���P��QX�����<z,��mt�|�a���=�I]�W��P������.P��P���I,�V�J�������QjD��]KBU���aq��`"`�)�
M��g��w]�5����;�n��@APW�`����'�%�_"�i�PkiM;;����Q�(��m�'�"Tq�L4@�8	q�qi�,c�����(}$H�M^]{l��`odP�qS�����Vj��\2��[����Q�W���A;AF-�h-��/�L��15-(�l7��+,0��f����=(����2J��J2�����
(����0����g�Zynm�b����J�*���&H�0<�A��]�k]��\�n\���*=�]�����tBD������0J���3�����J;A�
��I���^3;��0�]ov�������e�7���rl������t����+�JQ[�������fw���1�!s_m�����6���rCW6�~������At4s�va��Q�^�l���5 _
'�	��%0�W�1O�~��}��iJyiF�#�������]�4��������,	�LW����F}��W���_�S��hj���T;������	�D`8��������������7���8���z���-G��T���}��S�|a�7��5�� �����_#�>p~�.c���p��~������m
o�@~k�9�����o��7v�yrd�� �����ed��(
��7�e5-��2���\����"���U�S5:
X(�HOoXQ�l�W� kX!b���%�I�'b��9���g��e����2�hv#�E_�m����^�����j= sw$�Lt��fU�~!���bF>]�i|2�����g�<��x�"�G�o\)q�����jFN�����R�����2�Z}��e���ca,�R)1���Xwe�E� ���wo��r���?~��7o��7��}��l��-�u��f��w��7��wy#�n��{�{W>�_�L�����m��6�(dDl����<C�;�&�n} �d��P�\�):�h6�kQ>�e����������7o���*W b�O��?��O �]tGSPz������$\��V�.��k����v7�RF/��bI��iqp�����8
���	L�"�f� R/�DJ������ �A�
EiE��a�S2�*��7�X~	�:g5�f�\�IDAT0m	"iYv*!�Q�&6@D�i�e+,��F�E��&�SS9
����f�Z�w�����+�2e�u����!�����aLJ�v�Q�q#7���OO�=��l�c�E��.v���i��W���Ys�w)��c]5���}�m���;�Q�o��h�[?T��e����+�����oG���qI������(��N}����)�cl@��K���:�)�v�����o��o~��o����7O�a��2a�!���]��{� ����
�R�������V��\�~����tu
�[�T���k�N�*tYCb4}�%1r���D�B�&�WB�sRqh���xj4��D�4�	��81���� U�U�D��������)�~5�7��i�����*��������U� )V�^8UZt����J����]Qay��sO>��`��+� �Yn[6�k�Y]�y���rg@G��W\�������"iP�t@�R�f��sM���XJ([���|��g�=\I���ut��F�,j���71��C8�J���]��n_�1@������7�������A
����-hv��TxR�@!��R��4��k/���q���(�n������
�s�W�h�Z���eP�	��t��
��]p���}3���_X��E�$���-�j���.������	��
��������N��O��������
���vX|�dsy|��~>���X����5|��T�r]�/8�/9E�e�iO����}���_���Fn)qXt#�S{��A��Yze��w��F�o��Wqe���%|����]��I9z��J�H2]��*�J������2�V1�5�E��0��L�/�\-�5�0�cC����49@Ifb�H\E�}&c�%��#�G�gvG�o�F�Z�a�jp����L�o�������|p����I���%a�BeH�v���-i2�J2���)���sN����6��H$�����<n��o{����h�3VR0���a��^5S�	��	�Ml�i�� �JbiL����7y��}�������1C+� ��i�C��7{�8F���#���yoX_��e��E�Dx��x�%�h����vC���KV����Qjm�%X��9qgm��\~��~�V`��WD��������o1Z���j�}��[G��0�$�����hB��7v���r
k�&E�t#+4�j�kz;��B�F���� ��B�����!h)	Rs�K��������z��J�(huI���)���V��,��,>��D��i�/'��f��[�1���/����(b|���0������"!bk��6Y���:3�2��
��m����R���B���-z'l��������3�3-6Z����k�w��*���6
r�8^���������	���>����}�l����{����q��� g�b���e�4F���n�����?���b��7 ��&�
>�(���l��.# �8a���N�p[&��&H�w\h+v��&m�����i��US�$�B-���N���uuDR�&����VYam���HQ�H�"iB*T�IK���nzBRj��#$�$<I�'�����F%�!O8'��?��(�Dt*M4�]��`���i&{u�h�J�����(KfF���!��Y�
F&�R�U3�!��YV�E��$���o'�p�(e��v1�'����aR+-HQ$R_I�y[��j�ga�Y�:�0�$�"���i��e������C�l5C�r'`�&��c��L�O�6E�E�Rl��Y�9o���kmw���Jj,#u3F^3ol��7�����\�����7��n'�|6����hU;�}?���5��~6g����7g�^����fB|��|>?[��^���-lC?������=�(��F�����[_��7��F�$H~�m�7����mp}�{a��s�v��r�~�
�W�u)k�&��o��pf�_��x���~�9.|p�� ���I����el���$�X{<��;>������o�n�Y���a�������5��Gb*�~�rs�����q�,]�3�1z�|��7�]G�o��)�z�M�����E����[��D���p��<�!}�����cyI������?��g�m\��|��z����
�����mI<z��}z1��nn�%���'oCb��A�{��z����=�l^3���mN�w��\:�L����������*����g�C�~��/������6F��)��N�������Y.v�+�/8���9��f���1��`%6����I����9���^�#`OR�f��
cl�.]�+��aS&��7��P�:�i�M����p;���j�r%Q�w	�����������@����U�����	���7���9����+<��?�~n�T'�;9��	c%AH
����Y�a��)dP�T������3
)�
�#�&����	idWB#Yr~���"�:V�@4�(H�%ig`#<�#���*�I���l*�$���y������/9�1yV�%�Jj��q�2L���&�<����e��\�k0�9$�L�t�4��NuXq�-�8�g^���\B���'w���*�Z���)�!��/�0��E�I)MP�(��RN�p�������G��^�~`����p��B�H��!�U�X��������J%����[K.e0����&���o���o��d]>�>�"�]"9C����E����R)s��w�Ash$�t��(����&��I��Y]�F����[{�\/*B(w�	�T^��T&!�.;-s�$;"&
3�b[��$������)�c����v0Z�������W�!p�0�������^L�~��:���X���=�����,��
�������$H(����I/E16U��.�X�YyL�		BX`O��+�n�M$>PynK�L�Ej����������q4�c��1�5��%xR�a����#�+�| ���|��c
<��	R5c*�8�P�j:����(���BB���PG��(	b�(ym}p���"�
G�R���LM��#�<�(� bX���"1&B�����+��	q"A��`\�$�O�
%���a*���a��3H|d@��G���Y�&��pR�����NV'��s�w6m�]Z����E_�E����Un��0��������"[�!��u�1�,�{��-���Iwp���S_�p5��7[���g�n�
�,L<g�*A���� )�o@����l@����3�:C��-�qE�:w������0��K�O�p���
N|{
���#��sH��?���$G���gZ�2�Og�3��~lN��d�����ST1��E�o��m=�4F�e�
z���������~�"�R�;�Hb8�������{V�",B(Kb���:�F����o����~�s�������1���a�GS��k�����S-�'a���s�P�����l�L9K���5��[��rzF�
<T����%�0���fz���������d
P�����"7���>w���a�f�h78/�Km�x�o��m����0c���d6���
VF� t�5����B��t��-��">���w;��;W<?�l+A5};��2g7��)c��|����j�'���t�T,09~��_I��i�t����a�g��\�	r�:�W��l�
p3����#M�O},B�Vd����G�2���Wa�����\���q�*���������G�Z��F���Y�W�lu�L��|��
�����H(���
7
������n�a�ktqFDd��Q�_!#f6e���s�&3~�\��,&�,8/H+�����q�����K�n�_u�(�5�yJm#D����KU��X"�O�Q[!��8l��"o���h�u��}\�t;Q�e>
�6�^	�����+�S�kB��l�������i[$��5A��SM��a����	����g�D���<-E,f�a��a�����E����q�e�@���b"`�s��W6N6Kc���(�@����)VT��������s|9f�*�bTM�c��Da����������
����i��xy��,�/�k
>���t�,L��Ag�]f�u�����&���X�U�s�n�0�/9E�T����)[�s�A:���x�������y����
��'[�E��T�����v�^
������:@��3����0`��2��|�J0��NH]��sxHB��S��P�]����
�C��L�J��������}�-��	�K�g���1?0�t��l��{y��+�AJ���{�	����^��e�.}���������I�)y��
�X)�(�Dx��d)6��$1Xd���-h:�T����C�\:��k�)���$(	B/�I�.9?�p�!�rSq#I)y��d(�5���<wb}a���t�0:-5r�e���y����N:��� �&�}Dj��&��P���PA��h8����R���7��}
a���M�*i�R[P��y.�c����FD1#:��:&w���A��2�}p�=�Z\6�)��|�����o
���.&L���(������%�m�����98�	 J�DE�"��-����K�����U$Q�%������	���	�E��P�-��!���������������q�1:��3�
�	�"��~p��k��2���m��gLmR�0�!y���@����,���X���S��o!4���nk{�������pB��cC�����'���(�%\j,��t�:�m~����T�d��SE��&�*����g���V2���,���2Hk��FhV�pj�OU�'\��R�>m����0B�u��@k^��Z/!QU�
BMRn2�7q�
t}N�"bQ9��@��������3z#��84L�1�����
I����q������9�����]5�*���I���O�chW�T�
�;E5q�JA��mF���b����qx{�����b�gvU�K�Pz�c1�3�D���BAl���.��J@���E]<���+2����X
��.:�d70>\]�{�O�E�7����j��nT��c��2!��b6fr]L�wnFf�������f����-K,|
dc�m!��q�e�g���
	��y9n+v��'����\a�g�meW#����"�^,���vJ�J8��.�L������P�k���~�����3�������+�����6�/i�
6@b��lk���)@��pv�	27�<�)F��>8@Xu	�a,��N�d����L����K~�r�� '^��;�)�eC�g��������EzS�BW���;�[�/��M���D�|�Y[}G>�O��{���\�$���8�K�CA��f^���P�us�sC9�P�y�g�41������v��>������s��l�����QO���s/|a�6Ziu{�`���,I�P
I�H5d75a ���I�$5J�� �����@�)g>5���H��Er��Nj ��+j$���=����2��]��UI����{��m�S�z�>j?�mi�X8)�t�C�-����K6���c���	�����
�����M5����M�f��'tfj]`R�+3]����n|Ytfa�b�5)����2������-�����>���
`��B���P��mLB�E��$z�%���{[0���2/w��t���V��->X��b~�Yp�-�X?o�X�e��^`7via�9���J����i�~`Uc7k��� CXw�5����;SIL�#���
�R{�VS���rU���D�m��Pn��G��m)��M���L�Iv��:Q�������)���4L��2�����U������}�i��v1p��s1�3��^���R��/�r�`�dm��	Bk���YQ*��TS\zk�P�D]Sm�\����:������������l��|%���V'�+�Zm*i�����p���#���"��%3���A-���0��x@N������o��u��������'H�*��c�	b����G�b'lt4�T������Ac�
S�X���vX
T������/�|�k��~��@&9��3��W���kK�\I��Co�N,t}q��`�w="��n�%�O��X�wM�����zZ��
�
����������]��YI�iq�e.A��g82@�R��]�L���a���h8���Q���������p*�]a<h���4.]W�zg`;_��c��Ur�C7yJ�$�%����}7>x�e'�[:��H��"���S�}���t
���_umF@n��j8����%z���e���^@�CR�W�j�n��Q5@G��0Rv��j���z�l��F
�eb��{�iYL8E����cJ��z�&��	:��E�w��p��NY;�Q����A�
I��0B
��l�E�~�t��0��I�)��zj�x�}��U��Q����IPg9+w<��"�|�4��k����-R3j��L�]�ZE�T�5�PA���A��'�����t+��Z/}@�h�]9K�D�Y4Ai����i�
o�����}[��4�ev@�y��>��u\�E��D_2��E|���u�9}b)�qoJ"Kn�\�fd��I�g��q|�!���
�w����.��Z��Z�c��q��<�p`U�������x�W-Lo��R�\_��}/��1�x'��n��J1dG��J�I�#<
��
������7�hj��A��&�2.�~�B��q�a���Y������yq�B��_���~Mtkt����Kq{s�x>�����{�/2<r<�f�^�G������� s�������>/�2�����A��.��3�Em�����������?%i�X����Y���H<o����TF��mqA�St���sa�����e�������%���	���|��#�_{
��������� _���N��2N�������O���>
�cm7�u�]�~�MmC���el�C����3B�W�`	��;Q���
M[;�.N�/C%G���PR?��3	������p���0�v�!*/��iVr��������2�`��8��>�
'q�u����Et��Sir�~H
�+j-S�0I�u�j����v�K�����e�����������H q����s���,
Q������.��;�����lB8��Z������IF��[�\]�zR�-�uXU���B�`�+�q$X\�FWz"-S�vS8�D�e\�����&Fw&��F}���<<��"����O���`>�m��L&���Rr0����}��h���-����������x@���_����K�p�7�n
�-���.����������a+�
�,��o"GC�E��2�0�;��&L��$�I��]���IKw0b	<�K��=Z���������Yo�ea��`���^Z�Xc\��;8.�c�������q��"c��p�����I�����/k~��c�L1'n���g�xj��G�V�n	��
������<��`�vE������@&A���[�57lM�E�4!�����H�t&HV	)V�����u���m��Y���V���(�25FY�����E��h�O��Y���p�z��T/p�|�	������.h%�b$W&�I7IM�j� ��t��%xnkM��U�Z����{?w!��.b�7f��l`�%�.�d��U~8����
r}<�m����-l�ujm�.v�:�)�v,O��[�G8w�����q�?n�~�n�1��~���������"�p����Cc0�w-�����[��l�������L
X�m�����m�P�0��2�+v�cW��wC�2W%�S�cd�f��>��]��$�*g���t���(Er}%�[R*��R���4������f��88A�2>��������A� ����7�Z��M]�
#�s��Y8���yD��m�;[���|	N��K�1���}��!Vl�
�yHlz$���/i�GS.p��T'���V�XR�^�	��f|*M���^�
��0uuE{������apy:��R�T8G�"�~}M�u>��.��N���
d	�)�~E����)]N�G=Z�{���)������(�bP���24X:UA|��)�t�3(�8X���W0/��19�Y
��������x7�n����v���j��H�t,|��N�@�L �����Y�t��m�������������&�r.!�b�/f:X���,����I�-n>�W��=`��gU�0 n[D.�R\�Tf�1�m�W}��s���V/s*\����t0������}�e>@�ee8X.�Q]QN��x���Kq��],��`vA8��q�1���d���H���M��Q��B��S���p�������_:�w���� �6��_7�.��}��M���X��l��������b�?�#6�Kq��:���d��/�Lx248�,���l�U�_�.���|M��l�Z�/d��mz������ou<������<�v���m[�w�N�I��n�
���5+:e�z%����R���@��3]kf���"�1��d���$
�B������NC���
?}���u�,H��J�?��KP����k��_�����a�A��t��f$�����xG+��l���#1!����M�XO��j�����"�>�k�Idcdq<�
S\FhF������0n�s0e���VA��O�9�!��c�Mc��,_-\.a��D���J�'A1T�l60��"���)�<�{*Oq���Ylx|�}�{�����>�������d��8=���qz.��h>.a����r���U,��t��)
�?�^>|�qI�
��^.W����2x@��`JF)U����]���\di���\���d6�q�b�	�� /��P�6q%��: �ti�K?
Y��_�{�X�L"��Jxc1�M��<
���"8E�4x�(�	'2�3�j�	0M9�/A}(�?^q���B�h�����D��V��M[��L�J��0��VR��2)�XBg/�����+m[����j�ft�r�*&jQ$�~�F"X]]����r��I�1�������B�G� #yZ/�J����(FO�h�����Y�a5-	��n�	�!+i��L�T����*5WK$��"dq���yJ���T�eT�Z����7�B�QuT��S�f��������D\A���_w�����x��{|:l����n�-�O��n;>��|�/O�A(k�����:����o����Oc7��G�&r{(� ?��c������y��0p\�e�	�}��W��Q^�bylWX9��f���J8���}����5rT%�H��e��=��
\��o��A���Sel������H%�<YBb#�m�J&�q6g�(#\20�3��L+*�\��xx:n��i;
�x��d��8�C����3��3���������� f�����"E��O��kH�)F���T�G&�9��2�� $���J�:s��4�:�6�O��S>@��_�h�j�g�����1[f�$.R'�&�d�r �a/���������'q�m5+�D��@_���Qe/�E����r���,q�	�BH�=q�b���	����<�NCr�	1W�n|�{�����w=�:P�-��H�qy���uZ_@|Z���CyD�����5(q�;�����N�Uy
����A�{z<�nI�$�����]����5�u!��`1�MgM'�R:W�`b#M@���Yy/V6O����B@�6��mY�c����S�t�d��n@��ON��<%B+E���>��q�*?X{�x�����_B�f�f$5���"���XWP�t��;[��px,�9�U�������s�t�H�y��3��9��k���������U�������e�R7�a	�����,m�����QBe�G�\]x\���v{��m�#��Z�b��d����q�L��D�'�T���~�<�+)���������6���a�A[��j��H��y���/��|x�l����"Ve3���l����
�%�gxI.|5\>,g�R2A�������l�r�����2A��e��EW{�� _h��9�	(_q�|�v!����=��?��~��z�?�<����K���j�\�pd#��n��2~�������Ps&n��\L�W#J�U����|���x;��U~
�sw������n��<�U���A����|Y�(��Ey��|�/��O'Xz���f�"'���,�u<w$��,�^����}��'��f�w~�"��!��1�8��X���0��v0$y�N!I�qG�R����h&�?�l�+6!W3AzQ�4SLM�d.'�03)������8Bci��p7s�L�P�T�~3�zP�VR����f����a,;��Uo�tvJ��"'�e�#��X���m�t��z��$D���	���U�R��	R�o�d����F��I!2�����'QP�:V�>��
:)�0oc[�k����{ ���������V����i�tW2��>a�C@us-R'����s�?�����	��^��Kl�o����P�=Yi�n5�jH��

�N�[��}��������V)��M���v�S&������R.�Z�t9��6���p(��A!H��+�\w��.z�vm��v��5"������l#��k>8�B�
F�|����}w��q}�=��snL@L�[g��H���\{-^4�����]	���(���;�4IYohq�T�+dj#tV�����gs�jU�W$F��<
���YoIU9e��)��N���B�A'��G�xz�:Fd�tX�x���f����t�0����h�dX�8�y������E��������,�o:I�;2��l#��p���8AL��&wR�vr�4*���g���(�-c"�p���|�u@��c��f�h~��dZd�=p�U�"�z�c�r`�Xw���n������`^�������[�x�������s���6+y�3R��.����yR���
p�X���rx���X7E�����B�o�>���N�Jf"�U�sH
NxH�{@����6��.���B�;��H�
���A�p��N[�,fpq��&{{��aZ����������x(Xd���[F�r�:+O�M�g�//O�|������3A�\J�4�^��@F����'���*���.�r�T�$��e&=d�__�l*���}���z�����4z�v�I���dfh!��!tX�j���$HFV���/�%�4�!�K��aq���G�a�~�t8��
�����fq���T]�L�w����WX��
������yqv����wri�R���8��+kZ�{�����.�n�n�2�tsx����|�'�7��R.df��g���R�	�%���1��l�J�a�W�������c�l���F�dDw1<�<��k3�o�r'cY���r���8m�0%���s�����K��=t3T�|���
���P�Cv��1���L��&��f�:����#�e���=��w���� �{�{1w�<
�XO�]����6�����Z��(��T���N�w��E2��{)������F5�
�
iD6��
(d6��S�|���� �	����������~�f��/�aIxe��EL��g�w|m�.���w1���!��\����vi{�F�����(/!�g�y��s�q�Tj�<��rV��RR�phq��n�Xk������>��7�/��T&�58�_��u8�B�utM��*7���@����r�W�H'�/���>��xO(�\��L���fm����@7�b�(F�kQ�-�6�@fN��n�h�wn�lym����	B��x�d��.�������
�9/U��a2��t��K��h��R��x����W`O�`����	�R��z��d�p�.1q�B�%}�F�&��Jf��1�8+�������`���p���p�t���x<l��Sy8b�By8��]#nr��R�v���(xX;���!�z{wW��!����'���>���"D��m����}`;[f��rY���Ju��~{�:���*
o��gb�	�C���
���S7�'[����L������ ��i�1_L:��:zl���"W����6[��A���*�X�a�%@����[��]�E���1��h�1�
�@6���V���x
T/%||��/�B*/S�;-�DR��0�bV���Fk1���M��8A����8lK������h�wu��;)�������*����i$T&S��LK�.N{����5~=Eh���B�x�nKs�	I�kD��|�
��[���t�#�Z��uX|2��PF���iJg��%tX�����<)��'�� o��r>Cj�T]�+Z-��
z{����|��p�n?
���������k��f����ekch<n�w7C.0&5�`_����W�-��� �a~[�a�����l����,����>k{��+��#q.w��2�N��<����� \��/I����n���a�$�Z���L��]�Z�����Av��f+�����+8U67j�v�Y�E���Pd|}�`{\{3�X`�t;p�
2�I� `��)2fVl��	���)���?��S,�Y�����q$�w��hR�
*����B�XF	QF7K�`1����\������T��L�����@��j�O��3�h��&��������I�$].��5�c(d��(Q�=\�%�
�$l�^	����eb�N&E|�4C���j�3���a��c^�
�ly:F�+43��	�����@-,���$����kV���2-c*�>n��������qP�����.y�7�7���$S�����f},xO��n��;~<�����k��m$��"B��q�fK���[)�>��lx���J�!��)5y��+�FyE�,,d2p9����p?g��"�U����4�ea�`�X�	G#�m�/�F�����,V��Xd/��,h������7(��s1��"�wY�aUb�T6<��
��B[��|�f�����g�����t��K�[hu������?wz��'~a[{������8}����#�=�e���z���z�/�w~��C�/_���>���p�#iw���������]�v��q(
g�����{M����O��N��a�v�����rk�i���E�_����Ko��s|��9E�?�}f��N�����N���$�:G�����t �$�j]I�]��\������k�>��4:A��=���PW!)�4m�tC�������y����2��_��������5��>���R�������+��<3����zJt����?�<=h����X��Ziu{���u�g��H5�����
M�~0,I�B�S~�0AU'��i%A�������SD�I������V�\�p0�q{Ym_x���������0��Z����R-������pP��E����XX��-�Hn��d�'�p��>�9w��8]-ZO+
%9�h��a��R
�K26BESo(L�J�o�y�q�pMns�&��p�<UR��ef�����y���h���Fl�&�^ Jf���n��������?��i(-�q������=�-����_���0z_P����|��oy	Z�l'A8���vQ��c��g�B�����E�o
J���= [�i����\M{�z�s`!t�z[�X����/aml��Ai��<�^�Z��v���
�i�����3�f(���r��X��j`�yh�Ie�<�}���O�J��g2��w�C�a0�:v���S\�i��%����p��>Y��4O6����RL���F'H*��$��5	Q��R;��6+N�iQ$O�~[[oV�w�����!�D�h�&R��p^��+p�4���=r���U�Rt85R!raY�"������9Q2����Yk�H�3�^�;+=�V�b=q?����L��u�N��e4J�j�����'� ]����	S�$�����&��~`�SK��<����m�q��E#�I���;���������J ]�n��}�j���Y�������m�?�*���D��/��\�]2�Zx��h��,v]��Uk����k����~q�B�
cZ:�'��z��z�Wa�,�Sa�rvk�Q�v/D���x!�]��e�����V8��?����� ;������g#k�%����D�2��>�2�<�1�2�J��T�6������fw�8����[����h�5��sx�_�zv�lU�t�oAA�]R�����U�� �K����$�i�����.�<�]$?��n*:�o�iXMMc��4��!�:�j�����1�6�o-'���~���c�j
����k����X�0��%3��@]�r�"XE� a�&�m�x��a�^^i"����T
0P/0�%��j�T 	c_�S��@��c���@��wO�����
F�	���;@�6Z��>X�:��z��k��������n���5��&��8�Cy��d����[��N�*��0]_��0�W���yZ�N�t1=���Fk+�����:��������TY���bQ$|h�<.]�I ��W}@m�J��A3i�<b�['�ZC���(�!�����j����(��OwP�YD�����x��&�@��X��5~}����{��������6����+l��x7���><>v�S��GY���R������
�,����
m,�4��hN�WN�q'!��m�l��Q��'��E	��5�y,zr�^�����b��m����zV8����������	�e�:�~���dA}�}�;~��W��t��Q&���o[�"�WyUKQ�H'�2WM94��$���<���f��O
�5��0��O�U9��!�r�N����!�>�2
g�
w��g"D}�	y��/oQS���.i=;�,P�8��I Y��&��XE���&\1���:?g��V�,]��u�".$8��*r8�3g�]��|!������iB]g���Rj��1,����e$.8�*j��~������/�&�*���Q������N�1�r\�hK���q��t�'q����!�/@BcZ�g���=�aNHM3g�2M��&N�RJL$fQ��1��� �l�D�0�:)�
�����.x<�c�#������3*��A7�&��(���xK���S�q���� C$���X��X
���� r�1��8��cx���U�z3��� K7�!�tS��*;o
L���n�
K2'���x��}D
p����T�A:��u����If�0�cI��8@�����r-���XD�qX��
��^��=�OE�$�C��7�������"^��9�=�9�H�m�����\�'�	�fZt�hs������^RvV;�dZl
���	�$�0�m}(��#LjF/0�iMu�R��9WT� ���kJ�IN��������V�	����D�C!x&z>�<:�++����N�M�G�,�������w�hR�`<�J(�	|��+�9����=�e�:Aj
!6d��7d�G����}�B��Q��+-�d�v@x������G%z(����?��;���2j��<�G,�4�A_�O�O���x��P�:�^�p�l�_��F�]D{�^�I��Fp��Hn�V
`�Ad�Kz��w�Z<[<^���g<�)AOtjD�e��lBb ��C�%���k@8v��b[-�]j�nYH�B�������
O��`��0i�u��WV'�~�dt��kX�>>��=�[	5� ����8�	p��3�����N�9�g=��y@�he]@J��5#����f�g�0�5���%)�Q(�q��(Or�K���N��b|��.�Z�=��tuI0_�fF�H�t���h��N��j� �����T���t��y(���\����SC��l�r���
��&bDBpU�P�����Tj-*�W@��
WUd*,jL��4A�.�����������Z��������x����@��O�X�6��}~��>�.
pm�m������
��������gXm@�5�@����F���,��	Q��CM������:xu�6��xE���n���t���Lb��p��N5F����6�u��rL���vb��d�rr���(�I���a%�Ya��T�[j�^t���
��[4�>���?}N��?[Q���;�l�K������@�k�)��n������&O�$��������
�����pn�^:��P,����Vq4�S�vs��9K��USf�7��irI�b�����CQ�"�O/	����9�Y�t�'_<��/O��_���H��Q�z��1pWp�O|qK��tX/v���Z�����S)�1Mbq���e��1�c�`l�a��Nj�L$U�N�3�O�	���+{�k�ju������x@P\�p��������<�^U������T�;��|��������>;��R-��a�0\x�bC,_���t$wV|�'r�M�p�-��+<���B�X�]2�kf��%V�}��Z
����d�q���+y�j���H���(�3�$�M��#f��M�U���!���7��6a�&e�������>q��(���0/#�W NY�k����:|�?��|���__#�����WX
E�����4��!�����M���y�1����s3��{cJ
��{5�k����L��y�g���L,�<lb@�l��3J���B��dT*gs�t~���%9\�,y.-c���)���<���/�rk���<$R����3Y@�1����]\@?>��
�x�M�(c��������I���CT^	������o	�Nv�����!Z����	"���O<S/o���q�	lXG� J/p"�JKr������)"5"U��
F���Y��X�P�J�(���W>@9�$7m�����1����tp�ZIP'HG����$�X��\6&,�����V� RL�*���&',~:9��+m�%FUC��b��OFJ8tr9��Mf��\jQP�e��[?�W�G4������t���:�L)WHQ,�X[�#,>�������~=h����P�k{�f|��w�X��K�Y���
%�|uH���A�K~��>��"D8��v}Dl��k�)�����ma��a�l&���u����p�����>������A��n��>�\���'[&[�����mm�.�N
N�d�np��56���k{�>=��8�c/�}�)��O�;�<��8b������]��g�F��{g��/��
�{E�"���k�����vi������K&f���	���H����$��O��"�Ze�{�H�"�V��&����}t�����1�H�",v��O�S�X�����%0I%��������+;	X
�I� �(���E��7�\:����H��"���H@���IjiBL6H��M
�y����P,�Au���0p�m�>�m�o��>�+��I��w�>�*�e{�����
w��|�~���T��q}}�X�{������]�i�e��q���v'��J\���"���T*H8U)��uAv�
f�m�q\E�6�L]a/���.��mR����DRL$Y��c+�/|����	~�Y���HTW&`�VI�"���}�D���7kW� �5s�Ds�.���z�a���y������1t�v�� ���m��.�"���/����Y���#/+��m�_���l����U6@F20���\����t�??P�3���Uc�bA=3�I��H��:���
�,!2^
g\�6�3I!�{��O����_m��3�?���u����������/��;~I�� y~�$�\���i��7h�=��U�\�������My��%b.T�(���z�b�&������pS�����q��rZg�9�5SM���I����O��
^����/��}�g�;��}O����������S6�b����9���5@��
P�'dS���o�"��9�K��A!��g��
"��Kb]����+�YQ�?E�	c.p�������
����w��5a��<��������{��k!b�N��D��Vmg��^ $�X9U
���
I��;b�2�p��J�s1��1/���0zO�"
�AE�\���&6T�1Z�-���U$���:A�T��8�/F����h�d2N��\VO����@�I1���5P�1��S~
����1ZC*N��-~�&mr����0 #E����2���L~,aX*��TLX�0pk7rL�3�O�\f���?��}��p��?n�����^��^�������������(z|z|�[�����$R�
6��N�(�wW�j��3.n��"�{su����2��.[��7�6��������*����T�H��t���]�0�C@��gf}B�&��=7fk;��E��5��<�������>��"&;V�:�d}�K�{e�B|m���}��=��q��>�����pD�d���Ku���+�^�Wh�<E2	���hQ$N���$w��_h��\6��FK�K2
Gr�eG� $2���P�A��c� ��E3A���\`N���4� ��?Lv)�)�3���q%"<)���pB�;7��W
�!�$$����jfqLr�7�3�WVy������@��f�55n�
�[��`
���k&H���"/$��I=n�c,y,_�:��1&R��J/�-�t�~8N�������q������o��l�;�^OTo����&������p�9Aj&�]�����vp�\5��#s��SV���WJI��7�d,tX0�����HQ��d0P��c]�Q���X�y����"{���kge_����l�+�F�,rm����hu`Z�frB�}q�3-��T��Rp�	����-5Al`�_
|��	�A9���xc�$��f�2�V)������@����*S��C�#�,�E�e$q��HO�V}��J5������
C�����(I�������m=���Ht�����'�n����8�w�����`39��O�� ��NC(�a��m�/$���$�n�>|
\���I���8@��&W������*<�{�g����L��4K3~x��x��4�r�ZNN%�������/�@� +iVy�q�S7�.=�!�n�e	����6C��C��[S������
�
)�U����EB3Y������C/0r��B��B���+��<���r��)�����C�oiVeo8�-��C�ya�sqx.�@�~e	���73����	d�Xe<-Q<����on�u<�������������m�j�_��������������D�S��}�9���vw���C�?���g�����o������������O?SOw�Y�����������~����:�����p��T������P���2�~9'�k���K�����\~w�|Y��;A�=q�6�Y�r�]N�����p�md����t�~���D�P����g�p���@O;s
>�C)�+t����
SBY�Bb��f���L�d��K*|�������C��g���M>���|���u��u�e�^����������V��k�	�)Ld{����&�R���K'��[���`X��s�FN�M*�Ug�@��[�G+�S^�	R��#V}�&�x�+�p6���$��������E�����q\m�������Sf�xT6��`p�*�m��c�N��1�G�����"8b*$�2����L �����/��c���
F0R������E��q�c��%2t���$�65!�1��b:����w"N�G+`�����B�q����q$^���L����D�����`4|h`�����j��nw�����N ��.|��"C��]���h�fm���,��]��F:{�K�%`�;������[�m�utX�v�q������[������g�.0��w�H�m���k������������g�r�[���|��6�pp2�%#6�������O���q)�<"��|� ����i�����{g��������y�[D����+������W��[�� �����[]E�t��g*�������n��u�	y���_��H�EJ�4�������Dhe#�|J�EyO)uh"5��'mm���KkS�U�,%D��
��OH�����G��$��	WU4��$5�+s�k&���t���k��-)��g��?X�f!�)�����N4�	Bk�P����<q��(R��t��*I-hQ�*_E�������>�~b��xgC�5@i��SX�7�$5��,�����������m�uJ6b��g�o���m�B>�C�Be+��_�n�'����"8���t����Z��P�M~���+�������N��e~F��1u����BI����i�\�[j��Od������������K��R��������������2�-�}X:���_���G|X��v�V��t1�s��O�kw0k������g ��P.�k�G��B(�ks�
�]R ct��U�-���'�>�������zP��l�K��e`\9Z����U3���xgN�%2Bc^����1��g�M�!��aE�D��#�������4��6"#G�c��I,k������*1��O���b:�0B�%��+�T�"�z4@���2�����pZ�	z�bX3;NTN�\8Ip��L]Q�n�Z���5��a@���!Hw��a��B��������,<A|<��V�U,�$<D����J���E��9��(\?������C��e*d�X�����LKP��������m����l����Q��(���z$M9������$���������T\�n�*��MY���FJ{�N�A<[�M�e��0�
���o,+�B��Y��P��d�P�����&�!��7�O^�n�2�e� ?�+���e�"(%�x�/<��(ox$�8�Zm�U���NH��<A�W��;3�M��'��
�K�_�H�����KawjJ|f�*KZR�{�,���}GS��E�vuZ�O1�Hj5Da���+G�K6�*�^g�g��tN�b����������2�a����{yX��	�,�6v�2�]H>M���g�w|m��E��~���=9I:���<�z��/jN�f�h2�r2�0���!;Qb��]��fQb�	"��j�#U���Y��������i�k���D�{��=����W_O�0�Tz����������
Z��4���T������T!�\�k�h�T? ��i%�(����������&��jII9��:At���4W�G!	��ia�������`�����(3��0�t.�w��~�]+��
7����,����dX�����
�J��L���J������K�����p����G��F	�R�����)��6�����P�&�	JE
�k?q%|��&^^jg&�]R��k�	�+��Lo�����$V��~6�i2:�\�]
�~gZ�~���n���j�
�k�)��I��|��}�xa{\6.��^�^��dXp;�g3D�Qg��fg�)�gp���r4��0*������g��0z�5/��V�9�&��������
>9p��$%s��Dl�N�~�o�5Z5}1��=t9�A�I�sx{�t���%�������a��n��.�U�n��U�����?�~�|�m��^��9eA���j}����y�0�D*,��n8u�4�	B��L5�i.����F+���x<8N���tXD����h7�KY�Y;Z���\��a9����N�*)���������$�U�d[%
�=Gk�;m�F����O������!'���$�qa�Op�V��mAu�HJ�8A*r�HB�t�(����
!�D%'�a�3]�J��t�RH����}���u�F�����9j�}6�~�q�/���C(���d�����n]���S����s1]#�}���� �>�t�F�w�TV���������=t����ux�g�)�i��^���t�P�(�d�Nhv}�h��B��r�KB���O]�~W���I����
�N��D��{H���O�$�4�o�S�����7��<Wi�zB�|���V�L'����������������g��VdpBE�m��S}��t;����.�!�%!����s������t��
�V-���z��VL��K3m(�Hh��,�e"��Bjt�w.�A�a���i��9[�5$}Xok/��n���
������0�N��5'��y��0�1e��~s(#�
������a�ObL#%����fz��9]5�D4��0����j��x��E�
�5\�\�@���i ���\�����I����B��&4�+4���qrm@c^p�Q,�������������T�����Y��l83���v�V	�+,>������U��>�\�&
0���,�6v�SA�s�ur��ty�5-�|��������f"e����"�i�K��on��/5��Z��}��p�-�9a;�2S�/+i�<���`A��wY��G�5�L��]'!����l_Pcv]W4������a��GL�����]�1�~�e���v��_�9���Ow���_���~�~I�/��Z���U����D�U�����c���K���n^#��e����3>����8)	� �4�Q^����N�XN6��DW��������������O<S��^�]a
���I�%���^���^���9�s��U�������G��A��Bh����b��g<�_�^���SF�	������xP�1})��y��O�L*�oo�!�i	�-q����?Bu��~Z�6.����H�U�F7�h.	�KS�	,
�dt�z#��&�v��#j	��<-��i�'��5�����	"���ia�ZK�NcL�.ny�u�$E��;��%j,�a!N�& ���}�Z:R�P`���b2`�d���$eEu���"LUa�4�Z�E�(X�����R4�`;��{�S�X��;�0���Y�X���%�[Y�:_��I�E�n���%��o�+9��5}���}(n�V&&\�����M�� �T|&���Ou��9Xva�E�L�pk���p@d�MV"vf�	��q-�H(5��t�����nt�V��}�N\r��ON+������m�b��p�����>&������T?$9���-�/)m:���N��~t��{�1~���GdYW�i;��l���)��+���z��Eo4��,h�j\�&_<BH�>�'�>�Lc_v�x����n\W����gR
� )���C�~�����l��Mv�!�+�.�cW������43�H.��5S���\�����	���'��!��g��=�g2��iG� ������`��3A�K�@���Fhb�]�k�]��q$���5�	x�c���[�d[��
����Mq�\m`{p�������u�m��~r&H��.[���BoqJ���a�R�h''�V]�le�c�������M� ��/	�B�b�������g����(���23U�!7m�s���UB���WZ����T��Mf���|�Ak�''�W0b�:-\�C}�����k�)Qmt*�*����Te�lSBh0�#F\�u
�~�Q{�+�7��� ��,}~t
d��
���?�
�����w�����r'l�5��y�����`4����a0�I��&N���x�U���8���2)F� $I��$C��	PT�,�=e�RD�������N�t
a#(GDl������;����4[_���2�R'��X��8�$��������Ez�f�aM����|�b������Q�z�����{L>���#�j��;��U���@��>?����������U.�
������_8���X,�~�x����
j8��X�^���`Xc�l[LX�q�o\\o�>1�h|�$��YI_��4z)�2���T��iaZ��=���L�m��;LLA���`��cv�V�*���q-���iqU/��
���=
gmv������1����`Dy��6
G��w*r7���,5��M��F�� ��������@')��XU��m�|�����|��������q�}�~I���Il���
0[����>w(i�0&�@h��p�:O��H�YMa�3[����J9oF��l
��\N���UE*����\��U�\����|am�<���?�e�~��������������F�4��|/�x���1�W;A����K/jk.��Y������=��:��?u��Ig�].F����P������$oM2!�t}[��BV��lB��	��S�q�5�y;�$�r�����X���vZ��L^�r�K���kh�[�}1+D9�
PC���2�\2E�dh]�Tib�O�/,�����B��K`9#�'x���`3�zs9�)Yb��.�����M��)%��;���4�����������L=n����K���
V�x���l�����R7P�$ic�`X���q�?�.�v��;g����h}�w ���2*H���������R��~,���fr��V�$[�p�^�cX��c��X<Wy"�!��\;%M�����m\0�=8��������x�}e�M)h
B�p���n��N5�����P(nxRk���2���W�v�8����t���g;�e1�*���)���#��3��C�������Y����Wz��G�g��8A�KI��l/)z_��d�d'���!���6IB��s�����!��d����p$��s�����w������:�;m��aUf&�C$�E3��G7�=tK�L�7�:AO�1~�"����Sp@;��+�t�a��K3�������&Bi�%p��&�c��zx}'8�o��}�s�F.�:K����@C��0��m�Q���uQr0Xz�^��m�-W��C��K����t����S��o�u�.��[����yfy�e,�����*;A�wC��i������%Vdc>���Se�}�Q.���`���	�L��� iL���r��q*�4��u��t9
����La=�����YI��[�2���RRmoyq[=d|��#�+���g��?��u���6�x�d���+�$��q��Zos^
����J�z���A09������k.E�r���?g�{%K��F��g�I3AtL����ku��IUs���W�%��[����� L��xh����NO.]���O���x��ekc�V���&(�c��m)Vt�(���Nx��7��Fu�u5�D����V��^`�pX���3$�U�k�z�
��T�2YQs\��P�!��Z>��Fc��$O�&>@9��H�B*<����X�Z���lW�ml[�7����B��@�/c�,�����b�rIu&t���������e�#���[��{�J����*�F���9M
a���$3vkZVF�	eh�E��L����fh�	���+MS@r�K�Uc�7L��� pyx��	�n�n���W�8��/�u����4t����/\[>�>X�F��JH����Q���@�^I3�X�����������V7����q��a���`���6����L�����N��l�?������ �p5JG�Y���8uq2�WX��"��_2~��(�d��Y 4�E/�f���}+������:��w����^W�+k�m^b�F/1|�n��_WW��aj�p<�9�'�]�vX�~�����.t��:�����C�������]�I#����rX����4��������8����*�K�AO��������%���j�Xr�4g�����_�m�L0�$�C�bV��z���@�Z��IX=�X��0B��J���Sx��G�K�x;�E����5�$d/!�P�R�!��U��E=���$7~8	�L�1�BdD�B�N\��N�	���R�D���p���)�|#��!&0��Iz�]�����M-��d1����!D&�0M�x7�V�����~|
�??~z���-���n@3z`8Xr����oJ�����R��Q|X?�u����g�^���/��;�_~�m���yu�I%�}Q��t�����0���T��/���%�~�P���p���p��~,�)*��.1���n|!�_��_��,7n%���L����D3��8�����C�p&����\K#�Y�!�S��(�, �8�a	��e�>@�C]`���i����K
p���+���r{��w���.�G8��e���C:"���`X��
�G�Lq�,�rMW���\i;Z��l�Ta
���i�(I��@
qH���/M9�FV���*`"KE�W���9�����vR���I���X�"�������+�: �Y?K
����I
�I*���B7zr�~������lr����'�]pIB��^�n������2���d�SaaN�����;�W�y����zV��d3W�+�z��
�q���n>��O���\?�m���p>�������v���J����,�����(x�t�����O?�A7����������_~���?����a�mT�1E�,���5��Q��^�).��������[����Z����t�+�6H�`�d���s$����n�o��<��}\c�F������
	6���0K���*�J�f#�1�2��q]k��t�Mt��t���[���� ��\��[��
ZMs�Wn�uP�/.�^�J	�|�k%���uI�� ���[�d�B��1D�i)�Qs�tHSy��5��\`���&H}>1=��/�J�((�C�_�v:�2�
Q+����\�zs!�`"�2�*]U�\M�mh!Y���8�D$�&W.�i���TEi�U=]"h�j�&�	2B����sgD1eDB&��8��h��S�'�")����U��j|-r����%=��*$P~�$��\*z���`�����0�n����|<�}����i���uv�����-��7�`�}8K7w����=t�����L�
X|��j!�]N����������(���%�*{~[��C��;�u�U)
��2��Z`�n�V�p/[[��*f	�uX��
���8�`4�JgG��p��p�c���O��1v��:#��E�l
��M\X[p��24���(c�(��S�������'�#�5�"���r����z��;A����9A�V�gC�D�Sm7�Hu�PG��nj���3I� �23�(ZG�Nk��P�e���J�
w���9Vf8��6U�hVt]G�bp��V|��L��/����|��X�.�+����d<��s:�	������:��YEZ\���cm�A���1�M��M�~��|�|7�3�]H8���y��.3>�Q��'|����
����`g>�MHxTc*�l��
?}x�������u�P����3G�sG�`�!["e�"�� ���
1����6r��,L�g]�?���������l���8���Y���/=�����~��o��������g�xqXl�3�_���������3���20s���]��s���'�C�������Bt5�5���#W6�=�����N.
����Z��|M�O��X��
�rQ*����
����:���)�E��A��+a#!��?%��a~�I|$B��	���r*
���W7��E9��rS7$s�y��'�U@6�#����������[m3��K�t�z5�j���e~r/Z������NZ"�B���M�Y7qE	��Q���gQ�"Z1����;R��X��/��_�/f��h��z�f&D��!��r�Fb�����	GX9�U$�b��Z#lb�2��!�Qp#��%4�IW5�m�����re���35����TCjZg�P!���wM�lzF��Y����p���ie�6������������KB�S�.'��V0SMDs ��k+ME������>�����t������>��6������'�_����7����n��|��9��w��7�0j�
"��:eZA�<���>���/���QN������m��+%}J�1����v��������;��z��Z&�@�b&F�/�|�������,�e��w�b�E�i�r��&c� ��:��\t������d��������QQ��I�O(�o/���45>A����O�����J�R�\�-����u6Uu�d���OOa�%S��yU�
�s��$�)�����P/0���������MG� �_Zp���,��LvI���"
@<)�T�p���+�� ���f �W�n*c1�9A��5���x� U2,4�O����'A�?
�s�L�J�9��]�D� ���]�<�48��_p���/b`2u���
p�T
��G9<��;�������nQL��mFv�������>S:Ar7w�w��!�m�v�B��2�&��D33����L�Vt/4�SD�t����	����2�6�u�vi�v����.�607��kA%�OU��1��U�i�"���i�<j��-��j��c�����um����Y�el����
5�v��2oe
��7��
������3g!�O �o_`�{�r����w|��W�xg��?�i��d^��	��-/���1v��;��E^v���%�liM��X��J��6q�����d	>���d|��*�g�p8���v�����j_��1�q/����j�l	L�������4����r+���lT�S�w�a 5�<,�%����$|���g�Od	��d	�?�6�����,, ,����`YI�%0���i��	��%J�A.�%p^��@��_Y�x&��d��!�HaX�X�y�+q5��
&s����9.�%�S��%�D��Z��k�i	���2����?Xy;������^H�@��`��?73|��\��u��|�8���?�LQpz�`���������/@2��N�/iu'HM0������U����2�a��;,���7C7OO��E�yT������Iz�rq�����R��g��	��g8��������T~?S���Zy��	�G����z;�l&C������������yg�%�;�����w�*����?^�o�p^�FC=�$�@����n�Cm�H��� �fC�O�j���8�</�6@*�|�<���t6N��#>��#�1_~������g/��z�Q~�(���g��#����[��^��	B�ZBCw�M��{�d
Vjr=('H&����'&~i)Q`���Tvt��|l	U��tX�q5/O�Ta�H���*�f�(Cx��uE1K�h'�~oU���e��8AT��-��Tm�L��c%l���
N�i,=��HBR+A���SL6����/��d�A�AP�J���C�O��*8�����\�6���(�+:�/�� �}z:���pb��]A��u7��(����,�Gg��u"a�F�N5������n]��q{7t������W���a���1����9�e��B�8A|��E���B��������}�{��9S���@�E�RkYor�Feq�;��k���!�m����3(�-S������n��J69��4��I��`f���Xb+p6�dl�o��+lM�����p�8���2����!\��zo���$�'�Wkg/��!r�����.�|�]hcQ�`6�yQ$B�9C�G����^������N�
�^?Zz�	S��bl�D�$��9A!����f� t���)M�������-���%���Q����d����@4@�����G
�3L�E���z�t*Uj�$2�]iD��X2�����HQ$�������g&�NB�1H�u.a.0<������7��L��G'HW���������:�2A��>�1X���ps���O�Ow��^�o�-`�p�����'T����>�m� ��tX�K&>�%�;��AW����V���hA��!�(�
�:4Zm1;c���y�]�i�i��em�����o�����4�s����"�������������]��Js��x�$8�x��A|�Y����� ��Q����c1JBmwr������8���'H��p����H���r
�!1���
��G��&MK)7=�&�Z��1�2A"T(w�!�j��eNV�C*:��$F�u8>��A������Q22����I1U���>��$�c�'�52�3���������?(X�T$�]�dX��G�c��{S�H����H
c#	\/��J3��R�u~�`	���,�P| ��~�7O����x�x|8o��77���F2c2��Q�r5I�z-H2w��y�y
7�3~x�X�����E�<������>�g�,�����q<�LmP�7}3�����M�����@�|$s1.$��
LJ\�<����h�`�+��L�'��Q����Uyh���y�wx�a6)�4G����C���Dl�������"
?2�8������
��V��2A�E_^4_"��g�Y��E~�~��W�����@sj��Fyb���I��l��sb�re��A_f�����y,�X�<go���Y7�l��B����8��}�	}�8�1�P���j/�t��/��&?��x	����/qk}�+��_��E�5�N�����'�>M��y���z����"�`!��=
�W�\�N8��[N+�}IX
S@�9�*l�h�3}����~���L�����CT�; �6n�=�:w�?�>������I��rq�F&�����V��4���m*U��dO�O�����J��
v�6 ��5S�8�$|h�A�+�
N�~K�n<[�B�l�]S����o���+%>V(���*��U u�%)��D���x��)=����0�	�B���a����_���
5,����������a�
�����#Lb����?�����ky���#���H0��%l}D0���I��s��?��L�_kU�����z�q9�m���O[��p
�Ov��e��6�z���0A������q�
�}���+��>m�w�������};$JtC�&e����<���l��j�BHI%�ev����������;�>9����.���sm%�DP�����������V��!�}+�zv+��P u� ���V
K���`y�2J	��4"�Xi�q�v4s���S@�9���xd�t�p; �8���Ht9~���/._������*���r���"�� ��k�����?�q�5=�dJ�0�.�8AhQ$��XGJ��\��m��)�v:�!)kB��LC
�����w�I��[� ��!�7Y
r�qc������>L�f��pM���.��L�M�L��b�.����/�q��k�LU�\
N�sE�%����o�l�X*w�;�R�X�2��2���������>���p��a��H�hX���q2�m���Y����������`�t���S��>�����,<�m����mY7���-^9����=��3���/�B%��v��#�u�\�S�����}
�5@�m��o+W6A�[��n�����SI���//A��8G���{��"�	��&m���0�I}��R� !�\�824,�tk�-t�������p�2�:�����`�� �oO�s��gf=�/����L�{��(�i�V~L[�����$�����n���Z��+���ARLuM�@_�J`�T��M����6�+�)�g]d(2FI@�U��lU3(j!5��-�U���v���;���h�j�����JH��%��������Hi��N�����R	1�}��f����.�k-�iW�2��Jx�FmPh3\�����K��J���ue�hi��+�>�}�}@�HP�+X����x�K�������h���CyH-5}�h�O���/������;�F��n��~<��z_4��)��<���2z�%lweR��j+�U�Z�������Vj�Y(����Y)������WC�OP$���K�l#��Z���6N��{�c��cp����rp���-j��|�^h�Q8������n9��.������.2��=�2~��x@Cai����:�j�� ��W�D�K)��u94������8�+��{�y/���4_���wi�����\-���������]��8w%���y����q�X�_�J}4��IDATr�][�E�S5��0�����zf��9>@d|"���.���(�ud�2�KI�,����AGd���LW80������hr>����k�{5�=����Y��4������>��.��Wk��u%����W�D���~�����&��i;�x"4��$�{X�b�|�W����az.a����n0�w�)���c���_�&�{�y��~�v}�J���K6����a��������b���p���Ec�xl�$���
���*��U0EFE<��K��#��^R�N�������`�r��P�h/���0�����I&�du��Xb��]����M��h4t��0���"�.�1|:�0�`���y�������8A
���B�zbZ �4��%!�%0�N%��xb��/W�\b�@����-�����z�N�Yt�������D��R��Ol\�3��$���A� n���|
;n:��IJ�%�c���B+D���'��P�~�������0n��L��gt�Go��P��]��#?��a�N�����#<����������w��[_��k���Xe%��p|�;<0��+���p��������W����:79X:�������X9��n`0����`Z	��J���W���*�L��fi���,���Vdu�X������$���E0E�V��q2:���;*o�Y�Xd�S&�X;���xr��X�(�`��y��"<����28�����p����\����" c8�^�A_���i�8���|��"^�4#��'�;'�@�I��4��-%N�u�;���=E�5���[=�*�}�����tXL�
���������N�
@�)I�����0:�^`���P�_D=&"1��F+��d�������/@/�x�H�����eI<RRD/�nJ$��2���@.2�=	��H�	�XOK�O�9	�V�����YG�(�����]�p�W��%�\���3�]�<`h��5/����=�xg�mA���v�=<��n��]��=���2rg��\�=m��n>bJ]�=�;l�?�� �����o���Cw3�6���c@�'����,��T �Yk`;`��i"�$�6�����rK��>?�[�}�y%�7Sd�o��k�2`�N���[{�to�M��p�9�XX
�<�}o{,����������Z�l5l=�2>���}"2��� _[g�}�U1��:�M�|�mVt���k�U�N�!��s�EJ�����Z2���V�����N��Z�$q���/I��
�Av=B[���z������tD�R>@Z	����N�BJ#�'��e$$:�&���Z3H����%*��6�aL�w�����!�cV�Z���I�jg)�D$�H)#P��kO�����4(T���2p��(R�bE<������IG�aw[�~e��8���t�
�����O���-�X��������d���w�j��}
�<����v�#v���x��n[�"��E�u��~��6g��������7�������
�3I�^���#�eL{PC�II��:��9�Fix �v9@/9()sf<����E�\aUrZ�����Jp*�G�J��J*\+sp(�V�2������O����Te<:����j�V����L�o�]$?��:��\�}��h�t����%��M"��������H(�&>�OoHh0���K9�[�Fi�>���'-�4������~��=�%�k�tX���I���C�Mc����R^
��v�qx%#�pMd�t�F@^��E���3��;�����}��J��`��@��{�P?�BXm[�
A�����g���8v�!��l��`��?�Pj�4���&<�
���T�e���e�b����{�tA:D�B�Tz��#���k�|ha���`{�����8nI�x���_�@:���^%D3�
W�a!C �b�.�t����vx�M�J(J�0�J���}]��1��l�i����9=��S35^"�J.-�������H���I��@�r(���R(�	b�}��@B�������)&xe��]Q�Y���'��	��(
�k�j����S��d��.?���m��g������������0>,��&����������8�s�8�<�_��-d�L@�5����b\��;�p?��<`{��J���c��n��������,����R�U/N��X�F���E��k�(oJ]&����Zp�d  `���.�����w-T��������"���d\�� ����<��"����x��:c1�(�i%��8yeY ��U�U������F%+��S�4<��pO+M�l�;i[���W���-��2�0z�U���� ��jO��`)��D�x�tSm�&�<	Nl�r��J��p�F0���T`7�h*\={K����Jq�`��9�`��<�%����HB����^"<>���T%ds���k��
�:,��XzN���1a��J7Y2�0i:W�q��7����p�f/�'���n��s{<l�p��������9���a�6�(�}z|���F��,�4<��-~~w=0&��]g>@+�r�CT����]v#rN56�h�I�VqR��`���m�7�iu�����s�Y<�I�`g}�7L56�d�sR\F�Xp�n�����!Z�����a=p����I��n���v��!\��U���0z�����0�1�5�d�����|��"��K�����/��.����5#�^X�����E2Ty h�*Y��{������Z�)��� �������B������]�����nON� *	�1��5aHS�3��D5@����j����
!\���5��:&�p$XxQ�(R,�[�&�A�s g�1��^=�r� ��D��%$��B�X�u���R��#4�I���4d�a!"�{�N��q��l������b=w��NPI�t��Pk��w{��fq�v�����C�w����������[���m��3k�_���N)mVa�%2�(Xe���xd���yo��-u�����%���KNI!�g�������j.q�	F�>���e�A�`��1AT���a��<�s*��"Y������`�t����o�����Xx
e�%2�o��"c�����4����
���f�/��� z�Z�E�pl�m�_���m��`u~<se�R���K6�����
p0����EA.��!.�r�|�b��|����^,6/���T5s �����8�v���}�x^��E:,�_�+��5	^\�~�����h/<�gs��!$��c��O�5�F�<l7���I*\N���>����T��'�5��/��,��-�gs	<�O$ u'�~�����wv��>>H���8�|@����	�d<�&��+������%����]@��'�ye�v��k��F]b�"-����5X�~���8Nj0.� �%����������2%���O���6i_��3
��?|V�sV$Gyg��2�
����N������NA�5��M8�A
���r'��T���?,�9���)��?�����������U�Xo���|��X����n�bh�m���E�S����z���`�������r�JX;F���p6�����l�f$C(���������w�|����S�O�+@<�E6m�a���M>�8�!���8���q
���Q���A�b���.�c�~�;r���(�_<�[";�wV����0�����"�N��v��LM�9�'Dn����E2�X�d"j���w���!���(8-�D9T&6.S����X-��"�&t�q��8F���2(d�5Z�72)R��S"z;�"{,0>�BMT�M���O6@���:�L�re��	M*���L-�����P�>_���5r��G8��e�m��e�P"�iX����x(���X����B�,�kd*A�������>EJ�+��		��fd��OwO7�����������7�����QwzF��KE��]��\#���Y��p�2t�F�F9�:���l�����������P��$�osQ���|(h-?3���(E�L��[�Vz8U�9�B��s*xN���������\
6W�<=�b����6�Tfd|��(���?�q�bIG�Mx�
����O�g�zEH�U���������^��H��K�1io�
�E�a�	BR���99l� �E3E|���O�	�'�!N�5�OE��q�;�?�N��H��?��F�89m���]�-{iZ2'��PS�nUS� ����	��V��&z7z�z3�LL�\	7D��I��"��t���v�	R�9<�SA(���a0�x�I0��(U�&��v��HY�o�����[�z�ts��Gt�l��y���M���������/Oj-�)�x�a�97E�pC7����}�����n��E�����&��.'�����e��������a"�]����G
pwu���8c�O�e'���f]�����K`q��1�(U�"���\H��*Yw����	��o-�J��!��-�H�ZNx��8���$�E�U�	?�Bx.��}���\'��j2n�,��s�z�x�R����<�,.���z�/�ry
�H��,�����$K7������)W��x��c$BP�z�l������R_�jIU�a�U#�d	�������F���J��H~a	�H_K�WC%��Ip�����e�\�N1E#�m�1�5-]���z����v��f�l��l�{��x�a�Np�4�h�Z1���8���Br�y����,�?~��>�C7����������m�
�09-�q��i�����1/��X��a�+qh�M3��J�\�r\�������3��p%�����|q�����A�<�i������.a<�
�|?U�pN����h	|.!����������3�<������k���d�#��%"?Z��D��:�n&S��n�g(�o�yA^s%W_�>u����W��w2>���d�,�f��~���OF3j��������P�������z.�/W/�u\�b��K!�g?��_��w ��4B��8D����G��.�(�"c7,�6���S�ot��R��b?���^�g����F<\70M�;#*���s����K�su���m��C9w��S�;�v��75���K���L~�J��`�RJ�mS� J�zxO� �
Q#r�M7F��;�*��x� �]���[K�j&���O.J5M�hu�X�38.]�c��^a�A�^a#�!'a�����J|�F0%�|;8�_d�PB����D,3R�YU��m�k�r8`�JN�M��Iz�������z������vJ��qh�u�D�Eje��z}��>�T��v{W����a���v���=���	��BX�i�v�ub9v�t7�����<�H)�=<���><�G@����7�V��!���Ur�f>nE��p^�!he�	���V9�[���+Rk����!�z�e]��iw�mYr����������'���4)��T�s:�bTN�lBB��hR^�S�����X�����wE�|\?����nx�#2�e�d�����.������/8E��6o�~M�"����2��cu3��L���|��'���8AU���s��'L'�����q�������RBTA��!P���.T)�z�"R� T���p�$_�x[�"5��a��I!
�f�Ps�H��'��9BZ ��BT]�_�,�J����SMF:��M���khQ$��j����9,�P`��.h� �v'�k�=KGi�.�h����{��o��<�-@���|����Gs��8'T�uA�ehl?|8L�l?���}�	7���>����+����.��������I_�Y��]&z.F��fZ��,>�^�����]o�,s���j[r�,���v�E�W�\^�&�VXv \��H���2����2��n�� p*f`��n�<�){�V/��(�,7h��~z�>�w��pNd���>�����L^���/�=���N�:���t��*(���H=�
AK�L������e@�)�cA���!��fd�������x@{�<�v:b����S���,�#�K�����c��H��_c�������$V9@�x�A��b�Kk�5����a� Cfk���+XN���:b�kTx����s��(�C�E�.�t6+��+e1��]5@�q���������a,�y����id.�������h������(����Ke������#�!��O����6c$`���`rV�`�"K�;��~�0)|��+T��	��^,���yd�U�� ���������q���
�H�Y\���\�>,���2l���
�;1����ni���_BCf��y���f�rAY8��I8���x�~������W�~��a_�+������7�,��^�_�}���5�'��i}8���	�sOO��-����Om�����������/��>|���/�m6xr�����O>~��rej����\�x����e�s{z���I�`(��M�}�A����w'������L�/:p�4�]p&?w���t���oF���l)?j���Iu/���7o������y��-�5���|rQ�\�I���<����~;�A�f��������+����.��K`�sL�by�J�>�Z�p'��f]��;{�J/J�oo�m�E��%�U�I5:RI��~?�LuC,�4 \C]�jb~��
��&y�Dg������b����=e�G�:F��Li%~�5A:r�L%Z��T��kO��H��*���a5������!����	%~K�w���u-�X�O���gUk0QYb"�ft�^�X8q�����R�/�s�LrjRLB�C;��nK�O����/���e�Z|
$`	��#V8�����*g�@�)��{+������f�u�
���q	�B��r�uXXfn�%�D+0��-���0�Y����������w�~��o+�S��f�����y����1l�
����v����#$vq�v�(h�z�%A��=\�-��b�+�6�tW�����E������c����K����E���y�wU�G�����t�g1�\���n�H��F�3������Z���W<�p���L�R)�-u����,Zg��u/�kt������SU��a'q53c�-`��NP�h��g�n�����Z�!��*~�ko�xN��aK�����<._��
9	%34MJ��f�q���3�"Cq��/j�!��������=�E�A8m�u�m.u�1/v]=�sM�i����`����l�]/�6d����f��&Y����>�E�g�R�dT�=�\u��s7�m���%��-`������Q?�f��c��F�V���oQ�������������`��t_�C@E@�7�b?^mv���Wo��B�sJ]����r	:AZ�g�����Mn��q���7��P�9�:1f�R���K��E�@�zk��j�+����'�_T'�j�X(Ljk�I����=��Vr����S��k{�^&��6juO�i+R�fV�b�dv���tD�1S}�)�	�I��ZGr���� !����&QwA�"���*�H�	KiN����H�A�T�����l�:^�'�`�IZ-AF��G�����1:�W9&Y�aI��&
�pKn�=�D���}y�U�%���-h���j���l[/�UQa���4��	���x[4@��a ]WP�����-k��(�6���������J�%p��e�A��+����wY�� ��
��V�[��������A\����n6?��1�h�q����o�A���l�x%����U<c��S���+O�Jzo��[�3�-/F/�\,dL���jv?���A�m�_���/5j��6��h
�|(�9����
p(t������7��x���~����M���d���"�����/`)
�9��$�;�{���_������qXPB�K�l�?�U�?|)�Y��~x;^������?���:����_@�3X�X����k��N[��A6�>RvE�������iY����x�xj�R�E��'U�]X;�Ay�"!��4��[	ON���� �T���	h8I�x��Hi��l������S"!Z\J���h���H��%0<cd�=�v+�SY"���gq�F�W�F[k(�4���r][����kV�a0�H�J(8z,6�H��c�d�k,wm��*L!���]a�����X$������a'�[��0�t*!�P�����}U��`s�Usvj�B:���n���������
�oo����������U0��A�(�j:JCf�7��&^�O�3�?�$��I��N��wfO2A�&gN��fo�DK ��~�	u�8��p
�	����:��^��TnA&�&6��I���o��	�r���#�8��De��+	�Pq�&b��S����h�8����8s�x�nB�OKl���sN�*�r_G8���~��p����}*���%�B`QQ�5���!ri+�����t}�2�>u��t;�P~�+�S��w�%d�K�|��l�6����]�3��w0�o/�[����y����{������Woqd�+����Q6W����������o���N
t����?���@�����g}h�HoSj�����ke��#��Z�a�K	�s��,}�+ �_��,��mL����k�j%�i*��^`��Q�z/U N�h��n��$���.R��P�o��R.�I�$V�xp��Q>@�����U|)�+�A��a9��H�%WZCq9��H�pfi���(KR}K@�E"rP$���� N*��bI�-��"i]�v "L�r����2.����������IH��%�4h(���6':���n���~l��]�}���mh������a�ZDg~|�.�Z����|��+�\�{�k��~�P7(qo~~�w�-F����v������p�����w&�63�lZg��.�-��JO�������A|�����������Q����'?�y��@��-��������%Sxl��V62������J�3�������O��x���
�K�w����UaX��-
q
�O�@��K#�����^�~x�N��c��<<7I~���tX+�����7�7f8]vF��0�����������+�'��~�So������^��%���������_����������B�����=����T�+N�7�����������W���������x��������2o3�iI����Y�%�y����vpN>:������S����r)�H=V�%I��4F�\��W�rX1B�@��L���+Ou�6�4�����/��Q�6C�kR����u��(=��THxw�H�X�Hu�p)INB-�\�z�L��3s��d���,]%�P�������=UK�!�`p2Ld�m"�#-Tp5l�Y�`�_���J��fP������j|�f�
�`����"�Uh ��R����{i�q���bf25j�Oc��G.���������N��,�)	o����M\xP��T�����4X�}%ae���mK�&���%S��o��]�>#��g�U�3�_�1�V���ayila)���Y�kQ$�H��PK\QF���U�]�'����&�H!�YU#KR[��4�&���<p[b������0���H�:A4��pFu�?\�+a�4����0C���|%�d���h	j�{��+Q	�����B)X�i���
<y�x1���m�P��
��
_-��nB:����s�� ]�n<��o�M�+�I{�g	Zy�ke@,��C�(,�����Zj�[�I��=@�j�n���>������? �}����H�t�m�������m��N��j%�� �\si�&�� ���u��������G�k�HF(�S�jM�����e�����9���b�%�K������"�����M�	A�"�<W�f�*Z�
!������3�]�@�
Q�q��d��=���YH��f�p\�#��"R�0�H}&�L-��N"r�bj������q4�ZVn,E#�@��	B�Y��'!�#��R��yh"lE��q�T����BW4�d������\�a�V4��������R`m�����q����7�+*-��^�J H�3�HR�c����R�u��@B���B����4�v?��������p�p�:#�Ed<�P]Fx�����@V2�wq��������	d����t�5�'_�M>�S������j�]D?��IW�����%��
�lHQ��>���.o�/���J����\��0S�os+�^��&V����
p�%����^����^U�c����3����/qs�_<�;~I�� a�:�D��>���6>�������;<����oo�%���Fs�������V����\N�������m�p���;���/�M������/�������_8s�������	�����R��Y!�Z��~S[�n4=������1�t������I����;A^��K����_�w���Vw�t�Y�g�eb���gXp2O����M���_���"���BG�.$��p�1�-zJ�L���|�m������_
�_-�����!�/���|,�T�G��7C-L����d�NB�;dd��W��,G�L=.�^�����x4���.a	�bgS�MF���N4�)��z�����m��1�s8��|W�R��U�K��R	c��k+Du��@J��FQ�<bK��Z�X-E��T6!ICGd�&�!V}F�)�
�}S�}D��RS-M+/�H(gd
�V�0�����"&5Eo����)�Dw=��
C�B:Zg�0j�����H&=��^{���
d��& #�**_��*Ghpd�^ng�]��4I4j��k|�[n��]���R��f���w�����
i�ID���_!m6�W-����GK���-:�u������RW<�Fj��� ���e3���x�X���B�p�q��G$������j�Z�N+�	].�N>�L�gS9�Bm�l��S@Yn-�A�����-���+8�l��I���~�PN�i��"%�)nB0;?�:�����q���P ����^8<�:;P�F���:�8�1	�)K�	����f<�g���"-���/�������"q���`MO�Dq��B�h(rI� =�=%�`VR�A�&���$����Y
��1���iSg�#���g<&�!��@y��I�p,��N�K<+�dtu�����Rt� N#}������	~2�$� �
I0�}	��j'��*$Y(e/L��^�`
�1��?�b�uF@m]��~�2����V�;hS��I�<�|p�� ����s��a�������V�E������V�O�;&vL�����8J	������1�Jp�������`��]^�.%�������"��B�7715X{�E'������Q�&����o���Jj���R�$����9���x,l�����t�����0�^2;��$8X?T �"��[w�\���k����|�/����&3J�%	���X�|Aa���d�����}f���em=��}�b������8��\"Y6��1r���'�yo����K�u������������� �>[����s�0�|�X��T4�\�e^�����>?���&D����d�dO\�V�����u�\MC�9�n�~���P"FbWy�
��7���G��C�����/���{���pMp,�8�!�,E�C���:a�u�8����W>gyr`��U�/�%�\d�Z����88T2�����z�+9���@������0=1��^���O<�A��[���j���~�6/��F/�+�k���+��?���Zh`�����:<m�OGY�?����K2����}�l<�X���������
d��4@��#�{��������h181���i{!L��I|�q1���H��|���M�4J=����q�����XU�%���\l���������n�\^d��$�����'�/g���&z���o7G�"����3x�������>��;�����sa��,�C��yk��{C2A��(4~=�������E��*��U:,%	�KO���%������g�pKk�.�R��%��J'1�$R��qU�s[�lq�;��{���8A��hKj*)����d�HOSD��tX=�a�>�~�Ih�["�P�~J��ANeT�������%(t][���\w�>������ L�uq��B\�.���-tm�����`pm	)W��^�h���[�M+x���~�f���_������@��������_��B�
9>����2�2�SR&�����~�)�
��~q��6�����\�_pS��y-�Sw�_�Ie�7Ssa�{\�z]�
��\k
z�\�2�hC�8�-���%�B�s7���!`�e����.�k/��JH���\�'��a������nRH������;�'���5��~���
���3����U�f� \��=)��Y^C|�3�$����J� ���g�����G�f�&���<*(6e�����?�>u�	ByO9I	��U�������T8��U	�� �P:,��A�A���i�D*��"�;�!\t5[[�M5��P��C�V�����o���� ,j�t@}02�ev��6��+Nv�M����7�i��(�K���[]j��`�U�����q�!���Xt?��?~��W�
}{���mF@|����y��v}�
�QAg]��/��E�����x�>���^�9=c\/2m�Vr��E�~S�"e�i���s�������m2m7���%���G��>�^����U.��[�v��Lhw��[��
0�;iT��Y�OUd�4�.�]8��f�#���B>l��o�16����u_�%H��=�#���9��41K����kM\�
%C �����DBMI4�W�:	�1�R�R������u	lf�+i����@��4�p����!�#�(� I�Nm�?S��H����9+� �{Ik(W������8����01�50(����AR�V
7S������+�d����{W��1+Q~%#�y1W�1��n(h�uDeL%���������2^�%p����
��?~���_��5/}�@
�����(�����*�f��R������.���~����������(}�����F��D��%�uX�BiDW���a�!��)V0$Y�*�LT�0�`b��ve1:o6�C������������<�Z�5h�9����Z�u���OA�"�ub&R��6���7���g��C�/���3�����������
������t�^��B���
p�.���
pJk��:���o�.������u�F/��L��
�~t�XV��
�s�9Cg��]��*�K�~b,�|��� _�����7�B�m����u���J�Pm��v*�)���P����v�`|��9m�j�
w]Z�����fh5
���y��5�������s�3����x�����/��#��F��[u��������~pZ��?5-��H��`B�z����������4B�����`k tKJ�2AV���dv��C>�qA�U`i�����8S6��x1��Ob��,YK��H��tBK�kR\
dB��C�N���i���I	������v@R���J��D�n[�5���0�/���T*��Evn��/��JB+|(���*�����N����N���>���:�����|�Z�;�������n�����~�mp�����������X�-/�C��C���Yu9�:��Cpm^,����6�����������hj3�{�p��[��`���,����.vW�
�#!�J�DIN��)����Z���P�u"`�p�������l�6,Z�_!
�W�}@��6�#iM� ch�u �t�J^wG��R��z���*��������Q_js��\)�t����s�fOCoF�CE�����������V=<����(��&����"����4b��[�!'Az�i;��)M��&���D��0��j�����O��9;����#^.�Di
iA(R�v�HR	�Bu��@�V���.�6�����]��Yb%���P51����	"��n�$6�&m�������m>TbQ$�>�~W��������?~����F/0 �oYt=�z[�}��Vo���<����Go��a&.�7�]�� W������?�bu��O���������*T������E�A	z
�R5�W�}@���D��q�����?n3����U�m��4���o�1�Zx����q�A!4�/�L�>g��Uk�6W�
��f�����~���>����|8��B�	+v�K8w[�F���	�V�r�R�%��`�����7qi
��p�Y���IH�����R�T�`<�K�kV���%)vB�
y����z��H�U�HJ��SDMce����-�KG|��a4�]*�L�Q��IhPMr;��d�+*�}���W=0�j�0�|@�z���	j�mD@��I���|�*�T����2|��2�cm*���<����r\�
��3�
���?����_~������;lg���-��C�k�0xx=��d{�]f�}��z���������6/����L^�6��<-&�����5�n2���\�~���fp���+��B
�^���n�w���>m|X`Z��
B�2����_�/�^#.�a{�k@-��mYo��'�U�����CP.�s�6�]��6z����Y��eF�K��~��W�^���d��X��2������m��	2��,��n&���
0;G2?�y>�l$6�s6������_�����&�g��o��V��H����<��x�}���WT��i�`K��/o�<������A�?�����}6��8OGW�)��HxC����kv��"���T�j��:���r|8<9���	��u���g(�"U���,��� �dk�.��-���EA�K`A�%1������\�@�|a�T��\��B���T2�`F"����Z�Qd$C�D-�c4�y��kw��@E��yu����n����"!��3%�'+]N�D�'A���M^�YuF@��j��U���wq�`G�)dT�a�O�S$��:m[bzA�I�X.%��q�@o�N��ae����&C�a��E�!�j}����`M�X��pk��������r����?~�����G��������U��Hp�b���"T�:������t�d����C�������U@�&�����~ ^�(T,9*�����d
![-YZ���B��[8��(�"k&��#u�!��x��"w<�V��(�>�����n\���l'��R��X�����o�U���U�yWYp�o�}���L�?�����O3����cT�P����2;A�m�����h]�a�	B(�d�5��m�l'd�	������[�N�H� ��%G�Y��#����S���Nf���>+bsl����t��1)�5�H3I�+|C��R���a"c�V�����.K���sN�����S���l�r����(�TT����
x}X�
���������Nk�w>fW�|��):�\�Qa�f������~�}�
><���G�_������o�"������A��3y������������W�P�3#C�{�����������0>�����U`�:|�`�66��Z
��Bo%����/h��o17i�u���]3y������S�<4h}~�� ��p��������0\
2�mQ���%�-�8�� �B������A��6��Z4�|�����\���Xno���"%��n�V��hC$���R$O��Q���dG� ���%�LK�n���a0V�
�s/p-t��:�I�cL-���-�D�I���H���iL[sG4���)���q�
WU,i�I���x����,��@�&>
�I"��/* �6����~���D-�q9������mQ����v��+����HV����&��x��^�����
��g��p�q�{�>\�_���������[��n3�d D����������
�u���_�����i��	���f�ic6aL��KV�,h�Y`��X�1;���H���m48�"xW�n�BfA
*Y;��������p��L���t+y���#��I^F	��w]����7)��d��� �YP�����x�mr/��9_P&����%����-�m���t��HJx�
�^����O���������J����{��&�#K$���f�W�\����(S�>�rd�G�kA���5 r�^�e��}�����2&���U,���N��U�pe�(���y#�Ec��l_t.��{I&'��f �/`������M�����s�+���~��5���fp��2>����#��W�����E��.2A�.B��
fe�(�e�4W��:^0Se��+1~������4q�$#��Uv�W�������^��&�����x��z�0�@� b�f\����R���9������Q�sq�����p��������\Qjq�yR0D�@��h����n?�E��'b��E�%L�0+l R��?��/�O��(�C�k�9������.��&J(�����x��Q
�h�����!��C>A�����X�D�=@��`*G.E���g���f��h�������?#�=}J�#�
'J0��@6�	��2E�|V	=��p�����%q�
��?eTQGD��O�#@G�L�-=�F��U$}��������9p9r�C������n�6���B�Ke%�������'=*�
��{�C�x����d�<n������W"Y�Dx���
�hn�n��l|��{�A����"GqA�'d��+�Q����/
�}���]���X�|��E1\@A	�x�8��2E=GG$J7�s�.��t}�q��4v�{&����K�S���*�R�����������,W�N�D��5T�y�����|�T��R�l���s,�p� �/��)��x�`�5�{���\��3mf!�D�z��Mp���a��:���RnO �3�1V�MrsQ�U�����w�&A����~����� ����{N����<$V(/J�ut�'i�%v��1w�H���Z������'@�T��4s����
:8�6;��`�	�]|"��D���J����W��\��n�����",)�JU�R��6��`;���u����*�8��!)C�I�	!������AWY��fT�G�b��C�b�_n��?N3>o���{��8t�5A���H���H���<?#�����D����8S|�
#���m0���H������Lz���t{���(�� (@��<�B�8J�q��z&�\G��/>�)H��Q+�8�
��.�WO�
PQ$��EBQ-����MZP(e�N���4O�����-o�X�(.d1t�;!I�H� ����L����qgE�w����@�L�up����(����7D���#$���(��]%��U�,I$gR�
,i�K�5��^� X�]��s.��G�X�2xmN�(x�������4>(������=��uf��N���-�O��}�v[��DV������HJ�M� N{u+w�/#�F�ej����P�4��
(���f'4����Lyu*��[n��74
o>����p������������Q�e`�J>���*��^�� }�0��b|H�[�����r����1�x�E[g>�����������i�9�]���������ro����9D��1�M3-z�y�������y���tH~�.�]=w�G�����V:����f��cqW�������7�x���,�-�G�=��%����j��H��7��jQ��k��tz��e[t����������|���CD�����.�����:Q���2\z����jS�����]�fM�}���o�]�n37�E��%���i�~��Z\<�?B5V���}%�������8�gx��)'��d�'?�?��Y~���09F�{d��8oV�4�1�8>����d�-x���,����_��'�d;u�%<���M4���B�1������7X�C������K�\Y?��[d.�S;����G�4??������87Uo�������5�p����
>�0��2�h�j�vP:����7(/#�9������8�D��K)�:N�����D1CQH�E\�B *�9�C��\	�[�e���c�)"(��r�F�XC�pTO
G�[�����wS����<�mTv��m�|x��@M�!�����1��Z���m��cu��������G�8�B���
�3Z�C=TU�9+��M"����������?�$����[@���c0.�F���!7���m�A��$z�t�O�b0���f�a�c���OZI�����	a]��w���*>75�������T���CD�����#�P�_��[�����W"��D�$�������Y����v- i�!4�Lu��qi`+X���&
�`���>�U��8��h������=�������(���v��9��(�Vb��fb@mP����8����� �d�?'pQ$T���9�����"�L�iK
N`�C�r����j�y��`g� *�E�(�1�,:��>�bQ	����/�.Qv�b�w��p���=���v�8Bd�$�������nvC	���d>@�������e��� E*�
�Z�f"�nvf@���6�������RU�?����5�l����:��u���_Rb'�k.r��jSF|:�>U����"�&�H�pB��Z��a_o���	<wVm8� F�m@��V��\;6R�����I��R���3+���T5��_m�$B�bke���
Q��C~����
?p�:����t$����4.���/U�2S=K�w-G�^ W%*�q���7�sr��K�#���������O����>����T��\���0����g�i_]��3�qt�c��������Ww���c��
��>S	z{����x�`z�]��o�
���r]��(j]��%0��%,@X���vFxf5=�c���'-��K`��-@x���m
���dZ����x��Z�7Zn������x�5�y\&@�
����G��V�����5	X�>��K�>�6�*yZ�Fq�����,KeaS2(86`l�0�����!f�A���A`���,5�3�-49X�Q9?��?�?x�u�o}�?�`���.0�BW����m0�y�#�}z������W���s��g���(&b���0.����Hq �B�p���������?�.�t?3��]D��@5APu"����Y��m.�.�!N�
���
���n��Q�WE����Z���H�M��m�����
�H�"����$��^i�s�Z�r�^ip�'���8DP7_1(�y��R������p���j{�?�v���y[�k����|��d���:���������q#�5V;k�\��8D�������E�"U�j�	G5�[�3AP[��Q1s�����d�*�(&p�!��_!���S~�=���+�� v^����9��Fh;�w��{T�I�D�=N�P(��E)����������A:}s-p�N� 5J������:����k�h�f�F)�x�M��{]�L��Z��*1�	c��{��V3z��8����7B�'����	T�VU6W���������_��H���o������@+������9@C�f<�"��-��������O�S.-��?�������Nx���e�]9��mB��`G�N�K������O�`k�M�����B9Z��?��_��|�C|��<?e]Qc*KX��Uj�����!������:���v�]u���>h�5{��h�L��U{��=�Wc�{
��<xgux�-�"a:,:OV�M(^!�M\H�AP��.��#.:��e8/N�{]M���Q
s���*Li�x�N�+(�r���
��s	�=��&(CoA�/?�b1��E!��AT�� 6���%�V�Z(�F1������;z�|��	�����$��B&�
�)�\uRUCQ��b`	|	���q`ZE�)�k�It���t�;�:����GC����0�Z��`����j�}�6�t��L���`�p���~O�)`��_c3��5�m��tX�{��c�L��������d���/.BTB����89d��ZU��������vs�� �L�u������>�[@'�fW��J�J� �)�@�*��Pr��:E�����/D�<~����K|�6t�����W��w��0�(�?PN�G���y&�U8���S~�S>O&���(�NW�I�@F��m�=>
�.�!]����[b9��<8x�Q�/���Qp:��s������:*k��Pv����;���@vb4Rc7��pSXY
��A�eff{�q�0Z�[a(�����ha����X�rr/�JP�`CD�Vp��Z�I(<S>���C}j�P���vm"���L[_����#E������F@,�b"6�9������c�<��S)�R@�!����?us�����������*\j��B5)�s�����f�C������3�1[��j�}S��}T� �Q$�~/�R�@���@�R����t��zuP���i|q�R����:��^_����d���|>G�W�4����W>��?G���v]�����edh6�����"������M���� ��
M��wm�����~^��8:C�Z����X��W��_L���e��Lyj�L��g�
�UM�J��������h���gsr�u8�d������+]���~3�W� �"G�Wr���C@����=z=�g<��)��5w���:���0�`�����dj.��tNS��^���D��z��a�����nI������q^&i�7���u��&�b�Ss]!�������(������S��g6�
Wbu����k=G	�4=rq���xS3���)���S7�D�%� 0��$���u��<FD�����x��>Z����\��wa��t�
��{:o�6�>9�2�1�V��\�]�%��PnF:��?����
mZ
>��������i�����N+>U*ymmy*e����O�����������{��u�kMB�t��K��fc/=T�T������trO����~��@(+��������������
,���#O{��'���`:��������{Ce�iY��S����:@�"*! W]�@�
4��%��<`�LYGv/�BM�X����L�Nz�=2h��R�5x��ys)g�#�z
�6��#��� �(�.<�EUZE��8*�Cu(�ZGA�~�IN�&���E���&�|:8���d=E��[�g����^���2>�n���TR\���H9��E�A�$���I�.GeE����H�@���Z;E��]�*����jS>*��uK��(����qn�k����� *o/���5T������u#QLV�f�����eW6�������}��9�����_��S=��=�	1�V.��-�U��w���{�A�r�?�����a`@/Y�F��`5�Kj�(�p[�B��R�-�N�b/����.��� �x�� R��w��]Lg���)bAT��E%x0��uJD%$:,���`��C	:�E���H>6��84��
���������������u�Wqk�����^��^"3���T� �D�R�JU3R���P���4
�����De1D3�����]���
Y�G�
0��"�(&
�i�J��j�����C��Sj�(a.���4�:�^�P�E���$������"
��m��5�Ec~A�"A��&���CL�z�����E�L��[Sg��,Y�D�h��e��0[�P�.�lc�Z?X@���	�T�*JSg�q��K`x�x�tz���������'t,N~��A�D����[�~!��Xzj">�50��(�o����*GXB�����l�Kvp���Z�p`B����JNm
�i�9\*"�*�
DT��mB�^I'x������e�*��u@)@	�����������=�t'��l��j{NtF���X/�}_�`\���t-��^s��X��3Vv'�|��#_�p�T��o�G-k"��p��e�5`��c;�A���!X}W�(�����yu��!��Dt��>@����8�3��,�E����\����y[����-��iH_��E�D��g��s������d^;d;#A�M���@�/U��*~J�M~>r����(����g3A0n)���5<�+����3��>���Cf�:?���-}Y�d$.�/V�&}BW�u�������WV�_�
5(���tHm���g��t��\G���O�%0��K�a��0F�����1
�y�<����[��Sz���0_��z39�aY1��FJ������#����s�L@5{������'-��b�!"�z?�m��`���sABg'����;�\�p��6�Z���0i�3��M���u^���O���}�g|�32�Q���acxi!�
U�11$�"Q!��c��"��~���2��������U��2�^�}���`��j�	��=�M����)$�����_O� �8��J1^���b�)��`BT��������6���4�T=��I�Mz����|��#4����i��kD�v�Q��S��VU�Y:��56��<r��k�
��*~����2�������i�Z���_�b��_@�2A ��z�j�����/Pd��'G�[5��(��		�!��7�s��1��{�	��TS_\1o�&���""p��v�#/Q���<�(*�TQ��=6%� � �Ts�� o(�A�zQH�K8%fq��(�R
��� s9=��G����P�AY2G�#��R���
&g Xz����
�*�f�c�lk0��U��@��N����>���C�n��26�P�������O���������o��~K�'��:e|T��j�08�F����s}����!���.Y�������Fh������M�(�{x���_�u7�!��)�S�x���������^6��YR�� GTQ	���8�Z�gA:5�e5�Q�6m*n�1Qn�O.C�J/m���m���tl�oS�y��		����	]����V9SW�� B��3��l��E����J��Q`��+�4�h�T|�j$�Q��d
d�D���!���X��P��q�a�@���n��'k�
gp�c#
G��-#���&���Yd�lNv��U����� TpE��6��!Y�y�����"3��,��t~m��|��R�HM�����[�W��4�U@m���T8�iy|���u5x\t���" ){����=�UP��$])���A���O	��������%�a��t�l�I��z2u��q�T�U���OO� H���p>qD����1���f���~S�a��J�����:ex��(pa�e"��	5�V���"��9�Y-��d�K@B+��:�.����=8�@�&����:6R�c_D�fA��n������W�k�����m#�_����>�H�0?l�O|�@�L��>�O`>�`9�_��.���Y���n�����7w����%���)�Z �������a����=7����(���0�vm�o���!�7�V��m��iF������,��%��|������-4C;~������'�~��������}�cG�3��_8�4��Q�%�j{~��Y�����2�4���3=�#[>�h�Q��/u���F_y�&�Z�������$�wW��3y T];�P��j�)�aP��sA�lr��+�����B�C�/��
5[��o�Q�^a&�E�"5���mve�95>
sE��m�0CdS/��_�4��A}���0��KXC�-�`w���|�t��HQ&�(���j��QX�H�vnOYf��eC��&-�J50���*s����q�	+;��j��N�P��.a�\���ef���x*�3���s�Q�����y\^
�9�6�6\�\�)>Qo$Q�*R��j�.��U)��������PD����w�2E�?��Ls<�@�<�l�����5���<�_��V�l5�ru�G��'��s�b/�f���Y&V=��&wr���Lm�c�.=*w:b�C�t�$9H}c�!�x��������4_pM�g6B���rot��������R��q�b��JS��]?ja��������n����@d�	B��3��J��i�p�rx��������=�R�8�%�������~;�����������f�E@�l>�y)L�1W6���D�o0	>��������R(N���y��
7#�������m�t,���� 5�f�*GtX���(�
� ���9^e!!�*ON��������[���
������S����,��2B��<aqhw��Y���/�����V����R������R9C�$�������	[�T��]����kE����"rE��=V��wu}��zCM��2T,5OU<B��������Ar�o=�J�:����A���q	,�
�Z���f��D:�I���n��
gt���Vq�u�{��R���U��'���L�A ����;�p'|Y��V:�{`6<j��3������Z�Q����56x-�:?����"����;��ER8K-��"���Q~�s����(�b��JD8@
dJ7����H�f3�3AP�%�F�C��Q�%p���.~.�Tt��x�����l�*=�)�ShTxD�S��mG<�0Ac���[��~�����O�h�y�Q��w���H������X��L@avP�'�X�&�!E7������N;�����N��k�&���
PO	6�@���/ ��j
����o���J�2~��u"����0"\��. �?*&G>��w}2R/b��[?��?Vu�����+A���%�:>�8!�3!���MH�5u4���u&`��&�������Y��2�_���A���@��a�)�iY�a+�P�8��+�1�:�e�X�^� $�B�����[:��X.���i��������������nt������ �g�?=��?0�
o�-�Z't���F�q�0f�����~�N��gm�5j\�$�n�X���!�-�2�N!�9]O6����q����p/���y[9����z�v�b��9g�|����X=���#�pt/�IF��v3&��C��_��g/������y���g������z{]P�����j�E���2AV���5A�F��z��c�ev\�����7|x�X���?���9~�]���N��P��W���/���Z+.&����9�
��������Z��<k�e\^����V�b�W
���w���!>�����Ka!��!p����|�8?7[2ZK�����n��}$C��pio	��0O�d'��5�G����_�2l)A�@F~U��0������M.s�*}�8\t3�lC�ls!�����g��)S$�f���������f@
���pt�m���������u<!�<S��{@�`�{�ue��k���x3E���Z.�������Fw8����A�#7X��s�qT������E�����d)If�����Q�L�N�����$�N(<�E���l<D!P�\�*�QQ��OG�����(��k���-�/�Z�x�����X�qD�H��=�ww+�'����k��w8���<���i�\�P��	J{��
s�j��:���a3!��q@[�$~�!~������Z�W2%�����u��Z�J/inu|�V7�3!��A��v�P��Q�L���Q�>dd�=�)S��^UlaY�l�� �!�R�l!v�A����;�*CQ�Z�����x��l	�}l~��x�]���A8B^�J��MA�FYS+����#;�U�d,�:�����?D��i�!4U*B��!���kh��~q�&)�W>�;kVs�^c����F;��Liq{�%����2bqQ$��p��AT�����0������@�@�+A�J=ApZ2E�^�pdQ,��0�5
�,��HA��`2�.��K�p�
�P:D�}y�y��D'��W��� �|�c�(X�����
�|EH�n���y�d�)������k����$X��A���Ez���<�L��mBK��3~��k0��W�R���Iu��*��B mF@U��:$C(6�|f���9�Mv���������H{��	mI]�r�~���V����I��5�� �9��*���6�qjP�B���q�}T�6=K"�.�����js�h[����TN�K�_a�n�1�"B~j��e Y�����0����=��'����#�H�7�M���5�*��<�7O<
�����\��U���VN'�e��Q�3P]��r�����7�LN@VO����-��,��T���n����Q����G[-�AXY�!�_&C~s^������ �0��?$o��r%>��0l�tXwwu@�Kxvj��[&O��WZ�}=Pr�
�OR4=��\�d^,n��H���=�p|HY����������,	���w%~�p��*�l�K�k>����8�TO�X�OK`�{FH�`nl������K�:�?/��I�t��?@�'����8���� .�XY?�
8-6J����,���>���3nA���xdw��}/��~+�����������_F.�����#uU���U_�7?��A�m1������Eqq�9��l���sG�_����S����g�T�e��#�it����7|x�X.�"�
��P��x��fsI5x�?%�tD����������#�[`��*^E���Eo�<��(�z�c�a�QY�W�5�������h4��c�-z/�o��x�W�����QzuS�e4�*��g��}�� ��Uk^�fwVk�V�� h��.�H�~�#��E��sDp;��Z��5:=@��t.�5��b���\��T|��)P@Ep6m��
�����q]u��zT�1����a�=r�	|�����r�������]�;�~q4v�
R���h��b�����?
|��ad��Qki��
.Sm����U
u���Tp!�'�u��K{��vp����Y���]bqi��`�N���j��*��o�����k�)���m���������8
x0V[4�v�s[a��F�)�"io�C������>�x�`$T���T~������N���7�%��3Q	���Z���h����J��#
��\� �qr��H�s��)gH(�ZU��:��*y*�T�"H�z��`����+*!��:m��-TC���Q��Pg}�r%{��5\,U��ZH������2��"/�"ygu`��H��n�}���
g
�9��Ed>�.���E2/
�`�,Ip&J���s�@�p�s����V�"3~	�3A���T8����"�]��4�&(��(��	��&�	�<G���@A�EJ �.��U(�MfM��]%7C9���P9E"A��E�Pu
E����&u����?�����{���?6}���XG0>Je�LU�8P����%Un^�[����.R����;�uIJeZ�(K�U������#'�u�Br\���5U��r��x��{"��:�O��8���K�N
jr�%����`�!|i\U��R��^Bo��4�6u��@c��.B�U� #�W�Y����x�U�{�@9�/�RT��m�j�hF���/S��m�{����.V��ux�u�&<�;��� j���S(��	�+-��@!��xL��6��f$���B��k�X|�����(a�%�/v�Q�O�`�@����<�^���2x�g��)�p|>�|��5�9_
����^4����)�6���O�������������2����Ve�!@�(�h/���,/vra����B�_r����!�8��G �+����������}������e�~�K�AXCyZ���#��I�Z/������i_����i_*���K���V$@���65�g������R�!G;,7�	I�x�4��	�j`�������P���q	��%��� ���P*�nH��:�6Zz[�V����]��:�S��?D�x��A�M�1�
�\}�d0����$~�x�)\�u�[�s���|�>���V\�o����xU�%$0���T~��/Hz'������
�p����8�0�����������+tCWt|�\87�j|���u�Yb�\�};���reD�������Z��3���2������|E��5��0r��g1�:^��~~�2���O�2`B�������uJ[�6(;�T��1c��(�5�:��(0B�5`|������T���"���V�=�sk��k�T�B
�� �������+�����\E���TPa@�wK�~A
��/���qn���Pi]"��G���E|1������$
E�:jDcO<;N� �9D��26j��gP����Qq�@V����oLaW��0|�
�.Fu���XY���>���@m��WhT��X y�����&�R�v�c�I��qU^��P_!o��;�l�Hww��U�B"��P�����?	����������pM����A<�����|(�����`/��I�����!V���2��:���Wu��,�qP���D>>(5�)?�AQ18�&�ig��
T����m��_'_f�U�Vl��������AJ:�������P� �tuuLC������~���\"�E|������kip�X_�L�"NE��.<,� �d��(��x�"�������qr6u4�� ��+^��� �&H�k� ��D&#�}���������
\�������q�.��E�PW,��Fu9��Apn��(��A}��}+�a(���'��7�t��67��:)�,H�I�J��M���!��:E)�S�t��)#�k�C��8�V�f����uv�_�e0|������w���`@��7�����f��RJ�32�
�
�}
Rl�TQ	�}�4����#-�8a����k�������D"�4��'b�eR�)t��#�j��&����<�X����@�)����c��"E�M:�OQ|���_�<��{	�_���j3����Q�*-@)���4�.)�	�	B0b0m�l�)�0�h�>�j��(cX#4���_0'��5�JTzN�_G���p�4��-<;N�#�Y�T-�6��K
������c������
�g��k���P����VC���C�`�_��-�sQ�:`���E
)T*P�0v����^k/��mn �S20nw������Q�7�?�b�����w������)����Px�52F��h��3��)��iT��W����><[�A���g�Y���8CZo��RttW���eE5e)��H$������V���	���gQ��NK��1��D��k�~���}���vK�a[�m�q+��f����^�����p�?��D�U��w�~���p����%t��2��B}?R��9��|F9H���rqx	
�_�l��>{���s������-�?a�}�� ���?=2��1���,�>��J�aBT�f��������`��C���P�M���:����}5Zj���1����'��\B�W%+:cc`1j9k�����7|z�V�8������s����]o��������&@Q���4����B��t��9axw��������\��,�F��)x���3A��������������?�������^E�S���
[����������(��=.{���������M�X���D� ��lq����x
R��3G��A4C��qMN�0�V�@mN��	���D-v���H�v��T/E�E_������t����^Wb�C�4U"W�?��`f��������0|��^���������x��w��9�����0��rh\|�$���>���qO}����kr����5A@	0���^�^n����qS1����=�)�8�}��� ����F�3c����LE�,��'l�	`�)�Ty&q�LW��=���8|�T���i6�+[j�f���lE�Ft���
�/P����#�����
�f��VPUU���/(r�����u��D���_�E
���!(
�N'(��d�����d���������+N��rMV�� r�� �<*������+�
�(
<������cP����m����w��;m�@�G+NkW�& o���`�wO����� ��2mu�7���E;���ypmh�\���Q�����������;���^�>������]��`�_p-���_T+�R�B��W?6e���t�u^�]�����?wO���^�f�!�T�wj'[3`�`(M���IJ����P�p��`��N��S�������mvf��%�Tm�xK����}f�Ek����y.Y��>��[��C�NP�+��ZN�[��h7�+++E��o�y2 N�@q�j�y�pe�Z!�_WI�(������l���^@qK�mdz�� (��yl�������m0[l.4�P�"K����(\	�N4����9��y��c>�����-�\C�=�]b������!�z�[K������e�zGM���t�.���]�(��m
��3D��,\��2;n�������e_�O?��!y��tp�]���=X��~����K[��cI��d&���_��!����~H��dH�k8&��#�U`DI�� �d�6US�����V���5OE�{��`y�0T	)d3�)
��:83PSi�t���~��z�v>�"m%H��G����]��mv�R����3$�[�,���rsX,��
���W-�����Q��n>�����X\q8-�����V���~�Ie�����c^v�������\ej"�pDR�,���|B��y%�K15�r(s,��3g'����,}����
\�+��\�}�8�{���9�u������iz���Q���*����$D�Y�v�j��B�f3T�p8�f>N}L?����k�M]�,��alR)�mb�CC*R�!�JG �:of��EU�1�
e��a�Z�KV�8F#�>�e�K��	�{Qf/}\1������"6A�O/���aL���t�
^pD\B=Ed�)�g'_0#�c��2T�/w	/�d�����:MQ����2�&�+a\��bX�]�~A~;�g���t���//��J�HSM�ESSy)<Z<�����L�o)l[{ <��m|d����7���9 �V�o�����`|:��0���n=i�Wy:?l�ps"�n�5����?AC	QT5e��r�y E&���m|��s�6�us����;��)"�!���������k@E���9�X�w�rl���3��f>%��%d�z L���Ms�����M*��?r�}P�4>�
l�O~��6���N�{j> ��G�u�H��J8����7(�y�P�Y����u�*����A�-�3`�7��(c�&���
�v2[3��t��s�lp���!3o[��A`( #�L�(C�>�����N����KZM��1��({Cb����h���;1��1����Y)^��-�qY��F�s&�B�8"��+�E(8j��F>�
��~�M7��9�!h���!b����W,�iE�	�J�~�\���g��9u�{�t}��S�t$@���L&�����5�K�r�b��]��	{@������c3������k��fT�W=j[���3�B+R�v|��
 �����SP2��UE�&���h�s2IDAT�5��+����FS�R����*��������5g��Y�g�VM>@Y% ��&�
6�q��!>�ez.���ZL�^�7`��)�x��^.B�������c�*\���i��N~��_���.�����+�>�o�~L��*�h����� ��TSEBv"���f?SA�v���&�T�
��&T5��1�0�*�
��tE�
g�"�>%� ���a�@ 0���4��'�Q&2����E��j�������5���Q`�hQ#�{;D����8,9E�Q��;E5X$���1nm�\��m����L����?nBMr$�C�%����w:A�2��W(�>m�)�S��:����5`��)CK����}w� ���WV���KH�D�eYX���d��dz{���#r�����c���~`�e~�������h�����*���P^%W��������{J� l'
��b��A8���H�#`V	���md��;R���)1Xi|����+C����$i���b�&T����E������T��Hy	|�����/�G[���U��� o>����p�&>�����0j;U�(��&nB�Gw�!���W}� ���c��8�b4��*P
���K`"��X��8���:DU�J!�s�82���x���>�/�xy���o(y!rq�U���j��
?G��X������,�iy���%�&�v�uB�%��)Q3�6���S"S�sz����\^����i������y�&o����,-��Yg���_Sz����-��I����:,x���Ue/�BHW�Z?�.~����j)]|���i��"�`�gj����w��Q���z�+&�RS�����4E���Jj#FLh�'�~pD�tB9!*���&��������[J�/0�Ykenr�������I3���r��D�S;I�Pw4?^��-*�t?���Q�"��q[~��v���%��rg�ReH	/������\�\���75;
�)���Xo�~0v�cL��CY�@�	�3�(\�%@01Eb��>�9�&�L/�����3`���@�f:Js!$R�1��.�L�Q�
E��e��P��K����(���(��b���x[�$|��MkLF��x�E��8[�R��(�	j
+K�y���_��d���ZdPk������vu���J]'k���J&�M���|��?w���<8,����}/`��g^&�xG��:A�>�u������T�V��3i�2�0�K�
qN��<0\4��Gh���7����d@�����+����F������h���x)Ho��<��vO������*|�l��[������v�L���	�1E,�!�I	�)k��`��1D=>�Q�&�}��RT�r@h���*��o{�X����*�T9�5�
d{��x=w��W�xe���,3AT�h�P8�� �eQ$�����eE�s����1]�u�%���"�+2T�B(�,������ APD�Qd%
X�+"r�)�������8B�
�,��#E,�.�"�2T�@�p(��a��)����`G����A�B�iA�R
�C%�H���� Gwh��q���)�'Xs)�NA���p�{�4�U������%�]�L���,G/>��~��R$h��6��6�%9K�HYs/S5.l��I������{�.e�u��\��pR��KR�Z�*%�@Q���L���y����C�MwH�_
������-t���X��RK��?(�DH�
��:5�d'�jY�o-"��vvk� H����@=uI�M)�n��3�
��u.U|�re��������N�2�3�;k
(y�>�
�d�6���/V6���h�X|��������/i|�>��K���t���~x�y���Ky�Sg�����=�3�N��.�Z�TW��HY��pUx�����B�gW �*��"-!)o5�'|\�%�S������&�7A�%8�'���L<�Il��
��=n�0~��r��As��eKg������q\�K�K��E�!�@2��������7-�yy ��^��MZ��>@Zd���
��r^N �����k)��=�JEm��C��<�%������ir�g�.�0���J�����K��]^��,��-C"�������� ���?=2�a������� :~�fp`q
���� �ber���pOE?�%����A�����i 9��N��B�
<<�
a�����l��{o�����,z�9]�^����J�Dt���#_n����#1����"<<�,3�r���U,�����s
y��(x��"%�3��m)Z��9���^�a���������������W�u/4p������v'��3A��e��
�h����A��g��Z�}�-�����\�E����F�=:J1��O���j�����_��5A�B|,���=�Gg@F��K��xPU��#w��i�� ����3Ap�����"h�������	R��{>!^mzm������}@g������b�J}����R���3�W.E/*����C���C��q�lJ������_��9+Te�i�������Ui����[����n
4�S������J��iGCz����]�H�xex���	������?U�L�j��=�!S
�Xg\Z�+����u���
�dO"���	rj2Z+l
�vOQ	�6o�����Q�\<��������P��8pX�x��3(����yXet�����Q�N���jB��|�L+��\Zz�����X�<������]6�LE����E���pdQ	EB���� �h1����@QA� ��E�PY7�
W�����"dY	Y]%�8������i�� �"�Q�st[�
����"�H����;Z��EB�r������ �6Vfc�z�SdV=��T�BrR�:A9RV�>� ����}5A"���w��CoB�TW��H<�wU�#���:AN]�i?Av[��	���5��'���@fX~�F�����AV��f�:�?�N	�#��U@^�Z�K:lo�l�X�t�B�=�����Vl�Sw������\I������6��!~<���!w��vv�b����5�amI�����(	�7LWI�u���k@��V��5��w\`�rX����h�����V��h�EI����B<�tFe�Ad���(���h�����f���T`�U�K��Jnl��P�(��E5��s�#���x���	Q�4�j*���8u���2N�w�@I}
qL���v
��b��(�Q1�HF�����u��b+v�}M5�|@�mk
�}�'B�Cy��3�Y ���H��Yo	�����8��zM��
�K����J�����>�v��'�F����t�[r�^�F*L���P����K���O
\�O]@x>��OH��E���h'+%2���&�*o��h"����c�����#��O�XTuTp<%/��Y�|2���J�@/>�y�z��
���
G�@�������8U8���c�)OU��h�a�>.��w�O�z����QK���o��|�n��e��B��F�����u�
2��}�a���,������U�c�OgM�>��.��L�{t��a���o!�i|�_G�;���{���l���4�A>G��A����9<��Z���b.0v�A	�DU���d��E*\9�K+!�x
T�T
(�M�+'&���VK��q�����Xr��1�9���t&\�>$v%�q��<��*o������A��ol� �{EP4������x���1�u��TrZWA��i���	9�9A�'Y��������
oPC1�����^�L�
����a�f��8��C_�	[T1����.{���F@�����C�)��A�CQ�|9UKr���W2"�������l����.0��NP]`�����s�G7Ag��GlH�o5��>�`�,MU��jr�`�����saBk3�S�}���������;~xoO}��J��������cU���;q�j��	])}���"V<����ho���xreb��T`���)��T���������k�a�~"�����Q�"]��&^7`��&�om�xl�'p$��;	�����K�5�NA�.*�������ey��L������= ������TGvGy�������4�����mk�r�wN����K��~g�
�����#���vg����O���BQ$�v�a7E*��(��(^d0!U�%�	�� �� ���Q�	"�2��j:���� ����z@�Z\� �-N�	�VU�����9G�)�:�8����,��1T�h\d��:wE/.�x��:�A���H�����Fh�:v������U[���+Q���+{M������?���A����3���	�����j���u��?Yr��f6:>�m�-@��vj3�q��?6���b�1n�>�p�T��6Bo��[x���=��;���O`j��"�i�T�
mHX��C�o Mx�X-��@�eu{X^O�N{@�u��t�hk������*��%�MH'd��a�o~�Z����Y�Y�����+�v
���Pr��Ne��x�!���7)�7�������3����U��Ji���$z��U�E����np��"�<1\3� &N����98 N���[�X\�|gq��v��!���<������ildY^�h-��D�X�ni\�����D����MW���md�������u���_�^�Y	!l@�H��+R��CLiO	�~"���
g}��S��O2j�F"���g�_n��q��3)�$��l��.Z����"�PF�<���.��$4e)��>&$�xGL
>��hoM��S��'4��������#"`p]yh�20�/�;���!���g�b���[*U��6�������}���������4�`�
�)8�*�i��P)�S�}��.�=��p����K)"��UK}�L@�eO���8��9��5>�+�u-m��{�����7���?���1�[{���d���\d� +����lV�����Y��l�}�!�5AV����[����8��I]���.�`��p������*�G����^���������������p����8�=�f���������p~�����>��n���f�!�a V��f,���4ThTW����s3W)�Fp��������!�%����
��@o��F����b���~\����SZ��b���>@����"t��5;��&�������&�T�xp.F@�B�=�
�1�����cf�H\���s,��J�5,��J$�:5�Z�I ����&^���k�L�A;
3oB�����x e���~���T���e1M|*m@G��P#>8a����7J���'t���U��:u�N�i����at|������Xz_�
�|��;vM��X����O	
�
��J�B���ea#M�Qf���
I=�;<V��������)7,�-	L�����K������p���=T}�O�����h�8s��[���� �T	GwVe��&�d)��G�Z���j���8�U	���b��1������^���&#�y������<8__e���m�s�������#��4��D�E�2.� 8D�:h����$x�������S+��]���
D�����PnK;�oGEQ`4D\	�U'���3��8���g�IL���l��qp��b��A�rH�� �q����,��(0�.NP��y����^0�
���Lx;������X���.���������p:5C�}���B��a��z���@.���r{
���l<��9��N�}(�s��J�A���P���-����D��t=���e,�C���Z�*��gp�����u�\����r'e��@n������=tw�tXu��;�#������*������<����m��dhwm+�RY�l��
:NU����?J�3{�L@.��[���!�, o<����9r���s�����X_�O���&�]��A8���a�pM�Bc�i�����L�(���cl�(y�$�F)v�Z��
#�G��������E��"
<�������+(I��EQ$����H��@�u^��N4:jA:������hj�(���-�m�[��4��CtM�A��U@�k�+�E)���O	�����n�S�W��HR���+`��9���q�V���H�W	�������fi��7���
����?���A*0�wt;a��2T���>|�p0ZS��H�Y�]q���.<�8�����W�����������k��\�#V�i��T�������}i�.{)|{�iVCU������v
�T�Q�M�������?-('�����MQ`��JI�`G�v_x*]5�^��;��g� g�: An>�����8m
d��C7�k�����-��  b��������sj��A+>�f��d���+\��!��z�:�y�:s�|�����~h�`:���7��^����et`.L�o��������E5�u>�1z|��+p�
�
�����.�}&��E���7�\+�4����|)�sB�q�s:�W��n��e��P�
�}:�#�B��(p����e�����|\�����%^b`�;G�/��5�Z�����WEGV��|�E�=����+>Z
��,)�1S�����"2�r�9>���)�@��������8�����`�����G�Bz��E1��0!=����
Q�5�;"'�h��������L&��CK�������W@�4*N���O��������C@AQ�6@m�*�]%)�jJ����h3�������*���TQ�X���3>4��)Kn�T����G�8j��L��!j���8��2>��u��d/���D�x�<���H����
n�@~u��I?�;���DS;�) �%��+�TY�(�E��{,��K�E��b������d*�>)�4����OG��y�q��I��I�Y�7��g�!_<�N�pW2A�P���7-S4�=8V0�6�)��F�!��Yj��C;�y�	��Nfcg>@��{Ea�0v��YY����sQ$� 7�$�j���f���	5�b:,A��(���.xc1w��T��7A���s{��9d�q�V��L�U!�����$^#�{=�V�/4�c:��X�@tX����5���vWp������U��iX]�� �x��}�"%����z7����D��t���T-`�3J&6��*�P6�M\����:�V��R����L�KA�Z���*��(���Lste��Mn��@	5��
��K)�RV<��
P
�%,���!q|S�Ce{h��Fh�U����0��
}��L�.�y`����J�T����+���n�iSU1�T�(We�y��*i��k�cP��������)y'�K�s��}z������3A���g��A�E�r����h7B[��6���H(�
���H����)�� E���F��eQ$d��V�E�8D��H8,��/|O�A������E�P��R�� ��$�R�����q���o A"G�qQ$���
�x�����qX4r�^�M7^[��.�s����A�F_&"Q�"H����up �#B�U.�����T�k���&���"������KZ�
u:/�])�R	�>�8�:'���H�we���@x���MJ�H�9Wyd���L�����M|>����R0&���i�![^��$B���C*�?�����N��(�~�Lx	r�;|�H��6���Uf������"��+�3I�@�:��Jp�i^�P?CQ$�1����
�����#����)\��Res\���dWb�V��;���\\
�\^�����s�+�N��&_�[{Q-�+>�1
�~�v�K]���0z�A�4�;�>~Z8�t��G�D�F>��,��J�Ac�^��q�8}��]��}�W���-*kA��w.���Y_��
��|5����.����X�|�O�$r�"C8����.��+u���nd�����W�l������L���iE�sQ�u�Z�D�fm&}�!�����aM���_����>��������*V��<j����N_�_y�bq>���x�k/�v|����;�k�z��M�n���S��r(�9��hn1���H����$�6�D0�;���U����5)������rn����Jr�%�������?��lN08�]Ie1�������C3dl%�N�KqLN�d��F}
G�7B�����c�q|���Am�`�f��o�1����F���p%��%8{�ix���?@��������1�Gt�9�\��QZ���Q��8�	5 �����Ki���|��F�2��i��3��"����DL�qJC���Zx��P4^\�H�#�(�DE���Z�
�)4��1c�$|����MY�P0��o�F��\�	j���Feg�GC�+�D��PXk�2���oW��fl��uv�����Je�i����!A>�{�����C�0�X+R6G����A����_\lF�K6�N��N��NU@gB0n�a�1�AoqN��@oL)k� ���jj�/���`��0�-��R�)[;�N��in�&DHw9�x��SZW�Z�����Wr;Z�.?��*^��y{Y�\F�[�k�t[�����+�T��>*���I��T�:���q�:*�&O�
J�����o3�>�C��8�6�1�
9g�HA0����E���eEB��g�`�-A�y�8����PYT�X�*�����$h��������E�"���P�:�

�h������� �!kA\9�#�X:��)��m	�VF�`8	|�������2UW6BP�*�3�	r$�q)�@�6�.[��"��|�x�� ���0>����m��.�"�������"��	��xPu$�Y���!�APsI�j\�/��30�:Rv� Hl��2�����S�AER���R�j�m�v>O
��Ci��<AcC�3���4���r��5i%@���u�y@..#�!�.L�l��[�I����u��:PA���8�����q�}o]T��yj��o��p����+���r����Ak����>X���/���|��������#>Fa`����HNQ����hh1D���������4��3�����_B���#>G��G�����h%��=E9%q�"el���P�a	�t|j��2���pl�L��'�&��9�Y�uk��6K`����l
���F���'.I�b�e����}�� M�����<
u����B)��
O���i/kN(Q�rd��[Ri�!�D�OV-K��2�vzHk���j2�<��ySVD���$j��z_��x�{L�Ii���
����g��$��
(�/�Oye���\������}(��p~�=��������|m?���Q������+��y��p���C^Er+�\4���0Crn���������~
��O?�p�k=Iq1#\-ix��{]��U�E����CCQ�W)(J����%�/�(���r/���a���#InQ���x�Sx����g���b
u��_�/���e��Q��O_�S������v!C�����(X]�W:���9���tg;��6�CWs�����rb_>�a���r����5���z���4�W~1�Q3A���� �i�W�j7�j�����9����B�w\��#����VF�A�-CG���\#�����[\)����R�kAU���q�@5�)D!>�<����t%� Ap��A���(#��oq"�F�G����g�a����t#�
fl�Mk��{V��J,��C���etN�j������.,3&G2b���t���1��E����A-���wA[�w��z�t�60�5e�u���T���U���l�T���O?��e�	vp����
�F�ymk��0�F� ��$M�;�7�R?=�}ur���������3������K������o}���r�T�6��}�m;d�Qi�s:������?�t\���>*�r2����*��RjL]���.�>�>z@���#�����|��s\:f�x$g\�E1(��[=p�a,
�����@�"��A+6A��\���� J=�q��ss����(%.q��AG13�*PQTF��/�9���T��"2C'0
u,J��*�`2.��JT������5*��uA0���#
��sT���� ��)�"�P|��rD������z�����S��A
�!m'N�zUz���NA���:ZZ�.��w0������	��F����CA�3�^���R�rCP+��3����Lg�����~K��d�["���+?�����l%7\f�@�D��V���;6��TF�KJG�����s
���Dqu����`�����������W��J��k�h���u�%��r�����T�NY���2F�8�{��������K�������nS�$g�c��:~��=Ji���96�����MeD�"{^��^������� ��SS�:;z�6���1*>����hQ�������%�Af~h����*<�bBp����3����X�b�8E��1�N [p�/N���k��B�i|:���S8��Q��)T+�Qt:(�xq��H���	����V!�`*W:�c�.[�V�����?��sk��^74�!_��G�y��znX�4����y������W�-@��Ok�h�Q�����6h����Q3=�Fm�C:m@�U�7,�'L���M}�g��mv[������|��G�,W���\�h ��Q�,S�l�h��G����?�����l���!O�v��C4K���?��
d�m��l$�Fy��M�mQ	�8+�/: �/!�U)7�X6B�G(J������Q�1�a���/L��:�m�G�������?������9������o�]�������R��a?��s��(��E
��+�r���E�����l���0WyP��2q�%���;Vz��LMQ`0F�Mn���\�{����������(/�2��V����A��5n���@M_�)����.O�����^�l���!��|S&�$k�!�/��H����
�,��jw�J�$�;�=[�q5���E���Bd�&�y�(3�aBk5��<�9�����.~�	v���'�2�2Apxca�MkF&����"��&��"R!�R�E����b��l0��p���� ��L���/�D'�����G��	��cy"�8������Q������]PD�������k��"@E��z��40���OyW��*��;���v�	��vG�F88f�0�����)��z���tg��e0�C�/��:@;�b��2GU���m;�����(�$6}\H��E�?8
���t�P�
Q�n��(	�^X��:����bo i���O�������5uZ������*���>�:�����
SQ$t�����s���:h���gXX	��]����C�G���hu}$Y	��M���:7(A3�T������)�VX/b(���~�4M� �,�l�P&Jl�_�r� �+A�p�B�]0n��^�)d�A�������B�R��Yp� N�JD� ��n����3L�
A�O S!N�y��L�.�-P{�*\�De�� D��tXHY%���y��8�4p:D��>����2U�~D)e8�5�/�u��Y
�o������k��
SLG\M�p`�"���NWV���FJAr����[^�n���j�g,�m�B��N�Z�4�����Cz(D���(��P3ec/+@������F(�;t�Mp��G��Vp���
���0��!���R!��j�;�Z�a���o�������	JA��� ��Cx��c �����P��l�������f5Sm�w����K��E|�@k����
��bc��K`E�R�p��������APK�����`)�8�-�j��y^��k,�=BV0��W|�j���_� #�:���dQ�}����������V~�]�Y�qu&M\5�xqG�"e~��XI^n#d���%z�������[����&�y�q'+Q���%W�H�E�^`�S��!�)-�N�*bl���D�x�i$�zP���|���\��-�M����?��L#����B��e����<��LLT���<����W���-�84����p��e��:I��a[�L�x���f���<��BIyHQ�h��;�uO��g�����byMk�5+���!q`��J��������E
$�<�?�{�|�L)���V7��|�o"_�>{�@�R��]�)qw����>�$�#pY����+�fv�m��	���\����|�c��q��z��`��9Os�C4���������|>���f���7|#��K������]7Ep��i�_t
���l�}T�����ctv���#C.�K��+���+���s�7�}���~^�/�tVp�����,E�t�x�d%�3�1|�88�f��6�!���MA����r
[L�B���������*>�(*\����M���X-P�D����Ep]`W8�Y]D(C=�
���&�C|V���7w%�}u��� �� 
q��Qn:�h���fS_j6��.0{�2
����	7`=���YY����q�]mRB���`�!{�u\k�*\��������8#��	�m�����c����	��]F���a� ��#�U�_�n��	A�����R5	��f�����C����:q���P����5��
"��f��d'5�U�"��I�k�2��}p�����a�.-t��h^�m�����m�Fe��a�&���j�� H_[@�-����(��M�x��j

��v�(�
GZ�%����C��0W��������:��w"���r3�_�|
���A��������a�k�,�F�d�(�vg� 8��@��ed�	����	�jlXl���Nf��W�r�*���kl�L�]�I(���\)��(M6���?h�h�P�A��]�,7�e��c��a
�,\��/��/�E8�*AQ�x����e�R�Qv�91n�H_vm��Jo|������U��@���u�b����V����L��?B��u���J�0�95A�����u�%��� ��z��`�j���������n�Kk���>�i�Q�4�NAR����6`�[����B@�r�����oRwd�oo�	�b'��	F��"*!El�Q	CN���nG<���T^_3���!J>h��g(e�D%�����8�6���G�*)����c�w�����%�Y;����u������Z�����i����l��r��_������?����T1qRl]R��#��a���5��h��\�$5��@�j7__��(��T8�qn��;(�m�.����8H
���G��`�jtuZ�K�4$p��)Tm.�	�%E� ������B�a������J������_	����2J"��OvV4�Y���;"w��)�S�5l�Q�)��?��,�m[mv^��3J������K�H�.aU�P��l�)�{�uP�R	#�R+��o{Ib���*���*`|��9���(S����D�Fha�\����j���&E���/P�Ul^nY	%rc��d�,��*h�Ls�3����
q��*`���D/�j
�z/JY:��7k�Y������8U�	��`�uA�eY��A�T��� ��3n>��������:��*���,���F��T��P(%>���\�����.-���9�����#&��&��{�W|�_��o���i���������}ES7���`����+8PLS��L�J�2���e"�	2��!�\i�k����\q4S�"cM�W ���x�E����������
�w�������k��p������#2.��-�-U�"�"
ubg�[�
��pt:�;��
��`� �b0�g��Fh�j�
���-^���x���
o�K`�\��g�%��{������S�G��0Q�0��7S&��N1�����m����>�%�^,������^�9��|E#�[�u��3/�o�6[t����tD2�<�[��ik�f$�A?!�z����q	,�L3����!�$.(���J��3~�q������a+���A!i�����
=�A=Z�]kWN`J5x�+����	�SZ�~���y��w��CF"�j^(�T����?��8��Tg�M�-�����}:����*o�2���"�qe���P�aU�2)8	�S�Z��Ki�1<�������{��)�����
|!�����_���L�!���{hA� �Z��J� r����8
\5b���t�zdQk�EI�&���_��� ���8d;G��M06�APz!�5CA�����B�6�h	Gte�h�k�� EA�� ;��M&Um"�P�h��^��L�A���(�e�L�FR��L����p�����h��j ��6�C~
�Z�S3m���<R"C��]4�d0L�	��R��Z�A�/[�Z�1�0pq����\�tLm����7�����Rj�[%}� ��:��?����'��K�j�(&|������1��m�N�:��
��������I��Z���
����C���b���u,]�q*���Q�0�]/�]�6�+����]�Axi����w�
%������VB��I�p|m��(�Xp�K� �C`��8n�0�/��S2�?&8�52��V�	�'t�>MP��e�} ��=;wf����[kH!
1�
�����	Q%���>#3_`|�Ml�
F"SV�M?���{��6�0>T�d�Sg�<~���+�.P�C����(d���H�M���uu�*�n����:6�
� �63�ADx��."�4�7%�W���,S!��J"�%���������2v�mm�R��l:n�6_J��IE�:� ����*j�g*S���6U�������P�0(!�GXlLQ���r>���p��A��%�.T���� HVB�K�S4rS�)�M��i������XK8D��w�r>���7)�8G�E��%���n>�����~��t6�%�Rnf>�3}�8$���/������U����G���j���p�����{1�O�l�_�\������sP[�����%�A,�,����e����>�����L��_w�+�g,������ww������a���?�~���<��6V����s���{�u���~���>2f|�Yk��,���tx�(���7=��?�����_��k��G1���@b�z�A�t��r��.�|��S/��_��^����1�n]5��yvCs��,��^41,�������,"C@�
]�!���
�9�|V�D��`
X�����;�#�����3��_���,D	��
t���F�9�S-*����N{���`r�d����K/��Lmb�����)�6qs���m���>�vK��j}6�U\e:��vk�W��X(�p �#���0r��O���������o �UK���D�����2-�����p��l��xH�tO�������vj��C�G�0i �f�0>��Yb���p8���fl���O�a�`9F���7M<�����'n@�pV����{��)���5��O�>=%�h���R
�7�A^N�7���^Q��hyJ��
�cl>-�c��y����m��5��VG%�
:l������[K~}�
��d��7*�~����t�����1;A���a�o�/*�hc�����p�bN�����9 
��U!�S8���P�
��{G�8Tf�I�9���j��pf�nY���b�~�E@\^t�	�0����G1�0,4�u�a~�T�	U�qr\~���q�^���������5AL@A��W����v���<
bfH�PlwN�:y�-���Di�)�iLNSVeU�6�Y%B_I�����C)�;���&H�P��Q�6��S��ML��a��|�
'��}������;�?E$������Be'[H��RB�T�jF����m���������yl& {�u]��&H��N	\*q�
��}lz�����&�q��	GOO]�[��k�0W'�>�e��U��\��.*�#�}�XWG�7��J�&�o��������d��D'TJ��>��?T���M�����$����E%����������we�%�9��">z#�:O������!z���1.����(��IA	&D�8�3�c��
�	�1�'���!���<��6�q!
C;;&��|�|
�����P�[A=J�Ed��@�W����j�@��b8�A��wA����Q`|���S�E�p��(�bQNE$�����6hEv!o�c���!/��=u:9��JW��!����w<>e��8W�_S���d�Q���l��h���8���S��SwJ ���_TN���E�:�w�2����,��c�"���2�f�l����G�=�%Ez��������>"t�1��C��J�{yH��h�v]��}�J��
��j*�D��h���]�-eu�{g��Mh\��B��-��v�
�Rn�f7��d��&���J1��h�K)�xm����]����1���oA�+r��V��,�pE0������OO?8?_���&I.�e������������,Q3���*(	���!�?~5���{e\�����'�`�l��	}���p���]+�E'�?��b^-�<2��s����,}8v���$p��	?e����T<g�|����!����M����Q�>~�cQ�P�Q�eg�G�NJRN[�y*���)
|�d�5o��Y����|�����>=o++��g���������5;��������?��9B��X&���|��W|q�u��t������������eW����wK��W�N��mQL���!��������yy:z��Z}���+m��.��u��s�K��4�y�Y����F_�	&�1������l9�h�)��������?�G����cb������i@���%��a��T���t��H�r����}�����& I'��/_�p8�E�R���[���v ~	�|1��0���)J��<BS��������� h'=�xL����p|	�ve�x�6�2+*6�w��g��PH���K������)D����X����`�=�����q�����-*����*t��rY�wRZ�*^���j��N�����B�sE�t����h���h�r�QlF *\���0g���_#�E
�
W(����+d*�N�-�Yb�p��&GU���c5�����s�aC�{`�<>�����5��C�@�\d��Nh��Z��M e�����e=��P���_�~�($V:�L*�����J\���R�o#J�����)`��A�T� H���t��6�� �kv��]��N��eoR�������zc���oj��A����� �����]Wu�l�6h��K��������.���%��A���^+S���SUt*��R��I��]��i'�tRX
�7P�[Ti��.����A�:����h�85k@RGt�UERq9�B84��aDA}��K����dk<���n;���3���6�Y��@�C�L�P�}�3��HD��q&J�C���@�E���"A���kU�(B�HU;�0����)P!�B��4�VB��E��u�,��n��(�:3��m�r����&���@8��)��z@��5�DW��;� ��Y������R�&G����&�8�K7�h(vUv�7���$���S��g���\�G{�3�L����mY�P@-"������9�	�,@ 2���h��N��Rv�
��"}�K��S��g���
`�1������DhC�TI���P�)Z%j�p6.��lX��C�0���O������f����_pr �S�� ������[QK�� H�5=�2Rh'�7A��Y�i�k�����*��J��c ���toco�/>��=GWQ���_���g���%$�����/��;��+��"k>@�6][^���A��n��aW.���,O ��A���\���/���k��gYiwm	������+G������Z^�����EI�����NyW��U�1\.��j~��C�s �/����	eZ��%pZt��������'�t�%����i�;{���2��f��c3q�����c��tI��a���?��!.��i�K�)���8���dr$FE�xM����p�b��g���j�� �7u�|���� k�gtU������w��)�����������k��F�{J?'�v���K]!@"A��^W�9��r� ������.k���_d
\�Yn�#�����.W��7�����\�bI�yJa�p���T$�t>~��a`<��[�K�l��)�EV��|"����3� ��5����<��\�����+����J3_6^��^(�3��$[w�	���z�_!��n��Vk��Usmvx#�����sB��� ��5�P��F��B�DW-"T���p�6+�q�a�_��c=������P����Mk���{��D�������XC3�K����:����@���9�oG�/.����Ai�bJ��H�J�'(;o~6t��%s:!P5j(+�wirI\�U����F��7�aI�-����N�w���-w�=��2a*{:���W�������:.�1x*X�>�?��u�.Y��Nk�`66�k�����*�{{�� -M����?� W�{a�^��������	��|/`B����M�C�	8����;�T��2��:�s��3�0Z�.��y�	������"6�lTo<�+�QJf���:�����J�a��K��Z�.)Al��T����MG��t�b>{�K~���z#�u3p(������V�"���3���"�yjbQ�E-��4�
�pn
v�%�'U��#�P�
�{luav�yPN�C�MI!q�d~��j����BC�Py��E��2A���A�P*��Q����y���3,E�Jb2��6��xL�^��Ex5E������~������.%
��Zr�Y����:�w``�-���-@W�2t`��N]Ah>�U������n*7DIN�2o#l�b
���L���wtL�V���F�m|��8�Y���4�T�.-c����x6� �2A��E%d�* ���>�(:�	W�3��x�h������]��*t���A9~]Z�EK2���$��35��T�6���f
�m���*r��d�� ���p*q�ZI)�-�l�6�K���c�dt���F�hO:�c��3>�7-N�j:WF��H1'�j�ZA��0�c��AGP�)d4�n-�p2/_"#.&C���M,4�C��[��m{����Y� C������j	g�����e����`QI�5t�nI�_�eSz>-��G�hNqyK������h�D+4E	(��N(��_�����c?~z�a0�G�qP�(�\<ln`����U�X�2��q[z��x���'�N�(�R&�f*t&U��q�����1���F�&���S��`+0-n:��e�\����b$�����&KV�~Z
!�"N�8��9� �8mH$Q5��DP�������d%��e���T��Pxu������0�� ��u{�����|�����`��H���;p�O����x���K�z�R�U�|����~c����F7d���A�4���q{>p�4���
����;���<^]/q���,!v��\�Qn������ql
��R�)��lO�����/�p=
�}��_��8�g6���|�a��`D�B��/��������3�^
�o����0�����S�:��$0%>�O(�q��\�G9e0�|�+������N��'��p&�L�8�c���J�n"
;_�D��1%����-;���j=g��b.'T�k]�Bl0Q�A����F
�D�F��C�t;����zV6����wQD��7sVH�n������M�4��N��G {(3i�'��*U�%m���{���������*��U�������p�7]T�6A���B��ceTI�Tf������AL�L�I��f�R��4���$�V��H���t�)-��O����������!@.�����2A�j��z�}�/��uer���N���J����	�{F:�w������y�v�7i��V%�Z��F����hA����4�8��Z�;e��o�w&�hYP$�";��K_������f��]�@(��W� �eQ$\dA>�J�?XA�Oo)1�"���E�8��09*�T�zEE�X��"y\lI� ����#Y/���G��}M%H\���S|�@]i
E*PQ��n�=�s�qcQ$Ie��v_�j�l	;��k`����"D:���!���}�~���n��>���"7I���$�$�p-�v���n:�h��� ��U�'��q��w�]D���&��(�xl2�������N�i���L�)#�A����2��~��{������lyN�t�t�?)0�"V� �-���zL��=sK�*�tN�P28�P."5����D
@	pHT�R��t~Q��M���6Iq,�>}
��a�����
.����[���B*�}���c��,f�yN2D�����S��&�q�*�P�V� bq:����,��P��$P�	
mp)pYL�,�9�]�����l^D�
�/���������(f��*�Y��D��p��
	b�
����,��,T�����s�A�������8M�z-��U���_��������H��y��	�@�%)���S����g�R������D~�\;h���g#e�5>h�,a
�W Dr=K���4u9&�>E��p�c�HF���� �Yr���^�Q����`������#t���r&� ����v�bz����Bm�u��hd@��1T4.};��gK��2*IJ����:���v�W&��*�2�X40�b����3A.7������1��D��k�~���}���v�z����4��/��DO�����$Sx���<3��c�����=sTG�,������>Q�������_�������7�#������];>����p�A�����k]����q�xfz�#���V������5o�Z���
���?�W��,|�/���u�
���`�Dz�+����q���S��1��Y������Kal��e�x����[����8SH�����������6��<�9[���"C�������2�_\��T?EK������o2�ug�2��y��Cu���sM����B/ZD���L�h�r1�c}�����;��+�!�!��E�{��zZS\`�P��s��[��gp����@|n����j1�i���.��=:
��a""�&p2�5�K�y�/��	�sa���q����������<V�5����H�*l���+�4��)	����OO�2���w��'���8TM7lo��7G5T\�k|��|g��8�]�b3]�[��n;A�tMyj�����u����.-h����Jl0b��Oq���]�����S��
�����f'dSv��I�QVCM�J�)�����?,��Q5�����/������E���O������Q�jG����f`���o4���C����s!)[z�g�$���E�7X�%�����.����!(�Q����
c��������q"m���e9��A�!����o3���
�o5�+I7_��-����]�!?�_��A^�]�/�0�}��p�"����pT�Q	����$��
t+�7�� v��4*���36C�b��t\���(��o}�o(EA�����j��1 ��Q�V�e�=��,�ff*���^Eu
�ME>���P	."V�������,���}[��ov��2l�����R�����Gz�@����TZ\wO�}��J��l9����@�5����f�������M�U�}�����kh|��*��M,7�n
`�e���U+��b�R�Q���p�~>
(�&*$*�*��������-+V��6^
P���)�
� �`a�&~��2;EI�^���W*���/9,.�����R��m����������s?�\��}L`u@W�o��EmL�������v����l�'��h�l�I����7��#�� �C|��t�pQ$�`h��EVS�00d���A�R������ ��a�C���yD�W�
f��?�!*��e�YF#�u��W���`���m@���#1���P�"bA28��)�2��6��N���,�{p����=}��mJ��{?�`v>u�D#�����$S�Z���eyeC���\�+5m�c��x�	�>��xB�P:/{�T�8Q���s���oSx���q]g��������g��@�I6�	������pO7��uM�L@y,C#�t)s,������.`a�|������-@t�)}�=��u�C��)������K�tl
�:.R�	�"���F3|7���n��(����c_6.����@7\��
�����s��~����6B���7B������P��9��" ����tH��I�l&�*AC5�|�E�W�K����#�������T$��dr�������"���I"H�����h��f�!w�5]�� ~u='�^�j��\@�k��_�:�M��
F�2������d����]2AA*�������N3u�$�6bk��U���������(U/���������=�A�d��bp�lP7�_��1_��b+����<l��kM�D�u���1/�!<��!P�S�]I������#h#4A{�8F
��>@3�O7lA�P\m+����K6�!�!`�@���#*�d30�a���ooT����&C�;����Z�!��<�Ng6��N���Ns����������S~k�;���VP���~H<�pP���?��%�����0�!�p��Q>`s��}sL>/��J%&(E�3y�������?����@�P���
;�xr&������[
�{y������>~��j��d������L�G?�����&g�&2���<�w�z:�J�~��`��l���|����3��0��\�N���Q[8S�_&i1�8^���������c�Z��$������mV�@��l��U�u�zn����J����36B_`�%�Nif�0sQ$�V���k�<lg5J�[X�kdp]��xzK�[���oS��9��f�`a�P���|����0�G�:�Q($3�AaPQ$�d���������Fh�1�`=T!�� a�jcPQx��Kq�j��X���_[;���y��'s;T;@��9w9����{���w���
�������?�w��������tz:m#p=�m]���k�=?�SLd�D��z�����"��������Hp�`*#�s��|S�JT%��I�H��M�b�e������T������"u��
!B]��D�����2��!�KD]��`���`rD��}�s�7����@7�����*����
����JY?�R�������4�4QI��R��'7�����=|)E�'�ns-N�]�0��]��p��K����2<L�~��[���kG����<0m�;���S��D�}8����8QBY2`lAp����p~�'�������
E�P�3AAt.��v�,8��_E�H+������Z���
A�
H��%��E��&��] 	.o���(�B�v:^8H�A��������m�NL��-���
�������!��=�����/Y�(R@����l~h��cZi."��q�j
k�����>v9.N����������*s	����,�i��tH�k����������U5���=	�Pz+
)CF������C�G���Kh�zj$O�D����1�@�0���W'�
�7����U�n�/UZ�2�P�TG�<���2g����q$��$N'�����#�L���@u�oC%P�6Z�R�/���f��]���d\���b�R ns�~�|%>�s��������uwOp��������>@�}�W�����wG�o��� 8���&Z��P�Y�V>F���|�e������~�a��{9{�5AV|�C���>��#���Y��U���!�F wa��r���:����U��>B�a��S�)�N��+�� �Du��v�z%|�	=�[z���R0Z�E��2��z"��,n��7sVye�Y�>=o+�p���s�����^k3�+@YL�~��ZQ�������T�����.B��xVr�U}Dm��sO_����b�i����p_.�I^������k��_o���������|C6iNBE��~1��6��
����O�O��)� ���:�B�'o�P��%dD��r>v��O�1��8�0�k��:^n�S!M��2�-���&7#���d�-hh���<�s>@��I	Y9p�DM�VR��P�3�������Hry�<�4�m15�7���J2u���y�^_������>�L+���g������� ���5.�^�Y�8��[��*������0[�C@Ek���kT�3�n(�H��L\.2E�hT����(@+.
��X>�a\S��Rh�UAPQ$�2��O��)V�%�f
d��&s���-�AE��V
*�n�L|h@Mw�����k�;��*!��L�{8�����|�������:/��	h���}��[kx|BS��Z�	{D�D�z��[U� H��6<�90��].1d�sD�|_XC�y.�T����#�'(�|��B@*��RB�H��26��N;n��w��`GR�}�O(%���J�E���� ��u��qL�������Q��yp����Z���.��D%V��kLR���T�QI��-�.�Ri���K_��w0NR���Q�A�����-&�:�H�(��kg,�qB+��pQ��g%P�j���\���`�R\�]*Z�E�PS6��	�o+�_�r1����&�@�f]��2������,����E���u
��o<��M�=@��B@����0�� P���������v���h��PW�����
�� ���w�i��h���1��D��(��)Cx��H��*�2�5��O��l;ZX����J�`Di;���6cQ����|����\���J�#R���m�N���I���KS�0�����(c�B�&Z������)/��QZW�D��t�:�)�^k��	�CK�N&i@��:��V)&�y��� ���m��d�A�[�H-ox��h�o��g\[��^^�-�)^��m���N����n��-�������������<ce	�����.^^�y	L�xK��X�miZ����L{�����)�pl�%p��p��l��lj�L��2�6�]��!��2A�E1����l�"/{�/P�M��Dy�K`�5C��=J��cEK��T\�K��%:,��w��z�����������dZ�C�B����@���8i��&B�t��K���e�����C8�?>E��c/�[�>���<w�<��y�<����2�z��a����&&q��I�(b����e,8�7=�s8a
������
����2�c�~����c�Yq�L��o�6��1;�j����Qw�����h���E������=��g��R_�7|znr��|�L!���y.��U+�&7�������\n������������&�9����+�U����;r����?'S�+_���&7��dZ�}�����E��������k���-��n-��&_����@:�_y$�&7��������B����n�w��|5r�c��J��M�*�m���Mn�UJ4v������Mnr��Kn����&_���7��M�V����Mnr��Sn����&_�P�����Mn�5�D�u����&_���n�����������7�L��o���O^�|;��}��M�@��������D"�|���;�����o�~�����}����&3
�_�����)9�o|���n������?u"�����g @���GY�}�,�C������������rw��=�������b��S'Rq�'�8������|-�www��v.��_��}�3��oQh����M���D����P4�E�������.L�=��9���};����L����&/�m"�I^C��������7��[;�����39�o`��7��e�#�h��o��?����w4��!������nx�~XF�N�a���r�����}3G
����[�}���&J������'v��e�������I���p%��;|t�}�<,{8���r��`$����,W����<��z����ow_�}��u������~�2O�o��<��3�n���4�������q���J�2���f���w�����'�S|7���P�6�Q���]tk<��x�a"}�|�y}���*w�U��z8M���"C#�����YaK�k^��o���7�o�����-�D�>>www�9����ww�w��~�H��h�1�G�����8���w�2|M���;�O����#����)�.���j�#�TX������t�[��7�D�9����e���U�}� g='R<��a������������V����������F��Zy�����+���o����v��7�o7��e�H�dfEz��Rw��n|���o�����W\�)%�R+���q���hA�o�|U�������Zh��o��Q�~�>������;��8��t�q�����w��
����?�1���}��F�������R�N��x��F���
�fAix������t�����4u��q|?Zd��Us��7��8���H�B'��L�M�7x���4M��.~�0R�z8N$��os��~n��%����'!������@������~��&z�w��Nv���[�ZO��C�S�w����_r��&g���d���k��z7O�)#����s����D�v6.��5<X��|;E���i���o������Wu�n�5�H�S?|�"kf�e�H�g�!����{qi1O����r"����|7���=��Q��"��P ��E�1��Y��,��_�f]}{���u���@����{�}�_�}��s�[���v��Y�D��	4����d�"�����)a�����(�����f����� ��X�����t?����)9M�������c����h�\N��1�`����&@r��{���}��������o7��E��V'�����H������v��@�fG�������2��,��3��O����HW�H��T�����EO�{~"���Y�H�r��}#���v���v��Z>��8���Jr��`[�Y��K+����o�#
/ZW�pi�LN��o�����aI�N����gX��oZ��p���*��
�nY�H�}I�O)8x�� d�L.C�����s,	�:�x��)�1�#?���/���D	�g�����"�G���������,�����u���[\�
��������-�	�������N.�{�_����>x#��z�9x13���|"�&���PY��$���u_�w��]���	H�:�5����^��w_�}�#���+�i��'h?9?�����w����0M$���L��Y����r"]vq����A]X�'#���:���o	o/����w�����}�r��|]�>��G��"�`d�$C�&��)�t���I�2
��Z{�D���
��]hS~��7W&�E�s�%�z�2
�4�q���D��������=���/���/��e���,�	�m!�2'oNS*e���_�����^WH��?����T��nH��[{�D�F!t�XJ2�����t����>�_�p�)rMqN��D�C������I�3�*����������!�8Y�}��_�<3���\��'V�w8�6�@��G�"���,D|���H["��M�e��h���y�.�_���N������UB��}[�+�md��R�� 7�����4��y��}�����~�0"I���K����~��W.6��}�Y���fv��W�^�H�.c���[��o�M=��@�lt/O��E������}[���}��s�������g��Anx�����O��K��}��M�@n��O������
d���&_�\l���Mn�Z�G{_o�������Mn��H����M�r�o7���s��������Mn�B���6�~��+0~�7��}��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&7��Mnr����&_��%Dk1I^��IEND�B`�

v20231124-0001-prefetch-2023-10-16.patchtext/x-patch; charset=UTF-8; name=v20231124-0001-prefetch-2023-10-16.patchDownload

From 3bf9b6c3554b5aa70c189bf4d04c9f063a8ac49b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20231124 1/7] prefetch 2023-10-16

Patch version shared on 2023/10/16, with only minor tweaks in comments.

https://www.postgresql.org/message-id/06bb7d02-2c44-3062-731e-a735ba13da7e%40enterprisedb.com
---
 src/backend/access/heap/heapam_handler.c |  12 +-
 src/backend/access/index/genam.c         |  31 +-
 src/backend/access/index/indexam.c       | 659 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  73 ++-
 src/backend/executor/nodeIndexscan.c     |  80 ++-
 src/backend/utils/adt/selfuncs.c         |   3 +-
 src/include/access/genam.h               | 101 +++-
 src/include/access/relscan.h             |   9 +
 src/include/executor/instrument.h        |   2 +
 13 files changed, 975 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..89474078951 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -747,6 +748,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_max;
+
+		/*
+		 * Get the prefetch target for the old tablespace (which is what we'll
+		 * read using the index). We'll use it as a reset value too, although
+		 * there should be no rescans for CLUSTER etc.
+		 */
+		prefetch_max = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -755,7 +764,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_max);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..d45a209ee3a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* Information used for asynchronous prefetching during index scans. */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,20 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * We don't do any prefetching on system catalogs, for two main reasons.
+		 *
+		 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+		 * or we often don't know how many rows to expect (and the numbers tend
+		 * to be fairly low). So it's not clear it'd help. Furthermore, places
+		 * that are sensitive tend to use syscache anyway.
+		 *
+		 * Secondly, we can't call get_tablespace_io_concurrency() because that
+		 * does a sysscan internally, so it might lead to a cycle. We could use
+		 * use effective_io_concurrency, but it doesn't seem worth it.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +711,20 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * We don't do any prefetching on system catalogs, for two main reasons.
+	 *
+	 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+	 * or we often don't know how many rows to expect (and the numbers tend
+	 * to be fairly low). So it's not clear it'd help. Furthermore, places
+	 * that are sensitive tend to use syscache anyway.
+	 *
+	 * Secondly, we can't call get_tablespace_io_concurrency() because that
+	 * does a sysscan internally, so it might lead to a cycle. We could use
+	 * use effective_io_concurrency, but it doesn't seem worth it.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..51feece527a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,16 +49,19 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +109,10 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_max);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -200,18 +206,42 @@ index_insert(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_max determines if prefetching is requested for this index scan,
+ * and how far ahead we want to prefetch
+ *
+ * Setting prefetch_max to 0 disables prefetching for the index scan. We do
+ * this for two reasons - for scans on system catalogs, and/or for cases where
+ * prefetching is expected to be pointless (like IOS).
+ *
+ * For system catalogs, we usually either scan by a PK value, or we we expect
+ * only few rows (or rather we don't know how many rows to expect). Also, we
+ * need to prevent infinite in the get_tablespace_io_concurrency() call - it
+ * does an index scan internally. So we simply disable prefetching for system
+ * catalogs. We could deal with this by picking a conservative static target
+ * (e.g. effective_io_concurrency, capped to something), but places that are
+ * performance sensitive likely use syscache anyway, and catalogs tend to be
+ * very small and hot. So we don't bother.
+ *
+ * For IOS, we expect to not need most heap pages (that's the whole point of
+ * IOS, actually), and prefetching them might lead to a lot of wasted I/O.
+ *
+ * XXX Not sure the infinite loop can still happen, now that the target lookup
+ * moved to callers of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_max)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
+									NULL, false, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +271,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	/* No prefetch in bitmap scans, prefetch is done by the heap scan. */
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +289,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_max)
 {
 	IndexScanDesc scan;
 
@@ -282,6 +314,29 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * With prefetching requested, initialize the prefetcher state.
+	 *
+	 * FIXME This should really be in the IndexScanState, not IndexScanDesc
+	 * (certainly the queues etc). But index_getnext_tid only gets the scan
+	 * descriptor, so how else would we pass it? Seems like a sign of wrong
+	 * layer doing the prefetching.
+	 */
+	if ((prefetch_max > 0) &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0)	/* no prefetching for direct I/O */
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_max;
+
+		scan->xs_prefetch = prefetcher;
+	}
+
 	return scan;
 }
 
@@ -317,6 +372,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+
+		/* restart the incremental ramp-up */
+		prefetcher->prefetchTarget = 0;
+	}
 }
 
 /* ----------------
@@ -345,6 +414,23 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/*
+	 * If prefetching was enabled for this scan, log prefetch stats.
+	 *
+	 * FIXME This should really go to EXPLAIN ANALYZE instead.
+	 */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -490,7 +576,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_max)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +586,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -623,20 +710,95 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
+
 	for (;;)
 	{
+		/*
+		 * If the prefetching is still active (i.e. enabled and we still
+		 * haven't finished reading TIDs from the scan), read enough TIDs into
+		 * the queue until we hit the current target.
+		 */
+		if (PREFETCH_ACTIVE(prefetch))
+		{
+			/*
+			 * Ramp up the prefetch distance incrementally.
+			 *
+			 * Intentionally done as first, before reading the TIDs into the
+			 * queue, so that there's always at least one item. Otherwise we
+			 * might get into a situation where we start with target=0 and no
+			 * TIDs loaded.
+			 */
+			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+										   prefetch->prefetchMaxTarget);
+
+			/*
+			 * Now read TIDs from the index until the queue is full (with
+			 * respect to the current prefetch target).
+			 */
+			while (!PREFETCH_FULL(prefetch))
+			{
+				ItemPointer tid;
+
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/*
+				 * If we're out of index entries, we're done (and we mark the
+				 * the prefetcher as inactive).
+				 */
+				if (tid == NULL)
+				{
+					prefetch->prefetchDone = true;
+					break;
+				}
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+				prefetch->queueEnd++;
+
+				/*
+				 * Issue the actuall prefetch requests for the new TID.
+				 *
+				 * FIXME For IOS, this should prefetch only pages that are not
+				 * fully visible.
+				 */
+				index_prefetch(scan, tid, false);
+			}
+		}
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			/*
+			 * With prefetching enabled (even if we already finished reading
+			 * all TIDs from the index scan), we need to return a TID from the
+			 * queue. Otherwise, we just get the next TID from the scan
+			 * directly.
+			 */
+			if (PREFETCH_ENABLED(prefetch))
+			{
+				/* Did we reach the end of the scan and the queue is empty? */
+				if (PREFETCH_DONE(prefetch))
+					break;
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+				prefetch->queueIndex++;
+			}
+			else				/* not prefetching, just do the regular work  */
+			{
+				ItemPointer tid;
 
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
+				/* Time to fetch the next TID from the index */
+				tid = index_getnext_tid(scan, direction);
+
+				/* If we're out of index entries, we're done */
+				if (tid == NULL)
+					break;
+
+				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -988,3 +1150,472 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
+ *
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
+{
+	IndexPrefetch prefetch = scan->xs_prefetch;
+	BlockNumber block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 */
+	if (skip_all_visible)
+	{
+		bool	all_visible;
+		Buffer	vmbuffer = InvalidBuffer;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 block,
+									 &vmbuffer);
+
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+
+		if (all_visible)
+			return;
+	}
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid_prefetch - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
+
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+			prefetch->queueEnd++;
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, tid, true);
+		}
+	}
+
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e8..6810996edfd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 384b39839a0..70a7b65323e 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..91676ccff95 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c066..0011d9f679c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..855afd5ba76 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -83,16 +84,41 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int			prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   prefetch_max);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -116,7 +142,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_prefetch(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -646,6 +672,24 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->ioss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -658,7 +702,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -696,6 +741,23 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ioss_ScanDesc =
@@ -703,7 +765,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..a5f5394ef49 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,21 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +128,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -177,6 +195,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	estate = node->ss.ps.state;
 
@@ -198,6 +217,21 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +240,8 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1662,6 +1697,24 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1674,7 +1727,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1710,6 +1764,23 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1717,7 +1788,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 35c9e3c86fe..9447910f103 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6284,9 +6284,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/* XXX Maybe should do prefetching using the default prefetch parameters? */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index f31dec6ee0f..e7b915d6ce7 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -152,7 +153,8 @@ extern bool index_insert(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_max);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -169,9 +171,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_prefetch(IndexScanDesc scan,
+											  ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
@@ -230,4 +235,96 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..231a30ecc46 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declarations, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled for this scan) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c52..f53fb4a1e51 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
-- 
2.42.0

v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patchtext/x-patch; charset=UTF-8; name=v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patchDownload

From a5a897a6b77b9db99186092060d55b34491acbf2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 18 Nov 2023 00:32:33 +0100
Subject: [PATCH v20231124 2/7] rely on PrefetchBuffer instead of custom cache

Instead of maintaining a custom cache of recently prefetched blocks,
rely on PrefetchBuffer doing the right thing. This only checks shared
buffers, though, there's no attempt to determine if block is in page
cache. However, it's a shared cache, not restricted to a single process.
---
 src/backend/access/index/indexam.c       | 400 +++--------------------
 src/backend/executor/nodeIndexonlyscan.c |   8 +-
 src/include/access/genam.h               |  53 ---
 src/include/access/relscan.h             |   1 +
 4 files changed, 52 insertions(+), 410 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 51feece527a..54a704338f1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -112,6 +112,8 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap,
 											  int prefetch_max);
 
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction);
 static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible);
 
 
@@ -313,6 +315,7 @@ index_beginscan_internal(Relation indexRelation,
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
+	scan->indexonly = false;
 
 	/*
 	 * With prefetching requested, initialize the prefetcher state.
@@ -608,8 +611,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+static ItemPointer
+index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -710,95 +713,23 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 bool
 index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
-	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
-
 	for (;;)
 	{
-		/*
-		 * If the prefetching is still active (i.e. enabled and we still
-		 * haven't finished reading TIDs from the scan), read enough TIDs into
-		 * the queue until we hit the current target.
-		 */
-		if (PREFETCH_ACTIVE(prefetch))
-		{
-			/*
-			 * Ramp up the prefetch distance incrementally.
-			 *
-			 * Intentionally done as first, before reading the TIDs into the
-			 * queue, so that there's always at least one item. Otherwise we
-			 * might get into a situation where we start with target=0 and no
-			 * TIDs loaded.
-			 */
-			prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
-										   prefetch->prefetchMaxTarget);
-
-			/*
-			 * Now read TIDs from the index until the queue is full (with
-			 * respect to the current prefetch target).
-			 */
-			while (!PREFETCH_FULL(prefetch))
-			{
-				ItemPointer tid;
-
-				/* Time to fetch the next TID from the index */
-				tid = index_getnext_tid(scan, direction);
-
-				/*
-				 * If we're out of index entries, we're done (and we mark the
-				 * the prefetcher as inactive).
-				 */
-				if (tid == NULL)
-				{
-					prefetch->prefetchDone = true;
-					break;
-				}
-
-				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-
-				prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
-				prefetch->queueEnd++;
-
-				/*
-				 * Issue the actuall prefetch requests for the new TID.
-				 *
-				 * FIXME For IOS, this should prefetch only pages that are not
-				 * fully visible.
-				 */
-				index_prefetch(scan, tid, false);
-			}
-		}
+		/* Do prefetching (if requested/enabled). */
+		index_prefetch_tids(scan, direction);
 
 		if (!scan->xs_heap_continue)
 		{
-			/*
-			 * With prefetching enabled (even if we already finished reading
-			 * all TIDs from the index scan), we need to return a TID from the
-			 * queue. Otherwise, we just get the next TID from the scan
-			 * directly.
-			 */
-			if (PREFETCH_ENABLED(prefetch))
-			{
-				/* Did we reach the end of the scan and the queue is empty? */
-				if (PREFETCH_DONE(prefetch))
-					break;
-
-				scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
-				prefetch->queueIndex++;
-			}
-			else				/* not prefetching, just do the regular work  */
-			{
-				ItemPointer tid;
-
-				/* Time to fetch the next TID from the index */
-				tid = index_getnext_tid(scan, direction);
+			ItemPointer tid;
 
-				/* If we're out of index entries, we're done */
-				if (tid == NULL)
-					break;
+			/* Time to fetch the next TID from the index */
+			tid = index_prefetch_get_tid(scan, direction);
 
-				Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-			}
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
 
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 		}
 
 		/*
@@ -1151,267 +1082,6 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
 
-/*
- * index_prefetch_is_sequential
- *		Track the block number and check if the I/O pattern is sequential,
- *		or if the same block was just prefetched.
- *
- * Prefetching is cheap, but for some access patterns the benefits are small
- * compared to the extra overhead. In particular, for sequential access the
- * read-ahead performed by the OS is very effective/efficient. Doing more
- * prefetching is just increasing the costs.
- *
- * This tries to identify simple sequential patterns, so that we can skip
- * the prefetching request. This is implemented by having a small queue
- * of block numbers, and checking it before prefetching another block.
- *
- * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
- * they are sequential. We also check if the block is the same as the last
- * request (which is not sequential).
- *
- * Note that the main prefetch queue is not really useful for this, as it
- * stores TIDs while we care about block numbers. Consider a sorted table,
- * with a perfectly sequential pattern when accessed through an index. Each
- * heap page may have dozens of TIDs, but we need to check block numbers.
- * We could keep enough TIDs to cover enough blocks, but then we also need
- * to walk those when checking the pattern (in hot path).
- *
- * So instead, we maintain a small separate queue of block numbers, and we use
- * this instead.
- *
- * Returns true if the block is in a sequential pattern (and so should not be
- * prefetched), or false (not sequential, should be prefetched).
- *
- * XXX The name is a bit misleading, as it also adds the block number to the
- * block queue and checks if the block is the same as the last one (which
- * does not require a sequential pattern).
- */
-static bool
-index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
-{
-	int			idx;
-
-	/*
-	 * If the block queue is empty, just store the block and we're done (it's
-	 * neither a sequential pattern, neither recently prefetched block).
-	 */
-	if (prefetch->blockIndex == 0)
-	{
-		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
-		prefetch->blockIndex++;
-		return false;
-	}
-
-	/*
-	 * Check if it's the same as the immediately preceding block. We don't
-	 * want to prefetch the same block over and over (which would happen for
-	 * well correlated indexes).
-	 *
-	 * In principle we could rely on index_prefetch_add_cache doing this using
-	 * the full cache, but this check is much cheaper and we need to look at
-	 * the preceding block anyway, so we just do it.
-	 *
-	 * XXX Notice we haven't added the block to the block queue yet, and there
-	 * is a preceding block (i.e. blockIndex-1 is valid).
-	 */
-	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
-		return true;
-
-	/*
-	 * Add the block number to the queue.
-	 *
-	 * We do this before checking if the pattern, because we want to know
-	 * about the block even if we end up skipping the prefetch. Otherwise we'd
-	 * not be able to detect longer sequential pattens - we'd skip one block
-	 * but then fail to skip the next couple blocks even in a perfect
-	 * sequential pattern. This ocillation might even prevent the OS
-	 * read-ahead from kicking in.
-	 */
-	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
-	prefetch->blockIndex++;
-
-	/*
-	 * Check if the last couple blocks are in a sequential pattern. We look
-	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
-	 * so we look for patterns of 5 pages (40kB) including the new block.
-	 *
-	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
-	 *
-	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
-	 * prefetching works better for the forward direction?
-	 */
-	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
-	{
-		/*
-		 * Are there enough requests to confirm a sequential pattern? We only
-		 * consider something to be sequential after finding a sequence of
-		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
-		 *
-		 * FIXME Better to move this outside the loop.
-		 */
-		if (prefetch->blockIndex < i)
-			return false;
-
-		/*
-		 * Calculate index of the earlier block (we need to do -1 as we
-		 * already incremented the index when adding the new block to the
-		 * queue).
-		 */
-		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
-
-		/*
-		 * For a sequential pattern, blocks "k" step ago needs to have block
-		 * number by "k" smaller compared to the current block.
-		 */
-		if (prefetch->blockItems[idx] != (block - i))
-			return false;
-	}
-
-	return true;
-}
-
-/*
- * index_prefetch_add_cache
- *		Add a block to the cache, check if it was recently prefetched.
- *
- * We don't want to prefetch blocks that we already prefetched recently. It's
- * cheap but not free, and the overhead may have measurable impact.
- *
- * This check needs to be very cheap, even with fairly large caches (hundreds
- * of entries, see PREFETCH_CACHE_SIZE).
- *
- * A simple queue would allow expiring the requests, but checking if it
- * contains a particular block prefetched would be expensive (linear search).
- * Another option would be a simple hash table, which has fast lookup but
- * does not allow expiring entries cheaply.
- *
- * The cache does not need to be perfect, we can accept false
- * positives/negatives, as long as the rate is reasonably low. We also need
- * to expire entries, so that only "recent" requests are remembered.
- *
- * We use a hybrid cache that is organized as many small LRU caches. Each
- * block is mapped to a particular LRU by hashing (so it's a bit like a
- * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
- * happens at the level of a single LRU (by tracking only the 8 most recent requests).
- *
- * This allows quick searches and expiration, but with false negatives (when a
- * particular LRU has too many collisions, we may evict entries that are more
- * recent than some other LRU).
- *
- * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
- * prefetch request in total (these are the default parameters.)
- *
- * The recency is determined using a prefetch counter, incremented every
- * time we end up prefetching a block. The counter is uint64, so it should
- * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
- *
- * To check if a block was prefetched recently, we calculate hash(block),
- * and then linearly search if the tiny LRU has entry for the same block
- * and request less than PREFETCH_CACHE_SIZE ago.
- *
- * At the same time, we either update the entry (for the queried block) if
- * found, or replace the oldest/empty entry.
- *
- * If the block was not recently prefetched (i.e. we want to prefetch it),
- * we increment the counter.
- *
- * Returns true if the block was recently prefetched (and thus we don't
- * need to prefetch it again), or false (should do a prefetch).
- *
- * XXX It's a bit confusing these return values are inverse compared to
- * what index_prefetch_is_sequential does.
- */
-static bool
-index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
-{
-	PrefetchCacheEntry *entry;
-
-	/* map the block number the the LRU */
-	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
-
-	/* age/index of the oldest entry in the LRU, to maybe use */
-	uint64		oldestRequest = PG_UINT64_MAX;
-	int			oldestIndex = -1;
-
-	/*
-	 * First add the block to the (tiny) top-level LRU cache and see if it's
-	 * part of a sequential pattern. In this case we just ignore the block and
-	 * don't prefetch it - we expect read-ahead to do a better job.
-	 *
-	 * XXX Maybe we should still add the block to the hybrid cache, in case we
-	 * happen to access it later? That might help if we first scan a lot of
-	 * the table sequentially, and then randomly. Not sure that's very likely
-	 * with index access, though.
-	 */
-	if (index_prefetch_is_sequential(prefetch, block))
-	{
-		prefetch->countSkipSequential++;
-		return true;
-	}
-
-	/*
-	 * See if we recently prefetched this block - we simply scan the LRU
-	 * linearly. While doing that, we also track the oldest entry, so that we
-	 * know where to put the block if we don't find a matching entry.
-	 */
-	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
-	{
-		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
-
-		/* Is this the oldest prefetch request in this LRU? */
-		if (entry->request < oldestRequest)
-		{
-			oldestRequest = entry->request;
-			oldestIndex = i;
-		}
-
-		/*
-		 * If the entry is unused (identified by request being set to 0),
-		 * we're done. Notice the field is uint64, so empty entry is
-		 * guaranteed to be the oldest one.
-		 */
-		if (entry->request == 0)
-			continue;
-
-		/* Is this entry for the same block as the current request? */
-		if (entry->block == block)
-		{
-			bool		prefetched;
-
-			/*
-			 * Is the old request sufficiently recent? If yes, we treat the
-			 * block as already prefetched.
-			 *
-			 * XXX We do add the cache size to the request in order not to
-			 * have issues with uint64 underflows.
-			 */
-			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
-
-			/* Update the request number. */
-			entry->request = ++prefetch->prefetchReqNumber;
-
-			prefetch->countSkipCached += (prefetched) ? 1 : 0;
-
-			return prefetched;
-		}
-	}
-
-	/*
-	 * We didn't find the block in the LRU, so store it either in an empty
-	 * entry, or in the "oldest" prefetch request in this LRU.
-	 */
-	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
-
-	/* FIXME do a nice macro */
-	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
-
-	entry->block = block;
-	entry->request = ++prefetch->prefetchReqNumber;
-
-	/* not in the prefetch cache */
-	return false;
-}
-
 /*
  * index_prefetch
  *		Prefetch the TID, unless it's sequential or recently prefetched.
@@ -1452,6 +1122,7 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 {
 	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
+	PrefetchBufferResult result;
 
 	/*
 	 * No heap relation means bitmap index scan, which does prefetching at the
@@ -1501,6 +1172,10 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 			return;
 	}
 
+	prefetch->countAll++;
+
+	result = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
@@ -1508,19 +1183,15 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 	 * to a sequence ID). It's not expensive (the block is in page cache
 	 * already, so no I/O), but it's not free either.
 	 */
-	if (!index_prefetch_add_cache(prefetch, block))
+	if (result.initiated_io)
 	{
 		prefetch->countPrefetch++;
-
-		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 		pgBufferUsage.blks_prefetches++;
 	}
-
-	prefetch->countAll++;
 }
 
 /* ----------------
- * index_getnext_tid_prefetch - get the next TID from a scan
+ * index_getnext_tid - get the next TID from a scan
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
@@ -1529,9 +1200,20 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
  * ----------------
  */
 ItemPointer
-index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction);
+}
+
+static void
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
 {
-	IndexPrefetch prefetch = scan->xs_prefetch; /* for convenience */
+	/* for convenience */
+	IndexPrefetch prefetch = scan->xs_prefetch;
 
 	/*
 	 * If the prefetching is still active (i.e. enabled and we still
@@ -1560,7 +1242,7 @@ index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
 			ItemPointer tid;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			tid = index_getnext_tid_internal(scan, direction);
 
 			/*
 			 * If we're out of index entries, we're done (and we mark the
@@ -1583,9 +1265,16 @@ index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
 			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
 			 * so skip prefetching of all-visible pages.
 			 */
-			index_prefetch(scan, tid, true);
+			index_prefetch(scan, tid, scan->indexonly);
 		}
 	}
+}
+
+static ItemPointer
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* for convenience */
+	IndexPrefetch prefetch = scan->xs_prefetch;
 
 	/*
 	 * With prefetching enabled (even if we already finished reading
@@ -1607,7 +1296,7 @@ index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
 		ItemPointer tid;
 
 		/* Time to fetch the next TID from the index */
-		tid = index_getnext_tid(scan, direction);
+		tid = index_getnext_tid_internal(scan, direction);
 
 		/* If we're out of index entries, we're done */
 		if (tid == NULL)
@@ -1616,6 +1305,5 @@ index_getnext_tid_prefetch(IndexScanDesc scan, ScanDirection direction)
 		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 	}
 
-	/* Return the TID of the tuple we found. */
 	return &scan->xs_heaptid;
 }
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 855afd5ba76..545046e98ad 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -120,6 +120,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   node->ioss_NumOrderByKeys,
 								   prefetch_max);
 
+		/*
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		scandesc->indexonly = true;
+
 		node->ioss_ScanDesc = scandesc;
 
 
@@ -142,7 +148,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid_prefetch(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index e7b915d6ce7..9f33796fd29 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,38 +235,6 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
-/*
- * Cache of recently prefetched blocks, organized as a hash table of
- * small LRU caches. Doesn't need to be perfectly accurate, but we
- * aim to make false positives/negatives reasonably low.
- */
-typedef struct PrefetchCacheEntry {
-	BlockNumber		block;
-	uint64			request;
-} PrefetchCacheEntry;
-
-/*
- * Size of the cache of recently prefetched blocks - shouldn't be too
- * small or too large. 1024 seems about right, it covers ~8MB of data.
- * It's somewhat arbitrary, there's no particular formula saying it
- * should not be higher/lower.
- *
- * The cache is structured as an array of small LRU caches, so the total
- * size needs to be a multiple of LRU size. The LRU should be tiny to
- * keep linear search cheap enough.
- *
- * XXX Maybe we could consider effective_cache_size or something?
- */
-#define		PREFETCH_LRU_SIZE		8
-#define		PREFETCH_LRU_COUNT		128
-#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
-
-/*
- * Used to detect sequential patterns (and disable prefetching).
- */
-#define		PREFETCH_QUEUE_HISTORY			8
-#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
-
 
 typedef struct IndexPrefetchData
 {
@@ -296,27 +264,6 @@ typedef struct IndexPrefetchData
 	uint64			queueIndex;	/* next TID to prefetch */
 	uint64			queueStart;	/* first valid TID in queue */
 	uint64			queueEnd;	/* first invalid (empty) TID in queue */
-
-	/*
-	 * A couple of last prefetched blocks, used to check for certain access
-	 * pattern and skip prefetching - e.g. for sequential access).
-	 *
-	 * XXX Separate from the main queue, because we only want to compare the
-	 * block numbers, not the whole TID. In sequential access it's likely we
-	 * read many items from each page, and we don't want to check many items
-	 * (as that is much more expensive).
-	 */
-	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
-	uint64			blockIndex;	/* index in the block (points to the first
-								 * empty entry)*/
-
-	/*
-	 * Cache of recently prefetched blocks, organized as a hash table of
-	 * small LRU caches.
-	 */
-	uint64				prefetchReqNumber;
-	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
-
 } IndexPrefetchData;
 
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 231a30ecc46..d5903492c6e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -135,6 +135,7 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	bool		indexonly;			/* is this index-only scan? */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
-- 
2.42.0

v20231124-0003-check-page-cache-using-preadv2.patchtext/x-patch; charset=UTF-8; name=v20231124-0003-check-page-cache-using-preadv2.patchDownload

From 3a25a7534b4fc01bfc0043f027db922ab0b531fb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 22 Nov 2023 18:21:45 +0100
Subject: [PATCH v20231124 3/7] check page cache using preadv2

Call preadv2 with NOWAIT flag, to check if a block already exists in page cache.
---
 src/backend/storage/buffer/bufmgr.c | 12 +++++++++
 src/backend/storage/file/fd.c       | 40 +++++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c       | 27 +++++++++++++++++++
 src/backend/storage/smgr/smgr.c     | 19 ++++++++++++++
 src/include/storage/fd.h            |  1 +
 src/include/storage/md.h            |  2 ++
 src/include/storage/smgr.h          |  2 ++
 7 files changed, 103 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504cd..74da9c1376b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -567,8 +567,20 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		/*
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
+		 *
+		 * But first check if the block is already present in page cache.
+		 *
+		 * FIXME This breaks prefetch from recovery. Apparently that expects
+		 * the prefetch to initiate the I/O, otherwise it fails with. But
+		 * XLogPrefetcherNextBlock checks initiated_io, and may fail with:
+		 *
+		 * FATAL:  could not prefetch relation 1663/16384/16401 block 83758
+		 *
+		 * So maybe just fake the initiated_io=true in this case? Or not do
+		 * this when in recovery.
 		 */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+			!smgrcached(smgr_reln, forkNum, blockNum) &&
 			smgrprefetch(smgr_reln, forkNum, blockNum))
 		{
 			result.initiated_io = true;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba09321..2c51a3376f3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -78,6 +78,7 @@
 #include <sys/resource.h>		/* for getrlimit */
 #include <sys/stat.h>
 #include <sys/types.h>
+#include <sys/uio.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -2083,6 +2084,45 @@ retry:
 #endif
 }
 
+/*
+ * FileCached - check if a given range of the file is in page cache.
+ *
+ * XXX relies on preadv2, probably needs to be checked by configure
+ */
+bool
+FileCached(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
+	int			returnCode;
+	size_t		readlen;
+	char		buffer[BLCKSZ];
+	struct iovec	iov[1];
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
+			   file, VfdCache[file].fileName,
+			   (int64) offset, (int64) amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return false;
+
+	/* XXX not sure if this ensures proper buffer alignment */
+	iov[0].iov_base = &buffer;
+	iov[0].iov_len = amount;
+
+	pgstat_report_wait_start(wait_event_info);
+	readlen = preadv2(VfdCache[file].fd, iov, 1, offset, RWF_NOWAIT);
+	pgstat_report_wait_end();
+
+	return (readlen == amount);
+#else
+	Assert(FileIsValid(file));
+	return false;
+#endif
+}
+
 void
 FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad1709..16a7c424683 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -736,6 +736,33 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	return true;
 }
 
+/*
+ * mdcached() -- Check if the whole block is already available in page cache.
+ */
+bool
+mdcached(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+#ifdef USE_PREFETCH
+	off_t		seekpos;
+	MdfdVec    *v;
+
+	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false,
+					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+	if (v == NULL)
+		return false;
+
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	(void) FileCached(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+#endif							/* USE_PREFETCH */
+
+	return true;
+}
+
 /*
  * mdread() -- Read the specified block from a relation.
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..209518aae01 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,6 +55,8 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
+	bool		(*smgr_cached) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, void *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -80,6 +82,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
+		.smgr_cached = mdcached,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
@@ -550,6 +553,22 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
+/*
+ * smgrcached() -- Check if the specified block is already in page cache.
+ */
+bool
+smgrcached(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	/*
+	 * In recovery we consider the blocks not cached, so that PrefetchSharedBuffer
+	 * initiates the I/O. XLogPrefetcherNextBlock relies on that.
+	 */
+	if (InRecovery)
+		return false;
+
+	return smgrsw[reln->smgr_which].smgr_cached(reln, forknum, blocknum);
+}
+
 /*
  * smgrread() -- read a particular block from a relation into the supplied
  *				 buffer.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d9d5d9da5fb..c96a24dddd3 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -105,6 +105,7 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern bool	FileCached(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a8..8dc1382471e 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,6 +32,8 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
+extern bool mdcached(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   void *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..7fbed2a4291 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,6 +96,8 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
+extern bool smgrcached(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, void *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-- 
2.42.0

v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patchtext/x-patch; charset=UTF-8; name=v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patchDownload

From 8571e2b0ee705ea7d1b8d2f285c361ea010e4d2d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 18 Nov 2023 12:05:53 +0100
Subject: [PATCH v20231124 4/7] reintroduce the LRU cache of recent blocks

Useful for detecting sequential patterns, for which read-ahead works
better than our prefetching, and checking shared buffers and page cache
is not sufficient.
---
 src/backend/access/index/indexam.c | 137 +++++++++++++++++++++++++++++
 src/include/access/genam.h         |  30 +++++++
 2 files changed, 167 insertions(+)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54a704338f1..7456a69ab34 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -1082,6 +1082,125 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
 
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * index_prefetch
  *		Prefetch the TID, unless it's sequential or recently prefetched.
@@ -1172,8 +1291,26 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 			return;
 	}
 
+	/*
+	 * Now also check if the blocks to prefetch are in a sequential pattern.
+	 * We do it here because we need to do this check before PrefetchBuffer
+	 * initiates the prefetch, and we it can't do this easily (as it doesn't
+	 * know in what context it's called in). So we do it here.
+	 *
+	 * We use a tiny LRU cache and see if the blocks follow a sequential
+	 * pattern - if it's the same as the previous block, or if the last
+	 * couple blocks are a continguous sequence, we don't prefetch it.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return;
+	}
+
+	/* XXX shouldn't this be before the VM / sequenqial check? */
 	prefetch->countAll++;
 
+	/* OK, try prefetching the block. */
 	result = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 
 	/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9f33796fd29..9e2d77ef23b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,6 +235,22 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Used to detect sequential patterns (to not prefetch in this case).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
 
 typedef struct IndexPrefetchData
 {
@@ -264,6 +280,20 @@ typedef struct IndexPrefetchData
 	uint64			queueIndex;	/* next TID to prefetch */
 	uint64			queueStart;	/* first valid TID in queue */
 	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
 } IndexPrefetchData;
 
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
-- 
2.42.0

v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patchtext/x-patch; charset=UTF-8; name=v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patchDownload

From 23eead34921a16b8301160ce8bde51be615856f0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 22 Nov 2023 23:26:08 +0100
Subject: [PATCH v20231124 5/7] hold the vm buffer for IOS prefetching

---
 src/backend/access/index/indexam.c       |  7 ++-----
 src/backend/executor/nodeIndexonlyscan.c | 12 ++++++++++++
 src/include/access/genam.h               |  3 +++
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 7456a69ab34..53948986ada 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -336,6 +336,7 @@ index_beginscan_internal(Relation indexRelation,
 
 		prefetcher->prefetchTarget = 0;
 		prefetcher->prefetchMaxTarget = prefetch_max;
+		prefetcher->vmBuffer = InvalidBuffer;
 
 		scan->xs_prefetch = prefetcher;
 	}
@@ -1278,14 +1279,10 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 	if (skip_all_visible)
 	{
 		bool	all_visible;
-		Buffer	vmbuffer = InvalidBuffer;
 
 		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
 									 block,
-									 &vmbuffer);
-
-		if (vmbuffer != InvalidBuffer)
-			ReleaseBuffer(vmbuffer);
+									 &prefetch->vmBuffer);
 
 		if (all_visible)
 			return;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 545046e98ad..60cb772344f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -412,6 +412,18 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* Release VM buffer pin from prefetcher, if any. */
+	if (indexScanDesc && indexScanDesc->xs_prefetch)
+	{
+		IndexPrefetch indexPrefetch = indexScanDesc->xs_prefetch;
+
+		if (indexPrefetch->vmBuffer != InvalidBuffer)
+		{
+			ReleaseBuffer(indexPrefetch->vmBuffer);
+			indexPrefetch->vmBuffer = InvalidBuffer;
+		}
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9e2d77ef23b..8a3be673730 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -270,6 +270,9 @@ typedef struct IndexPrefetchData
 	uint64		countSkipSequential;
 	uint64		countSkipCached;
 
+	/* used when prefetching index-only scans */
+	Buffer		vmBuffer;
+
 	/*
 	 * Queue of TIDs to prefetch.
 	 *
-- 
2.42.0

v20231124-0006-poc-reuse-vm-information.patchtext/x-patch; charset=UTF-8; name=v20231124-0006-poc-reuse-vm-information.patchDownload

From 58a3cea898f810aa4646e6327c5e0241e6886e66 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 23 Nov 2023 00:47:17 +0100
Subject: [PATCH v20231124 6/7] poc: reuse vm information

---
 src/backend/access/index/indexam.c       | 59 ++++++++++++++++--------
 src/backend/executor/nodeIndexonlyscan.c |  8 +++-
 src/include/access/genam.h               | 12 +++--
 3 files changed, 56 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53948986ada..45c97aff5ef 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -113,8 +113,8 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int prefetch_max);
 
 static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction);
-static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction);
-static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -721,10 +721,11 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			ItemPointer	tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_prefetch_get_tid(scan, direction);
+			tid = index_prefetch_get_tid(scan, direction, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1238,12 +1239,15 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
  * in BTScanPosData.nextPage.
  */
 static void
-index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
+index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible)
 {
 	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
 	PrefetchBufferResult result;
 
+	/* by default not all visible (or we didn't check) */
+	*all_visible = false;
+
 	/*
 	 * No heap relation means bitmap index scan, which does prefetching at the
 	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
@@ -1275,16 +1279,19 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 	 * we can propagate it here). Or at least do it for a bulk of prefetches,
 	 * although that's not very useful - after the ramp-up we will prefetch
 	 * the pages one by one anyway.
+	 *
+	 * XXX Ideally we'd also propagate this to the executor, so that the
+	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
+	 * is measurable). But the index_getnext_tid() is not really well
+	 * suited for that, so the API needs a change.s
 	 */
 	if (skip_all_visible)
 	{
-		bool	all_visible;
+		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									  block,
+									  &prefetch->vmBuffer);
 
-		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
-									 block,
-									 &prefetch->vmBuffer);
-
-		if (all_visible)
+		if (*all_visible)
 			return;
 	}
 
@@ -1336,11 +1343,23 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible)
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
+	bool		all_visible;	/* ignored */
+
 	/* Do prefetching (if requested/enabled). */
 	index_prefetch_tids(scan, direction);
 
 	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction);
+	return index_prefetch_get_tid(scan, direction, &all_visible);
+}
+
+ItemPointer
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, all_visible);
 }
 
 static void
@@ -1374,6 +1393,7 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
 		while (!PREFETCH_FULL(prefetch))
 		{
 			ItemPointer tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
 			tid = index_getnext_tid_internal(scan, direction);
@@ -1390,22 +1410,23 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
 
 			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
-			prefetch->queueEnd++;
-
 			/*
 			 * Issue the actuall prefetch requests for the new TID.
 			 *
 			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
 			 * so skip prefetching of all-visible pages.
 			 */
-			index_prefetch(scan, tid, scan->indexonly);
+			index_prefetch(scan, tid, scan->indexonly, &all_visible);
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
+			prefetch->queueEnd++;
 		}
 	}
 }
 
 static ItemPointer
-index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction)
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
 {
 	/* for convenience */
 	IndexPrefetch prefetch = scan->xs_prefetch;
@@ -1422,7 +1443,8 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction)
 		if (PREFETCH_DONE(prefetch))
 			return NULL;
 
-		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
 		prefetch->queueIndex++;
 	}
 	else				/* not prefetching, just do the regular work  */
@@ -1431,6 +1453,7 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction)
 
 		/* Time to fetch the next TID from the index */
 		tid = index_getnext_tid_internal(scan, direction);
+		*all_visible = false;
 
 		/* If we're out of index entries, we're done */
 		if (tid == NULL)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 60cb772344f..b6660c10a63 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -66,6 +66,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	TupleTableSlot *slot;
 	ItemPointer tid;
 	Relation	heapRel = node->ss.ss_currentRelation;
+	bool		all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -148,7 +149,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -187,8 +188,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!all_visible &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8a3be673730..db1dc9c44b6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -175,8 +175,9 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern ItemPointer index_getnext_tid_prefetch(IndexScanDesc scan,
-											  ScanDirection direction);
+extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
+										ScanDirection direction,
+										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
@@ -251,6 +252,11 @@ typedef struct PrefetchCacheEntry {
 #define		PREFETCH_QUEUE_HISTORY			8
 #define		PREFETCH_SEQ_PATTERN_BLOCKS		4
 
+typedef struct PrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} PrefetchEntry;
 
 typedef struct IndexPrefetchData
 {
@@ -279,7 +285,7 @@ typedef struct IndexPrefetchData
 	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
 	 * than dynamically adjusting for custom values.
 	 */
-	ItemPointerData	queueItems[MAX_IO_CONCURRENCY];
+	PrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
 	uint64			queueIndex;	/* next TID to prefetch */
 	uint64			queueStart;	/* first valid TID in queue */
 	uint64			queueEnd;	/* first invalid (empty) TID in queue */
-- 
2.42.0

v20231124-0007-20231016-reworked.patchtext/x-patch; charset=UTF-8; name=v20231124-0007-20231016-reworked.patchDownload

From 3ba126559c69a31f6f0c48c85de2c892cebac4f3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Fri, 24 Nov 2023 12:31:43 +0100
Subject: [PATCH v20231124 7/7] 20231016-reworked

---
 src/backend/access/index/indexam.c  | 195 ++++++++++++++++++++++++----
 src/backend/storage/buffer/bufmgr.c |  12 --
 src/backend/storage/file/fd.c       |  40 ------
 src/backend/storage/smgr/md.c       |  27 ----
 src/backend/storage/smgr/smgr.c     |  19 ---
 src/include/access/genam.h          |  25 +++-
 src/include/storage/fd.h            |   1 -
 src/include/storage/md.h            |   2 -
 src/include/storage/smgr.h          |   2 -
 9 files changed, 195 insertions(+), 128 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 45c97aff5ef..82e3266bcc0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -1203,6 +1203,148 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
 	return true;
 }
 
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
 /*
  * index_prefetch
  *		Prefetch the TID, unless it's sequential or recently prefetched.
@@ -1237,13 +1379,35 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
  *
  * XXX Maybe we could/should also prefetch the next index block, e.g. stored
  * in BTScanPosData.nextPage.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchReqNumber value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
  */
 static void
 index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible)
 {
 	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
-	PrefetchBufferResult result;
 
 	/* by default not all visible (or we didn't check) */
 	*all_visible = false;
@@ -1295,28 +1459,6 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool
 			return;
 	}
 
-	/*
-	 * Now also check if the blocks to prefetch are in a sequential pattern.
-	 * We do it here because we need to do this check before PrefetchBuffer
-	 * initiates the prefetch, and we it can't do this easily (as it doesn't
-	 * know in what context it's called in). So we do it here.
-	 *
-	 * We use a tiny LRU cache and see if the blocks follow a sequential
-	 * pattern - if it's the same as the previous block, or if the last
-	 * couple blocks are a continguous sequence, we don't prefetch it.
-	 */
-	if (index_prefetch_is_sequential(prefetch, block))
-	{
-		prefetch->countSkipSequential++;
-		return;
-	}
-
-	/* XXX shouldn't this be before the VM / sequenqial check? */
-	prefetch->countAll++;
-
-	/* OK, try prefetching the block. */
-	result = PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
-
 	/*
 	 * Do not prefetch the same block over and over again,
 	 *
@@ -1324,11 +1466,15 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool
 	 * to a sequence ID). It's not expensive (the block is in page cache
 	 * already, so no I/O), but it's not free either.
 	 */
-	if (result.initiated_io)
+	if (!index_prefetch_add_cache(prefetch, block))
 	{
 		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
 		pgBufferUsage.blks_prefetches++;
 	}
+
+	prefetch->countAll++;
 }
 
 /* ----------------
@@ -1462,5 +1608,6 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_vi
 		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
 	}
 
+	/* Return the TID of the tuple we found. */
 	return &scan->xs_heaptid;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 74da9c1376b..f7c67d504cd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -567,20 +567,8 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		/*
 		 * Try to initiate an asynchronous read.  This returns false in
 		 * recovery if the relation file doesn't exist.
-		 *
-		 * But first check if the block is already present in page cache.
-		 *
-		 * FIXME This breaks prefetch from recovery. Apparently that expects
-		 * the prefetch to initiate the I/O, otherwise it fails with. But
-		 * XLogPrefetcherNextBlock checks initiated_io, and may fail with:
-		 *
-		 * FATAL:  could not prefetch relation 1663/16384/16401 block 83758
-		 *
-		 * So maybe just fake the initiated_io=true in this case? Or not do
-		 * this when in recovery.
 		 */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			!smgrcached(smgr_reln, forkNum, blockNum) &&
 			smgrprefetch(smgr_reln, forkNum, blockNum))
 		{
 			result.initiated_io = true;
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2c51a3376f3..f691ba09321 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -78,7 +78,6 @@
 #include <sys/resource.h>		/* for getrlimit */
 #include <sys/stat.h>
 #include <sys/types.h>
-#include <sys/uio.h>
 #ifndef WIN32
 #include <sys/mman.h>
 #endif
@@ -2084,45 +2083,6 @@ retry:
 #endif
 }
 
-/*
- * FileCached - check if a given range of the file is in page cache.
- *
- * XXX relies on preadv2, probably needs to be checked by configure
- */
-bool
-FileCached(File file, off_t offset, off_t amount, uint32 wait_event_info)
-{
-#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
-	int			returnCode;
-	size_t		readlen;
-	char		buffer[BLCKSZ];
-	struct iovec	iov[1];
-
-	Assert(FileIsValid(file));
-
-	DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
-			   file, VfdCache[file].fileName,
-			   (int64) offset, (int64) amount));
-
-	returnCode = FileAccess(file);
-	if (returnCode < 0)
-		return false;
-
-	/* XXX not sure if this ensures proper buffer alignment */
-	iov[0].iov_base = &buffer;
-	iov[0].iov_len = amount;
-
-	pgstat_report_wait_start(wait_event_info);
-	readlen = preadv2(VfdCache[file].fd, iov, 1, offset, RWF_NOWAIT);
-	pgstat_report_wait_end();
-
-	return (readlen == amount);
-#else
-	Assert(FileIsValid(file));
-	return false;
-#endif
-}
-
 void
 FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 {
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 16a7c424683..fdecbad1709 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -736,33 +736,6 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	return true;
 }
 
-/*
- * mdcached() -- Check if the whole block is already available in page cache.
- */
-bool
-mdcached(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
-{
-#ifdef USE_PREFETCH
-	off_t		seekpos;
-	MdfdVec    *v;
-
-	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
-
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
-	if (v == NULL)
-		return false;
-
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
-
-	(void) FileCached(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
-#endif							/* USE_PREFETCH */
-
-	return true;
-}
-
 /*
  * mdread() -- Read the specified block from a relation.
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 209518aae01..5d0f3d515c3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,8 +55,6 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	bool		(*smgr_cached) (SMgrRelation reln, ForkNumber forknum,
-								BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber blocknum, void *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
@@ -82,7 +80,6 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_cached = mdcached,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
 		.smgr_writeback = mdwriteback,
@@ -553,22 +550,6 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
 }
 
-/*
- * smgrcached() -- Check if the specified block is already in page cache.
- */
-bool
-smgrcached(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
-{
-	/*
-	 * In recovery we consider the blocks not cached, so that PrefetchSharedBuffer
-	 * initiates the I/O. XLogPrefetcherNextBlock relies on that.
-	 */
-	if (InRecovery)
-		return false;
-
-	return smgrsw[reln->smgr_which].smgr_cached(reln, forknum, blocknum);
-}
-
 /*
  * smgrread() -- read a particular block from a relation into the supplied
  *				 buffer.
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index db1dc9c44b6..b5dbe971770 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -247,7 +247,23 @@ typedef struct PrefetchCacheEntry {
 } PrefetchCacheEntry;
 
 /*
- * Used to detect sequential patterns (to not prefetch in this case).
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
  */
 #define		PREFETCH_QUEUE_HISTORY			8
 #define		PREFETCH_SEQ_PATTERN_BLOCKS		4
@@ -303,6 +319,13 @@ typedef struct IndexPrefetchData
 	uint64			blockIndex;	/* index in the block (points to the first
 								 * empty entry)*/
 
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
 } IndexPrefetchData;
 
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c96a24dddd3..d9d5d9da5fb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -105,7 +105,6 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern bool	FileCached(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8dc1382471e..941879ee6a8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,8 +32,6 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern bool mdcached(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				   void *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7fbed2a4291..a9a179aabac 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,8 +96,6 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern bool smgrcached(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, void *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-- 
2.42.0

#20

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Tomas Vondra (#19)

2 attachment(s)

Re: index prefetching

Hi,

Here's a simplified version of the patch series, with two important
changes from the last version shared on 2023/11/24.

Firstly, it abandons the idea to use preadv2() to check page cache. This
initially seemed like a great way to check if prefetching is needed, but
in practice it seems so expensive it's not really beneficial (especially
in the "cached" case, which is where it matters most).

Note: There's one more reason to not want rely on preadv2() that I
forgot to mention - it's a Linux-specific thing. I wouldn't mind using
it to improve already acceptable behavior, but it doesn't seem like a
great idea if performance without would be poor.

Secondly, this reworks multiple aspects of the "layering".

Until now, the prefetching info was stored in IndexScanDesc and
initialized in indexam.c in the various "beginscan" functions. That was
obviously wrong - IndexScanDesc is just a description of what the scan
should do, not a place where execution state (which the prefetch queue
is) should be stored. IndexScanState (and IndexOnlyScanState) is a more
appropriate place, so I moved it there.

This also means the various "beginscan" functions don't need any changes
(i.e. not even get prefetch_max), which is nice. Because the prefetch
state is created/initialized elsewhere.

But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?

With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
them into a queue, and prefetching based on that. That's kinda what the
patch does, except that it does it from inside index_getnext_tid(). But
that does not work for index_getnext_slot(), because that already reads
the heap tuples.

We could say prefetching only works for index_getnext_tid(), but that
seems a bit weird because that's what regular index scans do. (There's a
patch to evaluate filters on index, which switches index scans to
index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
that here. There are other index_getnext_slot() callers, and I don't
think we should accept does not work for those places seems wrong (e.g.
execIndexing/execReplication would benefit from prefetching, I think).

The patch just adds a "prefetcher" argument to index_getnext_*(), and
the prefetching still happens there. I guess we could move most of the
prefether typedefs/code somewhere, but I don't quite see how it could be
done in executor entirely.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20231209-0001-prefetch-2023-11-24.patchtext/x-patch; charset=UTF-8; name=v20231209-0001-prefetch-2023-11-24.patchDownload

From a3335da2a7a28dbb258380fa23d9ddd7c887f1d9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20231209 1/2] prefetch 2023-11-24

Patch version shared on 2023/11/24.
---
 src/backend/access/heap/heapam_handler.c |  12 +-
 src/backend/access/index/genam.c         |  31 +-
 src/backend/access/index/indexam.c       | 645 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  97 +++-
 src/backend/executor/nodeIndexscan.c     |  80 ++-
 src/backend/utils/adt/selfuncs.c         |   3 +-
 src/include/access/genam.h               | 110 +++-
 src/include/access/relscan.h             |  10 +
 src/include/executor/instrument.h        |   2 +
 13 files changed, 997 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..89474078951 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -747,6 +748,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_max;
+
+		/*
+		 * Get the prefetch target for the old tablespace (which is what we'll
+		 * read using the index). We'll use it as a reset value too, although
+		 * there should be no rescans for CLUSTER etc.
+		 */
+		prefetch_max = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -755,7 +764,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+									prefetch_max);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..d45a209ee3a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* Information used for asynchronous prefetching during index scans. */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,20 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * We don't do any prefetching on system catalogs, for two main reasons.
+		 *
+		 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+		 * or we often don't know how many rows to expect (and the numbers tend
+		 * to be fairly low). So it's not clear it'd help. Furthermore, places
+		 * that are sensitive tend to use syscache anyway.
+		 *
+		 * Secondly, we can't call get_tablespace_io_concurrency() because that
+		 * does a sysscan internally, so it might lead to a cycle. We could use
+		 * use effective_io_concurrency, but it doesn't seem worth it.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +711,20 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * We don't do any prefetching on system catalogs, for two main reasons.
+	 *
+	 * Firstly, we usually do PK lookups, which makes prefetching pointles,
+	 * or we often don't know how many rows to expect (and the numbers tend
+	 * to be fairly low). So it's not clear it'd help. Furthermore, places
+	 * that are sensitive tend to use syscache anyway.
+	 *
+	 * Secondly, we can't call get_tablespace_io_concurrency() because that
+	 * does a sysscan internally, so it might lead to a cycle. We could use
+	 * use effective_io_concurrency, but it doesn't seem worth it.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f23e0199f08..e493548b68a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,16 +49,19 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -106,7 +109,12 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap);
+											  ParallelIndexScanDesc pscan, bool temp_snap,
+											  int prefetch_max);
+
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -215,18 +223,42 @@ index_insert_cleanup(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_max determines if prefetching is requested for this index scan,
+ * and how far ahead we want to prefetch
+ *
+ * Setting prefetch_max to 0 disables prefetching for the index scan. We do
+ * this for two reasons - for scans on system catalogs, and/or for cases where
+ * prefetching is expected to be pointless (like IOS).
+ *
+ * For system catalogs, we usually either scan by a PK value, or we we expect
+ * only few rows (or rather we don't know how many rows to expect). Also, we
+ * need to prevent infinite in the get_tablespace_io_concurrency() call - it
+ * does an index scan internally. So we simply disable prefetching for system
+ * catalogs. We could deal with this by picking a conservative static target
+ * (e.g. effective_io_concurrency, capped to something), but places that are
+ * performance sensitive likely use syscache anyway, and catalogs tend to be
+ * very small and hot. So we don't bother.
+ *
+ * For IOS, we expect to not need most heap pages (that's the whole point of
+ * IOS, actually), and prefetching them might lead to a lot of wasted I/O.
+ *
+ * XXX Not sure the infinite loop can still happen, now that the target lookup
+ * moved to callers of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				int prefetch_max)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
+									NULL, false, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -256,7 +288,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	/* No prefetch in bitmap scans, prefetch is done by the heap scan. */
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false, 0);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -273,7 +306,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap)
+						 ParallelIndexScanDesc pscan, bool temp_snap,
+						 int prefetch_max)
 {
 	IndexScanDesc scan;
 
@@ -296,6 +330,31 @@ index_beginscan_internal(Relation indexRelation,
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
+	scan->indexonly = false;
+
+	/*
+	 * With prefetching requested, initialize the prefetcher state.
+	 *
+	 * FIXME This should really be in the IndexScanState, not IndexScanDesc
+	 * (certainly the queues etc). But index_getnext_tid only gets the scan
+	 * descriptor, so how else would we pass it? Seems like a sign of wrong
+	 * layer doing the prefetching.
+	 */
+	if ((prefetch_max > 0) &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0)	/* no prefetching for direct I/O */
+	{
+		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+		prefetcher->queueIndex = 0;
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+
+		prefetcher->prefetchTarget = 0;
+		prefetcher->prefetchMaxTarget = prefetch_max;
+		prefetcher->vmBuffer = InvalidBuffer;
+
+		scan->xs_prefetch = prefetcher;
+	}
 
 	return scan;
 }
@@ -332,6 +391,20 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->queueStart = 0;
+		prefetcher->queueEnd = 0;
+		prefetcher->queueIndex = 0;
+		prefetcher->prefetchDone = false;
+
+		/* restart the incremental ramp-up */
+		prefetcher->prefetchTarget = 0;
+	}
 }
 
 /* ----------------
@@ -360,6 +433,23 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
+	/*
+	 * If prefetching was enabled for this scan, log prefetch stats.
+	 *
+	 * FIXME This should really go to EXPLAIN ANALYZE instead.
+	 */
+	if (scan->xs_prefetch)
+	{
+		IndexPrefetch prefetch = scan->xs_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -505,7 +595,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 int prefetch_max)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -514,7 +605,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+									pscan, true, prefetch_max);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -536,8 +627,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+static ItemPointer
+index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -640,12 +731,16 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 {
 	for (;;)
 	{
+		/* Do prefetching (if requested/enabled). */
+		index_prefetch_tids(scan, direction);
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			ItemPointer	tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			tid = index_prefetch_get_tid(scan, direction, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1003,3 +1098,531 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+	PrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
+ *
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchReqNumber value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible)
+{
+	IndexPrefetch prefetch = scan->xs_prefetch;
+	BlockNumber block;
+
+	/* by default not all visible (or we didn't check) */
+	*all_visible = false;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 *
+	 * XXX Ideally we'd also propagate this to the executor, so that the
+	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
+	 * is measurable). But the index_getnext_tid() is not really well
+	 * suited for that, so the API needs a change.s
+	 */
+	if (skip_all_visible)
+	{
+		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									  block,
+									  &prefetch->vmBuffer);
+
+		if (*all_visible)
+			return;
+	}
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		all_visible;	/* ignored */
+
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, &all_visible);
+}
+
+ItemPointer
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, all_visible);
+}
+
+static void
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
+{
+	/* for convenience */
+	IndexPrefetch prefetch = scan->xs_prefetch;
+
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+			bool		all_visible;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_internal(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, tid, scan->indexonly, &all_visible);
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
+			prefetch->queueEnd++;
+		}
+	}
+}
+
+static ItemPointer
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
+{
+	/* for convenience */
+	IndexPrefetch prefetch = scan->xs_prefetch;
+
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid_internal(scan, direction);
+		*all_visible = false;
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e8..6810996edfd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2fa2118f3c2..15fa3211667 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -770,11 +770,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..91676ccff95 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c066..0011d9f679c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..b6660c10a63 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	bool		all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -83,16 +85,47 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int			prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   prefetch_max);
+
+		/*
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		scandesc->indexonly = true;
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -116,7 +149,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -155,8 +188,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!all_visible &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -380,6 +416,18 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* Release VM buffer pin from prefetcher, if any. */
+	if (indexScanDesc && indexScanDesc->xs_prefetch)
+	{
+		IndexPrefetch indexPrefetch = indexScanDesc->xs_prefetch;
+
+		if (indexPrefetch->vmBuffer != InvalidBuffer)
+		{
+			ReleaseBuffer(indexPrefetch->vmBuffer);
+			indexPrefetch->vmBuffer = InvalidBuffer;
+		}
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -646,6 +694,24 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->ioss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -658,7 +724,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -696,6 +763,23 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ioss_ScanDesc =
@@ -703,7 +787,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..a5f5394ef49 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	Relation heapRel = node->ss.ss_currentRelation;
 
 	/*
 	 * extract necessary information from index scan node
@@ -103,6 +105,21 @@ IndexNext(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -111,7 +128,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -177,6 +195,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	Relation	heapRel = node->ss.ss_currentRelation;
 
 	estate = node->ss.ps.state;
 
@@ -198,6 +217,21 @@ IndexNextWithReorder(IndexScanState *node)
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   node->ss.ps.plan->plan_rows);
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -206,7 +240,8 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   prefetch_max);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -1662,6 +1697,24 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1674,7 +1727,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1710,6 +1764,23 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
+	Relation	heapRel = node->ss.ss_currentRelation;
+	int			prefetch_max;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1717,7 +1788,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 prefetch_max);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d022827a..8b662d371dd 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6289,9 +6289,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/* XXX Maybe should do prefetching using the default prefetch parameters? */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, 0);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 80dc8d54066..f6882f644d2 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -154,7 +155,8 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 int prefetch_max);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -171,9 +173,13 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  int prefetch_max);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
+										ScanDirection direction,
+										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
@@ -232,4 +238,104 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+typedef struct PrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} PrefetchEntry;
+
+typedef struct IndexPrefetchData
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/* used when prefetching index-only scans */
+	Buffer		vmBuffer;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	PrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..d5903492c6e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Forward declarations, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -129,6 +135,7 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	bool		indexonly;			/* is this index-only scan? */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -162,6 +169,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	/* prefetching state (or NULL if disabled for this scan) */
+	IndexPrefetchData *xs_prefetch;
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c52..f53fb4a1e51 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
-- 
2.41.0

v20231209-0002-reworks.patchtext/x-patch; charset=UTF-8; name=v20231209-0002-reworks.patchDownload

From b0c3e346bf41c6c7d14502814bb0ee327ae68169 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sat, 9 Dec 2023 00:21:30 +0100
Subject: [PATCH v20231209 2/2] reworks

---
 src/backend/access/heap/heapam_handler.c |  14 +-
 src/backend/access/index/genam.c         |  35 +---
 src/backend/access/index/indexam.c       | 152 ++++------------
 src/backend/executor/execIndexing.c      |  12 +-
 src/backend/executor/execReplication.c   |  18 +-
 src/backend/executor/nodeIndexonlyscan.c | 166 ++++++++---------
 src/backend/executor/nodeIndexscan.c     | 149 ++++++++--------
 src/backend/utils/adt/selfuncs.c         |   5 +-
 src/include/access/genam.h               | 217 ++++++++++++-----------
 src/include/access/relscan.h             |  10 --
 src/include/nodes/execnodes.h            |   4 +
 11 files changed, 332 insertions(+), 450 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 89474078951..26d3ec20b63 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,7 +44,6 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -748,14 +747,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
-		int			prefetch_max;
-
-		/*
-		 * Get the prefetch target for the old tablespace (which is what we'll
-		 * read using the index). We'll use it as a reset value too, although
-		 * there should be no rescans for CLUSTER etc.
-		 */
-		prefetch_max = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,8 +755,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
-									prefetch_max);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -802,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index d45a209ee3a..72e7c9f206c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,9 +126,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
-	/* Information used for asynchronous prefetching during index scans. */
-	scan->xs_prefetch = NULL;
-
 	return scan;
 }
 
@@ -443,20 +440,8 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		/*
-		 * We don't do any prefetching on system catalogs, for two main reasons.
-		 *
-		 * Firstly, we usually do PK lookups, which makes prefetching pointles,
-		 * or we often don't know how many rows to expect (and the numbers tend
-		 * to be fairly low). So it's not clear it'd help. Furthermore, places
-		 * that are sensitive tend to use syscache anyway.
-		 *
-		 * Secondly, we can't call get_tablespace_io_concurrency() because that
-		 * does a sysscan internally, so it might lead to a cycle. We could use
-		 * use effective_io_concurrency, but it doesn't seem worth it.
-		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0, 0);
+										 snapshot, nkeys, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -524,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
 		{
 			bool		shouldFree;
 
@@ -711,20 +696,8 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	/*
-	 * We don't do any prefetching on system catalogs, for two main reasons.
-	 *
-	 * Firstly, we usually do PK lookups, which makes prefetching pointles,
-	 * or we often don't know how many rows to expect (and the numbers tend
-	 * to be fairly low). So it's not clear it'd help. Furthermore, places
-	 * that are sensitive tend to use syscache anyway.
-	 *
-	 * Secondly, we can't call get_tablespace_io_concurrency() because that
-	 * does a sysscan internally, so it might lead to a cycle. We could use
-	 * use effective_io_concurrency, but it doesn't seem worth it.
-	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0, 0);
+									 snapshot, nkeys, 0);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
@@ -740,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index e493548b68a..f96aeba1b39 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -109,12 +109,14 @@ do { \
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
-											  ParallelIndexScanDesc pscan, bool temp_snap,
-											  int prefetch_max);
+											  ParallelIndexScanDesc pscan, bool temp_snap);
 
-static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction);
-static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible);
-static void index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible);
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+								IndexPrefetch *prefetch);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+										  IndexPrefetch *prefetch, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
 
 
 /* ----------------------------------------------------------------
@@ -223,42 +225,18 @@ index_insert_cleanup(Relation indexRelation,
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
- *
- * prefetch_max determines if prefetching is requested for this index scan,
- * and how far ahead we want to prefetch
- *
- * Setting prefetch_max to 0 disables prefetching for the index scan. We do
- * this for two reasons - for scans on system catalogs, and/or for cases where
- * prefetching is expected to be pointless (like IOS).
- *
- * For system catalogs, we usually either scan by a PK value, or we we expect
- * only few rows (or rather we don't know how many rows to expect). Also, we
- * need to prevent infinite in the get_tablespace_io_concurrency() call - it
- * does an index scan internally. So we simply disable prefetching for system
- * catalogs. We could deal with this by picking a conservative static target
- * (e.g. effective_io_concurrency, capped to something), but places that are
- * performance sensitive likely use syscache anyway, and catalogs tend to be
- * very small and hot. So we don't bother.
- *
- * For IOS, we expect to not need most heap pages (that's the whole point of
- * IOS, actually), and prefetching them might lead to a lot of wasted I/O.
- *
- * XXX Not sure the infinite loop can still happen, now that the target lookup
- * moved to callers of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys,
-				int prefetch_max)
+				int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot,
-									NULL, false, prefetch_max);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -288,8 +266,7 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	/* No prefetch in bitmap scans, prefetch is done by the heap scan. */
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false, 0);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -306,8 +283,7 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
-						 ParallelIndexScanDesc pscan, bool temp_snap,
-						 int prefetch_max)
+						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
 	IndexScanDesc scan;
 
@@ -330,31 +306,6 @@ index_beginscan_internal(Relation indexRelation,
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
-	scan->indexonly = false;
-
-	/*
-	 * With prefetching requested, initialize the prefetcher state.
-	 *
-	 * FIXME This should really be in the IndexScanState, not IndexScanDesc
-	 * (certainly the queues etc). But index_getnext_tid only gets the scan
-	 * descriptor, so how else would we pass it? Seems like a sign of wrong
-	 * layer doing the prefetching.
-	 */
-	if ((prefetch_max > 0) &&
-		(io_direct_flags & IO_DIRECT_DATA) == 0)	/* no prefetching for direct I/O */
-	{
-		IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
-
-		prefetcher->queueIndex = 0;
-		prefetcher->queueStart = 0;
-		prefetcher->queueEnd = 0;
-
-		prefetcher->prefetchTarget = 0;
-		prefetcher->prefetchMaxTarget = prefetch_max;
-		prefetcher->vmBuffer = InvalidBuffer;
-
-		scan->xs_prefetch = prefetcher;
-	}
 
 	return scan;
 }
@@ -391,20 +342,6 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
-
-	/* If we're prefetching for this index, maybe reset some of the state. */
-	if (scan->xs_prefetch != NULL)
-	{
-		IndexPrefetch prefetcher = scan->xs_prefetch;
-
-		prefetcher->queueStart = 0;
-		prefetcher->queueEnd = 0;
-		prefetcher->queueIndex = 0;
-		prefetcher->prefetchDone = false;
-
-		/* restart the incremental ramp-up */
-		prefetcher->prefetchTarget = 0;
-	}
 }
 
 /* ----------------
@@ -433,23 +370,6 @@ index_endscan(IndexScanDesc scan)
 	if (scan->xs_temp_snap)
 		UnregisterSnapshot(scan->xs_snapshot);
 
-	/*
-	 * If prefetching was enabled for this scan, log prefetch stats.
-	 *
-	 * FIXME This should really go to EXPLAIN ANALYZE instead.
-	 */
-	if (scan->xs_prefetch)
-	{
-		IndexPrefetch prefetch = scan->xs_prefetch;
-
-		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
-			 prefetch->countAll,
-			 prefetch->countPrefetch,
-			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
-			 prefetch->countSkipCached,
-			 prefetch->countSkipSequential);
-	}
-
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -595,8 +515,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan,
-						 int prefetch_max)
+						 int norderbys, ParallelIndexScanDesc pscan)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -605,7 +524,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true, prefetch_max);
+									pscan, true);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -727,12 +646,13 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
+				   IndexPrefetch *prefetch)
 {
 	for (;;)
 	{
 		/* Do prefetching (if requested/enabled). */
-		index_prefetch_tids(scan, direction);
+		index_prefetch_tids(scan, direction, prefetch);
 
 		if (!scan->xs_heap_continue)
 		{
@@ -740,7 +660,7 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_prefetch_get_tid(scan, direction, &all_visible);
+			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1135,7 +1055,7 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
  * does not require a sequential pattern).
  */
 static bool
-index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+index_prefetch_is_sequential(IndexPrefetch *prefetch, BlockNumber block)
 {
 	int			idx;
 
@@ -1270,9 +1190,9 @@ index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
  * what index_prefetch_is_sequential does.
  */
 static bool
-index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
 {
-	PrefetchCacheEntry *entry;
+	IndexPrefetchCacheEntry *entry;
 
 	/* map the block number the the LRU */
 	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
@@ -1419,9 +1339,9 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
  * value once in a while, and see what happens.
  */
 static void
-index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool *all_visible)
+index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
 {
-	IndexPrefetch prefetch = scan->xs_prefetch;
 	BlockNumber block;
 
 	/* by default not all visible (or we didn't check) */
@@ -1502,33 +1422,33 @@ index_prefetch(IndexScanDesc scan, ItemPointer tid, bool skip_all_visible, bool
  * ----------------
  */
 ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
+				  IndexPrefetch *prefetch)
 {
 	bool		all_visible;	/* ignored */
 
 	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction);
+	index_prefetch_tids(scan, direction, prefetch);
 
 	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, &all_visible);
+	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
 }
 
 ItemPointer
-index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
+					 IndexPrefetch *prefetch, bool *all_visible)
 {
 	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction);
+	index_prefetch_tids(scan, direction, prefetch);
 
 	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, all_visible);
+	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
 }
 
 static void
-index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+					IndexPrefetch *prefetch)
 {
-	/* for convenience */
-	IndexPrefetch prefetch = scan->xs_prefetch;
-
 	/*
 	 * If the prefetching is still active (i.e. enabled and we still
 	 * haven't finished reading TIDs from the scan), read enough TIDs into
@@ -1577,7 +1497,7 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
 			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
 			 * so skip prefetching of all-visible pages.
 			 */
-			index_prefetch(scan, tid, scan->indexonly, &all_visible);
+			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
 
 			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
 			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
@@ -1587,11 +1507,9 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction)
 }
 
 static ItemPointer
-index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction, bool *all_visible)
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+					   IndexPrefetch *prefetch, bool *all_visible)
 {
-	/* for convenience */
-	IndexPrefetch prefetch = scan->xs_prefetch;
-
 	/*
 	 * With prefetching enabled (even if we already finished reading
 	 * all TIDs from the index scan), we need to return a TID from the
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 15fa3211667..0a136db6712 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -770,18 +770,18 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
-	 *
-	 * XXX Should this do index prefetch? Probably not worth it for unique
-	 * constraints, I guess? Otherwise we should calculate prefetch_target
-	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	/*
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 91676ccff95..9498b00fa64 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,21 +204,21 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan.
-	 *
-	 * XXX Should this do index prefetching? We're looking for a single tuple,
-	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
-	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
-	 */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0);
+	/* Start an index scan. */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
 
 retry:
 	found = false;
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	/*
+	 * Try to find the tuple
+	 *
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b6660c10a63..a7eadaf3db2 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -65,8 +65,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
-	Relation	heapRel = node->ss.ss_currentRelation;
-	bool		all_visible;
+	IndexPrefetch  *prefetch;
+	bool			all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -80,52 +80,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
-		int			prefetch_max;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 *
-		 * XXX Maybe reduce the value with parallel workers?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   node->ss.ps.plan->plan_rows);
-
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
-		 *
-		 * XXX Maybe we should enable prefetching, but prefetch only pages that
-		 * are not all-visible (but checking that from the index code seems like
-		 * a violation of layering etc).
-		 *
-		 * XXX This might lead to IOS being slower than plain index scan, if the
-		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys,
-								   prefetch_max);
-
-		/*
-		 * Remember this is index-only scan, because of prefetching. Not the most
-		 * elegant way to pass this info.
-		 */
-		scandesc->indexonly = true;
+								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -149,7 +119,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid_vm(scandesc, direction, &all_visible)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -389,6 +359,16 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -417,14 +397,22 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	}
 
 	/* Release VM buffer pin from prefetcher, if any. */
-	if (indexScanDesc && indexScanDesc->xs_prefetch)
+	if (node->ioss_prefetch)
 	{
-		IndexPrefetch indexPrefetch = indexScanDesc->xs_prefetch;
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		/* XXX some debug info */
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
 
-		if (indexPrefetch->vmBuffer != InvalidBuffer)
+		if (prefetch->vmBuffer != InvalidBuffer)
 		{
-			ReleaseBuffer(indexPrefetch->vmBuffer);
-			indexPrefetch->vmBuffer = InvalidBuffer;
+			ReleaseBuffer(prefetch->vmBuffer);
+			prefetch->vmBuffer = InvalidBuffer;
 		}
 	}
 
@@ -652,6 +640,63 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		indexstate->ioss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int			prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		/*
+		 * We reach here if the index only scan is not parallel, or if we're
+		 * serially executing an index only scan that was planned to be
+		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
+		 *
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+			prefetch->indexonly = true;
+
+			indexstate->ioss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
@@ -694,24 +739,6 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel = node->ss.ss_currentRelation;
-	int			prefetch_max;
-
-	/*
-	 * Determine number of heap pages to prefetch for this index. This is
-	 * essentially just effective_io_concurrency for the table (or the
-	 * tablespace it's in).
-	 *
-	 * XXX Should this also look at plan.plan_rows and maybe cap the target
-	 * to that? Pointless to prefetch more than we expect to use. Or maybe
-	 * just reset to that value during prefetching, after reading the next
-	 * index page (or rather after rescan)?
-	 *
-	 * XXX Maybe reduce the value with parallel workers?
-	 */
-
-	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->ioss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -724,8 +751,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan,
-								 prefetch_max);
+								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -763,23 +789,6 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel = node->ss.ss_currentRelation;
-	int			prefetch_max;
-
-	/*
-	 * Determine number of heap pages to prefetch for this index. This is
-	 * essentially just effective_io_concurrency for the table (or the
-	 * tablespace it's in).
-	 *
-	 * XXX Should this also look at plan.plan_rows and maybe cap the target
-	 * to that? Pointless to prefetch more than we expect to use. Or maybe
-	 * just reset to that value during prefetching, after reading the next
-	 * index page (or rather after rescan)?
-	 *
-	 * XXX Maybe reduce the value with parallel workers?
-	 */
-	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->ioss_ScanDesc =
@@ -787,8 +796,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan,
-								 prefetch_max);
+								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a5f5394ef49..b3282ec5a75 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -86,7 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	Relation heapRel = node->ss.ss_currentRelation;
+	IndexPrefetch  *prefetch;
 
 	/*
 	 * extract necessary information from index scan node
@@ -100,26 +100,12 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
-		int	prefetch_max;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index scan. This
-		 * is essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   node->ss.ps.plan->plan_rows);
-
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -128,8 +114,7 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys,
-								   prefetch_max);
+								   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -146,7 +131,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (index_getnext_slot(scandesc, direction, slot, prefetch))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -195,7 +180,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
-	Relation	heapRel = node->ss.ss_currentRelation;
+	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -212,26 +197,12 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
-		int	prefetch_max;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   node->ss.ps.plan->plan_rows);
-
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -240,8 +211,7 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys,
-								   prefetch_max);
+								   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -294,7 +264,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -623,6 +593,16 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -829,6 +809,19 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX nothing to free, but print some debug info */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1101,6 +1094,45 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int	prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+
+			indexstate->iss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
@@ -1697,24 +1729,6 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 {
 	EState	   *estate = node->ss.ps.state;
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel = node->ss.ss_currentRelation;
-	int			prefetch_max;
-
-	/*
-	 * Determine number of heap pages to prefetch for this index. This is
-	 * essentially just effective_io_concurrency for the table (or the
-	 * tablespace it's in).
-	 *
-	 * XXX Should this also look at plan.plan_rows and maybe cap the target
-	 * to that? Pointless to prefetch more than we expect to use. Or maybe
-	 * just reset to that value during prefetching, after reading the next
-	 * index page (or rather after rescan)?
-	 *
-	 * XXX Maybe reduce the value with parallel workers?
-	 */
-
-	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
 	index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1727,8 +1741,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan,
-								 prefetch_max);
+								 piscan);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1764,23 +1777,6 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 							  ParallelWorkerContext *pwcxt)
 {
 	ParallelIndexScanDesc piscan;
-	Relation	heapRel = node->ss.ss_currentRelation;
-	int			prefetch_max;
-
-	/*
-	 * Determine number of heap pages to prefetch for this index. This is
-	 * essentially just effective_io_concurrency for the table (or the
-	 * tablespace it's in).
-	 *
-	 * XXX Should this also look at plan.plan_rows and maybe cap the target
-	 * to that? Pointless to prefetch more than we expect to use. Or maybe
-	 * just reset to that value during prefetching, after reading the next
-	 * index page (or rather after rescan)?
-	 *
-	 * XXX Maybe reduce the value with parallel workers?
-	 */
-	prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-					   node->ss.ps.plan->plan_rows);
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
 	node->iss_ScanDesc =
@@ -1788,8 +1784,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan,
-								 prefetch_max);
+								 piscan);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8b662d371dd..b5c79359425 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6289,16 +6289,15 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	/* XXX Maybe should do prefetching using the default prefetch parameters? */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0, 0);
+								 1, 0);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index f6882f644d2..c0c46d7a05f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -129,6 +129,110 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct IndexPrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} IndexPrefetchEntry;
+
+typedef struct IndexPrefetch
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/* used when prefetching index-only scans */
+	bool		indexonly;
+	Buffer		vmBuffer;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	IndexPrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	IndexPrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetch;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -155,8 +259,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys,
-									 int prefetch_max);
+									 int nkeys, int norderbys);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,17 +276,19 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan,
-											  int prefetch_max);
+											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction);
+									 ScanDirection direction,
+									 IndexPrefetch *prefetch);
 extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
 										ScanDirection direction,
+										IndexPrefetch *prefetch,
 										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot);
+							   struct TupleTableSlot *slot,
+							   IndexPrefetch *prefetch);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -238,104 +343,4 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
 										  ScanDirection direction);
 extern void systable_endscan_ordered(SysScanDesc sysscan);
 
-/*
- * Cache of recently prefetched blocks, organized as a hash table of
- * small LRU caches. Doesn't need to be perfectly accurate, but we
- * aim to make false positives/negatives reasonably low.
- */
-typedef struct PrefetchCacheEntry {
-	BlockNumber		block;
-	uint64			request;
-} PrefetchCacheEntry;
-
-/*
- * Size of the cache of recently prefetched blocks - shouldn't be too
- * small or too large. 1024 seems about right, it covers ~8MB of data.
- * It's somewhat arbitrary, there's no particular formula saying it
- * should not be higher/lower.
- *
- * The cache is structured as an array of small LRU caches, so the total
- * size needs to be a multiple of LRU size. The LRU should be tiny to
- * keep linear search cheap enough.
- *
- * XXX Maybe we could consider effective_cache_size or something?
- */
-#define		PREFETCH_LRU_SIZE		8
-#define		PREFETCH_LRU_COUNT		128
-#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
-
-/*
- * Used to detect sequential patterns (and disable prefetching).
- */
-#define		PREFETCH_QUEUE_HISTORY			8
-#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
-
-typedef struct PrefetchEntry
-{
-	ItemPointerData		tid;
-	bool				all_visible;
-} PrefetchEntry;
-
-typedef struct IndexPrefetchData
-{
-	/*
-	 * XXX We need to disable this in some cases (e.g. when using index-only
-	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
-	 * only pages that are not all-visible, that'd be even better.
-	 */
-	int			prefetchTarget;	/* how far we should be prefetching */
-	int			prefetchMaxTarget;	/* maximum prefetching distance */
-	int			prefetchReset;	/* reset to this distance on rescan */
-	bool		prefetchDone;	/* did we get all TIDs from the index? */
-
-	/* runtime statistics */
-	uint64		countAll;		/* all prefetch requests */
-	uint64		countPrefetch;	/* actual prefetches */
-	uint64		countSkipSequential;
-	uint64		countSkipCached;
-
-	/* used when prefetching index-only scans */
-	Buffer		vmBuffer;
-
-	/*
-	 * Queue of TIDs to prefetch.
-	 *
-	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
-	 * than dynamically adjusting for custom values.
-	 */
-	PrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
-	uint64			queueIndex;	/* next TID to prefetch */
-	uint64			queueStart;	/* first valid TID in queue */
-	uint64			queueEnd;	/* first invalid (empty) TID in queue */
-
-	/*
-	 * A couple of last prefetched blocks, used to check for certain access
-	 * pattern and skip prefetching - e.g. for sequential access).
-	 *
-	 * XXX Separate from the main queue, because we only want to compare the
-	 * block numbers, not the whole TID. In sequential access it's likely we
-	 * read many items from each page, and we don't want to check many items
-	 * (as that is much more expensive).
-	 */
-	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
-	uint64			blockIndex;	/* index in the block (points to the first
-								 * empty entry)*/
-
-	/*
-	 * Cache of recently prefetched blocks, organized as a hash table of
-	 * small LRU caches.
-	 */
-	uint64				prefetchReqNumber;
-	PrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
-
-} IndexPrefetchData;
-
-#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
-#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
-#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
-#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
-#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
-#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
-#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
-
 #endif							/* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d5903492c6e..d03360eac04 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,12 +106,6 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
-/*
- * Forward declarations, defined in genam.h.
- */
-typedef struct IndexPrefetchData IndexPrefetchData;
-typedef struct IndexPrefetchData *IndexPrefetch;
-
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -135,7 +129,6 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
-	bool		indexonly;			/* is this index-only scan? */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
@@ -169,9 +162,6 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
-	/* prefetching state (or NULL if disabled for this scan) */
-	IndexPrefetchData *xs_prefetch;
-
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee07..8745453a5b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,6 +1529,7 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1581,8 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1621,7 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
-- 
2.41.0

#21

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#20)

Re: index prefetching

On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?

Yeah, that seems impossible.

Some thoughts:

* I think perhaps the subject line of this thread is misleading. It
doesn't seem like there is any index prefetching going on here at all,
and there couldn't be, unless you extended the index AM API with new
methods. What you're actually doing is prefetching heap pages that
will be needed by a scan of the index. I think this confusing naming
has propagated itself into some parts of the patch, e.g.
index_prefetch() reads *from the heap* which is not at all clear from
the comment saying "Prefetch the TID, unless it's sequential or
recently prefetched." You're not prefetching the TID: you're
prefetching the heap tuple to which the TID points. That's not an
academic distinction IMHO -- the TID would be stored in the index, so
if we were prefetching the TID, we'd have to be reading index pages,
not heap pages.

* Regarding layering, my first thought was that the changes to
index_getnext_tid() and index_getnext_slot() are sensible: read ahead
by some number of TIDs, keep the TIDs you've fetched in an array
someplace, use that to drive prefetching of blocks on disk, and return
the previously-read TIDs from the queue without letting the caller
know that the queue exists. I think that's the obvious design for a
feature of this type, to the point where I don't really see that
there's a viable alternative design. Driving something down into the
individual index AMs would make sense if you wanted to prefetch *from
the indexes*, but it's unnecessary otherwise, and best avoided.

* But that said, the skip_all_visible flag passed down to
index_prefetch() looks like a VERY strong sign that the layering here
is not what it should be. Right now, when some code calls
index_getnext_tid(), that function does not need to know or care
whether the caller is going to fetch the heap tuple or not. But with
this patch, the code does need to care. So knowledge of the executor
concept of an index-only scan trickles down into indexam.c, which now
has to be able to make decisions that are consistent with the ones
that the executor will make. That doesn't seem good at all.

* I think it might make sense to have two different prefetching
schemes. Ideally they could share some structure. If a caller is using
index_getnext_slot(), then it's easy for prefetching to be fully
transparent. The caller can just ask for TIDs and the prefetching
distance and TID queue can be fully under the control of something
that is hidden from the caller. But when using index_getnext_tid(),
the caller needs to have an opportunity to evaluate each TID and
decide whether we even want the heap tuple. If yes, then we feed that
TID to the prefetcher; if no, we don't. That way, we're not
replicating executor logic in lower-level code. However, that also
means that the IOS logic needs to be aware that this TID queue exists
and interact with whatever controls the prefetch distance. Perhaps
after calling index_getnext_tid() you call
index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
then you call index_prefetcher_get_tid() to drain the queue. Perhaps
also the prefetcher has a "fill" callback that gets invoked when the
TID queue isn't as full as the prefetcher wants it to be. Then
index_getnext_slot() can just install a trivial fill callback that
says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
true), but IOS can use a more sophisticated callback that checks the
VM to determine what to pass for the third argument.

* I realize that I'm being a little inconsistent in what I just said,
because in the first bullet point I said that this wasn't really index
prefetching, and now I'm proposing function names that still start
with index_prefetch. It's not entirely clear to me what the best thing
to do about the terminology is here -- could it be a heap prefetcher,
or a TID prefetcher, or an index scan prefetcher? I don't really know,
but whatever we can do to make the naming more clear seems like a
really good idea. Maybe there should be a clearer separation between
the queue of TIDs that we're going to return from the index and the
queue of blocks that we want to prefetch to get the corresponding heap
tuples -- making that separation crisper might ease some of the naming
issues.

* Not that I want to be critical because I think this is a great start
on an important project, but it does look like there's an awful lot of
stuff here that still needs to be sorted out before it would be
reasonable to think of committing this, both in terms of design
decisions and just general polish. There's a lot of stuff marked with
XXX and I think that's great because most of those seem to be good
questions but that does leave the, err, small problem of figuring out
the answers. index_prefetch_is_sequential() makes me really nervous
because it seems to depend an awful lot on whether the OS is doing
prefetching, and how the OS is doing prefetching, and I think those
might not be consistent across all systems and kernel versions.
Similarly with index_prefetch(). There's a lot of "magical"
assumptions here. Even index_prefetch_add_cache() has this problem --
the function assumes that it's OK if we sometimes fail to detect a
duplicate prefetch request, which makes sense, but under what
circumstances is it necessary to detect duplicates and in what cases
is it optional? The function comments are silent about that, which
makes it hard to assess whether the algorithm is good enough.

* In terms of polish, one thing I noticed is that index_getnext_slot()
calls index_prefetch_tids() even when scan->xs_heap_continue is set,
which seems like it must be a waste, since we can't really need to
kick off more prefetch requests halfway through a HOT chain referenced
by a single index tuple, can we? Also, blks_prefetch_rounds doesn't
seem to be used anywhere, and neither that nor blks_prefetches are
documented. In fact there's no new documentation at all, which seems
probably not right. That's partly because there are no new GUCs, which
I feel like typically for a feature like this would be the place where
the feature behavior would be mentioned in the documentation. I don't
think it's a good idea to tie the behavior of this feature to
effective_io_concurrency partly because it's usually a bad idea to
make one setting control multiple different things, but perhaps even
more because effective_io_concurrency doesn't actually work in a
useful way AFAICT and people typically have to set it to some very
artificially large value compared to how much real I/O parallelism
they have. So probably there should be new GUCs with hopefully-better
semantics, but at least the documentation for any existing ones would
need updating, I would think.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Robert Haas (#21)

Re: index prefetching

On 12/18/23 22:00, Robert Haas wrote:

On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?

Yeah, that seems impossible.

Some thoughts:

* I think perhaps the subject line of this thread is misleading. It
doesn't seem like there is any index prefetching going on here at all,
and there couldn't be, unless you extended the index AM API with new
methods. What you're actually doing is prefetching heap pages that
will be needed by a scan of the index. I think this confusing naming
has propagated itself into some parts of the patch, e.g.
index_prefetch() reads *from the heap* which is not at all clear from
the comment saying "Prefetch the TID, unless it's sequential or
recently prefetched." You're not prefetching the TID: you're
prefetching the heap tuple to which the TID points. That's not an
academic distinction IMHO -- the TID would be stored in the index, so
if we were prefetching the TID, we'd have to be reading index pages,
not heap pages.

Yes, that's a fair complaint. I think the naming is mostly obsolete -
the prefetching initially happened way way lower - in the index AMs. It
was prefetching the heap pages, ofc, but it kinda seemed reasonable to
call it "index prefetching". And even now it's called from indexam.c
where most functions start with "index_".

But I'll think about some better / cleared name.

* Regarding layering, my first thought was that the changes to
index_getnext_tid() and index_getnext_slot() are sensible: read ahead
by some number of TIDs, keep the TIDs you've fetched in an array
someplace, use that to drive prefetching of blocks on disk, and return
the previously-read TIDs from the queue without letting the caller
know that the queue exists. I think that's the obvious design for a
feature of this type, to the point where I don't really see that
there's a viable alternative design.

I agree.

Driving something down into the individual index AMs would make sense
if you wanted to prefetch *from the indexes*, but it's unnecessary
otherwise, and best avoided.

Right. In fact, the patch moved exactly in the opposite direction - it
was originally done at the AM level, and moved up. First to indexam.c,
then even more to the executor.

* But that said, the skip_all_visible flag passed down to
index_prefetch() looks like a VERY strong sign that the layering here
is not what it should be. Right now, when some code calls
index_getnext_tid(), that function does not need to know or care
whether the caller is going to fetch the heap tuple or not. But with
this patch, the code does need to care. So knowledge of the executor
concept of an index-only scan trickles down into indexam.c, which now
has to be able to make decisions that are consistent with the ones
that the executor will make. That doesn't seem good at all.

I agree the all_visible flag is a sign the abstraction is not quite
right. I did that mostly to quickly verify whether the duplicate VM
checks are causing for the perf regression (and they are).

Whatever the right abstraction is, it probably needs to do these VM
checks only once.

* I think it might make sense to have two different prefetching
schemes. Ideally they could share some structure. If a caller is using
index_getnext_slot(), then it's easy for prefetching to be fully
transparent. The caller can just ask for TIDs and the prefetching
distance and TID queue can be fully under the control of something
that is hidden from the caller. But when using index_getnext_tid(),
the caller needs to have an opportunity to evaluate each TID and
decide whether we even want the heap tuple. If yes, then we feed that
TID to the prefetcher; if no, we don't. That way, we're not
replicating executor logic in lower-level code. However, that also
means that the IOS logic needs to be aware that this TID queue exists
and interact with whatever controls the prefetch distance. Perhaps
after calling index_getnext_tid() you call
index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
then you call index_prefetcher_get_tid() to drain the queue. Perhaps
also the prefetcher has a "fill" callback that gets invoked when the
TID queue isn't as full as the prefetcher wants it to be. Then
index_getnext_slot() can just install a trivial fill callback that
says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
true), but IOS can use a more sophisticated callback that checks the
VM to determine what to pass for the third argument.

Yeah, after you pointed out the "leaky" abstraction, I also started to
think about customizing the behavior using a callback. Not sure what
exactly you mean by "fully transparent" but as I explained above I think
we need to allow passing some information between the prefetcher and the
executor - for example results of the visibility map checks in IOS.

I have imagined something like this:

nodeIndexscan / index_getnext_slot()
-> no callback, all TIDs are prefetched

nodeIndexonlyscan / index_getnext_tid()
-> callback checks VM for the TID, prefetches if not all-visible
-> the VM check result is stored in the queue with the VM (but in an
extensible way, so that other callback can store other stuff)
-> index_getnext_tid() also returns this extra information

So not that different from the WIP patch, but in a "generic" and
extensible way. Instead of hard-coding the all-visible flag, there'd be
a something custom information. A bit like qsort_r() has a void* arg to
pass custom context.

Or if envisioned something different, could you elaborate a bit?

* I realize that I'm being a little inconsistent in what I just said,
because in the first bullet point I said that this wasn't really index
prefetching, and now I'm proposing function names that still start
with index_prefetch. It's not entirely clear to me what the best thing
to do about the terminology is here -- could it be a heap prefetcher,
or a TID prefetcher, or an index scan prefetcher? I don't really know,
but whatever we can do to make the naming more clear seems like a
really good idea. Maybe there should be a clearer separation between
the queue of TIDs that we're going to return from the index and the
queue of blocks that we want to prefetch to get the corresponding heap
tuples -- making that separation crisper might ease some of the naming
issues.

I think if the code stays in indexam.c, it's sensible to keep the index_
prefix, but then also have a more appropriate rest of the name. For
example it might be index_prefetch_heap_pages() or something like that.

* Not that I want to be critical because I think this is a great start
on an important project, but it does look like there's an awful lot of
stuff here that still needs to be sorted out before it would be
reasonable to think of committing this, both in terms of design
decisions and just general polish. There's a lot of stuff marked with
XXX and I think that's great because most of those seem to be good
questions but that does leave the, err, small problem of figuring out
the answers.

Absolutely. I certainly don't claim this is close to commit ...

index_prefetch_is_sequential() makes me really nervous
because it seems to depend an awful lot on whether the OS is doing
prefetching, and how the OS is doing prefetching, and I think those
might not be consistent across all systems and kernel versions.

If the OS does not have read-ahead, or it's not configured properly,
then the patch does not perform worse than what we have now. I'm far
more concerned about the opposite issue, i.e. causing regressions with
OS-level read-ahead. And the check handles that well, I think.

Similarly with index_prefetch(). There's a lot of "magical"
assumptions here. Even index_prefetch_add_cache() has this problem --
the function assumes that it's OK if we sometimes fail to detect a
duplicate prefetch request, which makes sense, but under what
circumstances is it necessary to detect duplicates and in what cases
is it optional? The function comments are silent about that, which
makes it hard to assess whether the algorithm is good enough.

I don't quite understand what problem with duplicates you envision here.
Strictly speaking, we don't need to detect/prevent duplicates - it's
just that if you do posix_fadvise() for a block that's already in
memory, it's overhead / wasted time. The whole point is to not do that
very often. In this sense it's entirely optional, but desirable.

I'm in no way claiming the comments are perfect, ofc.

* In terms of polish, one thing I noticed is that index_getnext_slot()
calls index_prefetch_tids() even when scan->xs_heap_continue is set,
which seems like it must be a waste, since we can't really need to
kick off more prefetch requests halfway through a HOT chain referenced
by a single index tuple, can we?

Yeah, I think that's true.

Also, blks_prefetch_rounds doesn't
seem to be used anywhere, and neither that nor blks_prefetches are
documented. In fact there's no new documentation at all, which seems
probably not right. That's partly because there are no new GUCs, which
I feel like typically for a feature like this would be the place where
the feature behavior would be mentioned in the documentation.

That's mostly because the explain fields were added to help during
development. I'm not sure we actually want to make them part of EXPLAIN.

I don't
think it's a good idea to tie the behavior of this feature to
effective_io_concurrency partly because it's usually a bad idea to
make one setting control multiple different things, but perhaps even
more because effective_io_concurrency doesn't actually work in a
useful way AFAICT and people typically have to set it to some very
artificially large value compared to how much real I/O parallelism
they have. So probably there should be new GUCs with hopefully-better
semantics, but at least the documentation for any existing ones would
need updating, I would think.

I really don't want to have multiple knobs. At this point we have three
GUCs, each tuning prefetching for a fairly large part of the system:

effective_io_concurrency = regular queries
maintenance_io_concurrency = utility commands
recovery_prefetch = recovery / PITR

This seems sensible, but I really don't want many more GUCs tuning
prefetching for different executor nodes or something like that.

If we have issues with how effective_io_concurrency works (and I'm not
sure that's actually true), then perhaps we should fix that rather than
inventing new GUCs.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#22)

Re: index prefetching

On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Whatever the right abstraction is, it probably needs to do these VM
checks only once.

Makes sense.

Yeah, after you pointed out the "leaky" abstraction, I also started to
think about customizing the behavior using a callback. Not sure what
exactly you mean by "fully transparent" but as I explained above I think
we need to allow passing some information between the prefetcher and the
executor - for example results of the visibility map checks in IOS.

Agreed.

I have imagined something like this:

nodeIndexscan / index_getnext_slot()
-> no callback, all TIDs are prefetched

nodeIndexonlyscan / index_getnext_tid()
-> callback checks VM for the TID, prefetches if not all-visible
-> the VM check result is stored in the queue with the VM (but in an
extensible way, so that other callback can store other stuff)
-> index_getnext_tid() also returns this extra information

So not that different from the WIP patch, but in a "generic" and
extensible way. Instead of hard-coding the all-visible flag, there'd be
a something custom information. A bit like qsort_r() has a void* arg to
pass custom context.

Or if envisioned something different, could you elaborate a bit?

I can't totally follow the sketch you give above, but I think we're
thinking along similar lines, at least.

I think if the code stays in indexam.c, it's sensible to keep the index_
prefix, but then also have a more appropriate rest of the name. For
example it might be index_prefetch_heap_pages() or something like that.

Yeah, that's not a bad idea.

index_prefetch_is_sequential() makes me really nervous
because it seems to depend an awful lot on whether the OS is doing
prefetching, and how the OS is doing prefetching, and I think those
might not be consistent across all systems and kernel versions.

If the OS does not have read-ahead, or it's not configured properly,
then the patch does not perform worse than what we have now. I'm far
more concerned about the opposite issue, i.e. causing regressions with
OS-level read-ahead. And the check handles that well, I think.

I'm just not sure how much I believe that it's going to work well
everywhere. I mean, I have no evidence that it doesn't, it just kind
of looks like guesswork to me. For instance, the behavior of the
algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
to say that on some system or workload you didn't test the required
values aren't entirely different, or that the whole algorithm doesn't
need rethinking? Maybe we can't really answer that question perfectly,
but the patch doesn't really explain the reasoning behind this choice
of algorithm.

Similarly with index_prefetch(). There's a lot of "magical"
assumptions here. Even index_prefetch_add_cache() has this problem --
the function assumes that it's OK if we sometimes fail to detect a
duplicate prefetch request, which makes sense, but under what
circumstances is it necessary to detect duplicates and in what cases
is it optional? The function comments are silent about that, which
makes it hard to assess whether the algorithm is good enough.

I don't quite understand what problem with duplicates you envision here.
Strictly speaking, we don't need to detect/prevent duplicates - it's
just that if you do posix_fadvise() for a block that's already in
memory, it's overhead / wasted time. The whole point is to not do that
very often. In this sense it's entirely optional, but desirable.

Right ... but the patch sets up some data structure that will
eliminate duplicates in some circumstances and fail to eliminate them
in others. So it's making a judgement that the things it catches are
the cases that are important enough that we need to catch them, and
the things that it doesn't catch are cases that aren't particularly
important to catch. Here again, PREFETCH_LRU_SIZE and
PREFETCH_LRU_COUNT seem like they will have a big impact, but why
these values? The comments suggest that it's because we want to cover
~8MB of data, but it's not clear why that should be the right amount
of data to cover. My naive thought is that we'd want to avoid
prefetching a block during the time between we had prefetched it and
when we later read it, but then the value that is here magically 8MB
should really be replaced by the operative prefetch distance.

I really don't want to have multiple knobs. At this point we have three
GUCs, each tuning prefetching for a fairly large part of the system:

effective_io_concurrency = regular queries
maintenance_io_concurrency = utility commands
recovery_prefetch = recovery / PITR

This seems sensible, but I really don't want many more GUCs tuning
prefetching for different executor nodes or something like that.

If we have issues with how effective_io_concurrency works (and I'm not
sure that's actually true), then perhaps we should fix that rather than
inventing new GUCs.

Well, that would very possibly be a good idea, but I still think using
the same GUC for two different purposes is likely to cause trouble. I
think what effective_io_concurrency currently controls is basically
the heap prefetch distance for bitmap scans, and what you want to
control here is the heap prefetch distance for index scans. If those
are necessarily related in some understandable way (e.g. always the
same, one twice the other, one the square of the other) then it's fine
to use the same parameter for both, but it's not clear to me that this
is the case. I fear someone will find that if they crank up
effective_io_concurrency high enough to get the amount of prefetching
they want for bitmap scans, it will be too much for index scans, or
the other way around.

--
Robert Haas
EDB: http://www.enterprisedb.com

#24

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#22)

Re: index prefetching

On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I was going through to understand the idea, couple of observations

--
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /*
+ * If the entry is unused (identified by request being set to 0),
+ * we're done. Notice the field is uint64, so empty entry is
+ * guaranteed to be the oldest one.
+ */
+ if (entry->request == 0)
+ continue;

If the 'entry->request == 0' then we should break instead of continue, right?

---
/*
* Used to detect sequential patterns (and disable prefetching).
*/
#define PREFETCH_QUEUE_HISTORY 8
#define PREFETCH_SEQ_PATTERN_BLOCKS 4

If for sequential patterns we search only 4 blocks then why we are
maintaining history for 8 blocks

---

+ *
+ * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+ *
+ * XXX Could it be harmful that we read the queue backwards? Maybe memory
+ * prefetching works better for the forward direction?
+ */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)

Correct, I think if we fetch this forward it will have an advantage
with memory prefetching.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#25

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Robert Haas (#23)

Re: index prefetching

On 12/20/23 20:09, Robert Haas wrote:

On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
...

I have imagined something like this:

nodeIndexscan / index_getnext_slot()
-> no callback, all TIDs are prefetched

nodeIndexonlyscan / index_getnext_tid()
-> callback checks VM for the TID, prefetches if not all-visible
-> the VM check result is stored in the queue with the VM (but in an
extensible way, so that other callback can store other stuff)
-> index_getnext_tid() also returns this extra information

So not that different from the WIP patch, but in a "generic" and
extensible way. Instead of hard-coding the all-visible flag, there'd be
a something custom information. A bit like qsort_r() has a void* arg to
pass custom context.

Or if envisioned something different, could you elaborate a bit?

I can't totally follow the sketch you give above, but I think we're
thinking along similar lines, at least.

Yeah, it's hard to discuss vague descriptions of code that does not
exist yet. I'll try to do the actual patch, then we can discuss.

index_prefetch_is_sequential() makes me really nervous
because it seems to depend an awful lot on whether the OS is doing
prefetching, and how the OS is doing prefetching, and I think those
might not be consistent across all systems and kernel versions.

If the OS does not have read-ahead, or it's not configured properly,
then the patch does not perform worse than what we have now. I'm far
more concerned about the opposite issue, i.e. causing regressions with
OS-level read-ahead. And the check handles that well, I think.

I'm just not sure how much I believe that it's going to work well
everywhere. I mean, I have no evidence that it doesn't, it just kind
of looks like guesswork to me. For instance, the behavior of the
algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
to say that on some system or workload you didn't test the required
values aren't entirely different, or that the whole algorithm doesn't
need rethinking? Maybe we can't really answer that question perfectly,
but the patch doesn't really explain the reasoning behind this choice
of algorithm.

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

FWIW bitmap index scans have exactly the same "problem" except that no
one cares about it because that's how it worked from the start, so it's
not considered a regression.

Similarly with index_prefetch(). There's a lot of "magical"
assumptions here. Even index_prefetch_add_cache() has this problem --
the function assumes that it's OK if we sometimes fail to detect a
duplicate prefetch request, which makes sense, but under what
circumstances is it necessary to detect duplicates and in what cases
is it optional? The function comments are silent about that, which
makes it hard to assess whether the algorithm is good enough.

I don't quite understand what problem with duplicates you envision here.
Strictly speaking, we don't need to detect/prevent duplicates - it's
just that if you do posix_fadvise() for a block that's already in
memory, it's overhead / wasted time. The whole point is to not do that
very often. In this sense it's entirely optional, but desirable.

Right ... but the patch sets up some data structure that will
eliminate duplicates in some circumstances and fail to eliminate them
in others. So it's making a judgement that the things it catches are
the cases that are important enough that we need to catch them, and
the things that it doesn't catch are cases that aren't particularly
important to catch. Here again, PREFETCH_LRU_SIZE and
PREFETCH_LRU_COUNT seem like they will have a big impact, but why
these values? The comments suggest that it's because we want to cover
~8MB of data, but it's not clear why that should be the right amount
of data to cover. My naive thought is that we'd want to avoid
prefetching a block during the time between we had prefetched it and
when we later read it, but then the value that is here magically 8MB
should really be replaced by the operative prefetch distance.

True. Ideally we'd not issue prefetch request for data that's already in
memory - either in shared buffers or page cache (or whatever). And we
already do that for shared buffers, but not for page cache. The preadv2
experiment was an attempt to do that, but it's too expensive to help.

So we have to approximate, and the only way I can think of is checking
if we recently prefetched that block. Which is the whole point of this
simple cache - remembering which blocks we prefetched, so that we don't
prefetch them over and over again.

I don't understand what you mean by "cases that are important enough".
In a way, all the blocks are equally important, with exactly the same
impact of making the wrong decision.

You're certainly right the 8MB is a pretty arbitrary value, though. It
seemed reasonable, so I used that, but I might just as well use 32MB or
some other sensible value. Ultimately, any hard-coded value is going to
be wrong, but the negative consequences are a bit asymmetrical. If the
cache is too small, we may end up doing prefetches for data that's
already in cache. If it's too large, we may not prefetch data that's not
in memory at that point.

Obviously, the latter case has much more severe impact, but it depends
on the exact workload / access pattern etc. The only "perfect" solution
would be to actually check the page cache, but well - that seems to be
fairly expensive.

What I was envisioning was something self-tuning, based on the I/O we
may do later. If the prefetcher decides to prefetch something, but finds
it's already in cache, we'd increase the distance, to remember more
blocks. Likewise, if a block is not prefetched but then requires I/O
later, decrease the distance. That'd make it adaptive, but I don't think
we actually have the info about I/O.

A bigger "flaw" is that these caches are per-backend, so there's no way
to check if a block was recently prefetched by some other backend. I
actually wonder if maybe this cache should be in shared memory, but I
haven't tried.

Alternatively, I was thinking about moving the prefetches into a
separate worker process (or multiple workers), so we'd just queue the
request and all the overhead would be done by the worker. The main
problem is the overhead of calling posix_fadvise() for blocks that are
already in memory, and this would just move it to a separate backend. I
wonder if that might even make the custom cache unnecessary / optional.

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

I really don't want to have multiple knobs. At this point we have three
GUCs, each tuning prefetching for a fairly large part of the system:

effective_io_concurrency = regular queries
maintenance_io_concurrency = utility commands
recovery_prefetch = recovery / PITR

This seems sensible, but I really don't want many more GUCs tuning
prefetching for different executor nodes or something like that.

If we have issues with how effective_io_concurrency works (and I'm not
sure that's actually true), then perhaps we should fix that rather than
inventing new GUCs.

Well, that would very possibly be a good idea, but I still think using
the same GUC for two different purposes is likely to cause trouble. I
think what effective_io_concurrency currently controls is basically
the heap prefetch distance for bitmap scans, and what you want to
control here is the heap prefetch distance for index scans. If those
are necessarily related in some understandable way (e.g. always the
same, one twice the other, one the square of the other) then it's fine
to use the same parameter for both, but it's not clear to me that this
is the case. I fear someone will find that if they crank up
effective_io_concurrency high enough to get the amount of prefetching
they want for bitmap scans, it will be too much for index scans, or
the other way around.

I understand, but I think we should really try to keep the number of
knobs as low as possible, unless we actually have very good arguments
for having separate GUCs. And I don't think we have that.

This is very much about how many concurrent requests the storage can
handle (or rather requires to benefit from the capabilities), and that's
pretty orthogonal to which operation is generating the requests.

I think this is pretty similar to what we do with work_mem - there's one
value for all possible parts of the query plan, no matter if it's sort,
group by, or something else. We do have separate limits for maintenance
commands, because that's a different matter, and we have the same for
the two I/O GUCs.

If we come to the realization that really need two GUCs, fine with me.
But at this point I don't see a reason to do that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Dilip Kumar (#24)

Re: index prefetching

On 12/21/23 07:49, Dilip Kumar wrote:

On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I was going through to understand the idea, couple of observations
--
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /*
+ * If the entry is unused (identified by request being set to 0),
+ * we're done. Notice the field is uint64, so empty entry is
+ * guaranteed to be the oldest one.
+ */
+ if (entry->request == 0)
+ continue;
If the 'entry->request == 0' then we should break instead of continue, right?

Yes, I think that's true. The small LRU caches are accessed/filled
linearly, so once we find an empty entry, all following entries are
going to be empty too.

I thought this shouldn't make any difference, because the LRUs are very
small (only 8 entries, and I don't think we should make them larger).
And it's going to go away once the cache gets full. But now that I think
about it, maybe this could matter for small queries that only ever hit a
couple rows. Hmmm, I'll have to check.

Thanks for noticing this!

---
/*
* Used to detect sequential patterns (and disable prefetching).
*/
#define PREFETCH_QUEUE_HISTORY 8
#define PREFETCH_SEQ_PATTERN_BLOCKS 4

If for sequential patterns we search only 4 blocks then why we are
maintaining history for 8 blocks

---

Right, I think there's no reason to keep these two separate constants. I
believe this is a remnant from an earlier patch version which tried to
do something smarter, but I ended up abandoning that.

+ *
+ * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+ *
+ * XXX Could it be harmful that we read the queue backwards? Maybe memory
+ * prefetching works better for the forward direction?
+ */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)

Correct, I think if we fetch this forward it will have an advantage
with memory prefetching.

OK, although we only really have a couple uint32 values, so it should be
the same cacheline I guess.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#20)

Re: index prefetching

Hi,

On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote:

But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?

With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
them into a queue, and prefetching based on that. That's kinda what the
patch does, except that it does it from inside index_getnext_tid(). But
that does not work for index_getnext_slot(), because that already reads
the heap tuples.

We could say prefetching only works for index_getnext_tid(), but that
seems a bit weird because that's what regular index scans do. (There's a
patch to evaluate filters on index, which switches index scans to
index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
that here.

I think we should just switch plain index scans to index_getnext_tid(). It's
one of the primary places triggering index scans, so a few additional lines
don't seem problematic.

I continue to think that we should not have split plain and index only scans
into separate files...

There are other index_getnext_slot() callers, and I don't
think we should accept does not work for those places seems wrong (e.g.
execIndexing/execReplication would benefit from prefetching, I think).

I don't think it'd be a problem to have to opt into supporting
prefetching. There's plenty places where it doesn't really seem likely to be
useful, e.g. doing prefetching during syscache lookups is very likely just a
waste of time.

I don't think e.g. execReplication is likely to benefit from prefetching -
you're just fetching a single row after all. You'd need a lot of dead rows to
make it beneficial. I think it's similar in execIndexing.c.

I suspect we should work on providing executor nodes with some estimates about
the number of rows that are likely to be consumed. If an index scan is under a
LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the
infrastructure in
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Greetings,

Andres Freund

#28

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#25)

Re: index prefetching

Hi,

On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

For the streaming read interface (initially backed by fadvise, to then be
replaced by AIO) we found that it's clearly necessary to avoid fadvises in
cases of actual sequential IO - the overhead otherwise leads to easily
reproducible regressions. So I don't think we have much choice.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

I think sequential access is common in other cases as well. There's lots of
indexes where heap tids are almost perfectly correlated with index entries,
consider insert only insert-only tables and serial PKs or inserted_at
timestamp columns. Even leaving those aside, for indexes with many entries
for the same key, we sort by tid these days, which will also result in
"runs" of sequential access.

Obviously, the latter case has much more severe impact, but it depends
on the exact workload / access pattern etc. The only "perfect" solution
would be to actually check the page cache, but well - that seems to be
fairly expensive.

What I was envisioning was something self-tuning, based on the I/O we
may do later. If the prefetcher decides to prefetch something, but finds
it's already in cache, we'd increase the distance, to remember more
blocks. Likewise, if a block is not prefetched but then requires I/O
later, decrease the distance. That'd make it adaptive, but I don't think
we actually have the info about I/O.

How would the prefetcher know that hte data wasn't in cache?

Alternatively, I was thinking about moving the prefetches into a
separate worker process (or multiple workers), so we'd just queue the
request and all the overhead would be done by the worker. The main
problem is the overhead of calling posix_fadvise() for blocks that are
already in memory, and this would just move it to a separate backend. I
wonder if that might even make the custom cache unnecessary / optional.

The AIO patchset provides this.

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

Yes, most of this exists there. The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.

Greetings,

Andres Freund

#29

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#28)

Re: index prefetching

On 12/21/23 14:43, Andres Freund wrote:

Hi,

On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

For the streaming read interface (initially backed by fadvise, to then be
replaced by AIO) we found that it's clearly necessary to avoid fadvises in
cases of actual sequential IO - the overhead otherwise leads to easily
reproducible regressions. So I don't think we have much choice.

Yeah, the regression are pretty easy to demonstrate. In fact, I didn't
have such detection in the first patch, but after the first round of
benchmarks it became obvious it's needed.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

I think sequential access is common in other cases as well. There's lots of
indexes where heap tids are almost perfectly correlated with index entries,
consider insert only insert-only tables and serial PKs or inserted_at
timestamp columns. Even leaving those aside, for indexes with many entries
for the same key, we sort by tid these days, which will also result in
"runs" of sequential access.

True. I should have thought about those cases.

Obviously, the latter case has much more severe impact, but it depends
on the exact workload / access pattern etc. The only "perfect" solution
would be to actually check the page cache, but well - that seems to be
fairly expensive.

What I was envisioning was something self-tuning, based on the I/O we
may do later. If the prefetcher decides to prefetch something, but finds
it's already in cache, we'd increase the distance, to remember more
blocks. Likewise, if a block is not prefetched but then requires I/O
later, decrease the distance. That'd make it adaptive, but I don't think
we actually have the info about I/O.

How would the prefetcher know that hte data wasn't in cache?

I don't think there's a good way to do that, unfortunately, or at least
I'm not aware of it. That's what I meant by "we don't have the info" at
the end. Which is why I haven't tried implementing it.

The only "solution" I could come up with was some sort of "timing" for
the I/O requests and deducing what was cached. Not great, of course.

Alternatively, I was thinking about moving the prefetches into a
separate worker process (or multiple workers), so we'd just queue the
request and all the overhead would be done by the worker. The main
problem is the overhead of calling posix_fadvise() for blocks that are
already in memory, and this would just move it to a separate backend. I
wonder if that might even make the custom cache unnecessary / optional.

The AIO patchset provides this.

OK, I guess it's time for me to take a look at the patch again.

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

Yes, most of this exists there. The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.

Right, although the line where things stop being "prefetch" and becomes
"async" seems a bit unclear to me / perhaps more a point of view.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#27)

Re: index prefetching

On 12/21/23 14:27, Andres Freund wrote:

Hi,

On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote:

But there's a layering problem that I don't know how to solve - I don't
see how we could make indexam.c entirely oblivious to the prefetching,
and move it entirely to the executor. Because how else would you know
what to prefetch?

With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
them into a queue, and prefetching based on that. That's kinda what the
patch does, except that it does it from inside index_getnext_tid(). But
that does not work for index_getnext_slot(), because that already reads
the heap tuples.

We could say prefetching only works for index_getnext_tid(), but that
seems a bit weird because that's what regular index scans do. (There's a
patch to evaluate filters on index, which switches index scans to
index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
that here.

I think we should just switch plain index scans to index_getnext_tid(). It's
one of the primary places triggering index scans, so a few additional lines
don't seem problematic.

I continue to think that we should not have split plain and index only scans
into separate files...

I do agree with that opinion. Not just because of this prefetching
thread, but also because of the discussions about index-only filters in
a nearby thread.

There are other index_getnext_slot() callers, and I don't
think we should accept does not work for those places seems wrong (e.g.
execIndexing/execReplication would benefit from prefetching, I think).

I don't think it'd be a problem to have to opt into supporting
prefetching. There's plenty places where it doesn't really seem likely to be
useful, e.g. doing prefetching during syscache lookups is very likely just a
waste of time.

I don't think e.g. execReplication is likely to benefit from prefetching -
you're just fetching a single row after all. You'd need a lot of dead rows to
make it beneficial. I think it's similar in execIndexing.c.

Yeah, systable scans are unlikely to benefit from prefetching of this
type. I'm not sure about execIndexing/execReplication, it wasn't clear
to me but maybe you're right.

I suspect we should work on providing executor nodes with some estimates about
the number of rows that are likely to be consumed. If an index scan is under a
LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the
infrastructure in
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Isn't this mostly addressed by the incremental ramp-up at the beginning?
Even with target set to 1000, we only start prefetching 1, 2, 3, ...
blocks ahead, it's not like we'll prefetch 1000 blocks right away.

I did initially plan to also consider the number of rows we're expected
to need, but I think it's actually harder than it might seem. With LIMIT
for example we often don't know how selective the qual is, it's not like
we can just stop prefetching after the reading the first N tids. With
other nodes it's good to remember those are just estimates - it'd be
silly to be bitten both by a wrong estimate and also prefetching doing
the wrong thing based on an estimate.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Tomas Vondra (#29)

Re: index prefetching

Hi,

On 2023-12-21 16:20:45 +0100, Tomas Vondra wrote:

On 12/21/23 14:43, Andres Freund wrote:

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

Yes, most of this exists there. The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.

Right, although the line where things stop being "prefetch" and becomes
"async" seems a bit unclear to me / perhaps more a point of view.

Agreed. What I meant with not needing prefetching was that you'd not use
fadvise(), because it's better to instead just asynchronously read data into
shared buffers. That way you don't have the doubling of syscalls and you don't
need to care less about the buffering rate in the kernel.

Greetings,

Andres Freund

#32

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#30)

Re: index prefetching

On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I continue to think that we should not have split plain and index only scans
into separate files...

I do agree with that opinion. Not just because of this prefetching
thread, but also because of the discussions about index-only filters in
a nearby thread.

For the record, in the original patch I submitted for this feature, it
wasn't in separate files. If memory serves, Tom changed it.

So don't blame me. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#33

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Robert Haas (#32)

Re: index prefetching

Hi,

On 2023-12-21 11:00:34 -0500, Robert Haas wrote:

On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I continue to think that we should not have split plain and index only scans
into separate files...

I do agree with that opinion. Not just because of this prefetching
thread, but also because of the discussions about index-only filters in
a nearby thread.

For the record, in the original patch I submitted for this feature, it
wasn't in separate files. If memory serves, Tom changed it.

So don't blame me. :-)

But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)

Greetings,

Andres Freund

#34

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Andres Freund (#33)

Re: index prefetching

On Thu, Dec 21, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:

But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)

Sadly, you're more likely to get the first one than you are to get the
second one. I can't really see going back to revisit that decision as
a basis for somebody else's new work -- it'd be better if the person
doing the new work figured out what makes sense here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#35

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Robert Haas (#34)

Re: index prefetching

On 12/21/23 18:14, Robert Haas wrote:

On Thu, Dec 21, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:

But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)

Sadly, you're more likely to get the first one than you are to get the
second one. I can't really see going back to revisit that decision as
a basis for somebody else's new work -- it'd be better if the person
doing the new work figured out what makes sense here.

I think it's a great example of "hindsight is 20/20". There were
perfectly valid reasons to have two separate nodes, and it's not like
these reasons somehow disappeared. It still is a perfectly reasonable
decision.

It's just that allowing index-only filters for regular index scans seems
to eliminate pretty much all executor differences between the two nodes.
But that's hard to predict - I certainly would not have even think about
that back when index-only scans were added.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36

Tomas Vondra

tomas.vondra@enterprisedb.com

about 2 years ago

In reply to: Andres Freund (#31)

2 attachment(s)

Re: index prefetching

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)

I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.

Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.

The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).

I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a fundamental way. I
don't expect any major difference.

regards

[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240103-0001-prefetch-2023-12-09.patchtext/x-patch; charset=UTF-8; name=v20240103-0001-prefetch-2023-12-09.patchDownload

From 74bd0d6b70fa8ca3a1b26196de6b7a9cc670ac9b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240103 1/2] prefetch 2023-12-09

Patch version shared on 2023/12/09.
---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 551 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  99 +++-
 src/backend/executor/nodeIndexscan.c     |  71 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               | 115 ++++-
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |   4 +
 13 files changed, 868 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..26d3ec20b63 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..72e7c9f206c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f23e0199f08..f96aeba1b39 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,16 +49,19 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -108,6 +111,13 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+								IndexPrefetch *prefetch);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+										  IndexPrefetch *prefetch, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
+
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -536,8 +546,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+static ItemPointer
+index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -636,16 +646,21 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
+				   IndexPrefetch *prefetch)
 {
 	for (;;)
 	{
+		/* Do prefetching (if requested/enabled). */
+		index_prefetch_tids(scan, direction, prefetch);
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			ItemPointer	tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1003,3 +1018,529 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
+ *
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchReqNumber value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ */
+static void
+index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+{
+	BlockNumber block;
+
+	/* by default not all visible (or we didn't check) */
+	*all_visible = false;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 *
+	 * XXX Ideally we'd also propagate this to the executor, so that the
+	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
+	 * is measurable). But the index_getnext_tid() is not really well
+	 * suited for that, so the API needs a change.s
+	 */
+	if (skip_all_visible)
+	{
+		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									  block,
+									  &prefetch->vmBuffer);
+
+		if (*all_visible)
+			return;
+	}
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
+				  IndexPrefetch *prefetch)
+{
+	bool		all_visible;	/* ignored */
+
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+}
+
+ItemPointer
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
+					 IndexPrefetch *prefetch, bool *all_visible)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
+}
+
+static void
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+					IndexPrefetch *prefetch)
+{
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+			bool		all_visible;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_internal(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
+			prefetch->queueEnd++;
+		}
+	}
+}
+
+static ItemPointer
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+					   IndexPrefetch *prefetch, bool *all_visible)
+{
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid_internal(scan, direction);
+		*all_visible = false;
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e8..6810996edfd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2fa2118f3c2..0a136db6712 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,7 +777,11 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	/*
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..9498b00fa64 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,8 +212,13 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	/*
+	 * Try to find the tuple
+	 *
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c066..0011d9f679c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..a7eadaf3db2 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch  *prefetch;
+	bool			all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,6 +80,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -116,7 +119,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -155,8 +158,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!all_visible &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +359,16 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +396,26 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* Release VM buffer pin from prefetcher, if any. */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		/* XXX some debug info */
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+
+		if (prefetch->vmBuffer != InvalidBuffer)
+		{
+			ReleaseBuffer(prefetch->vmBuffer);
+			prefetch->vmBuffer = InvalidBuffer;
+		}
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -604,6 +640,63 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		indexstate->ioss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int			prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		/*
+		 * We reach here if the index only scan is not parallel, or if we're
+		 * serially executing an index only scan that was planned to be
+		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
+		 *
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+			prefetch->indexonly = true;
+
+			indexstate->ioss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..b3282ec5a75 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch  *prefetch;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,6 +100,7 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -128,7 +131,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (index_getnext_slot(scandesc, direction, slot, prefetch))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -177,6 +180,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -193,6 +197,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -259,7 +264,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -588,6 +593,16 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +809,19 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX nothing to free, but print some debug info */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1066,6 +1094,45 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int	prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+
+			indexstate->iss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d022827a..b5c79359425 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 80dc8d54066..c0c46d7a05f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -128,6 +129,110 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct IndexPrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} IndexPrefetchEntry;
+
+typedef struct IndexPrefetch
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/* used when prefetching index-only scans */
+	bool		indexonly;
+	Buffer		vmBuffer;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	IndexPrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	IndexPrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetch;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -173,11 +278,17 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction);
+									 ScanDirection direction,
+									 IndexPrefetch *prefetch);
+extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
+										ScanDirection direction,
+										IndexPrefetch *prefetch,
+										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot);
+							   struct TupleTableSlot *slot,
+							   IndexPrefetch *prefetch);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c52..f53fb4a1e51 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee07..8745453a5b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,6 +1529,7 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1581,8 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1621,7 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
-- 
2.43.0

v20240103-0002-switch-to-StreamingRead-like-API.patchtext/x-patch; charset=UTF-8; name=v20240103-0002-switch-to-StreamingRead-like-API.patchDownload

From b9021c498bb273055f8cf8809030c4abc7848737 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 1 Jan 2024 21:50:47 +0100
Subject: [PATCH v20240103 2/2] switch to StreamingRead-like API

---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/README.prefetch |   3 +
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 254 ++++++++++++-----------
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/nodeIndexonlyscan.c | 149 +++++++------
 src/backend/executor/nodeIndexscan.c     | 105 ++++++----
 src/backend/optimizer/plan/createplan.c  |  27 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               |  68 ++++--
 src/include/nodes/execnodes.h            |   4 +-
 src/include/nodes/plannodes.h            |   2 +
 13 files changed, 382 insertions(+), 253 deletions(-)
 create mode 100644 src/backend/access/index/README.prefetch

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26d3ec20b63..7c28dafb728 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/README.prefetch b/src/backend/access/index/README.prefetch
new file mode 100644
index 00000000000..2a6ac4a0eea
--- /dev/null
+++ b/src/backend/access/index/README.prefetch
@@ -0,0 +1,3 @@
+- index heap prefetch overview
+- 
+- callback - decision whether to prefetch, possibility to keep data
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 72e7c9f206c..4ca12006843 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f96aeba1b39..cdad3f4c6f9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,7 +49,6 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
-#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
@@ -64,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
 
@@ -111,13 +111,16 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
-static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-								IndexPrefetch *prefetch);
-static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-										  IndexPrefetch *prefetch, bool *all_visible);
-static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
-
+/* index prefetching of heap pages */
+static void index_prefetch_tids(IndexScanDesc scan,
+								IndexPrefetch *prefetch,
+								ScanDirection direction);
+static IndexPrefetchEntry *index_prefetch_get_entry(IndexScanDesc scan,
+												    IndexPrefetch *prefetch,
+													ScanDirection direction);
+static void index_prefetch_heap_page(IndexScanDesc scan,
+									 IndexPrefetch *prefetch,
+									 IndexPrefetchEntry *entry);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -546,8 +549,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-static ItemPointer
-index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -643,24 +646,22 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * Note: caller must check scan->xs_recheck, and perform rechecking of the
  * scan keys if required.  We do not do that here because we don't have
  * enough information to do it efficiently in the general case.
+ *
+ * XXX This does not support prefetching of heap pages. When such prefetching is
+ * desirable, use index_getnext_tid().
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
-				   IndexPrefetch *prefetch)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
 	for (;;)
 	{
-		/* Do prefetching (if requested/enabled). */
-		index_prefetch_tids(scan, direction, prefetch);
-
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer	tid;
-			bool		all_visible;
+			ItemPointer tid;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+			tid = index_getnext_tid(scan, direction);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1339,13 +1340,9 @@ index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
  * value once in a while, and see what happens.
  */
 static void
-index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+index_prefetch_heap_page(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
 {
-	BlockNumber block;
-
-	/* by default not all visible (or we didn't check) */
-	*all_visible = false;
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
 
 	/*
 	 * No heap relation means bitmap index scan, which does prefetching at the
@@ -1355,6 +1352,8 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	 * XXX But in this case we should have prefetchMaxTarget=0, because in
 	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
 	 * just check that.
+	 *
+	 * XXX Comment/check seems obsolete.
 	 */
 	if (!prefetch)
 		return;
@@ -1362,37 +1361,10 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	/*
 	 * If we got here, prefetching is enabled and it's a node that supports
 	 * prefetching (i.e. it can't be a bitmap index scan).
-	 */
-	Assert(scan->heapRelation);
-
-	block = ItemPointerGetBlockNumber(tid);
-
-	/*
-	 * When prefetching for IOS, we want to only prefetch pages that are not
-	 * marked as all-visible (because not fetching all-visible pages is the
-	 * point of IOS).
 	 *
-	 * XXX This is not great, because it releases the VM buffer for each TID
-	 * we consider to prefetch. We should reuse that somehow, similar to the
-	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
-	 * we can propagate it here). Or at least do it for a bulk of prefetches,
-	 * although that's not very useful - after the ramp-up we will prefetch
-	 * the pages one by one anyway.
-	 *
-	 * XXX Ideally we'd also propagate this to the executor, so that the
-	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
-	 * is measurable). But the index_getnext_tid() is not really well
-	 * suited for that, so the API needs a change.s
+	 * XXX Comment/check seems obsolete.
 	 */
-	if (skip_all_visible)
-	{
-		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
-									  block,
-									  &prefetch->vmBuffer);
-
-		if (*all_visible)
-			return;
-	}
+	Assert(scan->heapRelation);
 
 	/*
 	 * Do not prefetch the same block over and over again,
@@ -1412,42 +1384,12 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	prefetch->countAll++;
 }
 
-/* ----------------
- * index_getnext_tid - get the next TID from a scan
- *
- * The result is the next TID satisfying the scan keys,
- * or NULL if no more matching tuples exist.
- *
- * FIXME not sure this handles xs_heapfetch correctly.
- * ----------------
+/*
+ * index_prefetch_tids
+ *		Fill the prefetch queue and issue necessary prefetch requests.
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
-				  IndexPrefetch *prefetch)
-{
-	bool		all_visible;	/* ignored */
-
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
-}
-
-ItemPointer
-index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
-					 IndexPrefetch *prefetch, bool *all_visible)
-{
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
-}
-
 static void
-index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-					IndexPrefetch *prefetch)
+index_prefetch_tids(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
 	/*
 	 * If the prefetching is still active (i.e. enabled and we still
@@ -1473,43 +1415,46 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
 		 */
 		while (!PREFETCH_FULL(prefetch))
 		{
-			ItemPointer tid;
-			bool		all_visible;
+			IndexPrefetchEntry *entry = prefetch->next_cb(scan, prefetch, direction);
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid_internal(scan, direction);
-
-			/*
-			 * If we're out of index entries, we're done (and we mark the
-			 * the prefetcher as inactive).
-			 */
-			if (tid == NULL)
+			/* no more entries in this index scan */
+			if (entry == NULL)
 			{
 				prefetch->prefetchDone = true;
-				break;
+				return;
 			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
 
-			/*
-			 * Issue the actuall prefetch requests for the new TID.
-			 *
-			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
-			 * so skip prefetching of all-visible pages.
-			 */
-			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+			/* store the entry and then maybe issue the prefetch request */
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
 
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
-			prefetch->queueEnd++;
+			/* issue the prefetch request? */
+			if (entry->prefetch)
+				index_prefetch_heap_page(scan, prefetch, entry);
 		}
 	}
 }
 
-static ItemPointer
-index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-					   IndexPrefetch *prefetch, bool *all_visible)
+/*
+ * index_prefetch_get_entry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+index_prefetch_get_entry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
+	IndexPrefetchEntry *entry = NULL;
+
 	/*
 	 * With prefetching enabled (even if we already finished reading
 	 * all TIDs from the index scan), we need to return a TID from the
@@ -1522,25 +1467,98 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
 		if (PREFETCH_DONE(prefetch))
 			return NULL;
 
-		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
-		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
 		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
 	}
 	else				/* not prefetching, just do the regular work  */
 	{
 		ItemPointer tid;
 
 		/* Time to fetch the next TID from the index */
-		tid = index_getnext_tid_internal(scan, direction);
-		*all_visible = false;
+		tid = index_getnext_tid(scan, direction);
 
 		/* If we're out of index entries, we're done */
 		if (tid == NULL)
 			return NULL;
 
 		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
 	}
 
-	/* Return the TID of the tuple we found. */
-	return &scan->xs_heaptid;
+	return entry;
+}
+
+int
+index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch)
+{
+	/*
+	 * XXX No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only pretend
+	 * doing it now, ofc, because we'd not do posix_fadvise(), but once the code
+	 * starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+		return 0;
+
+	/* disable prefetching for cursors etc. */
+	if (!allow_prefetch)
+		return 0;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep
+	 * the result so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_entry(scan, prefetch, direction);
 }
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0a136db6712..2fa2118f3c2 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,11 +777,7 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	/*
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9498b00fa64..81f27042bc4 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,13 +212,8 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/*
-	 * Try to find the tuple
-	 *
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
+	/* Try to find the tuple */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index a7eadaf3db2..af7dd364f33 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,12 +43,13 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -66,7 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	TupleTableSlot *slot;
 	ItemPointer tid;
 	IndexPrefetch  *prefetch;
-	bool			all_visible;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -76,6 +77,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * Determine which direction to scan the index in based on the plan's scan
 	 * direction and the current direction of execution.
+	 *
+	 * XXX Could this be an issue for the prefetching? What if we prefetch something
+	 * but the direction changes before we get to the read? If that could happen,
+	 * maybe we should discard the prefetched data and go back? But can we even
+	 * do that, if we already fetched some TIDs from the index? I don't think
+	 * indexorderdir can't change, but es_direction maybe can?
 	 */
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
@@ -119,10 +126,15 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data;	/* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -161,7 +173,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!all_visible &&
+		if (!(all_visible && *all_visible) &&
 			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
@@ -367,6 +379,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -401,6 +416,8 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	{
 		IndexPrefetch *prefetch = node->ioss_prefetch;
 
+		Buffer *buffer = (Buffer *) prefetch->data;
+
 		/* XXX some debug info */
 		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
 			 prefetch->countAll,
@@ -409,10 +426,10 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 			 prefetch->countSkipCached,
 			 prefetch->countSkipSequential);
 
-		if (prefetch->vmBuffer != InvalidBuffer)
+		if (*buffer != InvalidBuffer)
 		{
-			ReleaseBuffer(prefetch->vmBuffer);
-			prefetch->vmBuffer = InvalidBuffer;
+			ReleaseBuffer(*buffer);
+			*buffer = InvalidBuffer;
 		}
 	}
 
@@ -512,6 +529,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
 	TupleDesc	tupDesc;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -641,61 +659,33 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
+	 *
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int			prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 *
-		 * XXX Maybe reduce the value with parallel workers?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		/*
-		 * We reach here if the index only scan is not parallel, or if we're
-		 * serially executing an index only scan that was planned to be
-		 * parallel.
-		 *
-		 * XXX Maybe we should enable prefetching, but prefetch only pages that
-		 * are not all-visible (but checking that from the index code seems like
-		 * a violation of layering etc).
-		 *
-		 * XXX This might lead to IOS being slower than plain index scan, if the
-		 * table has a lot of pages that need recheck.
-		 *
-		 * Remember this is index-only scan, because of prefetching. Not the most
-		 * elegant way to pass this info.
-		 */
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
-
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-			prefetch->indexonly = true;
-
-			indexstate->ioss_prefetch = prefetch;
-		}
-	}
+	indexstate->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												   prefetch_max,
+												   palloc0(sizeof(Buffer)));
 
 	/*
 	 * all done.
@@ -808,3 +798,42 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber	blkno = ItemPointerGetBlockNumber(tid);
+
+		bool	all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+											 blkno,
+											 (Buffer *) prefetch->data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index b3282ec5a75..bd65337270c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,7 +43,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -70,6 +69,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -87,6 +89,7 @@ IndexNext(IndexScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	IndexPrefetch  *prefetch;
+	IndexPrefetchEntry  *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -131,10 +134,19 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot, prefetch))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -180,7 +192,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
-	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -197,7 +208,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
-	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -264,7 +274,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -601,6 +611,9 @@ ExecReScanIndexScan(IndexScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -917,6 +930,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	IndexScanState *indexstate;
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -1095,43 +1109,33 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
+	 *
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int	prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index scan. This
-		 * is essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
-
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-
-			indexstate->iss_prefetch = prefetch;
-		}
-	}
+	indexstate->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												  prefetch_max,
+												  NULL);
 
 	/*
 	 * all done.
@@ -1795,3 +1799,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 34ca6d4ac21..0abbcd31ddd 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,13 +184,15 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 bool allow_prefetch);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 bool allow_prefetch);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3161,6 +3163,13 @@ create_indexscan_plan(PlannerInfo *root,
 		}
 	}
 
+	/*
+	 * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+	 * of a misuse of the flag, but we need to disable prefetching for cursors
+	 * (which might change direction), and parallelModeOK does that. But maybe
+	 * we might (or should) have a separate flag.
+	 */
+
 	/* Finally ready to build the plan node */
 	if (indexonly)
 		scan_plan = (Scan *) make_indexonlyscan(tlist,
@@ -3171,7 +3180,8 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												root->glob->parallelModeOK);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3182,7 +3192,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											root->glob->parallelModeOK);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5522,7 +5533,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   bool allow_prefetch)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5539,6 +5551,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
@@ -5552,7 +5565,8 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   bool allow_prefetch)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5568,6 +5582,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index b5c79359425..e11d022827a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c0c46d7a05f..f3452e8a799 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -129,7 +129,7 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
-
+/* index prefetching - probably should be somewhere else, outside indexam */
 
 /*
  * Cache of recently prefetched blocks, organized as a hash table of
@@ -159,6 +159,8 @@ typedef struct IndexPrefetchCacheEntry {
 
 /*
  * Used to detect sequential patterns (and disable prefetching).
+ *
+ * XXX seems strange to have two separate values
  */
 #define		PREFETCH_QUEUE_HISTORY			8
 #define		PREFETCH_SEQ_PATTERN_BLOCKS		4
@@ -166,9 +168,38 @@ typedef struct IndexPrefetchCacheEntry {
 typedef struct IndexPrefetchEntry
 {
 	ItemPointerData		tid;
-	bool				all_visible;
+
+	/* should we prefetch heap page for this TID? */
+	bool				prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The
+	 * data has to be a single palloc-ed piece of data, so that it can
+	 * be easily pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but
+	 * that seems unnecessarily complex - we expect the information to be
+	 * very simple, like bool flags or something. Easy to do in a simple
+	 * struct, and perhaps even reuse without pfree/palloc.
+	 */
+	void			    *data;
 } IndexPrefetchEntry;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													IndexPrefetch *state,
+													ScanDirection direction);
+
 typedef struct IndexPrefetch
 {
 	/*
@@ -187,9 +218,18 @@ typedef struct IndexPrefetch
 	uint64		countSkipSequential;
 	uint64		countSkipCached;
 
-	/* used when prefetching index-only scans */
-	bool		indexonly;
-	Buffer		vmBuffer;
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB	next_cb;
 
 	/*
 	 * Queue of TIDs to prefetch.
@@ -224,14 +264,22 @@ typedef struct IndexPrefetch
 
 } IndexPrefetch;
 
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state, ScanDirection direction);
+
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
 #define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
 #define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
 #define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
 #define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+
+/* XXX easy to confuse with PREFETCH_ACTIVE */
 #define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
 #define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
 
+int index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch);
 
 /*
  * generalized index_ interface routines (in indexam.c)
@@ -278,17 +326,11 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction,
-									 IndexPrefetch *prefetch);
-extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
-										ScanDirection direction,
-										IndexPrefetch *prefetch,
-										bool *all_visible);
+									 ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot,
-							   IndexPrefetch *prefetch);
+							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8745453a5b4..cc891d4fccf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,7 +1529,6 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
-
 /* ----------------
  *	 IndexScanState information
  *
@@ -1582,6 +1581,7 @@ typedef struct IndexScanState
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
 
+	/* prefetching */
 	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
@@ -1621,6 +1621,8 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
 	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59fe..bc1029982cf 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -454,6 +454,7 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexScan;
 
 /* ----------------
@@ -496,6 +497,7 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexOnlyScan;
 
 /* ----------------
-- 
2.43.0

#37

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Tomas Vondra (#36)

Re: index prefetching

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch is hard to review right now because there's a bunch of
comment updating that doesn't seem to have been done for the new
design. For instance:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

But not any more.

+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter
selectivity

I'm not sure whether all the problems in this area are solved, but I
think you've solved enough of them that this at least needs rewording,
if not removing.

+ * XXX Comment/check seems obsolete.

This occurs in two places. I'm not sure if it's accurate or not.

+     * XXX Could this be an issue for the prefetching? What if we
prefetch something
+     * but the direction changes before we get to the read? If that
could happen,
+     * maybe we should discard the prefetched data and go back? But can we even
+     * do that, if we already fetched some TIDs from the index? I don't think
+     * indexorderdir can't change, but es_direction maybe can?

But your email claims that "The patch simply disables prefetching for
such queries, using the same logic that we do for parallelism." FWIW,
I think that's a fine way to handle that case.

+     * XXX Maybe we should enable prefetching, but prefetch only pages that
+     * are not all-visible (but checking that from the index code seems like
+     * a violation of layering etc).

Isn't this fixed now? Note this comment occurs twice.

+     * XXX We need to disable this in some cases (e.g. when using index-only
+     * scans, we don't want to prefetch pages). Or maybe we should prefetch
+     * only pages that are not all-visible, that'd be even better.

Here again.

And now for some comments on other parts of the patch, mostly other
XXX comments:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

There's probably no reason to write XXX here. The comment is fine.

+     * XXX Notice we haven't added the block to the block queue yet, and there
+     * is a preceding block (i.e. blockIndex-1 is valid).

Same here, possibly? If this XXX indicates a defect in the code, I
don't know what the defect is, so I guess it needs to be more clear.
If it is just explaining the code, then there's no reason for the
comment to say XXX.

+     * XXX Could it be harmful that we read the queue backwards? Maybe memory
+     * prefetching works better for the forward direction?

It does. But I don't know whether that matters here or not.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.

I don't know what this means.

+ * XXX not sure this correctly handles xs_heap_continue - see
index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.

You've got a bunch of comments about xs_heap_continue here -- and I
don't fully understand what the issues are here with respect to this
particular patch, but I think that the general purpose of
xs_heap_continue is to handle the case where we need to return more
than one tuple from the same HOT chain. With an MVCC snapshot that
doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
As far as possible, the prefetcher shouldn't be involved at all when
xs_heap_continue is set, I believe, because in that case we're just
returning a bunch of tuples from the same page, and the extra fetches
from that heap page shouldn't trigger or require any further
prefetching.

+     * XXX Should this also look at plan.plan_rows and maybe cap the target
+     * to that? Pointless to prefetch more than we expect to use. Or maybe
+     * just reset to that value during prefetching, after reading the next
+     * index page (or rather after rescan)?

It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.

Well, this seems sad.

+     * XXX This might lead to IOS being slower than plain index scan, if the
+     * table has a lot of pages that need recheck.

How?

+    /*
+     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+     * of a misuse of the flag, but we need to disable prefetching for cursors
+     * (which might change direction), and parallelModeOK does that. But maybe
+     * we might (or should) have a separate flag.
+     */

I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

--
Robert Haas
EDB: http://www.enterprisedb.com

#38

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Robert Haas (#37)

1 attachment(s)

Re: index prefetching

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.

1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c. And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.

2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?

3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.

4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

5) I haven't done anything about the xs_heap_continue=true case yet.

6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.

Now to Robert's message ...

On 1/9/24 21:31, Robert Haas wrote:

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch is hard to review right now because there's a bunch of
comment updating that doesn't seem to have been done for the new
design. For instance:
+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().
But not any more.

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter
selectivity
I'm not sure whether all the problems in this area are solved, but I
think you've solved enough of them that this at least needs rewording,
if not removing.

+ * XXX Comment/check seems obsolete.

This occurs in two places. I'm not sure if it's accurate or not.
+     * XXX Could this be an issue for the prefetching? What if we
prefetch something
+     * but the direction changes before we get to the read? If that
could happen,
+     * maybe we should discard the prefetched data and go back? But can we even
+     * do that, if we already fetched some TIDs from the index? I don't think
+     * indexorderdir can't change, but es_direction maybe can?
But your email claims that "The patch simply disables prefetching for
such queries, using the same logic that we do for parallelism." FWIW,
I think that's a fine way to handle that case.

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

+     * XXX Maybe we should enable prefetching, but prefetch only pages that
+     * are not all-visible (but checking that from the index code seems like
+     * a violation of layering etc).

Isn't this fixed now? Note this comment occurs twice.

+     * XXX We need to disable this in some cases (e.g. when using index-only
+     * scans, we don't want to prefetch pages). Or maybe we should prefetch
+     * only pages that are not all-visible, that'd be even better.

Here again.

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

And now for some comments on other parts of the patch, mostly other
XXX comments:
+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().
There's probably no reason to write XXX here. The comment is fine.
+     * XXX Notice we haven't added the block to the block queue yet, and there
+     * is a preceding block (i.e. blockIndex-1 is valid).
Same here, possibly? If this XXX indicates a defect in the code, I
don't know what the defect is, so I guess it needs to be more clear.
If it is just explaining the code, then there's no reason for the
comment to say XXX.

Yeah, removed the XXX / reworded a bit.

+     * XXX Could it be harmful that we read the queue backwards? Maybe memory
+     * prefetching works better for the forward direction?

It does. But I don't know whether that matters here or not.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.

I don't know what this means.

There's a check that does this:

(x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

+ * XXX not sure this correctly handles xs_heap_continue - see
index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
You've got a bunch of comments about xs_heap_continue here -- and I
don't fully understand what the issues are here with respect to this
particular patch, but I think that the general purpose of
xs_heap_continue is to handle the case where we need to return more
than one tuple from the same HOT chain. With an MVCC snapshot that
doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
As far as possible, the prefetcher shouldn't be involved at all when
xs_heap_continue is set, I believe, because in that case we're just
returning a bunch of tuples from the same page, and the extra fetches
from that heap page shouldn't trigger or require any further
prefetching.

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

+     * XXX Should this also look at plan.plan_rows and maybe cap the target
+     * to that? Pointless to prefetch more than we expect to use. Or maybe
+     * just reset to that value during prefetching, after reading the next
+     * index page (or rather after rescan)?
It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

The current code actually does look at plan_rows when calculating the
prefetch target:

prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
node->ss.ps.plan->plan_rows,
estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.

Well, this seems sad.

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

+     * XXX This might lead to IOS being slower than plain index scan, if the
+     * table has a lot of pages that need recheck.

How?

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

+    /*
+     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+     * of a misuse of the flag, but we need to disable prefetching for cursors
+     * (which might change direction), and parallelModeOK does that. But maybe
+     * we might (or should) have a separate flag.
+     */
I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

Right. The new patch version does that.

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240112-0001-Prefetch-heap-pages-during-index-scans.patchtext/x-patch; charset=UTF-8; name=v20240112-0001-Prefetch-heap-pages-during-index-scans.patchDownload

From a3f99cc0aaa64ef94b09fc0a58bee709cd29add9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240112] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/Makefile            |   1 +
 src/backend/executor/execMain.c          |  12 +
 src/backend/executor/execPrefetch.c      | 884 +++++++++++++++++++++++
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c     |  68 +-
 src/include/executor/executor.h          |  52 ++
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |  10 +
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f5..9bbe270ab7d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..840f5a6596a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,6 +24,7 @@ OBJS = \
 	execMain.o \
 	execParallel.o \
 	execPartition.o \
+	execPrefetch.o \
 	execProcnode.o \
 	execReplication.o \
 	execSRF.o \
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83b..e3e9131bd62 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1645,6 +1645,18 @@ ExecutePlan(EState *estate,
 	 */
 	estate->es_direction = direction;
 
+	/*
+	 * Enable prefetching only if the plan is executed exactly once. We need
+	 * to disable prefetching for cases when the scan direction may change
+	 * (e.g. for scrollable cursors).
+	 *
+	 * XXX It might be possible to improve the prefetching code to handle this
+	 * by "walking back" the TID queue, but it's not clear if it's worth it.
+	 * And if there pauses in between the fetches, the prefetched pages may
+	 * get evicted, wasting the prefetch effort.
+	 */
+	estate->es_use_prefetching = execute_once;
+
 	/*
 	 * If the plan might potentially be executed multiple times, we must force
 	 * it to run without parallelism, because we might exit early.
diff --git a/src/backend/executor/execPrefetch.c b/src/backend/executor/execPrefetch.c
new file mode 100644
index 00000000000..000bb796d51
--- /dev/null
+++ b/src/backend/executor/execPrefetch.c
@@ -0,0 +1,884 @@
+/*-------------------------------------------------------------------------
+ *
+ * execPrefetch.c
+ *	  routines for prefetching heap pages for index scans.
+ *
+ * The IndexPrefetch node represents an "index prefetcher" which reads TIDs
+ * from an index scan, and prefetches the referenced heap pages. The basic
+ * API consists of these methods:
+ *
+ *	IndexPrefetchAlloc - allocate IndexPrefetch with custom callbacks
+ *	IndexPrefetchNext - read next TID from the index scan, do prefetches
+ *	IndexPrefetchReset - reset state of the prefetcher (for rescans)
+ *	IndexPrefetchEnd - release resources held by the prefetcher
+ *
+ * When allocating a prefetcher, the caller can supply two custom callbacks:
+ *
+ *	IndexPrefetchNextCB - reads the next TID from the index scan (required)
+ *	IndexPrefetchCleanupCB - release private prefetch data (optional)
+ *
+ * These callbacks allow customizing the behavior for different types of
+ * index scans - for exampel index-only scans may inspect visibility map,
+ * and adjust prefetches based on that.
+ *
+ *
+ * TID queue
+ * ---------
+ * The prefetcher maintains a simple queue of TIDs fetched from the index.
+ * The length of the queue (number of TIDs) is determined by the prefetch
+ * target, i.e. effective_io_concurrency. Adding entries to the queue is
+ * the responsibility of IndexPrefetchFillQueue(), depending on the state
+ * of the scan etc. It also prefetches the pages, if appropriate.
+ *
+ * Note: This prefetching applies only to heap pages from the indexed
+ * relation, not the internal index pages.
+ *
+ *
+ * pattern detection
+ * -----------------
+ * For certain access patterns, prefetching is inefficient. In particular,
+ * this applies to sequential access (where kernel read-ahead works fine)
+ * and for pages that are already in memory (prefetched recently). The
+ * prefetcher attempts to identify these two cases - sequential patterns
+ * are detected by IndexPrefetchBlockIsSequential, usign a tiny queue of
+ * recently prefetched blocks. Recently prefetched blocks are tracked in
+ * a "partitioned" LRU cache.
+ *
+ * Note: These are inherently best-effort heuristics. We don't know what
+ * the kernel algorithm/configuration is, or more precisely what already
+ * is in page cache.
+ *
+ *
+ * cache of recent prefetches
+ * --------------------------
+ * Cache of recently prefetched blocks, organized as a hash table of LRU
+ * LRU caches. Doesn't need to be perfectly accurate, but we aim to make
+ * false positives/negatives reasonably low. For more details see the
+ * comments at IndexPrefetchIsCached.
+ *
+ *
+ * prefetch request number
+ * -----------------------
+ * Prefetching works with the concept of "age" (e.g. "recently prefetched
+ * pages"). This relies on a simple prefetch counter, incremented every
+ * time a prefetch is issued. This is not exactly the same thing as time,
+ * as there may be arbitrary delays, it's good enough for this purpose.
+ *
+ *
+ * auto-tuning / self-adjustment
+ * -----------------------------
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and adjust the prefetch target accordingly. That'd help
+ * with cases when a scan matches only very few rows, far less than the
+ * prefetchTarget, because the unnecessary prefetches are wasted I/O.
+ * Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchRequest value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execPrefetch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "common/hashfn.h"
+#include "executor/executor.h"
+#include "nodes/nodeFuncs.h"
+#include "storage/bufmgr.h"
+#include "utils/spccache.h"
+
+
+/*
+ * An entry representing a recently prefetched block. For each block we know
+ * the request number, assigned sequentially, allowing us to decide how old
+ * the request is.
+ *
+ * XXX Is it enough to keep the request as uint32? This way we can prefetch
+ * 32TB of data, and this allows us to fit the whole entry into 64B, i.e.
+ * one cacheline. Which seems like a good thing.
+ *
+ * XXX If we're extra careful / paranoid about uint32, we could reset the
+ * cache once the request wraps around.
+ */
+typedef struct IndexPrefetchCacheEntry
+{
+	BlockNumber block;
+	uint32		request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too small or
+ * too large. 1024 entries seems about right, it covers ~8MB of data. This is
+ * rather arbitrary - there's no formula that'd tell us what the optimal size
+ * is, and we can't even tune it based on runtime (as it depends on what the
+ * other backends do too).
+ *
+ * A value too small would mean we may issue unnecessary prefetches for pages
+ * that have already been prefetched recently (and are still in page cache),
+ * incurring costs for unnecessary fadvise() calls.
+ *
+ * A value too large would mean we do not issue prefetches for pages that have
+ * already been evicted from memory (both shared buffers and page cache).
+ *
+ * Note however that PrefetchBuffer() checks shared buffers before doing the
+ * fadvise call, which somewhat limits the risk of a small cache - the page
+ * would have to get evicted from shared buffers not yet from page cache.
+ * Also, the cost of not issuing a fadvise call (and doing synchronous I/O
+ * later) is much higher than the unnecessary fadvise call. For these reasons
+ * it's better to keep the cache fairly small.
+ *
+ * The cache is structured as an array of small LRU caches - you may also
+ * imagine it as a hash table of LRU caches. To remember a prefetched block,
+ * the block number mapped to a LRU using by hashing. And then in each LRU
+ * we organize the entries by age (per request number) - in particular, the
+ * age determines which entry gets evicted after the LRU gets full.
+ *
+ * The LRU needs to be small enough to be searched linearly. At the same
+ * time it needs to be sufficiently large to handle collisions when several
+ * hot blocks get mapped to the same LRU. For example, if the LRU was only
+ * a single entry, and there were two hot blocks mapped to it, that would
+ * often give incorrect answer.
+ *
+ * The 8 entries per LRU seems about right - it's small enough for linear
+ * search to work well, but large enough to be adaptive. It's not very
+ * likely for 9+ busy blocks (out of 1000 recent requests) to map to the
+ * same LRU. Assuming reasonable hash function.
+ *
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define		PREFETCH_LRU_SIZE		8	/* slots in one LRU */
+#define		PREFETCH_LRU_COUNT		128 /* number of LRUs */
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Size of small sequential queue of most recently prefetched blocks, used
+ * to check if the block is exactly the same as the immediately preceding
+ * one (in which case prefetching is not needed), and if the blocks are a
+ * sequential pattern (in which case the kernel read-ahead is likely going
+ * to be more efficient, and we don't want to interfere with it).
+ */
+#define		PREFETCH_QUEUE_HISTORY	8
+
+/*
+ * An index prefetcher, which maintains a queue of TIDs from an index, and
+ * issues prefetches (if deemed beneficial and supported by the OS).
+ */
+typedef struct IndexPrefetch
+{
+	int			prefetchTarget; /* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics, displayed in EXPLAIN etc. */
+	uint32		countAll;		/* all prefetch requests (including skipped) */
+	uint32		countPrefetch;	/* PrefetchBuffer calls */
+	uint32		countSkipSequential;	/* skipped as sequential pattern */
+	uint32		countSkipCached;	/* skipped as recently prefetched */
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values. However, 1000 entries
+	 * means ~16kB, which means an oversized chunk, and thus always a malloc()
+	 * call. However, we already have the prefetchCache, which is also large
+	 * enough to cause this :-(
+	 *
+	 * XXX However what about the case without prefetching? In that case it
+	 * would be nice to lower the malloc overhead, maybe?
+	 */
+	IndexPrefetchEntry queueItems[MAX_IO_CONCURRENCY];
+	uint32		queueIndex;		/* next TID to prefetch */
+	uint32		queueStart;		/* first valid TID in queue */
+	uint32		queueEnd;		/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+	uint32		blockIndex;		/* index in the block (points to the first
+								 * empty entry) */
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of small
+	 * LRU caches.
+	 */
+	uint32		prefetchRequest;
+	IndexPrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB next_cb;	/* read next TID */
+	IndexPrefetchCleanupCB cleanup_cb;	/* cleanup data */
+
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+} IndexPrefetch;
+
+/* small sequential queue of recent blocks */
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+/* access to the main hybrid cache (hash of LRUs) */
+#define PREFETCH_LRU_ENTRY(p, lru, idx)	\
+	&((p)->prefetchCache[(lru) * PREFETCH_LRU_SIZE + (idx)])
+
+/* access to queue of TIDs (up to MAX_IO_CONCURRENCY elements) */
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+
+/*
+ * macros to deal with prefetcher state
+ *
+ * FIXME may need rethinking, easy to confuse PREFETCH_ENABLED/PREFETCH_ACTIVE
+ */
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_QUEUE_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+
+
+/*
+ * IndexPrefetchBlockIsSequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the block is the same as the immediately preceding one.
+ *
+ * This also updates the small sequential cache of blocks.
+ *
+ * The prefetching overhead is fairly low, but for some access patterns the
+ * benefits are small compared to the extra overhead, or the prefetching may
+ * even be harmful. In particular, for sequential access the read-ahead
+ * performed by the OS is very effective/efficient and our prefetching may
+ * be pointless or (worse) even interfere with it.
+ *
+ * This identifies simple sequential patterns, using a tiny queue of recently
+ * prefetched block numbers (PREFETCH_QUEUE_HISTORY blocks). It also checks
+ * if the block is exactly the same as any of the blocks in the queue (the
+ * main cache has block too, but checking the tiny cache is likely cheaper).
+ *
+ * The the main prefetch queue is not really useful for this, as it stores
+ * full TIDs, but while we only care about block numbers. Consider a nicely
+ * clustered table, with a perfectly sequential pattern when accessed through
+ * an index. Each heap page may have dozens of TIDs, filling the prefetch
+ * queue. But we need to compare block numbers - those may either not be
+ * in the queue anymore, or we have to walk many TIDs (making it expensive,
+ * and we're in hot path).
+ *
+ * So a tiny queue of just block numbers seems like a better option.
+ *
+ * Returns true if the block is in a sequential pattern or was prefetched
+ * recently (and so should not be prefetched this time), or false (in which
+ * case it should be prefetched).
+ */
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on IndexPrefetchIsCached doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the small queue.
+	 *
+	 * Done before checking if the pattern is sequential, because we want to
+	 * know about the block later, even if we end up skipping the prefetch.
+	 * Otherwise we'd not be able to detect longer sequential pattens - we'd
+	 * skip one block and then fail to skip the next couple blocks even in a
+	 * perfectly sequential pattern. And this ocillation might even prevent
+	 * the OS read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Are there enough requests to confirm a sequential pattern? We only
+	 * consider something to be sequential after finding a sequence of
+	 * PREFETCH_QUEUE_HISTORY blocks.
+	 */
+	if (prefetch->blockIndex < PREFETCH_QUEUE_HISTORY)
+		return false;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_QUEUE_HISTORY (8 by default), so
+	 * we look for patterns of 8 pages (64kB) including the new block.
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_QUEUE_HISTORY; i++)
+	{
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index after adding the new block to the
+		 * queue). So (blockIndex-1) is the new block.
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+
+		/* Don't prefetch if the block happens to be the same. */
+		if (prefetch->blockItems[idx] == block)
+			return false;
+	}
+
+	/* not sequential, not recently prefetched */
+	return true;
+}
+
+/*
+ * IndexPrefetchIsCached
+ *		Check if the block was prefetched recently, and update the cache.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may be quite significant.
+ *
+ * We want to remember which blocks were prefetched recently, so that we can
+ * skip repeated prefetches. We also need to eventually forget these blocks
+ * as they may get evicted from memory (particularly page cache, which is
+ * outside our control).
+ *
+ * A simple queue is not a viable option - it would allow expiring requests
+ * based on age, but it's very expensive to check (as it requires linear
+ * search, and we need fairly large number of entries). Hash table does not
+ * work because it does not allow expiring entries by age.
+ *
+ * The cache does not need to be perfect - false positives/negatives are
+ * both acceptable, as long as the rate is reasonably low.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table of LRUs). The LRU caches are tiny (e.g. 8 entries), and the
+ * expiration happens at the level of a single LRU (using age determined
+ * by sequential request number).
+ *
+ * This allows quick searches and expiration, with false negatives (when a
+ * particular LRU has too many collisions with hot blocks, we may end up
+ * evicting entries that are more recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * request in total (these are the default parameters.) representing about
+ * 8MB of data.
+ *
+ * If we want to check if a block was recently prefetched, we calculate
+ * (hash(blkno) % 128) and search only LRU at this index, using a linear
+ * search. If we want to add the block to the cache, we find either an
+ * empty slot or the "oldest" entry in the LRU, and store the block in it.
+ * If the block is already in the LRU, we only update the request number.
+ *
+ * The request age is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint32, so it should
+ * not wrap (we'd have to prefetch 32TB).
+ *
+ * If the request number is not less than PREFETCH_CACHE_SIZE ago, it's
+ * considered "recently prefetched". That is, the maximum age is the same
+ * as the total capacity of the cache.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what IndexPrefetchBlockIsSequential does.
+ *
+ * XXX Should we increase the prefetch counter even if we determine the
+ * entry was recently prefetched? Then we might skip some request numbers
+ * (there's be no entry with them).
+ */
+static bool
+IndexPrefetchIsCached(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) queue and see if it's part of a
+	 * sequential pattern. In this case we just ignore the block and don't
+	 * prefetch it - we expect OS read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the main cache, in case we
+	 * happen to access it later. That might help if we happen to scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (IndexPrefetchBlockIsSequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* Which LRU does this block belong to? */
+	lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/*
+	 * Did we prefetch this block recently? Scan the LRU linearly, and while
+	 * doing that, track the oldest (or empty) entry, so that we know where to
+	 * put the block if we don't find a match.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = PREFETCH_LRU_ENTRY(prefetch, lru, i);
+
+		/*
+		 * Is this the oldest prefetch request in this LRU?
+		 *
+		 * Notice that request is uint32, so an empty entry (with request=0)
+		 * is automatically oldest one.
+		 */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Skip unused entries. */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched. We need to check before updating
+			 * the prefetch request.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchRequest);
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchRequest;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it the "oldest" prefetch
+	 * request in this LRU (which might be an empty entry).
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = PREFETCH_LRU_ENTRY(prefetch, lru, oldestIndex);
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchRequest;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * IndexPrefetchHeapPage
+ *		Prefetch a heap page for the TID, unless it's sequential or was
+ *		recently prefetched.
+ */
+static void
+IndexPrefetchHeapPage(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
+{
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
+
+	prefetch->countAll++;
+
+	/*
+	 * Do not prefetch the same block over and over again, if it's probably
+	 * still in memory (page cache).
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 *
+	 * If we make a mistake and prefetch a buffer that's still in our shared
+	 * buffers, PrefetchBuffer will take care of that. If it's in page cache,
+	 * we'll issue an unnecessary prefetch. There's not much we can do about
+	 * that, unfortunately.
+	 *
+	 * XXX Maybe we could check PrefetchBufferResult and adjust countPrefetch
+	 * based on that?
+	 */
+	if (IndexPrefetchIsCached(prefetch, block))
+		return;
+
+	prefetch->countPrefetch++;
+
+	PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+	pgBufferUsage.blks_prefetches++;
+}
+
+/*
+ * IndexPrefetchFillQueue
+ *		Fill the prefetch queue and issue necessary prefetch requests.
+ *
+ * If the prefetching is still active (enabled, not reached end of scan), read
+ * TIDs into the queue until we hit the current target.
+ *
+ * This also ramps-up the prefetch target from 0 to prefetch_max, determined
+ * when allocating the prefetcher.
+ */
+static void
+IndexPrefetchFillQueue(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* When inactive (not enabled or end of scan reached), we're done. */
+	if (!PREFETCH_ACTIVE(prefetch))
+		return;
+
+	/*
+	 * Ramp up the prefetch distance incrementally.
+	 *
+	 * Intentionally done as first, before reading the TIDs into the queue, so
+	 * that there's always at least one item. Otherwise we might get into a
+	 * situation where we start with target=0 and no TIDs loaded.
+	 */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Read TIDs from the index until the queue is full (with respect to the
+	 * current prefetch target).
+	 */
+	while (!PREFETCH_QUEUE_FULL(prefetch))
+	{
+		IndexPrefetchEntry *entry
+		= prefetch->next_cb(scan, direction, prefetch->data);
+
+		/* no more entries in this index scan */
+		if (entry == NULL)
+		{
+			prefetch->prefetchDone = true;
+			return;
+		}
+
+		Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
+
+		/* store the entry and then maybe issue the prefetch request */
+		prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
+
+		/* issue the prefetch request? */
+		if (entry->prefetch)
+			IndexPrefetchHeapPage(scan, prefetch, entry);
+	}
+}
+
+/*
+ * IndexPrefetchNextEntry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+IndexPrefetchNextEntry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+
+	/*
+	 * With prefetching enabled (even if we already finished reading all TIDs
+	 * from the index scan), we need to return a TID from the queue.
+	 * Otherwise, we just get the next TID from the scan directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
+		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
+	}
+	else						/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
+	}
+
+	return entry;
+}
+
+/*
+ * IndexPrefetchComputeTarget
+ *		Calculate prefetch distance for the given heap relation.
+ *
+ * We disable prefetching when using direct I/O (when there's no page cache
+ * to prefetch into), and scans where the prefetch distance may change (e.g.
+ * for scrollable cursors).
+ *
+ * In regular cases we look at effective_io_concurrency for the tablepace
+ * (of the heap, not the index), and cap it with plan_rows.
+ *
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
+ *
+ * XXX Maybe we should reduce the value with parallel workers?
+ */
+int
+IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch)
+{
+	/*
+	 * No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only
+	 * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+	 * once the code starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+		return 0;
+
+	/* disable prefetching (for cursors etc.) */
+	if (!prefetch)
+		return 0;
+
+	/* regular case, look at tablespace effective_io_concurrency */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+/*
+ * IndexPrefetchAlloc
+ *		Allocate the index prefetcher.
+ *
+ * The behavior is customized by two callbacks - next_cb, which generates TID
+ * values to put into the prefetch queue, and (optional) cleanup_cb which
+ * releases resources at the end.
+ *
+ * prefetch_max specifies the maximum prefetch distance, i.e. how many TIDs
+ * ahead to keep in the prefetch queue. prefetch_max=0 means prefetching is
+ * disabled.
+ *
+ * data may point to a custom data, associated with the prefetcher.
+ */
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, IndexPrefetchCleanupCB cleanup_cb,
+				   int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	/* the next_cb callback is required */
+	Assert(next_cb);
+
+	/* valid prefetch distance */
+	Assert((prefetch_max >= 0) && (prefetch_max <= MAX_IO_CONCURRENCY));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep the result
+	 * so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->cleanup_cb = cleanup_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+/*
+ * IndexPrefetchNext
+ *		Read the next entry from the prefetch queue.
+ *
+ * Returns the next TID in the prefetch queue (which might have been prefetched
+ * sometime in the past). If needed, it adds more entries to the queue and does
+ * the prefetching for them.
+ *
+ * Returns IndexPrefetchEntry with the TID and optional data associated with
+ * the TID in the next_cb callback.
+ */
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	IndexPrefetchFillQueue(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return IndexPrefetchNextEntry(scan, prefetch, direction);
+}
+
+/*
+ * IndexPrefetchReset
+ *		Reset the prefetch TID, restart the prefetching.
+ *
+ * Useful during rescans etc. This also resets the prefetch target, so that
+ * each rescan does the initial prefetch ramp-up from target=0 to maximum
+ * prefetch distance.
+ */
+void
+IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	state->queueIndex = 0;
+	state->queueStart = 0;
+	state->queueEnd = 0;
+
+	state->prefetchDone = false;
+	state->prefetchTarget = 0;
+}
+
+/*
+ * IndexPrefetchStats
+ *		Log basic runtime debug stats of the prefetcher.
+ *
+ * FIXME Should be only in debug builds, or something like that.
+ */
+void
+IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	elog(LOG, "index prefetch stats: requests %u prefetches %u (%f) skip cached %u sequential %u",
+		 state->countAll,
+		 state->countPrefetch,
+		 state->countPrefetch * 100.0 / state->countAll,
+		 state->countSkipCached,
+		 state->countSkipSequential);
+}
+
+/*
+ * IndexPrefetchEnd
+ *		Release resources associated with the prefetcher.
+ *
+ * This is primarily about the private data the caller might have allocated
+ * in the next_cb, and stored in the data field. We don't know what the
+ * data might contain (e.g. buffers etc.), requiring additional cleanup, so
+ * we call another custom callback.
+ *
+ * Needs to be called at the end of the executor node.
+ *
+ * XXX Maybe if there's no callback, we should just pfree the data? Does
+ * not seem very useful, though.
+ */
+void
+IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	if (!state->cleanup_cb)
+		return;
+
+	state->cleanup_cb(scan, state->data);
+}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945f..8fda8694350 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b57..fce10ea6518 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -36,6 +36,7 @@
 #include "access/tupdesc.h"
 #include "access/visibilitymap.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
@@ -44,11 +45,14 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
+static void IndexOnlyPrefetchCleanup(IndexScanDesc scan,
+									 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +69,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,11 +84,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -111,15 +120,39 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												 IndexOnlyPrefetchCleanup,
+												 prefetch_max,
+												 palloc0(sizeof(Buffer)));
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data; /* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -155,8 +188,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from
+		 * prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!(all_visible && *all_visible) &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +390,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->ioss_ScanDesc, node->ioss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +420,12 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->ioss_prefetch);
+
+	/* Release VM buffer pin from prefetcher, if any. */
+	IndexPrefetchEnd(indexScanDesc, node->ioss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -715,3 +761,62 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	Assert(data);
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		bool		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+												 blkno,
+												 (Buffer *) data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
+
+/*
+ * For IOS, we may have a VM buffer in the private data, so make sure to
+ * release it properly.
+ */
+static void
+IndexOnlyPrefetchCleanup(IndexScanDesc scan, void *data)
+{
+	Buffer	   *buffer = (Buffer *) data;
+
+	Assert(data);
+
+	if (*buffer != InvalidBuffer)
+	{
+		ReleaseBuffer(*buffer);
+		*buffer = InvalidBuffer;
+	}
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..0548403dc50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -34,6 +34,7 @@
 #include "access/tableam.h"
 #include "catalog/pg_am.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexscan.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
@@ -69,6 +70,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +89,8 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,11 +104,14 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int prefetch_max;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -123,15 +132,43 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												NULL, /* no extra cleanup */
+												prefetch_max,
+												NULL);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -588,6 +625,9 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->iss_ScanDesc, node->iss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +834,9 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1728,3 +1771,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5e8c335a737..e792c3fc8d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -677,4 +677,56 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execPrefetch.c
+ */
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData tid;
+
+	/* should we prefetch heap page for this TID? */
+	bool		prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The data
+	 * has to be a single palloc-ed piece of data, so that it can be easily
+	 * pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but that
+	 * seems unnecessarily complex - we expect the information to be very
+	 * simple, like bool flags or something. Easy to do in a simple struct,
+	 * and perhaps even reuse without pfree/palloc.
+	 */
+	void	   *data;
+} IndexPrefetchEntry;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													ScanDirection direction,
+													void *data);
+
+typedef void (*IndexPrefetchCleanupCB) (IndexScanDesc scan,
+										void *data);
+
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  IndexPrefetchCleanupCB cleanup_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state,
+									  ScanDirection direction);
+
+extern void IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state);
+
+extern int	IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..fadeb389495 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1b..141db5d4ae2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -690,6 +690,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching; /* can we use prefetching? */
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
@@ -1529,6 +1530,9 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1584,9 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1625,9 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f582eb59e7d..9d194ec2715 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1183,6 +1183,9 @@ IndexOnlyScanState
 IndexOptInfo
 IndexOrderByDistance
 IndexPath
+IndexPrefetch
+IndexPrefetchCacheEntry
+IndexPrefetchEntry
 IndexRuntimeKeyInfo
 IndexScan
 IndexScanDesc
-- 
2.43.0

#39

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#38)

Re: index prefetching

Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

+1.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.
I don't know what this means.
There's a check that does this:

(x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.
Well, this seems sad.
Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#38)

Re: index prefetching

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.

2. It disables prefetch for direct_io. It seems to be even more obvious
than 1), because prefetching using `posix_fadvise` definitely not
possible in case of using direct_io. But in theory if SMGR provides some
alternative prefetch implementation (as in case of Neon), this also may
be not true. Still unclear why we can want to use direct_io in Neon...
But still I prefer to mo.ve this decision outside executor.

3. It doesn't perform prefetch of leave pages for IOS, only referenced
heap pages which are not marked as all-visible. It seems to me that if
optimized has chosen IOS (and not bitmap heap scan for example), then
there should be large enough fraction for all-visible pages. Also index
prefetch is most efficient for OLAp queries and them are used to be
performance for historical data which is all-visible. But IOS can be
really handled separately in some other PR. Frankly speaking combining
prefetch of leave B-Tree pages and referenced heap pages seems to be
very challenged task.

4. I think that performing prefetch at executor level is really great
idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?

#41

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#40)

Re: index prefetching

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

However, I'm not against making this modular / wrapping this in some
sort of callbacks, for example.

2. It disables prefetch for direct_io. It seems to be even more obvious
than 1), because prefetching using `posix_fadvise` definitely not
possible in case of using direct_io. But in theory if SMGR provides some
alternative prefetch implementation (as in case of Neon), this also may
be not true. Still unclear why we can want to use direct_io in Neon...
But still I prefer to mo.ve this decision outside executor.

True. I think this would / should be customizable by the callback.

3. It doesn't perform prefetch of leave pages for IOS, only referenced
heap pages which are not marked as all-visible. It seems to me that if
optimized has chosen IOS (and not bitmap heap scan for example), then
there should be large enough fraction for all-visible pages. Also index
prefetch is most efficient for OLAp queries and them are used to be
performance for historical data which is all-visible. But IOS can be
really handled separately in some other PR. Frankly speaking combining
prefetch of leave B-Tree pages and referenced heap pages seems to be
very challenged task.

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

4. I think that performing prefetch at executor level is really great
idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#41)

Re: index prefetching

On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

3. It doesn't perform prefetch of leave pages for IOS, only referenced
heap pages which are not marked as all-visible. It seems to me that if
optimized has chosen IOS (and not bitmap heap scan for example), then
there should be large enough fraction for all-visible pages. Also index
prefetch is most efficient for OLAp queries and them are used to be
performance for historical data which is all-visible. But IOS can be
really handled separately in some other PR. Frankly speaking combining
prefetch of leave B-Tree pages and referenced heap pages seems to be
very challenged task.

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

+1. This is a good feature, and so is that, but they're not the same
feature, despite the naming problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

#43

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#41)

Re: index prefetching

On 16/01/2024 6:25 pm, Tomas Vondra wrote:

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

Amazon RDS is just vanilla Postgres with file system mounted on EBS
(Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't
noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch

Show quoted text

regards

#44

Jim Nasby

jim.nasby@gmail.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#43)

Re: index prefetching

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:

Amazon RDS is just vanilla Postgres with file system mounted on EBS
(Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.

Actually, EBS only provides a block device - it's definitely not a
filesystem itself (*EFS* is a filesystem - but it's also significantly
different than EBS). So as long as readahead is happening somewheer
above the block device I would expect it to JustWork on EBS.

Of course, Aurora Postgres (like Neon) is completely different. If you
look at page 53 of [1] you'll note that there's two different terms
used: prefetch and batch. I'm not sure how much practical difference
there is, but batched IO (one IO request to Aurora Storage for many
blocks) predates index prefetch; VACUUM in APG has used batched IO for a
very long time (it also *only* reads blocks that aren't marked all
visble/frozen; none of the "only skip if skipping at least 32 blocks"
logic is used).

1:
https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf
--
Jim Nasby, Data Architect, Austin TX

#45

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Jim Nasby (#44)

Re: index prefetching

On 16/01/2024 11:58 pm, Jim Nasby wrote:

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:

Amazon RDS is just vanilla Postgres with file system mounted on EBS
(Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with
local SSDs.
I am not sure if read-ahead works for EBS.

Actually, EBS only provides a block device - it's definitely not a
filesystem itself (*EFS* is a filesystem - but it's also significantly
different than EBS). So as long as readahead is happening somewheer
above the block device I would expect it to JustWork on EBS.

Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for
any other local device.
There is actually recommendation to increase read-ahead for EBS device
to reach better performance on some workloads:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is
not needed.
But at Neon situation is quite different. May be Aurora Postgres is
using some other mechanism for speed-up vacuum and seqscan,
but Neon is using Postgres prefetch mechanism for it.

#46

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Jim Nasby (#44)

Re: index prefetching

On 16/01/2024 11:58 pm, Jim Nasby wrote:

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:

Amazon RDS is just vanilla Postgres with file system mounted on EBS
(Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with
local SSDs.
I am not sure if read-ahead works for EBS.

Actually, EBS only provides a block device - it's definitely not a
filesystem itself (*EFS* is a filesystem - but it's also significantly
different than EBS). So as long as readahead is happening somewheer
above the block device I would expect it to JustWork on EBS.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

#47

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#41)

Re: index prefetching

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.

Just small notice: you are reporting `blks_prefetch_rounds` in explain,
but it is not incremented anywhere.
Moreover, I do not precisely understand what it mean and wonder if such
information is useful for analyzing query executing plan.
Also your patch always report number of prefetched blocks (and rounds)
if them are not zero.

I think that adding new information to explain it may cause some
problems because there are a lot of different tools which parse explain
report to visualize it,
make some recommendations top improve performance, ... Certainly good
practice for such tools is to ignore all unknown tags. But I am not sure
that everybody follow this practice.
It seems to be more safe and at the same time convenient for users to
add extra tag to explain to enable/disable prefetch info (as it was done
in Neon).

Here we come back to my custom explain patch;) Actually using it is not
necessary. You can manually add "prefetch" option to Postgres core (as
it is currently done in Neon).

Best regards,
Konstantin

#48

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#43)

Re: index prefetching

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#49

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#47)

Re: index prefetching

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.

Cool! And do you think this is the right design/way to do this?

Just small notice: you are reporting `blks_prefetch_rounds` in explain,
but it is not incremented anywhere.
Moreover, I do not precisely understand what it mean and wonder if such
information is useful for analyzing query executing plan.
Also your patch always report number of prefetched blocks (and rounds)
if them are not zero.

Right, this needs fixing.

I think that adding new information to explain it may cause some
problems because there are a lot of different tools which parse explain
report to visualize it,
make some recommendations top improve performance, ... Certainly good
practice for such tools is to ignore all unknown tags. But I am not sure
that everybody follow this practice.
It seems to be more safe and at the same time convenient for users to
add extra tag to explain to enable/disable prefetch info (as it was done
in Neon).

I think we want to add this info to explain, but maybe it should be
behind a new flag and disabled by default.

Here we come back to my custom explain patch;) Actually using it is not
necessary. You can manually add "prefetch" option to Postgres core (as
it is currently done in Neon).

Yeah, I think that's the right solution.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#49)

Re: index prefetching

On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.

Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data
buffer. Why caller of IndexPrefetchNext can not look at prefetch field?

+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

2. Names of the functions `IndexPrefetchNext` and
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
one can assume that for one is used for normal index scan and last one -
for index only scan. But actually `IndexOnlyPrefetchNext` is callback
and `IndexPrefetchNext` is used in both nodeIndexscan.c and
nodeIndexonlyscan.c

#51

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#50)

Re: index prefetching

On 1/19/24 09:34, Konstantin Knizhnik wrote:

On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.

Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data
buffer. Why caller of IndexPrefetchNext can not look at prefetch field?
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

What you mean by "prefetch field"? The reason why it's done like this is
to only do the VM check once - without keeping the value, we'd have to
do it in the "next" callback, to determine if we need to prefetch the
heap tuple, and then later in the index-only scan itself. That's a
significant overhead, especially in the case when everything is visible.

2. Names of the functions `IndexPrefetchNext` and
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
one can assume that for one is used for normal index scan and last one -
for index only scan. But actually `IndexOnlyPrefetchNext` is callback
and `IndexPrefetchNext` is used in both nodeIndexscan.c and
nodeIndexonlyscan.c

Yeah, that's a good point. The naming probably needs rethinking.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#52

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#48)

Re: index prefetching

On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.

regards

Looks like I was not true, even if it is not index-only scan but index
condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

#53

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#38)

2 attachment(s)

Re: index prefetching

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/9/24 21:31, Robert Haas wrote:

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie

Attachments:

0002-use-streaming-reads-in-index-scan.txttext/plain; charset=US-ASCII; name=0002-use-streaming-reads-in-index-scan.txtDownload

From 31a0b829b3aca31542dc3236b408f1e86133aea7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v1 2/2] use streaming reads in index scan

---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c       |  2 +
 src/backend/executor/nodeIndexscan.c     | 83 ++++++++++++++++++++----
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h             |  6 ++
 src/include/storage/streaming_read.h     |  2 +
 6 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..0ef5f824546 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e29..c118cc3861f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -237,6 +237,8 @@ index_beginscan(Relation heapRelation,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->index_done = false;
+	scan->xs_heapfetch->blk_queue = InvalidBlockNumber;
 
 	return scan;
 }
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..41437faff06 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,33 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+#define QUEUE_FULL(q) ((q) != InvalidBlockNumber)
+
+static void
+blk_enqueue(BlockNumber blkno, BlockNumber *blk_queue)
+{
+	Assert(*blk_queue == InvalidBlockNumber);
+	*blk_queue = blkno;
+}
+
+static BlockNumber
+blk_dequeue(BlockNumber *blk_queue)
+{
+	BlockNumber result = *blk_queue;
+	*blk_queue = InvalidBlockNumber;
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	return blk_dequeue(&scan->blk_queue);
+}
+
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +150,63 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		// TODO: can't put this here bc not AM agnostic
+		scandesc->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scandesc->xs_heapfetch,
+													0,
+													NULL,
+													BMR_REL(scandesc->heapRelation),
+													MAIN_FORKNUM,
+													index_pgsr_next_single);
+
+		pg_streaming_read_set_resumable(scandesc->xs_heapfetch->pgsr);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		if (index_fetch_heap(scandesc, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
+
+		Assert(!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
+		do
+		{
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			blk_enqueue(ItemPointerGetBlockNumber(tid), &scandesc->xs_heapfetch->blk_queue);
+		} while (!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fea..6465963f837 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..d476cb206d5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,8 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	BlockNumber blk_queue;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +165,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c541..2288b7b5eb0 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2

0001-Streaming-Read-API.txttext/plain; charset=US-ASCII; name=0001-Streaming-Read-API.txtDownload

From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae641..018ebbcbaae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

#54

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#52)

Re: index prefetching

On 1/19/24 16:19, Konstantin Knizhnik wrote:

On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that
it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.

regards

Looks like I was not true, even if it is not index-only scan but index
condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#54)

Re: index prefetching

On 20/01/2024 12:14 am, Tomas Vondra wrote:

Looks like I was not true, even if it is not index-only scan but index

condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

We are comparing compound index (a,b) and covering (inclusive) index (a)
include (b)
This indexes have exactly the same width and size and almost the same
maintenance overhead.

First index has more expensive comparison function (involving two
columns) but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good
enough, then there is no need to compare "b"

Why we can prefer covering index to compound index? I see only two good
reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?

May be I missed something.

This is the example from
/messages/by-id/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me
:

```

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884
rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 ->
Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1
width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a,
b, d Index Cond: ((t.a > 1000000) AND (t.b = 4))
Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning
Time: 0.314 ms Execution Time: 6.910 ms ```

Isn't it an optimal plan for this query?

And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```

I agree that it is natural to expect the same result for both indexes.
So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural
and preferable.

#56

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#51)

Re: index prefetching

On 19/01/2024 2:35 pm, Tomas Vondra wrote:

On 1/19/24 09:34, Konstantin Knizhnik wrote:
On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.

Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data
buffer. Why caller of IndexPrefetchNext can not look at prefetch field?
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;
What you mean by "prefetch field"?

I mean "prefetch" field of IndexPrefetchEntry:

+
+typedef struct IndexPrefetchEntry
+{
+    ItemPointerData tid;
+
+    /* should we prefetch heap page for this TID? */
+    bool        prefetch;
+

You store the same flag twice:

+        /* prefetch only if not all visible */
+        entry->prefetch = !all_visible;
+
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

My question was: why do we need to allocate something in entry->data and
store all_visible in it, while we already stored !all-visible in
entry->prefetch.

#57

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#55)

Re: index prefetching

On 1/21/24 20:50, Konstantin Knizhnik wrote:

On 20/01/2024 12:14 am, Tomas Vondra wrote:

Looks like I was not true, even if it is not index-only scan but index

condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

We are comparing compound index (a,b) and covering (inclusive) index (a)
include (b)
This indexes have exactly the same width and size and almost the same
maintenance overhead.

First index has more expensive comparison function (involving two
columns) but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good
enough, then there is no need to compare "b"

Why we can prefer covering index to compound index? I see only two good
reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

Or maybe you don't want to include the columns in a UNIQUE constraint?

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
Index Cond: (a = 10)
Filter: (mod(b, 10) = 1111111)
Rows Removed by Filter: 974
Buffers: shared hit=980
Prefetches: blocks=901
Planning Time: 0.304 ms
Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key. Or this:

explain (analyze, buffers) select a from t where a = 10 and (b+1) < 100
and c < 0;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..3673.22 rows=1 width=4)
(actual time=4.446..4.448 rows=0 loops=1)
Index Cond: (a = 10)
Filter: ((c < 0) AND ((b + 1) < 100))
Rows Removed by Filter: 974
Buffers: shared hit=980
Prefetches: blocks=901
Planning Time: 0.313 ms
Execution Time: 4.878 ms
(8 rows)

where it's "broken" by the extra unindexed column.

FWIW there are the primary cases I had in mind for this patch.

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?

May be I missed something.

This is the example from
/messages/by-id/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me :

```

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884
rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 ->
Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1
width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a,
b, d Index Cond: ((t.a > 1000000) AND (t.b = 4))
Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning
Time: 0.314 ms Execution Time: 6.910 ms ```

Isn't it an optimal plan for this query?

And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```

I agree that it is natural to expect the same result for both indexes.
So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural
and preferable.

Yes, perhaps. But you may also see it from the other direction - if you
already have an index with included columns (for whatever reason), it
would be nice to leverage that if possible. And as I mentioned above,
it's not always the case that move a column from "included" to a proper
key, or stuff like that.

Anyway, it seems entirely unrelated to this prefetching thread.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#58

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#56)

Re: index prefetching

On 1/21/24 20:56, Konstantin Knizhnik wrote:

On 19/01/2024 2:35 pm, Tomas Vondra wrote:
On 1/19/24 09:34, Konstantin Knizhnik wrote:
On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it
also
seems to work.

Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data
buffer. Why caller of IndexPrefetchNext can not look at prefetch field?
+        /* store the all_visible flag in the private part of the
entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;
What you mean by "prefetch field"?
I mean "prefetch" field of IndexPrefetchEntry:
+
+typedef struct IndexPrefetchEntry
+{
+    ItemPointerData tid;
+
+    /* should we prefetch heap page for this TID? */
+    bool        prefetch;
+
You store the same flag twice:
+        /* prefetch only if not all visible */
+        entry->prefetch = !all_visible;
+
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;
My question was: why do we need to allocate something in entry->data and
store all_visible in it, while we already stored !all-visible in
entry->prefetch.

Ah, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Peter Smith

smithpb2250@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#53)

Re: index prefetching

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4351/, but it seems
like there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351. Please
have a look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4351/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351

Kind Regards,
Peter Smith.

#60

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#58)

Re: index prefetching

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
cause any extra space overhead (because of alignment), but allows to
avoid dynamic memory allocation (not sure if it is critical, but nice to
avoid if possible).

#61

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#57)

Re: index prefetching

On 22/01/2024 1:39 am, Tomas Vondra wrote:

Why we can prefer covering index to compound index? I see only two good
reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

Or maybe you don't want to include the columns in a UNIQUE constraint?

Do you mean that compound index (a,b) can not be used to enforce
uniqueness of "a"?
If so, I agree.

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
Index Cond: (a = 10)
Filter: (mod(b, 10) = 1111111)
Rows Removed by Filter: 974
Buffers: shared hit=980
Prefetches: blocks=901
Planning Time: 0.304 ms
Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key.

O yes.
Looks like I didn't understand the logic when predicate is included in
index condition and when not.
It seems to be natural that only such predicate which specifies some
range can be included in index condition.
But it is not the case:

postgres=# explain select * from t where a = 10 and b in (10,20,30);
QUERY PLAN
---------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..25.33 rows=3 width=12)
Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
(2 rows)

So I though ANY predicate using index keys is included in index condition.
But it is not true (as your example shows).

But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates
this is why I named this use cases "exotic".

In any case, if we have some columns in index tuple it is desired to use
them for filtering before extracting heap tuple.
But I afraid it will be not so easy to implement...

#62

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Melanie Plageman (#53)

Re: index prefetching

On 1/19/24 22:43, Melanie Plageman wrote:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/9/24 21:31, Robert Haas wrote:

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

Thanks.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

The only change to the streaming read API is that now, even if the
callback returns InvalidBlockNumber, we may not be finished, so make
it resumable.

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

The main

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#63

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#62)

2 attachment(s)

Re: index prefetching

On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/19/24 22:43, Melanie Plageman wrote:

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

The only change to the streaming read API is that now, even if the
callback returns InvalidBlockNumber, we may not be finished, so make
it resumable.

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

Oh dear! Fixed in the attached v2. I've replaced the single
BlockNumber with a single ItemPointerData. I will work on implementing
an actual queue next week.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

As for passing around the all visible status so as to not reread the
VM block -- that feels solvable but I haven't looked into it.

- Melanie

Attachments:

v2-0002-use-streaming-reads-in-index-scan.nocfbotapplication/octet-stream; name=v2-0002-use-streaming-reads-in-index-scan.nocfbotDownload

From de099b8e7f5e7469ca199e001a1d965ff388c7dc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v2 2/2] use streaming reads in index scan

ci-os-only:
---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c       |  3 +
 src/backend/executor/nodeIndexscan.c     | 92 ++++++++++++++++++++----
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h             |  6 ++
 src/include/storage/streaming_read.h     |  2 +
 6 files changed, 111 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..243bb19803 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &tid);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e2..a30e64f8fa 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -236,7 +236,10 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 
 	/* prepare to fetch index matches from table */
+	scan->index_done = false;
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
 
 	return scan;
 }
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..c3a2fabd01 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,42 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+#define QUEUE_FULL(tid_queue) (ItemPointerIsValid(tid_queue))
+
+static void
+tid_enqueue(ItemPointer tid, ItemPointer tid_queue)
+{
+	Assert(!ItemPointerIsValid(tid_queue));
+
+	ItemPointerSet(tid_queue, ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
+}
+
+static ItemPointerData
+tid_dequeue(ItemPointer tid_queue)
+{
+	ItemPointerData result = *tid_queue;
+	ItemPointerSet(tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	ItemPointerData data = tid_dequeue(&scan->tid_queue);
+
+	ItemPointer dest = per_buffer_data;
+	*dest = data;
+
+	if (!ItemPointerIsValid(&data))
+		return InvalidBlockNumber;
+	return ItemPointerGetBlockNumber(&data);
+}
+
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +159,63 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		// TODO: can't put this here bc not AM agnostic
+		scandesc->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scandesc->xs_heapfetch,
+													sizeof(ItemPointerData),
+													NULL,
+													BMR_REL(scandesc->heapRelation),
+													MAIN_FORKNUM,
+													index_pgsr_next_single);
+
+		pg_streaming_read_set_resumable(scandesc->xs_heapfetch->pgsr);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		if (index_fetch_heap(scandesc, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
+
+		Assert(!QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
+		{
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..6465963f83 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..364ff381a7 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,8 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	ItemPointerData tid_queue;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +165,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2

v2-0001-Streaming-Read-API.nocfbotapplication/octet-stream; name=v2-0001-Streaming-Read-API.nocfbotDownload

From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae64..018ebbcbaa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

#64

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Melanie Plageman (#63)

Re: index prefetching

On 1/24/24 01:51, Melanie Plageman wrote:

On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/19/24 22:43, Melanie Plageman wrote:

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

FWIW I think the two queues are a nice / elegant approach. In hindsight
my problems with trying to utilize the StreamingRead were due to trying
to use the block-oriented API directly from places that work with TIDs,
and this just makes that go away.

I wonder what the overhead of shuffling stuff between queues will be,
but hopefully not too high (that's my assumption).

The only change to the streaming read API is that now, even if the
callback returns InvalidBlockNumber, we may not be finished, so make
it resumable.

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

Thanks for the explanation. Yes, I didn't realize that the queues may be
of different length, at which point it makes sense to return invalid
block to signal the TID queue is empty.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

Agreed. I think I wasn't quite right when concerned about "overloading"
the prefetch, because that depends entirely on the StreamingRead API
queue. A lage TID queue can't cause overload of anything.

What could happen is a TID queue being too small, so the prefetch can't
hit the target distance. But that can happen already, e.g. indexes that
are correlated and/or index-only scans with all-visible pages.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

Yes, the "recently prefetched stuff" cache seems like a fairly natural
complement to the pattern detection and shared-buffers check.

FWIW I wonder if we should make some of this customizable, so that
systems with customized storage (e.g. neon or with direct I/O) can e.g.
disable some of these checks. Or replace them with their version.

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

Yes, that's the problem for index-only scans. I'd generalize it so that
it's about the callback being able to (a) decide if it needs to read the
heap page, and (b) store some custom info for the TID.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#61)

Re: index prefetching

On 1/22/24 08:21, Konstantin Knizhnik wrote:

On 22/01/2024 1:39 am, Tomas Vondra wrote:

Why we can prefer covering index to compound index? I see only two good
reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

Or maybe you don't want to include the columns in a UNIQUE constraint?

Do you mean that compound index (a,b) can not be used to enforce
uniqueness of "a"?
If so, I agree.

Yes.

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;
                                                    QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
    Index Cond: (a = 10)
    Filter: (mod(b, 10) = 1111111)
    Rows Removed by Filter: 974
    Buffers: shared hit=980
    Prefetches: blocks=901
Planning Time: 0.304 ms
Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key.

O yes.
Looks like I didn't understand the logic when predicate is included in
index condition and when not.
It seems to be natural that only such predicate which specifies some
range can be included in index condition.
But it is not the case:

postgres=# explain select * from t where a = 10 and b in (10,20,30);
                             QUERY PLAN
---------------------------------------------------------------------
Index Scan using t_a_b_idx on t (cost=0.42..25.33 rows=3 width=12)
   Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
(2 rows)

So I though ANY predicate using index keys is included in index condition.
But it is not true (as your example shows).

But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates
this is why I named this use cases "exotic".

Not sure I agree with describing this as "exotic".

The same thing applies to an arbitrary function call. And those are
pretty common in conditions - date_part/date_trunc. Arithmetic
expressions are not that uncommon either. Also, users sometimes have
conditions comparing multiple keys (a<b) etc.

But even if it was "uncommon", the whole point of this patch is to
eliminate these corner cases where a user does something minor (like
adding an output column), and the executor disables an optimization
unnecessarily, causing unexpected regressions.

In any case, if we have some columns in index tuple it is desired to use
them for filtering before extracting heap tuple.
But I afraid it will be not so easy to implement...

I'm not sure what you mean. The patch does that, more or less. There's
issues that need to be solved (e.g. to decide when not to do this), and
how to integrate that into the scan interface (where the quals are
evaluated at the end).

What do you mean when you say "will not be easy to implement"? What
problems do you foresee?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#60)

2 attachment(s)

Re: index prefetching

On 1/22/24 07:35, Konstantin Knizhnik wrote:

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
cause any extra space overhead (because of alignment), but allows to
avoid dynamic memory allocation (not sure if it is critical, but nice to
avoid if possible).

Because it's specific to index-only scans, while IndexPrefetchEntry is a
generic thing, for all places.

However:

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

(2) After going through Melanie's patch, I realized this is actually
broken. The IOS case needs to keep more stuff, not just the all-visible
flag, but also the index tuple. Otherwise it'll just operate on the last
tuple read from the index, which happens to be in xs_ituple. Attached is
a patch with a trivial fix.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240124-0001-Prefetch-heap-pages-during-index-scans.patchtext/x-patch; charset=UTF-8; name=v20240124-0001-Prefetch-heap-pages-during-index-scans.patchDownload

From 5ac954b8d13fbb9419204195c15cc594870e9702 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240124 1/2] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/Makefile            |   1 +
 src/backend/executor/execMain.c          |  12 +
 src/backend/executor/execPrefetch.c      | 884 +++++++++++++++++++++++
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c     |  68 +-
 src/include/executor/executor.h          |  52 ++
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |  10 +
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 843472e6ddc..b6631cdb2de 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..840f5a6596a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,6 +24,7 @@ OBJS = \
 	execMain.o \
 	execParallel.o \
 	execPartition.o \
+	execPrefetch.o \
 	execProcnode.o \
 	execReplication.o \
 	execSRF.o \
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83b..e3e9131bd62 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1645,6 +1645,18 @@ ExecutePlan(EState *estate,
 	 */
 	estate->es_direction = direction;
 
+	/*
+	 * Enable prefetching only if the plan is executed exactly once. We need
+	 * to disable prefetching for cases when the scan direction may change
+	 * (e.g. for scrollable cursors).
+	 *
+	 * XXX It might be possible to improve the prefetching code to handle this
+	 * by "walking back" the TID queue, but it's not clear if it's worth it.
+	 * And if there pauses in between the fetches, the prefetched pages may
+	 * get evicted, wasting the prefetch effort.
+	 */
+	estate->es_use_prefetching = execute_once;
+
 	/*
 	 * If the plan might potentially be executed multiple times, we must force
 	 * it to run without parallelism, because we might exit early.
diff --git a/src/backend/executor/execPrefetch.c b/src/backend/executor/execPrefetch.c
new file mode 100644
index 00000000000..000bb796d51
--- /dev/null
+++ b/src/backend/executor/execPrefetch.c
@@ -0,0 +1,884 @@
+/*-------------------------------------------------------------------------
+ *
+ * execPrefetch.c
+ *	  routines for prefetching heap pages for index scans.
+ *
+ * The IndexPrefetch node represents an "index prefetcher" which reads TIDs
+ * from an index scan, and prefetches the referenced heap pages. The basic
+ * API consists of these methods:
+ *
+ *	IndexPrefetchAlloc - allocate IndexPrefetch with custom callbacks
+ *	IndexPrefetchNext - read next TID from the index scan, do prefetches
+ *	IndexPrefetchReset - reset state of the prefetcher (for rescans)
+ *	IndexPrefetchEnd - release resources held by the prefetcher
+ *
+ * When allocating a prefetcher, the caller can supply two custom callbacks:
+ *
+ *	IndexPrefetchNextCB - reads the next TID from the index scan (required)
+ *	IndexPrefetchCleanupCB - release private prefetch data (optional)
+ *
+ * These callbacks allow customizing the behavior for different types of
+ * index scans - for exampel index-only scans may inspect visibility map,
+ * and adjust prefetches based on that.
+ *
+ *
+ * TID queue
+ * ---------
+ * The prefetcher maintains a simple queue of TIDs fetched from the index.
+ * The length of the queue (number of TIDs) is determined by the prefetch
+ * target, i.e. effective_io_concurrency. Adding entries to the queue is
+ * the responsibility of IndexPrefetchFillQueue(), depending on the state
+ * of the scan etc. It also prefetches the pages, if appropriate.
+ *
+ * Note: This prefetching applies only to heap pages from the indexed
+ * relation, not the internal index pages.
+ *
+ *
+ * pattern detection
+ * -----------------
+ * For certain access patterns, prefetching is inefficient. In particular,
+ * this applies to sequential access (where kernel read-ahead works fine)
+ * and for pages that are already in memory (prefetched recently). The
+ * prefetcher attempts to identify these two cases - sequential patterns
+ * are detected by IndexPrefetchBlockIsSequential, usign a tiny queue of
+ * recently prefetched blocks. Recently prefetched blocks are tracked in
+ * a "partitioned" LRU cache.
+ *
+ * Note: These are inherently best-effort heuristics. We don't know what
+ * the kernel algorithm/configuration is, or more precisely what already
+ * is in page cache.
+ *
+ *
+ * cache of recent prefetches
+ * --------------------------
+ * Cache of recently prefetched blocks, organized as a hash table of LRU
+ * LRU caches. Doesn't need to be perfectly accurate, but we aim to make
+ * false positives/negatives reasonably low. For more details see the
+ * comments at IndexPrefetchIsCached.
+ *
+ *
+ * prefetch request number
+ * -----------------------
+ * Prefetching works with the concept of "age" (e.g. "recently prefetched
+ * pages"). This relies on a simple prefetch counter, incremented every
+ * time a prefetch is issued. This is not exactly the same thing as time,
+ * as there may be arbitrary delays, it's good enough for this purpose.
+ *
+ *
+ * auto-tuning / self-adjustment
+ * -----------------------------
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and adjust the prefetch target accordingly. That'd help
+ * with cases when a scan matches only very few rows, far less than the
+ * prefetchTarget, because the unnecessary prefetches are wasted I/O.
+ * Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchRequest value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execPrefetch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "common/hashfn.h"
+#include "executor/executor.h"
+#include "nodes/nodeFuncs.h"
+#include "storage/bufmgr.h"
+#include "utils/spccache.h"
+
+
+/*
+ * An entry representing a recently prefetched block. For each block we know
+ * the request number, assigned sequentially, allowing us to decide how old
+ * the request is.
+ *
+ * XXX Is it enough to keep the request as uint32? This way we can prefetch
+ * 32TB of data, and this allows us to fit the whole entry into 64B, i.e.
+ * one cacheline. Which seems like a good thing.
+ *
+ * XXX If we're extra careful / paranoid about uint32, we could reset the
+ * cache once the request wraps around.
+ */
+typedef struct IndexPrefetchCacheEntry
+{
+	BlockNumber block;
+	uint32		request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too small or
+ * too large. 1024 entries seems about right, it covers ~8MB of data. This is
+ * rather arbitrary - there's no formula that'd tell us what the optimal size
+ * is, and we can't even tune it based on runtime (as it depends on what the
+ * other backends do too).
+ *
+ * A value too small would mean we may issue unnecessary prefetches for pages
+ * that have already been prefetched recently (and are still in page cache),
+ * incurring costs for unnecessary fadvise() calls.
+ *
+ * A value too large would mean we do not issue prefetches for pages that have
+ * already been evicted from memory (both shared buffers and page cache).
+ *
+ * Note however that PrefetchBuffer() checks shared buffers before doing the
+ * fadvise call, which somewhat limits the risk of a small cache - the page
+ * would have to get evicted from shared buffers not yet from page cache.
+ * Also, the cost of not issuing a fadvise call (and doing synchronous I/O
+ * later) is much higher than the unnecessary fadvise call. For these reasons
+ * it's better to keep the cache fairly small.
+ *
+ * The cache is structured as an array of small LRU caches - you may also
+ * imagine it as a hash table of LRU caches. To remember a prefetched block,
+ * the block number mapped to a LRU using by hashing. And then in each LRU
+ * we organize the entries by age (per request number) - in particular, the
+ * age determines which entry gets evicted after the LRU gets full.
+ *
+ * The LRU needs to be small enough to be searched linearly. At the same
+ * time it needs to be sufficiently large to handle collisions when several
+ * hot blocks get mapped to the same LRU. For example, if the LRU was only
+ * a single entry, and there were two hot blocks mapped to it, that would
+ * often give incorrect answer.
+ *
+ * The 8 entries per LRU seems about right - it's small enough for linear
+ * search to work well, but large enough to be adaptive. It's not very
+ * likely for 9+ busy blocks (out of 1000 recent requests) to map to the
+ * same LRU. Assuming reasonable hash function.
+ *
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define		PREFETCH_LRU_SIZE		8	/* slots in one LRU */
+#define		PREFETCH_LRU_COUNT		128 /* number of LRUs */
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Size of small sequential queue of most recently prefetched blocks, used
+ * to check if the block is exactly the same as the immediately preceding
+ * one (in which case prefetching is not needed), and if the blocks are a
+ * sequential pattern (in which case the kernel read-ahead is likely going
+ * to be more efficient, and we don't want to interfere with it).
+ */
+#define		PREFETCH_QUEUE_HISTORY	8
+
+/*
+ * An index prefetcher, which maintains a queue of TIDs from an index, and
+ * issues prefetches (if deemed beneficial and supported by the OS).
+ */
+typedef struct IndexPrefetch
+{
+	int			prefetchTarget; /* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics, displayed in EXPLAIN etc. */
+	uint32		countAll;		/* all prefetch requests (including skipped) */
+	uint32		countPrefetch;	/* PrefetchBuffer calls */
+	uint32		countSkipSequential;	/* skipped as sequential pattern */
+	uint32		countSkipCached;	/* skipped as recently prefetched */
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values. However, 1000 entries
+	 * means ~16kB, which means an oversized chunk, and thus always a malloc()
+	 * call. However, we already have the prefetchCache, which is also large
+	 * enough to cause this :-(
+	 *
+	 * XXX However what about the case without prefetching? In that case it
+	 * would be nice to lower the malloc overhead, maybe?
+	 */
+	IndexPrefetchEntry queueItems[MAX_IO_CONCURRENCY];
+	uint32		queueIndex;		/* next TID to prefetch */
+	uint32		queueStart;		/* first valid TID in queue */
+	uint32		queueEnd;		/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+	uint32		blockIndex;		/* index in the block (points to the first
+								 * empty entry) */
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of small
+	 * LRU caches.
+	 */
+	uint32		prefetchRequest;
+	IndexPrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB next_cb;	/* read next TID */
+	IndexPrefetchCleanupCB cleanup_cb;	/* cleanup data */
+
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+} IndexPrefetch;
+
+/* small sequential queue of recent blocks */
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+/* access to the main hybrid cache (hash of LRUs) */
+#define PREFETCH_LRU_ENTRY(p, lru, idx)	\
+	&((p)->prefetchCache[(lru) * PREFETCH_LRU_SIZE + (idx)])
+
+/* access to queue of TIDs (up to MAX_IO_CONCURRENCY elements) */
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+
+/*
+ * macros to deal with prefetcher state
+ *
+ * FIXME may need rethinking, easy to confuse PREFETCH_ENABLED/PREFETCH_ACTIVE
+ */
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_QUEUE_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+
+
+/*
+ * IndexPrefetchBlockIsSequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the block is the same as the immediately preceding one.
+ *
+ * This also updates the small sequential cache of blocks.
+ *
+ * The prefetching overhead is fairly low, but for some access patterns the
+ * benefits are small compared to the extra overhead, or the prefetching may
+ * even be harmful. In particular, for sequential access the read-ahead
+ * performed by the OS is very effective/efficient and our prefetching may
+ * be pointless or (worse) even interfere with it.
+ *
+ * This identifies simple sequential patterns, using a tiny queue of recently
+ * prefetched block numbers (PREFETCH_QUEUE_HISTORY blocks). It also checks
+ * if the block is exactly the same as any of the blocks in the queue (the
+ * main cache has block too, but checking the tiny cache is likely cheaper).
+ *
+ * The the main prefetch queue is not really useful for this, as it stores
+ * full TIDs, but while we only care about block numbers. Consider a nicely
+ * clustered table, with a perfectly sequential pattern when accessed through
+ * an index. Each heap page may have dozens of TIDs, filling the prefetch
+ * queue. But we need to compare block numbers - those may either not be
+ * in the queue anymore, or we have to walk many TIDs (making it expensive,
+ * and we're in hot path).
+ *
+ * So a tiny queue of just block numbers seems like a better option.
+ *
+ * Returns true if the block is in a sequential pattern or was prefetched
+ * recently (and so should not be prefetched this time), or false (in which
+ * case it should be prefetched).
+ */
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on IndexPrefetchIsCached doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the small queue.
+	 *
+	 * Done before checking if the pattern is sequential, because we want to
+	 * know about the block later, even if we end up skipping the prefetch.
+	 * Otherwise we'd not be able to detect longer sequential pattens - we'd
+	 * skip one block and then fail to skip the next couple blocks even in a
+	 * perfectly sequential pattern. And this ocillation might even prevent
+	 * the OS read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Are there enough requests to confirm a sequential pattern? We only
+	 * consider something to be sequential after finding a sequence of
+	 * PREFETCH_QUEUE_HISTORY blocks.
+	 */
+	if (prefetch->blockIndex < PREFETCH_QUEUE_HISTORY)
+		return false;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_QUEUE_HISTORY (8 by default), so
+	 * we look for patterns of 8 pages (64kB) including the new block.
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_QUEUE_HISTORY; i++)
+	{
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index after adding the new block to the
+		 * queue). So (blockIndex-1) is the new block.
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+
+		/* Don't prefetch if the block happens to be the same. */
+		if (prefetch->blockItems[idx] == block)
+			return false;
+	}
+
+	/* not sequential, not recently prefetched */
+	return true;
+}
+
+/*
+ * IndexPrefetchIsCached
+ *		Check if the block was prefetched recently, and update the cache.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may be quite significant.
+ *
+ * We want to remember which blocks were prefetched recently, so that we can
+ * skip repeated prefetches. We also need to eventually forget these blocks
+ * as they may get evicted from memory (particularly page cache, which is
+ * outside our control).
+ *
+ * A simple queue is not a viable option - it would allow expiring requests
+ * based on age, but it's very expensive to check (as it requires linear
+ * search, and we need fairly large number of entries). Hash table does not
+ * work because it does not allow expiring entries by age.
+ *
+ * The cache does not need to be perfect - false positives/negatives are
+ * both acceptable, as long as the rate is reasonably low.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table of LRUs). The LRU caches are tiny (e.g. 8 entries), and the
+ * expiration happens at the level of a single LRU (using age determined
+ * by sequential request number).
+ *
+ * This allows quick searches and expiration, with false negatives (when a
+ * particular LRU has too many collisions with hot blocks, we may end up
+ * evicting entries that are more recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * request in total (these are the default parameters.) representing about
+ * 8MB of data.
+ *
+ * If we want to check if a block was recently prefetched, we calculate
+ * (hash(blkno) % 128) and search only LRU at this index, using a linear
+ * search. If we want to add the block to the cache, we find either an
+ * empty slot or the "oldest" entry in the LRU, and store the block in it.
+ * If the block is already in the LRU, we only update the request number.
+ *
+ * The request age is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint32, so it should
+ * not wrap (we'd have to prefetch 32TB).
+ *
+ * If the request number is not less than PREFETCH_CACHE_SIZE ago, it's
+ * considered "recently prefetched". That is, the maximum age is the same
+ * as the total capacity of the cache.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what IndexPrefetchBlockIsSequential does.
+ *
+ * XXX Should we increase the prefetch counter even if we determine the
+ * entry was recently prefetched? Then we might skip some request numbers
+ * (there's be no entry with them).
+ */
+static bool
+IndexPrefetchIsCached(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) queue and see if it's part of a
+	 * sequential pattern. In this case we just ignore the block and don't
+	 * prefetch it - we expect OS read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the main cache, in case we
+	 * happen to access it later. That might help if we happen to scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (IndexPrefetchBlockIsSequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* Which LRU does this block belong to? */
+	lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/*
+	 * Did we prefetch this block recently? Scan the LRU linearly, and while
+	 * doing that, track the oldest (or empty) entry, so that we know where to
+	 * put the block if we don't find a match.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = PREFETCH_LRU_ENTRY(prefetch, lru, i);
+
+		/*
+		 * Is this the oldest prefetch request in this LRU?
+		 *
+		 * Notice that request is uint32, so an empty entry (with request=0)
+		 * is automatically oldest one.
+		 */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Skip unused entries. */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched. We need to check before updating
+			 * the prefetch request.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchRequest);
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchRequest;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it the "oldest" prefetch
+	 * request in this LRU (which might be an empty entry).
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = PREFETCH_LRU_ENTRY(prefetch, lru, oldestIndex);
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchRequest;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * IndexPrefetchHeapPage
+ *		Prefetch a heap page for the TID, unless it's sequential or was
+ *		recently prefetched.
+ */
+static void
+IndexPrefetchHeapPage(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
+{
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
+
+	prefetch->countAll++;
+
+	/*
+	 * Do not prefetch the same block over and over again, if it's probably
+	 * still in memory (page cache).
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 *
+	 * If we make a mistake and prefetch a buffer that's still in our shared
+	 * buffers, PrefetchBuffer will take care of that. If it's in page cache,
+	 * we'll issue an unnecessary prefetch. There's not much we can do about
+	 * that, unfortunately.
+	 *
+	 * XXX Maybe we could check PrefetchBufferResult and adjust countPrefetch
+	 * based on that?
+	 */
+	if (IndexPrefetchIsCached(prefetch, block))
+		return;
+
+	prefetch->countPrefetch++;
+
+	PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+	pgBufferUsage.blks_prefetches++;
+}
+
+/*
+ * IndexPrefetchFillQueue
+ *		Fill the prefetch queue and issue necessary prefetch requests.
+ *
+ * If the prefetching is still active (enabled, not reached end of scan), read
+ * TIDs into the queue until we hit the current target.
+ *
+ * This also ramps-up the prefetch target from 0 to prefetch_max, determined
+ * when allocating the prefetcher.
+ */
+static void
+IndexPrefetchFillQueue(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* When inactive (not enabled or end of scan reached), we're done. */
+	if (!PREFETCH_ACTIVE(prefetch))
+		return;
+
+	/*
+	 * Ramp up the prefetch distance incrementally.
+	 *
+	 * Intentionally done as first, before reading the TIDs into the queue, so
+	 * that there's always at least one item. Otherwise we might get into a
+	 * situation where we start with target=0 and no TIDs loaded.
+	 */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Read TIDs from the index until the queue is full (with respect to the
+	 * current prefetch target).
+	 */
+	while (!PREFETCH_QUEUE_FULL(prefetch))
+	{
+		IndexPrefetchEntry *entry
+		= prefetch->next_cb(scan, direction, prefetch->data);
+
+		/* no more entries in this index scan */
+		if (entry == NULL)
+		{
+			prefetch->prefetchDone = true;
+			return;
+		}
+
+		Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
+
+		/* store the entry and then maybe issue the prefetch request */
+		prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
+
+		/* issue the prefetch request? */
+		if (entry->prefetch)
+			IndexPrefetchHeapPage(scan, prefetch, entry);
+	}
+}
+
+/*
+ * IndexPrefetchNextEntry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+IndexPrefetchNextEntry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+
+	/*
+	 * With prefetching enabled (even if we already finished reading all TIDs
+	 * from the index scan), we need to return a TID from the queue.
+	 * Otherwise, we just get the next TID from the scan directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
+		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
+	}
+	else						/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
+	}
+
+	return entry;
+}
+
+/*
+ * IndexPrefetchComputeTarget
+ *		Calculate prefetch distance for the given heap relation.
+ *
+ * We disable prefetching when using direct I/O (when there's no page cache
+ * to prefetch into), and scans where the prefetch distance may change (e.g.
+ * for scrollable cursors).
+ *
+ * In regular cases we look at effective_io_concurrency for the tablepace
+ * (of the heap, not the index), and cap it with plan_rows.
+ *
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
+ *
+ * XXX Maybe we should reduce the value with parallel workers?
+ */
+int
+IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch)
+{
+	/*
+	 * No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only
+	 * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+	 * once the code starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+		return 0;
+
+	/* disable prefetching (for cursors etc.) */
+	if (!prefetch)
+		return 0;
+
+	/* regular case, look at tablespace effective_io_concurrency */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+/*
+ * IndexPrefetchAlloc
+ *		Allocate the index prefetcher.
+ *
+ * The behavior is customized by two callbacks - next_cb, which generates TID
+ * values to put into the prefetch queue, and (optional) cleanup_cb which
+ * releases resources at the end.
+ *
+ * prefetch_max specifies the maximum prefetch distance, i.e. how many TIDs
+ * ahead to keep in the prefetch queue. prefetch_max=0 means prefetching is
+ * disabled.
+ *
+ * data may point to a custom data, associated with the prefetcher.
+ */
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, IndexPrefetchCleanupCB cleanup_cb,
+				   int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	/* the next_cb callback is required */
+	Assert(next_cb);
+
+	/* valid prefetch distance */
+	Assert((prefetch_max >= 0) && (prefetch_max <= MAX_IO_CONCURRENCY));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep the result
+	 * so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->cleanup_cb = cleanup_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+/*
+ * IndexPrefetchNext
+ *		Read the next entry from the prefetch queue.
+ *
+ * Returns the next TID in the prefetch queue (which might have been prefetched
+ * sometime in the past). If needed, it adds more entries to the queue and does
+ * the prefetching for them.
+ *
+ * Returns IndexPrefetchEntry with the TID and optional data associated with
+ * the TID in the next_cb callback.
+ */
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	IndexPrefetchFillQueue(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return IndexPrefetchNextEntry(scan, prefetch, direction);
+}
+
+/*
+ * IndexPrefetchReset
+ *		Reset the prefetch TID, restart the prefetching.
+ *
+ * Useful during rescans etc. This also resets the prefetch target, so that
+ * each rescan does the initial prefetch ramp-up from target=0 to maximum
+ * prefetch distance.
+ */
+void
+IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	state->queueIndex = 0;
+	state->queueStart = 0;
+	state->queueEnd = 0;
+
+	state->prefetchDone = false;
+	state->prefetchTarget = 0;
+}
+
+/*
+ * IndexPrefetchStats
+ *		Log basic runtime debug stats of the prefetcher.
+ *
+ * FIXME Should be only in debug builds, or something like that.
+ */
+void
+IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	elog(LOG, "index prefetch stats: requests %u prefetches %u (%f) skip cached %u sequential %u",
+		 state->countAll,
+		 state->countPrefetch,
+		 state->countPrefetch * 100.0 / state->countAll,
+		 state->countSkipCached,
+		 state->countSkipSequential);
+}
+
+/*
+ * IndexPrefetchEnd
+ *		Release resources associated with the prefetcher.
+ *
+ * This is primarily about the private data the caller might have allocated
+ * in the next_cb, and stored in the data field. We don't know what the
+ * data might contain (e.g. buffers etc.), requiring additional cleanup, so
+ * we call another custom callback.
+ *
+ * Needs to be called at the end of the executor node.
+ *
+ * XXX Maybe if there's no callback, we should just pfree the data? Does
+ * not seem very useful, though.
+ */
+void
+IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	if (!state->cleanup_cb)
+		return;
+
+	state->cleanup_cb(scan, state->data);
+}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945f..8fda8694350 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b57..fce10ea6518 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -36,6 +36,7 @@
 #include "access/tupdesc.h"
 #include "access/visibilitymap.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
@@ -44,11 +45,14 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
+static void IndexOnlyPrefetchCleanup(IndexScanDesc scan,
+									 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +69,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,11 +84,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -111,15 +120,39 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												 IndexOnlyPrefetchCleanup,
+												 prefetch_max,
+												 palloc0(sizeof(Buffer)));
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data; /* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -155,8 +188,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from
+		 * prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!(all_visible && *all_visible) &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +390,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->ioss_ScanDesc, node->ioss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +420,12 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->ioss_prefetch);
+
+	/* Release VM buffer pin from prefetcher, if any. */
+	IndexPrefetchEnd(indexScanDesc, node->ioss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -715,3 +761,62 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	Assert(data);
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		bool		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+												 blkno,
+												 (Buffer *) data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
+
+/*
+ * For IOS, we may have a VM buffer in the private data, so make sure to
+ * release it properly.
+ */
+static void
+IndexOnlyPrefetchCleanup(IndexScanDesc scan, void *data)
+{
+	Buffer	   *buffer = (Buffer *) data;
+
+	Assert(data);
+
+	if (*buffer != InvalidBuffer)
+	{
+		ReleaseBuffer(*buffer);
+		*buffer = InvalidBuffer;
+	}
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..0548403dc50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -34,6 +34,7 @@
 #include "access/tableam.h"
 #include "catalog/pg_am.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexscan.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
@@ -69,6 +70,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +89,8 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,11 +104,14 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int prefetch_max;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -123,15 +132,43 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												NULL, /* no extra cleanup */
+												prefetch_max,
+												NULL);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -588,6 +625,9 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->iss_ScanDesc, node->iss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +834,9 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1728,3 +1771,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5e8c335a737..e792c3fc8d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -677,4 +677,56 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execPrefetch.c
+ */
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData tid;
+
+	/* should we prefetch heap page for this TID? */
+	bool		prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The data
+	 * has to be a single palloc-ed piece of data, so that it can be easily
+	 * pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but that
+	 * seems unnecessarily complex - we expect the information to be very
+	 * simple, like bool flags or something. Easy to do in a simple struct,
+	 * and perhaps even reuse without pfree/palloc.
+	 */
+	void	   *data;
+} IndexPrefetchEntry;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													ScanDirection direction,
+													void *data);
+
+typedef void (*IndexPrefetchCleanupCB) (IndexScanDesc scan,
+										void *data);
+
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  IndexPrefetchCleanupCB cleanup_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state,
+									  ScanDirection direction);
+
+extern void IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state);
+
+extern int	IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..fadeb389495 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..9a9cc5d98f0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -697,6 +697,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching; /* can we use prefetching? */
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
@@ -1536,6 +1537,9 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1587,6 +1591,9 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1625,6 +1632,9 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d0..926c82e36d3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1186,6 +1186,9 @@ IndexOnlyScanState
 IndexOptInfo
 IndexOrderByDistance
 IndexPath
+IndexPrefetch
+IndexPrefetchCacheEntry
+IndexPrefetchEntry
 IndexRuntimeKeyInfo
 IndexScan
 IndexScanDesc
-- 
2.43.0

v20240124-0002-fixup.patchtext/x-patch; charset=UTF-8; name=v20240124-0002-fixup.patchDownload

From 864f40d357cf3f6667553c6e7d312ac047e9f8ce Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 24 Jan 2024 14:09:33 +0100
Subject: [PATCH v20240124 2/2] fixup

---
 src/backend/executor/nodeIndexonlyscan.c | 28 +++++++++++++++++++-----
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index fce10ea6518..9f75649270f 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -54,6 +54,11 @@ static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
 static void IndexOnlyPrefetchCleanup(IndexScanDesc scan,
 									 void *data);
 
+typedef struct IndexOnlyPrefetchInfo {
+	bool		all_visible;
+	IndexTuple	index_tuple;
+} IndexOnlyPrefetchInfo;
+
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
  *
@@ -146,12 +151,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
-		bool	   *all_visible = NULL;
+		IndexOnlyPrefetchInfo	*info;
 		bool		tuple_from_heap = false;
 
 		/* unpack the entry */
 		tid = &entry->tid;
-		all_visible = (bool *) entry->data; /* result of visibility check */
+
+		/* result of visibility check and index tuple */
+		info = (IndexOnlyPrefetchInfo *) entry->data;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -192,7 +199,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * XXX Skip if we already know the page is all visible from
 		 * prefetcher.
 		 */
-		if (!(all_visible && *all_visible) &&
+		if (!(info && info->all_visible) &&
 			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
@@ -224,6 +231,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 			tuple_from_heap = true;
 		}
+		else if (info)
+			scandesc->xs_itup = info->index_tuple;
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
@@ -782,6 +791,7 @@ IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
 
 	if ((tid = index_getnext_tid(scan, direction)) != NULL)
 	{
+		IndexOnlyPrefetchInfo *info;
 		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
 
 		bool		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
@@ -795,9 +805,15 @@ IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
 		/* prefetch only if not all visible */
 		entry->prefetch = !all_visible;
 
-		/* store the all_visible flag in the private part of the entry */
-		entry->data = palloc(sizeof(bool));
-		*(bool *) entry->data = all_visible;
+		/*
+		 * Store the index tuple and all-visible flag in the private part
+		 * of the entry.
+		 */
+		info = palloc(sizeof(IndexOnlyPrefetchInfo));
+		info->all_visible = all_visible;
+		info->index_tuple = CopyIndexTuple(scan->xs_itup);
+
+		entry->data = info;
 	}
 
 	return entry;
-- 
2.43.0

#67

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#64)

2 attachment(s)

Re: index prefetching

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/24/24 01:51, Melanie Plageman wrote:

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

Yes, the "recently prefetched stuff" cache seems like a fairly natural
complement to the pattern detection and shared-buffers check.

FWIW I wonder if we should make some of this customizable, so that
systems with customized storage (e.g. neon or with direct I/O) can e.g.
disable some of these checks. Or replace them with their version.

That's a promising idea.

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

Yes, that's the problem for index-only scans. I'd generalize it so that
it's about the callback being able to (a) decide if it needs to read the
heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie

Attachments:

v3-0001-Streaming-Read-API.txttext/plain; charset=US-ASCII; name=v3-0001-Streaming-Read-API.txtDownload

From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae64..018ebbcbaa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

v3-0002-use-streaming-reads-in-index-scan.txttext/plain; charset=US-ASCII; name=v3-0002-use-streaming-reads-in-index-scan.txtDownload

From 339b39d7f2cc5fc4bd6aa3429a12e6f3a4f9d2db Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v3 2/2] use streaming reads in index scan

ci-os-only:
---
 src/backend/access/heap/heapam_handler.c | 19 ++++++--
 src/backend/access/index/indexam.c       | 42 +++++++++++++++++
 src/backend/executor/nodeIndexonlyscan.c | 26 ++++++++++-
 src/backend/executor/nodeIndexscan.c     | 57 +++++++++++++++++++-----
 src/backend/storage/aio/streaming_read.c | 10 ++++-
 src/include/access/relscan.h             |  7 +++
 src/include/executor/nodeIndexscan.h     |  6 +++
 src/include/storage/streaming_read.h     |  2 +
 8 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..e5e13e92d8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr && scan->do_pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &tid);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+		{
+			ItemPointerSet(&scan->tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+			if (!ItemPointerIsValid(tid))
+				return false;
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+		}
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e2..f247a1d2d3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -211,6 +211,29 @@ index_insert_cleanup(Relation indexRelation,
 		indexRelation->rd_indam->aminsertcleanup(indexInfo);
 }
 
+static ItemPointerData
+index_tid_dequeue(ItemPointer tid_queue)
+{
+	ItemPointerData result = *tid_queue;
+	ItemPointerSet(tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	ItemPointerData data = index_tid_dequeue(&scan->tid_queue);
+
+	ItemPointer dest = per_buffer_data;
+	*dest = data;
+
+	if (!ItemPointerIsValid(&data))
+		return InvalidBlockNumber;
+	return ItemPointerGetBlockNumber(&data);
+}
+
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
@@ -236,7 +259,22 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 
 	/* prepare to fetch index matches from table */
+	scan->index_done = false;
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+
+	// TODO: can't put this here bc not AM agnostic
+	scan->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+												scan->xs_heapfetch,
+												sizeof(ItemPointerData),
+												NULL,
+												BMR_REL(scan->heapRelation),
+												MAIN_FORKNUM,
+												index_pgsr_next_single);
+
+	pg_streaming_read_set_resumable(scan->xs_heapfetch->pgsr);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -525,6 +563,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -566,6 +607,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan->xs_heapfetch);
 
+		ItemPointerSet(&scan->xs_heaptid, InvalidBlockNumber, InvalidOffsetNumber);
 		return NULL;
 	}
 	Assert(ItemPointerIsValid(&scan->xs_heaptid));
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b5..7979ecf1e4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -111,17 +111,36 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (true)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
+		{
+			tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (!tid && TID_QUEUE_EMPTY(&scandesc->xs_heapfetch->tid_queue))
+			break;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -156,7 +175,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!tid || !VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -187,6 +206,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 			tuple_from_heap = true;
 		}
+		else
+			ItemPointerSet(&scandesc->xs_heapfetch->tid_queue,
+					InvalidBlockNumber, InvalidOffsetNumber);
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..be91854436 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,16 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+void
+index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue)
+{
+	Assert(!ItemPointerIsValid(tid_queue));
+
+	ItemPointerSet(tid_queue, ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
+}
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +133,54 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				scandesc->index_done = true;
+				break;
 			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (index_fetch_heap(scandesc, slot))
+		{
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..6465963f83 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..ade7f59946 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,9 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	ItemPointerData tid_queue;
+	bool do_pgsr;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +166,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 3cddece67c..7dbff789e9 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -44,4 +44,10 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
 								   IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 
+#define TID_QUEUE_FULL(tid_queue) (ItemPointerIsValid(tid_queue))
+/* If it were a real queue empty and full wouldn't be opposites */
+#define TID_QUEUE_EMPTY(tid_queue) (!ItemPointerIsValid(tid_queue))
+
+extern void index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue);
+
 #endif							/* NODEINDEXSCAN_H */
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2

#68

Dilip Kumar

dilipbalaut@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#66)

Re: index prefetching

On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/22/24 07:35, Konstantin Knizhnik wrote:

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
cause any extra space overhead (because of alignment), but allows to
avoid dynamic memory allocation (not sure if it is critical, but nice to
avoid if possible).

While reading through the first patch I got some questions, I haven't
read it complete yet but this is what I got so far.

1.
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+ int idx;
...
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+
+ /* Don't prefetch if the block happens to be the same. */
+ if (prefetch->blockItems[idx] == block)
+ return false;
+ }
+
+ /* not sequential, not recently prefetched */
+ return true;
+}

The above function name is BlockIsSequential but at the end, it
returns true if it is not sequential, seem like a problem?
Also other 2 checks right above the end of the function are returning
false if the block is the same or the pattern is sequential I think
those are wrong too.

2.
I have noticed that the prefetch history is maintained at the backend
level, but what if multiple backends are trying to fetch the same heap
blocks maybe scanning the same index, so should that be in some shared
structure? I haven't thought much deeper about this from the
implementation POV, but should we think about it, or it doesn't
matter?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#69

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Dilip Kumar (#68)

Re: index prefetching

On 1/25/24 11:45, Dilip Kumar wrote:

On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/22/24 07:35, Konstantin Knizhnik wrote:

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
cause any extra space overhead (because of alignment), but allows to
avoid dynamic memory allocation (not sure if it is critical, but nice to
avoid if possible).

While reading through the first patch I got some questions, I haven't
read it complete yet but this is what I got so far.
1.
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+ int idx;
...
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+
+ /* Don't prefetch if the block happens to be the same. */
+ if (prefetch->blockItems[idx] == block)
+ return false;
+ }
+
+ /* not sequential, not recently prefetched */
+ return true;
+}
The above function name is BlockIsSequential but at the end, it
returns true if it is not sequential, seem like a problem?

Actually, I think it's the comment that's wrong - the last return is
reached only for a sequential pattern (and when the block was not
accessed recently).

Also other 2 checks right above the end of the function are returning
false if the block is the same or the pattern is sequential I think
those are wrong too.

Hmmm. You're right this is partially wrong. There are two checks:

/*
* For a sequential pattern, blocks "k" step ago needs to have block
* number by "k" smaller compared to the current block.
*/
if (prefetch->blockItems[idx] != (block - i))
return false;

/* Don't prefetch if the block happens to be the same. */
if (prefetch->blockItems[idx] == block)
return false;

The first condition is correct - we want to return "false" when the
pattern is not sequential.

But the second condition is wrong - we want to skip prefetching when the
block was already prefetched recently, so this should return true (which
is a bit misleading, as it seems to imply the pattern is sequential,
when it's not).

However, this is harmless, because we then identify this block as
recently prefetched in the "full" cache check, so we won't prefetch it
anyway. So it's harmless, although a bit more expensive.

There's another inefficiency - we stop looking for the same block once
we find the first block breaking the non-sequential pattern. Imagine a
sequence of blocks 1, 2, 3, 1, 2, 3, ... in which case we never notice
the block was recently prefetched, because we always find the break of
the sequential pattern. But again, it's harmless, thanks to the full
cache of recently prefetched blocks.

2.
I have noticed that the prefetch history is maintained at the backend
level, but what if multiple backends are trying to fetch the same heap
blocks maybe scanning the same index, so should that be in some shared
structure? I haven't thought much deeper about this from the
implementation POV, but should we think about it, or it doesn't
matter?

Yes, the cache is at the backend level - it's a known limitation, but I
see it more as a conscious tradeoff.

Firstly, while the LRU cache is at backend level, PrefetchBuffer also
checks shared buffers for each prefetch request. So with sufficiently
large shared buffers we're likely to find it there (and for direct I/O
there won't be any other place to check).

Secondly, the only other place to check is page cache, but there's no
good (sufficiently cheap) way to check that. See the preadv2/nowait
experiment earlier in this thread.

I suppose we could implement a similar LRU cache for shared memory (and
I don't think it'd be very complicated), but I did not plan to do that
in this patch unless absolutely necessary.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#70

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#67)

2 attachment(s)

Re: index prefetching

On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/24/24 01:51, Melanie Plageman wrote:

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

Yes, that's the problem for index-only scans. I'd generalize it so that
it's about the callback being able to (a) decide if it needs to read the
heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

- Melanie

Attachments:

v4-0001-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Streaming-Read-API.patchDownload

From 550e3a4b55eb0f3edc0f8c4f691cff134b256371 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v4 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d..0e34145187 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2093,6 +2093,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

v4-0002-Index-scans-prefetch-with-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Index-scans-prefetch-with-streaming-read-API.patchDownload

From ec809e08fe59ffc3eaee772d2269cb47f365c0a6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 7 Feb 2024 12:47:51 -0500
Subject: [PATCH v4 2/2] Index scans prefetch with streaming read API

---
 src/backend/access/heap/heapam_handler.c |  55 +++++++-
 src/backend/access/index/indexam.c       | 104 ++++++++++++++-
 src/backend/executor/execMain.c          |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 153 ++++++++++++++++-------
 src/backend/executor/nodeIndexscan.c     |  61 ++++++++-
 src/backend/optimizer/plan/createplan.c  |   1 +
 src/backend/optimizer/plan/planner.c     |   3 +
 src/backend/storage/aio/streaming_read.c |  11 +-
 src/include/access/genam.h               |  41 ++++++
 src/include/access/relscan.h             |  11 ++
 src/include/nodes/execnodes.h            |   1 +
 src/include/nodes/pathnodes.h            |   1 +
 src/include/nodes/plannodes.h            |   1 +
 src/include/storage/streaming_read.h     |   2 +
 src/tools/pgindent/typedefs.list         |   2 +
 15 files changed, 401 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..03e6b522a8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -108,6 +108,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 	pfree(hscan);
 }
 
+/*
+ * For those using the streaming read user, tid is an output parameter set with
+ * the latest TID obtained from the streaming read API. For non-streaming read
+ * users, tid is an input parameter and contains the next block to be read from
+ * the heap.
+ */
 static bool
 heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
@@ -127,9 +133,52 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			TIDQueueItem *result;
+
+			if (BufferIsValid(hscan->xs_cbuf))
+				ReleaseBuffer(hscan->xs_cbuf);
+
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &result);
+			if (!BufferIsValid(hscan->xs_cbuf))
+			{
+				/*
+				 * Invalidate the item pointer to allow the caller to
+				 * distinguish between index_fetch_heap() returning false
+				 * because the tuple is not visible and because the streaming
+				 * read callback ran out of queue items.
+				 */
+				ItemPointerSetInvalid(tid);
+				return false;
+			}
+
+			/* Set this for use below */
+			*tid = result->tid;
+
+			scan->tid = result->tid;
+			scan->recheck = result->recheck;
+
+			if (scan->itup)
+				pfree(scan->itup);
+			scan->itup = NULL;
+
+			if (scan->htup)
+				pfree(scan->htup);
+			scan->htup = NULL;
+
+			if (result->itup)
+				scan->itup = CopyIndexTuple(result->itup);
+			if (result->htup)
+				scan->htup = heap_copytuple(result->htup);
+		}
+		else
+		{
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+		}
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index bbd499abcf..0bf50bcd83 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -248,6 +248,91 @@ index_insert_cleanup(Relation indexRelation,
 		indexRelation->rd_indam->aminsertcleanup(indexInfo);
 }
 
+void
+tid_queue_reset(TIDQueue *q)
+{
+	q->head = q->tail = 0;
+}
+
+TIDQueue *
+tid_queue_alloc(int size)
+{
+	TIDQueue   *result;
+
+	result = palloc(sizeof(TIDQueue) + (sizeof(TIDQueueItem) * size));
+	result->size = size;
+	tid_queue_reset(result);
+	return result;
+}
+
+static TIDQueueItem
+index_tid_dequeue(TIDQueue *tid_queue)
+{
+	TIDQueueItem result;
+
+	Assert(tid_queue->tail > tid_queue->head);
+	result = tid_queue->data[tid_queue->head % tid_queue->size];
+	tid_queue->head++;
+
+	return result;
+}
+
+void
+index_tid_enqueue(TIDQueue *tid_queue, ItemPointer tid, bool recheck,
+				  HeapTuple htup, IndexTuple itup)
+{
+	TIDQueueItem *cur;
+
+	Assert(tid_queue->tail >= tid_queue->head);
+	Assert(!TID_QUEUE_FULL(tid_queue));
+	cur = &tid_queue->data[tid_queue->tail % tid_queue->size];
+	ItemPointerSet(&cur->tid,
+				   ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid));
+	cur->recheck = recheck;
+	cur->itup = NULL;
+	cur->htup = NULL;
+
+	if (itup)
+		cur->itup = CopyIndexTuple(itup);
+
+	if (htup)
+		cur->htup = heap_copytuple(htup);
+
+	tid_queue->tail++;
+}
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexScanDesc scan = pgsr_private;
+	TIDQueueItem *result = per_buffer_data;
+
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	if (TID_QUEUE_EMPTY(scan->tid_queue))
+		return InvalidBlockNumber;
+
+	*result = index_tid_dequeue(scan->tid_queue);
+	return ItemPointerGetBlockNumber(&result->tid);
+}
+
+void
+index_pgsr_alloc(IndexScanDesc scan)
+{
+	if (scan->xs_heapfetch->pgsr)
+		pg_streaming_read_free(scan->xs_heapfetch->pgsr);
+	scan->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+															  scan,
+															  sizeof(TIDQueueItem),
+															  NULL,
+															  BMR_REL(scan->heapRelation),
+															  MAIN_FORKNUM,
+															  index_pgsr_next_single);
+
+	pg_streaming_read_set_resumable(scan->xs_heapfetch->pgsr);
+}
+
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
@@ -333,6 +418,8 @@ index_beginscan_internal(Relation indexRelation,
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
+	scan->index_done = false;
+	scan->tid_queue = NULL;
 
 	return scan;
 }
@@ -364,8 +451,12 @@ index_rescan(IndexScanDesc scan,
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	scan->index_done = false;
+
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
+	if (scan->tid_queue)
+		tid_queue_reset(scan->tid_queue);
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
@@ -384,10 +475,17 @@ index_endscan(IndexScanDesc scan)
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
+		if (scan->xs_heapfetch->pgsr)
+			pg_streaming_read_free(scan->xs_heapfetch->pgsr);
+		scan->xs_heapfetch->pgsr = NULL;
 		table_index_fetch_end(scan->xs_heapfetch);
 		scan->xs_heapfetch = NULL;
 	}
 
+	if (scan->tid_queue)
+		pfree(scan->tid_queue);
+	scan->tid_queue = NULL;
+
 	/* End the AM's scan */
 	scan->indexRelation->rd_indam->amendscan(scan);
 
@@ -530,6 +628,9 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	if (scan->tid_queue)
+		tid_queue_reset(scan->tid_queue);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -603,6 +704,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan->xs_heapfetch);
 
+		scan->index_done = true;
 		return NULL;
 	}
 	Assert(ItemPointerIsValid(&scan->xs_heaptid));
@@ -651,7 +753,7 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
 	 */
-	if (!scan->xactStartedInRecovery)
+	if (!scan->xactStartedInRecovery && !scan->xs_heapfetch->pgsr)
 		scan->kill_prior_tuple = all_dead;
 
 	return found;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83..9e951a69ab 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1652,6 +1652,10 @@ ExecutePlan(EState *estate,
 	if (!execute_once)
 		use_parallel_mode = false;
 
+	estate->es_use_prefetching = execute_once;
+	if (!planstate->plan->allow_prefetch)
+		estate->es_use_prefetching = false;
+
 	estate->es_use_parallel_mode = use_parallel_mode;
 	if (use_parallel_mode)
 		EnterParallelMode();
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b5..e3824c25e0 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -113,61 +113,109 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumOrderByKeys);
 	}
 
+	if (!scandesc->tid_queue)
+	{
+		/* Fall back to a queue size of 1 for now */
+		int			queue_size = 1;
+
+		if (estate->es_use_prefetching && ScanDirectionIsForward(direction))
+			queue_size = TID_QUEUE_SIZE;
+		scandesc->tid_queue = tid_queue_alloc(queue_size);
+		index_pgsr_alloc(scandesc);
+	}
+
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	for (;;)
 	{
-		bool		tuple_from_heap = false;
+		bool		tuple_from_heap = true;
 
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!scandesc->index_done)
+		{
+			while (!TID_QUEUE_FULL(scandesc->tid_queue))
+			{
+				if ((tid = index_getnext_tid(scandesc, direction)) == NULL)
+				{
+					scandesc->index_done = true;
+					break;
+				}
+
+				/*
+				 * We can skip the heap fetch if the TID references a heap
+				 * page on which all tuples are known visible to everybody. In
+				 * any case, we'll use the index tuple not the heap tuple as
+				 * the data source.
+				 *
+				 * Note on Memory Ordering Effects: visibilitymap_get_status
+				 * does not lock the visibility map buffer, and therefore the
+				 * result we read here could be slightly stale.  However, it
+				 * can't be stale enough to matter.
+				 *
+				 * We need to detect clearing a VM bit due to an insert right
+				 * away, because the tuple is present in the index page but
+				 * not visible. The reading of the TID by this scan (using a
+				 * shared lock on the index buffer) is serialized with the
+				 * insert of the TID into the index (using an exclusive lock
+				 * on the index buffer). Because the VM bit is cleared before
+				 * updating the index, and locking/unlocking of the index page
+				 * acts as a full memory barrier, we are sure to see the
+				 * cleared bit if we see a recently-inserted TID.
+				 *
+				 * Deletes do not update the index page (only VACUUM will
+				 * clear out the TID), so the clearing of the VM bit by a
+				 * delete is not serialized with this test below, and we may
+				 * see a value that is significantly stale. However, we don't
+				 * care about the delete right away, because the tuple is
+				 * still visible until the deleting transaction commits or the
+				 * statement ends (if it's our transaction). In either case,
+				 * the lock on the VM buffer will have been released (acting
+				 * as a write barrier) after clearing the bit. And for us to
+				 * have a snapshot that includes the deleting transaction
+				 * (making the tuple invisible), we must have acquired
+				 * ProcArrayLock after that time, acting as a read barrier.
+				 *
+				 * It's worth going through this complexity to avoid needing
+				 * to lock the VM buffer, which could cause significant
+				 * contention.
+				 */
+
+				if (VM_ALL_VISIBLE(scandesc->heapRelation,
+								   ItemPointerGetBlockNumber(tid),
+								   &node->ioss_VMBuffer))
+				{
+					tuple_from_heap = false;
+					break;
+				}
+
+				index_tid_enqueue(scandesc->tid_queue, tid, scandesc->xs_recheck,
+								  scandesc->xs_hitup, scandesc->xs_itup);
+			}
+		}
+
+		if (tuple_from_heap)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
+
 			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+			{
+				/*
+				 * Either there is no visible tuple or the streaming read ran
+				 * out of queue items and it is time to add more.
+				 */
+				if (ItemPointerIsValid(&scandesc->xs_heaptid))
+					continue;
 
-			ExecClearTuple(node->ioss_TableSlot);
+				if (!scandesc->index_done)
+					continue;
+
+				break;
+			}
 
 			/*
 			 * Only MVCC snapshots are supported here, so there should be no
@@ -185,15 +233,32 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			 * entry might require a visit to the same heap page.
 			 */
 
-			tuple_from_heap = true;
+			/*
+			 * If we visit the underlying table, we need to reset the
+			 * IndexScanDesc's fields to match the per tuple state returned by
+			 * the streaming read API. The most recent index tuple fetched
+			 * will not necessarily match the current TID being processed
+			 * after returning from index_fetch_heap().
+			 */
+			scandesc->xs_recheck = scandesc->xs_heapfetch->recheck;
+			scandesc->xs_heaptid = scandesc->xs_heapfetch->tid;
+
+			scandesc->xs_hitup = scandesc->xs_heapfetch->htup;
+			scandesc->xs_itup = scandesc->xs_heapfetch->itup;
 		}
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
-		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
+		 * provided in either HeapTuple or IndexTuple format. Conceivably an
 		 * index AM might fill both fields, in which case we prefer the heap
-		 * format, since it's probably a bit cheaper to fill a slot from.
+		 * format, since it's probably a bit cheaper to fill a slot from. As
+		 * soon as we encounter a tuple from an all visible block, we stop
+		 * prefetching and yield the tuple. As such, we can use the IndexTuple
+		 * and HeapTuple that the index AM filled in the scan descriptor
+		 * instead of having to get them from the per tuple state yielded by
+		 * the streaming read API.
 		 */
+		ExecClearTuple(node->ioss_TableSlot);
 		if (scandesc->xs_hitup)
 		{
 			/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..aa72479df1 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,13 +125,72 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	if (!scandesc->tid_queue)
+	{
+		/* Fall back to a queue size of 1 for now */
+		int			queue_size = 1;
+
+		if (estate->es_use_prefetching && ScanDirectionIsForward(direction))
+			queue_size = TID_QUEUE_SIZE;
+		scandesc->tid_queue = tid_queue_alloc(queue_size);
+		index_pgsr_alloc(scandesc);
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	for (;;)
 	{
+		ItemPointerData last_tid = scandesc->xs_heaptid;
+
+		/*
+		 * If we haven't exhausted TIDs from the index, then fill the queue
+		 * with TIDs from the index until the queue is full. Mark the index as
+		 * exhausted if we reach the end of it.
+		 */
+		if (!scandesc->index_done)
+		{
+			while (!TID_QUEUE_FULL(scandesc->tid_queue))
+			{
+				ItemPointer tid;
+
+				if ((tid = index_getnext_tid(scandesc, direction)) == NULL)
+				{
+					scandesc->index_done = true;
+					break;
+				}
+
+				index_tid_enqueue(scandesc->tid_queue, tid, scandesc->xs_recheck,
+								  NULL, NULL);
+			}
+		}
+
+		if (scandesc->xs_heap_continue)
+			scandesc->xs_heaptid = last_tid;
+
+		/*
+		 * index_fetch_heap() returns false when either the tuple isn't
+		 * visible or when there's no more to read
+		 */
+		if (!index_fetch_heap(scandesc, slot))
+		{
+			if (ItemPointerIsValid(&scandesc->xs_heaptid))
+				continue;
+
+			if (!scandesc->index_done)
+				continue;
+
+			if (scandesc->xs_heap_continue)
+				continue;
+
+			break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
+		scandesc->xs_recheck = scandesc->xs_heapfetch->recheck;
+		scandesc->xs_heaptid = scandesc->xs_heapfetch->tid;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..3360da8288 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4719,6 +4719,7 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	/* Costs of sort and material steps are included in path cost already */
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
+	root->allow_prefetch = false;
 
 	return join_plan;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2e2458b128..3be133f757 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -413,11 +413,14 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	root = subquery_planner(glob, parse, NULL,
 							false, tuple_fraction);
 
+	root->allow_prefetch = true;
+
 	/* Select best Path and turn it into a Plan */
 	final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
 	best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);
 
 	top_plan = create_plan(root, best_path);
+	top_plan->allow_prefetch = root->allow_prefetch;
 
 	/*
 	 * If creating a plan for a scrollable cursor, make sure it can run
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..aaccf25a7f 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -245,10 +246,12 @@ pg_streaming_read_new_range(PgStreamingRead *pgsr)
 static void
 pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 {
+	bool		done = pgsr->finished && !pgsr->resumable;
+
 	/*
 	 * If we're finished or can't start more I/O, then don't look ahead.
 	 */
-	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+	if (done || pgsr->ios_in_progress == pgsr->max_ios)
 		return;
 
 	/*
@@ -433,3 +436,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8026c2b36d..13d6fab318 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -24,6 +25,22 @@
 /* We don't want this file to depend on execnodes.h. */
 struct IndexInfo;
 
+typedef struct TIDQueueItem
+{
+	ItemPointerData tid;
+	bool		recheck;
+	IndexTuple	itup;
+	HeapTuple	htup;
+} TIDQueueItem;
+
+typedef struct TIDQueue
+{
+	uint64		head;
+	uint64		tail;
+	int			size;
+	TIDQueueItem data[FLEXIBLE_ARRAY_MEMBER /* size */ ];
+} TIDQueue;
+
 /*
  * Struct for statistics returned by ambuild
  */
@@ -175,6 +192,30 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+
+extern void index_pgsr_alloc(IndexScanDesc scan);
+
+extern void tid_queue_reset(TIDQueue *q);
+
+extern void index_tid_enqueue(TIDQueue *tid_queue, ItemPointer tid, bool recheck,
+							  HeapTuple htup, IndexTuple itup);
+
+#define TID_QUEUE_SIZE 6
+
+extern TIDQueue *tid_queue_alloc(int size);
+
+static inline bool
+TID_QUEUE_FULL(TIDQueue *q)
+{
+	return q->tail - q->head == q->size;
+}
+
+static inline bool
+TID_QUEUE_EMPTY(TIDQueue *q)
+{
+	return q->head == q->tail;
+}
+
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..66b3d90c83 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,8 +105,16 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+
+	ItemPointerData tid;
+	bool		recheck;
+	IndexTuple	itup;
+	HeapTuple	htup;
 } IndexFetchTableData;
 
+typedef struct TIDQueue TIDQueue;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -148,6 +157,8 @@ typedef struct IndexScanDescData
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+	TIDQueue   *tid_queue;
+	bool		index_done;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..a8042abf95 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -697,6 +697,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching;
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 534692bee1..171066b14e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -203,6 +203,7 @@ struct PlannerInfo
 
 	/* 1 at the outermost Query */
 	Index		query_level;
+	bool		allow_prefetch;
 
 	/* NULL at outermost Query */
 	PlannerInfo *parent_root pg_node_attr(read_write_ignore);
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..317de2d781 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,7 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+	bool		allow_prefetch;
 } Plan;
 
 /* ----------------
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0e34145187..ecc910ff35 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2726,6 +2726,8 @@ TBMSharedIteratorState
 TBMStatus
 TBlockState
 TIDBitmap
+TIDQueue
+TIDQueueItem
 TM_FailureData
 TM_IndexDelete
 TM_IndexDeleteOp
-- 
2.37.2

#71

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Melanie Plageman (#70)

Re: index prefetching

On 2/7/24 22:48, Melanie Plageman wrote:

...

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Thanks for working on this, and for investigating the various issues.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

I think restricting the prefetching to a single index page would not be
a huge issue performance-wise - that's what the initial patch version
(implemented at the index AM level) did, pretty much. The prefetch queue
would get drained as we approach the end of the index page, but luckily
index pages tend to have a lot of entries. But it'd put an upper bound
on the prefetch distance (much lower than the e_i_c maximum 1000, but
I'd say common values are 10-100 anyway).

But how would we know we're on the same index page? That knowledge is
not available outside the index AM - the executor or indexam.c does not
know this, right? Presumably we could expose this, somehow, but it seems
like a violation of the abstraction ...

The same thing affects keeping multiple index pages pinned, for TIDs
that are yet to be used by the index scan. We'd need to know when to
release a pinned page, once we're done with processing all items.

FWIW I haven't tried to implementing any of this, so maybe I'm missing
something and it can be made to work in a nice way.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

Haven't thought about this very much, will take a closer look.

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

Don't work in what sense? What is (not) happening?

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

It's not clear to me what you need to copy the tuples back - shouldn't
it be enough to copy the tuple just once?

FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
work), that would also mean we don't need to copy any tuples, right? We
could point into the buffers for all of them, right?

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

I think IndexFetchTableData seems like a not entirely terrible place for
allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
advocated for doing the prefetching in executor, and I'm not sure
heapam_handled.c + relscan.h is what he imagined ...

Also, when you say "obviously not good" - why? Are you concerned about
the extra overhead of shuffling stuff between queues, or something else?

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Yeah, not sure this is the right layering ... the initial patch did
everything in individual index AMs, then it moved to indexam.c, then to
executor. And this seems to move it to lower layers again ...

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

I agree this should happen in lower layers. I'd probably do this in the
streaming read API, because that would define "scope" of the cache
(pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
like it would do a single cache (for that particular backend).

But that's just an initial thought ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tomas Vondra (#71)

Re: index prefetching

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/7/24 22:48, Melanie Plageman wrote:
I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1]https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan,
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

There doesn't necessarily need to be much code duplication to make
that work. Offhand I suspect it would be kind of similar to how
deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
by with generic logic implemented by
index_compute_xid_horizon_for_tuples -- that's all that we need to
determine a snapshotConflictHorizon value for recovery conflict
purposes. Note that index_compute_xid_horizon_for_tuples() reads
*index* pages, despite not being aware of the caller's index AM and
index tuple format.

(The only reason why nbtree needs a custom solution is because it has
posting list tuples to worry about, unlike GiST and unlike Hash, which
consistently use unadorned generic IndexTuple structs with heap TID
represented in the standard/generic way only. While these concepts
probably all originated in nbtree, they're still not nbtree
implementation details.)

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.

The test that did fail will have only revealed that the
kill_prior_tuple wasn't operating as expected -- which isn't the same
thing as giving wrong answers.

Note that there are various ways that concurrent TID recycling might
prevent _bt_killitems() from setting LP_DEAD bits. It's totally
unsurprising that breaking kill_prior_tuple in some way could be
missed. Andres wrote the MVCC test in question precisely because
certain aspects of kill_prior_tuple were broken for months without
anybody noticing.

[1]: https://www.postgresql.org/docs/devel/index-locking.html -- Peter Geoghegan
--
Peter Geoghegan

#73

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#70)

Re: index prefetching

On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

Is this maybe just a bookkeeping problem? A Boolean that says "you can
kill the prior tuple" is well-suited if and only if the prior tuple is
well-defined. But perhaps it could be replaced with something more
sophisticated that tells you which tuples are eligible to be killed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#74

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Peter Geoghegan (#72)

Re: index prefetching

On 2/13/24 20:54, Peter Geoghegan wrote:

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/7/24 22:48, Melanie Plageman wrote:
I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

Good point.

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

In control how? What would be the control flow - what part would be
managed by the index AM?

I initially did the prefetching entirely in each index AM, but it was
suggested doing this in the executor would be better. So I gradually
moved it to executor. But the idea to combine this with the streaming
read API seems as a move from executor back to the lower levels ... and
now you're suggesting to make the index AM responsible for this again.

I'm not saying any of those layering options is wrong, but it's not
clear to me which is the right one.

There doesn't necessarily need to be much code duplication to make
that work. Offhand I suspect it would be kind of similar to how
deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
by with generic logic implemented by
index_compute_xid_horizon_for_tuples -- that's all that we need to
determine a snapshotConflictHorizon value for recovery conflict
purposes. Note that index_compute_xid_horizon_for_tuples() reads
*index* pages, despite not being aware of the caller's index AM and
index tuple format.

(The only reason why nbtree needs a custom solution is because it has
posting list tuples to worry about, unlike GiST and unlike Hash, which
consistently use unadorned generic IndexTuple structs with heap TID
represented in the standard/generic way only. While these concepts
probably all originated in nbtree, they're still not nbtree
implementation details.)

I haven't looked at the details, but I agree the LP_DEAD deletion seems
like a sensible inspiration.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.

The test that did fail will have only revealed that the
kill_prior_tuple wasn't operating as expected -- which isn't the same
thing as giving wrong answers.

Possible. But AFAIK it did fail for Melanie, and I don't have a very
good explanation for the difference in behavior.

Note that there are various ways that concurrent TID recycling might
prevent _bt_killitems() from setting LP_DEAD bits. It's totally
unsurprising that breaking kill_prior_tuple in some way could be
missed. Andres wrote the MVCC test in question precisely because
certain aspects of kill_prior_tuple were broken for months without
anybody noticing.

[1] https://www.postgresql.org/docs/devel/index-locking.html

Yeah. There's clearly plenty of space for subtle issues.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#75

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Robert Haas (#73)

Re: index prefetching

On 2/14/24 08:10, Robert Haas wrote:

On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

Is this maybe just a bookkeeping problem? A Boolean that says "you can
kill the prior tuple" is well-suited if and only if the prior tuple is
well-defined. But perhaps it could be replaced with something more
sophisticated that tells you which tuples are eligible to be killed.

I don't think it's just a bookkeeping problem. In a way, nbtree already
does keep an array of tuples to kill (see btgettuple), but it's always
for the current index page. So it's not that we immediately go and kill
the prior tuple - nbtree already stashes it in an array, and kills all
those tuples when moving to the next index page.

The way I understand the problem is that with prefetching we're bound to
determine the kill_prior_tuple flag with a delay, in which case we might
have already moved to the next index page ...

So to make this work, we'd need to:

1) keep index pages pinned for all "in flight" TIDs (read from the
index, not yet consumed by the index scan)

2) keep a separate array of "to be killed" index tuples for each page

3) have a more sophisticated way to decide when to kill tuples and unpin
the index page (instead of just doing it when moving to the next index page)

Maybe that's what you meant by "more sophisticated bookkeeping", ofc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#71)

Re: index prefetching

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/7/24 22:48, Melanie Plageman wrote:

...

Issues

---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

I think restricting the prefetching to a single index page would not be
a huge issue performance-wise - that's what the initial patch version
(implemented at the index AM level) did, pretty much. The prefetch queue
would get drained as we approach the end of the index page, but luckily
index pages tend to have a lot of entries. But it'd put an upper bound
on the prefetch distance (much lower than the e_i_c maximum 1000, but
I'd say common values are 10-100 anyway).

But how would we know we're on the same index page? That knowledge is
not available outside the index AM - the executor or indexam.c does not
know this, right? Presumably we could expose this, somehow, but it seems
like a violation of the abstraction ...

The easiest way to do this would be to have the index AM amgettuple()
functions set a new member in the IndexScanDescData which is either
the index page identifier or a boolean that indicates we have moved on
to the next page. Then, when filling the queue, we would stop doing so
when the page switches. Now, this wouldn't really work for the first
index tuple on each new page, so, perhaps we would need the index AMs
to implement some kind of "peek" functionality.

Or, we could provide the index AM with a max queue size and allow it
to fill up the queue with the TIDs it wants (which it could keep to
the same index page). And, for the index-only scan case, could have
some kind of flag which indicates if the caller is putting
TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce
the amount of space we need. I'm not sure who manages the memory here.

I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.

Yes, the steps you mention for resetting the queues make sense. What I
meant by "backwards scan is not supported by the streaming read API"
is that Thomas/Andres had mentioned that the streaming read API does
not support backwards scans right now. Though, since the callback just
returns a block number, I don't know how it would break.

When switching between a forwards and backwards scan, does it go
backwards from the current position or start at the end (or beginning)
of the relation? If it is the former, then the blocks would most
likely be in shared buffers -- which the streaming read API handles.
It is not obvious to me from looking at the code what the gap is, so
perhaps Thomas could weigh in.

As for handling this in index prefetching, if you think a TID queue
size of 1 is a sufficient fallback method, then resetting the pgsr
queue and resizing the TID queue to 1 would work with no issues. If
the fallback method requires the streaming read code path not be used
at all, then that is more work.

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

Don't work in what sense? What is (not) happening?

I got wrong results for this. I'll have to do more investigation, but
I assumed that not resetting the TID queue and pgsr queue was also the
source of this issue.

What I imagined we would do is figure out if there is a viable
solution for the larger design issues and then investigate what seemed
like smaller issues. But, perhaps I should dig into this first to
ensure there isn't a larger issue.

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

It's not clear to me what you need to copy the tuples back - shouldn't
it be enough to copy the tuple just once?

When enqueueing it, IndexTuple has to be copied from the scan
descriptor to somewhere in memory with a TIDQueueItem pointing to it.
Once we do this, the IndexTuple memory should stick around until we
free it, so yes, I'm not sure why I was seeing the IndexTuple no
longer be valid when I tried to put it in a slot. I'll have to do more
investigation.

FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
work), that would also mean we don't need to copy any tuples, right? We
could point into the buffers for all of them, right?

Yes, this would be a nice benefit.

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

I think IndexFetchTableData seems like a not entirely terrible place for
allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
advocated for doing the prefetching in executor, and I'm not sure
heapam_handled.c + relscan.h is what he imagined ...

Also, when you say "obviously not good" - why? Are you concerned about
the extra overhead of shuffling stuff between queues, or something else?

Well, I didn't resize the queue, I just limited how much of it we can
use to a single member (thus wasting the other memory). But resizing a
queue isn't free either. Also, I wondered if a queue size of 1 for
index AMs using the fallback method is too confusing (like it is a
fake queue?). But, I'd really, really rather not maintain both a queue
and non-queue control flow for Index[Only]Next(). The maintenance
overhead seems like it would outweigh the potential downsides.

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Yeah, not sure this is the right layering ... the initial patch did
everything in individual index AMs, then it moved to indexam.c, then to
executor. And this seems to move it to lower layers again ...

If we do something like make the index AM responsible for the TID
queue (as mentioned above as a potential solution to the kill prior
tuple issue), then we might be able to allocate the TID queue in the
index AMs?

As for the streaming read object, if we were able to solve the issue
where callers of index_beginscan() don't call Index[Only]Next() (and
thus shouldn't allocate a streaming read object), then it seems easy
enough to move the streaming read object allocation into the table
AM-specific begin scan method.

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

I agree this should happen in lower layers. I'd probably do this in the
streaming read API, because that would define "scope" of the cache
(pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
like it would do a single cache (for that particular backend).

Hmm. I wonder if there are any upsides to having the cache be
per-backend. Though, that does sound like a whole other project...

- Melanie

#77

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tomas Vondra (#74)

Re: index prefetching

On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

What I'm concerned about is the difficulty and complexity of any
design that requires revising "63.4. Index Locking Considerations",
since that's pretty subtle stuff. In particular, if prefetching
"de-synchronizes" (to use your term) the index leaf page level scan
and the heap page scan, then we'll probably have to totally revise the
basic API.

Maybe that'll actually turn out to be the right thing to do -- it
could just be the only thing that can unleash the full potential of
prefetching. But I'm not aware of any evidence that points in that
direction. Are you? (I might have just missed it.)

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

Good point.

The main reason why the index AM docs require this interlock is
because we need such an interlock to make non-MVCC snapshot scans
safe. If you remove the interlock (the buffer pin interlock that
protects against TID recycling by VACUUM), you can still avoid the
same race condition by using an MVCC snapshot. This is why using an
MVCC snapshot is a requirement for bitmap index scans. I believe that
it's also a requirement for index-only scans, but the index AM docs
don't spell that out.

Another factor that complicates things here is mark/restore
processing. The design for that has the idea of processing one page at
a time baked-in. Kinda like with the kill_prior_tuple issue.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

In control how? What would be the control flow - what part would be
managed by the index AM?

ISTM that prefetching for an index scan is about the index scan
itself, first and foremost. The heap accesses are usually the dominant
cost, of course, but sometimes the index leaf page accesses really do
make up a significant fraction of the overall cost of the index scan.
Especially with an expensive index qual. So if you just assume that
the TIDs returned by the index scan are the only thing that matters,
you might have a model that's basically correct on average, but is
occasionally very wrong. That's one reason for "putting the index AM
in control".

As I said back in June, we should probably be marrying information
from the index scan with information from the heap. This is something
that is arguably a modularity violation. But it might just be that you
really do need to take information from both places to consistently
make the right trade-off.

Perhaps the best arguments for "putting the index AM in control" only
work when you go to fix the problems that "naive de-synchronization"
creates. Thinking about that side of things some more might make
"putting the index AM in control" seem more natural.

Suppose, for example, you try to make a prefetching design based on
"de-synchronization" work with kill_prior_tuple -- suppose you try to
fix that problem. You're likely going to need to make some kind of
trade-off that gets you most of the advantages that that approach
offers (assuming that there really are significant advantages), while
still retaining most of the advantages that we already get from
kill_prior_tuple (basically we want to LP_DEAD-mark index tuples with
almost or exactly the same consistency as we manage today). Maybe your
approach involves tracking multiple LSNs for each prefetch-pending
leaf page, or perhaps you hold on to a pin on some number of leaf
pages instead (right now nbtree does both [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan, which I go into more
below). Either way, you're pushing stuff down into the index AM.

Note that we already hang onto more than one pin at a time in rare
cases involving mark/restore processing. For example, it can happen
for a merge join that happens to involve an unlogged index, if the
markpos and curpos are a certain way relative to the current leaf page
(yeah, really). So putting stuff like that under the control of the
index AM (while also applying basic information that comes from the
heap) in order to fix the kill_prior_tuple issue is arguably something
that has a kind of a precedent for us to follow.

Even if you disagree with me here ("precedent" might be overstating
it), perhaps you still get some general sense of why I have an inkling
that putting prefetching in the index AM is the way to go. It's very
hard to provide one really strong justification for all this, and I'm
certainly not expecting you to just agree with me right away. I'm also
not trying to impose any conditions on committing this patch.

Thinking about this some more, "making kill_prior_tuple work with
de-synchronization" is a bit of a misleading way of putting it. The
way that you'd actually work around this is (at a very high level)
*dynamically* making some kind of *trade-off* between synchronization
and desynchronization. Up until now, we've been talking in terms of a
strict dichotomy between the old index AM API design
(index-page-at-a-time synchronization), and a "de-synchronizing"
prefetching design that
embraces the opposite extreme -- a design where we only think in terms
of heap TIDs, and completely ignore anything that happens in the index
structure (and consequently makes kill_prior_tuple ineffective). That
now seems like a false dichotomy.

I initially did the prefetching entirely in each index AM, but it was
suggested doing this in the executor would be better. So I gradually
moved it to executor. But the idea to combine this with the streaming
read API seems as a move from executor back to the lower levels ... and
now you're suggesting to make the index AM responsible for this again.

I did predict that there'd be lots of difficulties around the layering
back in June. :-)

I'm not saying any of those layering options is wrong, but it's not
clear to me which is the right one.

I don't claim to know what the right trade-off is myself. The fact
that all of these things are in tension doesn't surprise me. It's just
a hard problem.

Possible. But AFAIK it did fail for Melanie, and I don't have a very
good explanation for the difference in behavior.

If you take a look at _bt_killitems(), you'll see that it actually has
two fairly different strategies for avoiding TID recycling race
condition issues, applied in each of two different cases:

1. Cases where we really have held onto a buffer pin, per the index AM
API -- the "inde AM orthodox" approach. (The aforementioned issue
with unlogged indexes exists because with an unlogged index we must
use approach 1, per the nbtree README section [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan).

2. Cases where we drop the pin as an optimization (also per [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan), and
now have to detect the possibility of concurrent modifications by
VACUUM (that could have led to concurrent TID recycling). We
conservatively do nothing (don't mark any index tuples LP_DEAD),
unless the LSN is exactly the same as it was back when the page was
scanned/read by _bt_readpage().

So some accidental detail with LSNs (like using or not using an
unlogged index) could cause bugs in this area to "accidentally fail to
fail". Since the nbtree index AM has its own optimizations here, which
probably has a tendency to mask problems/bugs. (I sometimes use
unlogged indexes for some of my nbtree related test cases, just to
reduce certain kinds of variability, including variability in this
area.)

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=52e646c7f759a5d9cfdc32b86f6aff8460891e12;hb=3e8235ba4f9cc3375b061fb5d3f3575434539b5f#l443 -- Peter Geoghegan
--
Peter Geoghegan

#78

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Melanie Plageman (#76)

Re: index prefetching

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.

The point I was trying to make with that example was: a highly generic
mechanism can sometimes work across disparate index AMs (that all at
least support plain index scans) when it just so happens that these
AMs don't actually differ in a way that could possibly matter to that
mechanism. While it's true that (say) nbtree and hash are very
different at a high level, it's nevertheless also true that the way
things work at the level of individual index pages is much more
similar than different.

With index deletion, we know that we're differences between each
supported index AM either don't matter at all (which is what obviates
the need for index_compute_xid_horizon_for_tuples() to be directly
aware of which index AM the page it is passed comes from), or matter
only in small, incidental ways (e.g., nbtree stores posting lists in
its tuples, despite using IndexTuple structs).

With prefetching, it seems reasonable to suppose that an index-AM
specific approach would end up needing very little truly custom code.
This is pretty strongly suggested by the fact that the rules around
buffer pins (as an interlock against concurrent TID recycling by
VACUUM) are standardized by the index AM API itself. Those rules might
be slightly more natural with nbtree, but that's kinda beside the
point. While the basic organizing principle for where each index tuple
goes can vary enormously, it doesn't necessarily matter at all -- in
the end, you're really just reading each index page (that has TIDs to
read) exactly once per scan, in some fixed order, with interlaced
inline heap accesses (that go fetch heap tuples for each individual
TID read from each index page).

In general I don't accept that we need to do things outside the index
AM, because software architecture encapsulation something something. I
suspect that we'll need to share some limited information across
different layers of abstraction, because that's just fundamentally
what's required by the constraints we're operating under. Can't really
prove it, though.

--
Peter Geoghegan

#79

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#76)

Re: index prefetching

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/7/24 22:48, Melanie Plageman wrote:

...
- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.

Yes, the steps you mention for resetting the queues make sense. What I
meant by "backwards scan is not supported by the streaming read API"
is that Thomas/Andres had mentioned that the streaming read API does
not support backwards scans right now. Though, since the callback just
returns a block number, I don't know how it would break.

When switching between a forwards and backwards scan, does it go
backwards from the current position or start at the end (or beginning)
of the relation?

Okay, well I answered this question for myself, by, um, trying it :).
FETCH backward will go backwards from the current cursor position. So,
I don't see exactly why this would be an issue.

If it is the former, then the blocks would most
likely be in shared buffers -- which the streaming read API handles.
It is not obvious to me from looking at the code what the gap is, so
perhaps Thomas could weigh in.

I have the same problem with the sequential scan streaming read user,
so I am going to try and figure this backwards scan and switching scan
direction thing there (where we don't have other issues).

- Melanie

#80

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Peter Geoghegan (#77)

Re: index prefetching

On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

What I'm concerned about is the difficulty and complexity of any
design that requires revising "63.4. Index Locking Considerations",
since that's pretty subtle stuff. In particular, if prefetching
"de-synchronizes" (to use your term) the index leaf page level scan
and the heap page scan, then we'll probably have to totally revise the
basic API.

So, a pin on the index leaf page is sufficient to keep line pointers
from being reused? If we stick to prefetching heap blocks referred to
by index tuples in a single index leaf page, and we keep that page
pinned, will we still have a problem?

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

Another factor that complicates things here is mark/restore
processing. The design for that has the idea of processing one page at
a time baked-in. Kinda like with the kill_prior_tuple issue.

Yes, I mentioned this in my earlier email. I think we can resolve
mark/restore by resetting the prefetch and TID queues and restoring
the last used heap TID in the index scan descriptor.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

Forcing each index access to be synchronous and interleaved with each
table access seems like an unprincipled design constraint. While it is
true that we rely on that in our current implementation (when using
non-MVCC snapshots), it doesn't seem like a principle inherent to
accessing indexes and tables.

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

In control how? What would be the control flow - what part would be
managed by the index AM?

ISTM that prefetching for an index scan is about the index scan
itself, first and foremost. The heap accesses are usually the dominant
cost, of course, but sometimes the index leaf page accesses really do
make up a significant fraction of the overall cost of the index scan.
Especially with an expensive index qual. So if you just assume that
the TIDs returned by the index scan are the only thing that matters,
you might have a model that's basically correct on average, but is
occasionally very wrong. That's one reason for "putting the index AM
in control".

I don't think the fact that it would also be valuable to do index
prefetching is a reason not to do prefetching of heap pages. And,
while it is true that were you to add index interior or leaf page
prefetching, it would impact the heap prefetching, at the end of the
day, the table AM needs some TID or TID-equivalents that whose blocks
it can go fetch. The index AM has to produce something that the table
AM will consume. So, if we add prefetching of heap pages and get the
table AM input right, it shouldn't require a full redesign to add
index page prefetching later.

You could argue that my suggestion to have the index AM manage and
populate a queue of TIDs for use by the table AM puts the index AM in
control. I do think having so many members of the IndexScanDescriptor
which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
interplay between fetching an index tuple and fetching a heap tuple is
confusing and error prone.

As I said back in June, we should probably be marrying information
from the index scan with information from the heap. This is something
that is arguably a modularity violation. But it might just be that you
really do need to take information from both places to consistently
make the right trade-off.

Agreed that we are going to need to mix information from both places.

If you take a look at _bt_killitems(), you'll see that it actually has
two fairly different strategies for avoiding TID recycling race
condition issues, applied in each of two different cases:

1. Cases where we really have held onto a buffer pin, per the index AM
API -- the "inde AM orthodox" approach. (The aforementioned issue
with unlogged indexes exists because with an unlogged index we must
use approach 1, per the nbtree README section [1]).

2. Cases where we drop the pin as an optimization (also per [1]), and
now have to detect the possibility of concurrent modifications by
VACUUM (that could have led to concurrent TID recycling). We
conservatively do nothing (don't mark any index tuples LP_DEAD),
unless the LSN is exactly the same as it was back when the page was
scanned/read by _bt_readpage().

Re 2: so the LSN could have been changed by some other process (i.e.
not vacuum), so how often in practice is the LSN actually the same as
when the page was scanned/read? Do you think we would catch a
meaningful number of kill prior tuple opportunities if we used an LSN
tracking method like this? Something that let us drop the pin on the
page would obviously be better.

- Melanie

#81

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Melanie Plageman (#80)

Re: index prefetching

On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, a pin on the index leaf page is sufficient to keep line pointers
from being reused? If we stick to prefetching heap blocks referred to
by index tuples in a single index leaf page, and we keep that page
pinned, will we still have a problem?

That's certainly one way of dealing with it. Obviously, there are
questions about how you do that in a way that consistently avoids
creating new problems.

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

I think that I'd rather use information at execution time instead, if
at all possible (perhaps in addition to a hint given by the planner).
But it seems a bit premature to discuss this problem now, except to
say that it might indeed be a problem.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

Forcing each index access to be synchronous and interleaved with each
table access seems like an unprincipled design constraint. While it is
true that we rely on that in our current implementation (when using
non-MVCC snapshots), it doesn't seem like a principle inherent to
accessing indexes and tables.

There is nothing sacred about the way plain index scans work right now
-- especially the part about buffer pins as an interlock.

If the pin thing really was sacred, then we could never have allowed
nbtree to selectively opt-out in cases where it's possible to provide
an equivalent correctness guarantee without holding onto buffer pins,
which, as I went into, is how it actually works in nbtree's
_bt_killitems() today (see commit 2ed5b87f96 for full details). And so
in principle I have no problem with the idea of revising the basic
definition of plain index scans -- especially if it's to make the
definition more abstract, without fundamentally changing it (e.g., to
make it no longer reference buffer pins, making life easier for
prefetching, while at the same time still implying the same underlying
guarantees sufficient to allow nbtree to mostly work the same way as
today).

All I'm really saying is:

1. The sort of tricks that we can do in nbtree's _bt_killitems() are
quite useful, and ought to be preserved in something like their
current form, even when prefetching is in use.

This seems to push things in the direction of centralizing control of
the process in index scan code. For example, it has to understand that
_bt_killitems() will be called at some regular cadence that is well
defined and sensible from an index point of view.

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

I don't think the fact that it would also be valuable to do index
prefetching is a reason not to do prefetching of heap pages. And,
while it is true that were you to add index interior or leaf page
prefetching, it would impact the heap prefetching, at the end of the
day, the table AM needs some TID or TID-equivalents that whose blocks
it can go fetch.

I wasn't really thinking of index page prefetching at all. Just the
cost of applying index quals to read leaf pages that might never
actually need to be read, due to the presence of a LIMIT. That is kind
of a new problem created by eagerly reading (without actually
prefetching) leaf pages.

You could argue that my suggestion to have the index AM manage and
populate a queue of TIDs for use by the table AM puts the index AM in
control. I do think having so many members of the IndexScanDescriptor
which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
interplay between fetching an index tuple and fetching a heap tuple is
confusing and error prone.

But that's kinda how amgettuple is supposed to work -- cursors need it
to work that way. Having some kind of general notion of scan order is
also important to avoid returning duplicate TIDs to the scan. In
contrast, GIN heavily relies on the fact that it only supports bitmap
scans -- that allows it to not have to reason about returning
duplicate TIDs (when dealing with a concurrently merged pending list,
and other stuff like that).

And so nbtree (and basically every other index AM that supports plain
index scans) kinda pretends to process a single tuple at a time, in
some fixed order that's convenient for the scan to work with (that's
how the executor thinks of things). In reality these index AMs
actually process batches consisting of a single leaf page worth of
tuples.

I don't see how the IndexScanDescData side of things makes life any
harder for this patch -- ISTM that you'll always need to pretend to
return one tuple at a time from the index scan, regardless of what
happens under the hood, with pins and whatnot. The page-at-a-time
thing is more or less an implementation detail that's private to index
AMs (albeit in a way that follows certain standard conventions across
index AMs) -- it's a leaky abstraction only due to the interactions
with VACUUM/TID recycle safety.

Re 2: so the LSN could have been changed by some other process (i.e.
not vacuum), so how often in practice is the LSN actually the same as
when the page was scanned/read?

It seems very hard to make generalizations about that sort of thing.

It doesn't help that we now have batching logic inside
_bt_simpledel_pass() that will make up for the problem of not setting
as many LP_DEAD bits as we could in many important cases. (I recall
that that was one factor that allowed the bug that Andres fixed in
commit 90c885cd to go undetected for months. I recall discussing the
issue with Andres around that time.)

Do you think we would catch a
meaningful number of kill prior tuple opportunities if we used an LSN
tracking method like this? Something that let us drop the pin on the
page would obviously be better.

Quite possibly, yes. But it's hard to say for sure without far more
detailed analysis. Plus you have problems with things like unlogged
indexes not having an LSN to use as a canary condition, which makes it
a bit messy (it's already kind of weird that we treat unlogged indexes
differently here IMV).

--
Peter Geoghegan

#82

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Melanie Plageman (#80)

Re: index prefetching

Hi,

On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote:

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

I'm not sure that that's really true. I think the more common and more
problematic case for partially executing a sub-tree of a query are nested
loops (worse because that happens many times within a query). Particularly for
anti-joins prefetching too aggressively could lead to a significant IO
amplification.

At the same time it's IMO more important to ramp up prefetching distance
fairly aggressively for index scans than it is for sequential scans. For
sequential scans it's quite likely that either the whole scan takes quite a
while (thus slowly ramping doesn't affect overall time that much) or that the
data is cached anyway because the tables are small and frequently used (in
which case we don't need to ramp). And even if smaller tables aren't cached,
because it's sequential IO, the IOs are cheaper as they're sequential.
Contrast that to index scans, where it's much more likely that you have cache
misses in queries that do an overall fairly small number of IOs and where that
IO is largely random.

I think we'll need some awareness at ExecInitNode() time about how the results
of the nodes are used. I see a few "classes":

1) All rows are needed, because the node is below an Agg, Hash, Materialize,
Sort, .... Can be determined purely by the plan shape.

2) All rows are needed, because the node is completely consumed by the
top-level (i.e. no limit, anti-joins or such inbetween) and the top-level
wants to run the whole query. Unfortunately I don't think we know this at
plan time at the moment (it's just determined by what's passed to
ExecutorRun()).

3) Some rows are needed, but it's hard to know the precise number. E.g. because
of a LIMIT further up.

4) Only a single row is going to be needed, albeit possibly after filtering on
the node level. E.g. the anti-join case.

There are different times at which we could determine how each node is
consumed:

a) Determine node consumption "class" purely within ExecInit*, via different
eflags.

Today that couldn't deal with 2), but I think it'd not too hard to modify
callers that consume query results completely to tell that ExecutorStart(),
not just ExecutorRun().

A disadvantage would be that this prevents us from taking IO depth into
account during costing. There very well might be plans that are cheaper
than others because the plan shape allows more concurrent IO.

b) Determine node consumption class at plan time.

This also couldn't deal with 2), but fixing that probably would be harder,
because we'll often not know at plan time how the query will be
executed. And in fact the same plan might be executed multiple ways, in
case of prepared statements.

The obvious advantage is of course that we can influence the choice of
paths.

I suspect we'd eventually want a mix of both. Plan time to be able to
influence plan shape, ExecInit* to deal with not knowing how the query will be
consumed at plan time. Which suggests that we could start with whichever is
easier and extend later.

Greetings,

Andres Freund

#83

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Peter Geoghegan (#72)

Re: index prefetching

Hi,

On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

Given that the interlock is only needed for non-mvcc scans, that non-mvcc
scans are rare due to catalog accesses using snapshots these days and that
most non-mvcc scans do single-tuple lookups, it might be viable to be more
restrictive about prefetching iff non-mvcc snapshots are in use and to use
method of cleanup that allows multiple pages to be cleaned up otherwise.

However, I don't think we would necessarily have to relax the IAM pinning
rules, just to be able to do prefetching of more than one index leaf
page. Restricting prefetching to entries within a single leaf page obviously
has the disadvantage of not being able to benefit from concurrent IO whenever
crossing a leaf page boundary, but at the same time processing entries from
just two leaf pages would often allow for a sufficiently aggressive
prefetching. Pinning a small number of leaf pages instead of a single leaf
page shouldn't be a problem.

One argument for loosening the tight coupling between kill_prior_tuples and
index scan progress is that the lack of kill_prior_tuples for bitmap scans is
quite problematic. I've seen numerous production issues with bitmap scans
caused by subsequent scans processing a growing set of dead tuples, where
plain index scans were substantially slower initially but didn't get much
slower over time. We might be able to design a system where the bitmap
contains a certain number of back-references to the index, allowing later
cleanup if there weren't any page splits or such.

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

Depending on what "control" means I'm doubtful:

Imo there are decisions influencing prefetching that an index AM shouldn't
need to know about directly, e.g. how the plan shape influences how many
tuples are actually going to be consumed. Of course that determination could
be made in planner/executor and handed to IAMs, for the IAM to then "control"
the prefetching.

Another aspect is that *long* term I think we want to be able to execute
different parts of the plan tree when one part is blocked for IO. Of course
that's not always possible. But particularly with partitioned queries it often
is. Depending on the form of "control" that's harder if IAMs are in control,
because control flow needs to return to the executor to be able to switch to a
different node, so we can't wait for IO inside the AM.

There probably are ways IAMs could be in "control" that would be compatible
with such constraints however.

Greetings,

Andres Freund

#84

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Andres Freund (#83)

Re: index prefetching

On Wed, Feb 14, 2024 at 7:28 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

Given that the interlock is only needed for non-mvcc scans, that non-mvcc
scans are rare due to catalog accesses using snapshots these days and that
most non-mvcc scans do single-tuple lookups, it might be viable to be more
restrictive about prefetching iff non-mvcc snapshots are in use and to use
method of cleanup that allows multiple pages to be cleaned up otherwise.

I agree, but don't think that it matters all that much.

If you have an MVCC snapshot, that doesn't mean that TID recycle
safety problems automatically go away. It only means that you have one
known and supported alternative approach to dealing with such
problems. It's not like you just get that for free, just by using an
MVCC snapshot, though -- it has downsides. Downsides such as the
current _bt_killitems() behavior with a concurrently-modified leaf
page (modified when we didn't hold a leaf page pin). It'll just give
up on setting any LP_DEAD bits due to noticing that the leaf page's
LSN changed. (Plus there are implementation restrictions that I won't
repeat again now.)

When I refer to the buffer pin interlock, I'm mostly referring to the
general need for something like that in the context of index scans.
Principally in order to make kill_prior_tuple continue to work in
something more or less like its current form.

However, I don't think we would necessarily have to relax the IAM pinning
rules, just to be able to do prefetching of more than one index leaf
page.

To be clear, we already do relax the IAM pinning rules. Or at least
nbtree selectively opts out, as I've gone into already.

Restricting prefetching to entries within a single leaf page obviously
has the disadvantage of not being able to benefit from concurrent IO whenever
crossing a leaf page boundary, but at the same time processing entries from
just two leaf pages would often allow for a sufficiently aggressive
prefetching. Pinning a small number of leaf pages instead of a single leaf
page shouldn't be a problem.

You're probably right. I just don't see any need to solve that problem in v1.

One argument for loosening the tight coupling between kill_prior_tuples and
index scan progress is that the lack of kill_prior_tuples for bitmap scans is
quite problematic. I've seen numerous production issues with bitmap scans
caused by subsequent scans processing a growing set of dead tuples, where
plain index scans were substantially slower initially but didn't get much
slower over time.

I've seen production issues like that too. No doubt it's a problem.

We might be able to design a system where the bitmap
contains a certain number of back-references to the index, allowing later
cleanup if there weren't any page splits or such.

That does seem possible, but do you really want a design for index
prefetching that relies on that massive enhancement (a total redesign
of kill_prior_tuple) happening at some point in the not-too-distant
future? Seems risky, from a project management point of view.

This back-references idea seems rather complicated, especially if it
needs to work with very large bitmap index scans. Since you'll still
have the basic problem of TID recycle safety to deal with (even with
an MVCC snapshot), you don't just have to revisit the leaf pages. You
also have to revisit the corresponding heap pages (generally they'll
be a lot more numerous than leaf pages). You'll have traded one
problem for another (which is not to say that it's not a good
trade-off).

Right now the executor uses a amgettuple interface, and knows nothing
about index related costs (e.g., pages accessed in any index, index
qual costs). While the index AM has some limited understanding of heap
access costs. So the index AM kinda knows a small bit about both types
of costs (possibly not enough, but something). That informs the
language I'm using to describe all this.

To do something like your "back-references to the index" thing well, I
think that you need more dynamic behavior around when you visit the
heap to get heap tuples pointed to by TIDs from index pages (i.e.
dynamic behavior that determines how many leaf pages to go before
going to the heap to get pointed-to TIDs). That is basically what I
meant by "put the index AM in control" -- it doesn't *strictly*
require that the index AM actually do that. Just that a single piece
of code has to have access to the full context, in order to make the
right trade-offs around how both index and heap accesses are
scheduled.

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the LIMIT issue I described).

Depending on what "control" means I'm doubtful:

Imo there are decisions influencing prefetching that an index AM shouldn't
need to know about directly, e.g. how the plan shape influences how many
tuples are actually going to be consumed. Of course that determination could
be made in planner/executor and handed to IAMs, for the IAM to then "control"
the prefetching.

I agree with all this.

--
Peter Geoghegan

#85

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Tomas Vondra (#75)

Re: index prefetching

On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I don't think it's just a bookkeeping problem. In a way, nbtree already
does keep an array of tuples to kill (see btgettuple), but it's always
for the current index page. So it's not that we immediately go and kill
the prior tuple - nbtree already stashes it in an array, and kills all
those tuples when moving to the next index page.

The way I understand the problem is that with prefetching we're bound to
determine the kill_prior_tuple flag with a delay, in which case we might
have already moved to the next index page ...

Well... I'm not clear on all of the details of how this works, but
this sounds broken to me, for the reasons that Peter G. mentions in
his comments about desynchronization. If we currently have a rule that
you hold a pin on the index page while processing the heap tuples it
references, you can't just throw that out the window and expect things
to keep working. Saying that kill_prior_tuple doesn't work when you
throw that rule out the window is probably understating the extent of
the problem very considerably.

I would have thought that the way this prefetching would work is that
we would bring pages into shared_buffers sooner than we currently do,
but not actually pin them until we're ready to use them, so that it's
possible they might be evicted again before we get around to them, if
we prefetch too far and the system is too busy. Alternately, it also
seems OK to read those later pages and pin them right away, as long as
(1) we don't also give up pins that we would have held in the absence
of prefetching and (2) we have some mechanism for limiting the number
of extra pins that we're holding to a reasonable number given the size
of shared_buffers.

However, it doesn't seem OK at all to give up pins that the current
code holds sooner than the current code would do.

--
Robert Haas
EDB: http://www.enterprisedb.com

#86

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#85)

Re: index prefetching

Hi,

On 2024-02-15 09:59:27 +0530, Robert Haas wrote:

I would have thought that the way this prefetching would work is that
we would bring pages into shared_buffers sooner than we currently do,
but not actually pin them until we're ready to use them, so that it's
possible they might be evicted again before we get around to them, if
we prefetch too far and the system is too busy.

The issue here is that we need to read index leaf pages (synchronously for
now!) to get the tids to do readahead of table data. What you describe is done
for the table data (IMO not a good idea medium term [1]The main reasons that I think that just doing readahead without keeping a pin is a bad idea, at least medium term, are:), but the problem at
hand is that once we've done readahead for all the tids on one index page, we
can't do more readahead without looking at the next index leaf page.

Obviously that would lead to a sawtooth like IO pattern, where you'd regularly
have to wait for IO for the first tuples referenced by an index leaf page.

However, if we want to issue table readahead for tids on the neighboring index
leaf page, we'll - as the patch stands - not hold a pin on the "current" index
leaf page. Which makes index prefetching as currently implemented incompatible
with kill_prior_tuple, as that requires the index leaf page pin being held.

Alternately, it also seems OK to read those later pages and pin them right
away, as long as (1) we don't also give up pins that we would have held in
the absence of prefetching and (2) we have some mechanism for limiting the
number of extra pins that we're holding to a reasonable number given the
size of shared_buffers.

FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently
used to limit how many buffers a backend may pin for bulk relation extension.

Greetings,

Andres Freund

[1]: The main reasons that I think that just doing readahead without keeping a pin is a bad idea, at least medium term, are:
pin is a bad idea, at least medium term, are:

a) To do AIO you need to hold a pin on the page while the IO is in progress,
as the target buffer contents will be modified at some moment you don't
control, so that buffer should better not be replaced while IO is in
progress. So at the very least you need to hold a pin until the IO is over.

b) If you do not keep a pin until you actually use the page, you need to
either do another buffer lookup (expensive!) or you need to remember the
buffer id and revalidate that it's still pointing to the same block (cheaper,
but still not cheap). That's not just bad because it's slow in an absolute
sense, more importantly it increases the potential performance downside of
doing readahead for fully cached workloads, because you don't gain anything,
but pay the price of two lookups/revalidation.

Note that these reasons really just apply to cases where we read ahead because
we are quite certain we'll need exactly those blocks (leaving errors or
queries ending early aside), not for "heuristic" prefetching. If we e.g. were
to issue prefetch requests for neighboring index pages while descending during
an ordered index scan, without checking that we'll need those, it'd make sense
to just do a "throway" prefetch request.

#87

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Andres Freund (#86)

Re: index prefetching

On Thu, Feb 15, 2024 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:

The issue here is that we need to read index leaf pages (synchronously for
now!) to get the tids to do readahead of table data. What you describe is done
for the table data (IMO not a good idea medium term [1]), but the problem at
hand is that once we've done readahead for all the tids on one index page, we
can't do more readahead without looking at the next index leaf page.

Oh, right.

However, if we want to issue table readahead for tids on the neighboring index
leaf page, we'll - as the patch stands - not hold a pin on the "current" index
leaf page. Which makes index prefetching as currently implemented incompatible
with kill_prior_tuple, as that requires the index leaf page pin being held.

But I think it probably also breaks MVCC, as Peter was saying.

--
Robert Haas
EDB: http://www.enterprisedb.com

#88

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Peter Geoghegan (#81)

Re: index prefetching

On 2/15/24 00:06, Peter Geoghegan wrote:

On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

...

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

That may be acceptable, considering we can usually fit 200+ index items
on a single page. Even then it limits what effective_io_concurrency
values are sensible, but in my experience quickly diminish past ~32.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#89

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tomas Vondra (#88)

Re: index prefetching

On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/15/24 00:06, Peter Geoghegan wrote:

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

I don't think that's true. At least not for nbtree scans.

As I went into last year, you'd get the benefit of the work I've done
on "boundary cases" (most recently in commit c9c0589f from just a
couple of months back), which helps us get the most out of suffix
truncation. This maximizes the chances of only having to scan a single
index leaf page in many important cases. So I can see no reason why
index items at the beginning of the page are at any particular
disadvantage (compared to those from the middle or the end of the
page).

Where you might have a problem is cases where it's just inherently
necessary to visit more than a single leaf page, despite the best
efforts of the nbtsplitloc.c logic -- cases where the scan just
inherently needs to return tuples that "straddle the boundary between
two neighboring pages". That isn't a particularly natural restriction,
but it's also not obvious that it's all that much of a disadvantage in
practice.

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".

Honestly, I always thought that it made sense to do things on the
index AM side. When you went the other way I was surprised. Perhaps I
should have said more about that, sooner, but I'd already said quite a
bit at that point, so...

Anyway, I think that it's pretty clear that "naive desynchronization"
is just not acceptable, because that'll disable kill_prior_tuple
altogether. So you're going to have to do this in a way that more or
less preserves something like the current kill_prior_tuple behavior.
It's going to have some downsides, but those can be managed. They can
be managed from within the index AM itself, a bit like the
_bt_killitems() no-pin stuff does things already.

Obviously this interpretation suggests that doing things at the index
AM level is indeed the right way to go, layering-wise. Does it make
sense to you, though?

--
Peter Geoghegan

#90

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Peter Geoghegan (#89)

Re: index prefetching

On 2/15/24 17:42, Peter Geoghegan wrote:

On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/15/24 00:06, Peter Geoghegan wrote:

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

I don't think that's true. At least not for nbtree scans.

As I went into last year, you'd get the benefit of the work I've done
on "boundary cases" (most recently in commit c9c0589f from just a
couple of months back), which helps us get the most out of suffix
truncation. This maximizes the chances of only having to scan a single
index leaf page in many important cases. So I can see no reason why
index items at the beginning of the page are at any particular
disadvantage (compared to those from the middle or the end of the
page).

I may be missing something, but it seems fairly self-evident to me an
entry at the beginning of an index page won't get prefetched (assuming
the page-at-a-time thing).

If I understand your point about boundary cases / suffix truncation,
that helps us by (a) picking the split in a way to minimize a single key
spanning multiple pages, if possible and (b) increasing the number of
entries that fit onto a single index page.

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

Where you might have a problem is cases where it's just inherently
necessary to visit more than a single leaf page, despite the best
efforts of the nbtsplitloc.c logic -- cases where the scan just
inherently needs to return tuples that "straddle the boundary between
two neighboring pages". That isn't a particularly natural restriction,
but it's also not obvious that it's all that much of a disadvantage in
practice.

One case I've been thinking about is sorting using index, where we often
read large part of the index.

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

Even for the page-at-a-time approach? Or are you talking about the v2?

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".

Honestly, I always thought that it made sense to do things on the
index AM side. When you went the other way I was surprised. Perhaps I
should have said more about that, sooner, but I'd already said quite a
bit at that point, so...

Anyway, I think that it's pretty clear that "naive desynchronization"
is just not acceptable, because that'll disable kill_prior_tuple
altogether. So you're going to have to do this in a way that more or
less preserves something like the current kill_prior_tuple behavior.
It's going to have some downsides, but those can be managed. They can
be managed from within the index AM itself, a bit like the
_bt_killitems() no-pin stuff does things already.

Obviously this interpretation suggests that doing things at the index
AM level is indeed the right way to go, layering-wise. Does it make
sense to you, though?

Yeah. The basic idea was that by moving this above index AM it will work
for all indexes automatically - but given the current discussion about
kill_prior_tuple, locking etc. I'm not sure that's really feasible.

The index AM clearly needs to have more control over this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#91

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tomas Vondra (#90)

Re: index prefetching

On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I may be missing something, but it seems fairly self-evident to me an
entry at the beginning of an index page won't get prefetched (assuming
the page-at-a-time thing).

Sure, if the first item on the page is also the first item that we
need the scan to return (having just descended the tree), then it
won't get prefetched under a scheme that sticks with the current
page-at-a-time behavior (at least in v1). Just like when the first
item that we need the scan to return is from the middle of the page,
or more towards the end of the page.

It is of course also true that we can't prefetch the next page's
first item until we actually visit the next page -- clearly that's
suboptimal. Just like we can't prefetch any other, later tuples from
the next page (until such time as we have determined for sure that
there really will be a next page, and have called _bt_readpage for
that next page.)

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here. The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

If I understand your point about boundary cases / suffix truncation,
that helps us by (a) picking the split in a way to minimize a single key
spanning multiple pages, if possible and (b) increasing the number of
entries that fit onto a single index page.

More like it makes the boundaries between leaf pages (i.e. high keys)
align with the "natural boundaries of the key space". Simple point
queries should practically never require more than a single leaf page
access as a result. Even somewhat complicated index scans that are
reasonably selective (think tens to low hundreds of matches) don't
tend to need to read more than a single leaf page match, at least with
equality type scan keys for the index qual.

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

One case I've been thinking about is sorting using index, where we often
read large part of the index.

That definitely seems like a case where reordering
work/desynchronization of the heap and index scans might be relatively
important.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

Even for the page-at-a-time approach? Or are you talking about the v2?

I meant that the current kill_prior_tuple behavior isn't sacred, and
can be revised in v2, for the benefit of lifting the restriction on
prefetching. But that's going to involve a trade-off of some kind. And
not a particularly simple one.

Yeah. The basic idea was that by moving this above index AM it will work
for all indexes automatically - but given the current discussion about
kill_prior_tuple, locking etc. I'm not sure that's really feasible.

The index AM clearly needs to have more control over this.

Cool. I think that that makes the layering question a lot clearer, then.

--
Peter Geoghegan

#92

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Peter Geoghegan (#91)

Re: index prefetching

Hi,

On 2024-02-15 12:53:10 -0500, Peter Geoghegan wrote:

On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I may be missing something, but it seems fairly self-evident to me an
entry at the beginning of an index page won't get prefetched (assuming
the page-at-a-time thing).

Sure, if the first item on the page is also the first item that we
need the scan to return (having just descended the tree), then it
won't get prefetched under a scheme that sticks with the current
page-at-a-time behavior (at least in v1). Just like when the first
item that we need the scan to return is from the middle of the page,
or more towards the end of the page.

It is of course also true that we can't prefetch the next page's
first item until we actually visit the next page -- clearly that's
suboptimal. Just like we can't prefetch any other, later tuples from
the next page (until such time as we have determined for sure that
there really will be a next page, and have called _bt_readpage for
that next page.)

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here. The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

To me this your phrasing just seems to reformulate the issue.

In practical terms you'll have to wait for the full IO latency when fetching
the table tuple corresponding to the first tid on a leaf page. Of course
that's also the moment you had to visit another leaf page. Whether the stall
is due to visit another leaf page or due to processing the first entry on such
a leaf page is a distinction without a difference.

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

I somewhat doubt that it's realistic to aim for 17 at this point. We seem to
still be doing fairly fundamental architectual work. I think it might be the
right thing even for 18 to go for the simpler only-a-single-leaf-page
approach though.

I wonder if there are prerequisites that can be tackled for 17. One idea is to
work on infrastructure to provide executor nodes with information about the
number of tuples likely to be fetched - I suspect we'll trigger regressions
without that in place.

One way to *sometimes* process more than a single leaf page, without having to
redesign kill_prior_tuple, would be to use the visibilitymap to check if the
target pages are all-visible. If all the table pages on a leaf page are
all-visible, we know that we don't need to kill index entries, and thus can
move on to the next leaf page

Greetings,

Andres Freund

#93

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Andres Freund (#92)

Re: index prefetching

On Thu, Feb 15, 2024 at 3:13 PM Andres Freund <andres@anarazel.de> wrote:

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here. The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

To me this your phrasing just seems to reformulate the issue.

What I said to Tomas seems very obvious to me. I think that there
might have been some kind of miscommunication (not a real
disagreement). I was just trying to work through that.

In practical terms you'll have to wait for the full IO latency when fetching
the table tuple corresponding to the first tid on a leaf page. Of course
that's also the moment you had to visit another leaf page. Whether the stall
is due to visit another leaf page or due to processing the first entry on such
a leaf page is a distinction without a difference.

I don't think anybody said otherwise?

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

I somewhat doubt that it's realistic to aim for 17 at this point.

That's a fair point. Tomas?

We seem to
still be doing fairly fundamental architectual work. I think it might be the
right thing even for 18 to go for the simpler only-a-single-leaf-page
approach though.

I definitely think it's a good idea to have that as a fall back
option. And to not commit ourselves to having something better than
that for v1 (though we probably should commit to making that possible
in v2).

I wonder if there are prerequisites that can be tackled for 17. One idea is to
work on infrastructure to provide executor nodes with information about the
number of tuples likely to be fetched - I suspect we'll trigger regressions
without that in place.

I don't think that there'll be regressions if we just take the simpler
only-a-single-leaf-page approach. At least it seems much less likely.

One way to *sometimes* process more than a single leaf page, without having to
redesign kill_prior_tuple, would be to use the visibilitymap to check if the
target pages are all-visible. If all the table pages on a leaf page are
all-visible, we know that we don't need to kill index entries, and thus can
move on to the next leaf page

It's possible that we'll need a variety of different strategies.
nbtree already has two such strategies in _bt_killitems(), in a way.
Though its "Modified while not pinned means hinting is not safe" path
(LSN doesn't match canary value path) seems pretty naive. The
prefetching stuff might present us with a good opportunity to replace
that with something fundamentally better.

--
Peter Geoghegan

#94

Jakub Wartak

jakub.wartak@enterprisedb.com

almost 2 years ago

In reply to: Tomas Vondra (#66)

Re: index prefetching

On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

Given lots of effort already spent on this and the fact that is thread
is actually two:

a. index/table prefetching since Jun 2023 till ~Jan 2024
b. afterwards index/table prefetching with Streaming API, but there
are some doubts of whether it could happen for v17 [1]/messages/by-id/20240215201337.7amzw3hpvng7wphb@awork3.anarazel.de

... it would be pitty to not take benefits of such work (even if
Streaming API wouldn't be ready for this; although there's lots of
movement in the area), so I've played a little with with the earlier
implementation from [2]/messages/by-id/777e981c-bf0c-4eb9-a9e0-42d677e94327@enterprisedb.com without streaming API as it already received
feedback, it demonstrated big benefits, and earlier it got attention
on pgcon unconference. Perhaps, some of those comment might be passed
later to the "b"-patch (once that's feasible):

1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
not apply cleanly anymore, due show_buffer_usage() being quite
recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :

patching file src/backend/commands/explain.c
Hunk #1 FAILED at 3568.
Hunk #2 FAILED at 3679.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/commands/explain.c.rej

2. v2 applies (fixup), but it would nice to see that integrated into
main patch (it adds IndexOnlyPrefetchInfo) into one patch

3. execMain.c :

    +     * XXX It might be possible to improve the prefetching code
to handle this
    +     * by "walking back" the TID queue, but it's not clear if
it's worth it.

Shouldn't we just remove the XXX? The walking-back seems to be niche
so are fetches using cursors when looking at real world users queries
? (support cases bias here when looking at peopel's pg_stat_activity)

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)

    + * XXX Maybe we could consider effective_cache_size when sizing the cache?
    + * Not to size the cache for that, ofc, but maybe as a guidance of how many
    + * heap pages it might keep. Maybe just a fraction fraction of the value,
    + * say Max(8MB, effective_cache_size / max_connections) or something.
    + */
    +#define        PREFETCH_LRU_SIZE        8    /* slots in one LRU */
    +#define        PREFETCH_LRU_COUNT        128 /* number of LRUs */
    +#define        PREFETCH_CACHE_SIZE        (PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)

BTW:
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
that's a duplicated "fraction" word over there.

5.
    +     * XXX Could it be harmful that we read the queue backwards?
Maybe memory
    +     * prefetching works better for the forward direction?

I wouldn't care, we are optimizing I/O (and context-switching) which
weighs much more than memory access direction impact and Dilipi
earlier also expressed no concern, so maybe it could be also removed
(one less "XXX" to care about)

6. in IndexPrefetchFillQueue()

    +    while (!PREFETCH_QUEUE_FULL(prefetch))
    +    {
    +        IndexPrefetchEntry *entry
    +        = prefetch->next_cb(scan, direction, prefetch->data);

If we are at it... that's a strange split and assignment not indented :^)

7. in IndexPrefetchComputeTarget()

    + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
    + * more than we expect to use.

That's a nice fact that's already in patch, so XXX isn't needed?

8.
+ * XXX Maybe we should reduce the value with parallel workers?

I was assuming it could be a good idea, but the same doesn't seem
(eic/actual_parallel_works_per_gather) to be performed for bitmap heap
scan prefetches, so no?

9.
    +    /*
    +     * No prefetching for direct I/O.
    +     *
    +     * XXX Shouldn't we do prefetching even for direct I/O? We would only
    +     * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
    +     * once the code starts loading into shared buffers, that'd work.
    +     */
    +    if ((io_direct_flags & IO_DIRECT_DATA) != 0)
    +        return 0;

It's redundant (?) and could be removed as
PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:

5 #ifdef USE_PREFETCH
4 │ │ /*
3 │ │ │* Try to initiate an asynchronous read. This
returns false in
2 │ │ │* recovery if the relation file doesn't exist.
1 │ │ │*/
571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1))
2 │ │ {
3 │ │ │ result.initiated_io = true;
4 │ │ }
5 #endif> > > > > > > /* USE_PREFETCH */

11. in IndexPrefetchStats() and ExecReScanIndexScan()

+ * FIXME Should be only in debug builds, or something like that.

    +    /* XXX Print some debug stats. Should be removed. */
    +    IndexPrefetchStats(indexScanDesc, node->iss_prefetch);

Hmm, but it could be useful in tuning the real world systems, no? E.g.
recovery prefetcher gives some info through pg_stat_recovery_prefetch
view, but e.g. bitmap heap scans do not provide us with anything at
all. I don't have a strong opinion. Exposing such stuff would take
away your main doubt (XXX) from execPrefetch.c
``auto-tuning/self-adjustment". And if we are at it, we could think in
far future about adding new session GUC track_cachestat or EXPLAIN
(cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5)
where we could present both index stats (as what IndexPrefetchStats()
does) *and* cachestat() results there for interested users. Of course
it would have to be generic enough for the bitmap heap scan case too.
Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT,
PREFETCH_QUEUE_HISTORY. Just an idea.

12.

    +         * XXX Maybe we should reduce the target in case this is
a parallel index
    +         * scan. We don't want to issue a multiple of
effective_io_concurrency.

in IndexOnlyPrefetchCleanup() and IndexNext()

+ * XXX Maybe we should reduce the value with parallel workers?

It's redundant XXX-comment (there are two for the same), as you it was
already there just before IndexPrefetchComputeTarget()

13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe
it would make some sense to follow the consistency pattern , to avoid
adding implementation on platforms without prefetching ?

14. The patch is missing documentation, so how about just this?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2527,7 +2527,8 @@ include_dir 'conf.d'
          operations that any individual
<productname>PostgreSQL</productname> session
          attempts to initiate in parallel.  The allowed range is 1 to 1000,
          or zero to disable issuance of asynchronous I/O requests. Currently,
-         this setting only affects bitmap heap scans.
+         this setting only enables prefetching for HEAP data blocks
when performing
+         bitmap heap scans and index (only) scans.
         </para>

Some further tests, given data:

CREATE TABLE test (id bigint, val bigint, str text);
ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
3000) FROM generate_series(1, 10000) g;
-- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
(10*random())::int), 3000) from (select 10000 * random() as r from
generate_series(1, 10000)) x;
VACUUM ANALYZE test;
CREATE INDEX on test (id) ;

1. the patch correctly detects sequential access (e.g. we issue up to
6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for
query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;`
-- offset of fadvise calls and pread match), so that's good.

2. Prefetching for TOASTed heap seems to be not implemented at all,
correct? (Is my assumption that we should go like this:
t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
see the code path where it could be added - certainly it's not blocker
-- but maybe in commit message a list of improvements for future could
be listed?):

2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats:
requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973
2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT
md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192
pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237
\0\320\237 \0"..., 8192, 40960) = 8192
pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192
pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192
pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192
[..no fadvises for fd=50 which was pg_toast_rel..]

3. I'm not sure if I got good-enough results for DESCending index
`create index on test (id DESC);`- with eic=16 it doesnt seem to be
be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
below in some text editor and it's distance is far lower between that
fadvise<->pread):

pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192
fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192
fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192
fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192
fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192

-Jakub Wartak.

[1]: /messages/by-id/20240215201337.7amzw3hpvng7wphb@awork3.anarazel.de
[2]: /messages/by-id/777e981c-bf0c-4eb9-a9e0-42d677e94327@enterprisedb.com

#95

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Jakub Wartak (#94)

Re: index prefetching

Hi,

Thanks for looking at the patch!

On 3/1/24 09:20, Jakub Wartak wrote:

On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

Given lots of effort already spent on this and the fact that is thread
is actually two:

a. index/table prefetching since Jun 2023 till ~Jan 2024
b. afterwards index/table prefetching with Streaming API, but there
are some doubts of whether it could happen for v17 [1]

... it would be pitty to not take benefits of such work (even if
Streaming API wouldn't be ready for this; although there's lots of
movement in the area), so I've played a little with with the earlier
implementation from [2] without streaming API as it already received
feedback, it demonstrated big benefits, and earlier it got attention
on pgcon unconference. Perhaps, some of those comment might be passed
later to the "b"-patch (once that's feasible):

TBH I don't have a clear idea what to do. It'd be cool to have at least
some benefits in v17, but I don't know how to do that in a way that
would be useful in the future.

For example, the v20240124 patch implements this in the executor, but
based on the recent discussions it seems that's not the right layer -
the index AM needs to have some control, and I'm not convinced it's
possible to improve it in that direction (even ignoring the various
issues we identified in the executor-based approach).

I think it might be more practical to do this from the index AM, even if
it has various limitations. Ironically, that's what I proposed at pgcon,
but mostly because it was the quick&dirty way to do this.

1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
not apply cleanly anymore, due show_buffer_usage() being quite
recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :

patching file src/backend/commands/explain.c
Hunk #1 FAILED at 3568.
Hunk #2 FAILED at 3679.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/commands/explain.c.rej

2. v2 applies (fixup), but it would nice to see that integrated into
main patch (it adds IndexOnlyPrefetchInfo) into one patch

Yeah, but I think it was an old patch version, no point in rebasing that
forever. Also, I'm not really convinced the executor-level approach is
the right path forward.

3. execMain.c :
+     * XXX It might be possible to improve the prefetching code
to handle this
+     * by "walking back" the TID queue, but it's not clear if
it's worth it.
Shouldn't we just remove the XXX? The walking-back seems to be niche
so are fetches using cursors when looking at real world users queries
? (support cases bias here when looking at peopel's pg_stat_activity)

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define        PREFETCH_LRU_SIZE        8    /* slots in one LRU */
+#define        PREFETCH_LRU_COUNT        128 /* number of LRUs */
+#define        PREFETCH_CACHE_SIZE        (PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)

I don't see why would this be related to effective_io_concurrency? It's
merely about how many recently accessed pages we expect to find in the
page cache. It's entirely separate from the prefetch distance.

BTW:
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
that's a duplicated "fraction" word over there.
5.
+     * XXX Could it be harmful that we read the queue backwards?
Maybe memory
+     * prefetching works better for the forward direction?
I wouldn't care, we are optimizing I/O (and context-switching) which
weighs much more than memory access direction impact and Dilipi
earlier also expressed no concern, so maybe it could be also removed
(one less "XXX" to care about)

Yeah, I think it's negligible. Probably a microoptimization we can
investigate later, I don't want to complicate the code unnecessarily.

6. in IndexPrefetchFillQueue()
+    while (!PREFETCH_QUEUE_FULL(prefetch))
+    {
+        IndexPrefetchEntry *entry
+        = prefetch->next_cb(scan, direction, prefetch->data);
If we are at it... that's a strange split and assignment not indented :^)

7. in IndexPrefetchComputeTarget()
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
That's a nice fact that's already in patch, so XXX isn't needed?

Right, which is why it's not a TODO/FIXME. But I think it's good to
point this out - I'm not 100% convinced we should be using plan_rows
like this (because what happens if the estimate happens to be wrong?).

8.
+ * XXX Maybe we should reduce the value with parallel workers?

I was assuming it could be a good idea, but the same doesn't seem
(eic/actual_parallel_works_per_gather) to be performed for bitmap heap
scan prefetches, so no?

Yeah, if we don't do that now, I'm not sure this patch should change
that behavior.

9.
+    /*
+     * No prefetching for direct I/O.
+     *
+     * XXX Shouldn't we do prefetching even for direct I/O? We would only
+     * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+     * once the code starts loading into shared buffers, that'd work.
+     */
+    if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+        return 0;
It's redundant (?) and could be removed as
PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:

5 #ifdef USE_PREFETCH
4 │ │ /*
3 │ │ │* Try to initiate an asynchronous read. This
returns false in
2 │ │ │* recovery if the relation file doesn't exist.
1 │ │ │*/
571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1))
2 │ │ {
3 │ │ │ result.initiated_io = true;
4 │ │ }
5 #endif> > > > > > > /* USE_PREFETCH */

Yeah, I think it might be redundant. I think it allowed skipping a bunch
things without prefetching (like initialization of the prefetcher), but
after the reworks that's no longer true.

11. in IndexPrefetchStats() and ExecReScanIndexScan()

+ * FIXME Should be only in debug builds, or something like that.
+    /* XXX Print some debug stats. Should be removed. */
+    IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
Hmm, but it could be useful in tuning the real world systems, no? E.g.
recovery prefetcher gives some info through pg_stat_recovery_prefetch
view, but e.g. bitmap heap scans do not provide us with anything at
all. I don't have a strong opinion. Exposing such stuff would take
away your main doubt (XXX) from execPrefetch.c

You're right it'd be good to collect/expose such statistics, to help
with monitoring/tuning, etc. But I think there are better / more
convenient ways to do this - exposing that in EXPLAIN, and adding a
counter to pgstat_all_tables / pgstat_all_indexes.

``auto-tuning/self-adjustment". And if we are at it, we could think in
far future about adding new session GUC track_cachestat or EXPLAIN
(cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5)
where we could present both index stats (as what IndexPrefetchStats()
does) *and* cachestat() results there for interested users. Of course
it would have to be generic enough for the bitmap heap scan case too.
Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT,
PREFETCH_QUEUE_HISTORY. Just an idea.

I haven't really thought about this, but I agree some auto-tuning would
be very helpful (assuming it's sufficiently reliable).

12.
+         * XXX Maybe we should reduce the target in case this is
a parallel index
+         * scan. We don't want to issue a multiple of
effective_io_concurrency.
in IndexOnlyPrefetchCleanup() and IndexNext()

+ * XXX Maybe we should reduce the value with parallel workers?

It's redundant XXX-comment (there are two for the same), as you it was
already there just before IndexPrefetchComputeTarget()

13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe
it would make some sense to follow the consistency pattern , to avoid
adding implementation on platforms without prefetching ?

Perhaps, but I'm not sure how to do that with the executor-based
approach, where essentially everything goes through the prefetch queue
(except that the prefetch distance is 0). So the amount of code that
would be disabled by the ifdef would be tiny.

14. The patch is missing documentation, so how about just this?
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2527,7 +2527,8 @@ include_dir 'conf.d'
operations that any individual
<productname>PostgreSQL</productname> session
attempts to initiate in parallel.  The allowed range is 1 to 1000,
or zero to disable issuance of asynchronous I/O requests. Currently,
-         this setting only affects bitmap heap scans.
+         this setting only enables prefetching for HEAP data blocks
when performing
+         bitmap heap scans and index (only) scans.
</para>
Some further tests, given data:

CREATE TABLE test (id bigint, val bigint, str text);
ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
3000) FROM generate_series(1, 10000) g;
-- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
(10*random())::int), 3000) from (select 10000 * random() as r from
generate_series(1, 10000)) x;
VACUUM ANALYZE test;
CREATE INDEX on test (id) ;

It's not clear to me what's the purpose of this test? Can you explain?

1. the patch correctly detects sequential access (e.g. we issue up to
6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for
query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;`
-- offset of fadvise calls and pread match), so that's good.

2. Prefetching for TOASTed heap seems to be not implemented at all,
correct? (Is my assumption that we should go like this:
t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
see the code path where it could be added - certainly it's not blocker
-- but maybe in commit message a list of improvements for future could
be listed?):

Yes, that's true. I haven't thought about TOAST very much, but with
prefetching happening in executor, that does not work. There'd need to
be some extra code for TOAST prefetching. I'm not sure how beneficial
that would be, considering most TOAST values tend to be stored on
consecutive heap pages.

2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats:
requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973
2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT
md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192
pread64(49, "\0\0\0\0@Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237
\0\320\237 \0"..., 8192, 40960) = 8192
pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192
pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192
pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4
\0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192
[..no fadvises for fd=50 which was pg_toast_rel..]

3. I'm not sure if I got good-enough results for DESCending index
`create index on test (id DESC);`- with eic=16 it doesnt seem to be
be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
below in some text editor and it's distance is far lower between that
fadvise<->pread):

pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192
fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192
fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192
fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192
fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4
\0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192

I'm not sure I understand these strace snippets. Can you elaborate a
bit, explain what the strace log says?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#96

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Peter Geoghegan (#93)

Re: index prefetching

On 2/15/24 21:30, Peter Geoghegan wrote:

On Thu, Feb 15, 2024 at 3:13 PM Andres Freund <andres@anarazel.de> wrote:

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here. The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

To me this your phrasing just seems to reformulate the issue.

What I said to Tomas seems very obvious to me. I think that there
might have been some kind of miscommunication (not a real
disagreement). I was just trying to work through that.

In practical terms you'll have to wait for the full IO latency when fetching
the table tuple corresponding to the first tid on a leaf page. Of course
that's also the moment you had to visit another leaf page. Whether the stall
is due to visit another leaf page or due to processing the first entry on such
a leaf page is a distinction without a difference.

I don't think anybody said otherwise?

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

I somewhat doubt that it's realistic to aim for 17 at this point.

That's a fair point. Tomas?

I think that's a fair assessment.

To me it seems doing the prefetching solely at the executor level is not
really workable. And if it can be made to work, there's far too many
open questions to do that in the last commitfest.

I think the consensus is at least some of the logic/control needs to
move back to the index AM. Maybe there's some minimal part that we could
do for v17, even if it has various limitations, and then improve that in
v18. Say, doing the leaf-page-at-a-time and passing a little bit of
information from the index scan to drive this.

But I have very hard time figuring out what the MVP version should be,
because I have very limited understanding on how much control the index
AM ought to have :-( And it'd be a bit silly to do something in v17,
only to have to rip it out in v18 because it turned out to not get the
split right.

We seem to
still be doing fairly fundamental architectual work. I think it might be the
right thing even for 18 to go for the simpler only-a-single-leaf-page
approach though.

I definitely think it's a good idea to have that as a fall back
option. And to not commit ourselves to having something better than
that for v1 (though we probably should commit to making that possible
in v2).

Yeah, I agree with that.

I wonder if there are prerequisites that can be tackled for 17. One idea is to
work on infrastructure to provide executor nodes with information about the
number of tuples likely to be fetched - I suspect we'll trigger regressions
without that in place.

I don't think that there'll be regressions if we just take the simpler
only-a-single-leaf-page approach. At least it seems much less likely.

I'm sure we could pass additional information from the index scans to
improve that further. But I think the gradual ramp-up would deal with
most regressions. At least that's my experience from benchmarking the
early version.

The hard thing is what to do about cases where neither of this helps.
The example I keep thinking about is IOS - if we don't do prefetching,
it's not hard to construct cases where regular index scan gets much
faster than IOS (with many not-all-visible pages). But we can't just
prefetch all pages, because that'd hurt IOS cases with most pages fully
visible (when we don't need to actually access the heap).

I managed to deal with this in the executor-level version, but I'm not
sure how to do this if the control moves closer to the index AM.

One way to *sometimes* process more than a single leaf page, without having to
redesign kill_prior_tuple, would be to use the visibilitymap to check if the
target pages are all-visible. If all the table pages on a leaf page are
all-visible, we know that we don't need to kill index entries, and thus can
move on to the next leaf page

It's possible that we'll need a variety of different strategies.
nbtree already has two such strategies in _bt_killitems(), in a way.
Though its "Modified while not pinned means hinting is not safe" path
(LSN doesn't match canary value path) seems pretty naive. The
prefetching stuff might present us with a good opportunity to replace
that with something fundamentally better.

No opinion.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#97

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Tomas Vondra (#96)

Re: index prefetching

On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

But I have very hard time figuring out what the MVP version should be,
because I have very limited understanding on how much control the index
AM ought to have :-( And it'd be a bit silly to do something in v17,
only to have to rip it out in v18 because it turned out to not get the
split right.

I suspect that you're overestimating the difficulty of getting the
layering right (at least relative to the difficulty of everything
else).

The executor proper doesn't know anything about pins on leaf pages
(and in reality nbtree usually doesn't hold any pins these days). All
the executor knows is that it had better not be possible for an
in-flight index scan to get confused by concurrent TID recycling by
VACUUM. When amgettuple/btgettuple is called, nbtree usually just
returns TIDs it collected from a just-scanned leaf page.

This sort of stuff already lives in the index AM. It seems to me that
everything at the API and executor level can continue to work in
essentially the same way as it always has, with only minimal revision
to the wording around buffer pins (in fact that really should have
happened back in 2015, as part of commit 2ed5b87f). The hard part
will be figuring out how to make the physical index scan prefetch
optimally, in a way that balances various considerations. These
include:

* Managing heap prefetch distance.

* Avoiding making kill_prior_tuple significantly less effective
(perhaps the new design could even make it more effective, in some
scenarios, by holding onto multiple buffer pins based on a dynamic
model).

* Figuring out how many leaf pages it makes sense to read ahead of
accessing the heap, since there is no fixed relationship between the
number of leaf pages we need to scan to collect a given number of
distinct heap blocks that we need for prefetching. (This is made more
complicated by things like LIMIT, but is actually an independent
problem.)

So I think that you need to teach index AMs to behave roughly as if
multiple leaf pages were read as one single leaf page, at least in
terms of things like how the BTScanOpaqueData.currPos state is
managed. I imagine that currPos will need to be filled with TIDs from
multiple index pages, instead of just one, with entries that are
organized in a way that preserves the illusion of one continuous scan
from the point of view of the executor proper. By the time we actually
start really returning TIDs via btgettuple, it looks like we scanned
one giant leaf page instead of several (the exact number of leaf pages
scanned will probably have to be indeterminate, because it'll depend
on things like heap prefetch distance).

The good news (assuming that I'm right here) is that you don't need to
have specific answers to most of these questions in order to commit a
v1 of index prefeteching. ISTM that all you really need is to have
confidence that the general approach that I've outlined is the right
approach, long term (certainly not nothing, but I'm at least
reasonably confident here).

The hard thing is what to do about cases where neither of this helps.
The example I keep thinking about is IOS - if we don't do prefetching,
it's not hard to construct cases where regular index scan gets much
faster than IOS (with many not-all-visible pages). But we can't just
prefetch all pages, because that'd hurt IOS cases with most pages fully
visible (when we don't need to actually access the heap).

I managed to deal with this in the executor-level version, but I'm not
sure how to do this if the control moves closer to the index AM.

The reality is that nbtree already knows about index-only scans. It
has to, because it wouldn't be safe to drop the pin on a leaf page's
buffer when the scan is "between pages" in the specific case of
index-only scans (so the _bt_killitems code path used when
kill_prior_tuple has index tuples to kill knows about index-only
scans).

I actually added commentary to the nbtree README that goes into TID
recycling by VACUUM not too long ago. This includes stuff about how
LP_UNUSED items in the heap are considered dead to all index scans
(which can actually try to look at a TID that just became LP_UNUSED in
the heap!), even though LP_UNUSED items don't prevent VACUUM from
setting heap pages all-visible. This seemed like the only way of
explaining the _bt_killitems IOS issue, that actually seemed to make
sense.

What you really want to do here is to balance costs and benefits.
That's just what's required. The fact that those costs and benefits
span multiple levels of abstractions makes it a bit awkward, but
doesn't (and can't) change the basic shape of the problem.

--
Peter Geoghegan

#98

Jakub Wartak

jakub.wartak@enterprisedb.com

almost 2 years ago

In reply to: Tomas Vondra (#95)

Re: index prefetching

On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
[..]

TBH I don't have a clear idea what to do. It'd be cool to have at least
some benefits in v17, but I don't know how to do that in a way that
would be useful in the future.

For example, the v20240124 patch implements this in the executor, but
based on the recent discussions it seems that's not the right layer -
the index AM needs to have some control, and I'm not convinced it's
possible to improve it in that direction (even ignoring the various
issues we identified in the executor-based approach).

I think it might be more practical to do this from the index AM, even if
it has various limitations. Ironically, that's what I proposed at pgcon,
but mostly because it was the quick&dirty way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define        PREFETCH_LRU_SIZE        8    /* slots in one LRU */
+#define        PREFETCH_LRU_COUNT        128 /* number of LRUs */
+#define        PREFETCH_CACHE_SIZE        (PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)
I don't see why would this be related to effective_io_concurrency? It's
merely about how many recently accessed pages we expect to find in the
page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

7. in IndexPrefetchComputeTarget()
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
That's a nice fact that's already in patch, so XXX isn't needed?
Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

But I think it's good to
point this out - I'm not 100% convinced we should be using plan_rows
like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1]/messages/by-id/CAKZiRmznOwi0oaV=4PHOCM4ygcH4MgSvt8=5cu_vNCfc8FSUug@mail.gmail.com, 9c6ad5eaa95. So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

Some further tests, given data:

CREATE TABLE test (id bigint, val bigint, str text);
ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
3000) FROM generate_series(1, 10000) g;
-- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
(10*random())::int), 3000) from (select 10000 * random() as r from
generate_series(1, 10000)) x;
VACUUM ANALYZE test;
CREATE INDEX on test (id) ;

It's not clear to me what's the purpose of this test? Can you explain?

It's just schema&data preparation for the tests below:

2. Prefetching for TOASTed heap seems to be not implemented at all,
correct? (Is my assumption that we should go like this:
t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
see the code path where it could be added - certainly it's not blocker
-- but maybe in commit message a list of improvements for future could
be listed?):

Yes, that's true. I haven't thought about TOAST very much, but with
prefetching happening in executor, that does not work. There'd need to
be some extra code for TOAST prefetching. I'm not sure how beneficial
that would be, considering most TOAST values tend to be stored on
consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of string_agg()
but that's not enough)

where fd 45,54,55 correspond to :
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/45 ->
/tmp/blah/base/5/16384 // "test"
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/54 ->
/tmp/blah/base/5/16388 // "pg_toast_16384_index"
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/55 ->
/tmp/blah/base/5/16387 // "pg_toast_16384"

I've got for the following data:
- 83 pread64 and 83x fadvise() for random offsets for fd=45 - the main
intent of this patch (main relation heap prefetching), works good
- 54 pread64 calls for fd=54 (no favdises())
- 1789 (!) calls to pread64 for fd=55 for RANDOM offsets (TOAST heap,
no prefetch)

so at least in theory it makes a lot of sense to prefetch TOAST too,
pattern looks like cyclic random:

// pread(fd, "", blocksz, offset)
fadvise64(45, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(55, ""..., 8192, 38002688) = 8192
pread64(55, ""..., 8192, 12034048) = 8192
pread64(55, ""..., 8192, 36560896) = 8192
pread64(55, ""..., 8192, 8871936) = 8192
pread64(55, ""..., 8192, 17965056) = 8192
pread64(55, ""..., 8192, 18710528) = 8192
pread64(55, ""..., 8192, 35635200) = 8192
pread64(55, ""..., 8192, 23379968) = 8192
pread64(55, ""..., 8192, 25141248) = 8192
pread64(55, ""..., 8192, 3457024) = 8192
pread64(55, ""..., 8192, 24633344) = 8192
pread64(55, ""..., 8192, 36462592) = 8192
pread64(55, ""..., 8192, 18120704) = 8192
pread64(55, ""..., 8192, 27066368) = 8192
pread64(45, ""..., 8192, 40960) = 8192
pread64(55, ""..., 8192, 2768896) = 8192
pread64(55, ""..., 8192, 10846208) = 8192
pread64(55, ""..., 8192, 30179328) = 8192
pread64(55, ""..., 8192, 7700480) = 8192
pread64(55, ""..., 8192, 38846464) = 8192
pread64(55, ""..., 8192, 1040384) = 8192
pread64(55, ""..., 8192, 10985472) = 8192

It's probably a separate feature (prefetching blocks from TOAST), but
it could be mentioned that this patch is not doing that (I was
assuming it could).

3. I'm not sure if I got good-enough results for DESCending index
`create index on test (id DESC);`- with eic=16 it doesnt seem to be
be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
below in some text editor and it's distance is far lower between that
fadvise<->pread):

[..]

I'm not sure I understand these strace snippets. Can you elaborate a
bit, explain what the strace log says?

set enable_seqscan to off;
set enable_bitmapscan to off;
drop index test_id_idx;
create index on test (id DESC); -- DESC one
SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;

Ok, so cleaner output of strace -s 0 for PID doing that SELECT with
eic=16, annotated with [*]:

lseek(45, 0, SEEK_END) = 688128
lseek(47, 0, SEEK_END) = 212992
pread64(47, ""..., 8192, 172032) = 8192
pread64(45, ""..., 8192, 90112) = 8192
fadvise64(45, 172032, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 172032) = 8192
fadvise64(45, 319488, 8192, POSIX_FADV_WILLNEED) = 0 [*off 319488 start]
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 319488) = 8192 [*off 319488,
read, distance=1 fadvises]
fadvise64(45, 466944, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 393216, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 335872) = 8192
fadvise64(45, 540672, 8192, POSIX_FADV_WILLNEED) = 0 [*off 540672 start]
fadvise64(45, 262144, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 466944) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 393216) = 8192
fadvise64(45, 163840, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 385024, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 540672) = 8192 [*off 540672,
read, distance=4 fadvises]
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
[..]
I was wondering why the distance never got >4 in such case for eic=16,
it should spawn more fadvises calls, shouldn't it? (it was happening
only for DESC, in normal ASC index the prefetching distance easily
achieves ~~ eic values) and I think today i've got the answer -- after
dropping/creating DESC index I did NOT execute ANALYZE so probably the
Min(..., plan_rows) was kicking in and preventing the full
prefetching.

Hitting above, makes me think that the XXX for plan_rows , should
really be real-FIXME.

-J.

[1]: /messages/by-id/CAKZiRmznOwi0oaV=4PHOCM4ygcH4MgSvt8=5cu_vNCfc8FSUug@mail.gmail.com

#99

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Peter Geoghegan (#97)

1 attachment(s)

Re: index prefetching

Hi,

Here's an updated (and pretty fundamentally reworked) patch to add
prefetching to regular index scans. I'm far happier with this approach
than with either of the two earlier ones, and I actually think it might
even be easier to combine this with the streaming read (which the patch
does not use at all for now). I feeling cautiously optimistic.

The patch is still WIP, but everything should be working fine (including
optimizations like kill_prior_tuple etc.). The patch actually passes
"make check-world" (even with valgrind) and I'm not aware of any bugs.
There are a couple limitations and things that need cleanup, ofc. Those
are mentioned at the end of this message.

The index prefetching had two prior patch versions, with very different
approaches, each having different drawbacks. The first one (posted
shortly before pgcon 2023) did the prefetching at a very low level, in
each index AM. We'd call amgettuple() -> btgettuple(), and that issued
prefetches for "future" TIDs from the same leaf page (in the correct
prefetch distance, etc).

That mostly worked ... sort of. Every index AM had to reimplement the
logic, but the main problem was that it had no idea what happened above
the index AM. So it regressed cases that are unlikely to benefit from
prefetches - like IOS, where we don't need the heap page at all if it's
all-visible. And if we disabled prefetching for IOS, it could easily
lead to cases where regular index scan is much faster than IOS (which
for users would seem quite bizarre).

We'd either need to teach the index AM about visibility checks (seems it
should not need to know about that), or inject the information in some
way, but then also cache the visibility check results (because checking
visibility map is not free, and doing it repeatedly can regresses the
"cached" case of IOS).

Perhaps that was solvable, but it felt uglier and uglier, and in the end
my conclusion was it's not the right place to do the prefetches. Why
should an index AM initiate prefetches against a heap? It seems the
right place to do prefetches is somewhere higher, where we actually have
the information to decide if the heap page is needed. (I believe this
uncertainty made it harder to adopt streaming read API too.)

This led to the second patch, which did pretty much everything in the
executor. The Index(Only)Scans simply called index_getnext_tid() in a
loop to fill a local "queue" driving the prefetching, and then also
consumed the TIDs from it again. The nice thing was this seemed to work
with any index AM as long as it had the amgettuple() callback.

Unfortunately, this complete separation of prefetching from index AM
turned out to be a problem. The ultimate issue that killed this was the
kill_prior_tuple, which we use to "remove" pointers to provably dead
heap tuples from the index early. With the single-tuple approach the
index AM processes the information before it unpins the leaf page, but
with a batch snapping multiple leaf pages, we can't rely on that - we
might have unpinned the page long before we get to process the list of
tuples to kill.

We have discussed different ways to deal with this - an obvious option
is to rework the index AMs to hold pins on all leaf pages needed by the
current batch. But despite the "obviousness" it's a pretty unattractive
option. It would require a lot of complexity and reworks in each index
AM to support this, which directly contradicts the primary benefit of
doing this in the executor - not having to do anything in the index AMs
and working for all index AMs.

Also, locking/pinning resources accessed asynchronously seems like a
great place for subtle bugs.

However, I had a bit of a lightbulb moment at pgconf.dev, when talking
to Andres about something only very remotely related, something to do
with accessing batches of items instead of individually.

What if we didn't get the TIDs from the index one by one, but in larger
batches, and the index AM never gave us a batch spanning multiple leaf
pages? A sort of a "contract" for the API.

Yes, this requires extending the index AM. The existing amgettuple()
callback is not sufficient for that, because we don't know when leaf
pages change. Or will change, which makes it hard to communicate
information about past tuples.

There's a fairly long comment in indexam.c before the chunk of new code,
trying to explain how this is supposed to work. There's also a lot of
XXX comments scattered around, with open questions / ideas about various
parts of this.

But let me share a brief overview here ...

The patch adds a new callback amgettuplebatch() which loads an array of
items (into IndexScanDesc). It also adds index_batch_getnext() and
index_batch_getnext_tid() wrappers to access the batch.

This means if we have loop reading tuples from an indexscan

while ((tid = index_getnext_slot(scan, dir, slot)) != NULL)
{
... process the slot ...
}

we could replace it with something like

while (index_batch_getnext(scan, dir))
{
while ((tid = index_batch_getnext_slot(scan, dir, slot)) != NULL)
{
... process the slot ...
}
}

Obviously, nodeIndescan.c does that a bit differently, but I think the I
idea is clear. For index-only scans it'd be more complicated, due to
visibility checks etc. but the overall idea is the same.

For kill_prior_tuple, the principle is about the same, except that we
collect information about which tuples to kill in the batch, and the AM
only gets the information before reading the next batch - at which point
it simply adds them to the private list and kills them when switching to
the next leaf page.

Obviously, this requires some new code in the index AM - I don't think
there's a way around that, the index AM has to have a say in this one
way or the other. Either it has to keep multiple leaf pages pinned, or
it needs to generate batches in a way that works with a single pin.

I've only done this for btree for now, but the amount of code needed is
pretty small - essentially I needed the btgettuplebatch, which is maybe
20 lines plus comments, and then _bt_first_batch/_bt_next_batch, which
are just simplified versions of _bt_first/_bt_next.

The _bt_first_batch/_bt_next_batch are a bit long, but there's a lot of
redundancy and it shouldn't be hard to cut them down to ~1/2 with a bit
of effort. I'm pretty sure other index AMs (e.g. hash) can do a very
similar approach to implement this.

A detail worth mentioning - the batches start small and gradually grow
over time, up to some maximum size (the patch hardcodes these limits as
8 and 64, at the moment). The reason are similar to why we do this for
prefetching - not harming queries that only need a single row.

The changes to nodeIndexscan.c and nodeIndexonlyscan.c have a lot of
duplicate code too. That's partially intentional - I wanted to retain
the ability to test the "old" code easily, so I added a GUC to switch
between the two.

For plain indexscans it might even be possible to "unite" the two paths
by tweaking index_getnext_slot to either get the TID from the index or
do the batch loop (with batching enabled). Not sure about IOS, we don't
want to repeat the visibility check in that case :-(

Actually, couldn't we have a per-batch cache of visibility checks? I
don't think we can get different answers to visibility checks for two
TIDs (for the same block) within the same batch, right? It'd simplify
the code I think, and perhaps it'd be useful even without prefetching.

I think the main priority is clarifying the boundary between indexam and
the AM code. Right now, it's a bit messy and not quite clear which code
is responsible for which fields. Sometimes a field is set by indexam,
but then one random place in nbtsearch.c sets it too, etc.

Finally, two things that might be an issue / I'm not quite sure about.

Firstly, do we need to support mixing batched and non-batched calls?
That is, given an index scan, should it be possible to interleave calls
to index_getnext_tid and index_batch_getnext/index_batch_getnext_tid?

I'm pretty sure that doesn't work, at least not right now. Because with
batching the index AM does not have an exact idea "where" on the page we
actually are / which item is "current". I believe it might be possible
to improve this by "synchronizing" whenever we switch between the two
approaches. But I'm not sure it's something we need/want to support. I
can't quite imagine why would I need this.

The other thing is mark/restore. At the moment this does not work, for
pretty much the same reason - the index AM has no idea what's the exact
"current" item on the page, so mark/restore does unexpected things. In
the patch I "fixed" this by disabling batching/prefetching for plans
with EXEC_FLAG_MARK, so e.g. mergejoins won't benefit from this.

It did seem like an acceptable limitation to me, but now that I think
about it, if we could "synchronize" the position from the batch (if the
index AM requests it), I think this might work correctly.

I'm yet to do a comprehensive benchmark, but the tests I've done during
development suggest the gains are in line with what we saw for the
earlier versions.

regards

--
Tomas Vondra

Attachments:

v20240831-0001-WIP-index-batching-prefetching.patchtext/x-patch; charset=UTF-8; name=v20240831-0001-WIP-index-batching-prefetching.patchDownload

From f34b81e33b173a112bc60ad38b39ce3d5672c255 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 28 Aug 2024 18:21:27 +0200
Subject: [PATCH v20240831] WIP: index batching / prefetching

---
 src/backend/access/index/indexam.c            | 664 ++++++++++++++++++
 src/backend/access/nbtree/nbtree.c            |  56 ++
 src/backend/access/nbtree/nbtsearch.c         | 363 ++++++++++
 src/backend/executor/nodeIndexonlyscan.c      | 496 ++++++++++---
 src/backend/executor/nodeIndexscan.c          | 111 ++-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   5 +
 src/include/access/genam.h                    |  24 +
 src/include/access/nbtree.h                   |   4 +
 src/include/access/relscan.h                  |  40 ++
 src/include/nodes/execnodes.h                 |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 13 files changed, 1662 insertions(+), 122 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813d8..7c2d7d24b13 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,14 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_batch_getnext	- get the next batch of TIDs from a scan
+ *		index_batch_getnext_tid	- get the next TIDs from a batch
+ *		index_batch_getnext_slot	- get the next tuple from a batch
+ *		index_batch_prefetch	- prefetch heap pages for a batch
+ *		index_batch_supported	- does the AM support/allow batching?
+ *		index_batch_init	- initialize the TID batch arrays
+ *		index_batch_reset	- reset the TID batch (before next batch)
+ *		index_batch_add		- add an item (TID, itup) to the batch
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -44,6 +52,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX MaxTIDsPerBTreePage */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -56,8 +65,11 @@
 #include "storage/predicate.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
+/* enable reading batches of TIDs from the index */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -333,6 +345,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/* No batching unless explicitly enabled, set everything to NULL. */
+	scan->xs_batch.heaptids = NULL;
+	scan->xs_batch.itups = NULL;
+	scan->xs_batch.privateData = NULL;
+	scan->xs_batch.killedItems = NULL;
+
 	return scan;
 }
 
@@ -368,6 +386,13 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/*
+	 * Reset the batch, to make it look empty. This needs to happen after the
+	 * amrestrpos() call, in case the AM needs some of the batch info (e.g. to
+	 * properly transfer the killed tuples).
+	 */
+	index_batch_reset(scan, ForwardScanDirection);
 }
 
 /* ----------------
@@ -444,6 +469,13 @@ index_restrpos(IndexScanDesc scan)
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	/*
+	 * Reset the batch, to make it look empty. This needs to happen after the
+	 * amrestrpos() call, in case the AM needs some of the batch info (e.g. to
+	 * properly transfer the killed tuples).
+	 */
+	index_batch_reset(scan, ForwardScanDirection);
 }
 
 /*
@@ -1037,3 +1069,635 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING AND PREFETCHING
+ *
+ * Allows reading chunks of items from an index, instead of reading them
+ * one by one. This reduces the overhead of accessing index pages, and
+ * also allows acting on "future" TIDs - e.g. we can prefetch heap pages
+ * that will be needed, etc.
+ *
+ *
+ * index AM contract
+ * -----------------
+ *
+ * This requires some level of cooperation from the index AM - the index
+ * needs to implement an optional callbacl amgettuplebatch() which fills
+ * data into the batch (in the scan descriptor).
+ *
+ * The index AM also needs to ensure it can perform all optimizations for
+ * all TIDs in the current batch. A good example of is the kill_prior_tuple
+ * optimization - with batching, the index AM may receive the information
+ * which tuples are to be killed with a delay - when loading the next
+ * batch, when ending/restarting the scan, etc. The AM needs to ensure it
+ * can still process such information.
+ *
+ * What this means/requires is very dependent on the index AM, of course.
+ * For B-Tree (and most other index AMs), batches spanning multiple leaf
+ * pages would be problematic. Such batches would work for basic index
+ * scans, but the kill_prior_tuple would be an issue - the AMs keep only
+ * a single leaf pinned. We'd either need to keep multiple pins, or allow
+ * reading older leaf pages pages (which might have been modified). Index
+ * only-scans is challenging too - we keep IndexTuple pointers into the
+ * leaf pages, which requires keeping those pins too.
+ *
+ * To solve this, we give the AM the control over batch boundaries. It is
+ * up to the index AM to pick which range of index items to load into the
+ * batch, and how to ensure all the optimizations are possible.
+ *
+ * For most index AMs the easiest way is to not load batches spanning
+ * multiple leaf pages. This may impact the efficiency, especially for
+ * indexes with wide index tuples, but those cases are rare.
+ *
+ * The alternative would be to make the index AMs more complex, to keep
+ * more leaf pages pinned, etc. But that does not seem like a good trade
+ * off due to "diminishing returns" behavior - we see significant gains
+ * initially (with even small batches), but as the batch grows the gains
+ * get smaller and smaller. It does not seem worth the complexity of
+ * pinning more pages etc.
+ *
+ *
+ * batch = sliding window
+ * ----------------------
+ *
+ * A good way to visualize the batch is a sliding window over the array
+ * of items on a leaf page. In the simplest example (forward scan with no
+ * changes of direction), we slice the array into smaller chunks, and
+ * then process each of those chunks.
+ *
+ * The batch size is adaptive - it starts small (only 8 elements) and
+ * increases as we read more batches (up to 64 elements). We don't want
+ * to regress cases that only need a single item (e.g. LIMIT 1 queries),
+ * and loading/copying a lot of data might cause that. So we start small
+ * and increase the size - that still improves cases reading a lot of
+ * data from the index, without hurting small queries.
+ *
+ * Note: This gradual ramp up is for batch size, independent of what we
+ * do for prefetch. The prefetch distance is gradually increased too, but
+ * it's independent / orthogonal to the batch size. The batch size limits
+ * how far ahead we can prefetch, of course.
+ *
+ * Note: The current limits on batch size (initial 8, maximum 64) are
+ * quite arbitrary, it just seemed those values are sane. We could adjust
+ * the initial size, but I don't think it'd make a fundamental difference.
+ * Growing the batches faster/slower has bigger impact.
+ *
+ * The maximum batch size does not matter much - it's true a btree index can
+ * have up to ~1300 items per 8K leaf page, but in most cases the actual
+ * number is lower, perhaps ~300. That's not that far from 64.
+ *
+ * Each batch has a firstIndex/lastIndex to track which part of the leaf
+ * page it currently represents.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ *
+ * If we decide a tuple can be killed, the batch item is marked accordingly,
+ * and the flag is reset to false (so that the index AM does not do something
+ * silly to a random tuple it thinks is "current").
+ *
+ * Then the next time the AM decides it's time to kill tuples, the AM needs
+ * to look at the batch and consider the tuples marked to be killed. B-Tree
+ * simply adds those TIDs to the regular "killItems" array.
+ *
+ *
+ * mark/restore limitation
+ * -----------------------
+ *
+ * At the moment, batching is incompatible with mark/restore - the index AM
+ * does not know what's the "current" item as seen by the user, because
+ * that's in the batch only. So the ammarkpos() has a bit misguided idea
+ * of what the current position is.
+ *
+ * For now, batching is simply disabled fro plans requiring mark/restore,
+ * which also disabes prefetching. In particular, this means mergejoin
+ * can't see this benefit.
+ *
+ * Note: It might be possible to fix this, if the AM could ask for current
+ * position in the batch. Now that we have the fistIndex/lastIndex, that
+ * should not be very difficult. Then markpos() would check if batching
+ * is used, and would use that current position instead of whatever it
+ * thinks is current (which is always to before/after the batch).
+ */
+
+/*
+ * Comprehensive check of various invariants on the index batch. Makes sure
+ * the indexes are set as expected, the buffer size is within limits, and
+ * so on.
+ */
+static void
+AssertCheckBatchInfo(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* all the arrays need to be allocated */
+	Assert((scan->xs_batch.heaptids != NULL) &&
+		   (scan->xs_batch.killedItems != NULL) &&
+		   (scan->xs_batch.privateData != NULL));
+
+	/* if IndexTuples expected, should be allocated too */
+	Assert(!(scan->xs_want_itup && (scan->xs_batch.itups == NULL)));
+
+	/* Various check on batch sizes */
+	Assert((scan->xs_batch.initSize >= 0) &&
+		   (scan->xs_batch.initSize <= scan->xs_batch.currSize) &&
+		   (scan->xs_batch.currSize <= scan->xs_batch.maxSize) &&
+		   (scan->xs_batch.maxSize <= 1024));	/* arbitrary limit */
+
+	/* Is the number of in the batch TIDs in a valid range? */
+	Assert((scan->xs_batch.nheaptids >= 0) &&
+		   (scan->xs_batch.nheaptids <= scan->xs_batch.maxSize));
+
+	/*
+	 * The current item must be between -1 and nheaptids. Those two extreme
+	 * values are starting points for forward/backward scans.
+	 */
+	Assert((scan->xs_batch.currIndex >= -1) &&
+		   (scan->xs_batch.currIndex <= scan->xs_batch.nheaptids));
+
+	/* check prefetch data */
+	Assert((scan->xs_batch.prefetchTarget >= 0) &&
+		   (scan->xs_batch.prefetchTarget <= scan->xs_batch.prefetchMaximum));
+
+	Assert((scan->xs_batch.prefetchIndex >= -1) &&
+		   (scan->xs_batch.prefetchIndex <= scan->xs_batch.nheaptids));
+
+	/*
+	 * XXX Not quite correct to use MaxTIDsPerBTreePage, which is btree
+	 * specific. Also we probably don't want to depend on AMs like this.
+	 */
+	Assert((scan->xs_batch.firstIndex >= -1) &&
+		   (scan->xs_batch.firstIndex <= MaxTIDsPerBTreePage));
+	Assert((scan->xs_batch.lastIndex >= -1) &&
+		   (scan->xs_batch.lastIndex <= MaxTIDsPerBTreePage));
+
+	for (int i = 0; i < scan->xs_batch.nheaptids; i++)
+		Assert(ItemPointerIsValid(&scan->xs_batch.heaptids[i]));
+#endif
+}
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch.nheaptids == (scan)->xs_batch.currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch.nheaptids == 0)
+
+/*
+ * Did we process all items? For forward scan it means the index points to the
+ * last item, for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch.nheaptids == ((scan)->xs_batch.currIndex + 1)) : \
+		((scan)->xs_batch.currIndex == 0))
+
+/* Does the batch items in the requested direction? */
+#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgettuplebatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	/*
+	 * The AM's amgettuplebatch proc loads a chunk of TIDs matching the scan
+	 * keys, and puts the TIDs into scan->xs_batch.heaptids.  It should also
+	 * set scan->xs_recheck and possibly
+	 * scan->xs_batch.itups/scan->xs_batch.hitups, though we pay no attention
+	 * to those fields here.
+	 *
+	 * FIXME At the moment this does nothing with hitup. Needs to be fixed?
+	 */
+	found = scan->indexRelation->rd_indam->amgettuplebatch(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return false;
+	}
+
+	/* We should have a non-empty batch with items. */
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch.nheaptids);
+
+	/*
+	 * Set the prefetch index to the first item in the loaded batch (we expect
+	 * the index AM to set that).
+	 *
+	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
+	 * more like indexam.c responsibility rather than something every index AM
+	 * should be doing (in _bt_first_batch etc.).
+	 *
+	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
+	 * for setting which fields. This needs clarification.
+	 */
+	scan->xs_batch.prefetchIndex = scan->xs_batch.currIndex;
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* Return the batch of TIDs we found. */
+	return true;
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * Same calling convention as index_getnext_tid(), except that NULL means
+ * no more items in the current batch, there may be more batches.
+ *
+ * XXX This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup.
+ *
+ * FIXME Should this set xs_hitup?
+ * ----------------
+ */
+ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * Bail out if he batch does not have more items in the requested directio
+	 * (either empty or everthing processed).
+	 */
+	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
+		return NULL;
+
+	/*
+	 * Advance to the next batch item - we know it's not empty and there are
+	 * items to process, so this is valid.
+	 */
+	if (ScanDirectionIsForward(direction))
+		scan->xs_batch.currIndex++;
+	else
+		scan->xs_batch.currIndex--;
+
+	/* next TID from the batch, optionally also the IndexTuple */
+	scan->xs_heaptid = scan->xs_batch.heaptids[scan->xs_batch.currIndex];
+	if (scan->xs_want_itup)
+		scan->xs_itup = scan->xs_batch.itups[scan->xs_batch.currIndex];
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ * index_getnext_batch_slot - get the next tuple from a scan batch
+ *
+ * Same calling convention as index_getnext_slot(), except that NULL means
+ * no more items only in the current batch, there may be more batches.
+ *
+ * XXX See index_getnext_slot comments.
+ * ----------------
+ */
+bool
+index_batch_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_batch_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+		{
+			/* If we found a visible tuple, we shouldn't kill it. */
+			Assert(!scan->kill_prior_tuple);
+
+			/* comprehensive checks of batching info */
+			AssertCheckBatchInfo(scan);
+
+			return true;
+		}
+
+		/*
+		 * If we haven't found any visible tuple for the TID, chances are all
+		 * versions are dead and may kill it from the index. If so, flag it in
+		 * the kill bitmap - we'll translate it to indexes later.
+		 *
+		 * XXX With the firstIndex/lastIndex it would not be too hard to do
+		 * the translation here. But do we want to? How much is that
+		 * considered an internal detail of the AM?
+		 *
+		 * XXX Maybe we should integrate this into index_fetch_heap(), so that
+		 * we don't need to do it after each call? Seems easy to ferget/miss.
+		 */
+		if (scan->kill_prior_tuple)
+		{
+			/*
+			 * FIXME This is not great, because we'll have to walk through the
+			 * whole bitmap later, to maybe add killed tuples to the regular
+			 * array. Might be costly for large batches. Maybe it'd be better
+			 * to do what btree does and stash the indexes (just some limited
+			 * number).
+			 */
+			scan->xs_batch.killedItems[scan->xs_batch.currIndex] = true;
+			scan->kill_prior_tuple = false;
+		}
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return false;
+}
+
+/* ----------------
+ *		index_getnext_batch_prefetch - prefetch pages for TIDs in current batch
+ *
+ * The prefetch distance is increased gradually, similar to what we do for
+ * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
+ * iteration increment the distance up to prefetchMaximum.
+ *
+ * The prefetch distance is reset (to 0) only on rescans, not between batches.
+ *
+ * It's possible to provide an index_prefetch_callback callback, to affect
+ * which items need to be prefetched. With prefetch_callback=NULL, all
+ * items are prefetched. With the callback provided, the item is prefetched
+ * iff the callback and returns true.
+ *
+ * The "arg" argument is used to pass a state for the plan node invoking the
+ * function, and is then passed to the callback. This means the callback is
+ * specific to the plan state.
+ *
+ * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
+ * tablespace options.
+ *
+ * XXX For accesses that change scan direction, we may do a lot of unnecessary
+ * prefetching (because we will re-issue prefetches for what we recently read).
+ * I'm not sure if there's a simple way to track what was already prefetched.
+ * Maybe we could count how far we got (in the forward direction), keep that
+ * as a watermark, and never prefetch again below it.
+ *
+ * XXX Maybe wrap this in ifdef USE_PREFETCH?
+ * ----------------
+ */
+void
+index_batch_prefetch(IndexScanDesc scan, ScanDirection direction,
+					 index_prefetch_callback prefetch_callback, void *arg)
+{
+	int			prefetchStart,
+				prefetchEnd;
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Where should we start to prefetch? */
+		prefetchStart = Max(scan->xs_batch.currIndex,
+							scan->xs_batch.prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchEnd = Min((scan->xs_batch.currIndex + 1) + scan->xs_batch.prefetchTarget,
+						  scan->xs_batch.nheaptids);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch.nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch.nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch.prefetchIndex = prefetchEnd;
+	}
+	else
+	{
+		/* Where should we start to prefetch? */
+		prefetchEnd = Min(scan->xs_batch.currIndex,
+						  scan->xs_batch.prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchStart = Max((scan->xs_batch.currIndex - 1) - scan->xs_batch.prefetchTarget,
+							-1);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch.nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch.nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch.prefetchIndex = prefetchStart;
+	}
+
+	/* we shouldn't get inverted prefetch range */
+	Assert(prefetchStart <= prefetchEnd);
+
+	/*
+	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
+	 * intentionally do this after calculating start/end, so that we start
+	 * actually prefetching only after the first item.
+	 */
+	scan->xs_batch.prefetchTarget = Min(scan->xs_batch.prefetchTarget + 1,
+										scan->xs_batch.prefetchMaximum);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* finally, do the actual prefetching */
+	for (int i = prefetchStart; i < prefetchEnd; i++)
+	{
+		/* skip block if the provided callback says so */
+		if (prefetch_callback && !prefetch_callback(scan, direction, arg, i))
+			continue;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
+					   ItemPointerGetBlockNumber(&scan->xs_batch.heaptids[i]));
+	}
+}
+
+/*
+ * Does the access method support batching? We just check the index AM has
+ * amgettuplebatch() method.
+ */
+bool
+index_batch_supported(IndexScanDesc scan, ScanDirection direction)
+{
+	return (scan->indexRelation->rd_indam->amgettuplebatch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+void
+index_batch_init(IndexScanDesc scan, ScanDirection direction)
+{
+	/* init batching info, but only if batch supported */
+	if (!index_batch_supported(scan, direction))
+		return;
+
+	scan->xs_batch.firstIndex = -1;
+	scan->xs_batch.lastIndex = -1;
+
+	/*
+	 * Set some reasonable batch size defaults.
+	 *
+	 * XXX Maybe should depend on prefetch distance, or something like that?
+	 * The initSize will affect how far ahead we can prefetch.
+	 */
+	scan->xs_batch.maxSize = 64;
+	scan->xs_batch.initSize = 8;
+	scan->xs_batch.currSize = scan->xs_batch.initSize;
+
+	/* initialize prefetching info */
+	scan->xs_batch.prefetchMaximum =
+		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
+	scan->xs_batch.prefetchTarget = 0;
+	scan->xs_batch.prefetchIndex = 0;
+
+	/* */
+	scan->xs_batch.currIndex = -1;
+
+	/* Preallocate the largest allowed array of TIDs. */
+	scan->xs_batch.nheaptids = 0;
+	scan->xs_batch.heaptids = palloc0(sizeof(ItemPointerData) * scan->xs_batch.maxSize);
+
+	if (scan->xs_want_itup)
+		scan->xs_batch.itups = palloc(sizeof(IndexTuple) * scan->xs_batch.maxSize);
+
+	/*
+	 * XXX Maybe use a more compact bitmap? We need just one bit per element,
+	 * not a bool. This is easier / more convenient to manipulate, though.
+	 */
+	scan->xs_batch.killedItems = (bool *) palloc0(sizeof(bool) * scan->xs_batch.maxSize);
+
+	/*
+	 * XXX Maybe allocate only when actually needed? Also, shouldn't we have a
+	 * memory context for the private data?
+	 */
+	scan->xs_batch.privateData = (Datum *) palloc0(sizeof(Datum) * scan->xs_batch.maxSize);
+
+	/* comprehensive checks */
+	AssertCheckBatchInfo(scan);
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
+ * correct, because for backwards scans is wrong.
+ */
+void
+index_batch_reset(IndexScanDesc scan, ScanDirection direction)
+{
+	/* maybe initialize */
+	scan->xs_batch.nheaptids = 0;
+	scan->xs_batch.prefetchIndex = 0;
+	scan->xs_batch.currIndex = -1;
+}
+
+/*
+ * index_batch_add
+ *		Add an item to the batch.
+ *
+ * The item is always a TID, and then also IndexTuple if requested (for IOS).
+ * Items are always added from the beginning (index 0).
+ *
+ * Returns true when adding the item was successful, or false when the batch
+ * is full (and the item should be added to the next batch).
+ */
+bool
+index_batch_add(IndexScanDesc scan, ItemPointerData tid, IndexTuple itup)
+{
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	/* don't add TIDs beyond the current batch size */
+	if (INDEX_BATCH_IS_FULL(scan))
+		return false;
+
+	/*
+	 * There must be space for at least one entry.
+	 *
+	 * XXX Seems redundant with the earlier INDEX_BATCH_IS_FULL check.
+	 */
+	Assert(scan->xs_batch.nheaptids < scan->xs_batch.currSize);
+	Assert(scan->xs_batch.nheaptids >= 0);
+
+	scan->xs_batch.heaptids[scan->xs_batch.nheaptids] = tid;
+	scan->xs_batch.killedItems[scan->xs_batch.nheaptids] = false;
+	scan->xs_batch.privateData[scan->xs_batch.nheaptids] = (Datum) 0;
+
+	if (scan->xs_want_itup)
+		scan->xs_batch.itups[scan->xs_batch.nheaptids] = itup;
+
+	scan->xs_batch.nheaptids++;
+
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f72..53375de9c66 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuplebatch = btgettuplebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -259,6 +260,52 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/*
+ *	btgettuplebatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+bool
+btgettuplebatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (!BTScanPosIsValid(so->currPos))
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Check to see if we should kill tuples from the previous batch.
+			 */
+			_bt_kill_batch(scan);
+
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, dir);
+		}
+
+		/* If we have a tuple, return it ... */
+		if (res)
+			break;
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -364,6 +411,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -421,6 +471,9 @@ btendscan(IndexScanDesc scan)
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -501,6 +554,9 @@ btrestrpos(IndexScanDesc scan)
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_bt_kill_batch(scan);
+
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a671..e7c3313696d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1527,6 +1527,366 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *	_bt_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* start a new batch */
+	index_batch_reset(scan, dir);
+
+	/*
+	 * Reset the batch size to the initial size.
+	 *
+	 * FIXME should be done in indexam.c probably, at the beginning of each
+	 * index rescan?
+	 */
+	scan->xs_batch.currSize = scan->xs_batch.initSize;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_bt_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch.currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch.currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * We're reading the first batch, and there should always be at least
+		 * one item (otherwise _bt_first would return false). So we should
+		 * never get into situation with empty start/end range. In the worst
+		 * case, there is just a single item, in which case (start == end).
+		 */
+		Assert(start <= end);
+
+		scan->xs_batch.firstIndex = start;
+		scan->xs_batch.lastIndex = end;
+
+		/* The range of items should fit into the current batch size. */
+		Assert((end - start + 1) <= scan->xs_batch.currSize);
+
+		/* should be valid items (with respect to the leaf page) */
+		Assert(so->currPos.firstItem <= scan->xs_batch.firstIndex);
+		Assert(scan->xs_batch.firstIndex <= scan->xs_batch.lastIndex);
+		Assert(scan->xs_batch.lastIndex <= so->currPos.lastItem);
+
+		/*
+		 * Walk through the range of index tuples, copy them into the batch.
+		 * If requested, set the index tuple too.
+		 *
+		 * We don't know if the batch is full already - we just try to add it,
+		 * and bail out if it fails.
+		 *
+		 * FIXME This seems wrong, actually. We use currSize when calculating
+		 * the start/end range, so the add should always succeed.
+		 */
+		while (start <= end)
+		{
+			BTScanPosItem *currItem = &so->currPos.items[start];
+			IndexTuple	itup = NULL;
+
+			if (scan->xs_want_itup)
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			/* try to add it to batch, if there's space */
+			if (!index_batch_add(scan, currItem->heapTid, itup))
+				break;
+
+			start++;
+		}
+
+		/*
+		 * set the starting point
+		 *
+		 * XXX might be better done in indexam.c
+		 */
+		if (ScanDirectionIsForward(dir))
+			scan->xs_batch.currIndex = -1;
+		else
+			scan->xs_batch.currIndex = scan->xs_batch.nheaptids;
+
+		/* shouldn't be possible to end here with an empty batch */
+		Assert(scan->xs_batch.nheaptids > 0);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ *
+ * FIXME There's a lot of redundant (almost the same) code here - handling
+ * the current and new leaf page is very similar, and it's also similar to
+ * _bt_first_batch(). We should try to reduce this a bit.
+ */
+bool
+_bt_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	/* should be valid items (with respect to the leaf page) */
+	Assert(so->currPos.firstItem <= scan->xs_batch.firstIndex);
+	Assert(scan->xs_batch.firstIndex <= scan->xs_batch.lastIndex);
+	Assert(scan->xs_batch.lastIndex <= so->currPos.lastItem);
+
+	/*
+	 * Try to increase the size of the batch. Intentionally done before trying
+	 * to read items from the current page, so that the increased batch
+	 * applies to that too.
+	 *
+	 * FIXME Should be done in indexam.c probably? FIXME Maybe it should grow
+	 * faster? This is what bitmap scans do.
+	 */
+	scan->xs_batch.currSize = Min(scan->xs_batch.currSize + 1,
+								  scan->xs_batch.maxSize);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = scan->xs_batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch.currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = scan->xs_batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch.currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * reset the batch before loading new data
+	 *
+	 * XXX needs to happen after we calculate the start/end above, as it
+	 * resets some of the fields needed by the calculation.
+	 */
+	index_batch_reset(scan, dir);
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		/* update the "window" the batch represents */
+		scan->xs_batch.firstIndex = start;
+		scan->xs_batch.lastIndex = end;
+
+		/* should fit into the current batch */
+		Assert((end - start + 1) <= scan->xs_batch.currSize);
+
+		/* should be valid items (with respect to the leaf page) */
+		Assert(so->currPos.firstItem <= scan->xs_batch.firstIndex);
+		Assert(scan->xs_batch.firstIndex <= scan->xs_batch.lastIndex);
+		Assert(scan->xs_batch.lastIndex <= so->currPos.lastItem);
+
+		/*
+		 * Walk through the range of index tuples, copy them into the batch.
+		 * If requested, set the index tuple too.
+		 *
+		 * We don't know if the batch is full already - we just try to add it,
+		 * and bail out if it fails.
+		 *
+		 * FIXME This seems wrong, actually. We use currSize when calculating
+		 * the start/end range, so the add should always succeed.
+		 */
+		while (start <= end)
+		{
+			BTScanPosItem *currItem = &so->currPos.items[start];
+			IndexTuple	itup = NULL;
+
+			if (scan->xs_want_itup)
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			/* try to add it to batch, if there's space */
+			if (!index_batch_add(scan, currItem->heapTid, itup))
+				break;
+
+			start++;
+		}
+
+		/*
+		 * set the starting point
+		 *
+		 * XXX might be better done in indexam.c
+		 */
+		if (ScanDirectionIsForward(dir))
+			scan->xs_batch.currIndex = -1;
+		else
+			scan->xs_batch.currIndex = scan->xs_batch.nheaptids;
+
+		/* shouldn't be possible to end here with an empty batch */
+		Assert(scan->xs_batch.nheaptids > 0);
+
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_bt_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch.currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch.currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/* update the "window" the batch represents */
+		scan->xs_batch.firstIndex = start;
+		scan->xs_batch.lastIndex = end;
+
+		/* should fit into the current batch */
+		Assert((end - start + 1) <= scan->xs_batch.currSize);
+
+		/* should be valid items (with respect to the leaf page) */
+		Assert(so->currPos.firstItem <= scan->xs_batch.firstIndex);
+		Assert(scan->xs_batch.firstIndex <= scan->xs_batch.lastIndex);
+		Assert(scan->xs_batch.lastIndex <= so->currPos.lastItem);
+
+		/*
+		 * Walk through the range of index tuples, copy them into the batch.
+		 * If requested, set the index tuple too.
+		 *
+		 * We don't know if the batch is full already - we just try to add it,
+		 * and bail out if it fails.
+		 *
+		 * FIXME This seems wrong, actually. We use currSize when calculating
+		 * the start/end range, so the add should always succeed.
+		 */
+		while (start <= end)
+		{
+			BTScanPosItem *currItem = &so->currPos.items[start];
+			IndexTuple	itup = NULL;
+
+			if (scan->xs_want_itup)
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			/* try to add it to batch, if there's space */
+			if (!index_batch_add(scan, currItem->heapTid, itup))
+				break;
+
+			start++;
+		}
+
+		/*
+		 * set the starting point
+		 *
+		 * XXX might be better done in indexam.c
+		 */
+		if (ScanDirectionIsForward(dir))
+			scan->xs_batch.currIndex = -1;
+		else
+			scan->xs_batch.currIndex = scan->xs_batch.nheaptids;
+
+		/* shouldn't be possible to end here with an empty batch */
+		Assert(scan->xs_batch.nheaptids > 0);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	for (int i = 0; i < scan->xs_batch.nheaptids; i++)
+	{
+		/* Skip batch items not marked as killed. */
+		if (!scan->xs_batch.killedItems[i])
+			continue;
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxTIDsPerBTreePage * sizeof(int));
+		if (so->numKilled < MaxTIDsPerBTreePage)
+			so->killedItems[so->numKilled++] = (scan->xs_batch.firstIndex + i);
+	}
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2048,6 +2408,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(BTScanPosIsValid(so->currPos));
 
+	/* Transfer killed items from the batch to the regular array. */
+	_bt_kill_batch(scan);
+
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
 		_bt_killitems(scan);
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 612c6738950..b883efc7fa0 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, ScanDirection direction,
+							   void *data, int index);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* XXX default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -112,142 +118,363 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		index_batch_init(scandesc, ForwardScanDirection);
 	}
 
 	/*
-	 * OK, now that we have what we need, fetch the next tuple.
+	 * If the batching is disabled by a GUC, or if it's not supported by the
+	 * index AM, do the original approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	if (!(enable_indexscan_batching &&
+		  index_batch_supported(scandesc, direction) &&
+		  node->ioss_CanBatch))
 	{
-		bool		tuple_from_heap = false;
-
-		CHECK_FOR_INTERRUPTS();
-
 		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
+		 * OK, now that we have what we need, fetch the next tuple.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 		{
+			bool		tuple_from_heap = false;
+
+			CHECK_FOR_INTERRUPTS();
+
 			/*
-			 * Rats, we have to visit the heap to check visibility.
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
 			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+			if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+								ItemPointerGetBlockNumber(tid),
+								&node->ioss_VMBuffer))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				InstrCountTuples2(node, 1);
+				if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(node->ioss_TableSlot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.  If we did want to allow that, we'd
+				 * need to keep more state to remember not to call
+				 * index_getnext_tid next time.
+				 */
+				if (scandesc->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in scandesc->xs_cbuf.  We could release that
+				 * pin now, but it's not clear whether it's a win to do so.
+				 * The next index entry might require a visit to the same heap
+				 * page.
+				 */
+
+				tuple_from_heap = true;
+			}
 
-			ExecClearTuple(node->ioss_TableSlot);
+			/*
+			 * Fill the scan tuple slot with data from the index.  This might
+			 * be provided in either HeapTuple or IndexTuple format.
+			 * Conceivably an index AM might fill both fields, in which case
+			 * we prefer the heap format, since it's probably a bit cheaper to
+			 * fill a slot from.
+			 */
+			if (scandesc->xs_hitup)
+			{
+				/*
+				 * We don't take the trouble to verify that the provided tuple
+				 * has exactly the slot's format, but it seems worth doing a
+				 * quick check on the number of fields.
+				 */
+				Assert(slot->tts_tupleDescriptor->natts ==
+					   scandesc->xs_hitupdesc->natts);
+				ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
+			}
+			else if (scandesc->xs_itup)
+				StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
+			else
+				elog(ERROR, "no data returned for index-only scan");
 
 			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
+			 * If the index was lossy, we have to recheck the index quals.
 			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->recheckqual, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
 
 			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
+			 * We don't currently support rechecking ORDER BY distances.  (In
+			 * principle, if the index can support retrieval of the originally
+			 * indexed value, it should be able to produce an exact distance
+			 * calculation too.  So it's not clear that adding code here for
+			 * recheck/re-sort would be worth the trouble.  But we should at
+			 * least throw an error if someone tries it.)
 			 */
+			if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("lossy distance functions are not supported in index-only scans")));
 
-			tuple_from_heap = true;
-		}
+			/*
+			 * If we didn't access the heap, then we'll need to take a
+			 * predicate lock explicitly, as if we had.  For now we do that at
+			 * page level.
+			 */
+			if (!tuple_from_heap)
+				PredicateLockPage(scandesc->heapRelation,
+								  ItemPointerGetBlockNumber(tid),
+								  estate->es_snapshot);
 
-		/*
-		 * Fill the scan tuple slot with data from the index.  This might be
-		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
-		 * index AM might fill both fields, in which case we prefer the heap
-		 * format, since it's probably a bit cheaper to fill a slot from.
-		 */
-		if (scandesc->xs_hitup)
+			return slot;
+		}
+	}
+	else
+	{
+new_batch:
+		/* do we have TIDs in the current batch */
+		while ((tid = index_batch_getnext_tid(scandesc, direction)) != NULL)
 		{
+			bool		all_visible;
+			bool		tuple_from_heap = false;
+
+			CHECK_FOR_INTERRUPTS();
+
+			/* Is the index of the current item valid for the batch? */
+			Assert((scandesc->xs_batch.currIndex >= 0) &&
+				   (scandesc->xs_batch.currIndex < scandesc->xs_batch.nheaptids));
+
+			/* Prefetch some of the following items in the batch. */
+			index_batch_prefetch(scandesc, direction, ios_prefetch_block, node);
+
 			/*
-			 * We don't take the trouble to verify that the provided tuple has
-			 * exactly the slot's format, but it seems worth doing a quick
-			 * check on the number of fields.
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page has to be all-visible.
+			 *
+			 * XXX It's a bir weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that. Maybe we could/should have a more direct
+			 * way?
 			 */
-			Assert(slot->tts_tupleDescriptor->natts ==
-				   scandesc->xs_hitupdesc->natts);
-			ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
-		}
-		else if (scandesc->xs_itup)
-			StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
-		else
-			elog(ERROR, "no data returned for index-only scan");
+			all_visible = !ios_prefetch_block(scandesc, direction, node,
+											  scandesc->xs_batch.currIndex);
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals.
-		 */
-		if (scandesc->xs_recheck)
-		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->recheckqual, econtext))
+			/*
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!all_visible)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				InstrCountTuples2(node, 1);
+				if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+				{
+					/*
+					 * If we haven't found any visible tuple for the TID,
+					 * chances are all versions are dead and may kill it from
+					 * the index. If so, flag it in the kill bitmap - we'll
+					 * translate it to indexes later.
+					 *
+					 * XXX With the firstIndex/lastIndex it would not be too
+					 * hard to do the translation here. But do we want to? How
+					 * much is that considered an internal detail of the AM?
+					 *
+					 * FIXME Maybe we could/should make this part of
+					 * index_fetch_heap, so that we don't have to do this
+					 * after each call?
+					 */
+					if (scandesc->kill_prior_tuple)
+					{
+						scandesc->xs_batch.killedItems[scandesc->xs_batch.currIndex] = true;
+						scandesc->kill_prior_tuple = false;
+					}
+
+					continue;	/* no visible tuple, try next index entry */
+				}
+
+				ExecClearTuple(node->ioss_TableSlot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.  If we did want to allow that, we'd
+				 * need to keep more state to remember not to call
+				 * index_getnext_tid next time.
+				 */
+				if (scandesc->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in scandesc->xs_cbuf.  We could release that
+				 * pin now, but it's not clear whether it's a win to do so.
+				 * The next index entry might require a visit to the same heap
+				 * page.
+				 */
+
+				tuple_from_heap = true;
 			}
-		}
 
-		/*
-		 * We don't currently support rechecking ORDER BY distances.  (In
-		 * principle, if the index can support retrieval of the originally
-		 * indexed value, it should be able to produce an exact distance
-		 * calculation too.  So it's not clear that adding code here for
-		 * recheck/re-sort would be worth the trouble.  But we should at least
-		 * throw an error if someone tries it.)
-		 */
-		if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("lossy distance functions are not supported in index-only scans")));
+			/*
+			 * Fill the scan tuple slot with data from the index.  This might
+			 * be provided in either HeapTuple or IndexTuple format.
+			 * Conceivably an index AM might fill both fields, in which case
+			 * we prefer the heap format, since it's probably a bit cheaper to
+			 * fill a slot from.
+			 */
+			if (scandesc->xs_hitup)
+			{
+				/*
+				 * We don't take the trouble to verify that the provided tuple
+				 * has exactly the slot's format, but it seems worth doing a
+				 * quick check on the number of fields.
+				 */
+				Assert(slot->tts_tupleDescriptor->natts ==
+					   scandesc->xs_hitupdesc->natts);
+				ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
+			}
+			else if (scandesc->xs_itup)
+				StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
+			else
+				elog(ERROR, "no data returned for index-only scan");
 
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
+			/*
+			 * If the index was lossy, we have to recheck the index quals.
+			 */
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->recheckqual, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			/*
+			 * We don't currently support rechecking ORDER BY distances.  (In
+			 * principle, if the index can support retrieval of the originally
+			 * indexed value, it should be able to produce an exact distance
+			 * calculation too.  So it's not clear that adding code here for
+			 * recheck/re-sort would be worth the trouble.  But we should at
+			 * least throw an error if someone tries it.)
+			 */
+			if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("lossy distance functions are not supported in index-only scans")));
+
+			/*
+			 * If we didn't access the heap, then we'll need to take a
+			 * predicate lock explicitly, as if we had.  For now we do that at
+			 * page level.
+			 */
+			if (!tuple_from_heap)
+				PredicateLockPage(scandesc->heapRelation,
+								  ItemPointerGetBlockNumber(tid),
+								  estate->es_snapshot);
+
+			return slot;
+		}
 
-		return slot;
+		/* batch is empty, try reading the next batch of tuples */
+		if (index_batch_getnext(scandesc, direction))
+		{
+			index_batch_prefetch(scandesc, direction, ios_prefetch_block, node);
+			goto new_batch;
+		}
+
+		return NULL;
 	}
 
 	/*
@@ -574,6 +801,16 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -744,6 +981,14 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
+	index_batch_init(node->ioss_ScanDesc, ForwardScanDirection);
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -788,6 +1033,9 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
+	/* XXX do we actually want prefetching for parallel index scans? */
+	index_batch_init(node->ioss_ScanDesc, ForwardScanDirection);
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -797,3 +1045,35 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, ScanDirection direction,
+				   void *arg, int index)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+
+	if (scan->xs_batch.privateData[index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &scan->xs_batch.heaptids[index];
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		scan->xs_batch.privateData[index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (scan->xs_batch.privateData[index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..d9f926536ad 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -38,6 +38,7 @@
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "utils/array.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
@@ -122,31 +123,89 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		index_batch_init(scandesc, ForwardScanDirection);
 	}
 
 	/*
-	 * ok, now that we have what we need, fetch the next tuple.
+	 * If the batching is disabled by a GUC, or if it's not supported by the
+	 * index AM, do the original approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	if (!(enable_indexscan_batching &&
+		  index_batch_supported(scandesc, direction) &&
+		  node->iss_CanBatch))
 	{
-		CHECK_FOR_INTERRUPTS();
-
 		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
+		 * ok, now that we have what we need, fetch the next tuple.
 		 */
-		if (scandesc->xs_recheck)
+		while (index_getnext_slot(scandesc, direction, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * If the index was lossy, we have to recheck the index quals
+			 * using the fetched tuple.
+			 */
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
 		}
+	}
+	else
+	{
+new_batch:
+		/* do we have TIDs in the current batch */
+		while (index_batch_getnext_slot(scandesc, direction, slot))
+		{
+			CHECK_FOR_INTERRUPTS();
 
-		return slot;
+			/* first, take care of prefetching further items */
+			index_batch_prefetch(scandesc, direction, NULL, NULL);
+
+			/*
+			 * If the index was lossy, we have to recheck the index quals
+			 * using the fetched tuple.
+			 */
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			return slot;
+		}
+
+		/* batch is empty, try reading the next batch of tuples */
+		if (index_batch_getnext(scandesc, direction))
+		{
+			index_batch_prefetch(scandesc, direction, NULL, NULL);
+			goto new_batch;
+		}
+
+		return NULL;
 	}
 
 	/*
@@ -942,6 +1001,16 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1677,6 +1746,14 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
+	index_batch_init(node->iss_ScanDesc, ForwardScanDirection);
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
@@ -1720,6 +1797,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 								 node->iss_NumOrderByKeys,
 								 piscan);
 
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
+	index_batch_init(node->iss_ScanDesc, ForwardScanDirection);
+
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
 	 * the scankeys to the index AM.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 521ec5591c8..47a9ed9d527 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -789,6 +789,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..b2509337755 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -398,6 +398,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f25c9d58a7d..8870f7eb676 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -177,6 +177,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef bool (*amgettuplebatch_function) (IndexScanDesc scan,
+										  ScanDirection direction);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -280,6 +284,7 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgettuplebatch_function amgettuplebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index fdcfbe8db74..01b1cee3d30 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -132,6 +133,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -182,6 +185,27 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern bool index_batch_getnext(IndexScanDesc scan,
+								ScanDirection direction);
+extern ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+extern bool index_batch_getnext_slot(IndexScanDesc scan,
+									 ScanDirection direction,
+									 struct TupleTableSlot *slot);
+extern bool index_batch_supported(IndexScanDesc scan, ScanDirection direction);
+extern void index_batch_init(IndexScanDesc scan, ScanDirection direction);
+extern void index_batch_reset(IndexScanDesc scan, ScanDirection direction);
+extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, IndexTuple itup);
+
+/*
+ * Typedef for callback function to determine if an item in index scan should
+ * be prefetched.
+ */
+typedef bool (*index_prefetch_callback) (IndexScanDesc scan, ScanDirection direction,
+										 void *arg, int index);
+extern void index_batch_prefetch(IndexScanDesc scan, ScanDirection direction,
+								 index_prefetch_callback callback, void *arg);
+
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecdcc..db1ba7ed647 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1172,6 +1172,7 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgettuplebatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1276,6 +1277,9 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..a58c294a8e5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -151,6 +151,46 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/*
+	 * Data about the current TID batch returned by the index AM.
+	 *
+	 * XXX Maybe this should be a separate struct instead, and the scan
+	 * descriptor would have only a pointer, initialized only when the
+	 * batching is actually used?
+	 *
+	 * XXX It's not quite clear which part of this is managed by indexam and
+	 * what's up to the actual index AM implementation. Needs some clearer
+	 * boundaries.
+	 *
+	 * XXX Should we have a pointer for optional state managed by the AM? Some
+	 * custom AMs may need more per-batch information, not just the fields we
+	 * have here.
+	 */
+	struct
+	{
+		/* batch size - maximum, initial, current (with ramp up) */
+		int			maxSize;
+		int			initSize;
+		int			currSize;
+
+		/* batch prefetching */
+		int			prefetchTarget; /* current prefetch distance */
+		int			prefetchMaximum;	/* maximum prefetch distance */
+		int			prefetchIndex;	/* next item to prefetch */
+
+		/* range of leaf page items copied into the current batch */
+		int			firstIndex;
+		int			lastIndex;
+
+		/* batch contents (TIDs, index tuples, kill bitmap, ...) */
+		int			nheaptids;	/* number of TIDs in the batch */
+		int			currIndex;	/* index of the current item */
+		ItemPointerData *heaptids;	/* TIDs in the batch */
+		IndexTuple *itups;		/* IndexTuples, if requested */
+		bool	   *killedItems;	/* bitmap of tuples to kill */
+		Datum	   *privateData;	/* private data for batch */
+	}			xs_batch;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index af7d8fd1e72..ffa0923ed3c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1644,6 +1644,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1671,6 +1672,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1692,6 +1697,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1713,6 +1719,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..14b38ed4d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -157,6 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.46.0

#100

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Tomas Vondra (#99)

5 attachment(s)

Re: index prefetching

Hi,

here's an updated version of this patch series, with a couple major
improvements:

1) adding batching/prefetching to relevant built-in index AMs

This means btree, hash, gist and sp-gist, i.e. index types that can
return tuples. For gin/brin it's irrelevant (it'd be more correct to
explicitly set amgetbatch to null, I guess).

Anyway, those patches are fairly small, maybe 10kB each, with 150-300
new lines. And the patches are pretty similar, thanks to the fact that
all the index AMs mirror btree (especially hash).

The main differences are in ordered scans in gist/spgist, where the
approach is quite different, but not that much. There's also the
business of returning orderbyvals/orderbynulls, and index-only scans,
but that should work too, now.

2) simplify / cleanup of the btree batching

There was a lot of duplication and copy-pasted code in the functions
that load the first/next batch, this version gets rid of that and
replaces this "common" code with _bt_copy_batch() utility function. The
other index AMs have pretty much the same thing, but adjusted for the
scan opaque struct specific for that index type.

I'm not saying it's perfect as it is, but it's way better, IMHO.

3) making mark/restore work for btree

This was one of the main limitations - the patch simply disabled
batching for plans requiring EXEC_FLAG_MARK, because of issues with
determining the correct position on the page in markpos(). I suggested
it should be possible to make this work by considering the batch index
in those calls, and restoring the proper batch in restrpos(), and this
updated patch does exactly that.

I haven't done any performance evaluation if batching helps in these
plans - if we restore to a position we already visited, we may not need
to prefetch those pages, it might even make things slow. Need some more
thinking, I guess.

Also, I'm not quite happy with how the two layers interact. The index AM
should not know this much the implementation details of batching, so I
plan to maybe replace those accesses with a function in indexam.c, or
something like that.

It's still a bit rough, so I kept it in a separate patch.

This now passes "make check-world" with asserts, valgrind and all that.
I still need to put it through some stress testing and benchmarking to
see how it performs.

The layering still needs some more work. I've been quite unhappy with
how how much the index AM needs to know about the "implementation
details" of the batching, and how unclear it was which layer manages
which fields. I think it's much better now - the goal is that:

* indexam.c updates the scandesc->xs_batch fields, and knows nothing
about the internal state of the index AM

* the AM can read scandesc->xs_batch data (perhaps by a function in
indexam.c), but never updates it

There are still a couple places where this is violated (e.g. in the
btrestrpos which manipulates the batch index directly), but I believe
that's fairly easy to solve.

Finally, I wrote that the basic contract that makes this possible is
"batch should never span multiple leaf pages". I realized that's
actually not quite correct - it's perfectly fine for the AM to return
batches spanning multiple leaf pages, as long as the AM knows to also
keep all the resources (pins, ...) until the next batch is requested.

It would also need to know how to handle kill_prior_tuples (which we now
accumulate per batch, and process before returning the next one), and
stuff like that.

It's just that with the restriction that a batch must not span multiple
leaf pages, it's fairly trivial to make this work. The changes required
by the current AM code are very limited, as demonstrated by the patches
adding this to gist/spgist/hash.

I can imagine the AMs being improved in this direction in the future. We
already have a place to keep track of this extra info - the scan opaque
struct. The AM could keep information about all the resources needed by
the last batch - in a way, we already do that, except that we need only
exactly the same resources as for regular non-batched scans.

Thinking about this a bit more, we'd probably want to allow multiple
in-flight batches. One of the shortcomings of the current approach with
a single batch is that as we're getting close to the end of the batch,
we can't issue prefetches. Only after we're done with that batch, we can
prefetch more pages. Essentially, there are "pipeline stall". I imagine
we could allow reading "future" batches so that we can issue prefetches,
and then eventually we'd process those.

But that would also require some ability to inform the index AM which
batches are no longer needed, and can be de-allocated. Hmmm, perhaps it
would be possible to make this work with just two batches, as long as
they are sized for the proper prefetch distance.

In any case, that would be a future patch. I'm only mentioning this to
clarify that I believe the proposed approach does not really have the
"single leaf page" restriction (the AM can do whatever it wants). And
that it could even be extended to handle multiple batches.

regards

--
Tomas Vondra

Attachments:

v20240906-0001-WIP-index-batching-prefetching.patchtext/x-patch; charset=UTF-8; name=v20240906-0001-WIP-index-batching-prefetching.patchDownload

From 1eadf153f6b60c404e60b21bc6e3db691a5e64ec Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 28 Aug 2024 18:21:27 +0200
Subject: [PATCH v20240906 1/5] WIP: index batching / prefetching

---
 src/backend/access/heap/heapam_handler.c      |   7 +-
 src/backend/access/index/genam.c              |  23 +-
 src/backend/access/index/indexam.c            | 829 +++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  65 ++
 src/backend/access/nbtree/nbtsearch.c         | 280 ++++++
 src/backend/executor/execIndexing.c           |   7 +-
 src/backend/executor/execReplication.c        |   9 +-
 src/backend/executor/nodeIndexonlyscan.c      | 503 ++++++++---
 src/backend/executor/nodeIndexscan.c          | 117 ++-
 src/backend/utils/adt/selfuncs.c              |   7 +-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   5 +
 src/include/access/genam.h                    |  29 +-
 src/include/access/nbtree.h                   |  15 +
 src/include/access/relscan.h                  |  51 ++
 src/include/nodes/execnodes.h                 |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/tools/pgindent/typedefs.list              |   3 +
 19 files changed, 1830 insertions(+), 141 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d43..86da23e41f4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -750,7 +750,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering. Seems like it
+		 * might be a pretty substantial win.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 43c95d6109b..68fc490c992 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -438,8 +438,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over
+		 * time. But then we need to be careful about infinite recursion when
+		 * looking up effective_io_concurrency for a tablespace in the
+		 * catalog.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, false);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +706,17 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time.
+	 * But then we need to be careful about infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace in the catalog.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, false);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813d8..d170c1d439d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,12 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_batch_getnext	- get the next batch of TIDs from a scan
+ *		index_batch_getnext_tid	- get the next TIDs from a batch
+ *		index_batch_getnext_slot	- get the next tuple from a batch
+ *		index_batch_prefetch	- prefetch heap pages for a batch
+ *		index_batch_supported	- does the AM support/allow batching?
+ *		index_batch_add		- add an item (TID, itup) to the batch
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -54,10 +60,14 @@
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memutils.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
+/* enable reading batches / prefetching of TIDs from the index */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +119,10 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan);
+
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -256,7 +270,8 @@ IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
 	IndexScanDesc scan;
 
@@ -274,6 +289,24 @@ index_beginscan(Relation heapRelation,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -333,6 +366,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batch = NULL;
+
 	return scan;
 }
 
@@ -368,6 +407,18 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /* ----------------
@@ -444,6 +495,18 @@ index_restrpos(IndexScanDesc scan)
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /*
@@ -539,7 +602,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -560,6 +624,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -1037,3 +1119,746 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING AND PREFETCHING
+ *
+ * Allows reading chunks of items from an index, instead of reading them
+ * one by one. This reduces the overhead of accessing index pages, and
+ * also allows acting on "future" TIDs - e.g. we can prefetch heap pages
+ * that will be needed, etc.
+ *
+ *
+ * index AM contract
+ * -----------------
+ *
+ * To support batching, the index AM needs to implement an optional callback
+ * amgetbatch() which loads data into the batch (in the scan descriptor).
+ *
+ * The index AM also needs to ensure it can perform all optimizations for
+ * all TIDs in the current batch. A good example of is the kill_prior_tuple
+ * optimization - with batching, the index AM may receive the information
+ * which tuples are to be killed with a delay - when loading the next
+ * batch, when ending/restarting the scan, etc. The AM needs to ensure it
+ * can still process such information, to keep the optimization effective.
+ *
+ * The AM may also need to keep pins required by the whole batch (not just
+ * the last tuple), etc.
+ *
+ * What this means/requires is very dependent on the index AM, of course.
+ * For B-Tree (and most other index AMs), batches spanning multiple leaf
+ * pages would be problematic. Such batches would work for basic index
+ * scans, but the kill_prior_tuple would be an issue - the AMs keep only
+ * a single leaf pinned. We'd either need to keep multiple pins, or allow
+ * reading older leaf pages pages (which might have been modified). Index
+ * only-scans is challenging too - we keep IndexTuple pointers into the
+ * leaf pages, which requires keeping those pins too.
+ *
+ * To solve this, we give the AM some control over batch boundaries. It is
+ * up to the index AM to pick which range of index items to load into the
+ * batch, and how to ensure all the optimizations are possible, keep pins,
+ * and so on. The index AM may use information about the batch (in the
+ * scan descriptor, maintained by indexam.c code), and may also keep some
+ * private information (in the existing "opaque" scan field).
+ *
+ * For most index AMs the easiest way is to not load batches spanning
+ * multiple leaf pages. This may impact the efficiency, especially for
+ * indexes with wide index tuples, as it means batches close to the end
+ * of the leaf page may be smaller.
+ *
+ * Note: There already is a pipeline break for prefetching - as we are
+ * getting closer to the end of a batch, we can't prefetch further than
+ * that, and the effective prefetch distance drops to 0.
+ *
+ * The alternative would be to make the index AMs more complex, to keep
+ * more leaf pages pinned, etc. The current model does not prohibit the
+ * index AMs from implementing that - it's entirely possible to keep the
+ * additional information in the "opaque" structure (say, list of pinned
+ * pages, and other necessary details).
+ *
+ * But that does not seem like a good trade off, as it's subject to
+ * "diminishing returns" - we see significant gains initially (with even
+ * small batches / prefetch distance), and as the batch grows the gains
+ * get smaller and smaller. It does not seem worth the complexity of
+ * pinning more pages etc. at least for the first version.
+ *
+ * To deal with the "prefetch pipeline break", that could be addressed by
+ * allowing multiple in-fligt batches - e.g. with prefetch distance 64
+ * we might have three batches of 32 items each, to prefetch far ahead.
+ * But that's not what this patch does yet.
+ *
+ *
+ * batch = sliding window
+ * ----------------------
+ *
+ * A good way to visualize a batch is a sliding window over the array
+ * of items on a leaf page. In the simplest example (forward scan with no
+ * changes of direction), we slice the array into smaller chunks, and
+ * then process each of those chunks.
+ *
+ * The batch size is adaptive - it starts small (only 8 elements) and
+ * increases as we read more batches (up to 64 elements). We don't want
+ * to regress cases that only need a single item (e.g. LIMIT 1 queries),
+ * and loading/copying a lot of data might cause that. So we start small
+ * and increase the size - that still improves cases reading a lot of
+ * data from the index, without hurting small queries.
+ *
+ * Note: This gradual ramp up is for batch size, independent of what we
+ * do for prefetch. The prefetch distance is gradually increased too, but
+ * it's independent / orthogonal to the batch size. The batch size limits
+ * how far ahead we can prefetch, of course.
+ *
+ * Note: The current limits on batch size (initial 8, maximum 64) are
+ * quite arbitrary, it just seemed those values are sane. We could adjust
+ * the initial size, but I don't think it'd make a fundamental difference.
+ * Growing the batches faster/slower has bigger impact.
+ *
+ * The maximum batch size does not matter much - it's true a btree index can
+ * have up to ~1300 items per 8K leaf page, but in most cases the actual
+ * number is lower, perhaps ~300. That's not that far from 64.
+ *
+ * Each batch has a firstIndex/lastIndex to track which part of the leaf
+ * page it currently represents.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ *
+ * If we decide a tuple can be killed, the batch item is marked accordingly,
+ * and the flag is reset to false (so that the index AM does not do something
+ * silly to a random tuple it thinks is "current").
+ *
+ * Then the next time the AM decides it's time to kill tuples, the AM needs
+ * to look at the batch and consider the tuples marked to be killed. B-Tree
+ * simply adds those TIDs to the regular "killItems" array.
+ *
+ *
+ * mark/restore
+ * ------------
+ *
+ * With batching, the index AM does not know the the "current" position on
+ * the leaf page - we don't propagate this to the index AM while walking
+ * items in the batch. To make ammarkpos() work, the index AM has to check
+ * the current position in the batch, and translate it to the proper page
+ * position, using the private information (about items in the batch).
+ *
+ * XXX This needs more work, I don't quite like how the two layers interact,
+ * it seems quite wrong to look at the batch info directly.
+ */
+
+/*
+ * Comprehensive check of various invariants on the index batch. Makes sure
+ * the indexes are set as expected, the buffer size is within limits, and
+ * so on.
+ */
+static void
+AssertCheckBatchInfo(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* all the arrays need to be allocated */
+	Assert((scan->xs_batch->heaptids != NULL) &&
+		   (scan->xs_batch->killedItems != NULL) &&
+		   (scan->xs_batch->privateData != NULL));
+
+	/* if IndexTuples expected, should be allocated too */
+	Assert(!(scan->xs_want_itup && (scan->xs_batch->itups == NULL)));
+
+	/* Various check on batch sizes */
+	Assert((scan->xs_batch->initSize >= 0) &&
+		   (scan->xs_batch->initSize <= scan->xs_batch->currSize) &&
+		   (scan->xs_batch->currSize <= scan->xs_batch->maxSize) &&
+		   (scan->xs_batch->maxSize <= 1024));	/* arbitrary limit */
+
+	/* Is the number of in the batch TIDs in a valid range? */
+	Assert((scan->xs_batch->nheaptids >= 0) &&
+		   (scan->xs_batch->nheaptids <= scan->xs_batch->maxSize));
+
+	/*
+	 * The current item must be between -1 and nheaptids. Those two extreme
+	 * values are starting points for forward/backward scans.
+	 */
+	Assert((scan->xs_batch->currIndex >= -1) &&
+		   (scan->xs_batch->currIndex <= scan->xs_batch->nheaptids));
+
+	/* check prefetch data */
+	Assert((scan->xs_batch->prefetchTarget >= 0) &&
+		   (scan->xs_batch->prefetchTarget <= scan->xs_batch->prefetchMaximum));
+
+	Assert((scan->xs_batch->prefetchIndex >= -1) &&
+		   (scan->xs_batch->prefetchIndex <= scan->xs_batch->nheaptids));
+
+	for (int i = 0; i < scan->xs_batch->nheaptids; i++)
+		Assert(ItemPointerIsValid(&scan->xs_batch->heaptids[i]));
+#endif
+}
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch->nheaptids == 0)
+
+/*
+ * Did we process all items? For forward scan it means the index points to the
+ * last item, for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
+		((scan)->xs_batch->currIndex == 0))
+
+/* Does the batch items in the requested direction? */
+#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	/*
+	 * Reset the current/prefetch positions in the batch.
+	 *
+	 * XXX Done before calling amgetbatch(), so that it sees the index as
+	 * invalid, batch as empty, and can add items.
+	 *
+	 * XXX Intentionally does not reset the nheaptids, because the AM does
+	 * rely on that when processing killed tuples. Maybe store the killed
+	 * tuples differently?
+	 */
+	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->nheaptids = 0;
+
+	/*
+	 * Reset the memory context with all per-batch data, allocated by the AM.
+	 * This might be tuples, or anything else the AM needs.
+	 *
+	 * XXX Make sure to reset the tuples, because the AM may do something with
+	 * them later (e.g. release them, as getNextNearest in gist), but we may
+	 * release them by the MemoryContextReset() call.
+	 *
+	 * This might break the AM if it relies on them pointing to the last
+	 * tuple, but at least it has the chance to do the right thing by checking
+	 * if the pointer is NULL.
+	 */
+	scan->xs_itup = NULL;
+	scan->xs_hitup = NULL;
+
+	MemoryContextReset(scan->xs_batch->ctx);
+
+	/*
+	 * The AM's amgetbatch proc loads a chunk of TIDs matching the scan keys,
+	 * and puts the TIDs into scan->xs_batch->heaptids.  It should also set
+	 * scan->xs_recheck and possibly
+	 * scan->xs_batch->itups/scan->xs_batch->hitups, though we pay no
+	 * attention to those fields here.
+	 *
+	 * FIXME At the moment this does nothing with hitup. Needs to be fixed?
+	 */
+	found = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return false;
+	}
+
+	/* We should have a non-empty batch with items. */
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch->nheaptids);
+
+	/*
+	 * Set the prefetch index to the first item in the loaded batch (we expect
+	 * the index AM to set that).
+	 *
+	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
+	 * more like indexam.c responsibility rather than something every index AM
+	 * should be doing (in _bt_first_batch etc.).
+	 *
+	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
+	 * for setting which fields. This needs clarification.
+	 */
+	scan->xs_batch->prefetchIndex = scan->xs_batch->currIndex;
+
+	/*
+	 * Try to increase the size of the batch. Intentionally done after the AM
+	 * call, so that the new value applies to the next batch. Otherwise we
+	 * would always skip the initial batch size.
+	 */
+	scan->xs_batch->currSize = Min(scan->xs_batch->currSize + 1,
+								   scan->xs_batch->maxSize);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* Return the batch of TIDs we found. */
+	return true;
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * Same calling convention as index_getnext_tid(), except that NULL means
+ * no more items in the current batch, there may be more batches.
+ *
+ * XXX This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup.
+ *
+ * FIXME Should this set xs_hitup?
+ * ----------------
+ */
+ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * Bail out if he batch does not have more items in the requested directio
+	 * (either empty or everthing processed).
+	 */
+	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
+		return NULL;
+
+	/*
+	 * Advance to the next batch item - we know it's not empty and there are
+	 * items to process, so this is valid.
+	 */
+	if (ScanDirectionIsForward(direction))
+		scan->xs_batch->currIndex++;
+	else
+		scan->xs_batch->currIndex--;
+
+	/*
+	 * Next TID from the batch, optionally also the IndexTuple/HeapTuple.
+	 *
+	 * XXX Not sure how to decide which of the tuples to set, seems easier to
+	 * just set both, one of them will be NULL.
+	 *
+	 * XXX Do we need to reset the itups/htups array between batches? Doesn't
+	 * seem necessary, but maybe we could get bogus data?
+	 */
+	scan->xs_heaptid = scan->xs_batch->heaptids[scan->xs_batch->currIndex];
+	if (scan->xs_want_itup)
+	{
+		scan->xs_itup = scan->xs_batch->itups[scan->xs_batch->currIndex];
+		scan->xs_hitup = scan->xs_batch->htups[scan->xs_batch->currIndex];
+	}
+
+	scan->xs_recheck = scan->xs_batch->recheck[scan->xs_batch->currIndex];
+
+	/*
+	 * If there are order-by clauses, point to the appropriate chunk in the
+	 * arrays.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->numberOfOrderBys * scan->xs_batch->currIndex;
+
+		scan->xs_orderbyvals = &scan->xs_batch->orderbyvals[idx];
+		scan->xs_orderbynulls = &scan->xs_batch->orderbynulls[idx];
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ * index_getnext_batch_slot - get the next tuple from a scan batch
+ *
+ * Same calling convention as index_getnext_slot(), except that NULL means
+ * no more items only in the current batch, there may be more batches.
+ *
+ * XXX See index_getnext_slot comments.
+ * ----------------
+ */
+bool
+index_batch_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_batch_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+		{
+			/* If we found a visible tuple, we shouldn't kill it. */
+			Assert(!scan->kill_prior_tuple);
+
+			/* comprehensive checks of batching info */
+			AssertCheckBatchInfo(scan);
+
+			return true;
+		}
+
+		/*
+		 * If we haven't found any visible tuple for the TID, chances are all
+		 * versions are dead and may kill it from the index. If so, flag it in
+		 * the kill bitmap - we'll translate it to indexes later.
+		 *
+		 * XXX With the firstIndex/lastIndex it would not be too hard to do
+		 * the translation here. But do we want to? How much is that
+		 * considered an internal detail of the AM?
+		 *
+		 * XXX Maybe we should integrate this into index_fetch_heap(), so that
+		 * we don't need to do it after each call? Seems easy to ferget/miss.
+		 */
+		if (scan->kill_prior_tuple)
+		{
+			scan->kill_prior_tuple = false;
+
+			/*
+			 * FIXME This is not great, because we'll have to walk through the
+			 * whole bitmap later, to maybe add killed tuples to the regular
+			 * array. Might be costly for large batches. Maybe it'd be better
+			 * to do what btree does and stash the indexes (just some limited
+			 * number).
+			 */
+			if (scan->xs_batch->nKilledItems < scan->xs_batch->maxSize)
+				scan->xs_batch->killedItems[scan->xs_batch->nKilledItems++]
+					= scan->xs_batch->currIndex;
+		}
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return false;
+}
+
+/* ----------------
+ *		index_batch_prefetch - prefetch pages for TIDs in current batch
+ *
+ * The prefetch distance is increased gradually, similar to what we do for
+ * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
+ * iteration increment the distance up to prefetchMaximum.
+ *
+ * The prefetch distance is reset (to 0) only on rescans, not between batches.
+ *
+ * It's possible to provide an index_prefetch_callback callback, to affect
+ * which items need to be prefetched. With prefetch_callback=NULL, all
+ * items are prefetched. With the callback provided, the item is prefetched
+ * iff the callback and returns true.
+ *
+ * The "arg" argument is used to pass a state for the plan node invoking the
+ * function, and is then passed to the callback. This means the callback is
+ * specific to the plan state.
+ *
+ * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
+ * tablespace options.
+ *
+ * XXX For accesses that change scan direction, we may do a lot of unnecessary
+ * prefetching (because we will re-issue prefetches for what we recently read).
+ * I'm not sure if there's a simple way to track what was already prefetched.
+ * Maybe we could count how far we got (in the forward direction), keep that
+ * as a watermark, and never prefetch again below it.
+ *
+ * XXX Maybe wrap this in ifdef USE_PREFETCH?
+ * ----------------
+ */
+void
+index_batch_prefetch(IndexScanDesc scan, ScanDirection direction,
+					 index_prefetch_callback prefetch_callback, void *arg)
+{
+	int			prefetchStart,
+				prefetchEnd;
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Where should we start to prefetch? */
+		prefetchStart = Max(scan->xs_batch->currIndex,
+							scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchEnd = Min((scan->xs_batch->currIndex + 1) + scan->xs_batch->prefetchTarget,
+						  scan->xs_batch->nheaptids);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchEnd;
+	}
+	else
+	{
+		/* Where should we start to prefetch? */
+		prefetchEnd = Min(scan->xs_batch->currIndex,
+						  scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchStart = Max((scan->xs_batch->currIndex - 1) - scan->xs_batch->prefetchTarget,
+							-1);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchStart;
+	}
+
+	/*
+	 * It's possible we get inverted prefetch range after a restrpos() call,
+	 * because we intentionally don't reset the prefetchIndex - we don't want
+	 * to prefetch pages over and over in this case. We'll do nothing in that
+	 * case, except for the AssertCheckBatchInfo().
+	 *
+	 * FIXME I suspect this actually does not work correctly if we change the
+	 * direction, because the prefetchIndex will flip between two extremes
+	 * thanks to the Min/Max.
+	 */
+
+	/*
+	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
+	 * intentionally do this after calculating start/end, so that we start
+	 * actually prefetching only after the first item.
+	 */
+	scan->xs_batch->prefetchTarget = Min(scan->xs_batch->prefetchTarget + 1,
+										 scan->xs_batch->prefetchMaximum);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* finally, do the actual prefetching */
+	for (int i = prefetchStart; i < prefetchEnd; i++)
+	{
+		/* skip block if the provided callback says so */
+		if (prefetch_callback && !prefetch_callback(scan, direction, arg, i))
+			continue;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
+					   ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[i]));
+	}
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, but only if batch supported */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->xs_batch = palloc0(sizeof(IndexScanBatchData));
+
+	/*
+	 * Set some reasonable batch size defaults.
+	 *
+	 * XXX Maybe should depend on prefetch distance, or something like that?
+	 * The initSize will affect how far ahead we can prefetch.
+	 */
+	scan->xs_batch->maxSize = 64;
+	scan->xs_batch->initSize = 8;
+	scan->xs_batch->currSize = scan->xs_batch->initSize;
+
+	/* initialize prefetching info */
+	scan->xs_batch->prefetchMaximum =
+		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
+	scan->xs_batch->prefetchTarget = 0;
+	scan->xs_batch->prefetchIndex = 0;
+
+	/* */
+	scan->xs_batch->currIndex = -1;
+
+	/* Preallocate the largest allowed array of TIDs. */
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->heaptids = palloc0(sizeof(ItemPointerData) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX We can't check scan->xs_want_itup, because that's set only after
+	 * the scan is initialized (and we initialize in beginscan). Maybe we
+	 * could (or should) allocate lazily.
+	 */
+	scan->xs_batch->itups = palloc(sizeof(IndexTuple) * scan->xs_batch->maxSize);
+	scan->xs_batch->htups = palloc(sizeof(HeapTuple) * scan->xs_batch->maxSize);
+
+	scan->xs_batch->recheck = palloc(sizeof(bool) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe use a more compact bitmap? We need just one bit per element,
+	 * not a bool. This is easier / more convenient to manipulate, though.
+	 *
+	 * XXX Maybe should allow more items thant the max batch size?
+	 */
+	scan->xs_batch->nKilledItems = 0;
+	scan->xs_batch->killedItems = (int *) palloc0(sizeof(int) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe allocate only when actually needed? Also, shouldn't we have a
+	 * memory context for the private data?
+	 */
+	scan->xs_batch->privateData = (Datum *) palloc0(sizeof(Datum) * scan->xs_batch->maxSize);
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			cnt = (scan->xs_batch->maxSize * scan->numberOfOrderBys);
+
+		scan->xs_batch->orderbyvals = (Datum *) palloc0(sizeof(Datum) * cnt);
+		scan->xs_batch->orderbynulls = (bool *) palloc0(sizeof(Datum) * cnt);
+	}
+	else
+	{
+		scan->xs_batch->orderbyvals = NULL;
+		scan->xs_batch->orderbynulls = NULL;
+	}
+
+	scan->xs_batch->ctx = AllocSetContextCreate(CurrentMemoryContext,
+												"indexscan batch context",
+												ALLOCSET_DEFAULT_SIZES);
+
+	/* comprehensive checks */
+	AssertCheckBatchInfo(scan);
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
+ * correct, because for backwards scans is wrong.
+ */
+static void
+index_batch_reset(IndexScanDesc scan)
+{
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->currIndex = -1;
+}
+
+/*
+ * index_batch_add
+ *		Add an item to the batch.
+ *
+ * The item is always a TID, and then also IndexTuple if requested (for IOS).
+ * Items are always added from the beginning (index 0).
+ *
+ * Returns true when adding the item was successful, or false when the batch
+ * is full (and the item should be added to the next batch).
+ */
+bool
+index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+				IndexTuple itup, HeapTuple htup)
+{
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	/* don't add TIDs beyond the current batch size */
+	if (INDEX_BATCH_IS_FULL(scan))
+		return false;
+
+	/*
+	 * There must be space for at least one entry.
+	 *
+	 * XXX Seems redundant with the earlier INDEX_BATCH_IS_FULL check.
+	 */
+	Assert(scan->xs_batch->nheaptids < scan->xs_batch->currSize);
+	Assert(scan->xs_batch->nheaptids >= 0);
+
+	scan->xs_batch->heaptids[scan->xs_batch->nheaptids] = tid;
+	scan->xs_batch->privateData[scan->xs_batch->nheaptids] = (Datum) 0;
+
+	if (scan->xs_want_itup)
+	{
+		scan->xs_batch->itups[scan->xs_batch->nheaptids] = itup;
+		scan->xs_batch->htups[scan->xs_batch->nheaptids] = htup;
+	}
+
+	scan->xs_batch->recheck[scan->xs_batch->nheaptids] = recheck;
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->xs_batch->nheaptids * scan->numberOfOrderBys;
+
+		memcpy(&scan->xs_batch->orderbyvals[idx], scan->xs_orderbyvals, sizeof(Datum) * scan->numberOfOrderBys);
+		memcpy(&scan->xs_batch->orderbynulls[idx], scan->xs_orderbynulls, sizeof(bool) * scan->numberOfOrderBys);
+	}
+
+	scan->xs_batch->nheaptids++;
+
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f72..a32cc6e7a3a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -259,6 +260,53 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+bool
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (!BTScanPosIsValid(so->currPos))
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Check to see if we should kill tuples from the previous batch.
+			 */
+			_bt_kill_batch(scan);
+
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, dir);
+		}
+
+		/* If we have a tuple, return it ... */
+		if (res)
+			break;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -338,6 +386,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* batch range, initially empty */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -364,6 +416,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -408,6 +463,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* Reset the batch (even if not batched scan) */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
 }
 
 /*
@@ -421,6 +480,9 @@ btendscan(IndexScanDesc scan)
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -501,6 +563,9 @@ btrestrpos(IndexScanDesc scan)
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_bt_kill_batch(scan);
+
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2551df8a671..3196ba640d5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1527,6 +1527,283 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+
+static void
+AssertCheckBTBatchInfo(BTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ * _bt_copy_batch
+ *		Copy a section of the leaf page into the batch.
+ */
+static void
+_bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+			   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		BTScanPosItem *currItem = &so->currPos.items[start];
+		IndexTuple	itup = NULL;
+
+		if (scan->xs_want_itup)
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, scan->xs_recheck, itup, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+/*
+ *	_bt_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch (we reset those indexes both in
+	 * btbeginscan and btrescan). So why not an assert? We can get here in
+	 * different ways too - not just after beginscan/rescan, but also when
+	 * iterating over ScalarArrayOps - in which case we'll see the last batch
+	 * of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_bt_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+bool
+_bt_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_bt_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_bt_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxTIDsPerBTreePage * sizeof(int));
+		if (so->numKilled < MaxTIDsPerBTreePage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2048,6 +2325,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(BTScanPosIsValid(so->currPos));
 
+	/* Transfer killed items from the batch to the regular array. */
+	_bt_kill_batch(scan);
+
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
 		_bt_killitems(scan);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 403a3f40551..eb0e4578d36 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -779,7 +779,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+
+	/*
+	 * XXX Does not seem useful to do prefetching for checks of constraints.
+	 * We would probably need just the first item anyway.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 1086cbc9624..df69327ea79 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -244,8 +244,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching is pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 612c6738950..dcf5fc5fffc 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, ScanDirection direction,
+							   void *data, int index);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* XXX default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -93,11 +99,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
@@ -115,139 +121,359 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	}
 
 	/*
-	 * OK, now that we have what we need, fetch the next tuple.
+	 * If the batching is disabled by a GUC, or if it's not supported by the
+	 * index AM, do the original approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	if (scandesc->xs_batch == NULL)
 	{
-		bool		tuple_from_heap = false;
-
-		CHECK_FOR_INTERRUPTS();
-
 		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
+		 * OK, now that we have what we need, fetch the next tuple.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 		{
+			bool		tuple_from_heap = false;
+
+			CHECK_FOR_INTERRUPTS();
+
 			/*
-			 * Rats, we have to visit the heap to check visibility.
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
 			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+			if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+								ItemPointerGetBlockNumber(tid),
+								&node->ioss_VMBuffer))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				InstrCountTuples2(node, 1);
+				if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(node->ioss_TableSlot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.  If we did want to allow that, we'd
+				 * need to keep more state to remember not to call
+				 * index_getnext_tid next time.
+				 */
+				if (scandesc->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in scandesc->xs_cbuf.  We could release that
+				 * pin now, but it's not clear whether it's a win to do so.
+				 * The next index entry might require a visit to the same heap
+				 * page.
+				 */
+
+				tuple_from_heap = true;
+			}
 
-			ExecClearTuple(node->ioss_TableSlot);
+			/*
+			 * Fill the scan tuple slot with data from the index.  This might
+			 * be provided in either HeapTuple or IndexTuple format.
+			 * Conceivably an index AM might fill both fields, in which case
+			 * we prefer the heap format, since it's probably a bit cheaper to
+			 * fill a slot from.
+			 */
+			if (scandesc->xs_hitup)
+			{
+				/*
+				 * We don't take the trouble to verify that the provided tuple
+				 * has exactly the slot's format, but it seems worth doing a
+				 * quick check on the number of fields.
+				 */
+				Assert(slot->tts_tupleDescriptor->natts ==
+					   scandesc->xs_hitupdesc->natts);
+				ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
+			}
+			else if (scandesc->xs_itup)
+				StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
+			else
+				elog(ERROR, "no data returned for index-only scan");
 
 			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
+			 * If the index was lossy, we have to recheck the index quals.
 			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->recheckqual, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
 
 			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
+			 * We don't currently support rechecking ORDER BY distances.  (In
+			 * principle, if the index can support retrieval of the originally
+			 * indexed value, it should be able to produce an exact distance
+			 * calculation too.  So it's not clear that adding code here for
+			 * recheck/re-sort would be worth the trouble.  But we should at
+			 * least throw an error if someone tries it.)
 			 */
+			if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("lossy distance functions are not supported in index-only scans")));
 
-			tuple_from_heap = true;
-		}
+			/*
+			 * If we didn't access the heap, then we'll need to take a
+			 * predicate lock explicitly, as if we had.  For now we do that at
+			 * page level.
+			 */
+			if (!tuple_from_heap)
+				PredicateLockPage(scandesc->heapRelation,
+								  ItemPointerGetBlockNumber(tid),
+								  estate->es_snapshot);
 
-		/*
-		 * Fill the scan tuple slot with data from the index.  This might be
-		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
-		 * index AM might fill both fields, in which case we prefer the heap
-		 * format, since it's probably a bit cheaper to fill a slot from.
-		 */
-		if (scandesc->xs_hitup)
+			return slot;
+		}
+	}
+	else
+	{
+new_batch:
+		/* do we have TIDs in the current batch */
+		while ((tid = index_batch_getnext_tid(scandesc, direction)) != NULL)
 		{
+			bool		all_visible;
+			bool		tuple_from_heap = false;
+
+			CHECK_FOR_INTERRUPTS();
+
+			/* Is the index of the current item valid for the batch? */
+			Assert((scandesc->xs_batch->currIndex >= 0) &&
+				   (scandesc->xs_batch->currIndex < scandesc->xs_batch->nheaptids));
+
+			/* Prefetch some of the following items in the batch. */
+			index_batch_prefetch(scandesc, direction, ios_prefetch_block, node);
+
 			/*
-			 * We don't take the trouble to verify that the provided tuple has
-			 * exactly the slot's format, but it seems worth doing a quick
-			 * check on the number of fields.
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page has to be all-visible.
+			 *
+			 * XXX It's a bir weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that. Maybe we could/should have a more direct
+			 * way?
 			 */
-			Assert(slot->tts_tupleDescriptor->natts ==
-				   scandesc->xs_hitupdesc->natts);
-			ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
-		}
-		else if (scandesc->xs_itup)
-			StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
-		else
-			elog(ERROR, "no data returned for index-only scan");
+			all_visible = !ios_prefetch_block(scandesc, direction, node,
+											  scandesc->xs_batch->currIndex);
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals.
-		 */
-		if (scandesc->xs_recheck)
-		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->recheckqual, econtext))
+			/*
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!all_visible)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				InstrCountTuples2(node, 1);
+				if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+				{
+					/*
+					 * If we haven't found any visible tuple for the TID,
+					 * chances are all versions are dead and may kill it from
+					 * the index. If so, flag it in the kill bitmap - we'll
+					 * translate it to indexes later.
+					 *
+					 * XXX With the firstIndex/lastIndex it would not be too
+					 * hard to do the translation here. But do we want to? How
+					 * much is that considered an internal detail of the AM?
+					 *
+					 * FIXME Maybe we could/should make this part of
+					 * index_fetch_heap, so that we don't have to do this
+					 * after each call?
+					 */
+					if (scandesc->kill_prior_tuple)
+					{
+						scandesc->kill_prior_tuple = false;
+
+						if (scandesc->xs_batch->nKilledItems < scandesc->xs_batch->maxSize)
+							scandesc->xs_batch->killedItems[scandesc->xs_batch->nKilledItems++]
+								= scandesc->xs_batch->currIndex;
+					}
+
+					continue;	/* no visible tuple, try next index entry */
+				}
+
+				ExecClearTuple(node->ioss_TableSlot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.  If we did want to allow that, we'd
+				 * need to keep more state to remember not to call
+				 * index_getnext_tid next time.
+				 */
+				if (scandesc->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in scandesc->xs_cbuf.  We could release that
+				 * pin now, but it's not clear whether it's a win to do so.
+				 * The next index entry might require a visit to the same heap
+				 * page.
+				 */
+
+				tuple_from_heap = true;
 			}
-		}
 
-		/*
-		 * We don't currently support rechecking ORDER BY distances.  (In
-		 * principle, if the index can support retrieval of the originally
-		 * indexed value, it should be able to produce an exact distance
-		 * calculation too.  So it's not clear that adding code here for
-		 * recheck/re-sort would be worth the trouble.  But we should at least
-		 * throw an error if someone tries it.)
-		 */
-		if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("lossy distance functions are not supported in index-only scans")));
+			/*
+			 * Fill the scan tuple slot with data from the index.  This might
+			 * be provided in either HeapTuple or IndexTuple format.
+			 * Conceivably an index AM might fill both fields, in which case
+			 * we prefer the heap format, since it's probably a bit cheaper to
+			 * fill a slot from.
+			 */
+			if (scandesc->xs_hitup)
+			{
+				/*
+				 * We don't take the trouble to verify that the provided tuple
+				 * has exactly the slot's format, but it seems worth doing a
+				 * quick check on the number of fields.
+				 */
+				Assert(slot->tts_tupleDescriptor->natts ==
+					   scandesc->xs_hitupdesc->natts);
+				ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
+			}
+			else if (scandesc->xs_itup)
+				StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
+			else
+				elog(ERROR, "no data returned for index-only scan");
 
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
+			/*
+			 * If the index was lossy, we have to recheck the index quals.
+			 */
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->recheckqual, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			/*
+			 * We don't currently support rechecking ORDER BY distances.  (In
+			 * principle, if the index can support retrieval of the originally
+			 * indexed value, it should be able to produce an exact distance
+			 * calculation too.  So it's not clear that adding code here for
+			 * recheck/re-sort would be worth the trouble.  But we should at
+			 * least throw an error if someone tries it.)
+			 */
+			if (scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("lossy distance functions are not supported in index-only scans")));
+
+			/*
+			 * If we didn't access the heap, then we'll need to take a
+			 * predicate lock explicitly, as if we had.  For now we do that at
+			 * page level.
+			 */
+			if (!tuple_from_heap)
+				PredicateLockPage(scandesc->heapRelation,
+								  ItemPointerGetBlockNumber(tid),
+								  estate->es_snapshot);
+
+			return slot;
+		}
 
-		return slot;
+		/* batch is empty, try reading the next batch of tuples */
+		if (index_batch_getnext(scandesc, direction))
+		{
+			index_batch_prefetch(scandesc, direction, ios_prefetch_block, node);
+			goto new_batch;
+		}
+
+		return NULL;
 	}
 
 	/*
@@ -574,6 +800,16 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -735,12 +971,20 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -780,12 +1024,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/* XXX do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -797,3 +1044,35 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, ScanDirection direction,
+				   void *arg, int index)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+
+	if (scan->xs_batch->privateData[index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &scan->xs_batch->heaptids[index];
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		scan->xs_batch->privateData[index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (scan->xs_batch->privateData[index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..6d5de6e2ce8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -38,6 +38,7 @@
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
 #include "utils/array.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
@@ -110,7 +111,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -125,28 +127,82 @@ IndexNext(IndexScanState *node)
 	}
 
 	/*
-	 * ok, now that we have what we need, fetch the next tuple.
+	 * If the batching is disabled by a GUC, or if it's not supported by the
+	 * index AM, do the original approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	if (scandesc->xs_batch == NULL)
 	{
-		CHECK_FOR_INTERRUPTS();
-
 		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
+		 * ok, now that we have what we need, fetch the next tuple.
 		 */
-		if (scandesc->xs_recheck)
+		while (index_getnext_slot(scandesc, direction, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * If the index was lossy, we have to recheck the index quals
+			 * using the fetched tuple.
+			 */
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			return slot;
+		}
+	}
+	else
+	{
+new_batch:
+		/* do we have TIDs in the current batch */
+		while (index_batch_getnext_slot(scandesc, direction, slot))
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			/* first, take care of prefetching further items */
+			index_batch_prefetch(scandesc, direction, NULL, NULL);
+
+			/*
+			 * If the index was lossy, we have to recheck the index quals
+			 * using the fetched tuple.
+			 */
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
+		}
+
+		/* batch is empty, try reading the next batch of tuples */
+		if (index_batch_getnext(scandesc, direction))
+		{
+			index_batch_prefetch(scandesc, direction, NULL, NULL);
+			goto new_batch;
 		}
 
-		return slot;
+		return NULL;
 	}
 
 	/*
@@ -205,7 +261,8 @@ IndexNextWithReorder(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);	/* XXX should use batching? */
 
 		node->iss_ScanDesc = scandesc;
 
@@ -942,6 +999,16 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1670,12 +1737,20 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1713,12 +1788,20 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f482..2c9c829c288 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6340,9 +6340,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58b..62a7f63a613 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -789,6 +789,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..b2509337755 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -398,6 +398,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f25c9d58a7d..d9106501b69 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -177,6 +177,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef bool (*amgetbatch_function) (IndexScanDesc scan,
+									 ScanDirection direction);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -280,6 +284,7 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index fdcfbe8db74..5a919f7d37a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -88,6 +89,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -132,6 +134,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -155,7 +159,8 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,7 +178,8 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -182,6 +188,25 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern bool index_batch_getnext(IndexScanDesc scan,
+								ScanDirection direction);
+extern ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+extern bool index_batch_getnext_slot(IndexScanDesc scan,
+									 ScanDirection direction,
+									 struct TupleTableSlot *slot);
+extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+							IndexTuple itup, HeapTuple htup);
+
+/*
+ * Typedef for callback function to determine if an item in index scan should
+ * be prefetched.
+ */
+typedef bool (*index_prefetch_callback) (IndexScanDesc scan, ScanDirection direction,
+										 void *arg, int index);
+extern void index_batch_prefetch(IndexScanDesc scan, ScanDirection direction,
+								 index_prefetch_callback callback, void *arg);
+
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9af9b3ecdcc..3326d2d8958 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1037,6 +1037,14 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+/* Information about the current batch (in batched index scans) */
+typedef struct BTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} BTBatchInfo;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1056,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	BTBatchInfo batch;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1172,6 +1183,7 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1276,6 +1288,9 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..9dfd933e22e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,54 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Data about the current TID batch returned by the index AM.
+ *
+ * XXX Maybe this should be a separate struct instead, and the scan
+ * descriptor would have only a pointer, initialized only when the
+ * batching is actually used?
+ *
+ * XXX It's not quite clear which part of this is managed by indexam and
+ * what's up to the actual index AM implementation. Needs some clearer
+ * boundaries.
+ *
+ * XXX Should we have a pointer for optional state managed by the AM? Some
+ * custom AMs may need more per-batch information, not just the fields we
+ * have here.
+ */
+typedef struct IndexScanBatchData
+{
+	/* batch size - maximum, initial, current (with ramp up) */
+	int			maxSize;
+	int			initSize;
+	int			currSize;
+
+	/* memory context for per-batch data */
+	MemoryContext ctx;
+
+	/* batch prefetching */
+	int			prefetchTarget; /* current prefetch distance */
+	int			prefetchMaximum;	/* maximum prefetch distance */
+	int			prefetchIndex;	/* next item to prefetch */
+
+	/* batch contents (TIDs, index tuples, kill bitmap, ...) */
+	int			currIndex;		/* index of the current item */
+	int			nheaptids;		/* number of TIDs in the batch */
+	ItemPointerData *heaptids;	/* TIDs in the batch */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+	/* list of killed items */
+	int			nKilledItems;	/* number of killedItems elements */
+	int		   *killedItems;	/* list of indexes to kill */
+} IndexScanBatchData;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -151,6 +199,9 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* Information used by batched index scans. */
+	IndexScanBatchData *xs_batch;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 516b9487435..4589ad5ed78 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1646,6 +1646,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1673,6 +1674,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1694,6 +1699,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1715,6 +1721,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..14b38ed4d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -157,6 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index df3f336bec0..9cb66d6e25c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -188,6 +188,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
 BTBuildState
 BTCycleId
 BTDedupInterval
@@ -1219,6 +1220,7 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
 IndexScanDesc
 IndexScanState
 IndexStateFlagsAction
@@ -3278,6 +3280,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.46.0

v20240906-0002-PoC-support-for-mark-restore.patchtext/x-patch; charset=UTF-8; name=v20240906-0002-PoC-support-for-mark-restore.patchDownload

From 140b1c672f33cf1391745d86e06d8418124d9124 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 6 Sep 2024 19:37:15 +0200
Subject: [PATCH v20240906 2/5] PoC: support for mark/restore

---
 src/backend/access/index/indexam.c       |  35 ++++---
 src/backend/access/nbtree/nbtree.c       | 111 +++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c    |   2 +-
 src/backend/executor/nodeIndexonlyscan.c |  11 +--
 src/backend/executor/nodeIndexscan.c     |  11 +--
 src/include/access/nbtree.h              |   2 +
 src/include/access/relscan.h             |   6 ++
 7 files changed, 153 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index d170c1d439d..26ffdc9a383 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -497,16 +497,9 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_indam->amrestrpos(scan);
 
 	/*
-	 * Reset the batch, to make it look empty.
-	 *
-	 * Done after the amrescan() call, in case the AM needs some of the batch
-	 * info (e.g. to properly transfer the killed tuples).
-	 *
-	 * XXX This is a bit misleading, because index_batch_reset does not reset
-	 * the killed tuples. So if that's the only justification, we could have
-	 * done it before the call.
+	 * Don't reset the batch here - amrestrpos should have has already loaded
+	 * the new batch, so don't throw that away.
 	 */
-	index_batch_reset(scan);
 }
 
 /*
@@ -1311,9 +1304,17 @@ AssertCheckBatchInfo(IndexScanDesc scan)
 		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
 		((scan)->xs_batch->currIndex == 0))
 
-/* Does the batch items in the requested direction? */
+/*
+ * Does the batch items in the requested direction? The batch must be non-empty
+ * and we should not have reached the end of the batch (in the direction).
+ * Also, if we just restored the position after mark/restore, there should be
+ * at least one item to process (we won't advance  on the next call).
+ *
+ * XXX This is a bit confusing / ugly, probably should rethink how we track
+ * empty batches, and how we handle not advancing after a restore.
+ */
 #define INDEX_BATCH_HAS_ITEMS(scan, direction) \
-	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+	(!INDEX_BATCH_IS_EMPTY(scan) && (!INDEX_BATCH_IS_PROCESSED(scan, direction) || scan->xs_batch->restored))
 
 
 /* ----------------
@@ -1466,8 +1467,17 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/*
 	 * Advance to the next batch item - we know it's not empty and there are
 	 * items to process, so this is valid.
+	 *
+	 * However, don't advance if this is the first getnext_tid() call after
+	 * amrestrpos(). That sets the position on the correct item, and advancing
+	 * here would skip it.
+	 *
+	 * XXX The "restored" flag is a bit weird. Can we do this without it? May
+	 * need to rethink when/how we advance the batch index. Not sure.
 	 */
-	if (ScanDirectionIsForward(direction))
+	if (scan->xs_batch->restored)
+		scan->xs_batch->restored = false;
+	else if (ScanDirectionIsForward(direction))
 		scan->xs_batch->currIndex++;
 	else
 		scan->xs_batch->currIndex--;
@@ -1805,6 +1815,7 @@ index_batch_reset(IndexScanDesc scan)
 	scan->xs_batch->nheaptids = 0;
 	scan->xs_batch->prefetchIndex = 0;
 	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->restored = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a32cc6e7a3a..c3a25cb5372 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -519,6 +519,25 @@ btmarkpos(IndexScanDesc scan)
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/*
+	 * With batched scans, we don't maintain the itemIndex when processing the
+	 * batch, so we need to calculate the current value.
+	 *
+	 * FIXME I don't like that this requires knowledge of batching details,
+	 * I'd very much prefer those to remain isolated in indexam.c. The best
+	 * idea I have is a function that returns the batch index, something like
+	 * index_batch_get_index(). That's close to how other places in AMs talk
+	 * to indexam.c (e.g. when setting distances / orderby in spgist).
+	 */
+	if (scan->xs_batch)
+	{
+		/* the index should be valid in the batch */
+		Assert(scan->xs_batch->currIndex >= 0);
+		Assert(scan->xs_batch->currIndex < scan->xs_batch->nheaptids);
+
+		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->currIndex;
+	}
+
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
 	 * before releasing the marked position, _bt_steppage makes a full copy of
@@ -552,6 +571,59 @@ btrestrpos(IndexScanDesc scan)
 		 * accurate.
 		 */
 		so->currPos.itemIndex = so->markItemIndex;
+
+		/*
+		 * We're restoring to a different position on the same page, but the
+		 * wrong batch may be loaded. Check if the current batch includes
+		 * itemIndex, and load the correct batch if not. We don't know in
+		 * which direction the scan will move, so we try to put the current
+		 * index in the middle of a batch. For indexes close to the end of the
+		 * page we may load fewer items, but that seems acceptable.
+		 *
+		 * Then update the index of the current item, even if we already have
+		 * the correct batch loaded.
+		 *
+		 * FIXME Similar to btmarkpos() - I don't like how this leaks details
+		 * that should be specific to indexam.c. The "restored" flag is weird
+		 * too - but even if we need it, we could set it in indexam.c, right?
+		 */
+		if (scan->xs_batch != NULL)
+		{
+			if ((so->currPos.itemIndex < so->batch.firstIndex) ||
+				(so->currPos.itemIndex > so->batch.lastIndex))
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/*
+				 * XXX the scan direction is bogus / not important. It affects
+				 * only how we advance the currIndex, but we'll override that
+				 * anyway to point at the "correct" entry.
+				 */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+			}
+
+			/*
+			 * Set the batch index to the "correct" position in the batch,
+			 * even if we haven't re-loaded it from the page. Also remember we
+			 * just did this, so that the next call to
+			 * index_batch_getnext_tid() does not advance it again.
+			 *
+			 * XXX This is a bit weird. There should be a way to not need the
+			 * "restored" flag I think.
+			 */
+			scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+			scan->xs_batch->restored = true;
+		}
 	}
 	else
 	{
@@ -589,6 +661,45 @@ btrestrpos(IndexScanDesc scan)
 				_bt_start_array_keys(scan, so->currPos.dir);
 				so->needPrimScan = false;
 			}
+
+			/*
+			 * With batched scans, we know the current batch is definitely
+			 * wrong (we've moved to a different leaf page). So empty the
+			 * batch, and load the right part of the page.
+			 *
+			 * Similarly to the block above, we place the current index in the
+			 * middle of the batch (if possible). And then we update the index
+			 * to point at the correct batch item.
+			 */
+			if (scan->xs_batch != NULL)
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/* XXX the scan direction is bogus */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+
+				/*
+				 * Set the batch index to the "correct" position in the batch,
+				 * even if we haven't re-loaded it from the page. Also
+				 * remember we just did this, so that the next call to
+				 * index_batch_getnext_tid() does not advance it again.
+				 *
+				 * XXX This is a bit weird. There should be a way to not need
+				 * the "restored" flag I think.
+				 */
+				scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+				scan->xs_batch->restored = true;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3196ba640d5..146c527cab2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1543,7 +1543,7 @@ AssertCheckBTBatchInfo(BTScanOpaque so)
  * _bt_copy_batch
  *		Copy a section of the leaf page into the batch.
  */
-static void
+void
 _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
 			   int start, int end)
 {
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index dcf5fc5fffc..a93da9d1d24 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -801,14 +801,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
+	 * All index scans can do batching.
 	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->ioss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 6d5de6e2ce8..9daec76ddc2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1000,14 +1000,13 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
+	 * All index scans can do batching.
 	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->iss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3326d2d8958..77cff1030b8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1292,6 +1292,8 @@ extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+extern void _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+						   int start, int end);
 
 /*
  * prototypes for functions in nbtutils.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9dfd933e22e..417ca03b898 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -131,6 +131,12 @@ typedef struct IndexScanBatchData
 	/* memory context for per-batch data */
 	MemoryContext ctx;
 
+	/*
+	 * Was this batch just restored by restrpos? if yes, we don't advance on
+	 * the first iteration.
+	 */
+	bool		restored;
+
 	/* batch prefetching */
 	int			prefetchTarget; /* current prefetch distance */
 	int			prefetchMaximum;	/* maximum prefetch distance */
-- 
2.46.0

v20240906-0003-WIP-batching-for-hash-indexes.patchtext/x-patch; charset=UTF-8; name=v20240906-0003-WIP-batching-for-hash-indexes.patchDownload

From 3090ef248d51df1d8494c9b715bab9d58e2608a9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 6 Sep 2024 20:46:49 +0200
Subject: [PATCH v20240906 3/5] WIP: batching for hash indexes

---
 src/backend/access/hash/hash.c       |  43 ++++
 src/backend/access/hash/hashsearch.c | 283 +++++++++++++++++++++++++++
 src/include/access/hash.h            |  17 ++
 src/tools/pgindent/typedefs.list     |   1 +
 4 files changed, 344 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 01d06b7c328..4e4fda8b172 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -97,6 +97,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = hashgetbatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
@@ -328,6 +329,42 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 }
 
 
+/*
+ *	hashgetbatch() -- Get the next batch of tuples in the scan.
+ */
+bool
+hashgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		res;
+
+	/* Hash indexes are always lossy since we store only the hash code */
+	scan->xs_recheck = true;
+
+	/*
+	 * If we've already initialized this scan, we can just advance it in the
+	 * appropriate direction.  If we haven't done so yet, we call a routine to
+	 * get the first item in the scan.
+	 */
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first_batch(scan, dir);
+	else
+	{
+		/*
+		 * Check to see if we should kill tuples from the previous batch.
+		 */
+		_hash_kill_batch(scan);
+
+		/*
+		 * Now continue the scan.
+		 */
+		res = _hash_next_batch(scan, dir);
+	}
+
+	return res;
+}
+
+
 /*
  *	hashgetbitmap() -- get all tuples at once
  */
@@ -402,6 +439,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
@@ -435,6 +475,9 @@ hashendscan(IndexScanDesc scan)
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0d99d6abc86..c11ae847a3b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -64,6 +64,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (++so->currPos.itemIndex > so->currPos.lastItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -82,6 +85,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (--so->currPos.itemIndex < so->currPos.firstItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -476,6 +482,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != 0)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the next page. Before leaving the current page, deal with any
@@ -535,6 +544,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the previous page. Before leaving the current page, deal with
@@ -713,3 +725,274 @@ _hash_saveitem(HashScanOpaque so, int itemIndex,
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 }
+
+static void
+AssertCheckHashBatchInfo(HashScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ *	_hash_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _hash_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_hash_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch. So why not an assert? But we can
+	 * get here in different ways - not just after beginscan/rescan, but also
+	 * when iterating over ScalarArrayOps - in which case we'll see the last
+	 * batch of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_hash_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _hash_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _hash_fist_batch().
+ *
+ * XXX See also the comments at _hash_first_batch() about returning a single
+ * batch for the page, etc.
+ *
+ * FIXME There's a lot of redundant (almost the same) code here - handling
+ * the current and new leaf page is very similar, and it's also similar to
+ * _hash_first_batch(). We should try to reduce this a bit.
+ */
+bool
+_hash_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_hash_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_hash_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_hash_kill_batch(IndexScanDesc scan)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxIndexTuplesPerPage * sizeof(int));
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+void
+_hash_copy_batch(IndexScanDesc scan, ScanDirection dir, HashScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		HashScanPosItem *currItem = &so->currPos.items[start];
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, false, NULL, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c7d81525b4..b710121f0ee 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -152,6 +152,14 @@ typedef struct HashScanPosData
 		(scanpos).itemIndex = 0; \
 	} while (0)
 
+/* Information about the current batch (in batched index scans) */
+typedef struct HashBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} HashBatchInfo;
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -182,6 +190,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	HashBatchInfo batch;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
@@ -369,6 +380,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool hashgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -444,6 +456,11 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _hash_copy_batch(IndexScanDesc scan, ScanDirection dir,
+							 HashScanOpaque so, int start, int end);
+extern void _hash_kill_batch(IndexScanDesc scan);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9cb66d6e25c..cf73cef65fd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1117,6 +1117,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchInfo
 HashBuildState
 HashCompareFunc
 HashCopyFunc
-- 
2.46.0

v20240906-0004-WIP-batching-for-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20240906-0004-WIP-batching-for-gist-indexes.patchDownload

From b5cfc7bc87fb333eaf623a1f5ed173a02463853c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 5 Sep 2024 21:49:27 +0200
Subject: [PATCH v20240906 4/5] WIP: batching for gist indexes

---
 src/backend/access/gist/gist.c    |   1 +
 src/backend/access/gist/gistget.c | 295 ++++++++++++++++++++++++++++++
 src/include/access/gist_private.h |  16 ++
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 313 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..42cffd2bfb3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = gistgetbatch;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..70e32f19366 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -25,6 +25,10 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void _gist_kill_batch(IndexScanDesc scan);
+static void _gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+							 int start, int end);
+
 /*
  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
  * told us were killed.
@@ -709,6 +713,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 			{
 				GISTSearchItem *item;
 
+				/*
+				 * XXX Probably not needed, we can't mix gettuple and
+				 * getbatch, so we should not get any killed tuples for a
+				 * batch. Maybe replace this with an assert that batch has no
+				 * killed items? Or simply that (xs_batch == NULL)?
+				 */
+				_gist_kill_batch(scan);
+
 				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
 					gistkillitems(scan);
 
@@ -736,6 +748,164 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	}
 }
 
+/*
+ * gistgetbatch() -- Get the next batch of tuples in the scan
+ */
+bool
+gistgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "GiST only supports forward scan direction");
+
+	if (!so->qual_ok)
+		return false;
+
+	if (so->firstCall)
+	{
+		/* Begin the scan by processing the root page */
+		GISTSearchItem fakeItem;
+
+		pgstat_count_index_scan(scan->indexRelation);
+
+		so->firstCall = false;
+		so->curPageData = so->nPageData = 0;
+		scan->xs_hitup = NULL;
+		if (so->pageDataCxt)
+			MemoryContextReset(so->pageDataCxt);
+
+		fakeItem.blkno = GIST_ROOT_BLKNO;
+		memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
+		gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
+
+		/*
+		 * Mark the batch as empty.
+		 *
+		 * XXX Is this necessary / the right place to do this? Maybe it should
+		 * be done in beginscan/rescan? The one problem with that is we only
+		 * initialize the batch after, so those places don't know if batching
+		 * is to be used.
+		 */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	/*
+	 * With order-by clauses, we simply get enough nearest items to fill the
+	 * batch. GIST does not handle killed items in this case for non-batched
+	 * scans, so we don't do that here either. The only issue is that the heap
+	 * tuple returned in xs_hitup is freed on the next getNextNearest() call,
+	 * so we make a copy in the batch memory context.
+	 *
+	 * For scans without order-by clauses, we simply copy a batch of items
+	 * from the next leaf page, and proceed to the next page when needed.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			nitems = 0;
+
+		/* Must fetch tuples in strict distance order */
+		while (nitems < scan->xs_batch->currSize)
+		{
+			HeapTuple	htup;
+			MemoryContext oldctx;
+
+			if (!getNextNearest(scan))
+				break;
+
+			/*
+			 * We need to copy the tuple into the batch context, as it may go
+			 * away on the next getNextNearest call. Free the original tuple,
+			 * otherwise we'd leak the last tuple in a batch.
+			 */
+			oldctx = MemoryContextSwitchTo(scan->xs_batch->ctx);
+			htup = heap_copytuple(scan->xs_hitup);
+
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
+
+			MemoryContextSwitchTo(oldctx);
+
+			index_batch_add(scan, scan->xs_heaptid, scan->xs_recheck, NULL, htup);
+
+			nitems++;
+		}
+
+		/* did we find more tuples? */
+		return (scan->xs_batch->nheaptids > 0);
+	}
+	else
+	{
+		/* Fetch the next batch of tuples */
+		for (;;)
+		{
+			int			start,
+						end;
+
+			/* forward directions only, easy to calculate next batch */
+			start = so->batch.lastIndex + 1;
+			end = Min(start + (scan->xs_batch->currSize - 1),
+					  so->nPageData - 1);	/* index of last item */
+			so->curPageData = (end + 1);
+
+			/* if we found more items on the current page, we're done */
+			if (start <= end)
+			{
+				_gist_copy_batch(scan, so, start, end);
+				return true;
+			}
+
+			/* find and process the next index page */
+			do
+			{
+				GISTSearchItem *item;
+
+				_gist_kill_batch(scan);
+
+				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+					gistkillitems(scan);
+
+				item = getNextGISTSearchItem(so);
+
+				if (!item)
+					return false;
+
+				CHECK_FOR_INTERRUPTS();
+
+				/* save current item BlockNumber for next gistkillitems() call */
+				so->curBlkno = item->blkno;
+
+				/*
+				 * While scanning a leaf page, ItemPointers of matching heap
+				 * tuples are stored in so->pageData.  If there are any on
+				 * this page, we fall out of the inner "do" and loop around to
+				 * return them.
+				 */
+				gistScanPage(scan, item, item->distances, NULL, NULL);
+
+				/*
+				 * If we found any items on the page, copy them into a batch
+				 * and we're done.
+				 */
+
+				/* forward direction only, so get the first chunk of items */
+				start = 0;
+				end = Min(start + (scan->xs_batch->currSize - 1),
+						  so->nPageData - 1);	/* index of last item */
+
+				if (start <= end)
+				{
+					_gist_copy_batch(scan, so, start, end);
+					return true;
+				}
+
+				pfree(item);
+			} while (so->nPageData == 0);
+		}
+	}
+}
+
 /*
  * gistgetbitmap() -- Get a bitmap of all heap tuple locations
  */
@@ -799,3 +969,128 @@ gistcanreturn(Relation index, int attno)
 	else
 		return false;
 }
+
+static void
+AssertCheckGISTBatchInfo(GISTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPageData);
+#endif
+}
+
+/*
+ *	_gist_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+static void
+_gist_kill_batch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+		{
+			MemoryContext oldCxt =
+				MemoryContextSwitchTo(so->giststate->scanCxt);
+
+			so->killedItems =
+				(OffsetNumber *) palloc(MaxIndexTuplesPerPage
+										* sizeof(OffsetNumber));
+
+			MemoryContextSwitchTo(oldCxt);
+		}
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckGISTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		GISTSearchHeapItem *item = &so->pageData[start];
+
+		HeapTuple	htup = NULL;
+
+		if (scan->xs_want_itup)
+			htup = item->recontup;
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, item->heapPtr, item->recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..2f653354da2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -148,6 +148,19 @@ typedef struct GISTSearchItem
 	(offsetof(GISTSearchItem, distances) + \
 	 sizeof(IndexOrderByDistance) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct GISTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} GISTBatchInfo;
+
 /*
  * GISTScanOpaqueData: private state for a scan of a GiST index
  */
@@ -176,6 +189,8 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	GISTBatchInfo batch;		/* batch loaded from the index */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -461,6 +476,7 @@ extern XLogRecPtr gistXLogAssignLSN(void);
 
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool gistgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool gistcanreturn(Relation index, int attno);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf73cef65fd..ac78d153d4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -968,6 +968,7 @@ GBT_NUMKEY_R
 GBT_VARKEY
 GBT_VARKEY_R
 GENERAL_NAME
+GISTBatchInfo
 GISTBuildBuffers
 GISTBuildState
 GISTDeletedPageContents
-- 
2.46.0

v20240906-0005-WIP-batching-for-sp-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20240906-0005-WIP-batching-for-sp-gist-indexes.patchDownload

From a321ef26d53b47b92ef79a08856037c3d61ca532 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 5 Sep 2024 22:37:17 +0200
Subject: [PATCH v20240906 5/5] WIP: batching for sp-gist indexes

---
 src/backend/access/spgist/spgscan.c  | 142 +++++++++++++++++++++++++++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/spgist.h          |   1 +
 src/include/access/spgist_private.h  |  15 +++
 src/tools/pgindent/typedefs.list     |   1 +
 5 files changed, 160 insertions(+)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 03293a7816e..8a6a6059af5 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -1079,6 +1079,148 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 	return false;
 }
 
+static void
+AssertCheckSpGistBatchInfo(SpGistScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPtrs);
+#endif
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_spgist_copy_batch(IndexScanDesc scan, SpGistScanOpaque so,
+				   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckSpGistBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		bool		recheck = so->recheck[start];
+		HeapTuple	htup = NULL;
+
+		if (so->want_itup)
+			htup = so->reconTups[start];
+
+		if (so->numberOfOrderBys > 0)
+			index_store_float8_orderby_distances(scan, so->orderByTypes,
+												 so->distances[so->iPtr],
+												 so->recheckDistances[so->iPtr]);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, so->heapPtrs[start], recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+
+bool
+spggetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "SP-GiST only supports forward scan direction");
+
+	/* Copy want_itup to *so so we don't need to pass it around separately */
+	so->want_itup = scan->xs_want_itup;
+
+	for (;;)
+	{
+		int			start,
+					end;
+
+		/* forward directions only, easy to calculate next batch */
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1),
+				  so->nPtrs - 1);	/* index of last item */
+		so->iPtr = (end + 1);
+
+		/* if we found more items on the current page, we're done */
+		if (start <= end)
+		{
+			_spgist_copy_batch(scan, so, start, end);
+			return true;
+		}
+
+		if (so->numberOfOrderBys > 0)
+		{
+			/* Must pfree distances to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				if (so->distances[i])
+					pfree(so->distances[i]);
+		}
+
+		if (so->want_itup)
+		{
+			/* Must pfree reconstructed tuples to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				pfree(so->reconTups[i]);
+		}
+		so->iPtr = so->nPtrs = 0;
+
+		spgWalk(scan->indexRelation, so, false, storeGettuple);
+
+		if (so->nPtrs == 0)
+			break;				/* must have completed scan */
+
+		/* reset before loading data from batch */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	return false;
+}
+
 bool
 spgcanreturn(Relation index, int attno)
 {
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 76b80146ff0..fc685ffa2aa 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -84,6 +84,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = spggetbatch;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index d6a49531200..f879843b3bb 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -209,6 +209,7 @@ extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
 extern int64 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool spggetbatch(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index e7cbe10a89b..15a5e77c5d3 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -183,6 +183,19 @@ typedef struct SpGistSearchItem
 #define SizeOfSpGistSearchItem(n_distances) \
 	(offsetof(SpGistSearchItem, distances) + sizeof(double) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct SpGistBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} SpGistBatchInfo;
+
 /*
  * Private state of an index scan
  */
@@ -235,6 +248,8 @@ typedef struct SpGistScanOpaqueData
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	SpGistBatchInfo batch;		/* batch loaded from the index */
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ac78d153d4f..fbbdb330fc5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2702,6 +2702,7 @@ SortSupportData
 SortTuple
 SortTupleComparator
 SortedPoint
+SpGistBatchInfo
 SpGistBuildState
 SpGistCache
 SpGistDeadTuple
-- 
2.46.0

#101

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Tomas Vondra (#100)

6 attachment(s)

Re: index prefetching

Hi,

Here's another version of this patch series, with a couple significant
improvements, mostly in the indexam.c and executor layers. The AM code
remains almost untouched.

I have focused on the simplification / cleanup of the executor code
(nodeIndexscan and nodeIndexonlyscan). In the previous version there was
quite a bit of duplicated code - both for the "regular" index scans and
index-only scans, the "while getnext" block was copied, calling either
the non-batched or batched functions.

That is now mostly gone. I managed to move 99% of the differences to the
indexam.c layer, so that the executor simply calls index_getnext_tid()
or index_getnext_slot(), and that decides *internally* whether to use
the batched version, or not. This means the only new function added to
the indexam API is index_batch_add(), which the index AMs use to add
items into the batch. For the executor the code remains the same.

The only exception is that index-only scans need a way to guide the
prefetching based on the visibility map (we don't want to prefetch
all-visible pages, because skipping those is the whole point of IOS).
And we also want a way to share the VM check, so that it doesn't need to
happen twice. Because for fully-cached workloads this is too expensive.

Doing the first part is trivial - we simply define a callback for the
batching, responsible for inspecting the VM and making a decision.
That's easy, and fairly clean. Passing the VM check result back is a bit
awkward, though. The current patch deals with it by just executing the
callback again (which just returns the cached result), or doing the VM
check locally (for non-batched version). It's not pretty, because it
leaks knowledge of the batching into the executor.

I'd appreciate ideas how to solve this in a nicer way.

I've also split the nbtree changes into a separate patch. It used to be
included in the first patch, but I've decided to keep it separate, just
like for the other AMs.

I'm now fairly happy with both the executor layer and the (much smaller)
indexam.c code, and I think it's in a good enough shape for a review.

The next item on my TODO is cleanup of the nbtree code, particularly the
mark/restore part in patch 0003. So I'll work on that next. I also plan
to get back to the index_batch_prefetch() code, which is not wrong but
would benefit from a bit of cleanup / clarification etc.

regards

--
Tomas Vondra

Attachments:

v20240930-0001-WIP-index-batching-prefetching.patchtext/x-patch; charset=UTF-8; name=v20240930-0001-WIP-index-batching-prefetching.patchDownload

From 88046ae9afd637cad22c5728f337794a210cfbc3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v20240930 1/6] WIP: index batching / prefetching

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.

It is up to the index AM to return only batches that it can handle
internally. For example, most of the later patches adding support for
batching to relevant index AMs (btree, hash, gist, sp-gist) restrict the
batches to a single leaf page. This makes implementation of batching
much simpler, with only minimal changes to the index AMs, but it's not a
hard requirement. The index AM can produce batches spanning arbitrary
number of leaf pages. This is left as a possible future improvement.

Most of the batching/prefetching logic happens in indexam.c. This means
the executor code can continue to call the interface just like before.

The only "violation" happens in index-only scans, which need to check
the visibility map both when the prefetching pages (we don't want to
prefetch pages that are unnecessary) and later when reading the data.
For cached data the visibility map checks can be fairly expensive, so
it's desirable to keep and reuse the result of the first check.

At the moment, the prefetching does not handle mark/restore plans. This
is doable, but requires additional synchronization between the batching
and index AM code in the "opposite direction".

This patch does not actually add batching to any of the index AMs, it's
just the common infrastructure.

TODO Add the new index AM callback to sgml docs.
---
 src/backend/access/heap/heapam_handler.c      |   7 +-
 src/backend/access/index/genam.c              |  23 +-
 src/backend/access/index/indexam.c            | 808 +++++++++++++++++-
 src/backend/executor/execIndexing.c           |   7 +-
 src/backend/executor/execReplication.c        |   9 +-
 src/backend/executor/nodeIndexonlyscan.c      | 106 ++-
 src/backend/executor/nodeIndexscan.c          |  36 +-
 src/backend/utils/adt/selfuncs.c              |   7 +-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   5 +
 src/include/access/genam.h                    |  14 +-
 src/include/access/relscan.h                  |  64 ++
 src/include/nodes/execnodes.h                 |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 1084 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d43..86da23e41f4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -750,7 +750,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering. Seems like it
+		 * might be a pretty substantial win.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 52fde5cc4d4..0186c251ef7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -445,8 +445,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over
+		 * time. But then we need to be careful about infinite recursion when
+		 * looking up effective_io_concurrency for a tablespace in the
+		 * catalog.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, false);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -708,8 +718,17 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time.
+	 * But then we need to be careful about infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace in the catalog.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, false);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1859be614c0..2849ab97cdf 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_batch_add		- add an item (TID, itup) to the batch
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -54,10 +55,14 @@
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memutils.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
+/* enable reading batches / prefetching of TIDs from the index */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +114,15 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan,
+								ScanDirection direction);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+static void index_batch_prefetch(IndexScanDesc scan,
+								 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -256,7 +270,8 @@ IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
 	IndexScanDesc scan;
 
@@ -274,6 +289,24 @@ index_beginscan(Relation heapRelation,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -333,6 +366,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batch = NULL;
+
 	return scan;
 }
 
@@ -368,6 +407,18 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /* ----------------
@@ -444,6 +495,18 @@ index_restrpos(IndexScanDesc scan)
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /*
@@ -539,7 +602,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -562,6 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -583,6 +665,53 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * When using batching (which may be disabled for various reasons (e.g.
+	 * through a GUC, the index AM not supporting it) do the old approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
+	 */
+	if (scan->xs_batch != NULL)
+	{
+batch_loaded:
+		/* Try getting a TID from the current batch (if we have one). */
+		while (index_batch_getnext_tid(scan, direction) != NULL)
+		{
+			/*
+			 * We've successfully loaded a TID from the batch, so issue
+			 * prefetches for future TIDs if needed.
+			 */
+			index_batch_prefetch(scan, direction);
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We either don't have any batch yet, or we've already processed
+		 * all items from the current batch. Try loading the next one.
+		 *
+		 * If we succeed, issue prefetches (using the current prefetch
+		 * distance without ramp up), and then go back to returning the
+		 * TIDs from the batch.
+		 *
+		 * XXX Maybe do this as a simple while/for loop without the goto.
+		 */
+		if (index_batch_getnext(scan, direction))
+		{
+			index_batch_prefetch(scan, direction);
+			goto batch_loaded;
+		}
+
+		return NULL;
+	}
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -651,7 +780,19 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * RelationGetIndexScan().
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->xs_batch == NULL)
+		{
+			scan->kill_prior_tuple = all_dead;
+		}
+		else if (all_dead)
+		{
+			/* batch case - record the killed tuple in the batch */
+			if (scan->xs_batch->nKilledItems < scan->xs_batch->maxSize)
+				scan->xs_batch->killedItems[scan->xs_batch->nKilledItems++]
+					= scan->xs_batch->currIndex;
+		}
+	}
 
 	return found;
 }
@@ -1039,3 +1180,664 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING AND PREFETCHING
+ *
+ * Allows reading chunks of items from an index, instead of reading them
+ * one by one. This reduces the overhead of accessing index pages, and
+ * also allows acting on "future" TIDs - e.g. we can prefetch heap pages
+ * that will be needed, etc.
+ *
+ *
+ * index AM contract
+ * -----------------
+ *
+ * To support batching, the index AM needs to implement an optional callback
+ * amgetbatch() which loads data into the batch (in the scan descriptor).
+ *
+ * The index AM also needs to ensure it can perform all optimizations for
+ * all TIDs in the current batch. A good example of is the kill_prior_tuple
+ * optimization - with batching, the index AM may receive the information
+ * which tuples are to be killed with a delay - when loading the next
+ * batch, when ending/restarting the scan, etc. The AM needs to ensure it
+ * can still process such information, to keep the optimization effective.
+ *
+ * The AM may also need to keep pins required by the whole batch (not just
+ * the last tuple), etc.
+ *
+ * What this means/requires is very dependent on the index AM, of course.
+ * For B-Tree (and most other index AMs), batches spanning multiple leaf
+ * pages would be problematic. Such batches would work for basic index
+ * scans, but the kill_prior_tuple would be an issue - the AMs keep only
+ * a single leaf pinned. We'd either need to keep multiple pins, or allow
+ * reading older leaf pages pages (which might have been modified). Index
+ * only-scans is challenging too - we keep IndexTuple pointers into the
+ * leaf pages, which requires keeping those pins too.
+ *
+ * To solve this, we give the AM some control over batch boundaries. It is
+ * up to the index AM to pick which range of index items to load into the
+ * batch, and how to ensure all the optimizations are possible, keep pins,
+ * and so on. The index AM may use information about the batch (in the
+ * scan descriptor, maintained by indexam.c code), and may also keep some
+ * private information (in the existing "opaque" scan field).
+ *
+ * For most index AMs the easiest way is to not load batches spanning
+ * multiple leaf pages. This may impact the efficiency, especially for
+ * indexes with wide index tuples, as it means batches close to the end
+ * of the leaf page may be smaller.
+ *
+ * Note: There already is a pipeline break for prefetching - as we are
+ * getting closer to the end of a batch, we can't prefetch further than
+ * that, and the effective prefetch distance drops to 0.
+ *
+ * The alternative would be to make the index AMs more complex, to keep
+ * more leaf pages pinned, etc. The current model does not prohibit the
+ * index AMs from implementing that - it's entirely possible to keep the
+ * additional information in the "opaque" structure (say, list of pinned
+ * pages, and other necessary details).
+ *
+ * But that does not seem like a good trade off, as it's subject to
+ * "diminishing returns" - we see significant gains initially (with even
+ * small batches / prefetch distance), and as the batch grows the gains
+ * get smaller and smaller. It does not seem worth the complexity of
+ * pinning more pages etc. at least for the first version.
+ *
+ * To deal with the "prefetch pipeline break", that could be addressed by
+ * allowing multiple in-fligt batches - e.g. with prefetch distance 64
+ * we might have three batches of 32 items each, to prefetch far ahead.
+ * But that's not what this patch does yet.
+ *
+ *
+ * batch = sliding window
+ * ----------------------
+ *
+ * A good way to visualize a batch is a sliding window over the array
+ * of items on a leaf page. In the simplest example (forward scan with no
+ * changes of direction), we slice the array into smaller chunks, and
+ * then process each of those chunks.
+ *
+ * The batch size is adaptive - it starts small (only 8 elements) and
+ * increases as we read more batches (up to 64 elements). We don't want
+ * to regress cases that only need a single item (e.g. LIMIT 1 queries),
+ * and loading/copying a lot of data might cause that. So we start small
+ * and increase the size - that still improves cases reading a lot of
+ * data from the index, without hurting small queries.
+ *
+ * Note: This gradual ramp up is for batch size, independent of what we
+ * do for prefetch. The prefetch distance is gradually increased too, but
+ * it's independent / orthogonal to the batch size. The batch size limits
+ * how far ahead we can prefetch, of course.
+ *
+ * Note: The current limits on batch size (initial 8, maximum 64) are
+ * quite arbitrary, it just seemed those values are sane. We could adjust
+ * the initial size, but I don't think it'd make a fundamental difference.
+ * Growing the batches faster/slower has bigger impact.
+ *
+ * The maximum batch size does not matter much - it's true a btree index can
+ * have up to ~1300 items per 8K leaf page, but in most cases the actual
+ * number is lower, perhaps ~300. That's not that far from 64.
+ *
+ * Each batch has a firstIndex/lastIndex to track which part of the leaf
+ * page it currently represents.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ *
+ * If we decide a tuple can be killed, the batch item is marked accordingly,
+ * and the flag is reset to false (so that the index AM does not do something
+ * silly to a random tuple it thinks is "current").
+ *
+ * Then the next time the AM decides it's time to kill tuples, the AM needs
+ * to look at the batch and consider the tuples marked to be killed. B-Tree
+ * simply adds those TIDs to the regular "killItems" array.
+ *
+ *
+ * mark/restore
+ * ------------
+ *
+ * With batching, the index AM does not know the the "current" position on
+ * the leaf page - we don't propagate this to the index AM while walking
+ * items in the batch. To make ammarkpos() work, the index AM has to check
+ * the current position in the batch, and translate it to the proper page
+ * position, using the private information (about items in the batch).
+ *
+ * XXX This needs more work, I don't quite like how the two layers interact,
+ * it seems quite wrong to look at the batch info directly.
+ */
+
+/*
+ * Comprehensive check of various invariants on the index batch. Makes sure
+ * the indexes are set as expected, the buffer size is within limits, and
+ * so on.
+ */
+static void
+AssertCheckBatchInfo(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* all the arrays need to be allocated */
+	Assert((scan->xs_batch->heaptids != NULL) &&
+		   (scan->xs_batch->killedItems != NULL) &&
+		   (scan->xs_batch->privateData != NULL));
+
+	/* if IndexTuples expected, should be allocated too */
+	Assert(!(scan->xs_want_itup && (scan->xs_batch->itups == NULL)));
+
+	/* Various check on batch sizes */
+	Assert((scan->xs_batch->initSize >= 0) &&
+		   (scan->xs_batch->initSize <= scan->xs_batch->currSize) &&
+		   (scan->xs_batch->currSize <= scan->xs_batch->maxSize) &&
+		   (scan->xs_batch->maxSize <= 1024));	/* arbitrary limit */
+
+	/* Is the number of in the batch TIDs in a valid range? */
+	Assert((scan->xs_batch->nheaptids >= 0) &&
+		   (scan->xs_batch->nheaptids <= scan->xs_batch->maxSize));
+
+	/*
+	 * The current item must be between -1 and nheaptids. Those two extreme
+	 * values are starting points for forward/backward scans.
+	 */
+	Assert((scan->xs_batch->currIndex >= -1) &&
+		   (scan->xs_batch->currIndex <= scan->xs_batch->nheaptids));
+
+	/* check prefetch data */
+	Assert((scan->xs_batch->prefetchTarget >= 0) &&
+		   (scan->xs_batch->prefetchTarget <= scan->xs_batch->prefetchMaximum));
+
+	Assert((scan->xs_batch->prefetchIndex >= -1) &&
+		   (scan->xs_batch->prefetchIndex <= scan->xs_batch->nheaptids));
+
+	for (int i = 0; i < scan->xs_batch->nheaptids; i++)
+		Assert(ItemPointerIsValid(&scan->xs_batch->heaptids[i]));
+#endif
+}
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch->nheaptids == 0)
+
+/*
+ * Did we process all items? For forward scan it means the index points to the
+ * last item, for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
+		((scan)->xs_batch->currIndex == 0))
+
+/* Does the batch items in the requested direction? */
+#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	/*
+	 * Reset the current/prefetch positions in the batch.
+	 *
+	 * XXX Done before calling amgetbatch(), so that it sees the index as
+	 * invalid, batch as empty, and can add items.
+	 *
+	 * XXX Intentionally does not reset the nheaptids, because the AM does
+	 * rely on that when processing killed tuples. Maybe store the killed
+	 * tuples differently?
+	 */
+	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->nheaptids = 0;
+
+	/*
+	 * Reset the memory context with all per-batch data, allocated by the AM.
+	 * This might be tuples, or anything else the AM needs.
+	 *
+	 * XXX Make sure to reset the tuples, because the AM may do something with
+	 * them later (e.g. release them, as getNextNearest in gist), but we may
+	 * release them by the MemoryContextReset() call.
+	 *
+	 * This might break the AM if it relies on them pointing to the last
+	 * tuple, but at least it has the chance to do the right thing by checking
+	 * if the pointer is NULL.
+	 */
+	scan->xs_itup = NULL;
+	scan->xs_hitup = NULL;
+
+	MemoryContextReset(scan->xs_batch->ctx);
+
+	/*
+	 * The AM's amgetbatch proc loads a chunk of TIDs matching the scan keys,
+	 * and puts the TIDs into scan->xs_batch->heaptids.  It should also set
+	 * scan->xs_recheck and possibly
+	 * scan->xs_batch->itups/scan->xs_batch->hitups, though we pay no
+	 * attention to those fields here.
+	 *
+	 * FIXME At the moment this does nothing with hitup. Needs to be fixed?
+	 */
+	found = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return false;
+	}
+
+	/* We should have a non-empty batch with items. */
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch->nheaptids);
+
+	/*
+	 * Set the prefetch index to the first item in the loaded batch (we expect
+	 * the index AM to set that).
+	 *
+	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
+	 * more like indexam.c responsibility rather than something every index AM
+	 * should be doing (in _bt_first_batch etc.).
+	 *
+	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
+	 * for setting which fields. This needs clarification.
+	 */
+	scan->xs_batch->prefetchIndex = scan->xs_batch->currIndex;
+
+	/*
+	 * Try to increase the size of the batch. Intentionally done after the AM
+	 * call, so that the new value applies to the next batch. Otherwise we
+	 * would always skip the initial batch size.
+	 */
+	scan->xs_batch->currSize = Min(scan->xs_batch->currSize + 1,
+								   scan->xs_batch->maxSize);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* Return the batch of TIDs we found. */
+	return true;
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * Same calling convention as index_getnext_tid(), except that NULL means
+ * no more items in the current batch, there may be more batches.
+ *
+ * XXX This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup.
+ *
+ * FIXME Should this set xs_hitup?
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * Bail out if he batch does not have more items in the requested directio
+	 * (either empty or everthing processed).
+	 */
+	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
+		return NULL;
+
+	/*
+	 * Advance to the next batch item - we know it's not empty and there are
+	 * items to process, so this is valid.
+	 */
+	if (ScanDirectionIsForward(direction))
+		scan->xs_batch->currIndex++;
+	else
+		scan->xs_batch->currIndex--;
+
+	/*
+	 * Next TID from the batch, optionally also the IndexTuple/HeapTuple.
+	 *
+	 * XXX Not sure how to decide which of the tuples to set, seems easier to
+	 * just set both, one of them will be NULL.
+	 *
+	 * XXX Do we need to reset the itups/htups array between batches? Doesn't
+	 * seem necessary, but maybe we could get bogus data?
+	 */
+	scan->xs_heaptid = scan->xs_batch->heaptids[scan->xs_batch->currIndex];
+	if (scan->xs_want_itup)
+	{
+		scan->xs_itup = scan->xs_batch->itups[scan->xs_batch->currIndex];
+		scan->xs_hitup = scan->xs_batch->htups[scan->xs_batch->currIndex];
+	}
+
+	scan->xs_recheck = scan->xs_batch->recheck[scan->xs_batch->currIndex];
+
+	/*
+	 * If there are order-by clauses, point to the appropriate chunk in the
+	 * arrays.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->numberOfOrderBys * scan->xs_batch->currIndex;
+
+		scan->xs_orderbyvals = &scan->xs_batch->orderbyvals[idx];
+		scan->xs_orderbynulls = &scan->xs_batch->orderbynulls[idx];
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ *		index_batch_prefetch - prefetch pages for TIDs in current batch
+ *
+ * The prefetch distance is increased gradually, similar to what we do for
+ * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
+ * iteration increment the distance up to prefetchMaximum.
+ *
+ * The prefetch distance is reset (to 0) only on rescans, not between batches.
+ *
+ * It's possible to provide an index_prefetch_callback callback, to affect
+ * which items need to be prefetched. With prefetch_callback=NULL, all
+ * items are prefetched. With the callback provided, the item is prefetched
+ * iff the callback and returns true.
+ *
+ * The "arg" argument is used to pass a state for the plan node invoking the
+ * function, and is then passed to the callback. This means the callback is
+ * specific to the plan state.
+ *
+ * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
+ * tablespace options.
+ *
+ * XXX For accesses that change scan direction, we may do a lot of unnecessary
+ * prefetching (because we will re-issue prefetches for what we recently read).
+ * I'm not sure if there's a simple way to track what was already prefetched.
+ * Maybe we could count how far we got (in the forward direction), keep that
+ * as a watermark, and never prefetch again below it.
+ *
+ * XXX Maybe wrap this in ifdef USE_PREFETCH?
+ * ----------------
+ */
+static void
+index_batch_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	int			prefetchStart,
+				prefetchEnd;
+
+	IndexPrefetchCallback	prefetch_callback = scan->xs_batch->prefetchCallback;
+	void *arg = scan->xs_batch->prefetchArgument;
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Where should we start to prefetch? */
+		prefetchStart = Max(scan->xs_batch->currIndex,
+							scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchEnd = Min((scan->xs_batch->currIndex + 1) + scan->xs_batch->prefetchTarget,
+						  scan->xs_batch->nheaptids);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchEnd;
+	}
+	else
+	{
+		/* Where should we start to prefetch? */
+		prefetchEnd = Min(scan->xs_batch->currIndex,
+						  scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchStart = Max((scan->xs_batch->currIndex - 1) - scan->xs_batch->prefetchTarget,
+							-1);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchStart;
+	}
+
+	/*
+	 * It's possible we get inverted prefetch range after a restrpos() call,
+	 * because we intentionally don't reset the prefetchIndex - we don't want
+	 * to prefetch pages over and over in this case. We'll do nothing in that
+	 * case, except for the AssertCheckBatchInfo().
+	 *
+	 * FIXME I suspect this actually does not work correctly if we change the
+	 * direction, because the prefetchIndex will flip between two extremes
+	 * thanks to the Min/Max.
+	 */
+
+	/*
+	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
+	 * intentionally do this after calculating start/end, so that we start
+	 * actually prefetching only after the first item.
+	 */
+	scan->xs_batch->prefetchTarget = Min(scan->xs_batch->prefetchTarget + 1,
+										 scan->xs_batch->prefetchMaximum);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* finally, do the actual prefetching */
+	for (int i = prefetchStart; i < prefetchEnd; i++)
+	{
+		/* skip block if the provided callback says so */
+		if (prefetch_callback && !prefetch_callback(scan, arg, i))
+			continue;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
+					   ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[i]));
+	}
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, but only if batch supported */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->xs_batch = palloc0(sizeof(IndexScanBatchData));
+
+	/*
+	 * Set some reasonable batch size defaults.
+	 *
+	 * XXX Maybe should depend on prefetch distance, or something like that?
+	 * The initSize will affect how far ahead we can prefetch.
+	 */
+	scan->xs_batch->maxSize = 64;
+	scan->xs_batch->initSize = 8;
+	scan->xs_batch->currSize = scan->xs_batch->initSize;
+
+	/* initialize prefetching info */
+	scan->xs_batch->prefetchMaximum =
+		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
+	scan->xs_batch->prefetchTarget = 0;
+	scan->xs_batch->prefetchIndex = 0;
+
+	/* */
+	scan->xs_batch->currIndex = -1;
+
+	/* Preallocate the largest allowed array of TIDs. */
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->heaptids = palloc0(sizeof(ItemPointerData) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX We can't check scan->xs_want_itup, because that's set only after
+	 * the scan is initialized (and we initialize in beginscan). Maybe we
+	 * could (or should) allocate lazily.
+	 */
+	scan->xs_batch->itups = palloc(sizeof(IndexTuple) * scan->xs_batch->maxSize);
+	scan->xs_batch->htups = palloc(sizeof(HeapTuple) * scan->xs_batch->maxSize);
+
+	scan->xs_batch->recheck = palloc(sizeof(bool) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe use a more compact bitmap? We need just one bit per element,
+	 * not a bool. This is easier / more convenient to manipulate, though.
+	 *
+	 * XXX Maybe should allow more items thant the max batch size?
+	 */
+	scan->xs_batch->nKilledItems = 0;
+	scan->xs_batch->killedItems = (int *) palloc0(sizeof(int) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe allocate only when actually needed? Also, shouldn't we have a
+	 * memory context for the private data?
+	 */
+	scan->xs_batch->privateData = (Datum *) palloc0(sizeof(Datum) * scan->xs_batch->maxSize);
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			cnt = (scan->xs_batch->maxSize * scan->numberOfOrderBys);
+
+		scan->xs_batch->orderbyvals = (Datum *) palloc0(sizeof(Datum) * cnt);
+		scan->xs_batch->orderbynulls = (bool *) palloc0(sizeof(Datum) * cnt);
+	}
+	else
+	{
+		scan->xs_batch->orderbyvals = NULL;
+		scan->xs_batch->orderbynulls = NULL;
+	}
+
+	scan->xs_batch->ctx = AllocSetContextCreate(CurrentMemoryContext,
+												"indexscan batch context",
+												ALLOCSET_DEFAULT_SIZES);
+
+	/* comprehensive checks */
+	AssertCheckBatchInfo(scan);
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
+ * correct, because for backwards scans is wrong.
+ */
+static void
+index_batch_reset(IndexScanDesc scan)
+{
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->currIndex = -1;
+}
+
+/*
+ * index_batch_add
+ *		Add an item to the batch.
+ *
+ * The item is always a TID, and then also IndexTuple if requested (for IOS).
+ * Items are always added from the beginning (index 0).
+ *
+ * Returns true when adding the item was successful, or false when the batch
+ * is full (and the item should be added to the next batch).
+ */
+bool
+index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+				IndexTuple itup, HeapTuple htup)
+{
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	/* don't add TIDs beyond the current batch size */
+	if (INDEX_BATCH_IS_FULL(scan))
+		return false;
+
+	/*
+	 * There must be space for at least one entry.
+	 *
+	 * XXX Seems redundant with the earlier INDEX_BATCH_IS_FULL check.
+	 */
+	Assert(scan->xs_batch->nheaptids < scan->xs_batch->currSize);
+	Assert(scan->xs_batch->nheaptids >= 0);
+
+	scan->xs_batch->heaptids[scan->xs_batch->nheaptids] = tid;
+	scan->xs_batch->privateData[scan->xs_batch->nheaptids] = (Datum) 0;
+
+	if (scan->xs_want_itup)
+	{
+		scan->xs_batch->itups[scan->xs_batch->nheaptids] = itup;
+		scan->xs_batch->htups[scan->xs_batch->nheaptids] = htup;
+	}
+
+	scan->xs_batch->recheck[scan->xs_batch->nheaptids] = recheck;
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->xs_batch->nheaptids * scan->numberOfOrderBys;
+
+		memcpy(&scan->xs_batch->orderbyvals[idx], scan->xs_orderbyvals, sizeof(Datum) * scan->numberOfOrderBys);
+		memcpy(&scan->xs_batch->orderbynulls[idx], scan->xs_orderbynulls, sizeof(bool) * scan->numberOfOrderBys);
+	}
+
+	scan->xs_batch->nheaptids++;
+
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	return true;
+}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f9a2fac79e4..742a963bc29 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -809,7 +809,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+
+	/*
+	 * XXX Does not seem useful to do prefetching for checks of constraints.
+	 * We would probably need just the first item anyway.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 54025c9f150..6be3744361d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -244,8 +244,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching is pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 612c6738950..c030c0df6fe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,12 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *data, int index);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* XXX default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -93,15 +98,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/* Also set the prefetch callback info, if baching enabled. */
+		if (scandesc->xs_batch != NULL)
+		{
+			scandesc->xs_batch->prefetchCallback = ios_prefetch_block;
+			scandesc->xs_batch->prefetchArgument = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -119,10 +131,38 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/* */
+		if (scandesc->xs_batch == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+						  ItemPointerGetBlockNumber(tid),
+						  &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/* Is the index of the current item valid for the batch? */
+			Assert((scandesc->xs_batch->currIndex >= 0) &&
+				   (scandesc->xs_batch->currIndex < scandesc->xs_batch->nheaptids));
+
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page has to be all-visible.
+			 *
+			 * XXX It's a bir weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that. Maybe we could/should have a more direct
+			 * way?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  scandesc->xs_batch->currIndex);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -157,16 +197,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
 			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+				continue;	/* no visible tuple, try next index entry */
 
 			ExecClearTuple(node->ioss_TableSlot);
 
@@ -574,6 +612,16 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -735,12 +783,20 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -780,12 +836,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/* XXX do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -797,3 +856,34 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, int index)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+
+	if (scan->xs_batch->privateData[index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &scan->xs_batch->heaptids[index];
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		scan->xs_batch->privateData[index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (scan->xs_batch->privateData[index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..8bbd3606566 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -110,7 +110,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -200,12 +201,15 @@ IndexNextWithReorder(IndexScanState *node)
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
+		 *
+		 * XXX Should we use batching here? And can we with reordering?
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -942,6 +946,20 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 */
+	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1670,12 +1688,17 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1713,12 +1736,17 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f482..2c9c829c288 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6340,9 +6340,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58b..62a7f63a613 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -789,6 +789,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a2..b2509337755 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -398,6 +398,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index c51de742ea0..966a25d9ba3 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef bool (*amgetbatch_function) (IndexScanDesc scan,
+									 ScanDirection direction);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -288,6 +292,7 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c25f5d11b53..1d9a0868a9b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -88,6 +89,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -132,6 +134,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -155,7 +159,8 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,7 +178,8 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -182,6 +188,10 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+/* index batching/prefetching */
+extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+							IndexTuple itup, HeapTuple htup);
+
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 114a85dc47c..3d767e14356 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -107,6 +107,9 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -152,6 +155,9 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* Information used by batched index scans. */
+	IndexScanBatchData *xs_batch;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
@@ -167,6 +173,64 @@ typedef struct IndexScanDescData
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
 
+/*
+ * Typedef for callback function to determine if an item in index scan should
+ * be prefetched.
+ */
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData *scan,
+									   void *arg, int index);
+
+/*
+ * Data about the current TID batch returned by the index AM.
+ *
+ * XXX Maybe this should be a separate struct instead, and the scan
+ * descriptor would have only a pointer, initialized only when the
+ * batching is actually used?
+ *
+ * XXX It's not quite clear which part of this is managed by indexam and
+ * what's up to the actual index AM implementation. Needs some clearer
+ * boundaries.
+ *
+ * XXX Should we have a pointer for optional state managed by the AM? Some
+ * custom AMs may need more per-batch information, not just the fields we
+ * have here.
+ */
+typedef struct IndexScanBatchData
+{
+	/* batch size - maximum, initial, current (with ramp up) */
+	int			maxSize;
+	int			initSize;
+	int			currSize;
+
+	/* memory context for per-batch data */
+	MemoryContext ctx;
+
+	/* batch prefetching */
+	int			prefetchTarget; /* current prefetch distance */
+	int			prefetchMaximum;	/* maximum prefetch distance */
+	int			prefetchIndex;	/* next item to prefetch */
+
+	IndexPrefetchCallback	prefetchCallback;
+	void				   *prefetchArgument;
+
+	/* batch contents (TIDs, index tuples, kill bitmap, ...) */
+	int			currIndex;		/* index of the current item */
+	int			nheaptids;		/* number of TIDs in the batch */
+	ItemPointerData *heaptids;	/* TIDs in the batch */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+	/* list of killed items */
+	int			nKilledItems;	/* number of killedItems elements */
+	int		   *killedItems;	/* list of indexes to kill */
+} IndexScanBatchData;
+
 /* Generic structure for parallel scans */
 typedef struct ParallelIndexScanDescData
 {
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index aab59d681cf..aa8bc96cd37 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1650,6 +1650,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1677,6 +1678,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1698,6 +1703,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1719,6 +1725,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..14b38ed4d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -157,6 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5fabb127d7e..3567a1c315f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1219,6 +1219,7 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
 IndexScanDesc
 IndexScanState
 IndexStateFlagsAction
@@ -3284,6 +3285,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.46.1

v20240930-0002-WIP-batching-for-nbtree-indexes.patchtext/x-patch; charset=UTF-8; name=v20240930-0002-WIP-batching-for-nbtree-indexes.patchDownload

From b67a0ec75b66780e0abb6ef78ff1cc63cf755f16 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:39 +0200
Subject: [PATCH v20240930 2/6] WIP: batching for nbtree indexes

Adds batching/prefetching for btree indexes. Returns only batches from a
single leaf page. Does not support mark/restore yet.
---
 src/backend/access/nbtree/nbtree.c    |  65 ++++++
 src/backend/access/nbtree/nbtsearch.c | 280 ++++++++++++++++++++++++++
 src/include/access/nbtree.h           |  15 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 361 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4fc9..e67b3938122 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -142,6 +142,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -260,6 +261,53 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+bool
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (!BTScanPosIsValid(so->currPos))
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Check to see if we should kill tuples from the previous batch.
+			 */
+			_bt_kill_batch(scan);
+
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, dir);
+		}
+
+		/* If we have a tuple, return it ... */
+		if (res)
+			break;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -339,6 +387,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* batch range, initially empty */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -365,6 +417,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -407,6 +462,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* Reset the batch (even if not batched scan) */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
 }
 
 /*
@@ -420,6 +479,9 @@ btendscan(IndexScanDesc scan)
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -500,6 +562,9 @@ btrestrpos(IndexScanDesc scan)
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_bt_kill_batch(scan);
+
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fff7c89eadb..66ca064b7e8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1531,6 +1531,283 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+
+static void
+AssertCheckBTBatchInfo(BTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ * _bt_copy_batch
+ *		Copy a section of the leaf page into the batch.
+ */
+static void
+_bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+			   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		BTScanPosItem *currItem = &so->currPos.items[start];
+		IndexTuple	itup = NULL;
+
+		if (scan->xs_want_itup)
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, scan->xs_recheck, itup, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+/*
+ *	_bt_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch (we reset those indexes both in
+	 * btbeginscan and btrescan). So why not an assert? We can get here in
+	 * different ways too - not just after beginscan/rescan, but also when
+	 * iterating over ScalarArrayOps - in which case we'll see the last batch
+	 * of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_bt_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+bool
+_bt_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_bt_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_bt_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxTIDsPerBTreePage * sizeof(int));
+		if (so->numKilled < MaxTIDsPerBTreePage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2052,6 +2329,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(BTScanPosIsValid(so->currPos));
 
+	/* Transfer killed items from the batch to the regular array. */
+	_bt_kill_batch(scan);
+
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
 		_bt_killitems(scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d64300fb973..df32382b311 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1037,6 +1037,14 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+/* Information about the current batch (in batched index scans) */
+typedef struct BTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} BTBatchInfo;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1056,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	BTBatchInfo batch;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1172,6 +1183,7 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1277,6 +1289,9 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3567a1c315f..be89be987a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -188,6 +188,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
 BTBuildState
 BTCycleId
 BTDedupInterval
-- 
2.46.1

v20240930-0003-PoC-support-for-mark-restore-for-nbtree.patchtext/x-patch; charset=UTF-8; name=v20240930-0003-PoC-support-for-mark-restore-for-nbtree.patchDownload

From 03f64b0a122da975598049715b0971f0f54a4365 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:46 +0200
Subject: [PATCH v20240930 3/6] PoC: support for mark/restore for nbtree

Adds support for batching for mark/restore with btree indexes, which
makes prefetching work with mergejoins, for example.
---
 src/backend/access/index/indexam.c       |  40 ++++----
 src/backend/access/nbtree/nbtree.c       | 111 +++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c    |   2 +-
 src/backend/executor/nodeIndexonlyscan.c |  11 +--
 src/backend/executor/nodeIndexscan.c     |  11 +--
 src/include/access/nbtree.h              |   2 +
 src/include/access/relscan.h             |   6 ++
 7 files changed, 152 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 2849ab97cdf..b117853f8d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -497,16 +497,9 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_indam->amrestrpos(scan);
 
 	/*
-	 * Reset the batch, to make it look empty.
-	 *
-	 * Done after the amrescan() call, in case the AM needs some of the batch
-	 * info (e.g. to properly transfer the killed tuples).
-	 *
-	 * XXX This is a bit misleading, because index_batch_reset does not reset
-	 * the killed tuples. So if that's the only justification, we could have
-	 * done it before the call.
+	 * Don't reset the batch here - amrestrpos should have has already loaded
+	 * the new batch, so don't throw that away.
 	 */
-	index_batch_reset(scan);
 }
 
 /*
@@ -1299,12 +1292,9 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
  *
  * With batching, the index AM does not know the the "current" position on
  * the leaf page - we don't propagate this to the index AM while walking
- * items in the batch. To make ammarkpos() work, the index AM has to check
+ * items in the batch. To make ammarkpos() work, the index AM can check
  * the current position in the batch, and translate it to the proper page
  * position, using the private information (about items in the batch).
- *
- * XXX This needs more work, I don't quite like how the two layers interact,
- * it seems quite wrong to look at the batch info directly.
  */
 
 /*
@@ -1372,9 +1362,17 @@ AssertCheckBatchInfo(IndexScanDesc scan)
 		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
 		((scan)->xs_batch->currIndex == 0))
 
-/* Does the batch items in the requested direction? */
+/*
+ * Does the batch items in the requested direction? The batch must be non-empty
+ * and we should not have reached the end of the batch (in the direction).
+ * Also, if we just restored the position after mark/restore, there should be
+ * at least one item to process (we won't advance  on the next call).
+ *
+ * XXX This is a bit confusing / ugly, probably should rethink how we track
+ * empty batches, and how we handle not advancing after a restore.
+ */
 #define INDEX_BATCH_HAS_ITEMS(scan, direction) \
-	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+	(!INDEX_BATCH_IS_EMPTY(scan) && (!INDEX_BATCH_IS_PROCESSED(scan, direction) || scan->xs_batch->restored))
 
 
 /* ----------------
@@ -1527,8 +1525,17 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/*
 	 * Advance to the next batch item - we know it's not empty and there are
 	 * items to process, so this is valid.
+	 *
+	 * However, don't advance if this is the first getnext_tid() call after
+	 * amrestrpos(). That sets the position on the correct item, and advancing
+	 * here would skip it.
+	 *
+	 * XXX The "restored" flag is a bit weird. Can we do this without it? May
+	 * need to rethink when/how we advance the batch index. Not sure.
 	 */
-	if (ScanDirectionIsForward(direction))
+	if (scan->xs_batch->restored)
+		scan->xs_batch->restored = false;
+	else if (ScanDirectionIsForward(direction))
 		scan->xs_batch->currIndex++;
 	else
 		scan->xs_batch->currIndex--;
@@ -1784,6 +1791,7 @@ index_batch_reset(IndexScanDesc scan)
 	scan->xs_batch->nheaptids = 0;
 	scan->xs_batch->prefetchIndex = 0;
 	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->restored = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e67b3938122..a328b2a7fe7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -518,6 +518,25 @@ btmarkpos(IndexScanDesc scan)
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/*
+	 * With batched scans, we don't maintain the itemIndex when processing the
+	 * batch, so we need to calculate the current value.
+	 *
+	 * FIXME I don't like that this requires knowledge of batching details,
+	 * I'd very much prefer those to remain isolated in indexam.c. The best
+	 * idea I have is a function that returns the batch index, something like
+	 * index_batch_get_index(). That's close to how other places in AMs talk
+	 * to indexam.c (e.g. when setting distances / orderby in spgist).
+	 */
+	if (scan->xs_batch)
+	{
+		/* the index should be valid in the batch */
+		Assert(scan->xs_batch->currIndex >= 0);
+		Assert(scan->xs_batch->currIndex < scan->xs_batch->nheaptids);
+
+		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->currIndex;
+	}
+
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
 	 * before releasing the marked position, _bt_steppage makes a full copy of
@@ -551,6 +570,59 @@ btrestrpos(IndexScanDesc scan)
 		 * accurate.
 		 */
 		so->currPos.itemIndex = so->markItemIndex;
+
+		/*
+		 * We're restoring to a different position on the same page, but the
+		 * wrong batch may be loaded. Check if the current batch includes
+		 * itemIndex, and load the correct batch if not. We don't know in
+		 * which direction the scan will move, so we try to put the current
+		 * index in the middle of a batch. For indexes close to the end of the
+		 * page we may load fewer items, but that seems acceptable.
+		 *
+		 * Then update the index of the current item, even if we already have
+		 * the correct batch loaded.
+		 *
+		 * FIXME Similar to btmarkpos() - I don't like how this leaks details
+		 * that should be specific to indexam.c. The "restored" flag is weird
+		 * too - but even if we need it, we could set it in indexam.c, right?
+		 */
+		if (scan->xs_batch != NULL)
+		{
+			if ((so->currPos.itemIndex < so->batch.firstIndex) ||
+				(so->currPos.itemIndex > so->batch.lastIndex))
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/*
+				 * XXX the scan direction is bogus / not important. It affects
+				 * only how we advance the currIndex, but we'll override that
+				 * anyway to point at the "correct" entry.
+				 */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+			}
+
+			/*
+			 * Set the batch index to the "correct" position in the batch,
+			 * even if we haven't re-loaded it from the page. Also remember we
+			 * just did this, so that the next call to
+			 * index_batch_getnext_tid() does not advance it again.
+			 *
+			 * XXX This is a bit weird. There should be a way to not need the
+			 * "restored" flag I think.
+			 */
+			scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+			scan->xs_batch->restored = true;
+		}
 	}
 	else
 	{
@@ -588,6 +660,45 @@ btrestrpos(IndexScanDesc scan)
 				_bt_start_array_keys(scan, so->currPos.dir);
 				so->needPrimScan = false;
 			}
+
+			/*
+			 * With batched scans, we know the current batch is definitely
+			 * wrong (we've moved to a different leaf page). So empty the
+			 * batch, and load the right part of the page.
+			 *
+			 * Similarly to the block above, we place the current index in the
+			 * middle of the batch (if possible). And then we update the index
+			 * to point at the correct batch item.
+			 */
+			if (scan->xs_batch != NULL)
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/* XXX the scan direction is bogus */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+
+				/*
+				 * Set the batch index to the "correct" position in the batch,
+				 * even if we haven't re-loaded it from the page. Also
+				 * remember we just did this, so that the next call to
+				 * index_batch_getnext_tid() does not advance it again.
+				 *
+				 * XXX This is a bit weird. There should be a way to not need
+				 * the "restored" flag I think.
+				 */
+				scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+				scan->xs_batch->restored = true;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 66ca064b7e8..5eba7982f65 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1547,7 +1547,7 @@ AssertCheckBTBatchInfo(BTScanOpaque so)
  * _bt_copy_batch
  *		Copy a section of the leaf page into the batch.
  */
-static void
+void
 _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
 			   int start, int end)
 {
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c030c0df6fe..8e55ffca197 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -613,14 +613,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
+	 * All index scans can do batching.
 	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->ioss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8bbd3606566..0a6dd840012 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -947,18 +947,13 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
-	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * All index scans can do batching.
 	 *
 	 * XXX Maybe this should check if the index AM supports batching, or even
 	 * call something like "amcanbatch" (does not exist yet). Or check the
-	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->iss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index df32382b311..3f500e254de 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1293,6 +1293,8 @@ extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+extern void _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+						   int start, int end);
 
 /*
  * prototypes for functions in nbtutils.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3d767e14356..6c887c04f67 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -205,6 +205,12 @@ typedef struct IndexScanBatchData
 	/* memory context for per-batch data */
 	MemoryContext ctx;
 
+	/*
+	 * Was this batch just restored by restrpos? if yes, we don't advance on
+	 * the first iteration.
+	 */
+	bool		restored;
+
 	/* batch prefetching */
 	int			prefetchTarget; /* current prefetch distance */
 	int			prefetchMaximum;	/* maximum prefetch distance */
-- 
2.46.1

v20240930-0004-WIP-batching-for-hash-indexes.patchtext/x-patch; charset=UTF-8; name=v20240930-0004-WIP-batching-for-hash-indexes.patchDownload

From 82071d7f72fff03ea2e41d2f129e348751d39a0b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:52 +0200
Subject: [PATCH v20240930 4/6] WIP: batching for hash indexes

---
 src/backend/access/hash/hash.c       |  43 ++++
 src/backend/access/hash/hashsearch.c | 283 +++++++++++++++++++++++++++
 src/include/access/hash.h            |  17 ++
 src/tools/pgindent/typedefs.list     |   1 +
 4 files changed, 344 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5ce36093943..d904548148d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = hashgetbatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
@@ -329,6 +330,42 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 }
 
 
+/*
+ *	hashgetbatch() -- Get the next batch of tuples in the scan.
+ */
+bool
+hashgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		res;
+
+	/* Hash indexes are always lossy since we store only the hash code */
+	scan->xs_recheck = true;
+
+	/*
+	 * If we've already initialized this scan, we can just advance it in the
+	 * appropriate direction.  If we haven't done so yet, we call a routine to
+	 * get the first item in the scan.
+	 */
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first_batch(scan, dir);
+	else
+	{
+		/*
+		 * Check to see if we should kill tuples from the previous batch.
+		 */
+		_hash_kill_batch(scan);
+
+		/*
+		 * Now continue the scan.
+		 */
+		res = _hash_next_batch(scan, dir);
+	}
+
+	return res;
+}
+
+
 /*
  *	hashgetbitmap() -- get all tuples at once
  */
@@ -403,6 +440,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
@@ -432,6 +472,9 @@ hashendscan(IndexScanDesc scan)
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0d99d6abc86..c11ae847a3b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -64,6 +64,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (++so->currPos.itemIndex > so->currPos.lastItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -82,6 +85,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (--so->currPos.itemIndex < so->currPos.firstItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -476,6 +482,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != 0)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the next page. Before leaving the current page, deal with any
@@ -535,6 +544,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the previous page. Before leaving the current page, deal with
@@ -713,3 +725,274 @@ _hash_saveitem(HashScanOpaque so, int itemIndex,
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 }
+
+static void
+AssertCheckHashBatchInfo(HashScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ *	_hash_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _hash_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_hash_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch. So why not an assert? But we can
+	 * get here in different ways - not just after beginscan/rescan, but also
+	 * when iterating over ScalarArrayOps - in which case we'll see the last
+	 * batch of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_hash_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _hash_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _hash_fist_batch().
+ *
+ * XXX See also the comments at _hash_first_batch() about returning a single
+ * batch for the page, etc.
+ *
+ * FIXME There's a lot of redundant (almost the same) code here - handling
+ * the current and new leaf page is very similar, and it's also similar to
+ * _hash_first_batch(). We should try to reduce this a bit.
+ */
+bool
+_hash_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_hash_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_hash_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_hash_kill_batch(IndexScanDesc scan)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxIndexTuplesPerPage * sizeof(int));
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+void
+_hash_copy_batch(IndexScanDesc scan, ScanDirection dir, HashScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		HashScanPosItem *currItem = &so->currPos.items[start];
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, false, NULL, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c7d81525b4..b710121f0ee 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -152,6 +152,14 @@ typedef struct HashScanPosData
 		(scanpos).itemIndex = 0; \
 	} while (0)
 
+/* Information about the current batch (in batched index scans) */
+typedef struct HashBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} HashBatchInfo;
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -182,6 +190,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	HashBatchInfo batch;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
@@ -369,6 +380,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool hashgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -444,6 +456,11 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _hash_copy_batch(IndexScanDesc scan, ScanDirection dir,
+							 HashScanOpaque so, int start, int end);
+extern void _hash_kill_batch(IndexScanDesc scan);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index be89be987a2..b642aa3cb25 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1117,6 +1117,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchInfo
 HashBuildState
 HashCompareFunc
 HashCopyFunc
-- 
2.46.1

v20240930-0005-WIP-batching-for-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20240930-0005-WIP-batching-for-gist-indexes.patchDownload

From 9167d1e7bcc04914970cd6b7e6c47c6a8e3c3e2f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:57 +0200
Subject: [PATCH v20240930 5/6] WIP: batching for gist indexes

---
 src/backend/access/gist/gist.c    |   1 +
 src/backend/access/gist/gistget.c | 295 ++++++++++++++++++++++++++++++
 src/include/access/gist_private.h |  16 ++
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 313 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2d7a0687d4a..5000af98a11 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -100,6 +100,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = gistgetbatch;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..70e32f19366 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -25,6 +25,10 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void _gist_kill_batch(IndexScanDesc scan);
+static void _gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+							 int start, int end);
+
 /*
  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
  * told us were killed.
@@ -709,6 +713,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 			{
 				GISTSearchItem *item;
 
+				/*
+				 * XXX Probably not needed, we can't mix gettuple and
+				 * getbatch, so we should not get any killed tuples for a
+				 * batch. Maybe replace this with an assert that batch has no
+				 * killed items? Or simply that (xs_batch == NULL)?
+				 */
+				_gist_kill_batch(scan);
+
 				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
 					gistkillitems(scan);
 
@@ -736,6 +748,164 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	}
 }
 
+/*
+ * gistgetbatch() -- Get the next batch of tuples in the scan
+ */
+bool
+gistgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "GiST only supports forward scan direction");
+
+	if (!so->qual_ok)
+		return false;
+
+	if (so->firstCall)
+	{
+		/* Begin the scan by processing the root page */
+		GISTSearchItem fakeItem;
+
+		pgstat_count_index_scan(scan->indexRelation);
+
+		so->firstCall = false;
+		so->curPageData = so->nPageData = 0;
+		scan->xs_hitup = NULL;
+		if (so->pageDataCxt)
+			MemoryContextReset(so->pageDataCxt);
+
+		fakeItem.blkno = GIST_ROOT_BLKNO;
+		memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
+		gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
+
+		/*
+		 * Mark the batch as empty.
+		 *
+		 * XXX Is this necessary / the right place to do this? Maybe it should
+		 * be done in beginscan/rescan? The one problem with that is we only
+		 * initialize the batch after, so those places don't know if batching
+		 * is to be used.
+		 */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	/*
+	 * With order-by clauses, we simply get enough nearest items to fill the
+	 * batch. GIST does not handle killed items in this case for non-batched
+	 * scans, so we don't do that here either. The only issue is that the heap
+	 * tuple returned in xs_hitup is freed on the next getNextNearest() call,
+	 * so we make a copy in the batch memory context.
+	 *
+	 * For scans without order-by clauses, we simply copy a batch of items
+	 * from the next leaf page, and proceed to the next page when needed.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			nitems = 0;
+
+		/* Must fetch tuples in strict distance order */
+		while (nitems < scan->xs_batch->currSize)
+		{
+			HeapTuple	htup;
+			MemoryContext oldctx;
+
+			if (!getNextNearest(scan))
+				break;
+
+			/*
+			 * We need to copy the tuple into the batch context, as it may go
+			 * away on the next getNextNearest call. Free the original tuple,
+			 * otherwise we'd leak the last tuple in a batch.
+			 */
+			oldctx = MemoryContextSwitchTo(scan->xs_batch->ctx);
+			htup = heap_copytuple(scan->xs_hitup);
+
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
+
+			MemoryContextSwitchTo(oldctx);
+
+			index_batch_add(scan, scan->xs_heaptid, scan->xs_recheck, NULL, htup);
+
+			nitems++;
+		}
+
+		/* did we find more tuples? */
+		return (scan->xs_batch->nheaptids > 0);
+	}
+	else
+	{
+		/* Fetch the next batch of tuples */
+		for (;;)
+		{
+			int			start,
+						end;
+
+			/* forward directions only, easy to calculate next batch */
+			start = so->batch.lastIndex + 1;
+			end = Min(start + (scan->xs_batch->currSize - 1),
+					  so->nPageData - 1);	/* index of last item */
+			so->curPageData = (end + 1);
+
+			/* if we found more items on the current page, we're done */
+			if (start <= end)
+			{
+				_gist_copy_batch(scan, so, start, end);
+				return true;
+			}
+
+			/* find and process the next index page */
+			do
+			{
+				GISTSearchItem *item;
+
+				_gist_kill_batch(scan);
+
+				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+					gistkillitems(scan);
+
+				item = getNextGISTSearchItem(so);
+
+				if (!item)
+					return false;
+
+				CHECK_FOR_INTERRUPTS();
+
+				/* save current item BlockNumber for next gistkillitems() call */
+				so->curBlkno = item->blkno;
+
+				/*
+				 * While scanning a leaf page, ItemPointers of matching heap
+				 * tuples are stored in so->pageData.  If there are any on
+				 * this page, we fall out of the inner "do" and loop around to
+				 * return them.
+				 */
+				gistScanPage(scan, item, item->distances, NULL, NULL);
+
+				/*
+				 * If we found any items on the page, copy them into a batch
+				 * and we're done.
+				 */
+
+				/* forward direction only, so get the first chunk of items */
+				start = 0;
+				end = Min(start + (scan->xs_batch->currSize - 1),
+						  so->nPageData - 1);	/* index of last item */
+
+				if (start <= end)
+				{
+					_gist_copy_batch(scan, so, start, end);
+					return true;
+				}
+
+				pfree(item);
+			} while (so->nPageData == 0);
+		}
+	}
+}
+
 /*
  * gistgetbitmap() -- Get a bitmap of all heap tuple locations
  */
@@ -799,3 +969,128 @@ gistcanreturn(Relation index, int attno)
 	else
 		return false;
 }
+
+static void
+AssertCheckGISTBatchInfo(GISTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPageData);
+#endif
+}
+
+/*
+ *	_gist_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+static void
+_gist_kill_batch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+		{
+			MemoryContext oldCxt =
+				MemoryContextSwitchTo(so->giststate->scanCxt);
+
+			so->killedItems =
+				(OffsetNumber *) palloc(MaxIndexTuplesPerPage
+										* sizeof(OffsetNumber));
+
+			MemoryContextSwitchTo(oldCxt);
+		}
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckGISTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		GISTSearchHeapItem *item = &so->pageData[start];
+
+		HeapTuple	htup = NULL;
+
+		if (scan->xs_want_itup)
+			htup = item->recontup;
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, item->heapPtr, item->recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..2f653354da2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -148,6 +148,19 @@ typedef struct GISTSearchItem
 	(offsetof(GISTSearchItem, distances) + \
 	 sizeof(IndexOrderByDistance) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct GISTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} GISTBatchInfo;
+
 /*
  * GISTScanOpaqueData: private state for a scan of a GiST index
  */
@@ -176,6 +189,8 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	GISTBatchInfo batch;		/* batch loaded from the index */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -461,6 +476,7 @@ extern XLogRecPtr gistXLogAssignLSN(void);
 
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool gistgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool gistcanreturn(Relation index, int attno);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b642aa3cb25..7313530c291 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -968,6 +968,7 @@ GBT_NUMKEY_R
 GBT_VARKEY
 GBT_VARKEY_R
 GENERAL_NAME
+GISTBatchInfo
 GISTBuildBuffers
 GISTBuildState
 GISTDeletedPageContents
-- 
2.46.1

v20240930-0006-WIP-batching-for-sp-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20240930-0006-WIP-batching-for-sp-gist-indexes.patchDownload

From f0e0547d45f0a115844a19da9fe124c20e6a1682 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:49:04 +0200
Subject: [PATCH v20240930 6/6] WIP: batching for sp-gist indexes

---
 src/backend/access/spgist/spgscan.c  | 142 +++++++++++++++++++++++++++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/spgist.h          |   1 +
 src/include/access/spgist_private.h  |  15 +++
 src/tools/pgindent/typedefs.list     |   1 +
 5 files changed, 160 insertions(+)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 3017861859f..4b96852c28c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -1077,6 +1077,148 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 	return false;
 }
 
+static void
+AssertCheckSpGistBatchInfo(SpGistScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPtrs);
+#endif
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_spgist_copy_batch(IndexScanDesc scan, SpGistScanOpaque so,
+				   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckSpGistBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		bool		recheck = so->recheck[start];
+		HeapTuple	htup = NULL;
+
+		if (so->want_itup)
+			htup = so->reconTups[start];
+
+		if (so->numberOfOrderBys > 0)
+			index_store_float8_orderby_distances(scan, so->orderByTypes,
+												 so->distances[so->iPtr],
+												 so->recheckDistances[so->iPtr]);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, so->heapPtrs[start], recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+
+bool
+spggetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "SP-GiST only supports forward scan direction");
+
+	/* Copy want_itup to *so so we don't need to pass it around separately */
+	so->want_itup = scan->xs_want_itup;
+
+	for (;;)
+	{
+		int			start,
+					end;
+
+		/* forward directions only, easy to calculate next batch */
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1),
+				  so->nPtrs - 1);	/* index of last item */
+		so->iPtr = (end + 1);
+
+		/* if we found more items on the current page, we're done */
+		if (start <= end)
+		{
+			_spgist_copy_batch(scan, so, start, end);
+			return true;
+		}
+
+		if (so->numberOfOrderBys > 0)
+		{
+			/* Must pfree distances to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				if (so->distances[i])
+					pfree(so->distances[i]);
+		}
+
+		if (so->want_itup)
+		{
+			/* Must pfree reconstructed tuples to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				pfree(so->reconTups[i]);
+		}
+		so->iPtr = so->nPtrs = 0;
+
+		spgWalk(scan->indexRelation, so, false, storeGettuple);
+
+		if (so->nPtrs == 0)
+			break;				/* must have completed scan */
+
+		/* reset before loading data from batch */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	return false;
+}
+
 bool
 spgcanreturn(Relation index, int attno)
 {
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 72b7661971f..25b363d380d 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -85,6 +85,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = spggetbatch;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index d6a49531200..f879843b3bb 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -209,6 +209,7 @@ extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
 extern int64 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool spggetbatch(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index e7cbe10a89b..15a5e77c5d3 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -183,6 +183,19 @@ typedef struct SpGistSearchItem
 #define SizeOfSpGistSearchItem(n_distances) \
 	(offsetof(SpGistSearchItem, distances) + sizeof(double) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct SpGistBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} SpGistBatchInfo;
+
 /*
  * Private state of an index scan
  */
@@ -235,6 +248,8 @@ typedef struct SpGistScanOpaqueData
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	SpGistBatchInfo batch;		/* batch loaded from the index */
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7313530c291..88be745d9ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2702,6 +2702,7 @@ SortSupportData
 SortTuple
 SortTupleComparator
 SortedPoint
+SpGistBatchInfo
 SpGistBuildState
 SpGistCache
 SpGistDeadTuple
-- 
2.46.1

#102

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Tomas Vondra (#101)

7 attachment(s)

Re: index prefetching

Hi,

Attached is an updated version of this patch series. The first couple
parts (adding batching + updating built-in index AMs) remain the same,
the new part is 0007 which switches index scans to read stream API.

We tried to use read stream API for index prefetching before, and it
didn't seem to be a good fit with that design of the patch. It wasn't
clear what to do about kill tuples, the index AM and read stream had no
way to communicate, etc.

I speculated that with the batching concept it might work better, and I
think that turned out to be the case. The batching is still the core
idea, giving the index AM enough control to make kill tuples work (by
not generating batches spanning multiple leaf pages, or doing something
smarter). And the read stream leverages that too - the next_block
callback returns items from the current batch, and the stream is reset
between batches. This is the same prefetch restriction as with the
explicit prefetching (done using posix_fadvise), except that the
prefetching is done by the read stream.

As I said before, I think this is an acceptable restriction for v1. It
can be relaxed in the future if needed, allowing either cross-leaf
batches and/or multiple in-flight batches. But the patch is complex
enough, and even this simpler patch gives significant benefit.

The main open questions are about how to structure the patch - in which
order to introduce the changes. Right now the patch adds batching, and
then bases the read stream on that.

The batching part now includes explicit prefetching, which is later
removed by 0007. That's mostly to allow comparisons with the read
stream, because that's interesting. Ultimately the explicit prefetch
should be removed, and it'd be just basic batching + read stream.

But then in which order we should introduce the parts? Now the batching
is introduced first, followed by read stream. But I can imagine doing it
the other way too - introducing read stream, and then batching.

Of course, without batching the read stream can't do any prefetching
(for index scans). It'd only simply read the heap pages 1 by 1, just
like now. Only with the batching part it'd be able to prefetch.

I don't see either of those options as obviously superior, but maybe
there are good reasons to pick one? Opinions?

A related question is whether all index scans should use ther read
stream API, or whether there should be a fallback to "regular" read
through ReadBuffer. Right now the read stream is an optional field in
IndexFetchHeapData, initialized only for index AMs supporting batching,
with a couple exceptions for cases where we don't expect batching (and
prefetch) to be very effective (e.g. for systables).

All built-in AMs do support batching, but my plan was to keep the
optional, and I can't predict if all index AMs can do batching (easily
or even at all).

I've thought maybe we could simulate batching for those AMs by simply
treating individual items (returned by amgettuple) as tiny single-item
batches, and just do everything through the read stream.

But the annoying consequence is that we'd have to reset the stream after
every item, because there's no way to "pause" the stream once it runs
out of the current batch. I haven't measured how expensive that is,
maybe not much, but it seems a bit inconvenient.

I wonder if there's a more natural / convenient way to handle this, when
we really can't look further ahead than at the very next item.

If this "single-item" batch idea is not usable, that means we can't
introduce read stream first. We have to introduce batching first, and
only then do the read stream change.

Opinions? I hope this wasn't too confusing :-(

regards

--
Tomas Vondra

Attachments:

v20241106-0001-WIP-index-batching-prefetching.patchtext/x-patch; charset=UTF-8; name=v20241106-0001-WIP-index-batching-prefetching.patchDownload

From 0860de6872b7e39f5a940c7d008e0c113f083bb1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v20241106 1/7] WIP: index batching / prefetching

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.

It is up to the index AM to return only batches that it can handle
internally. For example, most of the later patches adding support for
batching to relevant index AMs (btree, hash, gist, sp-gist) restrict the
batches to a single leaf page. This makes implementation of batching
much simpler, with only minimal changes to the index AMs, but it's not a
hard requirement. The index AM can produce batches spanning arbitrary
number of leaf pages. This is left as a possible future improvement.

Most of the batching/prefetching logic happens in indexam.c. This means
the executor code can continue to call the interface just like before.

The only "violation" happens in index-only scans, which need to check
the visibility map both when the prefetching pages (we don't want to
prefetch pages that are unnecessary) and later when reading the data.
For cached data the visibility map checks can be fairly expensive, so
it's desirable to keep and reuse the result of the first check.

At the moment, the prefetching does not handle mark/restore plans. This
is doable, but requires additional synchronization between the batching
and index AM code in the "opposite direction".

This patch does not actually add batching to any of the index AMs, it's
just the common infrastructure.

TODO Add the new index AM callback to sgml docs.
---
 src/backend/access/heap/heapam_handler.c      |   7 +-
 src/backend/access/index/genam.c              |  23 +-
 src/backend/access/index/indexam.c            | 808 +++++++++++++++++-
 src/backend/executor/execIndexing.c           |   7 +-
 src/backend/executor/execReplication.c        |   9 +-
 src/backend/executor/nodeIndexonlyscan.c      | 106 ++-
 src/backend/executor/nodeIndexscan.c          |  36 +-
 src/backend/utils/adt/selfuncs.c              |   7 +-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   5 +
 src/include/access/genam.h                    |  14 +-
 src/include/access/relscan.h                  |  64 ++
 src/include/nodes/execnodes.h                 |   7 +
 src/test/regress/expected/sysviews.out        |   3 +-
 src/tools/pgindent/typedefs.list              |   2 +
 16 files changed, 1084 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..3fceae759d2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -749,7 +749,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering. Seems like it
+		 * might be a pretty substantial win.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 60c61039d66..3a9d2d483d7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -445,8 +445,18 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over
+		 * time. But then we need to be careful about infinite recursion when
+		 * looking up effective_io_concurrency for a tablespace in the
+		 * catalog.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, nkeys, 0, false);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -708,8 +718,17 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time.
+	 * But then we need to be careful about infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace in the catalog.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, nkeys, 0, false);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1859be614c0..2849ab97cdf 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_batch_add		- add an item (TID, itup) to the batch
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -54,10 +55,14 @@
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memutils.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
+/* enable reading batches / prefetching of TIDs from the index */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +114,15 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan,
+								ScanDirection direction);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+static void index_batch_prefetch(IndexScanDesc scan,
+								 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -256,7 +270,8 @@ IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
 	IndexScanDesc scan;
 
@@ -274,6 +289,24 @@ index_beginscan(Relation heapRelation,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -333,6 +366,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batch = NULL;
+
 	return scan;
 }
 
@@ -368,6 +407,18 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /* ----------------
@@ -444,6 +495,18 @@ index_restrpos(IndexScanDesc scan)
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	/*
+	 * Reset the batch, to make it look empty.
+	 *
+	 * Done after the amrescan() call, in case the AM needs some of the batch
+	 * info (e.g. to properly transfer the killed tuples).
+	 *
+	 * XXX This is a bit misleading, because index_batch_reset does not reset
+	 * the killed tuples. So if that's the only justification, we could have
+	 * done it before the call.
+	 */
+	index_batch_reset(scan);
 }
 
 /*
@@ -539,7 +602,8 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-						 int norderbys, ParallelIndexScanDesc pscan)
+						 int norderbys, ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -562,6 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info.
+	 *
+	 * XXX We do this after ambeginscan(), which means the AM can't init the
+	 * private data in there (it doesn't even know if batching will be used at
+	 * that point).
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		index_batch_init(scan);
+	}
+
 	return scan;
 }
 
@@ -583,6 +665,53 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * When using batching (which may be disabled for various reasons (e.g.
+	 * through a GUC, the index AM not supporting it) do the old approach.
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). But maybe the approach with slow ramp-up (starting
+	 * with small batches) will handle that well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
+	 */
+	if (scan->xs_batch != NULL)
+	{
+batch_loaded:
+		/* Try getting a TID from the current batch (if we have one). */
+		while (index_batch_getnext_tid(scan, direction) != NULL)
+		{
+			/*
+			 * We've successfully loaded a TID from the batch, so issue
+			 * prefetches for future TIDs if needed.
+			 */
+			index_batch_prefetch(scan, direction);
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We either don't have any batch yet, or we've already processed
+		 * all items from the current batch. Try loading the next one.
+		 *
+		 * If we succeed, issue prefetches (using the current prefetch
+		 * distance without ramp up), and then go back to returning the
+		 * TIDs from the batch.
+		 *
+		 * XXX Maybe do this as a simple while/for loop without the goto.
+		 */
+		if (index_batch_getnext(scan, direction))
+		{
+			index_batch_prefetch(scan, direction);
+			goto batch_loaded;
+		}
+
+		return NULL;
+	}
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -651,7 +780,19 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * RelationGetIndexScan().
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->xs_batch == NULL)
+		{
+			scan->kill_prior_tuple = all_dead;
+		}
+		else if (all_dead)
+		{
+			/* batch case - record the killed tuple in the batch */
+			if (scan->xs_batch->nKilledItems < scan->xs_batch->maxSize)
+				scan->xs_batch->killedItems[scan->xs_batch->nKilledItems++]
+					= scan->xs_batch->currIndex;
+		}
+	}
 
 	return found;
 }
@@ -1039,3 +1180,664 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING AND PREFETCHING
+ *
+ * Allows reading chunks of items from an index, instead of reading them
+ * one by one. This reduces the overhead of accessing index pages, and
+ * also allows acting on "future" TIDs - e.g. we can prefetch heap pages
+ * that will be needed, etc.
+ *
+ *
+ * index AM contract
+ * -----------------
+ *
+ * To support batching, the index AM needs to implement an optional callback
+ * amgetbatch() which loads data into the batch (in the scan descriptor).
+ *
+ * The index AM also needs to ensure it can perform all optimizations for
+ * all TIDs in the current batch. A good example of is the kill_prior_tuple
+ * optimization - with batching, the index AM may receive the information
+ * which tuples are to be killed with a delay - when loading the next
+ * batch, when ending/restarting the scan, etc. The AM needs to ensure it
+ * can still process such information, to keep the optimization effective.
+ *
+ * The AM may also need to keep pins required by the whole batch (not just
+ * the last tuple), etc.
+ *
+ * What this means/requires is very dependent on the index AM, of course.
+ * For B-Tree (and most other index AMs), batches spanning multiple leaf
+ * pages would be problematic. Such batches would work for basic index
+ * scans, but the kill_prior_tuple would be an issue - the AMs keep only
+ * a single leaf pinned. We'd either need to keep multiple pins, or allow
+ * reading older leaf pages pages (which might have been modified). Index
+ * only-scans is challenging too - we keep IndexTuple pointers into the
+ * leaf pages, which requires keeping those pins too.
+ *
+ * To solve this, we give the AM some control over batch boundaries. It is
+ * up to the index AM to pick which range of index items to load into the
+ * batch, and how to ensure all the optimizations are possible, keep pins,
+ * and so on. The index AM may use information about the batch (in the
+ * scan descriptor, maintained by indexam.c code), and may also keep some
+ * private information (in the existing "opaque" scan field).
+ *
+ * For most index AMs the easiest way is to not load batches spanning
+ * multiple leaf pages. This may impact the efficiency, especially for
+ * indexes with wide index tuples, as it means batches close to the end
+ * of the leaf page may be smaller.
+ *
+ * Note: There already is a pipeline break for prefetching - as we are
+ * getting closer to the end of a batch, we can't prefetch further than
+ * that, and the effective prefetch distance drops to 0.
+ *
+ * The alternative would be to make the index AMs more complex, to keep
+ * more leaf pages pinned, etc. The current model does not prohibit the
+ * index AMs from implementing that - it's entirely possible to keep the
+ * additional information in the "opaque" structure (say, list of pinned
+ * pages, and other necessary details).
+ *
+ * But that does not seem like a good trade off, as it's subject to
+ * "diminishing returns" - we see significant gains initially (with even
+ * small batches / prefetch distance), and as the batch grows the gains
+ * get smaller and smaller. It does not seem worth the complexity of
+ * pinning more pages etc. at least for the first version.
+ *
+ * To deal with the "prefetch pipeline break", that could be addressed by
+ * allowing multiple in-fligt batches - e.g. with prefetch distance 64
+ * we might have three batches of 32 items each, to prefetch far ahead.
+ * But that's not what this patch does yet.
+ *
+ *
+ * batch = sliding window
+ * ----------------------
+ *
+ * A good way to visualize a batch is a sliding window over the array
+ * of items on a leaf page. In the simplest example (forward scan with no
+ * changes of direction), we slice the array into smaller chunks, and
+ * then process each of those chunks.
+ *
+ * The batch size is adaptive - it starts small (only 8 elements) and
+ * increases as we read more batches (up to 64 elements). We don't want
+ * to regress cases that only need a single item (e.g. LIMIT 1 queries),
+ * and loading/copying a lot of data might cause that. So we start small
+ * and increase the size - that still improves cases reading a lot of
+ * data from the index, without hurting small queries.
+ *
+ * Note: This gradual ramp up is for batch size, independent of what we
+ * do for prefetch. The prefetch distance is gradually increased too, but
+ * it's independent / orthogonal to the batch size. The batch size limits
+ * how far ahead we can prefetch, of course.
+ *
+ * Note: The current limits on batch size (initial 8, maximum 64) are
+ * quite arbitrary, it just seemed those values are sane. We could adjust
+ * the initial size, but I don't think it'd make a fundamental difference.
+ * Growing the batches faster/slower has bigger impact.
+ *
+ * The maximum batch size does not matter much - it's true a btree index can
+ * have up to ~1300 items per 8K leaf page, but in most cases the actual
+ * number is lower, perhaps ~300. That's not that far from 64.
+ *
+ * Each batch has a firstIndex/lastIndex to track which part of the leaf
+ * page it currently represents.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ *
+ * If we decide a tuple can be killed, the batch item is marked accordingly,
+ * and the flag is reset to false (so that the index AM does not do something
+ * silly to a random tuple it thinks is "current").
+ *
+ * Then the next time the AM decides it's time to kill tuples, the AM needs
+ * to look at the batch and consider the tuples marked to be killed. B-Tree
+ * simply adds those TIDs to the regular "killItems" array.
+ *
+ *
+ * mark/restore
+ * ------------
+ *
+ * With batching, the index AM does not know the the "current" position on
+ * the leaf page - we don't propagate this to the index AM while walking
+ * items in the batch. To make ammarkpos() work, the index AM has to check
+ * the current position in the batch, and translate it to the proper page
+ * position, using the private information (about items in the batch).
+ *
+ * XXX This needs more work, I don't quite like how the two layers interact,
+ * it seems quite wrong to look at the batch info directly.
+ */
+
+/*
+ * Comprehensive check of various invariants on the index batch. Makes sure
+ * the indexes are set as expected, the buffer size is within limits, and
+ * so on.
+ */
+static void
+AssertCheckBatchInfo(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* all the arrays need to be allocated */
+	Assert((scan->xs_batch->heaptids != NULL) &&
+		   (scan->xs_batch->killedItems != NULL) &&
+		   (scan->xs_batch->privateData != NULL));
+
+	/* if IndexTuples expected, should be allocated too */
+	Assert(!(scan->xs_want_itup && (scan->xs_batch->itups == NULL)));
+
+	/* Various check on batch sizes */
+	Assert((scan->xs_batch->initSize >= 0) &&
+		   (scan->xs_batch->initSize <= scan->xs_batch->currSize) &&
+		   (scan->xs_batch->currSize <= scan->xs_batch->maxSize) &&
+		   (scan->xs_batch->maxSize <= 1024));	/* arbitrary limit */
+
+	/* Is the number of in the batch TIDs in a valid range? */
+	Assert((scan->xs_batch->nheaptids >= 0) &&
+		   (scan->xs_batch->nheaptids <= scan->xs_batch->maxSize));
+
+	/*
+	 * The current item must be between -1 and nheaptids. Those two extreme
+	 * values are starting points for forward/backward scans.
+	 */
+	Assert((scan->xs_batch->currIndex >= -1) &&
+		   (scan->xs_batch->currIndex <= scan->xs_batch->nheaptids));
+
+	/* check prefetch data */
+	Assert((scan->xs_batch->prefetchTarget >= 0) &&
+		   (scan->xs_batch->prefetchTarget <= scan->xs_batch->prefetchMaximum));
+
+	Assert((scan->xs_batch->prefetchIndex >= -1) &&
+		   (scan->xs_batch->prefetchIndex <= scan->xs_batch->nheaptids));
+
+	for (int i = 0; i < scan->xs_batch->nheaptids; i++)
+		Assert(ItemPointerIsValid(&scan->xs_batch->heaptids[i]));
+#endif
+}
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch->nheaptids == 0)
+
+/*
+ * Did we process all items? For forward scan it means the index points to the
+ * last item, for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
+		((scan)->xs_batch->currIndex == 0))
+
+/* Does the batch items in the requested direction? */
+#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	/*
+	 * Reset the current/prefetch positions in the batch.
+	 *
+	 * XXX Done before calling amgetbatch(), so that it sees the index as
+	 * invalid, batch as empty, and can add items.
+	 *
+	 * XXX Intentionally does not reset the nheaptids, because the AM does
+	 * rely on that when processing killed tuples. Maybe store the killed
+	 * tuples differently?
+	 */
+	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->nheaptids = 0;
+
+	/*
+	 * Reset the memory context with all per-batch data, allocated by the AM.
+	 * This might be tuples, or anything else the AM needs.
+	 *
+	 * XXX Make sure to reset the tuples, because the AM may do something with
+	 * them later (e.g. release them, as getNextNearest in gist), but we may
+	 * release them by the MemoryContextReset() call.
+	 *
+	 * This might break the AM if it relies on them pointing to the last
+	 * tuple, but at least it has the chance to do the right thing by checking
+	 * if the pointer is NULL.
+	 */
+	scan->xs_itup = NULL;
+	scan->xs_hitup = NULL;
+
+	MemoryContextReset(scan->xs_batch->ctx);
+
+	/*
+	 * The AM's amgetbatch proc loads a chunk of TIDs matching the scan keys,
+	 * and puts the TIDs into scan->xs_batch->heaptids.  It should also set
+	 * scan->xs_recheck and possibly
+	 * scan->xs_batch->itups/scan->xs_batch->hitups, though we pay no
+	 * attention to those fields here.
+	 *
+	 * FIXME At the moment this does nothing with hitup. Needs to be fixed?
+	 */
+	found = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return false;
+	}
+
+	/* We should have a non-empty batch with items. */
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+
+	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch->nheaptids);
+
+	/*
+	 * Set the prefetch index to the first item in the loaded batch (we expect
+	 * the index AM to set that).
+	 *
+	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
+	 * more like indexam.c responsibility rather than something every index AM
+	 * should be doing (in _bt_first_batch etc.).
+	 *
+	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
+	 * for setting which fields. This needs clarification.
+	 */
+	scan->xs_batch->prefetchIndex = scan->xs_batch->currIndex;
+
+	/*
+	 * Try to increase the size of the batch. Intentionally done after the AM
+	 * call, so that the new value applies to the next batch. Otherwise we
+	 * would always skip the initial batch size.
+	 */
+	scan->xs_batch->currSize = Min(scan->xs_batch->currSize + 1,
+								   scan->xs_batch->maxSize);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* Return the batch of TIDs we found. */
+	return true;
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * Same calling convention as index_getnext_tid(), except that NULL means
+ * no more items in the current batch, there may be more batches.
+ *
+ * XXX This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup.
+ *
+ * FIXME Should this set xs_hitup?
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * Bail out if he batch does not have more items in the requested directio
+	 * (either empty or everthing processed).
+	 */
+	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
+		return NULL;
+
+	/*
+	 * Advance to the next batch item - we know it's not empty and there are
+	 * items to process, so this is valid.
+	 */
+	if (ScanDirectionIsForward(direction))
+		scan->xs_batch->currIndex++;
+	else
+		scan->xs_batch->currIndex--;
+
+	/*
+	 * Next TID from the batch, optionally also the IndexTuple/HeapTuple.
+	 *
+	 * XXX Not sure how to decide which of the tuples to set, seems easier to
+	 * just set both, one of them will be NULL.
+	 *
+	 * XXX Do we need to reset the itups/htups array between batches? Doesn't
+	 * seem necessary, but maybe we could get bogus data?
+	 */
+	scan->xs_heaptid = scan->xs_batch->heaptids[scan->xs_batch->currIndex];
+	if (scan->xs_want_itup)
+	{
+		scan->xs_itup = scan->xs_batch->itups[scan->xs_batch->currIndex];
+		scan->xs_hitup = scan->xs_batch->htups[scan->xs_batch->currIndex];
+	}
+
+	scan->xs_recheck = scan->xs_batch->recheck[scan->xs_batch->currIndex];
+
+	/*
+	 * If there are order-by clauses, point to the appropriate chunk in the
+	 * arrays.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->numberOfOrderBys * scan->xs_batch->currIndex;
+
+		scan->xs_orderbyvals = &scan->xs_batch->orderbyvals[idx];
+		scan->xs_orderbynulls = &scan->xs_batch->orderbynulls[idx];
+	}
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ *		index_batch_prefetch - prefetch pages for TIDs in current batch
+ *
+ * The prefetch distance is increased gradually, similar to what we do for
+ * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
+ * iteration increment the distance up to prefetchMaximum.
+ *
+ * The prefetch distance is reset (to 0) only on rescans, not between batches.
+ *
+ * It's possible to provide an index_prefetch_callback callback, to affect
+ * which items need to be prefetched. With prefetch_callback=NULL, all
+ * items are prefetched. With the callback provided, the item is prefetched
+ * iff the callback and returns true.
+ *
+ * The "arg" argument is used to pass a state for the plan node invoking the
+ * function, and is then passed to the callback. This means the callback is
+ * specific to the plan state.
+ *
+ * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
+ * tablespace options.
+ *
+ * XXX For accesses that change scan direction, we may do a lot of unnecessary
+ * prefetching (because we will re-issue prefetches for what we recently read).
+ * I'm not sure if there's a simple way to track what was already prefetched.
+ * Maybe we could count how far we got (in the forward direction), keep that
+ * as a watermark, and never prefetch again below it.
+ *
+ * XXX Maybe wrap this in ifdef USE_PREFETCH?
+ * ----------------
+ */
+static void
+index_batch_prefetch(IndexScanDesc scan, ScanDirection direction)
+{
+	int			prefetchStart,
+				prefetchEnd;
+
+	IndexPrefetchCallback	prefetch_callback = scan->xs_batch->prefetchCallback;
+	void *arg = scan->xs_batch->prefetchArgument;
+
+	if (ScanDirectionIsForward(direction))
+	{
+		/* Where should we start to prefetch? */
+		prefetchStart = Max(scan->xs_batch->currIndex,
+							scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchEnd = Min((scan->xs_batch->currIndex + 1) + scan->xs_batch->prefetchTarget,
+						  scan->xs_batch->nheaptids);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchEnd;
+	}
+	else
+	{
+		/* Where should we start to prefetch? */
+		prefetchEnd = Min(scan->xs_batch->currIndex,
+						  scan->xs_batch->prefetchIndex);
+
+		/*
+		 * Where should we stop prefetching? this is the first item that we do
+		 * NOT prefetch, i.e. it can be the first item after the batch.
+		 */
+		prefetchStart = Max((scan->xs_batch->currIndex - 1) - scan->xs_batch->prefetchTarget,
+							-1);
+
+		/* FIXME should calculate in a way to make this unnecessary */
+		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
+		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
+
+		/* remember how far we prefetched / where to start the next prefetch */
+		scan->xs_batch->prefetchIndex = prefetchStart;
+	}
+
+	/*
+	 * It's possible we get inverted prefetch range after a restrpos() call,
+	 * because we intentionally don't reset the prefetchIndex - we don't want
+	 * to prefetch pages over and over in this case. We'll do nothing in that
+	 * case, except for the AssertCheckBatchInfo().
+	 *
+	 * FIXME I suspect this actually does not work correctly if we change the
+	 * direction, because the prefetchIndex will flip between two extremes
+	 * thanks to the Min/Max.
+	 */
+
+	/*
+	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
+	 * intentionally do this after calculating start/end, so that we start
+	 * actually prefetching only after the first item.
+	 */
+	scan->xs_batch->prefetchTarget = Min(scan->xs_batch->prefetchTarget + 1,
+										 scan->xs_batch->prefetchMaximum);
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/* finally, do the actual prefetching */
+	for (int i = prefetchStart; i < prefetchEnd; i++)
+	{
+		/* skip block if the provided callback says so */
+		if (prefetch_callback && !prefetch_callback(scan, arg, i))
+			continue;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
+					   ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[i]));
+	}
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, but only if batch supported */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+
+	scan->xs_batch = palloc0(sizeof(IndexScanBatchData));
+
+	/*
+	 * Set some reasonable batch size defaults.
+	 *
+	 * XXX Maybe should depend on prefetch distance, or something like that?
+	 * The initSize will affect how far ahead we can prefetch.
+	 */
+	scan->xs_batch->maxSize = 64;
+	scan->xs_batch->initSize = 8;
+	scan->xs_batch->currSize = scan->xs_batch->initSize;
+
+	/* initialize prefetching info */
+	scan->xs_batch->prefetchMaximum =
+		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
+	scan->xs_batch->prefetchTarget = 0;
+	scan->xs_batch->prefetchIndex = 0;
+
+	/* */
+	scan->xs_batch->currIndex = -1;
+
+	/* Preallocate the largest allowed array of TIDs. */
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->heaptids = palloc0(sizeof(ItemPointerData) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX We can't check scan->xs_want_itup, because that's set only after
+	 * the scan is initialized (and we initialize in beginscan). Maybe we
+	 * could (or should) allocate lazily.
+	 */
+	scan->xs_batch->itups = palloc(sizeof(IndexTuple) * scan->xs_batch->maxSize);
+	scan->xs_batch->htups = palloc(sizeof(HeapTuple) * scan->xs_batch->maxSize);
+
+	scan->xs_batch->recheck = palloc(sizeof(bool) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe use a more compact bitmap? We need just one bit per element,
+	 * not a bool. This is easier / more convenient to manipulate, though.
+	 *
+	 * XXX Maybe should allow more items thant the max batch size?
+	 */
+	scan->xs_batch->nKilledItems = 0;
+	scan->xs_batch->killedItems = (int *) palloc0(sizeof(int) * scan->xs_batch->maxSize);
+
+	/*
+	 * XXX Maybe allocate only when actually needed? Also, shouldn't we have a
+	 * memory context for the private data?
+	 */
+	scan->xs_batch->privateData = (Datum *) palloc0(sizeof(Datum) * scan->xs_batch->maxSize);
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			cnt = (scan->xs_batch->maxSize * scan->numberOfOrderBys);
+
+		scan->xs_batch->orderbyvals = (Datum *) palloc0(sizeof(Datum) * cnt);
+		scan->xs_batch->orderbynulls = (bool *) palloc0(sizeof(Datum) * cnt);
+	}
+	else
+	{
+		scan->xs_batch->orderbyvals = NULL;
+		scan->xs_batch->orderbynulls = NULL;
+	}
+
+	scan->xs_batch->ctx = AllocSetContextCreate(CurrentMemoryContext,
+												"indexscan batch context",
+												ALLOCSET_DEFAULT_SIZES);
+
+	/* comprehensive checks */
+	AssertCheckBatchInfo(scan);
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
+ * correct, because for backwards scans is wrong.
+ */
+static void
+index_batch_reset(IndexScanDesc scan)
+{
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	scan->xs_batch->nheaptids = 0;
+	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->currIndex = -1;
+}
+
+/*
+ * index_batch_add
+ *		Add an item to the batch.
+ *
+ * The item is always a TID, and then also IndexTuple if requested (for IOS).
+ * Items are always added from the beginning (index 0).
+ *
+ * Returns true when adding the item was successful, or false when the batch
+ * is full (and the item should be added to the next batch).
+ */
+bool
+index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+				IndexTuple itup, HeapTuple htup)
+{
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	/* don't add TIDs beyond the current batch size */
+	if (INDEX_BATCH_IS_FULL(scan))
+		return false;
+
+	/*
+	 * There must be space for at least one entry.
+	 *
+	 * XXX Seems redundant with the earlier INDEX_BATCH_IS_FULL check.
+	 */
+	Assert(scan->xs_batch->nheaptids < scan->xs_batch->currSize);
+	Assert(scan->xs_batch->nheaptids >= 0);
+
+	scan->xs_batch->heaptids[scan->xs_batch->nheaptids] = tid;
+	scan->xs_batch->privateData[scan->xs_batch->nheaptids] = (Datum) 0;
+
+	if (scan->xs_want_itup)
+	{
+		scan->xs_batch->itups[scan->xs_batch->nheaptids] = itup;
+		scan->xs_batch->htups[scan->xs_batch->nheaptids] = htup;
+	}
+
+	scan->xs_batch->recheck[scan->xs_batch->nheaptids] = recheck;
+
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			idx = scan->xs_batch->nheaptids * scan->numberOfOrderBys;
+
+		memcpy(&scan->xs_batch->orderbyvals[idx], scan->xs_orderbyvals, sizeof(Datum) * scan->numberOfOrderBys);
+		memcpy(&scan->xs_batch->orderbynulls[idx], scan->xs_orderbynulls, sizeof(bool) * scan->numberOfOrderBys);
+	}
+
+	scan->xs_batch->nheaptids++;
+
+	/* comprehensive checks on the batch info */
+	AssertCheckBatchInfo(scan);
+
+	return true;
+}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f9a2fac79e4..742a963bc29 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -809,7 +809,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+
+	/*
+	 * XXX Does not seem useful to do prefetching for checks of constraints.
+	 * We would probably need just the first item anyway.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 54025c9f150..6be3744361d 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -244,8 +244,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching is pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 612c6738950..c030c0df6fe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,12 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *data, int index);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* XXX default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -93,15 +98,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   node->ioss_RelationDesc,
 								   estate->es_snapshot,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/* Also set the prefetch callback info, if baching enabled. */
+		if (scandesc->xs_batch != NULL)
+		{
+			scandesc->xs_batch->prefetchCallback = ios_prefetch_block;
+			scandesc->xs_batch->prefetchArgument = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -119,10 +131,38 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/* */
+		if (scandesc->xs_batch == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+						  ItemPointerGetBlockNumber(tid),
+						  &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/* Is the index of the current item valid for the batch? */
+			Assert((scandesc->xs_batch->currIndex >= 0) &&
+				   (scandesc->xs_batch->currIndex < scandesc->xs_batch->nheaptids));
+
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page has to be all-visible.
+			 *
+			 * XXX It's a bir weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that. Maybe we could/should have a more direct
+			 * way?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  scandesc->xs_batch->currIndex);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -157,16 +197,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
 			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+				continue;	/* no visible tuple, try next index entry */
 
 			ExecClearTuple(node->ioss_TableSlot);
 
@@ -574,6 +612,16 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 */
+	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -735,12 +783,20 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -780,12 +836,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/* XXX do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -797,3 +856,34 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, int index)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+
+	if (scan->xs_batch->privateData[index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &scan->xs_batch->heaptids[index];
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		scan->xs_batch->privateData[index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (scan->xs_batch->privateData[index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..8bbd3606566 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -110,7 +110,8 @@ IndexNext(IndexScanState *node)
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -200,12 +201,15 @@ IndexNextWithReorder(IndexScanState *node)
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
+		 *
+		 * XXX Should we use batching here? And can we with reordering?
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -942,6 +946,20 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * Can't do batching (and thus prefetching) when the plan requires mark
+	 * and restore. There's an issue with translating the mark/restore
+	 * positions between the batch in scan descriptor and the original
+	 * position recognized in the index AM.
+	 *
+	 * XXX Hopefully just a temporary limitation?
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 */
+	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1670,12 +1688,17 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 								  estate->es_snapshot,
 								  piscan);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1713,12 +1736,17 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 	ParallelIndexScanDesc piscan;
 
 	piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+
+	/*
+	 * XXX do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 08fa6774d9c..02ea0eea149 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6343,9 +6343,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a67f01200c..96806872613 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -789,6 +789,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39a3ac23127..20a1af47db2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -399,6 +399,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index c51de742ea0..966a25d9ba3 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,10 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef bool (*amgetbatch_function) (IndexScanDesc scan,
+									 ScanDirection direction);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -288,6 +292,7 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c25f5d11b53..1d9a0868a9b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -88,6 +89,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -132,6 +134,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -155,7 +159,8 @@ extern void index_insert_cleanup(Relation indexRelation,
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
@@ -173,7 +178,8 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -182,6 +188,10 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+/* index batching/prefetching */
+extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
+							IndexTuple itup, HeapTuple htup);
+
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index e1884acf493..8fd2da8514d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -129,6 +129,9 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -174,6 +177,9 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* Information used by batched index scans. */
+	IndexScanBatchData *xs_batch;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
@@ -189,6 +195,64 @@ typedef struct IndexScanDescData
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
 
+/*
+ * Typedef for callback function to determine if an item in index scan should
+ * be prefetched.
+ */
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData *scan,
+									   void *arg, int index);
+
+/*
+ * Data about the current TID batch returned by the index AM.
+ *
+ * XXX Maybe this should be a separate struct instead, and the scan
+ * descriptor would have only a pointer, initialized only when the
+ * batching is actually used?
+ *
+ * XXX It's not quite clear which part of this is managed by indexam and
+ * what's up to the actual index AM implementation. Needs some clearer
+ * boundaries.
+ *
+ * XXX Should we have a pointer for optional state managed by the AM? Some
+ * custom AMs may need more per-batch information, not just the fields we
+ * have here.
+ */
+typedef struct IndexScanBatchData
+{
+	/* batch size - maximum, initial, current (with ramp up) */
+	int			maxSize;
+	int			initSize;
+	int			currSize;
+
+	/* memory context for per-batch data */
+	MemoryContext ctx;
+
+	/* batch prefetching */
+	int			prefetchTarget; /* current prefetch distance */
+	int			prefetchMaximum;	/* maximum prefetch distance */
+	int			prefetchIndex;	/* next item to prefetch */
+
+	IndexPrefetchCallback	prefetchCallback;
+	void				   *prefetchArgument;
+
+	/* batch contents (TIDs, index tuples, kill bitmap, ...) */
+	int			currIndex;		/* index of the current item */
+	int			nheaptids;		/* number of TIDs in the batch */
+	ItemPointerData *heaptids;	/* TIDs in the batch */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+	/* list of killed items */
+	int			nKilledItems;	/* number of killedItems elements */
+	int		   *killedItems;	/* list of indexes to kill */
+} IndexScanBatchData;
+
 /* Generic structure for parallel scans */
 typedef struct ParallelIndexScanDescData
 {
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 182a6956bb0..e85f03cd0c1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1654,6 +1654,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1681,6 +1682,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1702,6 +1707,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1723,6 +1729,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..14b38ed4d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -157,6 +157,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1847bbfa95c..6378e182238 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1220,6 +1220,7 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
 IndexScanDesc
 IndexScanState
 IndexStateFlagsAction
@@ -3286,6 +3287,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.47.0

v20241106-0002-WIP-batching-for-nbtree-indexes.patchtext/x-patch; charset=UTF-8; name=v20241106-0002-WIP-batching-for-nbtree-indexes.patchDownload

From 4474a058c7a1d65b26c9d450065678823e73ce8e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:39 +0200
Subject: [PATCH v20241106 2/7] WIP: batching for nbtree indexes

Adds batching/prefetching for btree indexes. Returns only batches from a
single leaf page. Does not support mark/restore yet.
---
 src/backend/access/nbtree/nbtree.c    |  65 ++++++
 src/backend/access/nbtree/nbtsearch.c | 280 ++++++++++++++++++++++++++
 src/include/access/nbtree.h           |  15 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 361 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2919b12639d..bbaa3df2a0c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -259,6 +260,53 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+bool
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res;
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (!BTScanPosIsValid(so->currPos))
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Check to see if we should kill tuples from the previous batch.
+			 */
+			_bt_kill_batch(scan);
+
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, dir);
+		}
+
+		/* If we have a tuple, return it ... */
+		if (res)
+			break;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -339,6 +387,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* batch range, initially empty */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -365,6 +417,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -408,6 +463,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* Reset the batch (even if not batched scan) */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
 }
 
 /*
@@ -421,6 +480,9 @@ btendscan(IndexScanDesc scan)
 	/* we aren't holding any read locks, but gotta drop the pins */
 	if (BTScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_bt_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
@@ -501,6 +563,9 @@ btrestrpos(IndexScanDesc scan)
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_bt_kill_batch(scan);
+
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1608dd49d57..931edab0550 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1504,6 +1504,283 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+
+static void
+AssertCheckBTBatchInfo(BTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ * _bt_copy_batch
+ *		Copy a section of the leaf page into the batch.
+ */
+static void
+_bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+			   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		BTScanPosItem *currItem = &so->currPos.items[start];
+		IndexTuple	itup = NULL;
+
+		if (scan->xs_want_itup)
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, scan->xs_recheck, itup, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+/*
+ *	_bt_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch (we reset those indexes both in
+	 * btbeginscan and btrescan). So why not an assert? We can get here in
+	 * different ways too - not just after beginscan/rescan, but also when
+	 * iterating over ScalarArrayOps - in which case we'll see the last batch
+	 * of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_bt_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+bool
+_bt_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckBTBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_bt_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_bt_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_bt_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxTIDsPerBTreePage * sizeof(int));
+		if (so->numKilled < MaxTIDsPerBTreePage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2028,6 +2305,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 
 	Assert(BTScanPosIsValid(so->currPos));
 
+	/* Transfer killed items from the batch to the regular array. */
+	_bt_kill_batch(scan);
+
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
 		_bt_killitems(scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..01644709cb8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1028,6 +1028,14 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+/* Information about the current batch (in batched index scans) */
+typedef struct BTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} BTBatchInfo;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1048,6 +1056,9 @@ typedef struct BTScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	BTBatchInfo batch;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1162,6 +1173,7 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1269,6 +1281,9 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6378e182238..d86582db454 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -188,6 +188,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
 BTBuildState
 BTCycleId
 BTDedupInterval
-- 
2.47.0

v20241106-0003-PoC-support-for-mark-restore-for-nbtree.patchtext/x-patch; charset=UTF-8; name=v20241106-0003-PoC-support-for-mark-restore-for-nbtree.patchDownload

From 4a0f41c029f7ad25c1f0bc830aaa4aad520d10c3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:46 +0200
Subject: [PATCH v20241106 3/7] PoC: support for mark/restore for nbtree

Adds support for batching for mark/restore with btree indexes, which
makes prefetching work with mergejoins, for example.
---
 src/backend/access/index/indexam.c       |  40 ++++----
 src/backend/access/nbtree/nbtree.c       | 111 +++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c    |   2 +-
 src/backend/executor/nodeIndexonlyscan.c |  11 +--
 src/backend/executor/nodeIndexscan.c     |  11 +--
 src/include/access/nbtree.h              |   2 +
 src/include/access/relscan.h             |   6 ++
 7 files changed, 152 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 2849ab97cdf..b117853f8d0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -497,16 +497,9 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_indam->amrestrpos(scan);
 
 	/*
-	 * Reset the batch, to make it look empty.
-	 *
-	 * Done after the amrescan() call, in case the AM needs some of the batch
-	 * info (e.g. to properly transfer the killed tuples).
-	 *
-	 * XXX This is a bit misleading, because index_batch_reset does not reset
-	 * the killed tuples. So if that's the only justification, we could have
-	 * done it before the call.
+	 * Don't reset the batch here - amrestrpos should have has already loaded
+	 * the new batch, so don't throw that away.
 	 */
-	index_batch_reset(scan);
 }
 
 /*
@@ -1299,12 +1292,9 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
  *
  * With batching, the index AM does not know the the "current" position on
  * the leaf page - we don't propagate this to the index AM while walking
- * items in the batch. To make ammarkpos() work, the index AM has to check
+ * items in the batch. To make ammarkpos() work, the index AM can check
  * the current position in the batch, and translate it to the proper page
  * position, using the private information (about items in the batch).
- *
- * XXX This needs more work, I don't quite like how the two layers interact,
- * it seems quite wrong to look at the batch info directly.
  */
 
 /*
@@ -1372,9 +1362,17 @@ AssertCheckBatchInfo(IndexScanDesc scan)
 		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
 		((scan)->xs_batch->currIndex == 0))
 
-/* Does the batch items in the requested direction? */
+/*
+ * Does the batch items in the requested direction? The batch must be non-empty
+ * and we should not have reached the end of the batch (in the direction).
+ * Also, if we just restored the position after mark/restore, there should be
+ * at least one item to process (we won't advance  on the next call).
+ *
+ * XXX This is a bit confusing / ugly, probably should rethink how we track
+ * empty batches, and how we handle not advancing after a restore.
+ */
 #define INDEX_BATCH_HAS_ITEMS(scan, direction) \
-	(!INDEX_BATCH_IS_EMPTY(scan) && !INDEX_BATCH_IS_PROCESSED(scan, direction))
+	(!INDEX_BATCH_IS_EMPTY(scan) && (!INDEX_BATCH_IS_PROCESSED(scan, direction) || scan->xs_batch->restored))
 
 
 /* ----------------
@@ -1527,8 +1525,17 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/*
 	 * Advance to the next batch item - we know it's not empty and there are
 	 * items to process, so this is valid.
+	 *
+	 * However, don't advance if this is the first getnext_tid() call after
+	 * amrestrpos(). That sets the position on the correct item, and advancing
+	 * here would skip it.
+	 *
+	 * XXX The "restored" flag is a bit weird. Can we do this without it? May
+	 * need to rethink when/how we advance the batch index. Not sure.
 	 */
-	if (ScanDirectionIsForward(direction))
+	if (scan->xs_batch->restored)
+		scan->xs_batch->restored = false;
+	else if (ScanDirectionIsForward(direction))
 		scan->xs_batch->currIndex++;
 	else
 		scan->xs_batch->currIndex--;
@@ -1784,6 +1791,7 @@ index_batch_reset(IndexScanDesc scan)
 	scan->xs_batch->nheaptids = 0;
 	scan->xs_batch->prefetchIndex = 0;
 	scan->xs_batch->currIndex = -1;
+	scan->xs_batch->restored = false;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index bbaa3df2a0c..db0e22e0ce2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -519,6 +519,25 @@ btmarkpos(IndexScanDesc scan)
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/*
+	 * With batched scans, we don't maintain the itemIndex when processing the
+	 * batch, so we need to calculate the current value.
+	 *
+	 * FIXME I don't like that this requires knowledge of batching details,
+	 * I'd very much prefer those to remain isolated in indexam.c. The best
+	 * idea I have is a function that returns the batch index, something like
+	 * index_batch_get_index(). That's close to how other places in AMs talk
+	 * to indexam.c (e.g. when setting distances / orderby in spgist).
+	 */
+	if (scan->xs_batch)
+	{
+		/* the index should be valid in the batch */
+		Assert(scan->xs_batch->currIndex >= 0);
+		Assert(scan->xs_batch->currIndex < scan->xs_batch->nheaptids);
+
+		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->currIndex;
+	}
+
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
 	 * before releasing the marked position, _bt_steppage makes a full copy of
@@ -552,6 +571,59 @@ btrestrpos(IndexScanDesc scan)
 		 * accurate.
 		 */
 		so->currPos.itemIndex = so->markItemIndex;
+
+		/*
+		 * We're restoring to a different position on the same page, but the
+		 * wrong batch may be loaded. Check if the current batch includes
+		 * itemIndex, and load the correct batch if not. We don't know in
+		 * which direction the scan will move, so we try to put the current
+		 * index in the middle of a batch. For indexes close to the end of the
+		 * page we may load fewer items, but that seems acceptable.
+		 *
+		 * Then update the index of the current item, even if we already have
+		 * the correct batch loaded.
+		 *
+		 * FIXME Similar to btmarkpos() - I don't like how this leaks details
+		 * that should be specific to indexam.c. The "restored" flag is weird
+		 * too - but even if we need it, we could set it in indexam.c, right?
+		 */
+		if (scan->xs_batch != NULL)
+		{
+			if ((so->currPos.itemIndex < so->batch.firstIndex) ||
+				(so->currPos.itemIndex > so->batch.lastIndex))
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/*
+				 * XXX the scan direction is bogus / not important. It affects
+				 * only how we advance the currIndex, but we'll override that
+				 * anyway to point at the "correct" entry.
+				 */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+			}
+
+			/*
+			 * Set the batch index to the "correct" position in the batch,
+			 * even if we haven't re-loaded it from the page. Also remember we
+			 * just did this, so that the next call to
+			 * index_batch_getnext_tid() does not advance it again.
+			 *
+			 * XXX This is a bit weird. There should be a way to not need the
+			 * "restored" flag I think.
+			 */
+			scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+			scan->xs_batch->restored = true;
+		}
 	}
 	else
 	{
@@ -589,6 +661,45 @@ btrestrpos(IndexScanDesc scan)
 				_bt_start_array_keys(scan, so->currPos.dir);
 				so->needPrimScan = false;
 			}
+
+			/*
+			 * With batched scans, we know the current batch is definitely
+			 * wrong (we've moved to a different leaf page). So empty the
+			 * batch, and load the right part of the page.
+			 *
+			 * Similarly to the block above, we place the current index in the
+			 * middle of the batch (if possible). And then we update the index
+			 * to point at the correct batch item.
+			 */
+			if (scan->xs_batch != NULL)
+			{
+				int			start = Max(so->currPos.firstItem,
+										so->currPos.itemIndex - (scan->xs_batch->currSize / 2));
+				int			end = Min(so->currPos.lastItem,
+									  start + (scan->xs_batch->currSize - 1));
+
+				Assert(start <= end);
+				Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+				/* make it look empty */
+				scan->xs_batch->nheaptids = 0;
+				scan->xs_batch->prefetchIndex = -1;
+
+				/* XXX the scan direction is bogus */
+				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
+
+				/*
+				 * Set the batch index to the "correct" position in the batch,
+				 * even if we haven't re-loaded it from the page. Also
+				 * remember we just did this, so that the next call to
+				 * index_batch_getnext_tid() does not advance it again.
+				 *
+				 * XXX This is a bit weird. There should be a way to not need
+				 * the "restored" flag I think.
+				 */
+				scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
+				scan->xs_batch->restored = true;
+			}
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 931edab0550..22181bf027b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1520,7 +1520,7 @@ AssertCheckBTBatchInfo(BTScanOpaque so)
  * _bt_copy_batch
  *		Copy a section of the leaf page into the batch.
  */
-static void
+void
 _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
 			   int start, int end)
 {
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c030c0df6fe..8e55ffca197 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -613,14 +613,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
+	 * All index scans can do batching.
 	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->ioss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->ioss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8bbd3606566..0a6dd840012 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -947,18 +947,13 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
 	/*
-	 * Can't do batching (and thus prefetching) when the plan requires mark
-	 * and restore. There's an issue with translating the mark/restore
-	 * positions between the batch in scan descriptor and the original
-	 * position recognized in the index AM.
-	 *
-	 * XXX Hopefully just a temporary limitation?
+	 * All index scans can do batching.
 	 *
 	 * XXX Maybe this should check if the index AM supports batching, or even
 	 * call something like "amcanbatch" (does not exist yet). Or check the
-	 * enable_indexscan_batching GUC? Now we check the GUC in index_beginscan.
+	 * enable_indexscan_batching GUC?
 	 */
-	indexstate->iss_CanBatch = !(eflags & EXEC_FLAG_MARK);
+	indexstate->iss_CanBatch = true;
 
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 01644709cb8..4714a7238c1 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1285,6 +1285,8 @@ extern bool _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next_batch(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_kill_batch(IndexScanDesc scan);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+extern void _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
+						   int start, int end);
 
 /*
  * prototypes for functions in nbtutils.c
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8fd2da8514d..33d2b6a6223 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -227,6 +227,12 @@ typedef struct IndexScanBatchData
 	/* memory context for per-batch data */
 	MemoryContext ctx;
 
+	/*
+	 * Was this batch just restored by restrpos? if yes, we don't advance on
+	 * the first iteration.
+	 */
+	bool		restored;
+
 	/* batch prefetching */
 	int			prefetchTarget; /* current prefetch distance */
 	int			prefetchMaximum;	/* maximum prefetch distance */
-- 
2.47.0

v20241106-0004-WIP-batching-for-hash-indexes.patchtext/x-patch; charset=UTF-8; name=v20241106-0004-WIP-batching-for-hash-indexes.patchDownload

From 8a17bd8533fa3a7526de2fbe2568c9fd0bf30829 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:52 +0200
Subject: [PATCH v20241106 4/7] WIP: batching for hash indexes

---
 src/backend/access/hash/hash.c       |  43 ++++
 src/backend/access/hash/hashsearch.c | 283 +++++++++++++++++++++++++++
 src/include/access/hash.h            |  17 ++
 src/tools/pgindent/typedefs.list     |   1 +
 4 files changed, 344 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5ce36093943..d904548148d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = hashgetbatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
@@ -329,6 +330,42 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 }
 
 
+/*
+ *	hashgetbatch() -- Get the next batch of tuples in the scan.
+ */
+bool
+hashgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	bool		res;
+
+	/* Hash indexes are always lossy since we store only the hash code */
+	scan->xs_recheck = true;
+
+	/*
+	 * If we've already initialized this scan, we can just advance it in the
+	 * appropriate direction.  If we haven't done so yet, we call a routine to
+	 * get the first item in the scan.
+	 */
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first_batch(scan, dir);
+	else
+	{
+		/*
+		 * Check to see if we should kill tuples from the previous batch.
+		 */
+		_hash_kill_batch(scan);
+
+		/*
+		 * Now continue the scan.
+		 */
+		res = _hash_next_batch(scan, dir);
+	}
+
+	return res;
+}
+
+
 /*
  *	hashgetbitmap() -- get all tuples at once
  */
@@ -403,6 +440,9 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
@@ -432,6 +472,9 @@ hashendscan(IndexScanDesc scan)
 
 	if (HashScanPosIsValid(so->currPos))
 	{
+		/* Transfer killed items from the batch to the regular array. */
+		_hash_kill_batch(scan);
+
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_hash_kill_items(scan);
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0d99d6abc86..c11ae847a3b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -64,6 +64,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (++so->currPos.itemIndex > so->currPos.lastItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -82,6 +85,9 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	{
 		if (--so->currPos.itemIndex < so->currPos.firstItem)
 		{
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			if (so->numKilled > 0)
 				_hash_kill_items(scan);
 
@@ -476,6 +482,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != 0)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the next page. Before leaving the current page, deal with any
@@ -535,6 +544,9 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
+			/* Transfer killed items from the batch to the regular array. */
+			_hash_kill_batch(scan);
+
 			/*
 			 * Could not find any matching tuples in the current page, move to
 			 * the previous page. Before leaving the current page, deal with
@@ -713,3 +725,274 @@ _hash_saveitem(HashScanOpaque so, int itemIndex,
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 }
+
+static void
+AssertCheckHashBatchInfo(HashScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(so->currPos.firstItem <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->currPos.lastItem);
+#endif
+}
+
+/*
+ *	_hash_first_batch() -- Find the first batch in a scan.
+ *
+ * A batch variant of _hash_first(). Most of the comments for that function
+ * apply here too.
+ *
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
+ *
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
+ */
+bool
+_hash_first_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	/*
+	 * Mark the batch as empty.
+	 *
+	 * This might seems a bit strange, because surely the batch should be
+	 * empty before reading the first batch. So why not an assert? But we can
+	 * get here in different ways - not just after beginscan/rescan, but also
+	 * when iterating over ScalarArrayOps - in which case we'll see the last
+	 * batch of the preceding scan.
+	 */
+	so->batch.firstIndex = -1;
+	so->batch.lastIndex = -1;
+
+	/* we haven't visited any leaf pages yet, so proceed to reading one */
+	if (_hash_first(scan, dir))
+	{
+		/* range of the leaf to copy into the batch */
+		int			start,
+					end;
+
+		/* determine which part of the leaf page to extract */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		/*
+		 * Copy the selected range of items into the batch, set the batch
+		 * current index properly (before first / after last item, depending
+		 * on scan direction.
+		 */
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _hash_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _hash_fist_batch().
+ *
+ * XXX See also the comments at _hash_first_batch() about returning a single
+ * batch for the page, etc.
+ *
+ * FIXME There's a lot of redundant (almost the same) code here - handling
+ * the current and new leaf page is very similar, and it's also similar to
+ * _hash_first_batch(). We should try to reduce this a bit.
+ */
+bool
+_hash_next_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	int			start,
+				end;
+
+	Assert(scan->xs_batch->nheaptids == 0);
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Check if we still have some items on the current leaf page. If yes,
+	 * load them into a batch and return.
+	 *
+	 * XXX try combining that with the next block, the inner while loop is
+	 * exactly the same.
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+		so->currPos.itemIndex = (end + 1);
+	}
+	else
+	{
+		end = so->batch.firstIndex - 1;
+		start = Max(end - (scan->xs_batch->currSize - 1),
+					so->currPos.firstItem);
+		so->currPos.itemIndex = (start - 1);
+	}
+
+	/*
+	 * We have more items on the current leaf page.
+	 */
+	if (start <= end)
+	{
+		_hash_copy_batch(scan, dir, so, start, end);
+		return true;
+	}
+
+	/*
+	 * We've consumed all items from the current leaf page, so try reading the
+	 * next one, and process it.
+	 */
+	if (_hash_next(scan, dir))
+	{
+		/*
+		 * Check if we still have some items on the current leaf page. If yes,
+		 * load them into a batch and return.
+		 *
+		 * XXX try combining that with the next block, the inner while loop is
+		 * exactly the same.
+		 */
+		if (ScanDirectionIsForward(dir))
+		{
+			start = so->currPos.firstItem;
+			end = Min(start + (scan->xs_batch->currSize - 1), so->currPos.lastItem);
+			so->currPos.itemIndex = (end + 1);
+		}
+		else
+		{
+			end = so->currPos.lastItem;
+			start = Max(end - (scan->xs_batch->currSize - 1),
+						so->currPos.firstItem);
+			so->currPos.itemIndex = (start - 1);
+		}
+
+		_hash_copy_batch(scan, dir, so, start, end);
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ *	_hash_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_hash_kill_batch(IndexScanDesc scan)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+			so->killedItems = (int *)
+				palloc(MaxIndexTuplesPerPage * sizeof(int));
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+void
+_hash_copy_batch(IndexScanDesc scan, ScanDirection dir, HashScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckHashBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		HashScanPosItem *currItem = &so->currPos.items[start];
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, currItem->heapTid, false, NULL, NULL))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	if (ScanDirectionIsForward(dir))
+		scan->xs_batch->currIndex = -1;
+	else
+		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c7d81525b4..b710121f0ee 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -152,6 +152,14 @@ typedef struct HashScanPosData
 		(scanpos).itemIndex = 0; \
 	} while (0)
 
+/* Information about the current batch (in batched index scans) */
+typedef struct HashBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} HashBatchInfo;
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -182,6 +190,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* info about current batch */
+	HashBatchInfo batch;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
@@ -369,6 +380,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool hashgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -444,6 +456,11 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_next_batch(IndexScanDesc scan, ScanDirection dir);
+extern bool _hash_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern void _hash_copy_batch(IndexScanDesc scan, ScanDirection dir,
+							 HashScanOpaque so, int start, int end);
+extern void _hash_kill_batch(IndexScanDesc scan);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d86582db454..b9e465edac0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1118,6 +1118,7 @@ Hash
 HashAggBatch
 HashAggSpill
 HashAllocFunc
+HashBatchInfo
 HashBuildState
 HashCompareFunc
 HashCopyFunc
-- 
2.47.0

v20241106-0005-WIP-batching-for-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20241106-0005-WIP-batching-for-gist-indexes.patchDownload

From db1cb91f9553d76697065f0b09e16b39eb9f15c3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:57 +0200
Subject: [PATCH v20241106 5/7] WIP: batching for gist indexes

---
 src/backend/access/gist/gist.c    |   1 +
 src/backend/access/gist/gistget.c | 295 ++++++++++++++++++++++++++++++
 src/include/access/gist_private.h |  16 ++
 src/tools/pgindent/typedefs.list  |   1 +
 4 files changed, 313 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3ae913e0230..07bcf03c421 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -100,6 +100,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = gistgetbatch;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..70e32f19366 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -25,6 +25,10 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+static void _gist_kill_batch(IndexScanDesc scan);
+static void _gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+							 int start, int end);
+
 /*
  * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
  * told us were killed.
@@ -709,6 +713,14 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 			{
 				GISTSearchItem *item;
 
+				/*
+				 * XXX Probably not needed, we can't mix gettuple and
+				 * getbatch, so we should not get any killed tuples for a
+				 * batch. Maybe replace this with an assert that batch has no
+				 * killed items? Or simply that (xs_batch == NULL)?
+				 */
+				_gist_kill_batch(scan);
+
 				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
 					gistkillitems(scan);
 
@@ -736,6 +748,164 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	}
 }
 
+/*
+ * gistgetbatch() -- Get the next batch of tuples in the scan
+ */
+bool
+gistgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "GiST only supports forward scan direction");
+
+	if (!so->qual_ok)
+		return false;
+
+	if (so->firstCall)
+	{
+		/* Begin the scan by processing the root page */
+		GISTSearchItem fakeItem;
+
+		pgstat_count_index_scan(scan->indexRelation);
+
+		so->firstCall = false;
+		so->curPageData = so->nPageData = 0;
+		scan->xs_hitup = NULL;
+		if (so->pageDataCxt)
+			MemoryContextReset(so->pageDataCxt);
+
+		fakeItem.blkno = GIST_ROOT_BLKNO;
+		memset(&fakeItem.data.parentlsn, 0, sizeof(GistNSN));
+		gistScanPage(scan, &fakeItem, NULL, NULL, NULL);
+
+		/*
+		 * Mark the batch as empty.
+		 *
+		 * XXX Is this necessary / the right place to do this? Maybe it should
+		 * be done in beginscan/rescan? The one problem with that is we only
+		 * initialize the batch after, so those places don't know if batching
+		 * is to be used.
+		 */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	/*
+	 * With order-by clauses, we simply get enough nearest items to fill the
+	 * batch. GIST does not handle killed items in this case for non-batched
+	 * scans, so we don't do that here either. The only issue is that the heap
+	 * tuple returned in xs_hitup is freed on the next getNextNearest() call,
+	 * so we make a copy in the batch memory context.
+	 *
+	 * For scans without order-by clauses, we simply copy a batch of items
+	 * from the next leaf page, and proceed to the next page when needed.
+	 */
+	if (scan->numberOfOrderBys > 0)
+	{
+		int			nitems = 0;
+
+		/* Must fetch tuples in strict distance order */
+		while (nitems < scan->xs_batch->currSize)
+		{
+			HeapTuple	htup;
+			MemoryContext oldctx;
+
+			if (!getNextNearest(scan))
+				break;
+
+			/*
+			 * We need to copy the tuple into the batch context, as it may go
+			 * away on the next getNextNearest call. Free the original tuple,
+			 * otherwise we'd leak the last tuple in a batch.
+			 */
+			oldctx = MemoryContextSwitchTo(scan->xs_batch->ctx);
+			htup = heap_copytuple(scan->xs_hitup);
+
+			pfree(scan->xs_hitup);
+			scan->xs_hitup = NULL;
+
+			MemoryContextSwitchTo(oldctx);
+
+			index_batch_add(scan, scan->xs_heaptid, scan->xs_recheck, NULL, htup);
+
+			nitems++;
+		}
+
+		/* did we find more tuples? */
+		return (scan->xs_batch->nheaptids > 0);
+	}
+	else
+	{
+		/* Fetch the next batch of tuples */
+		for (;;)
+		{
+			int			start,
+						end;
+
+			/* forward directions only, easy to calculate next batch */
+			start = so->batch.lastIndex + 1;
+			end = Min(start + (scan->xs_batch->currSize - 1),
+					  so->nPageData - 1);	/* index of last item */
+			so->curPageData = (end + 1);
+
+			/* if we found more items on the current page, we're done */
+			if (start <= end)
+			{
+				_gist_copy_batch(scan, so, start, end);
+				return true;
+			}
+
+			/* find and process the next index page */
+			do
+			{
+				GISTSearchItem *item;
+
+				_gist_kill_batch(scan);
+
+				if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+					gistkillitems(scan);
+
+				item = getNextGISTSearchItem(so);
+
+				if (!item)
+					return false;
+
+				CHECK_FOR_INTERRUPTS();
+
+				/* save current item BlockNumber for next gistkillitems() call */
+				so->curBlkno = item->blkno;
+
+				/*
+				 * While scanning a leaf page, ItemPointers of matching heap
+				 * tuples are stored in so->pageData.  If there are any on
+				 * this page, we fall out of the inner "do" and loop around to
+				 * return them.
+				 */
+				gistScanPage(scan, item, item->distances, NULL, NULL);
+
+				/*
+				 * If we found any items on the page, copy them into a batch
+				 * and we're done.
+				 */
+
+				/* forward direction only, so get the first chunk of items */
+				start = 0;
+				end = Min(start + (scan->xs_batch->currSize - 1),
+						  so->nPageData - 1);	/* index of last item */
+
+				if (start <= end)
+				{
+					_gist_copy_batch(scan, so, start, end);
+					return true;
+				}
+
+				pfree(item);
+			} while (so->nPageData == 0);
+		}
+	}
+}
+
 /*
  * gistgetbitmap() -- Get a bitmap of all heap tuple locations
  */
@@ -799,3 +969,128 @@ gistcanreturn(Relation index, int attno)
 	else
 		return false;
 }
+
+static void
+AssertCheckGISTBatchInfo(GISTScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPageData);
+#endif
+}
+
+/*
+ *	_gist_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+static void
+_gist_kill_batch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+	/* bail out if batching not enabled */
+	if (!scan->xs_batch)
+		return;
+
+	for (int i = 0; i < scan->xs_batch->nKilledItems; i++)
+	{
+		int			index = (so->batch.firstIndex + scan->xs_batch->killedItems[i]);
+
+		/* make sure we have a valid index (in the current leaf page) */
+		Assert((so->batch.firstIndex <= index) &&
+			   (index <= so->batch.lastIndex));
+
+		/*
+		 * Yes, remember it for later. (We'll deal with all such tuples at
+		 * once right before leaving the index page.)  The test for numKilled
+		 * overrun is not just paranoia: if the caller reverses direction in
+		 * the indexscan then the same item might get entered multiple times.
+		 * It's not worth trying to optimize that, so we don't detect it, but
+		 * instead just forget any excess entries.
+		 */
+		if (so->killedItems == NULL)
+		{
+			MemoryContext oldCxt =
+				MemoryContextSwitchTo(so->giststate->scanCxt);
+
+			so->killedItems =
+				(OffsetNumber *) palloc(MaxIndexTuplesPerPage
+										* sizeof(OffsetNumber));
+
+			MemoryContextSwitchTo(oldCxt);
+		}
+		if (so->numKilled < MaxIndexTuplesPerPage)
+			so->killedItems[so->numKilled++] = index;
+	}
+
+	/* now reset the number of killed items */
+	scan->xs_batch->nKilledItems = 0;
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
+				 int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckGISTBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		GISTSearchHeapItem *item = &so->pageData[start];
+
+		HeapTuple	htup = NULL;
+
+		if (scan->xs_want_itup)
+			htup = item->recontup;
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, item->heapPtr, item->recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..2f653354da2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -148,6 +148,19 @@ typedef struct GISTSearchItem
 	(offsetof(GISTSearchItem, distances) + \
 	 sizeof(IndexOrderByDistance) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct GISTBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} GISTBatchInfo;
+
 /*
  * GISTScanOpaqueData: private state for a scan of a GiST index
  */
@@ -176,6 +189,8 @@ typedef struct GISTScanOpaqueData
 	OffsetNumber curPageData;	/* next item to return */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	GISTBatchInfo batch;		/* batch loaded from the index */
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
@@ -461,6 +476,7 @@ extern XLogRecPtr gistXLogAssignLSN(void);
 
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool gistgetbatch(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool gistcanreturn(Relation index, int attno);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b9e465edac0..8f72011ad1e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -968,6 +968,7 @@ GBT_NUMKEY_R
 GBT_VARKEY
 GBT_VARKEY_R
 GENERAL_NAME
+GISTBatchInfo
 GISTBuildBuffers
 GISTBuildState
 GISTDeletedPageContents
-- 
2.47.0

v20241106-0006-WIP-batching-for-sp-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20241106-0006-WIP-batching-for-sp-gist-indexes.patchDownload

From 744e6662e2122d2cd8001c11798b5b7d18b403bd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:49:04 +0200
Subject: [PATCH v20241106 6/7] WIP: batching for sp-gist indexes

---
 src/backend/access/spgist/spgscan.c  | 142 +++++++++++++++++++++++++++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/spgist.h          |   1 +
 src/include/access/spgist_private.h  |  15 +++
 src/tools/pgindent/typedefs.list     |   1 +
 5 files changed, 160 insertions(+)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 3017861859f..4b96852c28c 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -1077,6 +1077,148 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 	return false;
 }
 
+static void
+AssertCheckSpGistBatchInfo(SpGistScanOpaque so)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* should be valid items (with respect to the current leaf page) */
+	Assert(0 <= so->batch.firstIndex);
+	Assert(so->batch.firstIndex <= so->batch.lastIndex);
+	Assert(so->batch.lastIndex <= so->nPtrs);
+#endif
+}
+
+/*
+ * FIXME Does this need to worry about recheck/recheckDistances flags in
+ * GISTScanOpaque? Probably yes.
+ *
+ * FIXME Definitely should return recontup for IOS, but that needs changes
+ * to index_batch_add.
+ */
+static void
+_spgist_copy_batch(IndexScanDesc scan, SpGistScanOpaque so,
+				   int start, int end)
+{
+	/*
+	 * We're reading the first batch, and there should always be at least one
+	 * item (otherwise _bt_first would return false). So we should never get
+	 * into situation with empty start/end range. In the worst case, there is
+	 * just a single item, in which case (start == end).
+	 */
+	Assert(start <= end);
+
+	/* The range of items should fit into the current batch size. */
+	Assert((end - start + 1) <= scan->xs_batch->currSize);
+
+	so->batch.firstIndex = start;
+	so->batch.lastIndex = end;
+
+	AssertCheckSpGistBatchInfo(so);
+
+	/*
+	 * Walk through the range of index tuples, copy them into the batch. If
+	 * requested, set the index tuple too.
+	 *
+	 * We don't know if the batch is full already - we just try to add it, and
+	 * bail out if it fails.
+	 *
+	 * FIXME This seems wrong, actually. We use currSize when calculating the
+	 * start/end range, so the add should always succeed.
+	 */
+	while (start <= end)
+	{
+		bool		recheck = so->recheck[start];
+		HeapTuple	htup = NULL;
+
+		if (so->want_itup)
+			htup = so->reconTups[start];
+
+		if (so->numberOfOrderBys > 0)
+			index_store_float8_orderby_distances(scan, so->orderByTypes,
+												 so->distances[so->iPtr],
+												 so->recheckDistances[so->iPtr]);
+
+		/* try to add it to batch, if there's space */
+		if (!index_batch_add(scan, so->heapPtrs[start], recheck, NULL, htup))
+			break;
+
+		start++;
+	}
+
+	/*
+	 * set the starting point
+	 *
+	 * XXX might be better done in indexam.c
+	 */
+	scan->xs_batch->currIndex = -1;
+
+	/* shouldn't be possible to end here with an empty batch */
+	Assert(scan->xs_batch->nheaptids > 0);
+}
+
+
+bool
+spggetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+	if (dir != ForwardScanDirection)
+		elog(ERROR, "SP-GiST only supports forward scan direction");
+
+	/* Copy want_itup to *so so we don't need to pass it around separately */
+	so->want_itup = scan->xs_want_itup;
+
+	for (;;)
+	{
+		int			start,
+					end;
+
+		/* forward directions only, easy to calculate next batch */
+		start = so->batch.lastIndex + 1;
+		end = Min(start + (scan->xs_batch->currSize - 1),
+				  so->nPtrs - 1);	/* index of last item */
+		so->iPtr = (end + 1);
+
+		/* if we found more items on the current page, we're done */
+		if (start <= end)
+		{
+			_spgist_copy_batch(scan, so, start, end);
+			return true;
+		}
+
+		if (so->numberOfOrderBys > 0)
+		{
+			/* Must pfree distances to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				if (so->distances[i])
+					pfree(so->distances[i]);
+		}
+
+		if (so->want_itup)
+		{
+			/* Must pfree reconstructed tuples to avoid memory leak */
+			int			i;
+
+			for (i = 0; i < so->nPtrs; i++)
+				pfree(so->reconTups[i]);
+		}
+		so->iPtr = so->nPtrs = 0;
+
+		spgWalk(scan->indexRelation, so, false, storeGettuple);
+
+		if (so->nPtrs == 0)
+			break;				/* must have completed scan */
+
+		/* reset before loading data from batch */
+		so->batch.firstIndex = -1;
+		so->batch.lastIndex = -1;
+	}
+
+	return false;
+}
+
 bool
 spgcanreturn(Relation index, int attno)
 {
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 72b7661971f..25b363d380d 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -85,6 +85,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = spggetbatch;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index d6a49531200..f879843b3bb 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -209,6 +209,7 @@ extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
 extern int64 spggetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool spggetbatch(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index e7cbe10a89b..15a5e77c5d3 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -183,6 +183,19 @@ typedef struct SpGistSearchItem
 #define SizeOfSpGistSearchItem(n_distances) \
 	(offsetof(SpGistSearchItem, distances) + sizeof(double) * (n_distances))
 
+/*
+ * Information about the current batch (in batched index scans)
+ *
+ * XXX Probably not needed, as spgist supports just forward scans, so we
+ * could simply the iPtr (no problem after change of scan direction).
+ */
+typedef struct SpGistBatchInfo
+{
+	/* Current range of items in a batch (if used). */
+	int			firstIndex;
+	int			lastIndex;
+} SpGistBatchInfo;
+
 /*
  * Private state of an index scan
  */
@@ -235,6 +248,8 @@ typedef struct SpGistScanOpaqueData
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	SpGistBatchInfo batch;		/* batch loaded from the index */
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8f72011ad1e..b7212cc154e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2704,6 +2704,7 @@ SortSupportData
 SortTuple
 SortTupleComparator
 SortedPoint
+SpGistBatchInfo
 SpGistBuildState
 SpGistCache
 SpGistDeadTuple
-- 
2.47.0

v20241106-0007-WIP-stream-read-API.patchtext/x-patch; charset=UTF-8; name=v20241106-0007-WIP-stream-read-API.patchDownload

From fd557f9d68b57070f7ab01cee8e60960d309a82f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 30 Oct 2024 22:41:14 +0100
Subject: [PATCH v20241106 7/7] WIP: stream read API

---
 src/backend/access/gist/gistget.c        |   7 -
 src/backend/access/hash/hashsearch.c     |  10 -
 src/backend/access/heap/heapam_handler.c |  32 +-
 src/backend/access/index/indexam.c       | 714 +++++++++++++++--------
 src/backend/access/nbtree/nbtree.c       |  18 +-
 src/backend/access/nbtree/nbtsearch.c    |  10 -
 src/backend/access/spgist/spgscan.c      |   7 -
 src/backend/access/table/tableam.c       |   2 +-
 src/backend/commands/constraint.c        |   3 +-
 src/backend/executor/nodeIndexonlyscan.c |   8 +-
 src/include/access/genam.h               |   1 +
 src/include/access/relscan.h             |  32 +-
 src/include/access/tableam.h             |   7 +-
 13 files changed, 542 insertions(+), 309 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 70e32f19366..015e0954af5 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -1084,13 +1084,6 @@ _gist_copy_batch(IndexScanDesc scan, GISTScanOpaque so,
 		start++;
 	}
 
-	/*
-	 * set the starting point
-	 *
-	 * XXX might be better done in indexam.c
-	 */
-	scan->xs_batch->currIndex = -1;
-
 	/* shouldn't be possible to end here with an empty batch */
 	Assert(scan->xs_batch->nheaptids > 0);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index c11ae847a3b..5182457475d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -983,16 +983,6 @@ _hash_copy_batch(IndexScanDesc scan, ScanDirection dir, HashScanOpaque so,
 		start++;
 	}
 
-	/*
-	 * set the starting point
-	 *
-	 * XXX might be better done in indexam.c
-	 */
-	if (ScanDirectionIsForward(dir))
-		scan->xs_batch->currIndex = -1;
-	else
-		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
-
 	/* shouldn't be possible to end here with an empty batch */
 	Assert(scan->xs_batch->nheaptids > 0);
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3fceae759d2..b90c799e457 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -75,11 +75,12 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = rs;
 	hscan->xs_cbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
@@ -126,15 +127,38 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/*
+		 * XXX it's a bit weird just read buffers, expecting them to match
+		 * the TID we've put into the queue earlier from the callback.
+		 */
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+
+		//elog(WARNING, "BufferGetBlockNumber(hscan->xs_cbuf) = %u", BufferGetBlockNumber(hscan->xs_cbuf));
+		//elog(WARNING, "ItemPointerGetBlockNumber(tid) = %u", ItemPointerGetBlockNumber(tid));
+
+		/* crosscheck: Did we get the expected block number? */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
+		{
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		}
+
+		/* FIXME not sure this is really needed, or maybe this is not the
+		 * right place to do this */
+		if (scan->rs && (prev_buf != InvalidBuffer))
+		{
+			ReleaseBuffer(prev_buf);
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b117853f8d0..0a5a8a41abf 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -55,6 +55,7 @@
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/read_stream.h"
 #include "utils/memutils.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
@@ -121,8 +122,48 @@ static bool index_batch_getnext(IndexScanDesc scan,
 								ScanDirection direction);
 static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
 										   ScanDirection direction);
-static void index_batch_prefetch(IndexScanDesc scan,
-								 ScanDirection direction);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+
+/* Is the batch full (TIDs up to capacity)? */
+#define	INDEX_BATCH_IS_FULL(scan)	\
+	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+
+/* Is the batch empty (no TIDs)? */
+#define	INDEX_BATCH_IS_EMPTY(scan)	\
+	((scan)->xs_batch->nheaptids == 0)
+
+/*
+ * Did we process all items? Or can we move to the next item in the requested
+ * direction? For forward scan this means the index points to the last item,
+ * for backward scans it has to point to the first one.
+ *
+ * This does not cover empty batches properly, because of backward scans.
+ */
+#define	INDEX_BATCH_IS_PROCESSED(scan, pos, direction)	\
+	(ScanDirectionIsForward(direction) ? \
+		((scan)->xs_batch->nheaptids == (pos)->index) : \
+		((pos)->index == -1))
+
+/*
+ * Does the batch have items in the requested direction? The batch must be
+ * non-empty, and we should not have reached the end of the batch (in the
+ * direction).
+ *
+ * However, if we just restored the position after mark/restore, there should
+ * be at least one item to process (as we won't advance on the next call).
+ *
+ * XXX This is a bit confusing / ugly, probably should rethink how we track
+ * empty batches, and how we handle not advancing after a restore.
+ */
+#define INDEX_BATCH_HAS_ITEMS(scan, pos, direction) \
+	(!INDEX_BATCH_IS_EMPTY(scan) && \
+		(!INDEX_BATCH_IS_PROCESSED((scan), (pos), (direction)) || \
+		(pos)->restored))
+
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -273,6 +314,7 @@ index_beginscan(Relation heapRelation,
 				int nkeys, int norderbys,
 				bool enable_batching)
 {
+	ReadStream   *rs = NULL;
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
@@ -286,9 +328,6 @@ index_beginscan(Relation heapRelation,
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
 
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
-
 	/*
 	 * If explicitly requested and supported by both the index AM and the
 	 * plan, initialize batching info.
@@ -299,14 +338,31 @@ index_beginscan(Relation heapRelation,
 	 *
 	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
 	 * decide if batching is supported depending on the scan details.
+	 *
+	 * XXX Do this before initializing xs_heapfetch. We only use stream read
+	 * API with batching enabled (so not with systable scans). But maybe we
+	 * should change that, and just use different callbacks (or something
+	 * like that)?
 	 */
 	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
 		enable_batching &&
 		enable_indexscan_batching)
 	{
 		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heapRelation,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
 	}
 
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, rs);
+
 	return scan;
 }
 
@@ -400,7 +456,12 @@ index_rescan(IndexScanDesc scan,
 
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
+	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+
 		table_index_fetch_reset(scan->xs_heapfetch);
+	}
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
@@ -424,6 +485,8 @@ index_rescan(IndexScanDesc scan,
 /* ----------------
  *		index_endscan - end a scan
  * ----------------
+ *
+ * FIXME should also release the index batch?
  */
 void
 index_endscan(IndexScanDesc scan)
@@ -434,6 +497,11 @@ index_endscan(IndexScanDesc scan)
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
+		if (scan->xs_heapfetch->rs)
+		{
+			read_stream_end(scan->xs_heapfetch->rs);
+		}
+
 		table_index_fetch_end(scan->xs_heapfetch);
 		scan->xs_heapfetch = NULL;
 	}
@@ -489,16 +557,35 @@ index_restrpos(IndexScanDesc scan)
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
+	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+
 		table_index_fetch_reset(scan->xs_heapfetch);
+	}
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the current/prefetch positions in the batch.
+	 *
+	 * XXX Done before calling amgetbatch(), so that it sees the indexes as
+	 * invalid, batch as empty, and can add items.
+	 *
+	 * XXX This is a bit weird/fragile.
+	 */
+	scan->xs_batch->readPos.index = -1;
+	scan->xs_batch->streamPos.index = -1;
+
+	/* XXX don't reset nheaptids here, it confused amrestrpos (which seems
+	 * a bit weird, it shouldn't be the case I think) */
+
 	scan->indexRelation->rd_indam->amrestrpos(scan);
 
 	/*
 	 * Don't reset the batch here - amrestrpos should have has already loaded
-	 * the new batch, so don't throw that away.
+	 * the new batch and set the curr/prefetch indexes, so don't throw that away.
 	 */
 }
 
@@ -581,7 +668,12 @@ index_parallelrescan(IndexScanDesc scan)
 	SCAN_CHECKS;
 
 	if (scan->xs_heapfetch)
+	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+
 		table_index_fetch_reset(scan->xs_heapfetch);
+	}
 
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
@@ -600,6 +692,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
+	ReadStream   *rs = NULL;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
@@ -616,9 +709,6 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
 
-	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
-
 	/*
 	 * If explicitly requested and supported by both the index AM and the
 	 * plan, initialize batching info.
@@ -635,8 +725,20 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 		enable_indexscan_batching)
 	{
 		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heaprel,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
 	}
 
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, rs);
+
 	return scan;
 }
 
@@ -676,31 +778,16 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 batch_loaded:
 		/* Try getting a TID from the current batch (if we have one). */
 		while (index_batch_getnext_tid(scan, direction) != NULL)
-		{
-			/*
-			 * We've successfully loaded a TID from the batch, so issue
-			 * prefetches for future TIDs if needed.
-			 */
-			index_batch_prefetch(scan, direction);
-
 			return &scan->xs_heaptid;
-		}
 
 		/*
 		 * We either don't have any batch yet, or we've already processed
 		 * all items from the current batch. Try loading the next one.
 		 *
-		 * If we succeed, issue prefetches (using the current prefetch
-		 * distance without ramp up), and then go back to returning the
-		 * TIDs from the batch.
-		 *
 		 * XXX Maybe do this as a simple while/for loop without the goto.
 		 */
 		if (index_batch_getnext(scan, direction))
-		{
-			index_batch_prefetch(scan, direction);
 			goto batch_loaded;
-		}
 
 		return NULL;
 	}
@@ -722,7 +809,12 @@ batch_loaded:
 	{
 		/* release resources (like buffer pins) from table accesses */
 		if (scan->xs_heapfetch)
+		{
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+
 			table_index_fetch_reset(scan->xs_heapfetch);
+		}
 
 		return NULL;
 	}
@@ -783,7 +875,7 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 			/* batch case - record the killed tuple in the batch */
 			if (scan->xs_batch->nKilledItems < scan->xs_batch->maxSize)
 				scan->xs_batch->killedItems[scan->xs_batch->nKilledItems++]
-					= scan->xs_batch->currIndex;
+					= scan->xs_batch->readPos.index;
 		}
 	}
 
@@ -1328,98 +1420,252 @@ AssertCheckBatchInfo(IndexScanDesc scan)
 	 * The current item must be between -1 and nheaptids. Those two extreme
 	 * values are starting points for forward/backward scans.
 	 */
-	Assert((scan->xs_batch->currIndex >= -1) &&
-		   (scan->xs_batch->currIndex <= scan->xs_batch->nheaptids));
-
-	/* check prefetch data */
-	Assert((scan->xs_batch->prefetchTarget >= 0) &&
-		   (scan->xs_batch->prefetchTarget <= scan->xs_batch->prefetchMaximum));
+	Assert((scan->xs_batch->readPos.index >= -1) &&
+		   (scan->xs_batch->readPos.index <= scan->xs_batch->nheaptids));
 
-	Assert((scan->xs_batch->prefetchIndex >= -1) &&
-		   (scan->xs_batch->prefetchIndex <= scan->xs_batch->nheaptids));
+	Assert((scan->xs_batch->streamPos.index >= -1) &&
+		   (scan->xs_batch->streamPos.index <= scan->xs_batch->nheaptids));
 
 	for (int i = 0; i < scan->xs_batch->nheaptids; i++)
 		Assert(ItemPointerIsValid(&scan->xs_batch->heaptids[i]));
 #endif
 }
 
-/* Is the batch full (TIDs up to capacity)? */
-#define	INDEX_BATCH_IS_FULL(scan)	\
-	((scan)->xs_batch->nheaptids == (scan)->xs_batch->currSize)
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * XXX We expect to only do this when the advance is possible/valid.
+ *
+ * XXX Would be nice to have an assert that the final position is valid,
+ * but that requires knowing nheaptids.
+ */
+static int
+index_batch_pos_advance(IndexScanBatchPos *pos, ScanDirection dir)
+{
+	if (pos->restored)
+	{
+		pos->restored = false;
+		return pos->index;
+	}
 
-/* Is the batch empty (no TIDs)? */
-#define	INDEX_BATCH_IS_EMPTY(scan)	\
-	((scan)->xs_batch->nheaptids == 0)
+	/*
+	 * We update the index "before" to move to the next item, because that
+	 * makes scan direction changes simpler to handle (we don't need to undo
+	 * the last update).
+	 */
+	if (ScanDirectionIsForward(dir))
+		return ++pos->index;
+	else
+		return --pos->index;
+}
 
 /*
- * Did we process all items? For forward scan it means the index points to the
- * last item, for backward scans it has to point to the first one.
+ * index_batch_pos_reset
+ *		Reset the position to the index, maybe disable update on first advance.
  *
- * This does not cover empty batches properly, because of backward scans.
+ * Index specifies the item index. The 'restored' flag may be used to disable
+ * updating the index on the first advance. This is needed after mark/restore,
+ * because we restore to the item we're expected to return next, so we must
+ * not skip it.
+ *
+ * XXX Would be nice to have an assert that the final position is valid,
+ * but that requires knowing nheaptids.
  */
-#define	INDEX_BATCH_IS_PROCESSED(scan, direction)	\
-	(ScanDirectionIsForward(direction) ? \
-		((scan)->xs_batch->nheaptids == ((scan)->xs_batch->currIndex + 1)) : \
-		((scan)->xs_batch->currIndex == 0))
+static void
+index_batch_pos_reset(IndexScanBatchPos *pos, int index, bool restored)
+{
+	pos->index = index;
+	pos->restored = restored;
+}
 
 /*
- * Does the batch items in the requested direction? The batch must be non-empty
- * and we should not have reached the end of the batch (in the direction).
- * Also, if we just restored the position after mark/restore, there should be
- * at least one item to process (we won't advance  on the next call).
+ * index_batch_reset_positions
+ *		Reset both bach positions - both the processing and stream
  *
- * XXX This is a bit confusing / ugly, probably should rethink how we track
- * empty batches, and how we handle not advancing after a restore.
+ * XXX Would be nice to check the stream position is always "ahead" of the
+ * read position. That requires knowing the direction.
+ */
+void
+index_batch_reset_positions(IndexScanDesc scan, int index, bool restored)
+{
+	index_batch_pos_reset(&scan->xs_batch->readPos, index, restored);
+	index_batch_pos_reset(&scan->xs_batch->streamPos, index, restored);
+}
+
+/*
+ * index_batch_reset_indexes
+ *		reset batch indexes (read/stream) for a given scan direction
+ *
+ * After loading a new batch, we need to set both indexes, so that we
+ * advance to the first item on the next call. This depends on the scan
+ * direction - for forward scans we want to proceed to the first item
+ * (with index 0), so we set -1. For backward scans we want to proceed
+ * to the last item (with index (nheaptids-1)), so we set nheaptids.
+ *
+ * XXX It's legal to call lthis with resetIndexes=false, but in that case
+ * the function does nothing.
  */
-#define INDEX_BATCH_HAS_ITEMS(scan, direction) \
-	(!INDEX_BATCH_IS_EMPTY(scan) && (!INDEX_BATCH_IS_PROCESSED(scan, direction) || scan->xs_batch->restored))
+static void
+index_batch_reset_indexes(IndexScanDesc scan)
+{
+	/* bail out if no index needed */
+	if (!scan->xs_batch->resetIndexes)
+		return;
 
+	/*
+	 * Reset the read stream too, if we have one.
+	 *
+	 * XXX Not sure if this is the right place to do this. It does not hurt,
+	 * we really want to start with a new stream. But maybe it'd be more
+	 * logical to reset the stream from the caller, not from here.
+	 */
+	if (scan->xs_heapfetch)
+	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
 
-/* ----------------
- *		index_batch_getnext - get the next batch of TIDs from a scan
+	/* set indexes into "starting" position depending on scan direction */
+	if (ScanDirectionIsForward(scan->xs_batch->dir))
+	{
+		index_batch_pos_reset(&scan->xs_batch->readPos, -1, false);
+		index_batch_pos_reset(&scan->xs_batch->streamPos, -1, false);
+	}
+	else
+	{
+		index_batch_pos_reset(&scan->xs_batch->readPos,
+							  scan->xs_batch->nheaptids, false);
+		index_batch_pos_reset(&scan->xs_batch->streamPos,
+							  scan->xs_batch->nheaptids, false);
+	}
+
+	scan->xs_batch->resetIndexes = false;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
  *
- * Returns true if we managed to read at least some TIDs into the batch,
- * or false if there are no more TIDs in the scan. The xs_heaptids and
- * xs_nheaptids fields contain the TIDS and the number of elements.
+ * This assumes the "current" scan direction, requested by the caller. If
+ * that changes before consuming all buffers, we'll reset the stream and
+ * start from scratch. Which may seem inefficient, but it's no worse than
+ * what we do now, and it's not a very common case.
  *
- * XXX This only loads the TIDs and resets the various batch fields to
- * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
- * responsibility of the following index_batch_getnext_tid() calls.
- * ----------------
+ * The scan direction change is checked / handled elsewhere.
  */
-static bool
-index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
 {
-	bool		found;
+	int				index;
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
 
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgetbatch);
+	/*
+	 * XXX What should we do without batching? Currently we should not get
+	 * here without batching, because we fall back to the regular buffer
+	 * reads (without stream read API) in that case. But is that what we
+	 * want to do in the future?
+	 *
+	 * Maybe the right solution would be to just use read stream API, but
+	 * only read the one block ahead, because we can't know more than that.
+	 */
+	Assert(scan->xs_batch);
 
-	/* XXX: we should assert that a snapshot is pushed or registered */
-	Assert(TransactionIdIsValid(RecentXmin));
+	if (!scan->xs_batch)	/* can't happen right now */
+		elog(ERROR, "index_scan_stream_read_next calleld without batching");
 
-	/* comprehensive checks of batching info */
-	AssertCheckBatchInfo(scan);
+	/*
+	 * We shouldn't ever get here with an empty batch. In that case we bail
+	 * out after index_batch_getnext(), if it finds nothing.
+	 */
+	Assert(!INDEX_BATCH_IS_EMPTY(scan));
 
 	/*
-	 * We never read a new batch before we run out of items in the current
-	 * one. The current batch has to be either empty or we ran out of items
-	 * (in the given direction).
+	 * It shouldn't be possible to have both the resetIndexes and restored
+	 * flags set at the same time.
+	 *
+	 * XXX Add this check to AssertCheckBatchInfo().
 	 */
-	Assert(!INDEX_BATCH_HAS_ITEMS(scan, direction));
+	Assert(!(scan->xs_batch->resetIndexes && scan->xs_batch->readPos.restored));
 
+	/*
+	 * Maybe reset indexes, if requested.
+	 *
+	 * XXX Shouldn't this happen right after loading the next batch, or after
+	 * checking the scan direction? Then we should not need to care about
+	 * resetIndexes here.
+	 */
+	index_batch_reset_indexes(scan);
+
+	/*
+	 * Find the next block to pass to the read stream. With the callback we
+	 * may end up skipping some of the items, so do a loop until we find one
+	 * or until we run out of items.
+	 *
+	 * XXX It's a bit unfortunate that we may need to visit many items (if
+	 * the callback returns false for many before), as that may be expensive.
+	 * For example for IOS we need to check the visibility map, which is not
+	 * free (although not too expensive either). But that's one of the resons
+	 * why we grow the batch size gradually.
+	 *
+	 * XXX Do the while loop based on "has items", so that we can do
+	 * AssertCheckValidItem() on the result.
+	 */
+	while (true)
+	{
+		/* Advance the stream index as needed. */
+		index = index_batch_pos_advance(&scan->xs_batch->streamPos,
+										scan->xs_batch->dir);
+
+		/* Did we run out of items in the current batch? */
+		if ((index < 0) || (index >= scan->xs_batch->nheaptids))
+			return InvalidBlockNumber;
+
+		/*
+		 * If defined, invoke the prefetch callback to determine if we should
+		 * actually pass this block to the read stream (e.g. to prefetch).
+		 * For example index-only scans use this to skip all-visible pages.
+		 */
+		if (scan->xs_batch->prefetchCallback &&
+			!scan->xs_batch->prefetchCallback(scan,
+											  scan->xs_batch->prefetchArgument,
+											  index))
+		{
+			/*
+			 * We don't need to prefetch this item, but we need to return
+			 * something valid (otherwise the stream will think we're at
+			 * the end). So we keep looking for the next item in the batch.
+			 */
+			continue;
+		}
+
+		break;
+	}
+
+	/*
+	 * FIXME Maybe we should store the TID into the per_buffer_data, so that
+	 * we can cross check we got the right buffer later? It seems quite
+	 * fragile and easy to break the exact sequence in some way. We could
+	 * return a different block which happens to have the right item IDs.
+	 */
+
+	return ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[index]);
+}
+
+/* XXX isn't this pretty much the same as index_batch_reset? */
+static void
+index_batch_empty(IndexScanDesc scan)
+{
 	/*
 	 * Reset the current/prefetch positions in the batch.
 	 *
 	 * XXX Done before calling amgetbatch(), so that it sees the index as
 	 * invalid, batch as empty, and can add items.
-	 *
-	 * XXX Intentionally does not reset the nheaptids, because the AM does
-	 * rely on that when processing killed tuples. Maybe store the killed
-	 * tuples differently?
 	 */
-	scan->xs_batch->currIndex = -1;
-	scan->xs_batch->prefetchIndex = 0;
+	scan->xs_batch->readPos.index = -1;
+	scan->xs_batch->streamPos.index = -1;
 	scan->xs_batch->nheaptids = 0;
 
 	/*
@@ -1438,6 +1684,49 @@ index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	scan->xs_hitup = NULL;
 
 	MemoryContextReset(scan->xs_batch->ctx);
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch,
+ * or false if there are no more TIDs in the scan. The xs_heaptids and
+ * xs_nheaptids fields contain the TIDS and the number of elements.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* comprehensive checks of batching info */
+	AssertCheckBatchInfo(scan);
+
+	/*
+	 * We never read a new batch before we run out of items in the current
+	 * one. The current batch has to be either empty or we ran out of items
+	 * (in the given direction).
+	 *
+	 * XXX We may abandon a batch because of a rescan, but that counts as
+	 * a new scan (and we reset the batch anyway).
+	 */
+	Assert(!INDEX_BATCH_HAS_ITEMS(scan, &scan->xs_batch->readPos, direction));
+
+	/*
+	 * Reset the batch info empty state, before calling amgetbatch(), so that
+	 * the index AM sees it as ready for new data.
+	 */
+	index_batch_empty(scan);
 
 	/*
 	 * The AM's amgetbatch proc loads a chunk of TIDs matching the scan keys,
@@ -1454,33 +1743,44 @@ index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	scan->kill_prior_tuple = false;
 	scan->xs_heap_continue = false;
 
-	/* If we're out of index entries, we're done */
+	/*
+	 * If we're out of index entries, we're done
+	 *
+	 * XXX Is this the right place to release resources unrelated to the
+	 * batching? Maybe that should happen sometime higher / in the caller?
+	 */
 	if (!found)
 	{
 		/* release resources (like buffer pins) from table accesses */
 		if (scan->xs_heapfetch)
+		{
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+
 			table_index_fetch_reset(scan->xs_heapfetch);
+		}
 
 		return false;
 	}
 
+	/* reset positions */
+	if (ScanDirectionIsForward(direction))
+		index_batch_reset_positions(scan, -1, false);
+	else
+		index_batch_reset_positions(scan, scan->xs_batch->nheaptids, false);
+
 	/* We should have a non-empty batch with items. */
-	Assert(INDEX_BATCH_HAS_ITEMS(scan, direction));
+	Assert(INDEX_BATCH_HAS_ITEMS(scan, &scan->xs_batch->readPos, direction));
 
 	pgstat_count_index_tuples(scan->indexRelation, scan->xs_batch->nheaptids);
 
 	/*
-	 * Set the prefetch index to the first item in the loaded batch (we expect
-	 * the index AM to set that).
-	 *
-	 * FIXME Maybe set the currIndex here, not in the index AM. It seems much
-	 * more like indexam.c responsibility rather than something every index AM
-	 * should be doing (in _bt_first_batch etc.).
+	 * Remember the direction.
 	 *
-	 * FIXME It's a bit unclear who (indexam.c or the index AM) is responsible
-	 * for setting which fields. This needs clarification.
+	 * FIXME also check it later, it may happen to change for each call
 	 */
-	scan->xs_batch->prefetchIndex = scan->xs_batch->currIndex;
+	scan->xs_batch->dir = direction;
+	scan->xs_batch->resetIndexes = true;
 
 	/*
 	 * Try to increase the size of the batch. Intentionally done after the AM
@@ -1493,6 +1793,23 @@ index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	/* comprehensive checks of batching info */
 	AssertCheckBatchInfo(scan);
 
+	/* reset the read stream so that we read the next batch */
+	read_stream_reset(scan->xs_heapfetch->rs);
+
+	/*
+	 * Release resources (like buffer pins) from table accesses
+	 *
+	 * XXX Not sure this is needed, I might have added it while chasing
+	 * some resource leaks.
+	 */
+	if (scan->xs_heapfetch)
+	{
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+
+		table_index_fetch_reset(scan->xs_heapfetch);
+	}
+
 	/* Return the batch of TIDs we found. */
 	return true;
 }
@@ -1512,33 +1829,50 @@ index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
 static ItemPointer
 index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
+	int		index;
+
+	/* shouldn't get here without batching */
+	Assert(scan->xs_batch);
+
 	/* comprehensive checks of batching info */
 	AssertCheckBatchInfo(scan);
 
-	/*
-	 * Bail out if he batch does not have more items in the requested directio
-	 * (either empty or everthing processed).
-	 */
-	if (!INDEX_BATCH_HAS_ITEMS(scan, direction))
-		return NULL;
+	/* We should have handled change of scan direction sometime earlier. */
+	if (scan->xs_batch->dir != direction)
+	{
+		scan->xs_batch->dir = direction;
+		scan->xs_batch->streamPos.index = scan->xs_batch->readPos.index;
+
+		if (scan->xs_heapfetch)
+		{
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+	}
+
+	/* reset indexes if needed */
+	index_batch_reset_indexes(scan);
+
+	Assert(scan->xs_batch->dir == direction);
 
 	/*
-	 * Advance to the next batch item - we know it's not empty and there are
-	 * items to process, so this is valid.
+	 * Don't continue/advance indexes, if the batch is empty, otherwise
+	 * we'd advance to bogus index values e.g. after changing the scan
+	 * direction.
 	 *
-	 * However, don't advance if this is the first getnext_tid() call after
-	 * amrestrpos(). That sets the position on the correct item, and advancing
-	 * here would skip it.
-	 *
-	 * XXX The "restored" flag is a bit weird. Can we do this without it? May
-	 * need to rethink when/how we advance the batch index. Not sure.
+	 * FIXME Not sure this is the right place to do this decision. It seems
+	 * weird we reset the direction first, etc. and only then realize the
+	 * batch is actually empty.
 	 */
-	if (scan->xs_batch->restored)
-		scan->xs_batch->restored = false;
-	else if (ScanDirectionIsForward(direction))
-		scan->xs_batch->currIndex++;
-	else
-		scan->xs_batch->currIndex--;
+	if (scan->xs_batch->nheaptids == 0)
+		return NULL;
+
+	/* Advance the index to the item we need to in the next round. */
+	index = index_batch_pos_advance(&scan->xs_batch->readPos, direction);
+
+	/* Did we run out of items in the current batch? */
+	if ((index < 0) || (index >= scan->xs_batch->nheaptids))
+		return NULL;
 
 	/*
 	 * Next TID from the batch, optionally also the IndexTuple/HeapTuple.
@@ -1549,14 +1883,14 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	 * XXX Do we need to reset the itups/htups array between batches? Doesn't
 	 * seem necessary, but maybe we could get bogus data?
 	 */
-	scan->xs_heaptid = scan->xs_batch->heaptids[scan->xs_batch->currIndex];
+	scan->xs_heaptid = scan->xs_batch->heaptids[index];
 	if (scan->xs_want_itup)
 	{
-		scan->xs_itup = scan->xs_batch->itups[scan->xs_batch->currIndex];
-		scan->xs_hitup = scan->xs_batch->htups[scan->xs_batch->currIndex];
+		scan->xs_itup = scan->xs_batch->itups[index];
+		scan->xs_hitup = scan->xs_batch->htups[index];
 	}
 
-	scan->xs_recheck = scan->xs_batch->recheck[scan->xs_batch->currIndex];
+	scan->xs_recheck = scan->xs_batch->recheck[index];
 
 	/*
 	 * If there are order-by clauses, point to the appropriate chunk in the
@@ -1564,7 +1898,7 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	 */
 	if (scan->numberOfOrderBys > 0)
 	{
-		int			idx = scan->numberOfOrderBys * scan->xs_batch->currIndex;
+		int			idx = scan->numberOfOrderBys * index;
 
 		scan->xs_orderbyvals = &scan->xs_batch->orderbyvals[idx];
 		scan->xs_orderbynulls = &scan->xs_batch->orderbynulls[idx];
@@ -1576,120 +1910,6 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_batch_prefetch - prefetch pages for TIDs in current batch
- *
- * The prefetch distance is increased gradually, similar to what we do for
- * bitmap heap scans. We start from distance 0 (no prefetch), and then in each
- * iteration increment the distance up to prefetchMaximum.
- *
- * The prefetch distance is reset (to 0) only on rescans, not between batches.
- *
- * It's possible to provide an index_prefetch_callback callback, to affect
- * which items need to be prefetched. With prefetch_callback=NULL, all
- * items are prefetched. With the callback provided, the item is prefetched
- * iff the callback and returns true.
- *
- * The "arg" argument is used to pass a state for the plan node invoking the
- * function, and is then passed to the callback. This means the callback is
- * specific to the plan state.
- *
- * XXX the prefetchMaximum depends on effective_io_concurrency, and also on
- * tablespace options.
- *
- * XXX For accesses that change scan direction, we may do a lot of unnecessary
- * prefetching (because we will re-issue prefetches for what we recently read).
- * I'm not sure if there's a simple way to track what was already prefetched.
- * Maybe we could count how far we got (in the forward direction), keep that
- * as a watermark, and never prefetch again below it.
- *
- * XXX Maybe wrap this in ifdef USE_PREFETCH?
- * ----------------
- */
-static void
-index_batch_prefetch(IndexScanDesc scan, ScanDirection direction)
-{
-	int			prefetchStart,
-				prefetchEnd;
-
-	IndexPrefetchCallback	prefetch_callback = scan->xs_batch->prefetchCallback;
-	void *arg = scan->xs_batch->prefetchArgument;
-
-	if (ScanDirectionIsForward(direction))
-	{
-		/* Where should we start to prefetch? */
-		prefetchStart = Max(scan->xs_batch->currIndex,
-							scan->xs_batch->prefetchIndex);
-
-		/*
-		 * Where should we stop prefetching? this is the first item that we do
-		 * NOT prefetch, i.e. it can be the first item after the batch.
-		 */
-		prefetchEnd = Min((scan->xs_batch->currIndex + 1) + scan->xs_batch->prefetchTarget,
-						  scan->xs_batch->nheaptids);
-
-		/* FIXME should calculate in a way to make this unnecessary */
-		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
-		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
-
-		/* remember how far we prefetched / where to start the next prefetch */
-		scan->xs_batch->prefetchIndex = prefetchEnd;
-	}
-	else
-	{
-		/* Where should we start to prefetch? */
-		prefetchEnd = Min(scan->xs_batch->currIndex,
-						  scan->xs_batch->prefetchIndex);
-
-		/*
-		 * Where should we stop prefetching? this is the first item that we do
-		 * NOT prefetch, i.e. it can be the first item after the batch.
-		 */
-		prefetchStart = Max((scan->xs_batch->currIndex - 1) - scan->xs_batch->prefetchTarget,
-							-1);
-
-		/* FIXME should calculate in a way to make this unnecessary */
-		prefetchStart = Max(Min(prefetchStart, scan->xs_batch->nheaptids - 1), 0);
-		prefetchEnd = Max(Min(prefetchEnd, scan->xs_batch->nheaptids - 1), 0);
-
-		/* remember how far we prefetched / where to start the next prefetch */
-		scan->xs_batch->prefetchIndex = prefetchStart;
-	}
-
-	/*
-	 * It's possible we get inverted prefetch range after a restrpos() call,
-	 * because we intentionally don't reset the prefetchIndex - we don't want
-	 * to prefetch pages over and over in this case. We'll do nothing in that
-	 * case, except for the AssertCheckBatchInfo().
-	 *
-	 * FIXME I suspect this actually does not work correctly if we change the
-	 * direction, because the prefetchIndex will flip between two extremes
-	 * thanks to the Min/Max.
-	 */
-
-	/*
-	 * Increase the prefetch distance, but not beyond prefetchMaximum. We
-	 * intentionally do this after calculating start/end, so that we start
-	 * actually prefetching only after the first item.
-	 */
-	scan->xs_batch->prefetchTarget = Min(scan->xs_batch->prefetchTarget + 1,
-										 scan->xs_batch->prefetchMaximum);
-
-	/* comprehensive checks of batching info */
-	AssertCheckBatchInfo(scan);
-
-	/* finally, do the actual prefetching */
-	for (int i = prefetchStart; i < prefetchEnd; i++)
-	{
-		/* skip block if the provided callback says so */
-		if (prefetch_callback && !prefetch_callback(scan, arg, i))
-			continue;
-
-		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM,
-					   ItemPointerGetBlockNumber(&scan->xs_batch->heaptids[i]));
-	}
-}
-
 /*
  * index_batch_init
  *		Initialize various fields / arrays needed by batching.
@@ -1715,14 +1935,21 @@ index_batch_init(IndexScanDesc scan)
 	scan->xs_batch->initSize = 8;
 	scan->xs_batch->currSize = scan->xs_batch->initSize;
 
-	/* initialize prefetching info */
-	scan->xs_batch->prefetchMaximum =
-		get_tablespace_io_concurrency(scan->heapRelation->rd_rel->reltablespace);
-	scan->xs_batch->prefetchTarget = 0;
-	scan->xs_batch->prefetchIndex = 0;
+	/* initialize prefetching info to some bogus values */
+	index_batch_pos_reset(&scan->xs_batch->readPos, -1, false);
+	index_batch_pos_reset(&scan->xs_batch->streamPos, -1, false);
 
-	/* */
-	scan->xs_batch->currIndex = -1;
+	/*
+	 * Make sure to reset prefetch/current indexes before using them later.
+	 * We can't do that now because we don't know the direction.
+	 */
+	scan->xs_batch->resetIndexes = true;
+
+	/*
+	 * FIXME set later, when we actually know it. Or shall we assume most
+	 * scans are (or at lest start) forward?
+	 */
+	scan->xs_batch->dir = ForwardScanDirection;
 
 	/* Preallocate the largest allowed array of TIDs. */
 	scan->xs_batch->nheaptids = 0;
@@ -1780,6 +2007,9 @@ index_batch_init(IndexScanDesc scan)
  *
  * FIXME Another bit in need of cleanup. The currIndex default (-1) is not quite
  * correct, because for backwards scans is wrong.
+ *
+ * XXX Isn't this the same as index_batch_empty?
+ * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
  */
 static void
 index_batch_reset(IndexScanDesc scan)
@@ -1789,9 +2019,9 @@ index_batch_reset(IndexScanDesc scan)
 		return;
 
 	scan->xs_batch->nheaptids = 0;
-	scan->xs_batch->prefetchIndex = 0;
-	scan->xs_batch->currIndex = -1;
-	scan->xs_batch->restored = false;
+	index_batch_pos_reset(&scan->xs_batch->readPos, -1, false);
+	index_batch_pos_reset(&scan->xs_batch->streamPos, -1, false);
+	scan->xs_batch->resetIndexes = true;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index db0e22e0ce2..7082e7bd381 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -532,10 +532,10 @@ btmarkpos(IndexScanDesc scan)
 	if (scan->xs_batch)
 	{
 		/* the index should be valid in the batch */
-		Assert(scan->xs_batch->currIndex >= 0);
-		Assert(scan->xs_batch->currIndex < scan->xs_batch->nheaptids);
+		Assert(scan->xs_batch->readPos.index >= 0);
+		Assert(scan->xs_batch->readPos.index < scan->xs_batch->nheaptids);
 
-		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->currIndex;
+		so->currPos.itemIndex = so->batch.firstIndex + scan->xs_batch->readPos.index;
 	}
 
 	/*
@@ -602,7 +602,6 @@ btrestrpos(IndexScanDesc scan)
 
 				/* make it look empty */
 				scan->xs_batch->nheaptids = 0;
-				scan->xs_batch->prefetchIndex = -1;
 
 				/*
 				 * XXX the scan direction is bogus / not important. It affects
@@ -621,8 +620,9 @@ btrestrpos(IndexScanDesc scan)
 			 * XXX This is a bit weird. There should be a way to not need the
 			 * "restored" flag I think.
 			 */
-			scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
-			scan->xs_batch->restored = true;
+			index_batch_reset_positions(scan,
+										(so->currPos.itemIndex - so->batch.firstIndex),
+										true);
 		}
 	}
 	else
@@ -683,7 +683,6 @@ btrestrpos(IndexScanDesc scan)
 
 				/* make it look empty */
 				scan->xs_batch->nheaptids = 0;
-				scan->xs_batch->prefetchIndex = -1;
 
 				/* XXX the scan direction is bogus */
 				_bt_copy_batch(scan, ForwardScanDirection, so, start, end);
@@ -697,8 +696,9 @@ btrestrpos(IndexScanDesc scan)
 				 * XXX This is a bit weird. There should be a way to not need
 				 * the "restored" flag I think.
 				 */
-				scan->xs_batch->currIndex = (so->currPos.itemIndex - so->batch.firstIndex);
-				scan->xs_batch->restored = true;
+				index_batch_reset_positions(scan,
+											(so->currPos.itemIndex - so->batch.firstIndex),
+											true);
 			}
 		}
 		else
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 22181bf027b..1ae164a94b2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1565,16 +1565,6 @@ _bt_copy_batch(IndexScanDesc scan, ScanDirection dir, BTScanOpaque so,
 		start++;
 	}
 
-	/*
-	 * set the starting point
-	 *
-	 * XXX might be better done in indexam.c
-	 */
-	if (ScanDirectionIsForward(dir))
-		scan->xs_batch->currIndex = -1;
-	else
-		scan->xs_batch->currIndex = scan->xs_batch->nheaptids;
-
 	/* shouldn't be possible to end here with an empty batch */
 	Assert(scan->xs_batch->nheaptids > 0);
 }
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 4b96852c28c..05ff2285830 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -1145,13 +1145,6 @@ _spgist_copy_batch(IndexScanDesc scan, SpGistScanOpaque so,
 		start++;
 	}
 
-	/*
-	 * set the starting point
-	 *
-	 * XXX might be better done in indexam.c
-	 */
-	scan->xs_batch->currIndex = -1;
-
 	/* shouldn't be possible to end here with an empty batch */
 	Assert(scan->xs_batch->nheaptids > 0);
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..ffc6cde79d6 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -216,7 +216,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index f7dc42f7452..11b1f720c2c 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 8e55ffca197..65d16b981eb 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -145,9 +145,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		}
 		else
 		{
+			int	lastIndex = scandesc->xs_batch->readPos.index;
+
 			/* Is the index of the current item valid for the batch? */
-			Assert((scandesc->xs_batch->currIndex >= 0) &&
-				   (scandesc->xs_batch->currIndex < scandesc->xs_batch->nheaptids));
+			Assert((lastIndex >= 0) && (lastIndex < scandesc->xs_batch->nheaptids));
 
 			/*
 			 * Reuse the previously determined page visibility info, or
@@ -159,8 +160,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			 * visibility from that. Maybe we could/should have a more direct
 			 * way?
 			 */
-			all_visible = !ios_prefetch_block(scandesc, node,
-											  scandesc->xs_batch->currIndex);
+			all_visible = !ios_prefetch_block(scandesc, node, lastIndex);
 		}
 
 		/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1d9a0868a9b..67e70a081b9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -191,6 +191,7 @@ extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 /* index batching/prefetching */
 extern bool index_batch_add(IndexScanDesc scan, ItemPointerData tid, bool recheck,
 							IndexTuple itup, HeapTuple htup);
+extern void index_batch_reset_positions(IndexScanDesc scan, int index, bool restored);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 33d2b6a6223..dc5470833c9 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,8 +16,10 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -127,6 +129,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 /* Forward declaration, the prefetch callback needs IndexScanDescData. */
@@ -202,6 +205,12 @@ typedef struct IndexScanDescData
 typedef bool (*IndexPrefetchCallback) (IndexScanDescData *scan,
 									   void *arg, int index);
 
+typedef struct IndexScanBatchPos {
+	int		index;			/* index into the batch items */
+	bool	restored;		/* Was this restored by restrpos? If yes, don't
+							 * advance on the first access. */
+} IndexScanBatchPos;
+
 /*
  * Data about the current TID batch returned by the index AM.
  *
@@ -227,22 +236,19 @@ typedef struct IndexScanBatchData
 	/* memory context for per-batch data */
 	MemoryContext ctx;
 
-	/*
-	 * Was this batch just restored by restrpos? if yes, we don't advance on
-	 * the first iteration.
-	 */
-	bool		restored;
-
-	/* batch prefetching */
-	int			prefetchTarget; /* current prefetch distance */
-	int			prefetchMaximum;	/* maximum prefetch distance */
-	int			prefetchIndex;	/* next item to prefetch */
+	/* most recent direction of the scan */
+	ScanDirection dir;
 
 	IndexPrefetchCallback	prefetchCallback;
 	void				   *prefetchArgument;
 
+	/*
+	 * Was this batch just restored by restrpos? If yes, we don't advance on
+	 * the first iteration.
+	 */
+	bool		resetIndexes;
+
 	/* batch contents (TIDs, index tuples, kill bitmap, ...) */
-	int			currIndex;		/* index of the current item */
 	int			nheaptids;		/* number of TIDs in the batch */
 	ItemPointerData *heaptids;	/* TIDs in the batch */
 	IndexTuple *itups;			/* IndexTuples, if requested */
@@ -250,6 +256,10 @@ typedef struct IndexScanBatchData
 	bool	   *recheck;		/* recheck flags */
 	Datum	   *privateData;	/* private data for batch */
 
+	/* current position in the batch */
+	IndexScanBatchPos	readPos;	/* used by executor */
+	IndexScanBatchPos	streamPos;	/* used by read stream */
+
 	/* xs_orderbyvals / xs_orderbynulls */
 	Datum	   *orderbyvals;
 	bool	   *orderbynulls;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index adb478a93ca..2a8960b4005 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -420,7 +420,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1198,9 +1199,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
-- 
2.47.0

#103

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Tomas Vondra (#102)

Re: index prefetching

On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote:

Attached is an updated version of this patch series. The first couple
parts (adding batching + updating built-in index AMs) remain the same,
the new part is 0007 which switches index scans to read stream API.

The first thing that I notice about this patch series is that it
doesn't fully remove amgettuple as a concept. That seems a bit odd to
me. After all, you've invented a single page batching mechanism, which
is duplicative of the single page batching mechanism that each
affected index AM has to use already, just to be able to allow the
amgettuple interface to iterate backwards and forwards with a
scrollable cursor (and to make mark/restore work). ISTM that you have
one too many batching interfaces here.

I can think of nothing that makes the task of completely replacing
amgettuple particularly difficult. I don't think that the need to do
the _bt_killitems stuff actually makes this task all that much harder.
It will need to be generalized, too, by keeping track of multiple
BTScanOpaqueData.killedItems[] style states, each of which is
associated with its own page-level currPos state. But that's not
rocket science. (Also don't think that mark/restore support is all
that hard.)

The current way in which _bt_kill_batch() is called from
_bt_steppage() by the patch seems weird to me. You're copying what you
actually know to be the current page's kill items such that
_bt_steppage() will magically do what it does already when the
amgetttuple/btgettuple interface is in use, just as we're stepping off
the page. It seems to be working at the wrong level.

Notice that the current way of doing things in your patch means that
your new batching interface tacitly knows about the nbtree batching
interface, and that it too works along page boundaries -- that's the
only reason why it can hook into _bt_steppage like this in the first
place. Things are way too tightly coupled, and the old and new way of
doing things are hopelessly intertwined. What's being abstracted away
here, really?

I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at
all -- nor should it even call _bt_killitems(). Things need to be
broken down into smaller units of work that can be reordered, instead.

The first half of the current _bt_steppage() function deals with
finishing off the current leaf page should be moved to some other
function -- let's call it _bt_finishpage. A new callback should be
called as part of the new API when the time comes to tell nbtree that
we're now done with a given leaf page -- that's what this new
_bt_finishpage function is for. All that remains of _bt_steppage() are
the parts that deal with figuring out which page should be visited
next -- the second half of _bt_steppage stays put.

That way stepping to the next page and reading multiple pages can be
executed as eagerly as makes sense -- we don't need to "coordinate"
the heap accesses in lockstep with the leaf page accesses. Maybe you
won't take advantage of this flexibility right away, but ISTM that you
need nominal support for this kind of reordering to make the new API
really make sense.

There are some problems with this scheme, but they seem reasonably
tractable to me. We already have strategies for dealing with the risk
of concurrent TID recycling when _bt_killitems is called with some
maybe-recycled TIDs -- we're already dropping the pin on the leaf page
early in many cases. I've pointed this out many times already (again,
see _bt_drop_lock_and_maybe_pin).

It's true that we're still going to have to hold onto a buffer pin on
leaf pages whose TIDs haven't all been read from the table AM side
yet, unless we know that it's a case where that's safe for other
reasons -- otherwise index-only scans might give wrong answers. But
that other problem shouldn't be confused with the _bt_killitems
problem, just because of the superficial similarity around holding
onto a leaf page pin.

To repeat: it is important that you not conflate the problems on the
table AM side (TID recycle safety for index scans) with the problems
on the index AM side (safely setting LP_DEAD bits in _bt_killitems).
They're two separate problems that are currently dealt with as one
problem on the nbtree side -- but that isn't fundamental. Teasing them
apart seems likely to be helpful here.

I speculated that with the batching concept it might work better, and I
think that turned out to be the case. The batching is still the core
idea, giving the index AM enough control to make kill tuples work (by
not generating batches spanning multiple leaf pages, or doing something
smarter). And the read stream leverages that too - the next_block
callback returns items from the current batch, and the stream is reset
between batches. This is the same prefetch restriction as with the
explicit prefetching (done using posix_fadvise), except that the
prefetching is done by the read stream.

ISTM that the central feature of the new API should be the ability to
reorder certain kinds of work. There will have to be certain
constraints, of course. Sometimes these will principally be problems
for the table AM (e.g., we musn't allow concurrent TID recycling
unless it's for a plain index scan using an MVCC snapshot), other
times they're principally problems for the index AM (e.g., the
_bt_killitems safety issues).

I get that you're not that excited about multi-page batches; it's not
the priority. Fair enough. I just think that the API needs to work in
terms of batches that are sized as one or more pages, in order for it
to make sense.

BTW, the README changes you made are slightly wrong about pins and
locks. We don't actually keep around C pointers to IndexTuples for
index-only scans that point into shared memory -- that won't work. We
simply copy whatever IndexTuples the scan returns into local state,
associated with so->currPos. So that isn't a complicating factor, at
all.

That's all I have right now. Hope it helps.

--
Peter Geoghegan

#104

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Peter Geoghegan (#103)

Re: index prefetching

On 11/7/24 01:38, Peter Geoghegan wrote:

On Wed, Nov 6, 2024 at 12:25 PM Tomas Vondra <tomas@vondra.me> wrote:

Attached is an updated version of this patch series. The first couple
parts (adding batching + updating built-in index AMs) remain the same,
the new part is 0007 which switches index scans to read stream API.

The first thing that I notice about this patch series is that it
doesn't fully remove amgettuple as a concept. That seems a bit odd to
me. After all, you've invented a single page batching mechanism, which
is duplicative of the single page batching mechanism that each
affected index AM has to use already, just to be able to allow the
amgettuple interface to iterate backwards and forwards with a
scrollable cursor (and to make mark/restore work). ISTM that you have
one too many batching interfaces here.

I can think of nothing that makes the task of completely replacing
amgettuple particularly difficult. I don't think that the need to do
the _bt_killitems stuff actually makes this task all that much harder.
It will need to be generalized, too, by keeping track of multiple
BTScanOpaqueData.killedItems[] style states, each of which is
associated with its own page-level currPos state. But that's not
rocket science. (Also don't think that mark/restore support is all
that hard.)

The primary reason why I kept amgettuple() as is, and added a new AM
callback for the "batch" mode is backwards compatibility. I did not want
to force all AMs to do this, I think it should be optional. Not only to
limit the disruption for out-of-core AMs, but also because I'm not 100%
sure every AM will be able to do batching in a reasonable way.

I do agree having an AM-level batching, and then another batching in the
indexam.c is a bit ... weird. To some extent this is a remainder of an
earlier patch version, but it's also based on some suggestions by Andres
about batching these calls into AM for efficiency reasons. To be fair, I
was jetlagged and I'm not 100% sure this is what he meant, or that it
makes a difference in practice.

Yes, we could ditch the batching in indexam.c, and just rely on the AM
batching, just like now. There are a couple details why the separate
batching seemed convenient:

1) We may need to stash some custom data for each TID (e.g. so that IOS
does not need to check VM repeatedly). But perhaps that could be
delegated to the index AM too ...

2) We need to maintain two "positions" in the index. One for the item
the executor is currently processing (and which might end up getting
marked as "killed" etc). And another one for "read" position, i.e. items
passed to the read stream API / prefetching, etc.

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.

The current way in which _bt_kill_batch() is called from
_bt_steppage() by the patch seems weird to me. You're copying what you
actually know to be the current page's kill items such that
_bt_steppage() will magically do what it does already when the
amgetttuple/btgettuple interface is in use, just as we're stepping off
the page. It seems to be working at the wrong level.

True, but that's how it was working before, it wasn't my ambition to
rework that.

Notice that the current way of doing things in your patch means that
your new batching interface tacitly knows about the nbtree batching
interface, and that it too works along page boundaries -- that's the
only reason why it can hook into _bt_steppage like this in the first
place. Things are way too tightly coupled, and the old and new way of
doing things are hopelessly intertwined. What's being abstracted away
here, really?

I'm not sure sure if by "new batching interface" you mean the indexam.c
code, or the code in btgetbatch() etc.

I don't think indexam.c knows all that much about the nbtree internal
batching. It "just" relies on amgetbatch() producing items the AM can
handle later (during killtuples/cleanup etc.). It does not even need to
be a single-leaf-page batch, if the AM knows how to track/deal with that
internally. It just was easier to do by restricting to a single leaf
page for now. But that's internal to AM.

Yes, it's true inside the AM it's more intertwined, and some of it sets
things up so that the existing code does the right thing ...

I suspect that _bt_steppage() shouldn't be calling _bt_kill_batch() at
all -- nor should it even call _bt_killitems(). Things need to be
broken down into smaller units of work that can be reordered, instead.

The first half of the current _bt_steppage() function deals with
finishing off the current leaf page should be moved to some other
function -- let's call it _bt_finishpage. A new callback should be
called as part of the new API when the time comes to tell nbtree that
we're now done with a given leaf page -- that's what this new
_bt_finishpage function is for. All that remains of _bt_steppage() are
the parts that deal with figuring out which page should be visited
next -- the second half of _bt_steppage stays put.

That way stepping to the next page and reading multiple pages can be
executed as eagerly as makes sense -- we don't need to "coordinate"
the heap accesses in lockstep with the leaf page accesses. Maybe you
won't take advantage of this flexibility right away, but ISTM that you
need nominal support for this kind of reordering to make the new API
really make sense.

Yes, splitting _bt_steppage() like this makes sense to me, and I agree
being able to proceed to the next page before we're done with the
current page seems perfectly reasonable for batches spanning multiple
leaf pages.

There are some problems with this scheme, but they seem reasonably
tractable to me. We already have strategies for dealing with the risk
of concurrent TID recycling when _bt_killitems is called with some
maybe-recycled TIDs -- we're already dropping the pin on the leaf page
early in many cases. I've pointed this out many times already (again,
see _bt_drop_lock_and_maybe_pin).

It's true that we're still going to have to hold onto a buffer pin on
leaf pages whose TIDs haven't all been read from the table AM side
yet, unless we know that it's a case where that's safe for other
reasons -- otherwise index-only scans might give wrong answers. But
that other problem shouldn't be confused with the _bt_killitems
problem, just because of the superficial similarity around holding
onto a leaf page pin.

To repeat: it is important that you not conflate the problems on the
table AM side (TID recycle safety for index scans) with the problems
on the index AM side (safely setting LP_DEAD bits in _bt_killitems).
They're two separate problems that are currently dealt with as one
problem on the nbtree side -- but that isn't fundamental. Teasing them
apart seems likely to be helpful here.

Hmm. I've intentionally tried to ignore these issues, or rather to limit
the scope of the patch so that v1 does not require dealing with it.
Hence the restriction to single-leaf batches, for example.

But I guess I may have to look at this after all ... not great.

I speculated that with the batching concept it might work better, and I
think that turned out to be the case. The batching is still the core
idea, giving the index AM enough control to make kill tuples work (by
not generating batches spanning multiple leaf pages, or doing something
smarter). And the read stream leverages that too - the next_block
callback returns items from the current batch, and the stream is reset
between batches. This is the same prefetch restriction as with the
explicit prefetching (done using posix_fadvise), except that the
prefetching is done by the read stream.

ISTM that the central feature of the new API should be the ability to
reorder certain kinds of work. There will have to be certain
constraints, of course. Sometimes these will principally be problems
for the table AM (e.g., we musn't allow concurrent TID recycling
unless it's for a plain index scan using an MVCC snapshot), other
times they're principally problems for the index AM (e.g., the
_bt_killitems safety issues).

Not sure. By "new API" you mean the read stream API, or the index AM API
to allow batching?

I get that you're not that excited about multi-page batches; it's not
the priority. Fair enough. I just think that the API needs to work in
terms of batches that are sized as one or more pages, in order for it
to make sense.

True, but isn't that already the case? I mean, what exactly prevents an
index AM to "build" a batch for multiple leaf pages? The current patch
does not implement that for any of the AMs, true, but isn't that already
possible if the AM chooses to?

If you were to design the index AM API to support this (instead of
adding the amgetbatch callback etc.), how would it look?

In one of the previous patch versions I tried to rely on amgettuple().
It got a bunch of TIDs ahead from that, depending on prefetch distance.
Then those TIDs were prefetched/passed to the read stream, and stashed
in a queue (in IndexScanDesc). And then indexam would get the TIDs from
the queue, and pass them to index scans etc.

Unfortunately that didn't work because of killtuples etc. because the
index AM had no idea about the indexam queue and has it's own concept of
"current item", so it was confused about which item to mark as killed.
And that old item might even be from an earlier leaf page (not the
"current" currPos).

I was thinking maybe the AM could keep the leaf pages, and then free
them once they're no longer needed. But it wasn't clear to me how to
exchange this information between indexam.c and the index AM, because
right now the AM only knows about a single (current) position.

But imagine we have this:

a) A way to switch the scan into "batch" mode, where the AM keeps the
leaf page (and a way for the AM to indicate it supports this).

b) Some way to track two "positions" in the scan - one for read, one for
prefetch. I'm not sure if this would be internal in each index AM, or at
the indexam.c level.

c) A way to get the index tuple for either of the two positions (and
advance the position). It might be a flag for amgettuple(), or maybe
even a callaback for the "prefetch" position.

d) A way to inform the AM items up to some position are no longer
needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
could always be "up to the current read position".

Does that sound reasonable / better than the current approach, or have I
finally reached the "raving lunatic" stage?

BTW, the README changes you made are slightly wrong about pins and
locks. We don't actually keep around C pointers to IndexTuples for
index-only scans that point into shared memory -- that won't work. We
simply copy whatever IndexTuples the scan returns into local state,
associated with so->currPos. So that isn't a complicating factor, at
all.

Ah, OK. Thanks for the correction.

That's all I have right now. Hope it helps.

Yes, very interesting insights. Thanks!

regards

--
Tomas Vondra

#105

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Tomas Vondra (#104)

Re: index prefetching

On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas@vondra.me> wrote:

The primary reason why I kept amgettuple() as is, and added a new AM
callback for the "batch" mode is backwards compatibility. I did not want
to force all AMs to do this, I think it should be optional. Not only to
limit the disruption for out-of-core AMs, but also because I'm not 100%
sure every AM will be able to do batching in a reasonable way.

All index AMs that implement amgettuple are fairly similar to nbtree. They are:

* nbtree itself
* GiST
* Hash
* SP-GiST

They all have the same general notion of page-at-a-time processing,
with buffering of items for the amgettuple callback to return. There
are perhaps enough differences to be annoying in SP-GiST, and with
GiST's ordered scans (which use a pairing heap rather than true
page-at-a-time processing). I guess you're right that you'll need to
maintain amgettuple support for the foreseeable future, to support
these special cases.

I still think that you shouldn't need to use amgettuple in either
nbtree or hash, since neither AM does anything non-generic in this
area. It should be normal to never need to use amgettuple.

Yes, we could ditch the batching in indexam.c, and just rely on the AM
batching, just like now.

To be clear, I had imagined completely extracting the batching from
the index AM, since it isn't really at all coupled to individual index
AM implementation details anyway. I don't hate the idea of doing more
in the index AM, but whether or not it happens there vs. somewhere
else isn't my main concern at this point.

My main concern right now is that one single place be made to see
every relevant piece of information about costs and benefits. Probably
something inside indexam.c.

There are a couple details why the separate
batching seemed convenient:

1) We may need to stash some custom data for each TID (e.g. so that IOS
does not need to check VM repeatedly). But perhaps that could be
delegated to the index AM too ...

2) We need to maintain two "positions" in the index. One for the item
the executor is currently processing (and which might end up getting
marked as "killed" etc). And another one for "read" position, i.e. items
passed to the read stream API / prefetching, etc.

That all makes sense.

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.

But it doesn't, really. The index AM is still subject to exactly the
same constraints in terms of page-at-a-time processing. These existing
constraints always came from the table AM side, so it's not as if your
patch can remain totally neutral on these questions.

Basically, it looks like you've invented a shadow batching interface
that is technically not known to the index AM, but nevertheless
coordinates with the existing so->currPos batching interface.

I don't think indexam.c knows all that much about the nbtree internal
batching. It "just" relies on amgetbatch() producing items the AM can
handle later (during killtuples/cleanup etc.). It does not even need to
be a single-leaf-page batch, if the AM knows how to track/deal with that
internally.

I'm concerned that no single place will know about everything under
this scheme. Having one single place that has visibility into all
relevant costs, whether they're index AM or table AM related, is what
I think you should be aiming for.

I think that you should be removing the parts of the nbtree (and other
index AM) code that deal with the progress of the scan explicitly.
What remains is code that simply reads the next page, and saves its
details in the relevant data structures. Or code that "finishes off" a
leaf page by dropping its pin, and maybe doing the _bt_killitems
stuff.

The index AM itself should no longer know about the current next tuple
to return, nor about mark/restore. It is no longer directly in control
of the scan's progress. It loses all context that survives across API
calls.

Yes, splitting _bt_steppage() like this makes sense to me, and I agree
being able to proceed to the next page before we're done with the
current page seems perfectly reasonable for batches spanning multiple
leaf pages.

I think that it's entirely possible that it'll just be easier to do
things this way from the start. I understand that that may be far from
obvious right now, but, again, I just don't see what's so special
about the way that each index AM batches results. What about that it
is so hard to generalize across index AMs that must support amgettuple
right now? (At least in the case of nbtree and hash, which have no
special requirements for things like KNN-GiST.)

Most individual calls to btgettuple just return the next batched-up
so->currPos tuple/TID via another call to _bt_next. Things like the
_bt_first-new-primitive-scan case don't really add any complexity --
the core concept of processing a page at a time still applies. It
really is just a simple batching scheme, with a couple of extra fiddly
details attached to it -- but nothing too hairy.

The hardest part will probably be rigorously describing the rules for
not breaking index-only scans due to concurrent TID recycling by
VACUUM, and the rules for doing _bt_killitems. But that's also not a
huge problem, in the grand scheme of things.

Hmm. I've intentionally tried to ignore these issues, or rather to limit
the scope of the patch so that v1 does not require dealing with it.
Hence the restriction to single-leaf batches, for example.

But I guess I may have to look at this after all ... not great.

To be clear, I don't think that you necessarily have to apply these
capabilities in v1 of this project. I would be satisfied if the patch
could just break things out in the right way, so that some later patch
could improve things later on. I only really want to see the
capabilities within the index AM decomposed, such that one central
place can see a global view of the costs and benefits of the index
scan.

You should be able to validate the new API by stress-testing the code.
You can make the index AM read several leaf pages at a time when a
certain debug mode is enabled. Once you prove that the index AM
correctly performs the same processing as today correctly, without any
needless restrictions on the ordering that these decomposed operators
perform (only required restrictions that are well explained and
formalized), then things should be on the right path.

ISTM that the central feature of the new API should be the ability to
reorder certain kinds of work. There will have to be certain
constraints, of course. Sometimes these will principally be problems
for the table AM (e.g., we musn't allow concurrent TID recycling
unless it's for a plain index scan using an MVCC snapshot), other
times they're principally problems for the index AM (e.g., the
_bt_killitems safety issues).

Not sure. By "new API" you mean the read stream API, or the index AM API
to allow batching?

Right now those two concepts seem incredibly blurred to me.

I get that you're not that excited about multi-page batches; it's not
the priority. Fair enough. I just think that the API needs to work in
terms of batches that are sized as one or more pages, in order for it
to make sense.

True, but isn't that already the case? I mean, what exactly prevents an
index AM to "build" a batch for multiple leaf pages? The current patch
does not implement that for any of the AMs, true, but isn't that already
possible if the AM chooses to?

That's unclear, but overall I'd say no.

The index AM API says that they need to hold on to a buffer pin to
avoid confusing scans due to concurrent TID recycling by VACUUM. The
index AM API fails to adequately describe what is expected here. And
it provides no useful context for larger batching of index pages.
nbtree already does its own thing by dropping leaf page pins
selectively.

Whether or not it's technically possible is a matter of interpretation
(I came down on the "no" side, but it's still ambiguous). I would
prefer it if the index AM API was much simpler for ordered scans. As I
said already, something along the lines of "when you're told to scan
the next index page, here's how we'll call you, here's the data
structure that you need to fill up". Or "when we tell you that we're
done fetching tuples from a recently read index page, here's how we'll
call you".

These discussions about where the exact boundaries lie don't seem very
helpful. The simple fact is that nobody is ever going to invent an
index AM side interface that batches up more than a single leaf page.
Why would they? It just doesn't make sense to, since the index AM has
no idea about certain clearly-relevant context. For example, it has no
idea whether or not there's a LIMIT involved.

The value that comes from using larger batches on the index AM side
comes from making life easier for heap prefetching, which index AMs
know nothing about whatsoever. Again, the goal should be to marry
information from the index AM and the table AM in one central place.

Unfortunately that didn't work because of killtuples etc. because the
index AM had no idea about the indexam queue and has it's own concept of
"current item", so it was confused about which item to mark as killed.
And that old item might even be from an earlier leaf page (not the
"current" currPos).

Currently, during a call to btgettuple, so->currPos.itemIndex is
updated within _bt_next. But before _bt_next is called,
so->currPos.itemIndex indicates the item returned by the most recent
prior call to btgettuple -- which is also the tuple that the
scan->kill_prior_tuple reports on. In short, btgettuple does some
trivial things to remember which entries from so->currPos ought to be
marked dead later on due to the scan->kill_prior_tuple flag having
been set for those entries. This can be moved outside of each index
AM.

The index AM shouldn't need to use a scan->kill_prior_tuple style flag
under the new batching API at all, though. It should work at a higher
level than that. The index AM should be called through a callback that
tells it to drop the pin on a page that the table AM has been reading
from, and maybe perform _bt_killitems on these relevant known-dead
TIDs first. In short, all of the bookkeeping for so->killedItems[]
should be happening at a completely different layer. And the
so->killedItems[] structure should be directly associated with a
single index page subset of a batch (a subset similar to the current
so->currPos batches).

The first time the index AM sees anything about dead TIDs, it should
see a whole leaf page worth of them.

I was thinking maybe the AM could keep the leaf pages, and then free
them once they're no longer needed. But it wasn't clear to me how to
exchange this information between indexam.c and the index AM, because
right now the AM only knows about a single (current) position.

I'm imagining a world in which the index AM doesn't even know about
the current position. Basically, it has no real context about the
progress of the scan to maintain at all. It merely does what it is
told by some higher level, that is sensitive to the requirements of
both the index AM and the table AM.

But imagine we have this:

a) A way to switch the scan into "batch" mode, where the AM keeps the
leaf page (and a way for the AM to indicate it supports this).

I don't think that there needs to be a batch mode. There could simply
be the total absence of batching, which is one point along a
continuum, rather than a discrete mode.

b) Some way to track two "positions" in the scan - one for read, one for
prefetch. I'm not sure if this would be internal in each index AM, or at
the indexam.c level.

I think that it would be at the indexam.c level.

c) A way to get the index tuple for either of the two positions (and
advance the position). It might be a flag for amgettuple(), or maybe
even a callaback for the "prefetch" position.

Why does the index AM need to know anything about the fact that the
next tuple has been requested? Why can't it just be 100% ignorant of
all that? (Perhaps barring a few special cases, such as KNN-GiST
scans, which continue to use the legacy amgettuple interface.)

d) A way to inform the AM items up to some position are no longer
needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
could always be "up to the current read position".

Yeah, I like this idea. But the index AM doesn't need to know about
positions and whatnot. It just needs to do what it's told: to drop the
pin, and maybe to perform _bt_killitems first. Or maybe just to drop
the pin, with instruction to do _bt_killitems coming some time later
(the index AM will need to be a bit more careful within its
_bt_killitems step when this happens).

The index AM doesn't need to drop the current pin for the current
position -- not as such. The index AM doesn't directly know about what
pins are held, since that'll all be tracked elsewhere. Again, the
index AM should need to hold onto zero context, beyond the immediate
request to perform one additional unit of work, which will
usually/always happen at the index page level (all of which is tracked
by data structures that are under the control of the new indexam.c
level).

I don't think that it'll ultimately be all that hard to schedule when
and how index pages are read from outside of the index AM in question.
In general all relevant index AMs already work in much the same way
here. Maybe we can ultimately invent a way for the index AM to
influence that scheduling, but that might never be required.

Does that sound reasonable / better than the current approach, or have I
finally reached the "raving lunatic" stage?

The stage after "raving lunatic" is enlightenment. :-)

--
Peter Geoghegan

#106

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Peter Geoghegan (#105)

Re: index prefetching

On 11/7/24 18:55, Peter Geoghegan wrote:

On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas@vondra.me> wrote:

The primary reason why I kept amgettuple() as is, and added a new AM
callback for the "batch" mode is backwards compatibility. I did not want
to force all AMs to do this, I think it should be optional. Not only to
limit the disruption for out-of-core AMs, but also because I'm not 100%
sure every AM will be able to do batching in a reasonable way.

All index AMs that implement amgettuple are fairly similar to nbtree. They are:

* nbtree itself
* GiST
* Hash
* SP-GiST

They all have the same general notion of page-at-a-time processing,
with buffering of items for the amgettuple callback to return. There
are perhaps enough differences to be annoying in SP-GiST, and with
GiST's ordered scans (which use a pairing heap rather than true
page-at-a-time processing). I guess you're right that you'll need to
maintain amgettuple support for the foreseeable future, to support
these special cases.

I still think that you shouldn't need to use amgettuple in either
nbtree or hash, since neither AM does anything non-generic in this
area. It should be normal to never need to use amgettuple.

Right, I can imagine not using amgettuple() in nbtree/hash. I guess we
could even remove it altogether, although I'm not sure that'd work right
now (haven't tried).

Yes, we could ditch the batching in indexam.c, and just rely on the AM
batching, just like now.

To be clear, I had imagined completely extracting the batching from
the index AM, since it isn't really at all coupled to individual index
AM implementation details anyway. I don't hate the idea of doing more
in the index AM, but whether or not it happens there vs. somewhere
else isn't my main concern at this point.

My main concern right now is that one single place be made to see
every relevant piece of information about costs and benefits. Probably
something inside indexam.c.

Not sure I understand, but I think I'm somewhat confused by "index AM"
vs. indexam. Are you suggesting the individual index AMs should know as
little about the batching as possible, and instead it should be up to
indexam.c to orchestrate most of the stuff?

If yes, then I agree in principle, and I think indexam.c is the right
place to do that (or at least I can't think of a better one).

That's what the current patch aimed to do, more or less. I'm not saying
it got it perfectly right, and I'm sure there is stuff that can be
improved (like reworking _steppage to not deal with killed tuples). But
surely the index AMs need to have some knowledge about batching, because
how else would it know which leaf pages to still keep, etc?

There are a couple details why the separate
batching seemed convenient:

1) We may need to stash some custom data for each TID (e.g. so that IOS
does not need to check VM repeatedly). But perhaps that could be
delegated to the index AM too ...

2) We need to maintain two "positions" in the index. One for the item
the executor is currently processing (and which might end up getting
marked as "killed" etc). And another one for "read" position, i.e. items
passed to the read stream API / prefetching, etc.

That all makes sense.

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.

But it doesn't, really. The index AM is still subject to exactly the
same constraints in terms of page-at-a-time processing. These existing
constraints always came from the table AM side, so it's not as if your
patch can remain totally neutral on these questions.

Not sure I understand. Which part of my sentence you disagree with? Or
what constraints you mean?

The interface does not require page-at-a-time processing - the index AM
is perfectly within it's rights to produce a batch spanning 10 leaf
pages, as long as it keeps track of them, and perhaps keeps some mapping
of items (returned in the batch) to leaf pages. So that when the next
batch is requested, it can do the cleanup, and move to the next batch.

Yes, the current implementation does not do that, to keep the patches
simple. But it should be possible, I believe.

Basically, it looks like you've invented a shadow batching interface
that is technically not known to the index AM, but nevertheless
coordinates with the existing so->currPos batching interface.

Perhaps, but which part of that you consider a problem? Are you saying
this shouldn't use the currPos stuff at all, and instead do stuff in
some other way?

I don't think indexam.c knows all that much about the nbtree internal
batching. It "just" relies on amgetbatch() producing items the AM can
handle later (during killtuples/cleanup etc.). It does not even need to
be a single-leaf-page batch, if the AM knows how to track/deal with that
internally.

I'm concerned that no single place will know about everything under
this scheme. Having one single place that has visibility into all
relevant costs, whether they're index AM or table AM related, is what
I think you should be aiming for.

I think that you should be removing the parts of the nbtree (and other
index AM) code that deal with the progress of the scan explicitly.
What remains is code that simply reads the next page, and saves its
details in the relevant data structures. Or code that "finishes off" a
leaf page by dropping its pin, and maybe doing the _bt_killitems
stuff.

Does that mean not having a simple amgetbatch() callback, but some finer
grained interface? Or maybe one callback that returns the next "AM page"
(essentially the currPos), and then another callback to release it?

(This is what I mean by "two-callback API" later.)

Or what would it look like?

The index AM itself should no longer know about the current next tuple
to return, nor about mark/restore. It is no longer directly in control
of the scan's progress. It loses all context that survives across API
calls.

I'm lost. How could the index AM not know about mark/restore?

Yes, splitting _bt_steppage() like this makes sense to me, and I agree
being able to proceed to the next page before we're done with the
current page seems perfectly reasonable for batches spanning multiple
leaf pages.

I think that it's entirely possible that it'll just be easier to do
things this way from the start. I understand that that may be far from
obvious right now, but, again, I just don't see what's so special
about the way that each index AM batches results. What about that it
is so hard to generalize across index AMs that must support amgettuple
right now? (At least in the case of nbtree and hash, which have no
special requirements for things like KNN-GiST.)

I don't think the batching in various AMs is particularly unique, that's
true. But my goal was to wrap that in a single amgetbatch callback,
because that seemed natural, and that moves some of the responsibilities
to the AM. I still don't quite understand what API you imagine, but if
we want to make more of this the responsibility of indexam.c, I guess it
will require multiple smaller callbacks (I'm not opposed to that, but I
also don't know if that's what you imagine).

Most individual calls to btgettuple just return the next batched-up
so->currPos tuple/TID via another call to _bt_next. Things like the
_bt_first-new-primitive-scan case don't really add any complexity --
the core concept of processing a page at a time still applies. It
really is just a simple batching scheme, with a couple of extra fiddly
details attached to it -- but nothing too hairy.

True, although the details (how the batches are represented etc.) are
often quite different, so did you imagine some shared structure to
represent that, or wrapping that in a new callback? Or how would
indexam.c work with that?

The hardest part will probably be rigorously describing the rules for
not breaking index-only scans due to concurrent TID recycling by
VACUUM, and the rules for doing _bt_killitems. But that's also not a
huge problem, in the grand scheme of things.

It probably is not a huge problem ... for someone who's already familiar
with the rules, at least intuitively. But TBH this part really scares me
a little bit.

Hmm. I've intentionally tried to ignore these issues, or rather to limit
the scope of the patch so that v1 does not require dealing with it.
Hence the restriction to single-leaf batches, for example.

But I guess I may have to look at this after all ... not great.

To be clear, I don't think that you necessarily have to apply these
capabilities in v1 of this project. I would be satisfied if the patch
could just break things out in the right way, so that some later patch
could improve things later on. I only really want to see the
capabilities within the index AM decomposed, such that one central
place can see a global view of the costs and benefits of the index
scan.

Yes, I understand that. Getting the overall design right is my main
concern, even if some of the advanced stuff is not implemented until
later. But with the wrong design, that may turn out to be difficult.

That's the feedback I was hoping for when I kept bugging you, and this
discussion was already very useful in this regard. Thank you for that.

You should be able to validate the new API by stress-testing the code.
You can make the index AM read several leaf pages at a time when a
certain debug mode is enabled. Once you prove that the index AM
correctly performs the same processing as today correctly, without any
needless restrictions on the ordering that these decomposed operators
perform (only required restrictions that are well explained and
formalized), then things should be on the right path.

Yeah, stress testing is my primary tool ...

ISTM that the central feature of the new API should be the ability to
reorder certain kinds of work. There will have to be certain
constraints, of course. Sometimes these will principally be problems
for the table AM (e.g., we musn't allow concurrent TID recycling
unless it's for a plain index scan using an MVCC snapshot), other
times they're principally problems for the index AM (e.g., the
_bt_killitems safety issues).

Not sure. By "new API" you mean the read stream API, or the index AM API
to allow batching?

Right now those two concepts seem incredibly blurred to me.

Same here.

I get that you're not that excited about multi-page batches; it's not
the priority. Fair enough. I just think that the API needs to work in
terms of batches that are sized as one or more pages, in order for it
to make sense.

True, but isn't that already the case? I mean, what exactly prevents an
index AM to "build" a batch for multiple leaf pages? The current patch
does not implement that for any of the AMs, true, but isn't that already
possible if the AM chooses to?

That's unclear, but overall I'd say no.

The index AM API says that they need to hold on to a buffer pin to
avoid confusing scans due to concurrent TID recycling by VACUUM. The
index AM API fails to adequately describe what is expected here. And
it provides no useful context for larger batching of index pages.
nbtree already does its own thing by dropping leaf page pins
selectively.

Not sure I understand. I imagined the index AM would just read a
sequence of leaf pages, keeping all the same pins etc. just like it does
for the one leaf it reads right now (pins, etc.).

I'm probably too dumb for that, but I still don't quite understand how
that's different from just reading and processing that sequence of leaf
pages by amgettuple without batching.

Whether or not it's technically possible is a matter of interpretation
(I came down on the "no" side, but it's still ambiguous). I would
prefer it if the index AM API was much simpler for ordered scans. As I
said already, something along the lines of "when you're told to scan
the next index page, here's how we'll call you, here's the data
structure that you need to fill up". Or "when we tell you that we're
done fetching tuples from a recently read index page, here's how we'll
call you".

I think this is pretty much "two-callback API" I mentioned earlier.

These discussions about where the exact boundaries lie don't seem very
helpful. The simple fact is that nobody is ever going to invent an
index AM side interface that batches up more than a single leaf page.
Why would they? It just doesn't make sense to, since the index AM has
no idea about certain clearly-relevant context. For example, it has no
idea whether or not there's a LIMIT involved.

The value that comes from using larger batches on the index AM side
comes from making life easier for heap prefetching, which index AMs
know nothing about whatsoever. Again, the goal should be to marry
information from the index AM and the table AM in one central place.

True, although the necessary context could be passed to the index AM in
some way. That's what happens in the current patch, where indexam.c
could size the batch just right for a LIMIT clause, before asking the
index AM to fill it with items.

Unfortunately that didn't work because of killtuples etc. because the
index AM had no idea about the indexam queue and has it's own concept of
"current item", so it was confused about which item to mark as killed.
And that old item might even be from an earlier leaf page (not the
"current" currPos).

Currently, during a call to btgettuple, so->currPos.itemIndex is
updated within _bt_next. But before _bt_next is called,
so->currPos.itemIndex indicates the item returned by the most recent
prior call to btgettuple -- which is also the tuple that the
scan->kill_prior_tuple reports on. In short, btgettuple does some
trivial things to remember which entries from so->currPos ought to be
marked dead later on due to the scan->kill_prior_tuple flag having
been set for those entries. This can be moved outside of each index
AM.

The index AM shouldn't need to use a scan->kill_prior_tuple style flag
under the new batching API at all, though. It should work at a higher
level than that. The index AM should be called through a callback that
tells it to drop the pin on a page that the table AM has been reading
from, and maybe perform _bt_killitems on these relevant known-dead
TIDs first. In short, all of the bookkeeping for so->killedItems[]
should be happening at a completely different layer. And the
so->killedItems[] structure should be directly associated with a
single index page subset of a batch (a subset similar to the current
so->currPos batches).

The first time the index AM sees anything about dead TIDs, it should
see a whole leaf page worth of them.

I need to think about this a bit, but I agree passing this information
to an index AM through the kill_prior_tuple seems weird.

I was thinking maybe the AM could keep the leaf pages, and then free
them once they're no longer needed. But it wasn't clear to me how to
exchange this information between indexam.c and the index AM, because
right now the AM only knows about a single (current) position.

I'm imagining a world in which the index AM doesn't even know about
the current position. Basically, it has no real context about the
progress of the scan to maintain at all. It merely does what it is
told by some higher level, that is sensitive to the requirements of
both the index AM and the table AM.

Hmmm, OK. If the idea is to just return a leaf page as an array of items
(in some fancy way) to indexam.c, then it'd be indexam.c responsible for
tracking what the current position (or multiple positions are), I guess.

But imagine we have this:

a) A way to switch the scan into "batch" mode, where the AM keeps the
leaf page (and a way for the AM to indicate it supports this).

I don't think that there needs to be a batch mode. There could simply
be the total absence of batching, which is one point along a
continuum, rather than a discrete mode.

b) Some way to track two "positions" in the scan - one for read, one for
prefetch. I'm not sure if this would be internal in each index AM, or at
the indexam.c level.

I think that it would be at the indexam.c level.

Yes, if the index AM returns page as a set of items, then it'd be up to
indexam.c to maintain all this information.

c) A way to get the index tuple for either of the two positions (and
advance the position). It might be a flag for amgettuple(), or maybe
even a callaback for the "prefetch" position.

Why does the index AM need to know anything about the fact that the
next tuple has been requested? Why can't it just be 100% ignorant of
all that? (Perhaps barring a few special cases, such as KNN-GiST
scans, which continue to use the legacy amgettuple interface.)

Well, I was thinking about how it works now, for the "current" position.
And I was thinking about how would it need to change to handle the
prefetch position too, in the same way ...

But if you're suggesting to move this logic and context to the upper
layer indexam.c, that changes things ofc.

d) A way to inform the AM items up to some position are no longer
needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
could always be "up to the current read position".

Yeah, I like this idea. But the index AM doesn't need to know about
positions and whatnot. It just needs to do what it's told: to drop the
pin, and maybe to perform _bt_killitems first. Or maybe just to drop
the pin, with instruction to do _bt_killitems coming some time later
(the index AM will need to be a bit more careful within its
_bt_killitems step when this happens).

Well, if the AM works with "batches of tuples for a leaf page" (through
the two callbacks to read / release a page), then positions to exact
items are no longer needed. It just needs to know which pages are still
needed, etc. Correct?

The index AM doesn't need to drop the current pin for the current
position -- not as such. The index AM doesn't directly know about what
pins are held, since that'll all be tracked elsewhere. Again, the
index AM should need to hold onto zero context, beyond the immediate
request to perform one additional unit of work, which will
usually/always happen at the index page level (all of which is tracked
by data structures that are under the control of the new indexam.c
level).

No idea.

I don't think that it'll ultimately be all that hard to schedule when
and how index pages are read from outside of the index AM in question.
In general all relevant index AMs already work in much the same way
here. Maybe we can ultimately invent a way for the index AM to
influence that scheduling, but that might never be required.

I haven't thought about scheduling at all. Maybe there's something we
could improve in the future, but I don't see what would it look like,
and it seems unrelated to this patch.

Does that sound reasonable / better than the current approach, or have I
finally reached the "raving lunatic" stage?

The stage after "raving lunatic" is enlightenment. :-)

That's my hope.

regards

--
Tomas Vondra

#107

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Tomas Vondra (#106)

Re: index prefetching

On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas@vondra.me> wrote:

Not sure I understand, but I think I'm somewhat confused by "index AM"
vs. indexam. Are you suggesting the individual index AMs should know as
little about the batching as possible, and instead it should be up to
indexam.c to orchestrate most of the stuff?

Yes, that's what I'm saying. Knowing "as little as possible" turns out
to be pretty close to knowing nothing at all.

There might be some minor exceptions, such as the way that nbtree
needs to remember the scan's array keys. But that already works in a
way that's very insensitive to the exact position in the scan. For
example, right now if you restore a mark that doesn't just come from
the existing so->currPos batch then we cheat and reset the array keys.

If yes, then I agree in principle, and I think indexam.c is the right
place to do that (or at least I can't think of a better one).

Good.

That's what the current patch aimed to do, more or less. I'm not saying
it got it perfectly right, and I'm sure there is stuff that can be
improved (like reworking _steppage to not deal with killed tuples). But
surely the index AMs need to have some knowledge about batching, because
how else would it know which leaf pages to still keep, etc?

I think that your new thing can directly track which leaf pages have
pins. As well as tracking the order that it has to return tuples from
among those leaf page batch subsets.

Your new thing can think about this in very general terms, that really
aren't tied to any index AM specifics. It'll have some general notion
of an ordered sequence of pages (in scan/key space order), each of
which contains one or more tuples to return. It needs to track which
pages have tuples that we've already done all the required visibility
checks for, in order to be able to instruct the index AM to drop the
pin.

Suppose, for example, that we're doing an SAOP index scan, where the
leaf pages that our multi-page batch consists of aren't direct
siblings. That literally doesn't matter at all. The pages still have
to be in the same familiar key space/scan order, regardless. And that
factor shouldn't really need to influence how many pins we're willing
to hold on to (no more than it would when there are large numbers of
index leaf pages with no interesting tuples to return that we must
still scan over).

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.

But it doesn't, really. The index AM is still subject to exactly the
same constraints in terms of page-at-a-time processing. These existing
constraints always came from the table AM side, so it's not as if your
patch can remain totally neutral on these questions.

Not sure I understand. Which part of my sentence you disagree with? Or
what constraints you mean?

What I was saying here was something I said more clearly a bit further
down: it's technically possible to do multi-page batches within the
confines of the current index AM API, but that's not true in any
practical sense. And it'll never be true with an API that looks very
much like the current amgettuple API.

The interface does not require page-at-a-time processing - the index AM
is perfectly within it's rights to produce a batch spanning 10 leaf
pages, as long as it keeps track of them, and perhaps keeps some mapping
of items (returned in the batch) to leaf pages. So that when the next
batch is requested, it can do the cleanup, and move to the next batch.

How does an index AM actually do that in a way that's useful? It only
sees a small part of the picture. That's why it's the wrong place for
it.

Basically, it looks like you've invented a shadow batching interface
that is technically not known to the index AM, but nevertheless
coordinates with the existing so->currPos batching interface.

Perhaps, but which part of that you consider a problem? Are you saying
this shouldn't use the currPos stuff at all, and instead do stuff in
some other way?

I think that you should generalize the currPos stuff, and move it to
some other, higher level module.

Does that mean not having a simple amgetbatch() callback, but some finer
grained interface? Or maybe one callback that returns the next "AM page"
(essentially the currPos), and then another callback to release it?

(This is what I mean by "two-callback API" later.)

I'm not sure. Why does the index AM need to care about the batch size
at all? It merely needs to read the next leaf page. The high level
understanding of batches and the leaf pages that constitute batches
lives elsewhere.

The nbtree code will know about buffer pins held, in the sense that
it'll be the one setting the Buffer variables in the new scan
descriptor thing. But it's not going to remember to drop those buffer
pins on its own. It'll need to be told. So it's not ever really in
control.

The index AM itself should no longer know about the current next tuple
to return, nor about mark/restore. It is no longer directly in control
of the scan's progress. It loses all context that survives across API
calls.

I'm lost. How could the index AM not know about mark/restore?

Restoring a mark already works by restoring an earlier so->currPos
batch. Actually, more often it works by storing an offset into the
current so->currPos, without actually copying anything into
so->markPos, and without restoring so->markPos into so->currPos.

In short, there is virtually nothing about how mark/restore works that
really needs to live inside nbtree. It's all just restoring an earlier
batch and/or offset into a batch. The only minor caveat is the stuff
about array keys that I went into already -- that isn't quite a piece
of state that lives in so->currPos, but it's a little bit like that.

You can probably poke one or two more minor holes in some of this --
it's not 100% trivial. But it's doable.

I don't think the batching in various AMs is particularly unique, that's
true. But my goal was to wrap that in a single amgetbatch callback,
because that seemed natural, and that moves some of the responsibilities
to the AM.

Why is it natural? I mean all of the index AMs that support amgettuple
copied everything from ntree already. Including all of the
kill_prior_tuple stuff. It's already quite generic.

I still don't quite understand what API you imagine, but if
we want to make more of this the responsibility of indexam.c, I guess it
will require multiple smaller callbacks (I'm not opposed to that, but I
also don't know if that's what you imagine).

I think that you understood me correctly here.

Most individual calls to btgettuple just return the next batched-up
so->currPos tuple/TID via another call to _bt_next. Things like the
_bt_first-new-primitive-scan case don't really add any complexity --
the core concept of processing a page at a time still applies. It
really is just a simple batching scheme, with a couple of extra fiddly
details attached to it -- but nothing too hairy.

True, although the details (how the batches are represented etc.) are
often quite different, so did you imagine some shared structure to
represent that, or wrapping that in a new callback?

In what sense are they sometimes different?

In general batches will consist of one or more groups of tuples, each
of which is associated with a particular leaf page (if the scan
returns no tuples for a given scanned leaf page then it won't form a
part of the final batch). You can do amgettuple style scrolling back
and forth with this structure, across page boundaries. Seems pretty
general to me.

Yes, I understand that. Getting the overall design right is my main
concern, even if some of the advanced stuff is not implemented until
later. But with the wrong design, that may turn out to be difficult.

That's the feedback I was hoping for when I kept bugging you, and this
discussion was already very useful in this regard. Thank you for that.

I don't want to insist on doing all this. But it just seems really
weird to have this shadow batching system for the so->currPos batches.

The index AM API says that they need to hold on to a buffer pin to
avoid confusing scans due to concurrent TID recycling by VACUUM. The
index AM API fails to adequately describe what is expected here. And
it provides no useful context for larger batching of index pages.
nbtree already does its own thing by dropping leaf page pins
selectively.

Not sure I understand. I imagined the index AM would just read a
sequence of leaf pages, keeping all the same pins etc. just like it does
for the one leaf it reads right now (pins, etc.).

Right. But it wouldn't necessarily drop the leaf pages right away. It
might try to coalesce together multiple heap page accesses, for index
tuples that happen to span page boundaries (but are part of the same
higher level batch).

I'm probably too dumb for that, but I still don't quite understand how
that's different from just reading and processing that sequence of leaf
pages by amgettuple without batching.

It's not so much different, as just more flexible. It's possible that
v1 would effectively do exactly the same thing in practice. It'd only
be able to do fancier things with holding onto leaf pages in a debug
build, that validated the general approach.

True, although the necessary context could be passed to the index AM in
some way. That's what happens in the current patch, where indexam.c
could size the batch just right for a LIMIT clause, before asking the
index AM to fill it with items.

What difference does it make where it happens? It might make some
difference, but as I keep saying, the important point is that
*somebody* has to know all of these things at the same time.

I need to think about this a bit, but I agree passing this information
to an index AM through the kill_prior_tuple seems weird.

Right. Because it's a tuple-at-a-time interface, which isn't suitable
for the direction you want to take things in.

Hmmm, OK. If the idea is to just return a leaf page as an array of items
(in some fancy way) to indexam.c, then it'd be indexam.c responsible for
tracking what the current position (or multiple positions are), I guess.

Right. It would have to have some basic idea of the laws-of-physics
underlying the index scan. It would have to sensibly limit the number
of index page buffer pins held at any given time.

Why does the index AM need to know anything about the fact that the
next tuple has been requested? Why can't it just be 100% ignorant of
all that? (Perhaps barring a few special cases, such as KNN-GiST
scans, which continue to use the legacy amgettuple interface.)

Well, I was thinking about how it works now, for the "current" position.
And I was thinking about how would it need to change to handle the
prefetch position too, in the same way ...

But if you're suggesting to move this logic and context to the upper
layer indexam.c, that changes things ofc.

Yes, I am suggesting that.

Well, if the AM works with "batches of tuples for a leaf page" (through
the two callbacks to read / release a page), then positions to exact
items are no longer needed. It just needs to know which pages are still
needed, etc. Correct?

Right, correct.

I don't think that it'll ultimately be all that hard to schedule when
and how index pages are read from outside of the index AM in question.
In general all relevant index AMs already work in much the same way
here. Maybe we can ultimately invent a way for the index AM to
influence that scheduling, but that might never be required.

I haven't thought about scheduling at all. Maybe there's something we
could improve in the future, but I don't see what would it look like,
and it seems unrelated to this patch.

It's only related to this patch in the sense that we have to imagine
that it'll be worth having in some form in the future.

It might also be a good exercise architecturally. We don't need to do
the same thing in several slightly different ways in each index AM.

--
Peter Geoghegan

#108

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Peter Geoghegan (#107)

Re: index prefetching

On 11/8/24 02:35, Peter Geoghegan wrote:

On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas@vondra.me> wrote:

Not sure I understand, but I think I'm somewhat confused by "index AM"
vs. indexam. Are you suggesting the individual index AMs should know as
little about the batching as possible, and instead it should be up to
indexam.c to orchestrate most of the stuff?

Yes, that's what I'm saying. Knowing "as little as possible" turns out
to be pretty close to knowing nothing at all.

There might be some minor exceptions, such as the way that nbtree
needs to remember the scan's array keys. But that already works in a
way that's very insensitive to the exact position in the scan. For
example, right now if you restore a mark that doesn't just come from
the existing so->currPos batch then we cheat and reset the array keys.

If yes, then I agree in principle, and I think indexam.c is the right
place to do that (or at least I can't think of a better one).

Good.

That's what the current patch aimed to do, more or less. I'm not saying
it got it perfectly right, and I'm sure there is stuff that can be
improved (like reworking _steppage to not deal with killed tuples). But
surely the index AMs need to have some knowledge about batching, because
how else would it know which leaf pages to still keep, etc?

I think that your new thing can directly track which leaf pages have
pins. As well as tracking the order that it has to return tuples from
among those leaf page batch subsets.

Your new thing can think about this in very general terms, that really
aren't tied to any index AM specifics. It'll have some general notion
of an ordered sequence of pages (in scan/key space order), each of
which contains one or more tuples to return. It needs to track which
pages have tuples that we've already done all the required visibility
checks for, in order to be able to instruct the index AM to drop the
pin.

Is it a good idea to make this part (in indexam.c) aware of /
responsible for managing stuff like pins? Perhaps it'd work fine for
index AMs that always return an array of items for a single leaf-page
(like btree or hash). But I'm still thinking about cases like gist with
ORDER BY clauses, or maybe something even weirder in custom AMs.

It seems to me knowing which pages may be pinned is very AM-specific
knowledge, and my intention was to let the AM to manage that. That is,
the new indexam code would be responsible for deciding when the "AM
batches" are loaded and released, using the two new callbacks. But it'd
be the AM responsible for making sure everything is released.

Suppose, for example, that we're doing an SAOP index scan, where the
leaf pages that our multi-page batch consists of aren't direct
siblings. That literally doesn't matter at all. The pages still have
to be in the same familiar key space/scan order, regardless. And that
factor shouldn't really need to influence how many pins we're willing
to hold on to (no more than it would when there are large numbers of
index leaf pages with no interesting tuples to return that we must
still scan over).

I agree that in the simple cases it's not difficult to determine what
pins we need for the sequence of tuples/pages. But is it guaranteed to
be that easy, and is it easy to communicate this information to the
indexam.c layer? I'm not sure about that. In an extreme case it may be
that each tuple comes from entirely different leaf page, and stuff like
that. And while most out-of-core AMs that I'm aware of are rather close
to nbtree/gist/gin, I wonder what weird things can be out there.

3) It makes it clear when the items are no longer needed, and the AM can
do cleanup. process kill tuples, etc.

But it doesn't, really. The index AM is still subject to exactly the
same constraints in terms of page-at-a-time processing. These existing
constraints always came from the table AM side, so it's not as if your
patch can remain totally neutral on these questions.

Not sure I understand. Which part of my sentence you disagree with? Or
what constraints you mean?

What I was saying here was something I said more clearly a bit further
down: it's technically possible to do multi-page batches within the
confines of the current index AM API, but that's not true in any
practical sense. And it'll never be true with an API that looks very
much like the current amgettuple API.

The interface does not require page-at-a-time processing - the index AM
is perfectly within it's rights to produce a batch spanning 10 leaf
pages, as long as it keeps track of them, and perhaps keeps some mapping
of items (returned in the batch) to leaf pages. So that when the next
batch is requested, it can do the cleanup, and move to the next batch.

How does an index AM actually do that in a way that's useful? It only
sees a small part of the picture. That's why it's the wrong place for
it.

Sure, maybe it'd need some more information - say, how many items we
expect to read, but if indexam knows that bit, surely it can pass it
down to the AM.

But yeah, I agree doing it in amgettuple() would be inconvenient and
maybe even awkward. I can imagine the AM maintaining an array of
currPos, but then it'd also need to be made aware of multiple positions,
and stuff like that. Which it shouldn't need to know about.

Basically, it looks like you've invented a shadow batching interface
that is technically not known to the index AM, but nevertheless
coordinates with the existing so->currPos batching interface.

Perhaps, but which part of that you consider a problem? Are you saying
this shouldn't use the currPos stuff at all, and instead do stuff in
some other way?

I think that you should generalize the currPos stuff, and move it to
some other, higher level module.

By generalizing you mean defining a common struct serving the same
purpose, but for all the index AMs? And the new AM callbacks would
produce/consume this new struct, right?

Does that mean not having a simple amgetbatch() callback, but some finer
grained interface? Or maybe one callback that returns the next "AM page"
(essentially the currPos), and then another callback to release it?

(This is what I mean by "two-callback API" later.)

I'm not sure. Why does the index AM need to care about the batch size
at all? It merely needs to read the next leaf page. The high level
understanding of batches and the leaf pages that constitute batches
lives elsewhere.

I don't think I suggested the index AM would need to know about the
batch size. Only indexam.c would be aware of that, and would read enough
stuff from the index to satisfy that.

The nbtree code will know about buffer pins held, in the sense that
it'll be the one setting the Buffer variables in the new scan
descriptor thing. But it's not going to remember to drop those buffer
pins on its own. It'll need to be told. So it's not ever really in
control.

Right. So those pins would be released after indexam invokes the second
new callback, instructing the index AM to release everything associated
with a chunk of items returned sometime earlier.

The index AM itself should no longer know about the current next tuple
to return, nor about mark/restore. It is no longer directly in control
of the scan's progress. It loses all context that survives across API
calls.

I'm lost. How could the index AM not know about mark/restore?

Restoring a mark already works by restoring an earlier so->currPos
batch. Actually, more often it works by storing an offset into the
current so->currPos, without actually copying anything into
so->markPos, and without restoring so->markPos into so->currPos.

In short, there is virtually nothing about how mark/restore works that
really needs to live inside nbtree. It's all just restoring an earlier
batch and/or offset into a batch. The only minor caveat is the stuff
about array keys that I went into already -- that isn't quite a piece
of state that lives in so->currPos, but it's a little bit like that.

You can probably poke one or two more minor holes in some of this --
it's not 100% trivial. But it's doable.

OK. The thing that worries me is whether it's going to be this simple
for other AMs. Maybe it is, I don't know.

I don't think the batching in various AMs is particularly unique, that's
true. But my goal was to wrap that in a single amgetbatch callback,
because that seemed natural, and that moves some of the responsibilities
to the AM.

Why is it natural? I mean all of the index AMs that support amgettuple
copied everything from ntree already. Including all of the
kill_prior_tuple stuff. It's already quite generic.

I don't recall my reasoning, and I'm not saying it was the right
instinct. But if we have one callback to read tuples, it seemed like
maybe we should have one callback to read a bunch of tuples in a similar
way.

I still don't quite understand what API you imagine, but if
we want to make more of this the responsibility of indexam.c, I guess it
will require multiple smaller callbacks (I'm not opposed to that, but I
also don't know if that's what you imagine).

I think that you understood me correctly here.

Most individual calls to btgettuple just return the next batched-up
so->currPos tuple/TID via another call to _bt_next. Things like the
_bt_first-new-primitive-scan case don't really add any complexity --
the core concept of processing a page at a time still applies. It
really is just a simple batching scheme, with a couple of extra fiddly
details attached to it -- but nothing too hairy.

True, although the details (how the batches are represented etc.) are
often quite different, so did you imagine some shared structure to
represent that, or wrapping that in a new callback?

In what sense are they sometimes different?

In general batches will consist of one or more groups of tuples, each
of which is associated with a particular leaf page (if the scan
returns no tuples for a given scanned leaf page then it won't form a
part of the final batch). You can do amgettuple style scrolling back
and forth with this structure, across page boundaries. Seems pretty
general to me.

I meant that each of the AMs uses a separate typedef, with different
fields, etc. I'm sure there are similarities (it's always an array of
elements, either TIDs, index or heap tuples, or some combination of
that). But maybe there is stuff unique to some AMs - chances are that
can be either "generalized" or extended using some private member.

Yes, I understand that. Getting the overall design right is my main
concern, even if some of the advanced stuff is not implemented until
later. But with the wrong design, that may turn out to be difficult.

That's the feedback I was hoping for when I kept bugging you, and this
discussion was already very useful in this regard. Thank you for that.

I don't want to insist on doing all this. But it just seems really
weird to have this shadow batching system for the so->currPos batches.

The index AM API says that they need to hold on to a buffer pin to
avoid confusing scans due to concurrent TID recycling by VACUUM. The
index AM API fails to adequately describe what is expected here. And
it provides no useful context for larger batching of index pages.
nbtree already does its own thing by dropping leaf page pins
selectively.

Not sure I understand. I imagined the index AM would just read a
sequence of leaf pages, keeping all the same pins etc. just like it does
for the one leaf it reads right now (pins, etc.).

Right. But it wouldn't necessarily drop the leaf pages right away. It
might try to coalesce together multiple heap page accesses, for index
tuples that happen to span page boundaries (but are part of the same
higher level batch).

No opinion, but it's not clear to me how exactly would this work. I've
imagined we'd just acquire (and release) multiple pins as we go.

I'm probably too dumb for that, but I still don't quite understand how
that's different from just reading and processing that sequence of leaf
pages by amgettuple without batching.

It's not so much different, as just more flexible. It's possible that
v1 would effectively do exactly the same thing in practice. It'd only
be able to do fancier things with holding onto leaf pages in a debug
build, that validated the general approach.

True, although the necessary context could be passed to the index AM in
some way. That's what happens in the current patch, where indexam.c
could size the batch just right for a LIMIT clause, before asking the
index AM to fill it with items.

What difference does it make where it happens? It might make some
difference, but as I keep saying, the important point is that
*somebody* has to know all of these things at the same time.

Agreed.

I don't think that it'll ultimately be all that hard to schedule when
and how index pages are read from outside of the index AM in question.
In general all relevant index AMs already work in much the same way
here. Maybe we can ultimately invent a way for the index AM to
influence that scheduling, but that might never be required.

I haven't thought about scheduling at all. Maybe there's something we
could improve in the future, but I don't see what would it look like,
and it seems unrelated to this patch.

It's only related to this patch in the sense that we have to imagine
that it'll be worth having in some form in the future.

It might also be a good exercise architecturally. We don't need to do
the same thing in several slightly different ways in each index AM.

Could you briefly outline how you think this might interact with the
scheduling of index page reads? I can imagine telling someone about
which future index pages we might need to read (say, the next leaf
page), or something like that. But this patch is about prefetching the
heap pages it seems like an entirely independent thing. And ISTM there
are concurrency challenges with prefetching index pages (at least when
leveraging read stream API to do async reads).

regards

--
Tomas Vondra

#109

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Tomas Vondra (#108)

Re: index prefetching

On Sun, Nov 10, 2024 at 4:41 PM Tomas Vondra <tomas@vondra.me> wrote:

Is it a good idea to make this part (in indexam.c) aware of /
responsible for managing stuff like pins?

My sense is that that's the right long term architectural direction. I
can't really prove it.

Perhaps it'd work fine for
index AMs that always return an array of items for a single leaf-page
(like btree or hash). But I'm still thinking about cases like gist with
ORDER BY clauses, or maybe something even weirder in custom AMs.

Nothing is perfect. What you really have to worry about not supporting
is index AMs that implement amgettuple -- AMs that aren't quite a
natural fit for this. At least for in-core index AMs that's really
just GiST (iff KNN GiST is in use, which it usually isn't) plus
SP-GiST.

AFAIK most out-of-core index AMs only support lossy index scans in
practice. Just limiting yourself to that makes an awful lot of things
easier. For example I think that GIN gets away with a lot by only
supporting lossy scans -- there's a comment above ginInsertCleanup()
that says "On first glance it looks completely not crash-safe", but
stuff like that is automatically okay with lossy scans. So many index
AMs automatically don't need to be considered here at all.

It seems to me knowing which pages may be pinned is very AM-specific
knowledge, and my intention was to let the AM to manage that.

This is useful information, because it helps me to understand how
you're viewing this.

I totally disagree with this characterization. This is an important
difference in perspective. IMV index AMs hardly care at all about
holding onto buffer pins, very much unlike heapam.

I think that holding onto pins and whatnot has almost nothing to do
with the index AM as such -- it's about protecting against unsafe
concurrent TID recycling, which is a table AM/heap issue. You can make
a rather weak argument that the index AM needs it for _bt_killitems,
but that seems very secondary to me (if you go back long enough there
are no _bt_killitems, but the pin thing itself still existed).

As I pointed out before, the index AM API docs (at
https://www.postgresql.org/docs/devel/index-locking.html) talk about
holding onto buffer pins on leaf pages during amgettuple. So the need
to mess around with pins just doesn't come from the index AM side, at
all. The cleanup lock interlock against TID recycling protects the
scan from seeing transient wrong answers -- it doesn't protect the
index structure itself.

The only thing that's a bit novel about what I'm proposing now is that
I'm imagining that it'll be possible to eventually usefully schedule
multi-leaf-page batches using code that has no more than a very
general notion of how an ordered index scan works. That might turn out
to be more complicated than I suppose it will now. If it is then it
should still be fixable.

That is,
the new indexam code would be responsible for deciding when the "AM
batches" are loaded and released, using the two new callbacks. But it'd
be the AM responsible for making sure everything is released.

What does it really mean for the index AM to be responsible for a
thing? I think that the ReleaseBuffer() calls would be happening in
index AM code, for sure. But that would probably always be called
through your new index scan management code in practice.

I don't have any fixed ideas about the resource management aspects of
this. That doesn't seem particularly fundamental to the design.

I agree that in the simple cases it's not difficult to determine what
pins we need for the sequence of tuples/pages. But is it guaranteed to
be that easy, and is it easy to communicate this information to the
indexam.c layer?

I think that it's fairly generic. The amount of work required to read
an index page is (in very round numbers) more or less uniform across
index AMs. Maybe you'd need to have some kind of way of measuring how
many pages you had to read without returning any tuples, for
scheduling purposes -- that cost is a relevant cost, and so would
probably have to be tracked. But that still seems fairly general --
any kind of order index scan is liable to sometimes scan multiple
pages without having any index tuples to return.

Sure, maybe it'd need some more information - say, how many items we
expect to read, but if indexam knows that bit, surely it can pass it
down to the AM.

What are you arguing for here? Practically speaking, I think that the
best way to do it is to have one layer that manages all this stuff. It
would also be possible to split it up any way you can think of, but
why would you want to?

I'm not asking you to solve these problems. I'm only suggesting that
you move things in a direction that is amenable to adding these things
later on.

By generalizing you mean defining a common struct serving the same
purpose, but for all the index AMs? And the new AM callbacks would
produce/consume this new struct, right?

Yes.

I don't think I suggested the index AM would need to know about the
batch size. Only indexam.c would be aware of that, and would read enough
stuff from the index to satisfy that.

I don't think that you'd ultimately want to make the batch sizes fixed
(though they'd probably always consist of tuples taken from 1 or more
index pages). Ultimately the size would vary over time, based on
competing considerations.

The nbtree code will know about buffer pins held, in the sense that
it'll be the one setting the Buffer variables in the new scan
descriptor thing. But it's not going to remember to drop those buffer
pins on its own. It'll need to be told. So it's not ever really in
control.

Right. So those pins would be released after indexam invokes the second
new callback, instructing the index AM to release everything associated
with a chunk of items returned sometime earlier.

Yes. It might all look very similar to today, at least for your
initial commited version.

You might also want to combine reading the next page with dropping the
pin on the previous page. But also maybe not.

OK. The thing that worries me is whether it's going to be this simple
for other AMs. Maybe it is, I don't know.

Really? I mean if we're just talking about the subset of GiST scans
that use KNN-GiST as well as SP-GiST scans not using your new
facility, that seems quite acceptable to me.

I don't recall my reasoning, and I'm not saying it was the right
instinct. But if we have one callback to read tuples, it seemed like
maybe we should have one callback to read a bunch of tuples in a similar
way.

The tuple-level interface will still need to exist, of course. It just
won't be directly owned by affected index AMs.

I meant that each of the AMs uses a separate typedef, with different
fields, etc. I'm sure there are similarities (it's always an array of
elements, either TIDs, index or heap tuples, or some combination of
that). But maybe there is stuff unique to some AMs - chances are that
can be either "generalized" or extended using some private member.

Right. Maybe it won't even be that hard to do SP-GiST and KNN-GiST
index scans with this too.

No opinion, but it's not clear to me how exactly would this work. I've
imagined we'd just acquire (and release) multiple pins as we go.

More experimentation is required to get good intuitions about how
useful it is to reorder stuff, to make heap prefetching work best.

Could you briefly outline how you think this might interact with the
scheduling of index page reads? I can imagine telling someone about
which future index pages we might need to read (say, the next leaf
page), or something like that. But this patch is about prefetching the
heap pages it seems like an entirely independent thing.

I agree that prefetching of index pages themselves would be entirely
independent (and probably much less useful). I wasn't talking about
that at all, though. I was talking about the potential value in
reading multiple leaf pages at a time as an enabler of heap
prefetching -- to avoid "pipeline stalls" for heap prefetching, with
certain workloads.

The simplest example of how these two things (heap prefetching and
eager leaf page reading) could be complementary is the idea of
coalescing together accesses to the same heap page from TIDs that
don't quite appear in order (when read from the index), but are
clustered together. Not just clustered together on one leaf page --
clustered together on a few sibling leaf pages. (The exactly degree to
which you'd vary how many leaf pages you read at a time might need to
be fully dynamic/adaptive.)

We've talked about this already. Reading multiple index pages at a
time could in general result in pinning/reading the same heap pages
far less often. Imagine if our scan will inherently need to read a
total of no more than 3 or 4 index leaf pages. Reading all of those
leaf pages in one go probably doesn't add any real latency, but
literally guarantees that no heap page will need to be accessed twice.
So it's almost a hybrid of an index scan and bitmap index scan,
offering the best of both worlds.

--
Peter Geoghegan

#110

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Peter Geoghegan (#109)

Re: index prefetching

On Sun, Nov 10, 2024 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

It seems to me knowing which pages may be pinned is very AM-specific
knowledge, and my intention was to let the AM to manage that.

This is useful information, because it helps me to understand how
you're viewing this.

I totally disagree with this characterization. This is an important
difference in perspective. IMV index AMs hardly care at all about
holding onto buffer pins, very much unlike heapam.

I think that holding onto pins and whatnot has almost nothing to do
with the index AM as such -- it's about protecting against unsafe
concurrent TID recycling, which is a table AM/heap issue. You can make
a rather weak argument that the index AM needs it for _bt_killitems,
but that seems very secondary to me (if you go back long enough there
are no _bt_killitems, but the pin thing itself still existed).

Much of this discussion is going over my head, but I have a comment on
this part. I suppose that when any code in the system takes a pin on a
buffer page, the initial concern is almost always to keep the page
from disappearing out from under it. There might be a few exceptions,
but hopefully not many. So I suppose what is happening here is that
index AM pins an index page so that it can read that page -- and then
it defers releasing the pin because of some interlocking concern. So
at any given moment, there's some set of pins (possibly empty) that
the index AM is holding for its own purposes, and some other set of
pins (also possibly empty) that the index AM no longer requires for
its own purposes but which are still required for heap/index
interlocking. The second set of pins could possibly be managed in some
AM-agnostic way. The AM could communicate that after the heap is done
with X set of TIDs, it can unpin Y set of pages. But the first set of
pins are of direct and immediate concern to the AM.

Or at least, so it seems to me. Am I confused?

--
Robert Haas
EDB: http://www.enterprisedb.com

#111

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Robert Haas (#110)

Re: index prefetching

On Mon, Nov 11, 2024 at 12:23 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think that holding onto pins and whatnot has almost nothing to do
with the index AM as such -- it's about protecting against unsafe
concurrent TID recycling, which is a table AM/heap issue. You can make
a rather weak argument that the index AM needs it for _bt_killitems,
but that seems very secondary to me (if you go back long enough there
are no _bt_killitems, but the pin thing itself still existed).

Much of this discussion is going over my head, but I have a comment on
this part. I suppose that when any code in the system takes a pin on a
buffer page, the initial concern is almost always to keep the page
from disappearing out from under it.

That almost never comes up in index AM code, though -- cases where you
simply want to avoid having an index page evicted do exist, but are
naturally very rare. I think that nbtree only does this during page
deletion by VACUUM, since it works out to be slightly more convenient
to hold onto just the pin at one point where we quickly drop and
reacquire the lock. Index AMs find very little use for pins that don't
naturally coexist with buffer locks. And even the supposed exception
that happens for page deletion could easily be replaced by just
dropping the pin and the lock (there'd just be no point in it).

I almost think of "pin held" and "buffer lock held" as synonymous when
working on the nbtree code, even though you have this one obscure page
deletion case where that isn't quite true (plus the TID recycle safety
business imposed by heapam). As far as protecting the structure of the
index itself is concerned, holding on to buffer pins alone does not
matter at all.

I have a vague recollection of hash doing something novel with cleanup
locks, but I also seem to recall that that had problems -- I think
that we got rid of it not too long back. In any case my mental model
is that cleanup locks are for the benefit of heapam, never for the
benefit of index AMs themselves. This is why we require cleanup locks
for nbtree VACUUM but not nbtree page deletion, even though both
operations perform precisely the same kinds of page-level
modifications to the index leaf page.

There might be a few exceptions,
but hopefully not many. So I suppose what is happening here is that
index AM pins an index page so that it can read that page -- and then
it defers releasing the pin because of some interlocking concern. So
at any given moment, there's some set of pins (possibly empty) that
the index AM is holding for its own purposes, and some other set of
pins (also possibly empty) that the index AM no longer requires for
its own purposes but which are still required for heap/index
interlocking.

That summary is correct, but FWIW I find the emphasis on index pins
slightly odd from an index AM point of view.

The nbtree code virtually always calls _bt_getbuf and _bt_relbuf, as
opposed to independently acquiring pins and locks -- that's why "lock"
and "pin" seem almost synonymous to me in nbtree contexts. Clearly no
index AM should hold onto a buffer lock for more than an instant, so
my natural instinct is to wonder why you're even talking about buffer
pins or buffer locks that the index AM cares about directly.

As I said to Tomas, yeah, the index AM kinda sometimes needs to hold
onto a leaf page pin to be able to correctly perform _bt_killitems.
But this is only because it needs to reason about concurrent TID
recycling. So this is also not really any kind of exception.
(_bt_killitems is even prepared to reason about cases where no pin was
held at all, and has been since commit 2ed5b87f96.)

The second set of pins could possibly be managed in some
AM-agnostic way. The AM could communicate that after the heap is done
with X set of TIDs, it can unpin Y set of pages. But the first set of
pins are of direct and immediate concern to the AM.

Or at least, so it seems to me. Am I confused?

I think that this is exactly what I propose to do, said in a different
way. (Again, I wouldn't have expressed it in this way because it seems
obvious to me that buffer pins don't have nearly the same significance
to an index AM as they do to heapam -- they have no value in
protecting the index structure, or helping an index scan to reason
about concurrency that isn't due to a heapam issue.)

Does that make sense?

--
Peter Geoghegan

#112

Robert Haas

robertmhaas@gmail.com

about 1 year ago

In reply to: Peter Geoghegan (#111)

Re: index prefetching

On Mon, Nov 11, 2024 at 1:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

I almost think of "pin held" and "buffer lock held" as synonymous when
working on the nbtree code, even though you have this one obscure page
deletion case where that isn't quite true (plus the TID recycle safety
business imposed by heapam). As far as protecting the structure of the
index itself is concerned, holding on to buffer pins alone does not
matter at all.

That makes sense from the point of view of working with the btree code
itself, but from a system-wide perspective, it's weird to pretend like
the pins don't exist or don't matter just because a buffer lock is
also held. I had actually forgotten that the btree code tends to
pin+lock together; now that you mention it, I remember that I knew it
at one point, but it fell out of my head a long time ago...

I think that this is exactly what I propose to do, said in a different
way. (Again, I wouldn't have expressed it in this way because it seems
obvious to me that buffer pins don't have nearly the same significance
to an index AM as they do to heapam -- they have no value in
protecting the index structure, or helping an index scan to reason
about concurrency that isn't due to a heapam issue.)

Does that make sense?

Yeah, it just really throws me for a loop that you're using "pin" to
mean "pin at a time when we don't also hold a lock." The fundamental
purpose of a pin is to prevent a buffer from being evicted while
someone is in the middle of looking at it, and nothing that uses
buffers can possibly work correctly without that guarantee. Everything
you've written in parentheses there is, AFAICT, 100% wrong if you mean
"any pin" and 100% correct if you mean "a pin held without a
corresponding lock."

--
Robert Haas
EDB: http://www.enterprisedb.com

#113

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Robert Haas (#112)

Re: index prefetching

On Mon, Nov 11, 2024 at 1:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

That makes sense from the point of view of working with the btree code
itself, but from a system-wide perspective, it's weird to pretend like
the pins don't exist or don't matter just because a buffer lock is
also held.

I can see how that could cause confusion. If you're working on nbtree
all day long, it becomes natural, though. Both points are true, and
relevant to the discussion.

I prefer to over-communicate when discussing these points -- it's too
easy to talk past each other here. I think that the precise reasons
why the index AM does things with buffer pins will need to be put on a
more rigorous and formalized footing with Tomas' patch. The different
requirements/safety considerations will have to be carefully teased
apart.

I had actually forgotten that the btree code tends to
pin+lock together; now that you mention it, I remember that I knew it
at one point, but it fell out of my head a long time ago...

The same thing appears to mostly be true of hash, which mostly uses
_hash_getbuf + _hash_relbuf (hash's idiosyncratic use of cleanup locks
notwithstanding).

To be fair it does look like GiST's gistdoinsert function holds onto
multiple buffer pins at a time, for its own reasons -- index AM
reasons. But this looks to be more or less an optimization to deal
with navigating the tree with a loose index order, where multiple
descents and ascents are absolutely expected. (This makes it a bit
like the nbtree "drop lock but not pin" case that I mentioned in my
last email.)

It's not as if these gistdoinsert buffer pins persist across calls to
amgettuple, though, so for the purposes of this discussion about the
new batch API to replace amgettuple they are not relevant -- they
don't actually undermine my point. (Though to be fair their existence
does help to explain why you found my characterization of buffer pins
as irrelevant to index AMs confusing.)

The real sign that what I said is generally true of index AMs is that
you'll see so few calls to
LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls
ConditionalLockBufferForCleanup at all (which I find a bit weird).
Both GiST and SP-GiST call neither functions -- even during VACUUM. So
GiST and SP-GiST make clear that index AMs (that support only MVCC
snapshot scans) can easily get by without any use of cleanup locks
(and with no externally significant use of buffer pins).

I think that this is exactly what I propose to do, said in a different
way. (Again, I wouldn't have expressed it in this way because it seems
obvious to me that buffer pins don't have nearly the same significance
to an index AM as they do to heapam -- they have no value in
protecting the index structure, or helping an index scan to reason
about concurrency that isn't due to a heapam issue.)

Does that make sense?

Yeah, it just really throws me for a loop that you're using "pin" to
mean "pin at a time when we don't also hold a lock."

I'll try to be more careful about that in the future, then.

The fundamental
purpose of a pin is to prevent a buffer from being evicted while
someone is in the middle of looking at it, and nothing that uses
buffers can possibly work correctly without that guarantee. Everything
you've written in parentheses there is, AFAICT, 100% wrong if you mean
"any pin" and 100% correct if you mean "a pin held without a
corresponding lock."

I agree.

--
Peter Geoghegan

#114

Peter Geoghegan

pg@bowt.ie

about 1 year ago

In reply to: Peter Geoghegan (#113)

Re: index prefetching

On Mon, Nov 11, 2024 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

The real sign that what I said is generally true of index AMs is that
you'll see so few calls to
LockBufferForCleanup/ConditionalLockBufferForCleanup. Only hash calls
ConditionalLockBufferForCleanup at all (which I find a bit weird).
Both GiST and SP-GiST call neither functions -- even during VACUUM. So
GiST and SP-GiST make clear that index AMs (that support only MVCC
snapshot scans) can easily get by without any use of cleanup locks
(and with no externally significant use of buffer pins).

Actually, I'm pretty sure that it's wrong for GiST VACUUM to not
acquire a full cleanup lock (which used to be called a super-exclusive
lock in index AM contexts), as I went into some years ago:

/messages/by-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com

I plan on playing around with injection points soon. I might try my
hand at proving that GiST VACUUM needs to do more here to avoid
breaking concurrent GiST index-only scans.

Issues such as this are why I place so much emphasis on formalizing
all the rules around TID recycling and dropping pins with index scans.
I think that we're still a bit sloppy about things in this area.

--
Peter Geoghegan

#115

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Peter Geoghegan (#114)

Re: index prefetching

Hi,

Since the patch has needed a rebase since mid February and is in Waiting on
Author since mid March, I think it'd be appropriate to mark this as Returned
with Feedback for now? Or at least moved to the next CF?

Greetings,

Andres Freund

#116

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Andres Freund (#115)

Re: index prefetching

On 4/2/25 18:05, Andres Freund wrote:

Hi,

Since the patch has needed a rebase since mid February and is in Waiting on
Author since mid March, I think it'd be appropriate to mark this as Returned
with Feedback for now? Or at least moved to the next CF?

Yes, I agree.

regards

--
Tomas Vondra

#117

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Tomas Vondra (#116)

3 attachment(s)

Re: index prefetching

Hi,

here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results). There's still a fair amount of work to be
done, of course - the btree changes are not very polished, more time
needs to be spent on profiling and optimization, etc. And I'm sure that
while the patch passes tests, there certainly are bugs.

Compared to the last patch version [1]/messages/by-id/accd03eb-0379-416d-9936-41a4de3c47ef@vondra.me shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful. Let me try to sum the main conclusions and changes:

1) patch now relies on read_stream

The November patch still relied on sync I/O and PrefetchBuffer(). At
some point I added a commit switching it to read_stream - which turned
out non-trivial, especially for index-only scans. But it works, and for
a while I kept it separate - with PrefetchBuffer first, and a switch to
read_stream later. But then I realized it does not make much sense to
keep the first part - why would we introduce a custom fadvise-based
prefetch, only to immediately rip it out and replace it with with
read_stream code with a comparable amount of complexity, right?

So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.

2) two new index AM callbacks - amgetbatch + amfreebatch

The [1]/messages/by-id/accd03eb-0379-416d-9936-41a4de3c47ef@vondra.me patch introduced a new callback for reading a "batch"
(essentially a leaf page) from the index. But there was a limitation of
only allowing a single batch at a time, which was causing trouble with
prefetch distance and read_stream stalls at the end of the batch, etc.

Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.

There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.

In a way, this design is a compromise between the initial AM-level
approach I presented as a PoC at pgconf.dev 2023, and the executor level
approach I shared a couple months back. Each of those "extreme" cases
had it's issues with either happening "too deep" or "too high" - being
too integrated in the AM, or not having enough info about the AM.

I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().

(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)

3) btree prefetch

The patch implements the new callbacks only for btree indexes, and it's
not very pretty / clean - it's mostly a massaged version of the old code
backing amgettuple(). This needs cleanup/improvements, and maybe
refactoring to allow reusing more of the code, etc.. Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.

For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.

In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.

Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.

This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.

4) duplicate blocks

While working on the patch, I realized the old index_fetch_heap code
skips reads for duplicate blocks - index the TID matches the immediately
preceding block, ReleaseAndReadBuffer() skips most of the work. But
read_stream() doesn't do that - if the callback returns the same block,
it starts a new read for it, pins it, etc. That can be quite expensive,
and I've seen a couple cases where the impact was not negligible
(correlated index, fits in memory, ...).

I've speculated that maybe read_stream_next_buffer() should detect and
handle these cases better - not unlike it detects sequential reads. It
might even keep a small cache of already requested reads, etc. so that
it can handle a wider range of workloads, not just perfect duplicates.

But it does not do that, and I'm not sure if/when that will happen. So
for now I simply reproduced the "skip duplicate blocks" behavior. It's
not as simple with read_stream, because this logic needs to happen in
two places - in the callback (when generating reads), and then also when
reading the blocks from the stream - if these places get "out of sync"
the stream won't return the blocks expected by the reader.

But it does work, and it's not that complex. But there's an issue with
prefetch distance ...

5) prefetch distance

Traditionally, we measure distance in "tuples" - e.g. in bitmap heap
scan, we make sure we prefetched pages for X tuples ahead. But that's
not what read_stream does for prefetching - it works with pages. That
can cause various issues.

Consider for example the "skip duplicate blocks" optimization described
in (4). And imagine a perfectly correlated index, with ~200 items per
leaf page. The heap tuples are likely wider, let's say we have 50 of
them per page. That means that for each leaf page, we have only ~4
blocks per leaf page. With effective_io_concurrency=16 the read_stream
will try to prefetch 16 heap pages, that's 3200 index entries.

Is that what we want? I'm not quite sure, maybe it's OK? It sure is not
quite what I expected.

But now imagine an index-only scan on nearly all-visible table. If the
fraction of index entries that don't pass the visibility check is very
low, we can quickly get into a situation when the read_stream has to
read a lot of leaf pages to get the next block number.

Sure, we'd need to read that block number eventually, but doing it this
early means we may need to keep the batch (leaf page) - a lot of them,
actually. Essentially, pick a number and I can construct an IOS that
needs to keep more batches.

I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).

Ideally, we'd be able to give this feedback to read_stream in some way,
say by "pausing" it when we get too far ahead in the index. But we don't
have that - the only thing we can do is to return IndalidBlockNumber to
the stream, so that it stops. And then we need to "reset" the stream,
and let it continue - but only after we consumed all scheduled reads.

In principle it's very similar to the "pause/resume" I mentioned, except
that it requires completely draining the queue - a pipeline stall.
That's not great, but hopefully it's not very common, and more
importantly - it only happens when only a tiny fraction of the index
items requires a heap block.

So that's what the patch does. I think it's acceptable, but some
optimizations may be necessary (see next section).

6) performance and optimization

It's not difficult to construct cases where the prefetching is a huge
improvement - 5-10x speedup for a query is common, depending on the
hardware, dataset, etc.

But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.

I haven't figured out the root cause(s) yet, but I believe there's a
couple contributing factors:

(a) read_stream adds a bit of complexity/overhead, but these cases
worked great with just the sync API, and can't benefit from that.

(b) There's inefficiencies in how I integrated read_stream into the
btree AM. For example every batch allocates the same buffer btbeginscan,
which turned out to be an issue before [2]/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - and now we do that for
every batch, not just once per scan - that's not great.

regards

[1]: /messages/by-id/accd03eb-0379-416d-9936-41a4de3c47ef@vondra.me
/messages/by-id/accd03eb-0379-416d-9936-41a4de3c47ef@vondra.me

[2]: /messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com

regards

--
Tomas Vondra

Attachments:

v20250422-0001-WIP-index-prefetching.patchtext/x-patch; charset=UTF-8; name=v20250422-0001-WIP-index-prefetching.patchDownload

From 24e3f3a7a16bfb1c77021954a53b74f402b5d6d4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v20250422 1/5] WIP: index prefetching

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.

It is up to the index AM to return only batches that it can handle
internally. For example, most of the later patches adding support for
batching to relevant index AMs (btree, hash, gist, sp-gist) restrict the
batches to a single leaf page. This makes implementation of batching
much simpler, with only minimal changes to the index AMs, but it's not a
hard requirement. The index AM can produce batches spanning arbitrary
number of leaf pages. This is left as a possible future improvement.

Most of the batching/prefetching logic happens in indexam.c. This means
the executor code can continue to call the interface just like before.

The only "violation" happens in index-only scans, which need to check
the visibility map both when the prefetching pages (we don't want to
prefetch pages that are unnecessary) and later when reading the data.
For cached data the visibility map checks can be fairly expensive, so
it's desirable to keep and reuse the result of the first check.

At the moment, the prefetching does not handle mark/restore plans. This
is doable, but requires additional synchronization between the batching
and index AM code in the "opposite direction".

This patch does not actually add batching to any of the index AMs, it's
just the common infrastructure.

TODO Add the new index AM callback to sgml docs.

Re-introduce the callback to check VM and remember the result.

It can happen the first few batches (leaf pages) may be returned from
the index, skipping the heap fetches. Which means the read stream does
no reads until much later after the first batches are already freed.
Because the reads only happen when first reading from the stream. In
that case we need to be careful about initializing the stream position
because setting it to (0,0) would be wrong as the batch is already gone.
So just initialize to readPost, which should be initialized already.

Could it happen later, or just on first call? Probably first call only,
as the read stream always looks ahead for the block that actually needs
reading.
---
 src/backend/access/heap/heapam_handler.c      |   81 +-
 src/backend/access/index/genam.c              |   30 +-
 src/backend/access/index/indexam.c            | 1381 ++++++++++++++++-
 src/backend/access/table/tableam.c            |    2 +-
 src/backend/commands/constraint.c             |    3 +-
 src/backend/executor/execIndexing.c           |   12 +-
 src/backend/executor/execReplication.c        |    9 +-
 src/backend/executor/nodeIndexonlyscan.c      |  133 +-
 src/backend/executor/nodeIndexscan.c          |   32 +-
 src/backend/utils/adt/selfuncs.c              |    7 +-
 src/backend/utils/misc/guc_tables.c           |   10 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/include/access/amapi.h                    |   10 +
 src/include/access/genam.h                    |   13 +-
 src/include/access/relscan.h                  |  160 ++
 src/include/access/tableam.h                  |   12 +-
 src/include/nodes/execnodes.h                 |    7 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    5 +
 19 files changed, 1872 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..f79d97a8c64 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,11 +79,12 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = rs;
 	hscan->xs_cbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
@@ -94,6 +95,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +112,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -130,15 +137,72 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/*
+		 * Read the block for the requested TID. With a read stream, simply
+		 * read the next block we queued earlier (from the callback).
+		 * Otherwise just do the regular read using the TID.
+		 *
+		 * XXX It's a bit fragile to just read buffers, expecting the right
+		 * block, which we queued from the callback sometime much earlier. If
+		 * the two streams get out of sync in any way (which can happen
+		 * easily, due to some optimization heuristics), it may misbehave in
+		 * strange ways.
+		 *
+		 * XXX We need to support both the old ReadBuffer and ReadStream, as
+		 * some places are unlikely to benefit from a read stream - e.g.
+		 * because they only fetch a single tuple. So better to support this.
+		 *
+		 * XXX Another reason is that some index AMs may not support the
+		 * batching interface, which is a prerequisite for using read_stream
+		 * API.
+		 */
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+
+		/* We should always get a valid buffer for a valid TID. */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+
+		/*
+		 * Did we read the expected block number (per the TID)? For the
+		 * regular buffer reads this should always match, but with the read
+		 * stream it might disagree due to a bug elsewhere (happened
+		 * repeatedly).
+		 */
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+
+		/*
+		 * When using the read stream, release the old buffer.
+		 *
+		 * XXX Not sure this is really needed, or maybe this is not the right
+		 * place to do this, and buffers should be released elsewhere. The
+		 * problem is that other place may not really know if the index scan
+		 * uses read stream API.
+		 *
+		 * XXX We need to do this, because otherwise the caller would need to
+		 * do different things depending on whether the read_stream was used
+		 * or not. With the read_stream it'd have to also explicitly release
+		 * the buffers, but doing that for every caller seems error prone
+		 * (easy to forget). It's also not clear whether it would free the
+		 * buffer before or after the index_fetch_tuple call (we don't know if
+		 * the buffer changed until *after* the call, etc.).
+		 *
+		 * XXX Does this do the right thing when reading the same page? That
+		 * should return the same buffer, so won't we release it prematurely?
+		 */
+		if (scan->rs && (prev_buf != InvalidBuffer))
+		{
+			ReleaseBuffer(prev_buf);
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
@@ -753,7 +817,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering? Seems like it
+		 * might be a pretty substantial win if the table is not yet well
+		 * clustered by the index.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0,
+									false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8f532e14590..8266d5e0e87 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -446,8 +446,21 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over time
+		 * anyway? There was a problem with infinite recursion when looking up
+		 * effective_io_concurrency for a tablespace (which may do an index
+		 * scan internally), but the read_stream should care of that. Still,
+		 * we don't expect this to help a lot.
+		 *
+		 * XXX This also means scans on catalogs won't use read_stream.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, NULL, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 
@@ -707,8 +720,21 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time
+	 * anyway? There was a problem with infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace (which may do an index scan
+	 * internally), but the read_stream should care of that. Still, we don't
+	 * expect this to help a lot.
+	 *
+	 * XXX This also means scans on catalogs won't use read_stream.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, NULL, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..190a112e457 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -58,6 +59,8 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/* enable batching / prefetching during index scans */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +112,36 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static bool index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(WARNING, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -250,6 +283,10 @@ index_insert_cleanup(Relation indexRelation,
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
+ * enable_batching determines whether the scan should try using the batching
+ * interface (amgetbatch/amfreebatch), if supported by the index AM, or the
+ * regular amgettuple interface.
+ *
  * Caller must be holding suitable locks on the heap and the index.
  */
 IndexScanDesc
@@ -257,8 +294,10 @@ index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
+	ReadStream *rs = NULL;
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
@@ -273,8 +312,45 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info. We only use stream read API with
+	 * batching enabled (so not with systable scans). But maybe we should
+	 * change that, and just use different read_next callbacks (or something
+	 * like that)?
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details. That
+	 * might be needed for certain index AMs, that can do batching only for
+	 * some scans (I'm thinking about GiST/SP-GiST indexes, with ORDER BY).
+	 *
+	 * XXX Do this before initializing xs_heapfetch, so that we can pass the
+	 * read stream to it.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		/*
+		 * XXX We do this after index_beginscan_internal(), which means we
+		 * can't init the batch state in there (it doesn't even know if
+		 * batching will be used at that point). We can't init the read_stream
+		 * there, because it needs the heapRelation.
+		 */
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heapRelation,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, rs);
 
 	return scan;
 }
@@ -337,6 +413,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batches = NULL;
+
 	return scan;
 }
 
@@ -370,6 +452,19 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amrescan, so that it could reinitialize
+	 * everything (this probably does not matter very much, now that we've
+	 * moved all the batching logic to indexam.c, it was more important when
+	 * the index AM was responsible for more of it).
+	 *
+	 * XXX Maybe this should also happen before table_index_fetch_reset?
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -384,6 +479,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -414,7 +512,46 @@ index_markpos(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(ammarkpos);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/*
+	 * Without batching, just use the ammarkpos() callback. With batching
+	 * everything is handled at this layer, without calling the AM.
+	 */
+	if (scan->xs_batches == NULL)
+	{
+		scan->indexRelation->rd_indam->ammarkpos(scan);
+	}
+	else
+	{
+		IndexScanBatches *batches = scan->xs_batches;
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatchData *batch = batches->markBatch;
+
+		/*
+		 * Free the previous mark batch (if any), but only if the batch is no
+		 * longer valid (in the current first/next range). This means that if
+		 * we're marking the same batch (different item), we don't really do
+		 * anything.
+		 *
+		 * XXX Should have some macro for this check, I guess.
+		 */
+		if ((batch != NULL) &&
+			(pos->batch < batches->firstBatch || pos->batch >= batches->nextBatch))
+		{
+			batches->markBatch = NULL;
+			index_batch_free(scan, batch);
+		}
+
+		/* just copy the read position (which has to be valid) */
+		batches->markPos = batches->readPos;
+		batches->markBatch = INDEX_SCAN_BATCH(scan, batches->markPos.batch);
+
+		/*
+		 * FIXME we need to make sure the batch does not get freed during the
+		 * regular advances.
+		 */
+
+		AssertCheckBatchPosValid(scan, &batches->markPos);
+	}
 }
 
 /* ----------------
@@ -447,7 +584,58 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * Without batching, just use the amrestrpos() callback. With batching
+	 * everything is handled at this layer, without calling the AM.
+	 */
+	if (scan->xs_batches == NULL)
+		scan->indexRelation->rd_indam->amrestrpos(scan);
+	else
+	{
+		IndexScanBatches *batches = scan->xs_batches;
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatchData *batch = scan->xs_batches->markBatch;
+
+		Assert(batch != NULL);
+
+		/*
+		 * XXX The pos can be invalid, if we already advanced past the the
+		 * marked batch (and stashed it in markBatch instead of freeing). So
+		 * this assert would be incorrect.
+		 */
+		/* AssertCheckBatchPosValid(scan, &pos); */
+
+		/* FIXME we should still check the batch was not freed yet */
+
+		/*
+		 * Reset the batching state, except for the marked batch, and make it
+		 * look like we have a single batch - the marked one.
+		 *
+		 * XXX This seems a bit ugly / hacky, maybe there's a more elegant way
+		 * to do this?
+		 */
+		index_batch_reset(scan, false);
+
+		batches->markPos = *pos;
+		batches->readPos = *pos;
+		batches->firstBatch = pos->batch;
+		batches->nextBatch = (batches->firstBatch + 1);
+
+		INDEX_SCAN_BATCH(scan, batches->markPos.batch) = batch;
+
+		/*
+		 * XXX I really dislike that we have so many definitions of "current"
+		 * batch. We have readPos, streamPos, currentBatch, ... seems very ad
+		 * hoc - I just added a new "current" field when I needed one. We
+		 * should make that somewhat more consistent, or at least explain it
+		 * clearly somewhere.
+		 *
+		 * XXX Do we even need currentBatch? It's not accessed anywhere, at
+		 * least not in this patch.
+		 */
+		// batches->currentBatch = batch;
+		batches->markBatch = batch; /* also remember this */
+	}
 }
 
 /*
@@ -569,6 +757,18 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc. We Do this
+	 * before calling amrescan, so that it can reinitialize everything.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -583,10 +783,12 @@ IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
-						 ParallelIndexScanDesc pscan)
+						 ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
+	ReadStream *rs = NULL;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
@@ -604,8 +806,48 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info. We only use stream read API with
+	 * batching enabled (so not with systable scans). But maybe we should
+	 * change that, and just use different read_next callbacks (or something
+	 * like that)?
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details. That
+	 * might be needed for certain index AMs, that can do batching only for
+	 * some scans (I'm thinking about GiST/SP-GiST indexes, with ORDER BY).
+	 *
+	 * XXX Do this before initializing xs_heapfetch, so that we can pass the
+	 * read stream to it.
+	 *
+	 * XXX Pretty duplicate with the code in index_beginscan(), so maybe move
+	 * into a shared function.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		/*
+		 * XXX We do this after index_beginscan_internal(), which means we
+		 * can't init the batch state in there (it doesn't even know if
+		 * batching will be used at that point). We can't init the read_stream
+		 * there, because it needs the heapRelation.
+		 */
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heaprel,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, rs);
 
 	return scan;
 }
@@ -628,6 +870,27 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * When using batching (which may be disabled for various reasons - e.g.
+	 * through a GUC, the index AM not supporting it), redirect the code to
+	 * the "batch" variant. If needed (e.g. for the first call) the call may
+	 * read the next batch (leaf page) from the index (but that's driven by
+	 * the read stream).
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). The amcanbatch() callback might consider things
+	 * like that, or maybe that should be considered outside AM. However, the
+	 * slow ramp-up (starting with small batches) in read_stream should handle
+	 * this well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
+	 */
+	if (scan->xs_batches != NULL)
+		return index_batch_getnext_tid(scan, direction);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -694,9 +957,22 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->xs_batches == NULL)
+		{
+			scan->kill_prior_tuple = all_dead;
+		}
+		else if (all_dead)
+		{
+			index_batch_kill_item(scan);
+		}
+	}
 
 	return found;
 }
@@ -1084,3 +1360,1094 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING (AND PREFETCHING)
+ *
+ * The traditional AM interface (amgettuple) is designed to walk the index one
+ * leaf page at a time, and the state (representing the leaf page) is managed
+ * by the AM implementation. Before advancing to the next leaf page, the index
+ * AM forgets the "current" leaf page. This makes it impossible to implement
+ * features that operate on multiple leaf pages - like for example prefetch.
+ *
+ * The batching relaxes this by extending the AM API with two new methods,
+ * amgetbatch and amfreebatch, that separate the "advance" to the next leaf
+ * page, and "forgetting" the previous one. This means there may be multiple
+ * leaf pages loaded at once, if necessary. It's a bit like having multiple
+ * "positions" within the index.
+ *
+ * The AM is no longer responsible for management of these "batches" - once
+ * a batch is returned from amgetbatch(), it's up to indexam.c to determine
+ * when it's no longer necessary, and call amfreebatch(). That is, the AM
+ * can no longer discard a leaf page when advancing to the next one.
+ *
+ * This allows operating on "future" index entries, e.g. to prefetch tuples
+ * from the table. Without the batching, we could do this within the single
+ * leaf page, which has limitations, e.g. inability to prefetch beyond the
+ * of the current leaf page, and the prefetch distance drop to 0. (Most
+ * indexes have many index items per leaf page, so the prefetching would
+ * be beneficial even with this limitation, but it's not great either.)
+ *
+ * Moving the batch management to the indexam.c also means defining a common
+ * batch state, instead of each index AM defining it's own opaque state. The
+ * AM merely "fills" the batch, and everything else is handled by code in
+ * indexam.c (so not AM-specific). Including prefetching.
+ *
+ * Without this "common" batch definition, each AM would need to do a fair
+ * bit of the prefetching on it's own.
+ *
+ *
+ * note: Strictly speaking, the AM may keep a second leaf page because of
+ * mark/restore may, but that's a minor detail.
+ *
+ * note: There are different definitions of "batch" - I use it as a synonym
+ * for a leaf page, or the index tuples read from one leaf page. Others use
+ * "batch" when talking about all the leaf pages kept in memory at a given
+ * moment in time (so in a way, there's a single batch, changing over time).
+ * It's not my ambition to present a binding definition of a batch, but it's
+ * good to consider this when reading comments by other people.
+ *
+ * note: In theory, how the batch maps to leaf pages is mostly up to the index
+ * AM - as long as it can "advance" between batches, etc. it could use batches
+ * that represent a subset of a leaf page, or multiple leaf pages at once.
+ *
+ * note: Or maybe it doesn't need to map to leaf pages at all, at least not
+ * in a simple way. Consider for example ordered scans on SP-GiST indexes,
+ * or similar cases. I think that could be handled by having "abstract"
+ * batches - such indexes don't support mark/restore or changing direction,
+ * so this should be OK.
+ *
+ * note: When thinking about an index AM, think about BTREE, unless another
+ * AM is mentioned explicitly. Most AMs are based on / derived from BTREE,
+ * and everything about BTREE directly extends to them.
+ *
+ * note: In the following text "index AM" refers to an implementation of a
+ * particular index AM (e.g. BTREE), i.e. code src/backend/access/nbtree),
+ * while "indexam.c" is the shared executor level used to interact with
+ * indexes.
+ *
+ *
+ * index scan state
+ * ----------------
+ * With the traditional API (amgettuple), index scan state is stored at the
+ * scan-level in AM-specific structs - e.g. in BTScanOpaque for BTREE). So
+ * there can be only a single leaf page "loaded" for a scan at a time.
+ *
+ * With the new API (amgetbatch/amfreebatch), an index scan needs to store
+ * multiple batches - but not in private "scan opaque" struct. Instead,
+ * the queue of batches and some of the other information was moved to the
+ * IndexScanDesc, into a common struct. So the AM-specific scan-opaque
+ * structs get split and moved into three places:
+ *
+ * 1) scan-opaque - Fields that are truly related to the scan as a whole
+ *    remain in the struct (which is AM-specific, i.e. each AM method may
+ *    keep something different). Example: scankeys/arraykeys are still
+ *    kept in BTScanOpaque.
+ *
+ * 2) batch-opaque - AM-specific information related to a particular leaf
+ *    page are moved to a new batch-level struct. A good example are for
+ *    example the position of the leaf page / batch in the index (current
+ *    page, left/righ pages, etc.).
+ *
+ * 3) batch - A significant part of the patch is introducing a common
+ *    representation of a batch, common to all the index AMs. Until now
+ *    each AM had it's own way of representing tuples from a leaf page,
+ *    and accessing it required going through the AM again. The common
+ *    representation allows accessing the batches through the indexam.c
+ *    layer, without having to go through the AM.
+ *
+ *
+ * amgetbatch/amfreebatch
+ * ----------------------
+ * To support batching, the index AM needs to implement two optional
+ * callbacks - amgetbatch() and amfreebatch(), which load data from the
+ * "next" leaf page, and then free it when the batch is no longer needed.
+ *
+ * For now the amgettuple() callback is still required even for AMs that
+ * support batching, so that we can fall-back to the non-batched scan
+ * for cases when batching is not supported (e.g. scans of system tables)
+ * or when batching is disabled using the enable_indexscan_batching GUC.
+ *
+ *
+ * batch
+ * ----------------------
+ * A good way to visualize batching is a sliding window over the key space of
+ * an index. At any given moment, we have a "window" representing a range of
+ * the keys, consisting of one or more batches, each with items from a single
+ * leaf page.
+ *
+ * For now, each batch is exactly one whole leaf page. We might allow batches
+ * to be smaller or larger, but that doesn't seem very useful. It would make
+ * things more complex, without providing much benefit. Ultimately it's up to
+ * the index AM - it can produce any batches it wants, as long as it keeps
+ * necessary information in the batch-opaque struct, and handles this in the
+ * amgetbatch/amfreebatch callbacks.
+ *
+ *
+ * prefetching: leaf pages vs. heap pages
+ * --------------------------------------
+ * This patch is only about prefetching pages from the indexed relation (e.g.
+ * heap), not about prefetching index leaf pages etc. The read_next callback
+ * does read leaf pages when needed (after reaching the end of the current
+ * batch), but this is synchronous, and the callback will block until the leaf
+ * page is read.
+ *
+ *
+ * gradual ramp up
+ * ---------------
+ * The prefetching is driven by the read_stream API / implementation. There
+ * are no explicit fadvise calls in the index code, that all happens in the
+ * read stream. The read stream does the usual gradual ramp up to not regress
+ * LIMIT 1 queries etc.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ * If we decide a tuple should be "killed" in the index, the a flag is used to
+ * pass this information to indexam.c - the item is recorded in the batch, and
+ * the actual killing is postponed until the batch is freed using amfreebatch().
+ * The scan flag is reset to false, so that the index AM does not get confused
+ * and does not do something for a different "current" item.
+ *
+ * That is, this is very similar to what happens without batching, except that
+ * the killed items are accumulated in indexam.c, not in the AM.
+ */
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages. We
+ * should not really need this many batches - we need a certain number of TIDs,
+ * to satisfy the prefetch distance, and there usually are many index tuples
+ * per page. In the worst case we might have one index tuple per leaf page,
+ * but even that may not quite work in some cases.
+ *
+ * But there may be cases when this does not work - some examples:
+ *
+ * a) the index may be bloated, with many pages only have a single index item
+ *
+ * b) the index is correlated, and we skip prefetches of duplicate blocks
+ *
+ * c) we may be doing index-only scan, and we don't prefetch all-visible pages
+ *
+ * So we might need to load huge number of batches before we find the first
+ * block to load from the table. Or enough pages to satisfy the prefetch
+ * distance.
+ *
+ * XXX Currently, once we hit this number of batches, we fail in the stream
+ * callback (or rather in index_batch_getnext), because that's where we load
+ * batches. It'd be nice to "pause" the read stream for a bit instead, but
+ * there's no built-in way to do that. So we can only "stop" the stream by
+ * returning InvalidBlockNumber. But we could also remember this, and do
+ * read_stream_reset() to continue, after consuming all the already scheduled
+ * blocks.
+ *
+ * XXX Maybe 64 is too high - it also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS). Which might be an issue with LIMIT queries,
+ * when we actually won't need most of the leaf pages.
+ *
+ * XXX We could/should use a lower value for testing, to make it more likely
+ * we hit this issue. With 64 the whole check-world passes without hitting
+ * the limit, wo we wouldn't test it's handled correctly.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->xs_batches->nextBatch - (scan)->xs_batches->firstBatch)
+
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->xs_batches->nextBatch)
+
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->xs_batches->maxBatches)
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatches *batch = scan->xs_batches;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batch->firstBatch);
+	Assert(pos->batch < batch->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+	Assert(batch->lastItem <= MaxTIDsPerBTreePage); /* XXX tied to BTREE */
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+	// Assert(batch->currTuples != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(batch->numKilled <= MaxTIDsPerBTreePage);	/* XXX tied to BTREE */
+	Assert(!((batch->numKilled > 0) && (batch->killedItems == NULL)));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatches *batches = scan->xs_batches;
+
+	/* we should have batches initialized */
+	Assert(batches != NULL);
+
+	/* We should not have too many batches. */
+	Assert((batches->maxBatches > 0) &&
+		   (batches->maxBatches <= INDEX_SCAN_MAX_BATCHES));
+
+	/*
+	 * The first/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert((batches->firstBatch >= 0) &&
+		   (batches->firstBatch <= batches->nextBatch));
+	Assert((batches->nextBatch - batches->firstBatch) <= batches->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatches *batches = scan->xs_batches;
+
+	if (!scan->xs_batches)
+		return;
+
+	DEBUG_LOG("%s: batches firstBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->firstBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("%s: batch %d %p first %d last %d item %d killed %d",
+				  label, i, batch, batch->firstItem, batch->lastItem,
+				  batch->itemIndex, batch->numKilled);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance, right after loading the first batch, the
+ * position is still be undefined. Otherwise we expect the position to be
+ * valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The poisition is guaranteed to be valid only after an advance.
+ */
+static bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	IndexScanBatchData *batch;
+	ScanDirection direction = scan->xs_batches->direction;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the first batch. In that case just initialize it to the first
+	 * item in the batch (or last item, if it's backwards scaa).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the first batch, without having to go through the advance.
+	 *
+	 * XXX Add a macro INDEX_SCAN_POS_DEFINED() or something like this, to
+	 * make this easier to understand.
+	 */
+	if ((pos->batch == -1) && (pos->index == -1))
+	{
+		/* we should have loaded the very first batch */
+		Assert(scan->xs_batches->firstBatch == 0);
+
+		batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->firstBatch);
+		Assert(batch != NULL);
+
+		pos->batch = 0;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position we just set has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch. If the position is for the
+	 * last item in the batch, try advancing to the next batch (if loaded).
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (pos->index < batch->lastItem)
+		{
+			pos->index++;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (pos->index > batch->firstItem)
+		{
+			pos->index--;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller. If
+ * that changes before consuming all buffers, we'll reset the stream and
+ * start from scratch. Which may seem inefficient, but it's no worse than
+ * what we do now, and it's not a very common case.
+ *
+ * The position of the read_stream is stored in streamPos, which may be
+ * ahead of the current readPos (which is what got consumed by the scan).
+ *
+ * The scan direction change is checked / handled elsewhere. Here we rely
+ * on having the correct value in xs_batches->direction.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchPos *pos = &scan->xs_batches->streamPos;
+
+	/* we should have set the direction already */
+	Assert(scan->xs_batches->direction != NoMovementScanDirection);
+
+	/*
+	 * The read position has to be valid, because we initialize/advance it
+	 * before maybe even attempting to read the heap tuple. And it lags behind
+	 * the stream position, so it can't be invalid yet. If this is the first
+	 * time for this callback, we will use the readPos to init streamPos, so
+	 * better check it's valid.
+	 */
+	AssertCheckBatchPosValid(scan, &scan->xs_batches->readPos);
+
+	/*
+	 * Try to advance to the next item, and if there's none in the current
+	 * batch, try loading the next batch.
+	 *
+	 * XXX This loop shouldn't happen more than twice, because if we fail to
+	 * advance the position, we'll try to load the next batch and then in the
+	 * next loop the advance has to succeed.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position is undefined, just use the read position.
+		 *
+		 * It's possible we got here only fairly late in the scan, e.g. if
+		 * many tuples got skipped in the index-only scan, etc. In this case
+		 * just use the read position as a starting point.
+		 *
+		 * The first batch is loaded from index_batch_getnext_tid(), because
+		 * we don't get here until the first index_fetch_heap() call - only
+		 * then can read_stream start loading more batches. It's also possible
+		 * to disable prefetching (effective_io_concurrency=0), in which case
+		 * all batches get loaded in index_batch_getnext_tid.
+		 */
+		if ((pos->batch == -1) && (pos->index == -1))
+		{
+			*pos = scan->xs_batches->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, pos))
+		{
+			advanced = true;
+		}
+
+		/* FIXME maybe check the streamPos is not behind readPos? */
+
+		/* If we advanced the position, return the block for the TID. */
+		if (advanced)
+		{
+			IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+			ItemPointer tid = &batch->items[pos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  pos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * if there's a prefetch callback, use it to decide if we will
+			 * need to read the block
+			 */
+			if (scan->xs_batches->prefetchCallback &&
+				!scan->xs_batches->prefetchCallback(scan, scan->xs_batches->prefetchArgument, pos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			return ItemPointerGetBlockNumber(tid);
+		}
+
+		/*
+		 * Couldn't advance the position, so either there are no more items in
+		 * the current batch, or maybe we don't have any batches yet (if is
+		 * the first time through). Try loading the next batch - if that
+		 * succeeds, try the advance again (and this time the advance should
+		 * work).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch, or
+ * false if there are no more TIDs in the scan. The batch load may fail for
+ * multiple reasons - there really may not be more batches in the scan, or
+ * maybe we reached INDEX_SCAN_MAX_BATCHES.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan)
+{
+	IndexScanBatchData *batch;
+	ItemPointerData tid;
+	ScanDirection direction = scan->xs_batches->direction;
+	IndexTuple	itup;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 *
+	 * XXX For now we just error out, but the correct solution is to pause the
+	 * stream by returning InvalidBlockNumber and then unpause it by doing
+	 * read_stream_reset.
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->xs_batches->reset = true;
+	}
+
+	/*
+	 * Did we fill the batch queue, either in this or some earlier call?
+	 * If yes, we have to consume everything from currently loaded batch
+	 * before we reset the stream and continue. It's a bit like 'finished'
+	 * but it's only a temporary pause, not the end of the stream.
+	 */
+	if (scan->xs_batches->reset)
+		return NULL;
+
+	/*
+	 * Did we already read the last batch for this scan?
+	 *
+	 * We may read the batches in two places, so we need to remember that,
+	 * otherwise the retry restarts the scan.
+	 *
+	 * XXX This comment might be obsolete, from before using the read_stream.
+	 *
+	 * XXX Also, maybe we should do this before calling INDEX_SCAN_BATCH_FULL?
+	 */
+	if (scan->xs_batches->finished)
+		return NULL;
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * FIXME btgetbatch calls _bt_returnitem, which however sets xs_heaptid,
+	 * and so would interfere with index scans (because this may get executed
+	 * from the read_stream_next_buffer callback during the scan (fetching
+	 * heap tuples in heapam_index_fetch_tuple). Ultimately we should not do
+	 * _bt_returnitem at all, just functions like _bt_steppage etc. while
+	 * loading the next batch.
+	 *
+	 * XXX I think this is no longer true, the amgetbatch does not do that I
+	 * believe (_bt_returnitem_batch should not set these fields).
+	 */
+	tid = scan->xs_heaptid;
+	itup = scan->xs_itup;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = scan->xs_batches->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		scan->xs_batches->nextBatch++;
+
+		/*
+		 * XXX Why do we need currentBatch, actually? It doesn't seem to be
+		 * used anywhere, just set ...
+		 */
+		// scan->xs_batches->currentBatch = batch;
+
+		DEBUG_LOG("index_batch_getnext firstBatch %d nextBatch %d batch %p",
+				  scan->xs_batches->firstBatch, scan->xs_batches->nextBatch, batch);
+	}
+	else
+		scan->xs_batches->finished = true;
+
+	/* XXX see FIXME above */
+	scan->xs_heaptid = tid;
+	scan->xs_itup = itup;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * The calling convention is similar to index_getnext_tid() - NULL means no
+ * more items in the current batch, and no more batches.
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * Returns the next TID, or NULL if no more items (or batches).
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchPos *pos;
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* read the next TID from the index */
+	pos = &scan->xs_batches->readPos;
+
+	/* FIXME handle change of scan direction (reset stream, ...) */
+	scan->xs_batches->direction = direction;
+
+	DEBUG_LOG("index_batch_getnext_tid pos %d %d direction %d",
+			  pos->batch, pos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? If the advance/getnext
+	 * functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, pos))
+		{
+			IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+			Assert(batch != NULL);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = batch->items[pos->index].heapTid;
+			scan->xs_itup = (IndexTuple) (batch->currTuples + batch->items[pos->index].tupleOffset);
+
+			DEBUG_LOG("pos batch %p first %d last %d pos %d/%d TID (%u,%u)",
+					  batch, batch->firstItem, batch->lastItem,
+					  pos->batch, pos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to firstBatch.
+			 */
+			if (pos->batch != scan->xs_batches->firstBatch)
+			{
+				batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->firstBatch);
+				Assert(batch != NULL);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap blocks.
+				 * But we may not do that often enough - e.g. IOS may not need
+				 * to access all-visible heap blocks, so the read_next callback
+				 * does not get invoked for a long time. It's possible the
+				 * stream gets so mucu behind the position gets invalid, as we
+				 * already removed the batch. But that means we don't need any
+				 * heap blocks until the current read position - if we did, we
+				 * would not be in this situation (or it's a sign of a bug, as
+				 * those two places are expected to be in sync). So if the
+				 * streamPos still points at the batch we're about to free,
+				 * just reset the position - we'll set it to readPos in the
+				 * read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 */
+				if (scan->xs_batches->streamPos.batch == scan->xs_batches->firstBatch)
+				{
+					index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free batch %p firstBatch %d nextBatch %d",
+						  batch,
+						  scan->xs_batches->firstBatch,
+						  scan->xs_batches->nextBatch);
+
+				/* Free the batch (except when it's needed for mark/restore). */
+				index_batch_free(scan, batch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mar/restore.
+				 */
+				scan->xs_batches->firstBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed firstBatch %d nextBatch %d",
+						  scan->xs_batches->firstBatch,
+						  scan->xs_batches->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(scan->xs_batches->firstBatch == pos->batch);
+			}
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (scan->xs_batches->reset)
+		{
+			DEBUG_LOG("resetting read stream pos %d,%d",
+					  scan->xs_batches->readPos.batch, scan->xs_batches->readPos.index);
+
+			scan->xs_batches->reset = false;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+
+			read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the first batch from here.
+		 * Second, while most batches will be preloaded by the stream thank's
+		 * to prefetching, it's possible to set effective_io_concurrency=0, in
+		 * which case all the batch loads happen from here.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	return NULL;
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, assume batching is supported by the AM */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->xs_batches = palloc0(sizeof(IndexScanBatches));
+
+	/* We don't know direction of the scan yet. */
+	scan->xs_batches->direction = NoMovementScanDirection;
+
+	/* Initialize the batch */
+	scan->xs_batches->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->xs_batches->firstBatch = 0;	/* first batch */
+	scan->xs_batches->nextBatch = 0;	/* first batch is empty */
+
+	scan->xs_batches->batches
+		= palloc(sizeof(IndexScanBatchData *) * scan->xs_batches->maxBatches);
+
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->xs_batches->readPos);
+	index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+	index_batch_pos_reset(scan, &scan->xs_batches->markPos);
+
+	// scan->xs_batches->currentBatch = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatches *batches = scan->xs_batches;
+
+	/* bail out if batching not enabled */
+	if (!batches)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batches->readPos);
+	index_batch_pos_reset(scan, &batches->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && (batches->markBatch != NULL))
+	{
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatch batch = batches->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batches->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if ((pos->batch < batches->firstBatch) ||
+			(pos->batch >= batches->nextBatch))
+		{
+			index_batch_free(scan, batch);
+		}
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batches->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batches->firstBatch < batches->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batches->firstBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batches->firstBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batches->firstBatch++;
+	}
+
+	/* reset relevant IndexScanBatches fields */
+	batches->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	batches->firstBatch = 0;	/* first batch */
+	batches->nextBatch = 0;		/* first batch is empty */
+
+	batches->finished = false;
+	batches->reset = false;
+	// batches->currentBatch = NULL;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *pos = &scan->xs_batches->readPos;
+	IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	/* FIXME mark item at current readPos as deleted */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * XXX Too tied to btree (through MaxTIDsPerBTreePage), we should make
+	 * this AM agnostic. We could maybe even replace this with Bitmapset. It
+	 * might be more expensive if we only kill items at the end of the page
+	 * (in which case we still have to walk the first part to find the bits at
+	 * the end). But given the lower memory usage it still sees like a good
+	 * tradeoff overall.
+	 */
+	if (batch->killedItems == NULL)
+		batch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (batch->numKilled < MaxTIDsPerBTreePage)
+		batch->killedItems[batch->numKilled++] = pos->index;
+
+	/* elog(WARNING, "index_batch_kill_item (%d,%d)", pos->batch, pos->index); */
+	/* FIXME index_batch_kill_item not implemented */
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->xs_batches->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(sizeof(IndexScanBatchData));
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->itemIndex = -1;
+
+	batch->killedItems = NULL;	/* FIXME allocate an array, actually */
+	batch->numKilled = 0;		/* nothing killed yet */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 *
+	 * XXX allocate
+	 */
+	batch->currTuples = NULL;	/* tuple storage for currPos */
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	/*
+	 * XXX Maybe don't size to MaxTIDsPerBTreePage? We don't reuse batches
+	 * (unlike currPos), so we can size it for just what we need.
+	 */
+	batch->items = palloc0(sizeof(IndexScanBatchPosItem) * maxitems);
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX allocate as needed?
+	 */
+	batch->itups = NULL;		/* IndexTuples, if requested */
+	batch->htups = NULL;		/* HeapTuples, if requested */
+	batch->recheck = NULL;		/* recheck flags */
+	batch->privateData = NULL;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	batch->orderbyvals = NULL;
+	batch->orderbynulls = NULL;
+
+	/* AM-specific per-batch state */
+	batch->opaque = NULL;
+
+	return batch;
+}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..be8e02a9c45 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -217,7 +217,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221f2..8a5d79a27a6 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..1ec046adeff 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,7 +815,17 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+
+	/*
+	 * It doesn't seem very useful to allow batching/prefetching when checking
+	 * exclusion/uniqueness constraints. We should only find either no or just
+	 * one row, I think.
+	 *
+	 * XXX Maybe there are cases where we could find multiple "candidate"
+	 * rows, e.g. with exclusion constraints? Not sure.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0,
+								 false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 53ddd25c42d..9c7df9b9ccb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -201,8 +201,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching would be pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..1a14f5faa68 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *data,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -94,15 +100,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->xs_batches != NULL)
+		{
+			scandesc->xs_batches->prefetchCallback = ios_prefetch_block;
+			scandesc->xs_batches->prefetchArgument = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->xs_batches == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->xs_batches->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -596,6 +643,20 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * All index scans can do batching.
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
+	 *
+	 * XXX For now we only know if the scan gets to use batching after the
+	 * index_beginscan() returns, so maybe this name is a bit misleading. It's
+	 * more about "allow batching". But maybe this field is unnecessary - we
+	 * check all the interesting stuff in index_beginscan() anyway.
+	 */
+	indexstate->ioss_CanBatch = true;
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -783,13 +844,21 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans). Although, that should not happen, because we only call
+	 * that with (xs_batches != NULL).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -849,13 +918,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 		return;
 	}
 
+	/* XXX Do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -889,3 +960,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->privateData == NULL)
+		batch->privateData = palloc0(sizeof(Datum) * (batch->lastItem + 1));
+
+	if (batch->privateData[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->privateData[pos->index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->privateData[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..177d74c2c27 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -111,7 +111,8 @@ IndexNext(IndexScanState *node)
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -201,13 +202,16 @@ IndexNextWithReorder(IndexScanState *node)
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
+		 *
+		 * XXX Should we use batching here? Does it even work for reordering?
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -965,6 +969,18 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * All index scans can do batching.
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
+	 *
+	 * XXX Well, we disable batching for reordering, so maybe we should check
+	 * that here instead? But maybe it's unnecessary limitation?
+	 */
+	indexstate->iss_CanBatch = true;
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1719,13 +1735,17 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1783,13 +1803,17 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 987f2154459..30947649bc7 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6665,9 +6665,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable, NULL,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..2d6438f3259 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -809,6 +809,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..649df2b06a0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -415,6 +415,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 52916bab7a3..0028bb55843 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -196,6 +196,14 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -307,6 +315,8 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..8bef942b11d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -111,6 +112,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -155,6 +157,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -179,7 +183,8 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											IndexScanInstrumentation *instrument,
@@ -205,7 +210,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -213,7 +219,6 @@ extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
-
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
@@ -231,7 +236,7 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
 
 /*
  * index access method support routines (in genam.c)
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..b63af845ca6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,162 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+/*
+ * XXX parts of BTScanOpaqueData, BTScanPosItem and BTScanPosData relevant
+ * for one batch.
+ */
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM. This is similar
+ * to the AM-specific "opaque" structs, used by each AM to track items
+ * loaded from one leaf page, but generalized for all AMs.
+ *
+ * XXX Not sure which of there fields are 100% needed for all index AMs,
+ * most of this comes from nbtree.
+ *
+ * XXX Mostly a copy of BTScanPosData, but other AMs may need different (or
+ * only some of those) fields.
+ */
+typedef struct IndexScanBatchData
+{
+	/*
+	 * AM-specific concept of position within the index, and other stuff the
+	 * AM might need to store for each batch.
+	 *
+	 * XXX maybe "position" is not the best name, it can have other stuff the
+	 * AM needs to keep per-batch (even only for reading the leaf items, like
+	 * nextTupleOffset).
+	 */
+	void	   *opaque;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 *
+	 * XXX Do we need all these indexes, or would it be enough to have just
+	 * 0-indexed array with only itemIndex?
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for currPos */
+	IndexScanBatchPosItem *items;	/* XXX don't size to MaxTIDsPerBTreePage */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+
+	/* XXX why do we need this on top of "opaque" pointer? */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan, void *arg, IndexScanBatchPos *pos);
+
+/*
+ * Queue
+ */
+typedef struct IndexScanBatches
+{
+	/*
+	 * Did we read the last batch? The batches may be loaded from multiple
+	 * places, and we need to remember when we fail to load the next batch in
+	 * a given scan (which means "no more batches"). amgetbatch may restart
+	 * the scan on the get call, so we need to remember it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 *
+	 * XXX May need some work to use already loaded batches after change of
+	 * direction, instead of just throwing everything away. May need to reset
+	 * the stream but keep the batches?
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+	// IndexScanBatchData *currentBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The firstBatch is an index of the first batch,
+	 * but needs to be translated by (modulo maxBatches) into index in the
+	 * batches array.
+	 *
+	 * FIXME Maybe these fields should be uint32, or something like that?
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			firstBatch;		/* first used batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetchCallback;
+	void	   *prefetchArgument;
+} IndexScanBatches;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -176,6 +330,12 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/*
+	 * Batches index scan keep a list of batches loaded from the index in a
+	 * circular buffer.
+	 */
+	IndexScanBatches *xs_batches;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..5bed359cf13 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -413,8 +413,14 @@ typedef struct TableAmRoutine
 	 * structure with additional information.
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 *
+	 * The ReadStream pointer is optional - NULL means the regular buffer
+	 * reads are used. If a valid ReadStream is provided, the callback
+	 * (generating the blocks to read) and index_fetch_tuple (consuming the
+	 * buffers) need to agree on the exact order.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1149,9 +1155,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..ef672e203d0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1697,6 +1697,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1726,6 +1727,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1749,6 +1754,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1772,6 +1778,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index ae17d028ed3..220b61fad2d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -158,6 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -172,7 +173,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(24 rows)
+(25 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5879e00dff..060d964e399 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1260,6 +1260,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3396,6 +3400,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.49.0

v20250422-0002-WIP-batching-for-nbtree-indexes.patchtext/x-patch; charset=UTF-8; name=v20250422-0002-WIP-batching-for-nbtree-indexes.patchDownload

From 642f442946d793a0d95910e34b2a0f665fec9c37 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:39 +0200
Subject: [PATCH v20250422 2/5] WIP: batching for nbtree indexes

Adds batching/prefetching for btree indexes. Returns only batches from a
single leaf page. Does not support mark/restore yet.
---
 src/backend/access/nbtree/nbtree.c    |  319 ++++
 src/backend/access/nbtree/nbtsearch.c | 1998 +++++++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  |  179 +++
 src/include/access/nbtree.h           |   72 +-
 src/tools/pgindent/typedefs.list      |    2 +
 5 files changed, 2417 insertions(+), 153 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index accc7fe8bbe..3e3076b7e12 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,6 +159,8 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -279,6 +281,158 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Simplified version of btgettuple(), but for batches of tuples.
+ */
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch res;
+	BTBatchScanPos pos = NULL;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	if (scan->xs_batches->firstBatch < scan->xs_batches->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->nextBatch-1);
+		pos = (BTBatchScanPos) batch->opaque;
+	}
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (pos == NULL)
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, pos, dir);
+		}
+
+		/* If we have a batch, return it ... */
+		if (res)
+			break;
+
+		/*
+		 * XXX we need to invoke _bt_first_batch on the next iteration, to
+		 * advance SAOP keys etc. But indexam.c already does this, but that's
+		 * only after this returns, so maybe this should do this in some other
+		 * way, not sure who should be responsible for setting currentBatch.
+		 *
+		 * XXX Maybe we don't even need that field? What is a current batch
+		 * anyway? There seem to be at least multiple concepts of "current"
+		 * batch, one for the read stream, another for executor ...
+		 */
+		// scan->xs_batches->currentBatch = res;
+
+		/*
+		 * We may do a new scan, depending on what _bt_start_prim_scan says.
+		 * In that case we need to start from scratch, not from the position
+		 * of the last batch. In regular non-batched scans we have currPos,
+		 * because we have just one leaf page for the whole scan, and we
+		 * invalidate it before loading the next one. But with batching that
+		 * doesn't work - we have many leafs, it's not clear which one is
+		 * 'current' (well, it's the last), and we can't invalidate it,
+		 * that's up to amfreebatch(). For now we deduce the position and
+		 * reset it to NULL, to indicate the same thing.
+		 *
+		 * XXX Maybe we should have something like 'currentBatch'? But then
+		 * that probably should be in BTScanOpaque, not in the generic
+		 * indexam.c part? Or it it a sufficiently generic thing? How would
+		 * we keep it in sync with the batch queue? If freeing batches is
+		 * up to indexam, how do we ensure the currentBatch does not point
+		 * to already removed batch?
+		 */
+		pos = NULL;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * Check to see if we should kill tuples from the previous batch.
+	 */
+	_bt_kill_batch(scan, batch);
+
+	/* free all the stuff that might be allocated */
+
+	if (batch->items)
+		pfree(batch->items);
+
+	if (batch->itups)
+		pfree(batch->itups);
+
+	if (batch->htups)
+		pfree(batch->htups);
+
+	if (batch->recheck)
+		pfree(batch->recheck);
+
+	if (batch->privateData)
+		pfree(batch->privateData);
+
+	if (batch->orderbyvals)
+		pfree(batch->orderbyvals);
+
+	if (batch->orderbynulls)
+		pfree(batch->orderbynulls);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->opaque)
+	{
+		BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+
+		BTBatchScanPosIsValid(*pos);
+		BTBatchScanPosIsPinned(*pos);
+
+		BTBatchScanPosUnpinIfPinned(*pos);
+
+		pfree(batch->opaque);
+	}
+
+	/* and finally free the batch itself */
+	pfree(batch);
+
+	return;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -376,6 +530,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 /*
  *	btrescan() -- rescan an index relation
+ *
+ * Batches should have been freed from indexam using btfreebatch() before we
+ * get here, but then some of the generic scan stuff needs to be reset here.
+ * But we shouldn't need to do anything particular here, I think.
  */
 void
 btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -400,6 +558,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
+	/* FIXME should be in indexam.c I think */
+	// if (scan->xs_batches)
+	//	scan->xs_batches->currentBatch = NULL;
+
 	/*
 	 * Allocate tuple workspace arrays, if needed for an index-only scan and
 	 * not already done in a previous rescan call.  To save on palloc
@@ -433,6 +595,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 /*
  *	btendscan() -- close down a scan
+ *
+ * Batches should have been freed from indexam using btfreebatch() before we
+ * get here, but then some of the generic scan stuff needs to be reset here.
+ * But we shouldn't need to do anything particular here, I think.
  */
 void
 btendscan(IndexScanDesc scan)
@@ -469,12 +635,18 @@ btendscan(IndexScanDesc scan)
 
 /*
  *	btmarkpos() -- save current scan position
+ *
+ * With batching, all the interesting markpos() stuff happens in indexam.c. We
+ * should not even get here.
  */
 void
 btmarkpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	/* with batching, mark/restore is handled in indexam */
+	Assert(scan->xs_batches == NULL);
+
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
@@ -495,12 +667,18 @@ btmarkpos(IndexScanDesc scan)
 
 /*
  *	btrestrpos() -- restore scan to last saved position
+ *
+ * With batching, all the interesting restrpos() stuff happens in indexam.c. We
+ * should not even get here.
  */
 void
 btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	/* with batching, mark/restore is handled in indexam */
+	Assert(scan->xs_batches == NULL);
+
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -900,6 +1078,147 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	return status;
 }
 
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call _bt_parallel_release()
+ *		or _bt_parallel_done().
+ *
+ * The return value is true if we successfully seized the scan and false
+ * if we did not.  The latter case occurs when no pages remain, or when
+ * another primitive index scan is scheduled that caller's backend cannot
+ * start just yet (only backends that call from _bt_first are capable of
+ * starting primitive index scans, which they indicate by passing first=true).
+ *
+ * If the return value is true, *next_scan_page returns the next page of the
+ * scan, and *last_curr_page returns the page that *next_scan_page came from.
+ * An invalid *next_scan_page means the scan hasn't yet started, or that
+ * caller needs to start the next primitive index scan (if it's the latter
+ * case we'll set so.needPrimScan).
+ *
+ * Callers should ignore the value of *next_scan_page and *last_curr_page if
+ * the return value is false.
+ */
+bool
+_bt_parallel_seize_batch(IndexScanDesc scan, BTBatchScanPos pos,
+						 BlockNumber *next_scan_page,
+						 BlockNumber *last_curr_page, bool first)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		exit_loop = false,
+				status = true,
+				endscan = false;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	*next_scan_page = InvalidBlockNumber;
+	*last_curr_page = InvalidBlockNumber;
+
+	/*
+	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
+	 * call to _bt_readnextpage treats this backend similarly to a serial
+	 * backend that steps from *last_curr_page to *next_scan_page (unless this
+	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
+	 */
+	BTScanPosInvalidate(so->currPos);
+	pos->moreLeft = pos->moreRight = true;
+
+	if (first)
+	{
+		/*
+		 * Initialize array related state when called from _bt_first, assuming
+		 * that this will be the first primitive index scan for the scan
+		 */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		so->oppositeDirCheck = false;
+	}
+	else
+	{
+		/*
+		 * Don't attempt to seize the scan when it requires another primitive
+		 * index scan, since caller's backend cannot start it right now
+		 */
+		if (so->needPrimScan)
+			return false;
+	}
+
+	btscan = (BTParallelScanDesc) OffsetToPointer(parallel_scan,
+												  parallel_scan->ps_offset_am);
+
+	while (1)
+	{
+		LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
+
+		if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+		{
+			/* We're done with this parallel index scan */
+			status = false;
+		}
+		else if (btscan->btps_pageStatus == BTPARALLEL_IDLE &&
+				 btscan->btps_nextScanPage == P_NONE)
+		{
+			/* End this parallel index scan */
+			status = false;
+			endscan = true;
+		}
+		else if (btscan->btps_pageStatus == BTPARALLEL_NEED_PRIMSCAN)
+		{
+			Assert(so->numArrayKeys);
+
+			if (first)
+			{
+				/* Can start scheduled primitive scan right away, so do so */
+				btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+
+				/* Restore scan's array keys from serialized values */
+				_bt_parallel_restore_arrays(rel, btscan, so);
+				exit_loop = true;
+			}
+			else
+			{
+				/*
+				 * Don't attempt to seize the scan when it requires another
+				 * primitive index scan, since caller's backend cannot start
+				 * it right now
+				 */
+				status = false;
+			}
+
+			/*
+			 * Either way, update backend local state to indicate that a
+			 * pending primitive scan is required
+			 */
+			so->needPrimScan = true;
+			so->scanBehind = false;
+			so->oppositeDirCheck = false;
+		}
+		else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
+		{
+			/*
+			 * We have successfully seized control of the scan for the purpose
+			 * of advancing it to a new page!
+			 */
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			Assert(btscan->btps_nextScanPage != P_NONE);
+			*next_scan_page = btscan->btps_nextScanPage;
+			*last_curr_page = btscan->btps_lastCurrPage;
+			exit_loop = true;
+		}
+		LWLockRelease(&btscan->btps_lock);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* When the scan has reached the rightmost (or leftmost) page, end it */
+	if (endscan)
+		_bt_parallel_done(scan);
+
+	return status;
+}
+
 /*
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_nextScanPage; another backend
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 77264ddeecb..10b28a76c0f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,8 +24,20 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+/*
+ * XXX A lot of the new functions are copies of the non-batching version, with
+ * changes to make it work with batching (which means with position provided
+ * by the caller, not from the BTScanOpaque). The duplication is not great,
+ * but it's a bit unclear what to do about it. One option would be to remove
+ * the amgettuple() interface altogether, once the batching API works, but we
+ * may also choose to keep both (e.g. for cases that don't support batching,
+ * like scans of catalogs). In that case we'd need to do some refactoring to
+ * share as much code as possible.
+ */
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/* static void _bt_drop_lock_and_maybe_pin_batch(IndexScanDesc scan, BTBatchScanPos sp); */
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
@@ -34,24 +46,44 @@ static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum, bool firstpage);
+static IndexScanBatch _bt_readpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+										 ScanDirection dir, OffsetNumber offnum,
+										 bool firstPage);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_saveitem_batch(IndexScanBatch batch, int itemIndex,
+							   OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 								  OffsetNumber offnum, ItemPointer heapTid,
 								  IndexTuple itup);
+static int	_bt_setuppostingitems_batch(IndexScanBatch batch, int itemIndex,
+										OffsetNumber offnum, ItemPointer heapTid,
+										IndexTuple itup);
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
+static inline void _bt_savepostingitem_batch(IndexScanBatch batch, int itemIndex,
+											 OffsetNumber offnum,
+											 ItemPointer heapTid, int tupleOffset);
 static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_steppage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+										 ScanDirection dir);
 static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
 							  ScanDirection dir);
+static IndexScanBatch _bt_readfirstpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+											  OffsetNumber offnum,
+											  ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 							 BlockNumber lastcurrblkno, ScanDirection dir,
 							 bool seized);
+static IndexScanBatch _bt_readnextpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+											 BlockNumber blkno, BlockNumber lastcurrblkno,
+											 ScanDirection dir, bool seized);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint_batch(IndexScanDesc scan, ScanDirection dir);
 
 
 /*
@@ -77,6 +109,20 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 	}
 }
 
+/* static void */
+/* _bt_drop_lock_and_maybe_pin_batch(IndexScanDesc scan, BTBatchScanPos sp) */
+/* { */
+/* 	_bt_unlockbuf(scan->indexRelation, sp->buf); */
+/*  */
+/* /	if (IsMVCCSnapshot(scan->xs_snapshot) && */
+/* 		RelationNeedsWAL(scan->indexRelation) && */
+/* 		!scan->xs_want_itup) */
+/* 	{ */
+/* 		ReleaseBuffer(sp->buf); */
+/* 		sp->buf = InvalidBuffer; */
+/* 	} */
+/* } */
+
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -1570,136 +1616,1344 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_first_batch() -- Load the first batch in a scan.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
  *
- * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction
- * (could just be for the current primitive index scan when scan has arrays).
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
  *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
  *
- * Returns true if any matching items found on the page, false if none.
+ * XXX We probably should not rely on _bt_first/_bt_steppage, because that
+ * very much relies on currPos, and it's just laziness to rely on that. For
+ * batching we probably need something else anyway.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+IndexScanBatch
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber minoff;
-	OffsetNumber maxoff;
-	BTReadPageState pstate;
-	bool		arrayKeys;
-	int			itemIndex,
-				indnatts;
+	BTStack		stack;
+	OffsetNumber offnum;
+	BTScanInsertData inskey;
+	ScanKey		startKeys[INDEX_MAX_KEYS];
+	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
+	int			keysz = 0;
+	StrategyNumber strat_total;
+	BlockNumber blkno = InvalidBlockNumber,
+				lastcurrblkno;
+	BTBatchScanPosData pos;
 
-	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
-	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
+	BTBatchScanPosInvalidate(pos);
 
-	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
-	Assert(!so->needPrimScan);
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
 
-	if (scan->parallel_scan)
-	{
-		/* allow next/prev page to be read by other worker without delay */
-		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
-		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
-	}
+	/* FIXME maybe check there's no active batch yet */
+	/* Assert(!BTScanPosIsValid(so->currPos)); */
 
-	/* initialize remaining currPos fields related to current page */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
-	/* either moreLeft or moreRight should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	/*
+	 * Examine the scan keys and eliminate any redundant keys; also mark the
+	 * keys that must be matched to continue the scan.
+	 */
+	_bt_preprocess_keys(scan);
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	/*
+	 * Quit now if _bt_preprocess_keys() discovered that the scan keys can
+	 * never be satisfied (eg, x == 1 AND x > 2).
+	 */
+	if (!so->qual_ok)
+	{
+		Assert(!so->needPrimScan);
+		_bt_parallel_done(scan);
+		return false;
+	}
 
-	/* initialize local variables */
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
-	arrayKeys = so->numArrayKeys != 0;
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
+	/*
+	 * If this is a parallel scan, we must seize the scan.  _bt_readfirstpage
+	 * will likely release the parallel scan later on.
+	 */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize_batch(scan, &pos, &blkno, &lastcurrblkno, true))
+		return false;
 
-	/* initialize page-level state that we'll pass to _bt_checkkeys */
-	pstate.minoff = minoff;
-	pstate.maxoff = maxoff;
-	pstate.finaltup = NULL;
-	pstate.page = page;
-	pstate.firstpage = firstpage;
-	pstate.forcenonrequired = false;
-	pstate.startikey = 0;
-	pstate.offnum = InvalidOffsetNumber;
-	pstate.skip = InvalidOffsetNumber;
-	pstate.continuescan = true; /* default assumption */
-	pstate.rechecks = 0;
-	pstate.targetdistance = 0;
-	pstate.nskipadvances = 0;
+	/*
+	 * Initialize the scan's arrays (if any) for the current scan direction
+	 * (except when they were already set to later values as part of
+	 * scheduling the primitive index scan that is now underway)
+	 */
+	if (so->numArrayKeys && !so->needPrimScan)
+		_bt_start_array_keys(scan, dir);
 
-	if (ScanDirectionIsForward(dir))
+	if (blkno != InvalidBlockNumber)
 	{
-		/* SK_SEARCHARRAY forward scans must provide high key up front */
-		if (arrayKeys)
-		{
-			if (!P_RIGHTMOST(opaque))
-			{
-				ItemId		iid = PageGetItemId(page, P_HIKEY);
+		/*
+		 * We anticipated calling _bt_search, but another worker bet us to it.
+		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
+		 */
+		Assert(scan->parallel_scan != NULL);
+		Assert(!so->needPrimScan);
+		Assert(blkno != P_NONE);
 
-				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		return _bt_readnextpage_batch(scan, &pos, blkno, lastcurrblkno, dir, true);
+	}
 
-				if (so->scanBehind &&
-					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
-				{
-					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
-					so->needPrimScan = true;
-					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
-					return false;
-				}
-			}
+	/*
+	 * Count an indexscan for stats, now that we know that we'll call
+	 * _bt_search/_bt_endpoint below
+	 */
+	pgstat_count_index_scan(rel);
+	if (scan->instrument)
+		scan->instrument->nsearches++;
 
-			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
-		}
+	/*----------
+	 * Examine the scan keys to discover where we need to start the scan.
+	 *
+	 * We want to identify the keys that can be used as starting boundaries;
+	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+	 * a backwards scan.  We can use keys for multiple attributes so long as
+	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+	 * a > or < boundary or find an attribute with no boundary (which can be
+	 * thought of as the same as "> -infinity"), we can't use keys for any
+	 * attributes to its right, because it would break our simplistic notion
+	 * of what initial positioning strategy to use.
+	 *
+	 * When the scan keys include cross-type operators, _bt_preprocess_keys
+	 * may not be able to eliminate redundant keys; in such cases we will
+	 * arbitrarily pick a usable one for each attribute.  This is correct
+	 * but possibly not optimal behavior.  (For example, with keys like
+	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+	 * x=5 would be more efficient.)  Since the situation only arises given
+	 * a poorly-worded query plus an incomplete opfamily, live with it.
+	 *
+	 * When both equality and inequality keys appear for a single attribute
+	 * (again, only possible when cross-type operators appear), we *must*
+	 * select one of the equality keys for the starting point, because
+	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
+	 * equality quals survive preprocessing, however, it doesn't matter which
+	 * one we use --- by definition, they are either redundant or
+	 * contradictory.
+	 *
+	 * In practice we rarely see any "attribute boundary key gaps" here.
+	 * Preprocessing can usually backfill skip array keys for any attributes
+	 * that were omitted from the original scan->keyData[] input keys.  All
+	 * array keys are always considered = keys, but we'll sometimes need to
+	 * treat the current key value as if we were using an inequality strategy.
+	 * This happens with range skip arrays, which store inequality keys in the
+	 * array's low_compare/high_compare fields (used to find the first/last
+	 * set of matches, when = key will lack a usable sk_argument value).
+	 * These are always preferred over any redundant "standard" inequality
+	 * keys on the same column (per the usual rule about preferring = keys).
+	 * Note also that any column with an = skip array key can never have an
+	 * additional, contradictory = key.
+	 *
+	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
+	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
+	 * If the index stores nulls at the end of the index we'll be starting
+	 * from, and we have no boundary key for the column (which means the key
+	 * we deduced NOT NULL from is an inequality key that constrains the other
+	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+	 * use as a boundary key.  If we didn't do this, we might find ourselves
+	 * traversing a lot of null entries at the start of the scan.
+	 *
+	 * In this loop, row-comparison keys are treated the same as keys on their
+	 * first (leftmost) columns.  We'll add on lower-order columns of the row
+	 * comparison below, if possible.
+	 *
+	 * The selected scan keys (at most one per index column) are remembered by
+	 * storing their addresses into the local startKeys[] array.
+	 *
+	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
+	 * the next primitive index scan (for scans with array keys) based in part
+	 * on an understanding of how it'll enable us to reposition the scan.
+	 * They're directly aware of how we'll sometimes cons up an explicit
+	 * SK_SEARCHNOTNULL key.  They'll even end primitive scans by applying a
+	 * symmetric "deduce NOT NULL" rule of their own.  This allows top-level
+	 * scans to skip large groups of NULLs through repeated deductions about
+	 * key strictness (for a required inequality key) and whether NULLs in the
+	 * key's index column are stored last or first (relative to non-NULLs).
+	 * If you update anything here, _bt_checkkeys/_bt_advance_array_keys might
+	 * need to be kept in sync.
+	 *----------
+	 */
+	strat_total = BTEqualStrategyNumber;
+	if (so->numberOfKeys > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
 
 		/*
-		 * Consider pstate.startikey optimization once the ongoing primitive
-		 * index scan has already read at least one page
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
 		 */
-		if (!pstate.firstpage && minoff < maxoff)
-			_bt_set_startikey(scan, &pstate);
-
-		/* load items[] in ascending order */
-		itemIndex = 0;
-
-		offnum = Max(offnum, minoff);
+		cur = so->keyData;
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
 
-		while (offnum <= maxoff)
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (int i = 0;; cur++, i++)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	itup;
+			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.
+				 *
+				 * If this is a scan key for a skip array whose current
+				 * element is MINVAL, choose low_compare (when scanning
+				 * backwards it'll be MAXVAL, and we'll choose high_compare).
+				 *
+				 * Note: if the array's low_compare key makes 'chosen' NULL,
+				 * then we behave as if the array's first element is -inf,
+				 * except when !array->null_elem implies a usable NOT NULL
+				 * constraint.
+				 */
+				if (chosen != NULL &&
+					(chosen->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
+				{
+					int			ikey = chosen - so->keyData;
+					ScanKey		skipequalitykey = chosen;
+					BTArrayKeyInfo *array = NULL;
+
+					for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+					{
+						array = &so->arrayKeys[arridx];
+						if (array->scan_key == ikey)
+							break;
+					}
+
+					if (ScanDirectionIsForward(dir))
+					{
+						Assert(!(skipequalitykey->sk_flags & SK_BT_MAXVAL));
+						chosen = array->low_compare;
+					}
+					else
+					{
+						Assert(!(skipequalitykey->sk_flags & SK_BT_MINVAL));
+						chosen = array->high_compare;
+					}
+
+					Assert(chosen == NULL ||
+						   chosen->sk_attno == skipequalitykey->sk_attno);
+
+					if (!array->null_elem)
+						impliesNN = skipequalitykey;
+					else
+						Assert(chosen == NULL && impliesNN == NULL);
+				}
+
+				/*
+				 * If we didn't find a usable boundary key, see if we can
+				 * deduce a NOT NULL key
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysz] */
+					chosen = &notnullkeys[keysz];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL)
+					break;
+				startKeys[keysz++] = chosen;
+
+				/*
+				 * We can only consider adding more boundary keys when the one
+				 * that we just chose to add uses either the = or >= strategy
+				 * (during backwards scans we can only do so when the key that
+				 * we just added to startKeys[] uses the = or <= strategy)
+				 */
+				strat_total = chosen->sk_strategy;
+				if (strat_total == BTGreaterStrategyNumber ||
+					strat_total == BTLessStrategyNumber)
+					break;
+
+				/*
+				 * If the key that we just added to startKeys[] is a skip
+				 * array = key whose current element is marked NEXT or PRIOR,
+				 * make strat_total > or < (and stop adding boundary keys).
+				 * This can only happen with opclasses that lack skip support.
+				 */
+				if (chosen->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
+				{
+					Assert(chosen->sk_flags & SK_BT_SKIP);
+					Assert(strat_total == BTEqualStrategyNumber);
+
+					if (ScanDirectionIsForward(dir))
+					{
+						Assert(!(chosen->sk_flags & SK_BT_PRIOR));
+						strat_total = BTGreaterStrategyNumber;
+					}
+					else
+					{
+						Assert(!(chosen->sk_flags & SK_BT_NEXT));
+						strat_total = BTLessStrategyNumber;
+					}
+
+					/*
+					 * We're done.  We'll never find an exact = match for a
+					 * NEXT or PRIOR sentinel sk_argument value.  There's no
+					 * sense in trying to add more keys to startKeys[].
+					 */
+					break;
+				}
+
+				/*
+				 * Done if that was the last scan key output by preprocessing.
+				 * Also done if there is a gap index attribute that lacks a
+				 * usable key (only possible when preprocessing was unable to
+				 * generate a skip array key to "fill in the gap").
+				 */
+				if (i >= so->numberOfKeys ||
+					cur->sk_attno != curattr + 1)
+					break;
+
+				/*
+				 * Reset for next attr.
+				 */
+				curattr = cur->sk_attno;
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+
+	/*
+	 * If we found no usable boundary keys, we have to start from one end of
+	 * the tree.  Walk down that edge to the first or last key, and scan from
+	 * there.
+	 *
+	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
+	 */
+	if (keysz == 0)
+		return _bt_endpoint_batch(scan, dir);
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning scan keys are finalized.)
+	 */
+	Assert(keysz <= INDEX_MAX_KEYS);
+	for (int i = 0; i < keysz; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			/*
+			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
+			 * would've marked the qual as unsatisfiable, preventing us from
+			 * ever getting this far
+			 */
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			Assert(subkey->sk_attno == cur->sk_attno);
+			Assert(!(subkey->sk_flags & SK_ISNULL));
+
+			/*
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function)
+			 */
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysz - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysz + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysz < INDEX_MAX_KEYS);
+					memcpy(inskey.scankeys + keysz, subkey,
+						   sizeof(ScanKeyData));
+					keysz++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (strat_total)
+					{
+						case BTLessStrategyNumber:
+							strat_total = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							strat_total = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   InvalidStrategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey.scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   InvalidStrategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * initial descent by _bt_search (and our _bt_binsrch call for the leaf
+	 * page _bt_search returns).
+	 *----------
+	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysz;
+	switch (strat_total)
+	{
+		case BTLessStrategyNumber:
+
+			inskey.nextkey = false;
+			inskey.backward = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			inskey.nextkey = true;
+			inskey.backward = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy
+				 */
+				inskey.nextkey = true;
+				inskey.backward = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy
+				 */
+				inskey.nextkey = false;
+				inskey.backward = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey
+			 */
+			inskey.nextkey = false;
+			inskey.backward = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey
+			 */
+			inskey.nextkey = true;
+			inskey.backward = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
+			return false;
+	}
+
+	/*
+	 * Use the manufactured insertion scan key to descend the tree and
+	 * position ourselves on the target leaf page.
+	 */
+	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
+	stack = _bt_search(rel, NULL, &inskey, &pos.buf, BT_READ);
+
+	/* don't need to keep the stack around... */
+	_bt_freestack(stack);
+
+	if (!BufferIsValid(pos.buf))
+	{
+		/*
+		 * We only get here if the index is completely empty. Lock relation
+		 * because nothing finer to lock exists.  Without a buffer lock, it's
+		 * possible for another transaction to insert data between
+		 * _bt_search() and PredicateLockRelation().  We have to try again
+		 * after taking the relation-level predicate lock, to close a narrow
+		 * window where we wouldn't scan concurrently inserted tuples, but the
+		 * writer wouldn't see our predicate lock.
+		 */
+		if (IsolationIsSerializable())
+		{
+			PredicateLockRelation(rel, scan->xs_snapshot);
+			stack = _bt_search(rel, NULL, &inskey, &pos.buf, BT_READ);
+			_bt_freestack(stack);
+		}
+
+		if (!BufferIsValid(pos.buf))
+		{
+			Assert(!so->needPrimScan);
+			_bt_parallel_done(scan);
+			return false;
+		}
+	}
+
+	/* position to the precise item on the page */
+	offnum = _bt_binsrch(rel, &inskey, pos.buf);
+
+	/*
+	 * Now load data from the first page of the scan (usually the page
+	 * currently in so->currPos.buf).
+	 *
+	 * If inskey.nextkey = false and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple >= inskey.scankeys.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple < inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple > inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple <= inskey.scankeys.
+	 *
+	 * It's possible that _bt_binsrch returned an offnum that is out of bounds
+	 * for the page.  For example, when inskey is both < the leaf page's high
+	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
+	 */
+	return _bt_readfirstpage_batch(scan, &pos, offnum, dir);
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+IndexScanBatch
+_bt_next_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+	// BTBatchScanPos pos;
+	BTBatchScanPosData tmp;
+	// IndexScanBatch	batch;
+	// int 			idx;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * restore the BTScanOpaque from the current batch
+	 *
+	 * XXX This is pretty ugly/expensive. Ideally we'd have all the fields
+	 * needed to determine "location" in the index (essentially BTScanPosData)
+	 * in the batch, without cloning all the other stuff.
+	 */
+	// Assert(scan->xs_batches->currentBatch != NULL);
+
+	/*
+	 * Use the last batch as the "current" batch. We use the streamPos if
+	 * initialized, or the readPos as a fallback. Alternatively, we could
+	 * simply use the last batch in the queue, i.e. (nextBatch - 1).
+	 *
+	 * Even better, we could pass the "correct" batch from indexam.c, and
+	 * let that figure out which position to move from.
+	 */
+/*
+	idx = scan->xs_batches->streamPos.batch;
+	if (idx == -1)
+		idx = scan->xs_batches->readPos.batch;
+
+	batch = INDEX_SCAN_BATCH(scan, idx);
+	Assert(batch != NULL);
+	pos = (BTBatchScanPos) batch->opaque;
+*/
+
+	Assert(BTBatchScanPosIsPinned(*pos));
+
+	memcpy(&tmp, pos, sizeof(tmp));
+
+	/*
+	 * Advance to next page, load the data into the index batch.
+	 *
+	 * FIXME It may not be quite correct to just pass the position from
+	 * current batch, some of the functions scribble over it (e.g.
+	 * _bt_readpage_batch). Maybe we should create a copy, or something?
+	 *
+	 * XXX For now we pass a local copy "tmp".
+	 */
+	return _bt_steppage_batch(scan, &tmp, dir);
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* we should only get here for scans with batching */
+	Assert(scan->xs_batches);
+
+	/* bail out if the batch has no killed items */
+	if (batch->numKilled == 0)
+		return;
+
+	/*
+	 * XXX Now what? we don't have the currPos around anymore, so we should
+	 * load that, and apply the killed items to that, somehow?
+	 */
+	/* FIXME: _bt_kill_batch not implemented */
+
+	/*
+	 * XXX maybe we should have a separate callback for this, and call it from
+	 * the indexam.c directly whenever we think it's appropriate? And not only
+	 * from here when freeing the batch?
+	 */
+	_bt_killitems_batch(scan, batch);
+}
+
+/*
+ *	_bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction.  All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction
+ * (could just be for the current primitive index scan when scan has arrays).
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
+			 bool firstpage)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
+
+	/* save the page/buffer block number, along with its sibling links */
+	page = BufferGetPage(so->currPos.buf);
+	opaque = BTPageGetOpaque(page);
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+	so->currPos.prevPage = opaque->btpo_prev;
+	so->currPos.nextPage = opaque->btpo_next;
+
+	Assert(!P_IGNORE(opaque));
+	Assert(BTScanPosIsPinned(so->currPos));
+	Assert(!so->needPrimScan);
+
+	if (scan->parallel_scan)
+	{
+		/* allow next/prev page to be read by other worker without delay */
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, so->currPos.nextPage,
+								 so->currPos.currPage);
+		else
+			_bt_parallel_release(scan, so->currPos.prevPage,
+								 so->currPos.currPage);
+	}
+
+	/* initialize remaining currPos fields related to current page */
+	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	so->currPos.dir = dir;
+	so->currPos.nextTupleOffset = 0;
+	/* either moreLeft or moreRight should be set now (may be unset later) */
+	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
+		   so->currPos.moreLeft);
+
+	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+
+	/* initialize local variables */
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
+	arrayKeys = so->numArrayKeys != 0;
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.firstpage = firstpage;
+	pstate.forcenonrequired = false;
+	pstate.startikey = 0;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+	pstate.nskipadvances = 0;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys)
+		{
+			if (!P_RIGHTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					so->currPos.moreRight = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   so->currPos.currPage);
+					return false;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		offnum = Max(offnum, minoff);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				offnum = OffsetNumberNext(offnum);
+				continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
+
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
+
+			/*
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
+			 */
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+				Assert(!pstate.forcenonrequired);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
+			if (passes_quals)
+			{
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
+			}
+			/* When !continuescan, there can't be any more matches, so stop */
+			if (!pstate.continuescan)
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		/*
+		 * We don't need to visit page to the right when the high key
+		 * indicates that no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (pstate.continuescan && !so->scanBehind && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
+			int			truncatt;
+
+			truncatt = BTreeTupleGetNAtts(itup, rel);
+			pstate.forcenonrequired = false;
+			pstate.startikey = 0;	/* _bt_set_startikey ignores P_HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
+		}
+
+		if (!pstate.continuescan)
+			so->currPos.moreRight = false;
+
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys)
+		{
+			if (minoff <= maxoff && !P_LEFTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, minoff);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					so->currPos.moreLeft = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   so->currPos.currPage);
+					return false;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in descending order */
+		itemIndex = MaxTIDsPerBTreePage;
+
+		offnum = Min(offnum, maxoff);
+
+		while (offnum >= minoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		tuple_alive;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple (previous, actually,
+			 * since we're scanning backwards).  However, if this is the first
+			 * tuple on the page, we do check the index keys, to prevent
+			 * uselessly advancing to the page to the left.  This is similar
+			 * to the high key optimization used by forward scans.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				if (offnum > minoff)
+				{
+					offnum = OffsetNumberPrev(offnum);
+					continue;
+				}
+
+				tuple_alive = false;
+			}
+			else
+				tuple_alive = true;
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
+
+			pstate.offnum = offnum;
+			if (arrayKeys && offnum == minoff && pstate.forcenonrequired)
+			{
+				pstate.forcenonrequired = false;
+				pstate.startikey = 0;
+			}
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
+
+			if (arrayKeys && so->scanBehind)
+			{
+				/*
+				 * Done scanning this page, but not done with the current
+				 * primscan.
+				 *
+				 * Note: Forward scans don't check this explicitly, since they
+				 * prefer to reuse pstate.skip for this instead.
+				 */
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(!pstate.forcenonrequired);
+
+				break;
+			}
+
+			/*
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
+			 */
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+				Assert(!pstate.forcenonrequired);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
+			if (passes_quals && tuple_alive)
+			{
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
+			}
+			/* When !continuescan, there can't be any more matches, so stop */
+			if (!pstate.continuescan)
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		/*
+		 * We don't need to visit page to the left when no more matches will
+		 * be found there
+		 */
+		if (!pstate.continuescan)
+			so->currPos.moreLeft = false;
+
+		Assert(itemIndex >= 0);
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+	}
+
+	/*
+	 * If _bt_set_startikey told us to temporarily treat the scan's keys as
+	 * nonrequired (possible only during scans with array keys), there must be
+	 * no lasting consequences for the scan's array keys.  The scan's arrays
+	 * should now have exactly the same elements as they would have had if the
+	 * nonrequired behavior had never been used.  (In general, a scan's arrays
+	 * are expected to track its progress through the index's key space.)
+	 *
+	 * We are required (by _bt_set_startikey) to call _bt_checkkeys against
+	 * pstate.finaltup with pstate.forcenonrequired=false to allow the scan's
+	 * arrays to recover.  Assert that that step hasn't been missed.
+	 */
+	Assert(!pstate.forcenonrequired);
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+static IndexScanBatch
+_bt_readpage_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir, OffsetNumber offnum,
+				   bool firstpage)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
+
+	/* result */
+	/* IndexScanBatch batch = ddd; */
+	IndexScanBatch batch;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * FIXME fake for _bt_checkkeys, needs to be set properly elsewhere (not
+	 * sure where)
+	 */
+
+	/*
+	 * XXX we shouldn't be passing this info through currPos but directly, I
+	 * guess.
+	 */
+	so->currPos.dir = dir;
+
+	/*
+	 * XXX We can pass the exact number if items from this page, by using
+	 * maxoff
+	 */
+	batch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+
+	/* FIXME but we don't copy the contents until the end */
+	batch->opaque = palloc0(sizeof(BTBatchScanPosData));
+
+	/* bogus values */
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->itemIndex = -1;
+
+	/* if (so->currTuples) */
+	/* { */
+	/* batch->currTuples = (char *) palloc(BLCKSZ); */
+	/* memcpy(batch->currTuples, so->currTuples, BLCKSZ); */
+	/* } */
+
+	/* save the page/buffer block number, along with its sibling links */
+	page = BufferGetPage(pos->buf);
+	opaque = BTPageGetOpaque(page);
+	pos->currPage = BufferGetBlockNumber(pos->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+
+	Assert(!P_IGNORE(opaque));
+	Assert(BTBatchScanPosIsPinned(*pos));
+	Assert(!so->needPrimScan);
+
+	if (scan->parallel_scan)
+	{
+		/* allow next/prev page to be read by other worker without delay */
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, pos->nextPage,
+								 pos->currPage);
+		else
+			_bt_parallel_release(scan, pos->prevPage,
+								 pos->currPage);
+	}
+
+	/* initialize remaining currPos fields related to current page */
+	pos->lsn = BufferGetLSNAtomic(pos->buf);
+	pos->dir = dir;
+	pos->nextTupleOffset = 0;
+	/* either moreLeft or moreRight should be set now (may be unset later) */
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
+
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
+
+	/* initialize local variables */
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
+	arrayKeys = so->numArrayKeys != 0;
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.firstpage = firstpage;
+	pstate.forcenonrequired = false;
+	pstate.startikey = 0;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+	pstate.nskipadvances = 0;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys)
+		{
+			if (!P_RIGHTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					pos->moreRight = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   pos->currPage);
+					return NULL;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		offnum = Max(offnum, minoff);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
 			bool		passes_quals;
 
 			/*
@@ -1740,7 +2994,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem_batch(batch, itemIndex, offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1752,16 +3006,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
-											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+						_bt_setuppostingitems_batch(batch, itemIndex, offnum,
+													BTreeTupleGetPostingN(itup, 0),
+													itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
-											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+						_bt_savepostingitem_batch(batch, itemIndex, offnum,
+												  BTreeTupleGetPostingN(itup, i),
+												  tupleOffset);
 						itemIndex++;
 					}
 				}
@@ -1792,17 +3046,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 			truncatt = BTreeTupleGetNAtts(itup, rel);
 			pstate.forcenonrequired = false;
-			pstate.startikey = 0;	/* _bt_set_startikey ignores P_HIKEY */
+			pstate.startikey = 0;	/* _bt_set_startikey ignores HIKEY */
 			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
+		batch->itemIndex = 0;
 	}
 	else
 	{
@@ -1819,12 +3073,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
-					return false;
+													   pos->currPage);
+					return NULL;
 				}
 			}
 
@@ -1922,7 +3176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem_batch(batch, itemIndex, offnum, itup);
 				}
 				else
 				{
@@ -1940,16 +3194,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
-											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+						_bt_setuppostingitems_batch(batch, itemIndex, offnum,
+													BTreeTupleGetPostingN(itup, 0),
+													itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
-											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+						_bt_savepostingitem_batch(batch, itemIndex, offnum,
+												  BTreeTupleGetPostingN(itup, i),
+												  tupleOffset);
 					}
 				}
 			}
@@ -1965,12 +3219,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxTIDsPerBTreePage - 1;
+		batch->itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -1987,7 +3241,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	if (batch->firstItem > batch->lastItem)
+		return NULL;
+
+	memcpy(batch->opaque, pos, sizeof(BTBatchScanPosData));
+
+	return batch;
 }
 
 /* Save an index item into so->currPos.items[itemIndex] */
@@ -2005,9 +3264,97 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+	}
+}
+
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static void
+_bt_saveitem_batch(IndexScanBatch batch, int itemIndex,
+				   OffsetNumber offnum, IndexTuple itup)
+{
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
+	/* copy the populated part of the items array */
+	batch->items[itemIndex].heapTid = itup->t_tid;
+	batch->items[itemIndex].indexOffset = offnum;
+
+	if (batch->currTuples)
+	{
+		Size		itupsz = IndexTupleSize(itup);
+
+		batch->items[itemIndex].tupleOffset = pos->nextTupleOffset;
+		memcpy(batch->currTuples + pos->nextTupleOffset, itup, itupsz);
+		pos->nextTupleOffset += MAXALIGN(itupsz);
 	}
 }
 
@@ -2022,31 +3369,34 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  ItemPointer heapTid, IndexTuple itup)
+_bt_setuppostingitems_batch(IndexScanBatch batch, int itemIndex, OffsetNumber offnum,
+							ItemPointer heapTid, IndexTuple itup)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+	IndexScanBatchPosItem *item = &batch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (batch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = pos->nextTupleOffset;
+		base = (IndexTuple) (batch->currTuples + pos->nextTupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		pos->nextTupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
@@ -2060,20 +3410,20 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * posting list tuple.  Caller passes its return value as tupleOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem_batch(IndexScanBatch batch, int itemIndex, OffsetNumber offnum,
+						  ItemPointer heapTid, int tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &batch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (batch->currTuples)
+		item->tupleOffset = tupleOffset;
 }
 
 /*
@@ -2186,6 +3536,71 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+/*
+ *	a batching version of _bt_steppage(), ignoring irrelevant bits
+ */
+static IndexScanBatch
+_bt_steppage_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* Batching has a different concept of position, stored in the batch. */
+	Assert(BTBatchScanPosIsValid(*pos));
+
+	/*
+	 * killitems
+	 *
+	 * No need to handle killtuples here, that's going to be dealt with at the
+	 * indexam.c level when freeing the batch, or possibly in when calling
+	 * amfreebatch.
+	 */
+
+	/*
+	 * mark/restore
+	 *
+	 * Mark/restore shall also be handled at the indexam.c level, by keeping
+	 * the correct batch around, etc. We don't discard the old batch here.
+	 *
+	 * In _bt_steppage this also handled primitive scans for array keys, but
+	 * that probably would be handled at indexam.c level too.
+	 */
+
+	/* Don't unpin the buffer here, keep the batch pinned until amfreebatch. */
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = pos->nextPage;
+	else
+		blkno = pos->prevPage;
+
+	lastcurrblkno = pos->currPage;
+
+	/*
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for currPos happened to use the opposite direction to the
+	 * one that we're stepping in now.  (It's okay to leave the scan's array
+	 * keys as-is, since the next _bt_readpage will advance them.)
+	 *
+	 * XXX Not sure this is correct. Can we combine the direction from some
+	 * older batch (with mark/restore?) and the current needPrimScan from the
+	 * latest batch we processed? But, the mark/restore code in indexam should
+	 * reset this somehow.
+	 *
+	 * XXX However, aren't primitive scans very btree-specific code? How could
+	 * indexam.c ever handle that?
+	 */
+	if (pos->dir != dir)
+		so->needPrimScan = false;
+
+	return _bt_readnextpage_batch(scan, pos, blkno, lastcurrblkno, dir, false);
+}
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -2265,6 +3680,77 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	return true;
 }
 
+static IndexScanBatch
+_bt_readfirstpage_batch(IndexScanDesc scan, BTBatchScanPos pos, OffsetNumber offnum, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	/* copy position info from BTScanOpaque */
+
+	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		pos->moreLeft = true;
+		pos->moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
+	{
+		pos->moreLeft = false;
+		pos->moreRight = true;
+	}
+	else
+	{
+		pos->moreLeft = true;
+		pos->moreRight = false;
+	}
+
+	/*
+	 * Attempt to load matching tuples from the first page.
+	 *
+	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * _bt_readpage also releases parallel scan (even when it returns false).
+	 */
+	if ((batch = _bt_readpage_batch(scan, pos, dir, offnum, true)) != NULL)
+	{
+		pos = (BTBatchScanPos) batch->opaque;
+
+		/*
+		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+		 * so->currPos.buf in preparation for btgettuple returning tuples.
+		 */
+		Assert(BTBatchScanPosIsPinned(*pos));
+
+		/* _bt_drop_lock_and_maybe_pin_batch(scan, pos); */
+		/* XXX drop just the lock, not the pin, that's up to btfreebatch */
+		/* without this btfreebatch triggers an assert when unpinning the */
+		/* buffer, because that checks we're not holding a lock on it */
+		_bt_unlockbuf(scan->indexRelation, pos->buf);
+		return batch;
+	}
+
+	/* There's no actually-matching data on the page in so->currPos.buf */
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+
+	/* XXX Not sure we can drop the pin before calling steppage_batch? But */
+	/* without this, \d+ reports unreleased buffer ... */
+	/* And the non-batch code doesn't need to do this. */
+	ReleaseBuffer(pos->buf);
+
+	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
+	return _bt_steppage_batch(scan, pos, dir);
+}
+
 /*
  *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
  *
@@ -2412,6 +3898,138 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	return true;
 }
 
+static IndexScanBatch
+_bt_readnextpage_batch(IndexScanDesc scan, BTBatchScanPos pos, BlockNumber blkno,
+					   BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* BTBatchScanPosData	newpos; */
+	IndexScanBatch newbatch = NULL;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	Assert(pos->currPage == lastcurrblkno || seized);
+	Assert(BTBatchScanPosIsPinned(*pos) || seized);
+
+	/* initialize the new position to the old one, we'll modify it */
+	/* newpos = *pos; */
+
+	/* pos->moreLeft = pos->moreRight = false; */
+
+	/*
+	 * Remember that the scan already read lastcurrblkno, a page to the left
+	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 */
+	if (ScanDirectionIsForward(dir))
+		pos->moreLeft = true;
+	else
+		pos->moreRight = true;
+
+	for (;;)
+	{
+		Page		page;
+		BTPageOpaque opaque;
+
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !pos->moreRight : !pos->moreLeft))
+		{
+			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
+			Assert(pos->currPage == lastcurrblkno && !seized);
+			BTBatchScanPosInvalidate(*pos);
+			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
+			return NULL;
+		}
+
+		Assert(!so->needPrimScan);
+
+		/* parallel scan must never actually visit so->currPos blkno */
+		if (!seized && scan->parallel_scan != NULL &&
+			!_bt_parallel_seize_batch(scan, pos, &blkno, &lastcurrblkno, false))
+		{
+			/* whole scan is now done (or another primitive scan required) */
+			BTBatchScanPosInvalidate(*pos);
+			return NULL;
+		}
+
+		if (ScanDirectionIsForward(dir))
+		{
+			/* read blkno, but check for interrupts first */
+			CHECK_FOR_INTERRUPTS();
+			pos->buf = _bt_getbuf(rel, blkno, BT_READ);
+		}
+		else
+		{
+			/* read blkno, avoiding race (also checks for interrupts) */
+			pos->buf = _bt_lock_and_validate_left(rel, &blkno,
+												  lastcurrblkno);
+			if (pos->buf == InvalidBuffer)
+			{
+				/* must have been a concurrent deletion of leftmost page */
+				BTBatchScanPosInvalidate(*pos);
+				_bt_parallel_done(scan);
+				return NULL;
+			}
+		}
+
+		page = BufferGetPage(pos->buf);
+		opaque = BTPageGetOpaque(page);
+		lastcurrblkno = blkno;
+		if (likely(!P_IGNORE(opaque)))
+		{
+			/* see if there are any matches on this page */
+			if (ScanDirectionIsForward(dir))
+			{
+				/* note that this will clear moreRight if we can stop */
+				if ((newbatch = _bt_readpage_batch(scan, pos, dir, P_FIRSTDATAKEY(opaque), false)) != NULL)
+					break;
+				blkno = pos->nextPage;
+			}
+			else
+			{
+				/* note that this will clear moreLeft if we can stop */
+				if ((newbatch = _bt_readpage_batch(scan, pos, dir, PageGetMaxOffsetNumber(page), false)) != NULL)
+					break;
+				blkno = pos->prevPage;
+			}
+		}
+		else
+		{
+			/* _bt_readpage not called, so do all this for ourselves */
+			if (ScanDirectionIsForward(dir))
+				blkno = opaque->btpo_next;
+			else
+				blkno = opaque->btpo_prev;
+			if (scan->parallel_scan != NULL)
+				_bt_parallel_release(scan, blkno, lastcurrblkno);
+		}
+
+		/* no matching tuples on this page */
+		_bt_relbuf(rel, pos->buf);
+		seized = false;			/* released by _bt_readpage (or by us) */
+	}
+
+	/* */
+	Assert(newbatch != NULL);
+
+	pos = (BTBatchScanPos) newbatch->opaque;
+
+	/*
+	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+	 * so->currPos.buf in preparation for btgettuple returning tuples.
+	 */
+	Assert(pos->currPage == blkno);
+	Assert(BTBatchScanPosIsPinned(*pos));
+	/* _bt_drop_lock_and_maybe_pin_batch(scan, pos); */
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+
+	return newbatch;
+}
+
 /*
  * _bt_lock_and_validate_left() -- lock caller's left sibling blkno,
  * recovering from concurrent page splits/page deletions when necessary
@@ -2693,3 +4311,79 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	_bt_returnitem(scan, so);
 	return true;
 }
+
+/*
+ *	_bt_endpoint() -- Find the first or last page in the index, and scan
+ * from there to the first key satisfying all the quals.
+ *
+ * This is used by _bt_first() to set up a scan when we've determined
+ * that the scan must start at the beginning or end of the index (for
+ * a forward or backward scan respectively).
+ *
+ * Parallel scan callers must have seized the scan before calling here.
+ * Exit conditions are the same as for _bt_first().
+ */
+static IndexScanBatch
+_bt_endpoint_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber start;
+	BTBatchScanPosData pos;
+
+	BTBatchScanPosInvalidate(pos);
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!so->needPrimScan);
+
+	/*
+	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
+	 * version of _bt_search().
+	 */
+	pos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+
+	if (!BufferIsValid(pos.buf))
+	{
+		/*
+		 * Empty index. Lock the whole relation, as nothing finer to lock
+		 * exists.
+		 */
+		PredicateLockRelation(rel, scan->xs_snapshot);
+		_bt_parallel_done(scan);
+		return false;
+	}
+
+	page = BufferGetPage(pos.buf);
+	opaque = BTPageGetOpaque(page);
+	Assert(P_ISLEAF(opaque));
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* There could be dead pages to the left, so not this: */
+		/* Assert(P_LEFTMOST(opaque)); */
+
+		start = P_FIRSTDATAKEY(opaque);
+	}
+	else if (ScanDirectionIsBackward(dir))
+	{
+		Assert(P_RIGHTMOST(opaque));
+
+		start = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		elog(ERROR, "invalid scan direction: %d", (int) dir);
+		start = 0;				/* keep compiler quiet */
+	}
+
+	/*
+	 * Now load data from the first page of the scan.
+	 */
+	return _bt_readfirstpage_batch(scan, &pos, start, dir);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9e27302fe81..5f8818b1a3f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3481,6 +3481,185 @@ _bt_killitems(IndexScanDesc scan)
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
 }
 
+/*
+ * _bt_killitems_batch
+ *		a variant of _bt_killitems, using the batch-level killedItems
+ */
+void
+_bt_killitems_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* BTScanOpaque so = (BTScanOpaque) scan->opaque; */
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	int			i;
+	int			numKilled = batch->numKilled;
+	bool		killedsomething = false;
+	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
+
+	Assert(BTBatchScanPosIsValid(*pos));
+
+	/*
+	 * Always reset the scan state, so we don't look for same items on other
+	 * pages.
+	 */
+	batch->numKilled = 0;
+
+	if (BTBatchScanPosIsPinned(*pos))
+	{
+		/*
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * re-use of any TID on the page, so there is no need to check the
+		 * LSN.
+		 */
+		droppedpin = false;
+		_bt_lockbuf(scan->indexRelation, pos->buf, BT_READ);
+
+		page = BufferGetPage(pos->buf);
+	}
+	else
+	{
+		Buffer		buf;
+
+		droppedpin = true;
+		/* Attempt to re-read the buffer, getting pin and lock. */
+		buf = _bt_getbuf(scan->indexRelation, pos->currPage, BT_READ);
+
+		page = BufferGetPage(buf);
+		if (BufferGetLSNAtomic(buf) == pos->lsn)
+			pos->buf = buf;
+		else
+		{
+			/* Modified while not pinned means hinting is not safe. */
+			_bt_relbuf(scan->indexRelation, buf);
+			return;
+		}
+	}
+
+	opaque = BTPageGetOpaque(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
+		OffsetNumber offnum = kitem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
+		if (offnum < minoff)
+			continue;			/* pure paranoia */
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
+
+			if (BTreeTupleIsPosting(ituple))
+			{
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * We rely on the convention that heap TIDs in the scanpos
+				 * items array are stored in ascending heap TID order for a
+				 * group of TIDs that originally came from a posting list
+				 * tuple.  This convention even applies during backwards
+				 * scans, where returning the TIDs in descending order might
+				 * seem more natural.  This is about effectiveness, not
+				 * correctness.
+				 *
+				 * Note that the page may have been modified in almost any way
+				 * since we first read it (in the !droppedpin case), so it's
+				 * possible that this posting list tuple wasn't a posting list
+				 * tuple when we first encountered its heap TIDs.
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/*
+					 * kitem must have matching offnum when heap TIDs match,
+					 * though only in the common case where the page can't
+					 * have been concurrently modified
+					 */
+					Assert(kitem->indexOffset == offnum || !droppedpin);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			/*
+			 * Mark index item as dead, if it isn't already.  Since this
+			 * happens while holding a buffer lock possibly in shared mode,
+			 * it's possible that multiple processes attempt to do this
+			 * simultaneously, leading to multiple full-page images being sent
+			 * to WAL (if wal_log_hints or data checksums are enabled), which
+			 * is undesirable.
+			 */
+			if (killtuple && !ItemIdIsDead(iid))
+			{
+				/* found the item/all posting list items */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 *
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
+	 * only rely on the page-level flag in !heapkeyspace indexes.)
+	 */
+	if (killedsomething)
+	{
+		opaque->btpo_flags |= BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(pos->buf, true);
+	}
+
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+}
 
 /*
  * The following routines manage a shared-memory area in which we track
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..a00a1108ba5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1001,6 +1001,38 @@ typedef struct BTScanPosData
 
 typedef BTScanPosData *BTScanPos;
 
+/*
+ * Minimal AM-specific concept of "position" for batching.
+ */
+typedef struct BTBatchScanPosData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+
+	/* page details as of the saved position's call to _bt_readpage */
+	BlockNumber currPage;		/* page referenced by items array */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+	XLogRecPtr	lsn;			/* currPage's LSN */
+
+	/* scan direction for the saved position's call to _bt_readpage */
+	ScanDirection dir;
+
+	/*
+	 * If we are doing an index-only scan, nextTupleOffset is the first free
+	 * location in the associated tuple storage workspace.
+	 */
+	int			nextTupleOffset;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively.
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+} BTBatchScanPosData;
+
+typedef BTBatchScanPosData *BTBatchScanPos;
+
 #define BTScanPosIsPinned(scanpos) \
 ( \
 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -1017,7 +1049,6 @@ typedef BTScanPosData *BTScanPos;
 		if (BTScanPosIsPinned(scanpos)) \
 			BTScanPosUnpin(scanpos); \
 	} while (0)
-
 #define BTScanPosIsValid(scanpos) \
 ( \
 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -1030,6 +1061,35 @@ typedef BTScanPosData *BTScanPos;
 		(scanpos).currPage = InvalidBlockNumber; \
 	} while (0)
 
+#define BTBatchScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+#define BTBatchScanPosUnpin(scanpos) \
+	do { \
+		ReleaseBuffer((scanpos).buf); \
+		(scanpos).buf = InvalidBuffer; \
+	} while (0)
+#define BTBatchScanPosUnpinIfPinned(scanpos) \
+	do { \
+		if (BTBatchScanPosIsPinned(scanpos)) \
+			BTBatchScanPosUnpin(scanpos); \
+	} while (0)
+#define BTBatchScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+#define BTBatchScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+	} while (0)
+
+
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1191,6 +1251,8 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, ScanDirection dir);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1215,6 +1277,9 @@ extern StrategyNumber bttranslatecmptype(CompareType cmptype, Oid opfamily);
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 							   BlockNumber *last_curr_page, bool first);
+extern bool _bt_parallel_seize_batch(IndexScanDesc scan, BTBatchScanPos pos,
+									 BlockNumber *next_scan_page,
+									 BlockNumber *last_curr_page, bool first);
 extern void _bt_parallel_release(IndexScanDesc scan,
 								 BlockNumber next_scan_page,
 								 BlockNumber curr_page);
@@ -1308,6 +1373,10 @@ extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
+extern IndexScanBatch _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan, IndexScanBatch batch);
+
 /*
  * prototypes for functions in nbtutils.c
  */
@@ -1326,6 +1395,7 @@ extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
 extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems_batch(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 060d964e399..1e5548aacb9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -195,6 +195,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
-- 
2.49.0

v20250422-0003-WIP-Don-t-read-the-same-block-repeatedly.patchtext/x-patch; charset=UTF-8; name=v20250422-0003-WIP-Don-t-read-the-same-block-repeatedly.patchDownload

From 8d56965c784ed308fc84f469539a369d6de71974 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 1 Jan 2025 22:10:37 +0100
Subject: [PATCH v20250422 3/5] WIP: Don't read the same block repeatedly

---
 src/backend/access/heap/heapam_handler.c | 57 ++++++++++++++++++++++--
 src/backend/access/index/indexam.c       | 11 +++++
 src/backend/storage/buffer/bufmgr.c      | 40 +++++++++++++++++
 src/include/access/relscan.h             |  2 +
 src/include/storage/bufmgr.h             |  2 +
 5 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f79d97a8c64..326d5fed681 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -136,6 +136,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
+		bool		release_prev = true;
 
 		/*
 		 * Read the block for the requested TID. With a read stream, simply
@@ -157,7 +158,56 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * API.
 		 */
 		if (scan->rs)
-			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		{
+			/*
+			 * If we're trying to read the same block as the last time, don't
+			 * try reading it from the stream again, but just return the last
+			 * buffer. We need to check if the previous buffer is still pinned
+			 * and contains the correct block (it might have been unpinned,
+			 * used for a different block, so we need to be careful).
+			 *
+			 * The place scheduling the blocks (index_scan_stream_read_next)
+			 * needs to do the same thing and not schedule the blocks if it
+			 * matches the previous one. Otherwise the stream will get out of
+			 * sync, causing confusion.
+			 *
+			 * This is what ReleaseAndReadBuffer does too, but it does not
+			 * have a queue of requests scheduled from somewhere else, so it
+			 * does not need to worry about that.
+			 *
+			 * XXX Maybe we should remember the block in IndexFetchTableData,
+			 * so that we can make the check even cheaper, without looking at
+			 * the buffer descriptor? But that assumes the buffer was not
+			 * unpinned (or repinned) elsewhere, before we got back here. But
+			 * can that even happen? If yes, I guess we shouldn't be releasing
+			 * the prev buffer anyway.
+			 *
+			 * XXX This has undesired impact on prefetch distance. The read
+			 * stream schedules reads for a certain number of future blocks,
+			 * but if we skip duplicate blocks, the prefetch distance may get
+			 * unexpectedly large (e.g. for correlated indexes, with long runs
+			 * of TIDs from the same heap page). This may spend a lot of CPU
+			 * time in the index_scan_stream_read_next callback, but more
+			 * importantly it may require reading (and keeping) a lot of leaf
+			 * pages from the index.
+			 *
+			 * XXX What if we pinned the buffer twice (increase the refcount),
+			 * so that if the caller unpins the buffer, we still keep the
+			 * second pin. Wouldn't that mean we don't need to worry about the
+			 * possibility someone loaded another page into the buffer?
+			 *
+			 * XXX We might also keep a longer history of recent blocks, not
+			 * just the immediately preceding one. But that makes it harder,
+			 * because the two places (read_next callback and here) need to
+			 * have a slightly different view.
+			 */
+			if (BufferMatches(hscan->xs_cbuf,
+							  hscan->xs_base.rel,
+							  ItemPointerGetBlockNumber(tid)))
+				release_prev = false;
+			else
+				hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		}
 		else
 			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
 												  hscan->xs_base.rel,
@@ -181,7 +231,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 
 		/*
-		 * When using the read stream, release the old buffer.
+		 * When using the read stream, release the old buffer - but only if
+		 * we're reading a different block.
 		 *
 		 * XXX Not sure this is really needed, or maybe this is not the right
 		 * place to do this, and buffers should be released elsewhere. The
@@ -199,7 +250,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * XXX Does this do the right thing when reading the same page? That
 		 * should return the same buffer, so won't we release it prematurely?
 		 */
-		if (scan->rs && (prev_buf != InvalidBuffer))
+		if (scan->rs && (prev_buf != InvalidBuffer) && release_prev)
 		{
 			ReleaseBuffer(prev_buf);
 		}
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 190a112e457..ae4f3ffb0ca 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -1908,6 +1908,15 @@ index_scan_stream_read_next(ReadStream *stream,
 				continue;
 			}
 
+			/* same block as before, don't need to read it */
+			if (scan->xs_batches->lastBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (lastBlock)");
+				continue;
+			}
+
+			scan->xs_batches->lastBlock = ItemPointerGetBlockNumber(tid);
+
 			return ItemPointerGetBlockNumber(tid);
 		}
 
@@ -2268,6 +2277,7 @@ index_batch_init(IndexScanDesc scan)
 	index_batch_pos_reset(scan, &scan->xs_batches->markPos);
 
 	// scan->xs_batches->currentBatch = NULL;
+	scan->xs_batches->lastBlock = InvalidBlockNumber;
 }
 
 /*
@@ -2350,6 +2360,7 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batches->finished = false;
 	batches->reset = false;
 	// batches->currentBatch = NULL;
+	batches->lastBlock = InvalidBlockNumber;
 
 	AssertCheckBatches(scan);
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe0ceeadc13..8a1d916ddff 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3045,6 +3045,46 @@ ReleaseAndReadBuffer(Buffer buffer,
 	return ReadBuffer(relation, blockNum);
 }
 
+/*
+ * BufferMatches
+ *		Check if the buffer (still) contains the expected page.
+ *
+ * Check if the buffer contains the expected page. The buffer may be invalid,
+ * or valid and pinned.
+ */
+bool
+BufferMatches(Buffer buffer,
+			  Relation relation,
+			  BlockNumber blockNum)
+{
+	ForkNumber	forkNum = MAIN_FORKNUM;
+	BufferDesc *bufHdr;
+
+	if (BufferIsValid(buffer))
+	{
+		Assert(BufferIsPinned(buffer));
+		if (BufferIsLocal(buffer))
+		{
+			bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+		else
+		{
+			bufHdr = GetBufferDescriptor(buffer - 1);
+			/* we have pin, so it's ok to examine tag without spinlock */
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b63af845ca6..2bbd0db0223 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -242,6 +242,8 @@ typedef struct IndexScanBatches
 	bool		finished;
 	bool		reset;
 
+	BlockNumber lastBlock;
+
 	/*
 	 * Current scan direction, for the currently loaded batches. This is used
 	 * to load data in the read stream API callback, etc.
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..3b7d4e6a6a2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -237,6 +237,8 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern bool BufferMatches(Buffer buffer, Relation relation,
+						  BlockNumber blockNum);
 
 extern Buffer ExtendBufferedRel(BufferManagerRelation bmr,
 								ForkNumber forkNum,
-- 
2.49.0

#118

Peter Geoghegan

pg@bowt.ie

9 months ago

In reply to: Tomas Vondra (#117)

Re: index prefetching

On Tue, Apr 22, 2025 at 6:46 AM Tomas Vondra <tomas@vondra.me> wrote:

here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results).

Cool!

Compared to the last patch version [1] shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful.

Thanks for being so receptive to my feedback. I know that I wasn't
particularly clear. I mostly only gave you my hand-wavy, caveat-laden
ideas about how best to layer things. But you were willing to give
them full and fair consideration.

1) patch now relies on read_stream

So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.

Make sense.

Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.

There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.

To me, the really important point about this high-level design is that
it provides a great deal of flexibility around reordering work, while
still preserving the appearance of an index scan that performs work in
the same old fixed order. All relevant kinds of work (whether table AM
and index AM related work) are under the direct control of one single
module. There's one central place for a mechanism that weighs both
costs and benefits, keeping things in balance.

(I realize that there's still some sense in which that isn't true,
partly due to the read stream interface, but for now the important
thing is that we're agreed on this high level direction.)

I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().

Right. But (if I'm not mistaken) the index AM doesn't actually need to
know *when* to do killtuples. It still needs to have some handling for
this, since we're actually modifying index pages, and we need to have
handling for certain special cases (e.g., posting list tuples) on the
scan side. But it can be made to work in a way that isn't rigidly tied
to the progress of the scan -- it's perfectly fine to do this work
somewhat out of order, if that happens to make sense. It doesn't have
to happen in perfect lockstep with the scan, right after the items
from the relevant leaf page have all been returned.

It should also eventually be possible to do things like perform
killtuples in a different process (perhaps even thread?) to the one
that originally read the corresponding leaf page items. That's the
kind of long term goal to keep in mind, I feel.

(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)

The term that I've used is "index scan manager", since it subsumes
some of the responsibilities related to scheduling work that has
traditionally been under the control of index AMs. I'm not attached to
that name, but we should agree upon some name for this new concept. It
is a new layer, above the index AM but below the executor proper, and
so it feels like it needs to be clearly distinguished from the two
adjoining layers.

Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.

Just to be clear, for other people reading along: I never said that we
should fully remove amgettuple as an interface. What I said was that I
think that we should remove btgettuple(), and any other amgettuple
routine within index AMs that switch over to using the new interface.

I'm not religious about removing amgettuple() from index AMs that also
support the new batch interface. It's probably useful to keep around
for now, for debugging purposes. My point was only this: I know of no
good reason to keep around btgettuple in the first committed version
of the patch. So if you're going to keep it around, you should surely
have at least one explicit reason for doing so. I don't remember
hearing such a reason?

Even if there is such a reason, maybe there doesn't have to be. Maybe
this reason can be eliminated by improving the batch design such that
we no longer need btgettuple at all (not even for catalogs). Or maybe
it won't be so easy -- maybe we'll have to keep around btgettuple
after all. Either way, I'd like to know the details.

For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.

I don't see why prefetching should be mandatory with this new
interface. Surely it has to have adaptive "ramp-up" behavior already,
even when we're pretty sure that prefetching is a good idea from the
start?

In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.

Again, I would be absolutely fine with continuing to support the
amgettuple interface indefinitely. Again, my only concern is with
index AMs that support both the old and new interfaces at the same
time.

Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.

This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.

What's the most logical second index AM to support, after nbtree,
then? Probably hash/hashgettuple?

I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).

I think that that just makes read_stream an awkward fit for index
prefetching. You legitimately need to see all of the resources that
are in flight. That context will really matter, at least at times.

I'm much less sure what to do about it. Maybe using read_stream is
still the right medium-term design. Further testing/perf validation is
required to be able to say anything sensible about it.

But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.

I'd like to know more about these cases. I'll wait for your benchmark
results, which presumably have examples of this.

--
Peter Geoghegan

#119

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Peter Geoghegan (#118)

Re: index prefetching

On 4/22/25 18:26, Peter Geoghegan wrote:

On Tue, Apr 22, 2025 at 6:46 AM Tomas Vondra <tomas@vondra.me> wrote:

here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results).

Cool!

Compared to the last patch version [1] shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful.

Thanks for being so receptive to my feedback. I know that I wasn't
particularly clear. I mostly only gave you my hand-wavy, caveat-laden
ideas about how best to layer things. But you were willing to give
them full and fair consideration.

1) patch now relies on read_stream

So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.

Make sense.

Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.

There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.

To me, the really important point about this high-level design is that
it provides a great deal of flexibility around reordering work, while
still preserving the appearance of an index scan that performs work in
the same old fixed order. All relevant kinds of work (whether table AM
and index AM related work) are under the direct control of one single
module. There's one central place for a mechanism that weighs both
costs and benefits, keeping things in balance.

(I realize that there's still some sense in which that isn't true,
partly due to the read stream interface, but for now the important
thing is that we're agreed on this high level direction.)

Yeah, that makes sense, although I've been thinking about this a bit
differently. I haven't been trying to establish a new "component" to
manage prefetching. For me the question was what's the right layer, so
that unnecessary details don't leak into AM and/or executor.

The AM could issue fadvise prefetches, or perhaps even feed blocks into
a read_stream, but it doesn't seem like the right place to ever do more
decisions. OTOH we don't want every place in the executor to reimplement
the prefetching, and indexam.c seems like a good place in between.

It requires exchanging some additional details with the AM, provided by
the new callbacks.

It seems the indexam.c achieves both your and mine goals, more or less.

I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().

Right. But (if I'm not mistaken) the index AM doesn't actually need to
know *when* to do killtuples. It still needs to have some handling for
this, since we're actually modifying index pages, and we need to have
handling for certain special cases (e.g., posting list tuples) on the
scan side. But it can be made to work in a way that isn't rigidly tied
to the progress of the scan -- it's perfectly fine to do this work
somewhat out of order, if that happens to make sense. It doesn't have
to happen in perfect lockstep with the scan, right after the items
from the relevant leaf page have all been returned.

It should also eventually be possible to do things like perform
killtuples in a different process (perhaps even thread?) to the one
that originally read the corresponding leaf page items. That's the
kind of long term goal to keep in mind, I feel.

Right. The amfreebatch() does not mean the batch needs to be freed
immediately, it's just handed over back to the AM, and it's up to the AM
to do the necessary cleanup at some point. It might queue it for later,
or perhaps even do that in a separate thread ...

(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)

The term that I've used is "index scan manager", since it subsumes
some of the responsibilities related to scheduling work that has
traditionally been under the control of index AMs. I'm not attached to
that name, but we should agree upon some name for this new concept. It
is a new layer, above the index AM but below the executor proper, and
so it feels like it needs to be clearly distinguished from the two
adjoining layers.

Yes. I wonder if we should introduce a separate abstraction for this, as
a subset of indexam.c.

Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.

Just to be clear, for other people reading along: I never said that we
should fully remove amgettuple as an interface. What I said was that I
think that we should remove btgettuple(), and any other amgettuple
routine within index AMs that switch over to using the new interface.

I'm not religious about removing amgettuple() from index AMs that also
support the new batch interface. It's probably useful to keep around
for now, for debugging purposes. My point was only this: I know of no
good reason to keep around btgettuple in the first committed version
of the patch. So if you're going to keep it around, you should surely
have at least one explicit reason for doing so. I don't remember
hearing such a reason?

Even if there is such a reason, maybe there doesn't have to be. Maybe
this reason can be eliminated by improving the batch design such that
we no longer need btgettuple at all (not even for catalogs). Or maybe
it won't be so easy -- maybe we'll have to keep around btgettuple
after all. Either way, I'd like to know the details.

My argument was (a) ability to disable prefetching, and fall back to the
old code if needed, and (b) handling use cases where prefetching does
not work / is not implemented, even if only temporarily (e.g. ordered
scan in SP-GiST). Maybe (a) is unnecessarily defensive, and (b) may not
be needed. Not sure.

For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.

I don't see why prefetching should be mandatory with this new
interface. Surely it has to have adaptive "ramp-up" behavior already,
even when we're pretty sure that prefetching is a good idea from the
start?

Possibly, I may be too defensive. And perhaps in cases where we know the
prefetching can't help we could disable that for the read_stream.

In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.

Again, I would be absolutely fine with continuing to support the
amgettuple interface indefinitely. Again, my only concern is with
index AMs that support both the old and new interfaces at the same
time.

Understood.

Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.

This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.

What's the most logical second index AM to support, after nbtree,
then? Probably hash/hashgettuple?

I think hash should be fairly easy to support. But I was really thinking
about doing SP-GiST, exactly because it's very different in some
aspects, and I wanted to validate the design on that (for hash I think
it's almost certain it's OK).

I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).

I think that that just makes read_stream an awkward fit for index
prefetching. You legitimately need to see all of the resources that
are in flight. That context will really matter, at least at times.

I'm much less sure what to do about it. Maybe using read_stream is
still the right medium-term design. Further testing/perf validation is
required to be able to say anything sensible about it.

Agreed. That's why I've suggested it might help if the read_stream had
ability to pause/resume in some way, without having to stall for a while
(which the read_stream_reset workaround does). Based on what the
read_next callback decides.

But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.

I'd like to know more about these cases. I'll wait for your benchmark
results, which presumably have examples of this.

I expect to have better data sometime next week.

I think the cases affected by this the most are index-only scans on
all-visible tables that fit into shared buffers, with
correlated/sequential pattern. Or even regular index scans with all data
in shred buffers.

It also seems quite hardware / CPU dependent - I see much worse impact
on an older Xeon than on a new Ryzen.

regards

--
Tomas Vondra

#120

Peter Geoghegan

pg@bowt.ie

9 months ago

In reply to: Tomas Vondra (#119)

Re: index prefetching

On Tue, Apr 22, 2025 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

Yeah, that makes sense, although I've been thinking about this a bit
differently. I haven't been trying to establish a new "component" to
manage prefetching. For me the question was what's the right layer, so
that unnecessary details don't leak into AM and/or executor.

FWIW that basically seems equivalent to what I said. If there's any
difference at all between what each of us has said, then it's only a
difference in emphasis. The "index scan manager" doesn't just manage
prefetching -- it manages the whole index scan, including details that
were previously only supposed to be known inside index AMs. It can do
so while weighing all relevant factors -- regardless of whether
they're related to the index structure or the heap structure.

It would be possible to (say) do everything at the index AM level
instead. But then we'd be teaching index AMs about heap/table AM
related costs, which would be a bad design, primarily because it would
have to duplicate the same logic in every supported index AM. Better
to have one dedicated layer that has an abstract-ish understanding of
both index AM scan costs, and table AM scan costs. It needs to be
abstract, but not too abstract -- costs like "read one index leaf
page" generalize well across all index AMs. And costs like "read one
table AM page" should also generalize quite well, at least across
block-based table AMs.

You primarily care about "doing the layering right", while I primarily
care about "making sure that one layer can see all relevant costs".
ISTM that these are two sides of the same coin.

It requires exchanging some additional details with the AM, provided by
the new callbacks.

I think of it as primarily externalizing decisions about index page
accesses. The index AM reads the next leaf page to be read because the
index scan manager tells it to. The index AM performs killitems
exactly as instructed by the index scan manager. And the index AM
doesn't really own as much context about the progress of the scan --
that all lives inside the scan manager instead. The scan manager has a
fairly fuzzy idea about how the index AM organizes data, but that
shouldn't matter.

It seems the indexam.c achieves both your and mine goals, more or less.

Agreed.

Yes. I wonder if we should introduce a separate abstraction for this, as
a subset of indexam.c.

I like that idea.

My argument was (a) ability to disable prefetching, and fall back to the
old code if needed, and (b) handling use cases where prefetching does
not work / is not implemented, even if only temporarily (e.g. ordered
scan in SP-GiST). Maybe (a) is unnecessarily defensive, and (b) may not
be needed. Not sure.

We don't need to make a decision on this for some time, but I still
lean towards forcing index AMs to make a choice between this new
interface, and the old amgettuple interface.

I don't see why prefetching should be mandatory with this new
interface. Surely it has to have adaptive "ramp-up" behavior already,
even when we're pretty sure that prefetching is a good idea from the
start?

Possibly, I may be too defensive. And perhaps in cases where we know the
prefetching can't help we could disable that for the read_stream.

Shouldn't the index scan manager be figuring all this out for us,
automatically? Maybe that works in a very trivial way, at first. The
important point is that the design be able to support these
requirements in some later iteration of the feature -- though it's
unlikely to happen in the first Postgres version that the scan manager
thing appears in.

I think hash should be fairly easy to support. But I was really thinking
about doing SP-GiST, exactly because it's very different in some
aspects, and I wanted to validate the design on that (for hash I think
it's almost certain it's OK).

WFM.

There are still bugs in SP-GiST (and GiST) index-only scans:

/messages/by-id/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com

It would be nice if the new index scan manager interface could fix
that bug, at least in the case of SP-GiST. By generalizing the
approach that nbtree takes, where we hang onto a leaf buffer pin.
Admittedly this would necessitate changes to SP-GiST VACUUM, which
doesn't cleanup lock any pages, but really has to in order to fix the
underlying bug. There are draft patches that try to fix the bug, which
might be a useful starting point.

I think the cases affected by this the most are index-only scans on
all-visible tables that fit into shared buffers, with
correlated/sequential pattern. Or even regular index scans with all data
in shred buffers.

My hope is that the index scan manager can be taught to back off when
this is happening, to avoid the regressions. Or that it can avoid them
by only gradually ramping up the prefetching. Does that sound
plausible to you?

--
Peter Geoghegan

#121

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Tomas Vondra (#117)

6 attachment(s)

Re: index prefetching

Hi,

Here's a rebased version of the patch, addressing a couple bugs with
scrollable cursors that Peter reported to me off-list. The patch did not
handle that quite right, resulting either in incorrect results (when the
position happened to be off by one), or crashes (when it got out of sync
with the read stream).

But then there are some issues with array keys and mark/restore,
triggered by Peter's "dynamic SAOP advancement" tests in extra tests
(some of the tests use data files too large to post on hackers, it's
available in the github branch). The patch used to handle mark/restore
entirely in indexam.c, and for simple scans that works. But with array
keys the btree code needs to update the moreLeft/moreRight/needPrimScan
flags, so that after restoring it knows where to continue.

There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like

-         Index Searches: 5
+         Index Searches: 4

And that is almost certainly a bug.

I haven't figured this out yet, and I feel a bit lost again :-(

It made me think again whether it makes sense to make this fundamental
redesign of the index AM interface a prerequisite for prefetching. I
don't dispute the advantages of this new design, with indexam.c
responsible for more stuff (e.g. when a batch gets freed). It seems more
flexible and might make some stuff easier, and if we were designing it
now, we'd do it that way ...

Even if I eventually to fix this issue, will I ever be sufficiently
confident about correctness of the new code, enough to commit that?
Perhaps I'm too skeptical, but I'm not really sure about that anymore.

After thinking about this for a while, I decided to revisit the approach
used in the experimental patch I spoke about at pgconf.dev unconference
in 2023, and see if maybe it could be made to work.

That patch was pretty dumb - it simply initiated prefetches from the AM,
by calling PrefetchBuffer(). And the arguments against that doing this
from the AM seems like a layering violation, that every AM would need to
do a copy of this, because each AM has a different representation of the
internal scan state.

But after looking at it with fresh eyes, this seems fixable. It might
have been "more true" with the fadvise-based prefetching, but with the
ReadStream the amount of new AM code is *much* smaller. It doesn't need
to track the distance, or anything like that - that's handled by the
ReadStream. It just needs to respond to read_next callback. It also
doesn't feel like a layering violation, for the same reason.

I gave this a try last week, and I was surprised how easy it was to make
this work, and how small and simple the patches are - see the attached
simple-prefetch.tgz archive:

infrastructure - 22kB
btree - 10kB
hash - 7kB
gist - 10kB
spgist - 16kB

That's a grand total of ~64kB (there might be some more improvements
necessary, esp. in the gist/spgist part).

Now compare that with the more complex patch, where we have

infrastructure - 100kB
nbtree - 100kB

And that's just one index type. The other index types would probably
need a comparable amount of new code eventually ...

Sure, it can probably be made somewhat smaller (e.g. the nbtree code
copies a lot of stuff to support both the old and new approach, and that
might be reduced if we ditch the old one), and some of the diff are
comments. But even considering all that the size/complexity difference
will remain significant.

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.

One argument for moving more stuff (including prefetching) to indexam.c
was it seems desirable to have one "component" aware of all the relevant
information, so that it can adjust prefetching in some way. I believe
that's still possible even with the simpler patch - nothing prevents
adding a "struct" to the scan descriptor, and using it from the
read_next callback or something like that.

regards

[1]: https://github.com/tvondra/postgres/tree/index-prefetch-2025

--
Tomas Vondra

Attachments:

v20250501-0001-WIP-index-prefetching.patchtext/x-patch; charset=UTF-8; name=v20250501-0001-WIP-index-prefetching.patchDownload

From 33610f92a43853c66a33c88e30bcceb13469ab66 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v20250501 1/7] WIP: index prefetching

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.

It is up to the index AM to return only batches that it can handle
internally. For example, most of the later patches adding support for
batching to relevant index AMs (btree, hash, gist, sp-gist) restrict the
batches to a single leaf page. This makes implementation of batching
much simpler, with only minimal changes to the index AMs, but it's not a
hard requirement. The index AM can produce batches spanning arbitrary
number of leaf pages. This is left as a possible future improvement.

Most of the batching/prefetching logic happens in indexam.c. This means
the executor code can continue to call the interface just like before.

The only "violation" happens in index-only scans, which need to check
the visibility map both when the prefetching pages (we don't want to
prefetch pages that are unnecessary) and later when reading the data.
For cached data the visibility map checks can be fairly expensive, so
it's desirable to keep and reuse the result of the first check.

At the moment, the prefetching does not handle mark/restore plans. This
is doable, but requires additional synchronization between the batching
and index AM code in the "opposite direction".

This patch does not actually add batching to any of the index AMs, it's
just the common infrastructure.

TODO Add the new index AM callback to sgml docs.

Re-introduce the callback to check VM and remember the result.

It can happen the first few batches (leaf pages) may be returned from
the index, skipping the heap fetches. Which means the read stream does
no reads until much later after the first batches are already freed.
Because the reads only happen when first reading from the stream. In
that case we need to be careful about initializing the stream position
because setting it to (0,0) would be wrong as the batch is already gone.
So just initialize to readPost, which should be initialized already.

Could it happen later, or just on first call? Probably first call only,
as the read stream always looks ahead for the block that actually needs
reading.
---
 src/backend/access/heap/heapam_handler.c      |   81 +-
 src/backend/access/index/genam.c              |   30 +-
 src/backend/access/index/indexam.c            | 1424 ++++++++++++++++-
 src/backend/access/table/tableam.c            |    2 +-
 src/backend/commands/constraint.c             |    3 +-
 src/backend/executor/execIndexing.c           |   12 +-
 src/backend/executor/execReplication.c        |    9 +-
 src/backend/executor/nodeIndexonlyscan.c      |  133 +-
 src/backend/executor/nodeIndexscan.c          |   32 +-
 src/backend/utils/adt/selfuncs.c              |    7 +-
 src/backend/utils/misc/guc_tables.c           |   10 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/include/access/amapi.h                    |   10 +
 src/include/access/genam.h                    |   13 +-
 src/include/access/relscan.h                  |  160 ++
 src/include/access/tableam.h                  |   12 +-
 src/include/nodes/execnodes.h                 |    7 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    5 +
 19 files changed, 1915 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..f79d97a8c64 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,11 +79,12 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = rs;
 	hscan->xs_cbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
@@ -94,6 +95,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +112,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -130,15 +137,72 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/*
+		 * Read the block for the requested TID. With a read stream, simply
+		 * read the next block we queued earlier (from the callback).
+		 * Otherwise just do the regular read using the TID.
+		 *
+		 * XXX It's a bit fragile to just read buffers, expecting the right
+		 * block, which we queued from the callback sometime much earlier. If
+		 * the two streams get out of sync in any way (which can happen
+		 * easily, due to some optimization heuristics), it may misbehave in
+		 * strange ways.
+		 *
+		 * XXX We need to support both the old ReadBuffer and ReadStream, as
+		 * some places are unlikely to benefit from a read stream - e.g.
+		 * because they only fetch a single tuple. So better to support this.
+		 *
+		 * XXX Another reason is that some index AMs may not support the
+		 * batching interface, which is a prerequisite for using read_stream
+		 * API.
+		 */
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+
+		/* We should always get a valid buffer for a valid TID. */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+
+		/*
+		 * Did we read the expected block number (per the TID)? For the
+		 * regular buffer reads this should always match, but with the read
+		 * stream it might disagree due to a bug elsewhere (happened
+		 * repeatedly).
+		 */
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+
+		/*
+		 * When using the read stream, release the old buffer.
+		 *
+		 * XXX Not sure this is really needed, or maybe this is not the right
+		 * place to do this, and buffers should be released elsewhere. The
+		 * problem is that other place may not really know if the index scan
+		 * uses read stream API.
+		 *
+		 * XXX We need to do this, because otherwise the caller would need to
+		 * do different things depending on whether the read_stream was used
+		 * or not. With the read_stream it'd have to also explicitly release
+		 * the buffers, but doing that for every caller seems error prone
+		 * (easy to forget). It's also not clear whether it would free the
+		 * buffer before or after the index_fetch_tuple call (we don't know if
+		 * the buffer changed until *after* the call, etc.).
+		 *
+		 * XXX Does this do the right thing when reading the same page? That
+		 * should return the same buffer, so won't we release it prematurely?
+		 */
+		if (scan->rs && (prev_buf != InvalidBuffer))
+		{
+			ReleaseBuffer(prev_buf);
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
@@ -753,7 +817,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+
+		/*
+		 * XXX Maybe enable batching/prefetch for clustering? Seems like it
+		 * might be a pretty substantial win if the table is not yet well
+		 * clustered by the index.
+		 */
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0,
+									false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8f532e14590..8266d5e0e87 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -446,8 +446,21 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
+		/*
+		 * No batching/prefetch for catalogs. We don't expect that to help
+		 * very much, because we usually need just one row, and even if we
+		 * need multiple rows, they tend to be colocated in heap.
+		 *
+		 * XXX Maybe we could do that, the prefetching only ramps up over time
+		 * anyway? There was a problem with infinite recursion when looking up
+		 * effective_io_concurrency for a tablespace (which may do an index
+		 * scan internally), but the read_stream should care of that. Still,
+		 * we don't expect this to help a lot.
+		 *
+		 * XXX This also means scans on catalogs won't use read_stream.
+		 */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, NULL, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 
@@ -707,8 +720,21 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/*
+	 * No batching/prefetch for catalogs. We don't expect that to help very
+	 * much, because we usually need just one row, and even if we need
+	 * multiple rows, they tend to be colocated in heap.
+	 *
+	 * XXX Maybe we could do that, the prefetching only ramps up over time
+	 * anyway? There was a problem with infinite recursion when looking up
+	 * effective_io_concurrency for a tablespace (which may do an index scan
+	 * internally), but the read_stream should care of that. Still, we don't
+	 * expect this to help a lot.
+	 *
+	 * XXX This also means scans on catalogs won't use read_stream.
+	 */
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, NULL, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..c5ba89499f1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -58,6 +59,8 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/* enable batching / prefetching during index scans */
+bool		enable_indexscan_batching = false;
 
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
@@ -109,6 +112,36 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan,
+										   ScanDirection direction);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static bool index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(WARNING, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -250,6 +283,10 @@ index_insert_cleanup(Relation indexRelation,
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
+ * enable_batching determines whether the scan should try using the batching
+ * interface (amgetbatch/amfreebatch), if supported by the index AM, or the
+ * regular amgettuple interface.
+ *
  * Caller must be holding suitable locks on the heap and the index.
  */
 IndexScanDesc
@@ -257,8 +294,10 @@ index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
-				int nkeys, int norderbys)
+				int nkeys, int norderbys,
+				bool enable_batching)
 {
+	ReadStream *rs = NULL;
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
@@ -273,8 +312,45 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info. We only use stream read API with
+	 * batching enabled (so not with systable scans). But maybe we should
+	 * change that, and just use different read_next callbacks (or something
+	 * like that)?
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details. That
+	 * might be needed for certain index AMs, that can do batching only for
+	 * some scans (I'm thinking about GiST/SP-GiST indexes, with ORDER BY).
+	 *
+	 * XXX Do this before initializing xs_heapfetch, so that we can pass the
+	 * read stream to it.
+	 */
+	if ((indexRelation->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		/*
+		 * XXX We do this after index_beginscan_internal(), which means we
+		 * can't init the batch state in there (it doesn't even know if
+		 * batching will be used at that point). We can't init the read_stream
+		 * there, because it needs the heapRelation.
+		 */
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heapRelation,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, rs);
 
 	return scan;
 }
@@ -337,6 +413,12 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
+	/*
+	 * No batching by default, so set it to NULL. Will be initialized later if
+	 * batching is requested and AM supports it.
+	 */
+	scan->xs_batches = NULL;
+
 	return scan;
 }
 
@@ -370,6 +452,19 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amrescan, so that it could reinitialize
+	 * everything (this probably does not matter very much, now that we've
+	 * moved all the batching logic to indexam.c, it was more important when
+	 * the index AM was responsible for more of it).
+	 *
+	 * XXX Maybe this should also happen before table_index_fetch_reset?
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -384,6 +479,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -414,7 +512,46 @@ index_markpos(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(ammarkpos);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/*
+	 * Without batching, just use the ammarkpos() callback. With batching
+	 * everything is handled at this layer, without calling the AM.
+	 */
+	if (scan->xs_batches == NULL)
+	{
+		scan->indexRelation->rd_indam->ammarkpos(scan);
+	}
+	else
+	{
+		IndexScanBatches *batches = scan->xs_batches;
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatchData *batch = batches->markBatch;
+
+		/*
+		 * Free the previous mark batch (if any), but only if the batch is no
+		 * longer valid (in the current first/next range). This means that if
+		 * we're marking the same batch (different item), we don't really do
+		 * anything.
+		 *
+		 * XXX Should have some macro for this check, I guess.
+		 */
+		if ((batch != NULL) &&
+			(pos->batch < batches->firstBatch || pos->batch >= batches->nextBatch))
+		{
+			batches->markBatch = NULL;
+			index_batch_free(scan, batch);
+		}
+
+		/* just copy the read position (which has to be valid) */
+		batches->markPos = batches->readPos;
+		batches->markBatch = INDEX_SCAN_BATCH(scan, batches->markPos.batch);
+
+		/*
+		 * FIXME we need to make sure the batch does not get freed during the
+		 * regular advances.
+		 */
+
+		AssertCheckBatchPosValid(scan, &batches->markPos);
+	}
 }
 
 /* ----------------
@@ -447,7 +584,58 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * Without batching, just use the amrestrpos() callback. With batching
+	 * everything is handled at this layer, without calling the AM.
+	 */
+	if (scan->xs_batches == NULL)
+		scan->indexRelation->rd_indam->amrestrpos(scan);
+	else
+	{
+		IndexScanBatches *batches = scan->xs_batches;
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatchData *batch = scan->xs_batches->markBatch;
+
+		Assert(batch != NULL);
+
+		/*
+		 * XXX The pos can be invalid, if we already advanced past the the
+		 * marked batch (and stashed it in markBatch instead of freeing). So
+		 * this assert would be incorrect.
+		 */
+		/* AssertCheckBatchPosValid(scan, &pos); */
+
+		/* FIXME we should still check the batch was not freed yet */
+
+		/*
+		 * Reset the batching state, except for the marked batch, and make it
+		 * look like we have a single batch - the marked one.
+		 *
+		 * XXX This seems a bit ugly / hacky, maybe there's a more elegant way
+		 * to do this?
+		 */
+		index_batch_reset(scan, false);
+
+		batches->markPos = *pos;
+		batches->readPos = *pos;
+		batches->firstBatch = pos->batch;
+		batches->nextBatch = (batches->firstBatch + 1);
+
+		INDEX_SCAN_BATCH(scan, batches->markPos.batch) = batch;
+
+		/*
+		 * XXX I really dislike that we have so many definitions of "current"
+		 * batch. We have readPos, streamPos, currentBatch, ... seems very ad
+		 * hoc - I just added a new "current" field when I needed one. We
+		 * should make that somewhat more consistent, or at least explain it
+		 * clearly somewhere.
+		 *
+		 * XXX Do we even need currentBatch? It's not accessed anywhere, at
+		 * least not in this patch.
+		 */
+		// batches->currentBatch = batch;
+		batches->markBatch = batch; /* also remember this */
+	}
 }
 
 /*
@@ -569,6 +757,18 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc. We Do this
+	 * before calling amrescan, so that it can reinitialize everything.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -583,10 +783,12 @@ IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
-						 ParallelIndexScanDesc pscan)
+						 ParallelIndexScanDesc pscan,
+						 bool enable_batching)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
+	ReadStream *rs = NULL;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
@@ -604,8 +806,48 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	/*
+	 * If explicitly requested and supported by both the index AM and the
+	 * plan, initialize batching info. We only use stream read API with
+	 * batching enabled (so not with systable scans). But maybe we should
+	 * change that, and just use different read_next callbacks (or something
+	 * like that)?
+	 *
+	 * XXX Maybe we should have a separate "amcanbatch" call, to let the AM
+	 * decide if batching is supported depending on the scan details. That
+	 * might be needed for certain index AMs, that can do batching only for
+	 * some scans (I'm thinking about GiST/SP-GiST indexes, with ORDER BY).
+	 *
+	 * XXX Do this before initializing xs_heapfetch, so that we can pass the
+	 * read stream to it.
+	 *
+	 * XXX Pretty duplicate with the code in index_beginscan(), so maybe move
+	 * into a shared function.
+	 */
+	if ((indexrel->rd_indam->amgetbatch != NULL) &&
+		enable_batching &&
+		enable_indexscan_batching)
+	{
+		/*
+		 * XXX We do this after index_beginscan_internal(), which means we
+		 * can't init the batch state in there (it doesn't even know if
+		 * batching will be used at that point). We can't init the read_stream
+		 * there, because it needs the heapRelation.
+		 */
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heaprel,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, rs);
 
 	return scan;
 }
@@ -628,6 +870,27 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * When using batching (which may be disabled for various reasons - e.g.
+	 * through a GUC, the index AM not supporting it), redirect the code to
+	 * the "batch" variant. If needed (e.g. for the first call) the call may
+	 * read the next batch (leaf page) from the index (but that's driven by
+	 * the read stream).
+	 *
+	 * XXX Maybe we should enable batching based on the plan too, so that we
+	 * don't do batching when it's probably useless (e.g. semijoins or queries
+	 * with LIMIT 1 etc.). The amcanbatch() callback might consider things
+	 * like that, or maybe that should be considered outside AM. However, the
+	 * slow ramp-up (starting with small batches) in read_stream should handle
+	 * this well enough.
+	 *
+	 * XXX Perhaps it'd be possible to do both in index_getnext_slot(), i.e.
+	 * call either the original code without batching, or the new batching
+	 * code if supported/enabled. It's not great to have duplicated code.
+	 */
+	if (scan->xs_batches != NULL)
+		return index_batch_getnext_tid(scan, direction);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -694,9 +957,22 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->xs_batches == NULL)
+		{
+			scan->kill_prior_tuple = all_dead;
+		}
+		else if (all_dead)
+		{
+			index_batch_kill_item(scan);
+		}
+	}
 
 	return found;
 }
@@ -1084,3 +1360,1137 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * INDEX BATCHING (AND PREFETCHING)
+ *
+ * The traditional AM interface (amgettuple) is designed to walk the index one
+ * leaf page at a time, and the state (representing the leaf page) is managed
+ * by the AM implementation. Before advancing to the next leaf page, the index
+ * AM forgets the "current" leaf page. This makes it impossible to implement
+ * features that operate on multiple leaf pages - like for example prefetch.
+ *
+ * The batching relaxes this by extending the AM API with two new methods,
+ * amgetbatch and amfreebatch, that separate the "advance" to the next leaf
+ * page, and "forgetting" the previous one. This means there may be multiple
+ * leaf pages loaded at once, if necessary. It's a bit like having multiple
+ * "positions" within the index.
+ *
+ * The AM is no longer responsible for management of these "batches" - once
+ * a batch is returned from amgetbatch(), it's up to indexam.c to determine
+ * when it's no longer necessary, and call amfreebatch(). That is, the AM
+ * can no longer discard a leaf page when advancing to the next one.
+ *
+ * This allows operating on "future" index entries, e.g. to prefetch tuples
+ * from the table. Without the batching, we could do this within the single
+ * leaf page, which has limitations, e.g. inability to prefetch beyond the
+ * of the current leaf page, and the prefetch distance drop to 0. (Most
+ * indexes have many index items per leaf page, so the prefetching would
+ * be beneficial even with this limitation, but it's not great either.)
+ *
+ * Moving the batch management to the indexam.c also means defining a common
+ * batch state, instead of each index AM defining it's own opaque state. The
+ * AM merely "fills" the batch, and everything else is handled by code in
+ * indexam.c (so not AM-specific). Including prefetching.
+ *
+ * Without this "common" batch definition, each AM would need to do a fair
+ * bit of the prefetching on it's own.
+ *
+ *
+ * note: Strictly speaking, the AM may keep a second leaf page because of
+ * mark/restore may, but that's a minor detail.
+ *
+ * note: There are different definitions of "batch" - I use it as a synonym
+ * for a leaf page, or the index tuples read from one leaf page. Others use
+ * "batch" when talking about all the leaf pages kept in memory at a given
+ * moment in time (so in a way, there's a single batch, changing over time).
+ * It's not my ambition to present a binding definition of a batch, but it's
+ * good to consider this when reading comments by other people.
+ *
+ * note: In theory, how the batch maps to leaf pages is mostly up to the index
+ * AM - as long as it can "advance" between batches, etc. it could use batches
+ * that represent a subset of a leaf page, or multiple leaf pages at once.
+ *
+ * note: Or maybe it doesn't need to map to leaf pages at all, at least not
+ * in a simple way. Consider for example ordered scans on SP-GiST indexes,
+ * or similar cases. I think that could be handled by having "abstract"
+ * batches - such indexes don't support mark/restore or changing direction,
+ * so this should be OK.
+ *
+ * note: When thinking about an index AM, think about BTREE, unless another
+ * AM is mentioned explicitly. Most AMs are based on / derived from BTREE,
+ * and everything about BTREE directly extends to them.
+ *
+ * note: In the following text "index AM" refers to an implementation of a
+ * particular index AM (e.g. BTREE), i.e. code src/backend/access/nbtree),
+ * while "indexam.c" is the shared executor level used to interact with
+ * indexes.
+ *
+ *
+ * index scan state
+ * ----------------
+ * With the traditional API (amgettuple), index scan state is stored at the
+ * scan-level in AM-specific structs - e.g. in BTScanOpaque for BTREE). So
+ * there can be only a single leaf page "loaded" for a scan at a time.
+ *
+ * With the new API (amgetbatch/amfreebatch), an index scan needs to store
+ * multiple batches - but not in private "scan opaque" struct. Instead,
+ * the queue of batches and some of the other information was moved to the
+ * IndexScanDesc, into a common struct. So the AM-specific scan-opaque
+ * structs get split and moved into three places:
+ *
+ * 1) scan-opaque - Fields that are truly related to the scan as a whole
+ *    remain in the struct (which is AM-specific, i.e. each AM method may
+ *    keep something different). Example: scankeys/arraykeys are still
+ *    kept in BTScanOpaque.
+ *
+ * 2) batch-opaque - AM-specific information related to a particular leaf
+ *    page are moved to a new batch-level struct. A good example are for
+ *    example the position of the leaf page / batch in the index (current
+ *    page, left/righ pages, etc.).
+ *
+ * 3) batch - A significant part of the patch is introducing a common
+ *    representation of a batch, common to all the index AMs. Until now
+ *    each AM had it's own way of representing tuples from a leaf page,
+ *    and accessing it required going through the AM again. The common
+ *    representation allows accessing the batches through the indexam.c
+ *    layer, without having to go through the AM.
+ *
+ *
+ * amgetbatch/amfreebatch
+ * ----------------------
+ * To support batching, the index AM needs to implement two optional
+ * callbacks - amgetbatch() and amfreebatch(), which load data from the
+ * "next" leaf page, and then free it when the batch is no longer needed.
+ *
+ * For now the amgettuple() callback is still required even for AMs that
+ * support batching, so that we can fall-back to the non-batched scan
+ * for cases when batching is not supported (e.g. scans of system tables)
+ * or when batching is disabled using the enable_indexscan_batching GUC.
+ *
+ *
+ * batch
+ * ----------------------
+ * A good way to visualize batching is a sliding window over the key space of
+ * an index. At any given moment, we have a "window" representing a range of
+ * the keys, consisting of one or more batches, each with items from a single
+ * leaf page.
+ *
+ * For now, each batch is exactly one whole leaf page. We might allow batches
+ * to be smaller or larger, but that doesn't seem very useful. It would make
+ * things more complex, without providing much benefit. Ultimately it's up to
+ * the index AM - it can produce any batches it wants, as long as it keeps
+ * necessary information in the batch-opaque struct, and handles this in the
+ * amgetbatch/amfreebatch callbacks.
+ *
+ *
+ * prefetching: leaf pages vs. heap pages
+ * --------------------------------------
+ * This patch is only about prefetching pages from the indexed relation (e.g.
+ * heap), not about prefetching index leaf pages etc. The read_next callback
+ * does read leaf pages when needed (after reaching the end of the current
+ * batch), but this is synchronous, and the callback will block until the leaf
+ * page is read.
+ *
+ *
+ * gradual ramp up
+ * ---------------
+ * The prefetching is driven by the read_stream API / implementation. There
+ * are no explicit fadvise calls in the index code, that all happens in the
+ * read stream. The read stream does the usual gradual ramp up to not regress
+ * LIMIT 1 queries etc.
+ *
+ *
+ * kill_prior_tuples
+ * -----------------
+ * If we decide a tuple should be "killed" in the index, the a flag is used to
+ * pass this information to indexam.c - the item is recorded in the batch, and
+ * the actual killing is postponed until the batch is freed using amfreebatch().
+ * The scan flag is reset to false, so that the index AM does not get confused
+ * and does not do something for a different "current" item.
+ *
+ * That is, this is very similar to what happens without batching, except that
+ * the killed items are accumulated in indexam.c, not in the AM.
+ */
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages. We
+ * should not really need this many batches - we need a certain number of TIDs,
+ * to satisfy the prefetch distance, and there usually are many index tuples
+ * per page. In the worst case we might have one index tuple per leaf page,
+ * but even that may not quite work in some cases.
+ *
+ * But there may be cases when this does not work - some examples:
+ *
+ * a) the index may be bloated, with many pages only have a single index item
+ *
+ * b) the index is correlated, and we skip prefetches of duplicate blocks
+ *
+ * c) we may be doing index-only scan, and we don't prefetch all-visible pages
+ *
+ * So we might need to load huge number of batches before we find the first
+ * block to load from the table. Or enough pages to satisfy the prefetch
+ * distance.
+ *
+ * XXX Currently, once we hit this number of batches, we fail in the stream
+ * callback (or rather in index_batch_getnext), because that's where we load
+ * batches. It'd be nice to "pause" the read stream for a bit instead, but
+ * there's no built-in way to do that. So we can only "stop" the stream by
+ * returning InvalidBlockNumber. But we could also remember this, and do
+ * read_stream_reset() to continue, after consuming all the already scheduled
+ * blocks.
+ *
+ * XXX Maybe 64 is too high - it also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS). Which might be an issue with LIMIT queries,
+ * when we actually won't need most of the leaf pages.
+ *
+ * XXX We could/should use a lower value for testing, to make it more likely
+ * we hit this issue. With 64 the whole check-world passes without hitting
+ * the limit, wo we wouldn't test it's handled correctly.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->xs_batches->nextBatch - (scan)->xs_batches->firstBatch)
+
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->xs_batches->nextBatch)
+
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->xs_batches->maxBatches)
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatches *batch = scan->xs_batches;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batch->firstBatch);
+	Assert(pos->batch < batch->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+	Assert(batch->lastItem <= MaxTIDsPerBTreePage); /* XXX tied to BTREE */
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+	// Assert(batch->currTuples != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(batch->numKilled <= MaxTIDsPerBTreePage);	/* XXX tied to BTREE */
+	Assert(!((batch->numKilled > 0) && (batch->killedItems == NULL)));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatches *batches = scan->xs_batches;
+
+	/* we should have batches initialized */
+	Assert(batches != NULL);
+
+	/* We should not have too many batches. */
+	Assert((batches->maxBatches > 0) &&
+		   (batches->maxBatches <= INDEX_SCAN_MAX_BATCHES));
+
+	/*
+	 * The first/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert((batches->firstBatch >= 0) &&
+		   (batches->firstBatch <= batches->nextBatch));
+	Assert((batches->nextBatch - batches->firstBatch) <= batches->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatches *batches = scan->xs_batches;
+
+	if (!scan->xs_batches)
+		return;
+
+	DEBUG_LOG("%s: batches firstBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->firstBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("%s: batch %d %p first %d last %d item %d killed %d",
+				  label, i, batch, batch->firstItem, batch->lastItem,
+				  batch->itemIndex, batch->numKilled);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance, right after loading the first batch, the
+ * position is still be undefined. Otherwise we expect the position to be
+ * valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The poisition is guaranteed to be valid only after an advance.
+ */
+static bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	IndexScanBatchData *batch;
+	ScanDirection direction = scan->xs_batches->direction;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the first batch. In that case just initialize it to the first
+	 * item in the batch (or last item, if it's backwards scaa).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the first batch, without having to go through the advance.
+	 *
+	 * XXX Add a macro INDEX_SCAN_POS_DEFINED() or something like this, to
+	 * make this easier to understand.
+	 */
+	if ((pos->batch == -1) && (pos->index == -1))
+	{
+		/*
+		 * we should have loaded the very first batch
+		 *
+		 * XXX Actually, we might have changed the direction of the scan,
+		 * and scanned all the way to the beginning/end. We reset the
+		 * position, but we're not on the first batch - we should have
+		 * only one batch, though.
+		 */
+		// Assert(scan->xs_batches->firstBatch == 0);
+
+		batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->firstBatch);
+		Assert(batch != NULL);
+
+		pos->batch = scan->xs_batches->firstBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position we just set has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch. If the position is for the
+	 * last item in the batch, try advancing to the next batch (if loaded).
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (pos->index < batch->lastItem)
+		{
+			pos->index++;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (pos->index > batch->firstItem)
+		{
+			pos->index--;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller. If
+ * that changes before consuming all buffers, we'll reset the stream and
+ * start from scratch. Which may seem inefficient, but it's no worse than
+ * what we do now, and it's not a very common case.
+ *
+ * The position of the read_stream is stored in streamPos, which may be
+ * ahead of the current readPos (which is what got consumed by the scan).
+ *
+ * The scan direction change is checked / handled elsewhere. Here we rely
+ * on having the correct value in xs_batches->direction.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchPos *pos = &scan->xs_batches->streamPos;
+
+	/* we should have set the direction already */
+	Assert(scan->xs_batches->direction != NoMovementScanDirection);
+
+	/*
+	 * The read position has to be valid, because we initialize/advance it
+	 * before maybe even attempting to read the heap tuple. And it lags behind
+	 * the stream position, so it can't be invalid yet. If this is the first
+	 * time for this callback, we will use the readPos to init streamPos, so
+	 * better check it's valid.
+	 */
+	AssertCheckBatchPosValid(scan, &scan->xs_batches->readPos);
+
+	/*
+	 * Try to advance to the next item, and if there's none in the current
+	 * batch, try loading the next batch.
+	 *
+	 * XXX This loop shouldn't happen more than twice, because if we fail to
+	 * advance the position, we'll try to load the next batch and then in the
+	 * next loop the advance has to succeed.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position is undefined, just use the read position.
+		 *
+		 * It's possible we got here only fairly late in the scan, e.g. if
+		 * many tuples got skipped in the index-only scan, etc. In this case
+		 * just use the read position as a starting point.
+		 *
+		 * The first batch is loaded from index_batch_getnext_tid(), because
+		 * we don't get here until the first index_fetch_heap() call - only
+		 * then can read_stream start loading more batches. It's also possible
+		 * to disable prefetching (effective_io_concurrency=0), in which case
+		 * all batches get loaded in index_batch_getnext_tid.
+		 */
+		if ((pos->batch == -1) && (pos->index == -1))
+		{
+			*pos = scan->xs_batches->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, pos))
+		{
+			advanced = true;
+		}
+
+		/* FIXME maybe check the streamPos is not behind readPos? */
+
+		/* If we advanced the position, return the block for the TID. */
+		if (advanced)
+		{
+			IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+			ItemPointer tid = &batch->items[pos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  pos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * if there's a prefetch callback, use it to decide if we will
+			 * need to read the block
+			 */
+			if (scan->xs_batches->prefetchCallback &&
+				!scan->xs_batches->prefetchCallback(scan, scan->xs_batches->prefetchArgument, pos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			return ItemPointerGetBlockNumber(tid);
+		}
+
+		/*
+		 * Couldn't advance the position, so either there are no more items in
+		 * the current batch, or maybe we don't have any batches yet (if is
+		 * the first time through). Try loading the next batch - if that
+		 * succeeds, try the advance again (and this time the advance should
+		 * work).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch, or
+ * false if there are no more TIDs in the scan. The batch load may fail for
+ * multiple reasons - there really may not be more batches in the scan, or
+ * maybe we reached INDEX_SCAN_MAX_BATCHES.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan)
+{
+	IndexScanBatchData *batch;
+	ItemPointerData tid;
+	ScanDirection direction = scan->xs_batches->direction;
+	IndexTuple	itup;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 *
+	 * XXX For now we just error out, but the correct solution is to pause the
+	 * stream by returning InvalidBlockNumber and then unpause it by doing
+	 * read_stream_reset.
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->xs_batches->reset = true;
+	}
+
+	/*
+	 * Did we fill the batch queue, either in this or some earlier call?
+	 * If yes, we have to consume everything from currently loaded batch
+	 * before we reset the stream and continue. It's a bit like 'finished'
+	 * but it's only a temporary pause, not the end of the stream.
+	 */
+	if (scan->xs_batches->reset)
+		return NULL;
+
+	/*
+	 * Did we already read the last batch for this scan?
+	 *
+	 * We may read the batches in two places, so we need to remember that,
+	 * otherwise the retry restarts the scan.
+	 *
+	 * XXX This comment might be obsolete, from before using the read_stream.
+	 *
+	 * XXX Also, maybe we should do this before calling INDEX_SCAN_BATCH_FULL?
+	 */
+	if (scan->xs_batches->finished)
+		return NULL;
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * FIXME btgetbatch calls _bt_returnitem, which however sets xs_heaptid,
+	 * and so would interfere with index scans (because this may get executed
+	 * from the read_stream_next_buffer callback during the scan (fetching
+	 * heap tuples in heapam_index_fetch_tuple). Ultimately we should not do
+	 * _bt_returnitem at all, just functions like _bt_steppage etc. while
+	 * loading the next batch.
+	 *
+	 * XXX I think this is no longer true, the amgetbatch does not do that I
+	 * believe (_bt_returnitem_batch should not set these fields).
+	 */
+	tid = scan->xs_heaptid;
+	itup = scan->xs_itup;
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = scan->xs_batches->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		scan->xs_batches->nextBatch++;
+
+		/*
+		 * XXX Why do we need currentBatch, actually? It doesn't seem to be
+		 * used anywhere, just set ...
+		 */
+		// scan->xs_batches->currentBatch = batch;
+
+		DEBUG_LOG("index_batch_getnext firstBatch %d nextBatch %d batch %p",
+				  scan->xs_batches->firstBatch, scan->xs_batches->nextBatch, batch);
+	}
+	else
+		scan->xs_batches->finished = true;
+
+	/* XXX see FIXME above */
+	scan->xs_heaptid = tid;
+	scan->xs_itup = itup;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - get the next TID from the current batch
+ *
+ * The calling convention is similar to index_getnext_tid() - NULL means no
+ * more items in the current batch, and no more batches.
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * Returns the next TID, or NULL if no more items (or batches).
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchPos *pos;
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* read the next TID from the index */
+	pos = &scan->xs_batches->readPos;
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the last one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	if (scan->xs_batches->direction != direction)
+	{
+		/* release "future" batches in the wrong direction */
+		while (scan->xs_batches->nextBatch > scan->xs_batches->firstBatch + 1)
+		{
+			IndexScanBatch batch;
+			scan->xs_batches->nextBatch--;
+			batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->nextBatch);
+			index_batch_free(scan, batch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked
+		 * as "finished" (we might have already read the last batch, but now
+		 * we need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		scan->xs_batches->direction = direction;
+		scan->xs_batches->finished = false;
+
+		index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+		read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	DEBUG_LOG("index_batch_getnext_tid pos %d %d direction %d",
+			  pos->batch, pos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? If the advance/getnext
+	 * functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, pos))
+		{
+			IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+			Assert(batch != NULL);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = batch->items[pos->index].heapTid;
+			scan->xs_itup = (IndexTuple) (batch->currTuples + batch->items[pos->index].tupleOffset);
+
+			DEBUG_LOG("pos batch %p first %d last %d pos %d/%d TID (%u,%u)",
+					  batch, batch->firstItem, batch->lastItem,
+					  pos->batch, pos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to firstBatch.
+			 */
+			if (pos->batch != scan->xs_batches->firstBatch)
+			{
+				batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->firstBatch);
+				Assert(batch != NULL);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap blocks.
+				 * But we may not do that often enough - e.g. IOS may not need
+				 * to access all-visible heap blocks, so the read_next callback
+				 * does not get invoked for a long time. It's possible the
+				 * stream gets so mucu behind the position gets invalid, as we
+				 * already removed the batch. But that means we don't need any
+				 * heap blocks until the current read position - if we did, we
+				 * would not be in this situation (or it's a sign of a bug, as
+				 * those two places are expected to be in sync). So if the
+				 * streamPos still points at the batch we're about to free,
+				 * just reset the position - we'll set it to readPos in the
+				 * read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 */
+				if (scan->xs_batches->streamPos.batch == scan->xs_batches->firstBatch)
+				{
+					index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free batch %p firstBatch %d nextBatch %d",
+						  batch,
+						  scan->xs_batches->firstBatch,
+						  scan->xs_batches->nextBatch);
+
+				/* Free the batch (except when it's needed for mark/restore). */
+				index_batch_free(scan, batch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mar/restore.
+				 */
+				scan->xs_batches->firstBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed firstBatch %d nextBatch %d",
+						  scan->xs_batches->firstBatch,
+						  scan->xs_batches->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(scan->xs_batches->firstBatch == pos->batch);
+			}
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (scan->xs_batches->reset)
+		{
+			DEBUG_LOG("resetting read stream pos %d,%d",
+					  scan->xs_batches->readPos.batch, scan->xs_batches->readPos.index);
+
+			scan->xs_batches->reset = false;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+
+			read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the first batch from here.
+		 * Second, while most batches will be preloaded by the stream thank's
+		 * to prefetching, it's possible to set effective_io_concurrency=0, in
+		 * which case all the batch loads happen from here.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 */
+	index_batch_pos_reset(scan, pos);
+
+	return NULL;
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, assume batching is supported by the AM */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->xs_batches = palloc0(sizeof(IndexScanBatches));
+
+	/* We don't know direction of the scan yet. */
+	scan->xs_batches->direction = NoMovementScanDirection;
+
+	/* Initialize the batch */
+	scan->xs_batches->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->xs_batches->firstBatch = 0;	/* first batch */
+	scan->xs_batches->nextBatch = 0;	/* first batch is empty */
+
+	scan->xs_batches->batches
+		= palloc(sizeof(IndexScanBatchData *) * scan->xs_batches->maxBatches);
+
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->xs_batches->readPos);
+	index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
+	index_batch_pos_reset(scan, &scan->xs_batches->markPos);
+
+	// scan->xs_batches->currentBatch = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatches *batches = scan->xs_batches;
+
+	/* bail out if batching not enabled */
+	if (!batches)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batches->readPos);
+	index_batch_pos_reset(scan, &batches->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && (batches->markBatch != NULL))
+	{
+		IndexScanBatchPos *pos = &batches->markPos;
+		IndexScanBatch batch = batches->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batches->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if ((pos->batch < batches->firstBatch) ||
+			(pos->batch >= batches->nextBatch))
+		{
+			index_batch_free(scan, batch);
+		}
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batches->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batches->firstBatch < batches->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batches->firstBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batches->firstBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batches->firstBatch++;
+	}
+
+	/* reset relevant IndexScanBatches fields */
+	batches->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	batches->firstBatch = 0;	/* first batch */
+	batches->nextBatch = 0;		/* first batch is empty */
+
+	batches->finished = false;
+	batches->reset = false;
+	// batches->currentBatch = NULL;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *pos = &scan->xs_batches->readPos;
+	IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	/* FIXME mark item at current readPos as deleted */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * XXX Too tied to btree (through MaxTIDsPerBTreePage), we should make
+	 * this AM agnostic. We could maybe even replace this with Bitmapset. It
+	 * might be more expensive if we only kill items at the end of the page
+	 * (in which case we still have to walk the first part to find the bits at
+	 * the end). But given the lower memory usage it still sees like a good
+	 * tradeoff overall.
+	 */
+	if (batch->killedItems == NULL)
+		batch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (batch->numKilled < MaxTIDsPerBTreePage)
+		batch->killedItems[batch->numKilled++] = pos->index;
+
+	/* elog(WARNING, "index_batch_kill_item (%d,%d)", pos->batch, pos->index); */
+	/* FIXME index_batch_kill_item not implemented */
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->xs_batches->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(sizeof(IndexScanBatchData));
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->itemIndex = -1;
+
+	batch->killedItems = NULL;	/* FIXME allocate an array, actually */
+	batch->numKilled = 0;		/* nothing killed yet */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 *
+	 * XXX allocate
+	 */
+	batch->currTuples = NULL;	/* tuple storage for currPos */
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	/*
+	 * XXX Maybe don't size to MaxTIDsPerBTreePage? We don't reuse batches
+	 * (unlike currPos), so we can size it for just what we need.
+	 */
+	batch->items = palloc0(sizeof(IndexScanBatchPosItem) * maxitems);
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX allocate as needed?
+	 */
+	batch->itups = NULL;		/* IndexTuples, if requested */
+	batch->htups = NULL;		/* HeapTuples, if requested */
+	batch->recheck = NULL;		/* recheck flags */
+	batch->privateData = NULL;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	batch->orderbyvals = NULL;
+	batch->orderbynulls = NULL;
+
+	/* AM-specific per-batch state */
+	batch->opaque = NULL;
+
+	return batch;
+}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..be8e02a9c45 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -217,7 +217,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221f2..8a5d79a27a6 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..1ec046adeff 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -815,7 +815,17 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+
+	/*
+	 * It doesn't seem very useful to allow batching/prefetching when checking
+	 * exclusion/uniqueness constraints. We should only find either no or just
+	 * one row, I think.
+	 *
+	 * XXX Maybe there are cases where we could find multiple "candidate"
+	 * rows, e.g. with exclusion constraints? Not sure.
+	 */
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0,
+								 false);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 53ddd25c42d..9c7df9b9ccb 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -201,8 +201,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	/*
+	 * Start an index scan.
+	 *
+	 * XXX No prefetching for replication identity. We expect to find just one
+	 * row, so prefetching would be pointless.
+	 */
+	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0, false);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..1a14f5faa68 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *data,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -94,15 +100,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+								   node->ioss_NumOrderByKeys,
+								   node->ioss_CanBatch);
 
 		node->ioss_ScanDesc = scandesc;
 
-
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->xs_batches != NULL)
+		{
+			scandesc->xs_batches->prefetchCallback = ios_prefetch_block;
+			scandesc->xs_batches->prefetchArgument = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->xs_batches == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->xs_batches->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -596,6 +643,20 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->recheckqual =
 		ExecInitQual(node->recheckqual, (PlanState *) indexstate);
 
+	/*
+	 * All index scans can do batching.
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
+	 *
+	 * XXX For now we only know if the scan gets to use batching after the
+	 * index_beginscan() returns, so maybe this name is a bit misleading. It's
+	 * more about "allow batching". But maybe this field is unnecessary - we
+	 * check all the interesting stuff in index_beginscan() anyway.
+	 */
+	indexstate->ioss_CanBatch = true;
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -783,13 +844,21 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans? Maybe
+	 * not, but then we need to be careful not to call index_batch_getnext_tid
+	 * (which now can happen, because we'll call IndexOnlyNext even for
+	 * parallel plans). Although, that should not happen, because we only call
+	 * that with (xs_batches != NULL).
+	 */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 	node->ioss_VMBuffer = InvalidBuffer;
 
@@ -849,13 +918,15 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 		return;
 	}
 
+	/* XXX Do we actually want prefetching for parallel index scans? */
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->ioss_CanBatch);
 	node->ioss_ScanDesc->xs_want_itup = true;
 
 	/*
@@ -889,3 +960,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->privateData == NULL)
+		batch->privateData = palloc0(sizeof(Datum) * (batch->lastItem + 1));
+
+	if (batch->privateData[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->privateData[pos->index]
+			= all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->privateData[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..177d74c2c27 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -111,7 +111,8 @@ IndexNext(IndexScanState *node)
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   node->iss_CanBatch);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -201,13 +202,16 @@ IndexNextWithReorder(IndexScanState *node)
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
+		 *
+		 * XXX Should we use batching here? Does it even work for reordering?
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->iss_RelationDesc,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+								   node->iss_NumOrderByKeys,
+								   false);
 
 		node->iss_ScanDesc = scandesc;
 
@@ -965,6 +969,18 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->indexorderbyorig =
 		ExecInitExprList(node->indexorderbyorig, (PlanState *) indexstate);
 
+	/*
+	 * All index scans can do batching.
+	 *
+	 * XXX Maybe this should check if the index AM supports batching, or even
+	 * call something like "amcanbatch" (does not exist yet). Or check the
+	 * enable_indexscan_batching GUC?
+	 *
+	 * XXX Well, we disable batching for reordering, so maybe we should check
+	 * that here instead? But maybe it's unnecessary limitation?
+	 */
+	indexstate->iss_CanBatch = true;
+
 	/*
 	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
 	 * here.  This allows an index-advisor plugin to EXPLAIN a plan containing
@@ -1719,13 +1735,17 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1783,13 +1803,17 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 		return;
 	}
 
+	/*
+	 * XXX Do we actually want prefetching for parallel index scans?
+	 */
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->iss_RelationDesc,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
-								 piscan);
+								 piscan,
+								 node->iss_CanBatch);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index a96b1b9c0bc..facc83bb83a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6719,9 +6719,14 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
+	/*
+	 * XXX I'm not sure about batching/prefetching here. In most cases we
+	 * expect to find the endpoints immediately, but sometimes we have a lot
+	 * of dead tuples - and then prefetching might help.
+	 */
 	index_scan = index_beginscan(heapRel, indexRel,
 								 &SnapshotNonVacuumable, NULL,
-								 1, 0);
+								 1, 0, false);
 	/* Set it up for index-only scan */
 	index_scan->xs_want_itup = true;
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2f8cbd86759..36d2b7f1e68 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -809,6 +809,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_batching", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-scan batching."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_batching,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..649df2b06a0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -415,6 +415,7 @@
 #enable_hashjoin = on
 #enable_incremental_sort = on
 #enable_indexscan = on
+#enable_indexscan_batching = on
 #enable_indexonlyscan = on
 #enable_material = on
 #enable_memoize = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 52916bab7a3..0028bb55843 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -196,6 +196,14 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -307,6 +315,8 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..8bef942b11d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -111,6 +112,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -155,6 +157,8 @@ typedef struct IndexOrderByDistance
  * generalized index_ interface routines (in indexam.c)
  */
 
+extern PGDLLIMPORT bool enable_indexscan_batching;
+
 /*
  * IndexScanIsValid
  *		True iff the index scan is valid.
@@ -179,7 +183,8 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
-									 int nkeys, int norderbys);
+									 int nkeys, int norderbys,
+									 bool enable_batching);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											IndexScanInstrumentation *instrument,
@@ -205,7 +210,8 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
-											  ParallelIndexScanDesc pscan);
+											  ParallelIndexScanDesc pscan,
+											  bool enable_batching);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
 struct TupleTableSlot;
@@ -213,7 +219,6 @@ extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
-
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
@@ -231,7 +236,7 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
 
 /*
  * index access method support routines (in genam.c)
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..b63af845ca6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,162 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+/*
+ * XXX parts of BTScanOpaqueData, BTScanPosItem and BTScanPosData relevant
+ * for one batch.
+ */
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM. This is similar
+ * to the AM-specific "opaque" structs, used by each AM to track items
+ * loaded from one leaf page, but generalized for all AMs.
+ *
+ * XXX Not sure which of there fields are 100% needed for all index AMs,
+ * most of this comes from nbtree.
+ *
+ * XXX Mostly a copy of BTScanPosData, but other AMs may need different (or
+ * only some of those) fields.
+ */
+typedef struct IndexScanBatchData
+{
+	/*
+	 * AM-specific concept of position within the index, and other stuff the
+	 * AM might need to store for each batch.
+	 *
+	 * XXX maybe "position" is not the best name, it can have other stuff the
+	 * AM needs to keep per-batch (even only for reading the leaf items, like
+	 * nextTupleOffset).
+	 */
+	void	   *opaque;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 *
+	 * XXX Do we need all these indexes, or would it be enough to have just
+	 * 0-indexed array with only itemIndex?
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for currPos */
+	IndexScanBatchPosItem *items;	/* XXX don't size to MaxTIDsPerBTreePage */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	IndexTuple *itups;			/* IndexTuples, if requested */
+	HeapTuple  *htups;			/* HeapTuples, if requested */
+	bool	   *recheck;		/* recheck flags */
+
+	/* XXX why do we need this on top of "opaque" pointer? */
+	Datum	   *privateData;	/* private data for batch */
+
+	/* xs_orderbyvals / xs_orderbynulls */
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
+
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan, void *arg, IndexScanBatchPos *pos);
+
+/*
+ * Queue
+ */
+typedef struct IndexScanBatches
+{
+	/*
+	 * Did we read the last batch? The batches may be loaded from multiple
+	 * places, and we need to remember when we fail to load the next batch in
+	 * a given scan (which means "no more batches"). amgetbatch may restart
+	 * the scan on the get call, so we need to remember it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 *
+	 * XXX May need some work to use already loaded batches after change of
+	 * direction, instead of just throwing everything away. May need to reset
+	 * the stream but keep the batches?
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+	// IndexScanBatchData *currentBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The firstBatch is an index of the first batch,
+	 * but needs to be translated by (modulo maxBatches) into index in the
+	 * batches array.
+	 *
+	 * FIXME Maybe these fields should be uint32, or something like that?
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			firstBatch;		/* first used batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetchCallback;
+	void	   *prefetchArgument;
+} IndexScanBatches;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -176,6 +330,12 @@ typedef struct IndexScanDescData
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/*
+	 * Batches index scan keep a list of batches loaded from the index in a
+	 * circular buffer.
+	 */
+	IndexScanBatches *xs_batches;
+
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
 	 * expressions of the last returned tuple, according to the index.  If
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..5bed359cf13 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -413,8 +413,14 @@ typedef struct TableAmRoutine
 	 * structure with additional information.
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 *
+	 * The ReadStream pointer is optional - NULL means the regular buffer
+	 * reads are used. If a valid ReadStream is provided, the callback
+	 * (generating the blocks to read) and index_fetch_tuple (consuming the
+	 * buffers) need to agree on the exact order.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1149,9 +1155,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..ef672e203d0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1697,6 +1697,7 @@ typedef struct
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
  *		PscanLen		   size of parallel index scan descriptor
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexScanState
@@ -1726,6 +1727,10 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* batching/prefetching enabled? */
+	bool		iss_CanBatch;
+
 } IndexScanState;
 
 /* ----------------
@@ -1749,6 +1754,7 @@ typedef struct IndexScanState
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
+ *		CanBatch		   batching (and prefetching) enabled
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1772,6 +1778,7 @@ typedef struct IndexOnlyScanState
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
+	bool		ioss_CanBatch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index ae17d028ed3..220b61fad2d 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -158,6 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_batching      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -172,7 +173,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(24 rows)
+(25 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5879e00dff..060d964e399 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1260,6 +1260,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3396,6 +3400,7 @@ amendscan_function
 amestimateparallelscan_function
 amgetbitmap_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-- 
2.49.0

v20250501-0002-WIP-batching-for-nbtree-indexes.patchtext/x-patch; charset=UTF-8; name=v20250501-0002-WIP-batching-for-nbtree-indexes.patchDownload

From 35d3fec8c1ab6c1e2b98f6f80d6687a808722f73 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:39 +0200
Subject: [PATCH v20250501 2/7] WIP: batching for nbtree indexes

Adds batching/prefetching for btree indexes. Returns only batches from a
single leaf page. Does not support mark/restore yet.
---
 src/backend/access/nbtree/nbtree.c    |  319 ++++
 src/backend/access/nbtree/nbtsearch.c | 1998 +++++++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  |  179 +++
 src/include/access/nbtree.h           |   72 +-
 src/tools/pgindent/typedefs.list      |    2 +
 5 files changed, 2417 insertions(+), 153 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887af..405c601d3ff 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,6 +159,8 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -279,6 +281,158 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	return res;
 }
 
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->xs_batches->batches[(idx) % (scan)->xs_batches->maxBatches])
+
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Simplified version of btgettuple(), but for batches of tuples.
+ */
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch res;
+	BTBatchScanPos pos = NULL;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* btree indexes are never lossy */
+	scan->xs_recheck = false;
+
+	if (scan->xs_batches->firstBatch < scan->xs_batches->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->nextBatch-1);
+		pos = (BTBatchScanPos) batch->opaque;
+	}
+
+	/* Each loop iteration performs another primitive index scan */
+	do
+	{
+		/*
+		 * If we've already initialized this scan, we can just advance it in
+		 * the appropriate direction.  If we haven't done so yet, we call
+		 * _bt_first() to get the first item in the scan.
+		 */
+		if (pos == NULL)
+			res = _bt_first_batch(scan, dir);
+		else
+		{
+			/*
+			 * Now continue the scan.
+			 */
+			res = _bt_next_batch(scan, pos, dir);
+		}
+
+		/* If we have a batch, return it ... */
+		if (res)
+			break;
+
+		/*
+		 * XXX we need to invoke _bt_first_batch on the next iteration, to
+		 * advance SAOP keys etc. But indexam.c already does this, but that's
+		 * only after this returns, so maybe this should do this in some other
+		 * way, not sure who should be responsible for setting currentBatch.
+		 *
+		 * XXX Maybe we don't even need that field? What is a current batch
+		 * anyway? There seem to be at least multiple concepts of "current"
+		 * batch, one for the read stream, another for executor ...
+		 */
+		// scan->xs_batches->currentBatch = res;
+
+		/*
+		 * We may do a new scan, depending on what _bt_start_prim_scan says.
+		 * In that case we need to start from scratch, not from the position
+		 * of the last batch. In regular non-batched scans we have currPos,
+		 * because we have just one leaf page for the whole scan, and we
+		 * invalidate it before loading the next one. But with batching that
+		 * doesn't work - we have many leafs, it's not clear which one is
+		 * 'current' (well, it's the last), and we can't invalidate it,
+		 * that's up to amfreebatch(). For now we deduce the position and
+		 * reset it to NULL, to indicate the same thing.
+		 *
+		 * XXX Maybe we should have something like 'currentBatch'? But then
+		 * that probably should be in BTScanOpaque, not in the generic
+		 * indexam.c part? Or it it a sufficiently generic thing? How would
+		 * we keep it in sync with the batch queue? If freeing batches is
+		 * up to indexam, how do we ensure the currentBatch does not point
+		 * to already removed batch?
+		 */
+		pos = NULL;
+
+		/* ... otherwise see if we need another primitive index scan */
+	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
+
+	return res;
+}
+
+/*
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
+ *
+ * XXX Pretty much like btgettuple(), but for batches of tuples.
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * Check to see if we should kill tuples from the previous batch.
+	 */
+	_bt_kill_batch(scan, batch);
+
+	/* free all the stuff that might be allocated */
+
+	if (batch->items)
+		pfree(batch->items);
+
+	if (batch->itups)
+		pfree(batch->itups);
+
+	if (batch->htups)
+		pfree(batch->htups);
+
+	if (batch->recheck)
+		pfree(batch->recheck);
+
+	if (batch->privateData)
+		pfree(batch->privateData);
+
+	if (batch->orderbyvals)
+		pfree(batch->orderbyvals);
+
+	if (batch->orderbynulls)
+		pfree(batch->orderbynulls);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->opaque)
+	{
+		BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+
+		BTBatchScanPosIsValid(*pos);
+		BTBatchScanPosIsPinned(*pos);
+
+		BTBatchScanPosUnpinIfPinned(*pos);
+
+		pfree(batch->opaque);
+	}
+
+	/* and finally free the batch itself */
+	pfree(batch);
+
+	return;
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
@@ -376,6 +530,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 /*
  *	btrescan() -- rescan an index relation
+ *
+ * Batches should have been freed from indexam using btfreebatch() before we
+ * get here, but then some of the generic scan stuff needs to be reset here.
+ * But we shouldn't need to do anything particular here, I think.
  */
 void
 btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
@@ -400,6 +558,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
+	/* FIXME should be in indexam.c I think */
+	// if (scan->xs_batches)
+	//	scan->xs_batches->currentBatch = NULL;
+
 	/*
 	 * Allocate tuple workspace arrays, if needed for an index-only scan and
 	 * not already done in a previous rescan call.  To save on palloc
@@ -433,6 +595,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 /*
  *	btendscan() -- close down a scan
+ *
+ * Batches should have been freed from indexam using btfreebatch() before we
+ * get here, but then some of the generic scan stuff needs to be reset here.
+ * But we shouldn't need to do anything particular here, I think.
  */
 void
 btendscan(IndexScanDesc scan)
@@ -469,12 +635,18 @@ btendscan(IndexScanDesc scan)
 
 /*
  *	btmarkpos() -- save current scan position
+ *
+ * With batching, all the interesting markpos() stuff happens in indexam.c. We
+ * should not even get here.
  */
 void
 btmarkpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	/* with batching, mark/restore is handled in indexam */
+	Assert(scan->xs_batches == NULL);
+
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
 
@@ -495,12 +667,18 @@ btmarkpos(IndexScanDesc scan)
 
 /*
  *	btrestrpos() -- restore scan to last saved position
+ *
+ * With batching, all the interesting restrpos() stuff happens in indexam.c. We
+ * should not even get here.
  */
 void
 btrestrpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
+	/* with batching, mark/restore is handled in indexam */
+	Assert(scan->xs_batches == NULL);
+
 	if (so->markItemIndex >= 0)
 	{
 		/*
@@ -900,6 +1078,147 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	return status;
 }
 
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call _bt_parallel_release()
+ *		or _bt_parallel_done().
+ *
+ * The return value is true if we successfully seized the scan and false
+ * if we did not.  The latter case occurs when no pages remain, or when
+ * another primitive index scan is scheduled that caller's backend cannot
+ * start just yet (only backends that call from _bt_first are capable of
+ * starting primitive index scans, which they indicate by passing first=true).
+ *
+ * If the return value is true, *next_scan_page returns the next page of the
+ * scan, and *last_curr_page returns the page that *next_scan_page came from.
+ * An invalid *next_scan_page means the scan hasn't yet started, or that
+ * caller needs to start the next primitive index scan (if it's the latter
+ * case we'll set so.needPrimScan).
+ *
+ * Callers should ignore the value of *next_scan_page and *last_curr_page if
+ * the return value is false.
+ */
+bool
+_bt_parallel_seize_batch(IndexScanDesc scan, BTBatchScanPos pos,
+						 BlockNumber *next_scan_page,
+						 BlockNumber *last_curr_page, bool first)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		exit_loop = false,
+				status = true,
+				endscan = false;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	*next_scan_page = InvalidBlockNumber;
+	*last_curr_page = InvalidBlockNumber;
+
+	/*
+	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
+	 * call to _bt_readnextpage treats this backend similarly to a serial
+	 * backend that steps from *last_curr_page to *next_scan_page (unless this
+	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
+	 */
+	BTScanPosInvalidate(so->currPos);
+	pos->moreLeft = pos->moreRight = true;
+
+	if (first)
+	{
+		/*
+		 * Initialize array related state when called from _bt_first, assuming
+		 * that this will be the first primitive index scan for the scan
+		 */
+		so->needPrimScan = false;
+		so->scanBehind = false;
+		so->oppositeDirCheck = false;
+	}
+	else
+	{
+		/*
+		 * Don't attempt to seize the scan when it requires another primitive
+		 * index scan, since caller's backend cannot start it right now
+		 */
+		if (so->needPrimScan)
+			return false;
+	}
+
+	btscan = (BTParallelScanDesc) OffsetToPointer(parallel_scan,
+												  parallel_scan->ps_offset_am);
+
+	while (1)
+	{
+		LWLockAcquire(&btscan->btps_lock, LW_EXCLUSIVE);
+
+		if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+		{
+			/* We're done with this parallel index scan */
+			status = false;
+		}
+		else if (btscan->btps_pageStatus == BTPARALLEL_IDLE &&
+				 btscan->btps_nextScanPage == P_NONE)
+		{
+			/* End this parallel index scan */
+			status = false;
+			endscan = true;
+		}
+		else if (btscan->btps_pageStatus == BTPARALLEL_NEED_PRIMSCAN)
+		{
+			Assert(so->numArrayKeys);
+
+			if (first)
+			{
+				/* Can start scheduled primitive scan right away, so do so */
+				btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+
+				/* Restore scan's array keys from serialized values */
+				_bt_parallel_restore_arrays(rel, btscan, so);
+				exit_loop = true;
+			}
+			else
+			{
+				/*
+				 * Don't attempt to seize the scan when it requires another
+				 * primitive index scan, since caller's backend cannot start
+				 * it right now
+				 */
+				status = false;
+			}
+
+			/*
+			 * Either way, update backend local state to indicate that a
+			 * pending primitive scan is required
+			 */
+			so->needPrimScan = true;
+			so->scanBehind = false;
+			so->oppositeDirCheck = false;
+		}
+		else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
+		{
+			/*
+			 * We have successfully seized control of the scan for the purpose
+			 * of advancing it to a new page!
+			 */
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			Assert(btscan->btps_nextScanPage != P_NONE);
+			*next_scan_page = btscan->btps_nextScanPage;
+			*last_curr_page = btscan->btps_lastCurrPage;
+			exit_loop = true;
+		}
+		LWLockRelease(&btscan->btps_lock);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* When the scan has reached the rightmost (or leftmost) page, end it */
+	if (endscan)
+		_bt_parallel_done(scan);
+
+	return status;
+}
+
 /*
  * _bt_parallel_release() -- Complete the process of advancing the scan to a
  *		new page.  We now have the new value btps_nextScanPage; another backend
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 77264ddeecb..10b28a76c0f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,8 +24,20 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+/*
+ * XXX A lot of the new functions are copies of the non-batching version, with
+ * changes to make it work with batching (which means with position provided
+ * by the caller, not from the BTScanOpaque). The duplication is not great,
+ * but it's a bit unclear what to do about it. One option would be to remove
+ * the amgettuple() interface altogether, once the batching API works, but we
+ * may also choose to keep both (e.g. for cases that don't support batching,
+ * like scans of catalogs). In that case we'd need to do some refactoring to
+ * share as much code as possible.
+ */
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/* static void _bt_drop_lock_and_maybe_pin_batch(IndexScanDesc scan, BTBatchScanPos sp); */
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
@@ -34,24 +46,44 @@ static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum, bool firstpage);
+static IndexScanBatch _bt_readpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+										 ScanDirection dir, OffsetNumber offnum,
+										 bool firstPage);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_saveitem_batch(IndexScanBatch batch, int itemIndex,
+							   OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 								  OffsetNumber offnum, ItemPointer heapTid,
 								  IndexTuple itup);
+static int	_bt_setuppostingitems_batch(IndexScanBatch batch, int itemIndex,
+										OffsetNumber offnum, ItemPointer heapTid,
+										IndexTuple itup);
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
+static inline void _bt_savepostingitem_batch(IndexScanBatch batch, int itemIndex,
+											 OffsetNumber offnum,
+											 ItemPointer heapTid, int tupleOffset);
 static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_steppage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+										 ScanDirection dir);
 static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
 							  ScanDirection dir);
+static IndexScanBatch _bt_readfirstpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+											  OffsetNumber offnum,
+											  ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 							 BlockNumber lastcurrblkno, ScanDirection dir,
 							 bool seized);
+static IndexScanBatch _bt_readnextpage_batch(IndexScanDesc scan, BTBatchScanPos pos,
+											 BlockNumber blkno, BlockNumber lastcurrblkno,
+											 ScanDirection dir, bool seized);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static IndexScanBatch _bt_endpoint_batch(IndexScanDesc scan, ScanDirection dir);
 
 
 /*
@@ -77,6 +109,20 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 	}
 }
 
+/* static void */
+/* _bt_drop_lock_and_maybe_pin_batch(IndexScanDesc scan, BTBatchScanPos sp) */
+/* { */
+/* 	_bt_unlockbuf(scan->indexRelation, sp->buf); */
+/*  */
+/* /	if (IsMVCCSnapshot(scan->xs_snapshot) && */
+/* 		RelationNeedsWAL(scan->indexRelation) && */
+/* 		!scan->xs_want_itup) */
+/* 	{ */
+/* 		ReleaseBuffer(sp->buf); */
+/* 		sp->buf = InvalidBuffer; */
+/* 	} */
+/* } */
+
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -1570,136 +1616,1344 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_first_batch() -- Load the first batch in a scan.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * A batch variant of _bt_first(). Most of the comments for that function
+ * apply here too.
  *
- * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
- * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
- * that there can be no more matching tuples in the current scan direction
- * (could just be for the current primitive index scan when scan has arrays).
+ * XXX This only populates the batch, it does not set any other fields like
+ * scan->xs_heaptid or scan->xs_itup. That happens in getnext_tid() calls.
  *
- * In the case of a parallel scan, caller must have called _bt_parallel_seize
- * prior to calling this function; this function will invoke
- * _bt_parallel_release before returning.
+ * XXX I'm not sure it works to mix batched and non-batches calls, e.g. get
+ * a TID and then a batch of TIDs. It probably should work as long as we
+ * update itemIndex correctly, but we need to be careful about killed items
+ * (right now the two places use different ways to communicate which items
+ * should be killed).
  *
- * Returns true if any matching items found on the page, false if none.
+ * XXX We probably should not rely on _bt_first/_bt_steppage, because that
+ * very much relies on currPos, and it's just laziness to rely on that. For
+ * batching we probably need something else anyway.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+IndexScanBatch
+_bt_first_batch(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Page		page;
-	BTPageOpaque opaque;
-	OffsetNumber minoff;
-	OffsetNumber maxoff;
-	BTReadPageState pstate;
-	bool		arrayKeys;
-	int			itemIndex,
-				indnatts;
+	BTStack		stack;
+	OffsetNumber offnum;
+	BTScanInsertData inskey;
+	ScanKey		startKeys[INDEX_MAX_KEYS];
+	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
+	int			keysz = 0;
+	StrategyNumber strat_total;
+	BlockNumber blkno = InvalidBlockNumber,
+				lastcurrblkno;
+	BTBatchScanPosData pos;
 
-	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
-	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
+	BTBatchScanPosInvalidate(pos);
 
-	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
-	Assert(!so->needPrimScan);
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
 
-	if (scan->parallel_scan)
-	{
-		/* allow next/prev page to be read by other worker without delay */
-		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
-		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
-	}
+	/* FIXME maybe check there's no active batch yet */
+	/* Assert(!BTScanPosIsValid(so->currPos)); */
 
-	/* initialize remaining currPos fields related to current page */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
-	/* either moreLeft or moreRight should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	/*
+	 * Examine the scan keys and eliminate any redundant keys; also mark the
+	 * keys that must be matched to continue the scan.
+	 */
+	_bt_preprocess_keys(scan);
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	/*
+	 * Quit now if _bt_preprocess_keys() discovered that the scan keys can
+	 * never be satisfied (eg, x == 1 AND x > 2).
+	 */
+	if (!so->qual_ok)
+	{
+		Assert(!so->needPrimScan);
+		_bt_parallel_done(scan);
+		return false;
+	}
 
-	/* initialize local variables */
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
-	arrayKeys = so->numArrayKeys != 0;
-	minoff = P_FIRSTDATAKEY(opaque);
-	maxoff = PageGetMaxOffsetNumber(page);
+	/*
+	 * If this is a parallel scan, we must seize the scan.  _bt_readfirstpage
+	 * will likely release the parallel scan later on.
+	 */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize_batch(scan, &pos, &blkno, &lastcurrblkno, true))
+		return false;
 
-	/* initialize page-level state that we'll pass to _bt_checkkeys */
-	pstate.minoff = minoff;
-	pstate.maxoff = maxoff;
-	pstate.finaltup = NULL;
-	pstate.page = page;
-	pstate.firstpage = firstpage;
-	pstate.forcenonrequired = false;
-	pstate.startikey = 0;
-	pstate.offnum = InvalidOffsetNumber;
-	pstate.skip = InvalidOffsetNumber;
-	pstate.continuescan = true; /* default assumption */
-	pstate.rechecks = 0;
-	pstate.targetdistance = 0;
-	pstate.nskipadvances = 0;
+	/*
+	 * Initialize the scan's arrays (if any) for the current scan direction
+	 * (except when they were already set to later values as part of
+	 * scheduling the primitive index scan that is now underway)
+	 */
+	if (so->numArrayKeys && !so->needPrimScan)
+		_bt_start_array_keys(scan, dir);
 
-	if (ScanDirectionIsForward(dir))
+	if (blkno != InvalidBlockNumber)
 	{
-		/* SK_SEARCHARRAY forward scans must provide high key up front */
-		if (arrayKeys)
-		{
-			if (!P_RIGHTMOST(opaque))
-			{
-				ItemId		iid = PageGetItemId(page, P_HIKEY);
+		/*
+		 * We anticipated calling _bt_search, but another worker bet us to it.
+		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
+		 */
+		Assert(scan->parallel_scan != NULL);
+		Assert(!so->needPrimScan);
+		Assert(blkno != P_NONE);
 
-				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+		return _bt_readnextpage_batch(scan, &pos, blkno, lastcurrblkno, dir, true);
+	}
 
-				if (so->scanBehind &&
-					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
-				{
-					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
-					so->needPrimScan = true;
-					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
-					return false;
-				}
-			}
+	/*
+	 * Count an indexscan for stats, now that we know that we'll call
+	 * _bt_search/_bt_endpoint below
+	 */
+	pgstat_count_index_scan(rel);
+	if (scan->instrument)
+		scan->instrument->nsearches++;
 
-			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
-		}
+	/*----------
+	 * Examine the scan keys to discover where we need to start the scan.
+	 *
+	 * We want to identify the keys that can be used as starting boundaries;
+	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+	 * a backwards scan.  We can use keys for multiple attributes so long as
+	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+	 * a > or < boundary or find an attribute with no boundary (which can be
+	 * thought of as the same as "> -infinity"), we can't use keys for any
+	 * attributes to its right, because it would break our simplistic notion
+	 * of what initial positioning strategy to use.
+	 *
+	 * When the scan keys include cross-type operators, _bt_preprocess_keys
+	 * may not be able to eliminate redundant keys; in such cases we will
+	 * arbitrarily pick a usable one for each attribute.  This is correct
+	 * but possibly not optimal behavior.  (For example, with keys like
+	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+	 * x=5 would be more efficient.)  Since the situation only arises given
+	 * a poorly-worded query plus an incomplete opfamily, live with it.
+	 *
+	 * When both equality and inequality keys appear for a single attribute
+	 * (again, only possible when cross-type operators appear), we *must*
+	 * select one of the equality keys for the starting point, because
+	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
+	 * equality quals survive preprocessing, however, it doesn't matter which
+	 * one we use --- by definition, they are either redundant or
+	 * contradictory.
+	 *
+	 * In practice we rarely see any "attribute boundary key gaps" here.
+	 * Preprocessing can usually backfill skip array keys for any attributes
+	 * that were omitted from the original scan->keyData[] input keys.  All
+	 * array keys are always considered = keys, but we'll sometimes need to
+	 * treat the current key value as if we were using an inequality strategy.
+	 * This happens with range skip arrays, which store inequality keys in the
+	 * array's low_compare/high_compare fields (used to find the first/last
+	 * set of matches, when = key will lack a usable sk_argument value).
+	 * These are always preferred over any redundant "standard" inequality
+	 * keys on the same column (per the usual rule about preferring = keys).
+	 * Note also that any column with an = skip array key can never have an
+	 * additional, contradictory = key.
+	 *
+	 * All keys (with the exception of SK_SEARCHNULL keys and SK_BT_SKIP
+	 * array keys whose array is "null_elem=true") imply a NOT NULL qualifier.
+	 * If the index stores nulls at the end of the index we'll be starting
+	 * from, and we have no boundary key for the column (which means the key
+	 * we deduced NOT NULL from is an inequality key that constrains the other
+	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+	 * use as a boundary key.  If we didn't do this, we might find ourselves
+	 * traversing a lot of null entries at the start of the scan.
+	 *
+	 * In this loop, row-comparison keys are treated the same as keys on their
+	 * first (leftmost) columns.  We'll add on lower-order columns of the row
+	 * comparison below, if possible.
+	 *
+	 * The selected scan keys (at most one per index column) are remembered by
+	 * storing their addresses into the local startKeys[] array.
+	 *
+	 * _bt_checkkeys/_bt_advance_array_keys decide whether and when to start
+	 * the next primitive index scan (for scans with array keys) based in part
+	 * on an understanding of how it'll enable us to reposition the scan.
+	 * They're directly aware of how we'll sometimes cons up an explicit
+	 * SK_SEARCHNOTNULL key.  They'll even end primitive scans by applying a
+	 * symmetric "deduce NOT NULL" rule of their own.  This allows top-level
+	 * scans to skip large groups of NULLs through repeated deductions about
+	 * key strictness (for a required inequality key) and whether NULLs in the
+	 * key's index column are stored last or first (relative to non-NULLs).
+	 * If you update anything here, _bt_checkkeys/_bt_advance_array_keys might
+	 * need to be kept in sync.
+	 *----------
+	 */
+	strat_total = BTEqualStrategyNumber;
+	if (so->numberOfKeys > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
 
 		/*
-		 * Consider pstate.startikey optimization once the ongoing primitive
-		 * index scan has already read at least one page
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
 		 */
-		if (!pstate.firstpage && minoff < maxoff)
-			_bt_set_startikey(scan, &pstate);
-
-		/* load items[] in ascending order */
-		itemIndex = 0;
-
-		offnum = Max(offnum, minoff);
+		cur = so->keyData;
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
 
-		while (offnum <= maxoff)
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (int i = 0;; cur++, i++)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
-			IndexTuple	itup;
+			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.
+				 *
+				 * If this is a scan key for a skip array whose current
+				 * element is MINVAL, choose low_compare (when scanning
+				 * backwards it'll be MAXVAL, and we'll choose high_compare).
+				 *
+				 * Note: if the array's low_compare key makes 'chosen' NULL,
+				 * then we behave as if the array's first element is -inf,
+				 * except when !array->null_elem implies a usable NOT NULL
+				 * constraint.
+				 */
+				if (chosen != NULL &&
+					(chosen->sk_flags & (SK_BT_MINVAL | SK_BT_MAXVAL)))
+				{
+					int			ikey = chosen - so->keyData;
+					ScanKey		skipequalitykey = chosen;
+					BTArrayKeyInfo *array = NULL;
+
+					for (int arridx = 0; arridx < so->numArrayKeys; arridx++)
+					{
+						array = &so->arrayKeys[arridx];
+						if (array->scan_key == ikey)
+							break;
+					}
+
+					if (ScanDirectionIsForward(dir))
+					{
+						Assert(!(skipequalitykey->sk_flags & SK_BT_MAXVAL));
+						chosen = array->low_compare;
+					}
+					else
+					{
+						Assert(!(skipequalitykey->sk_flags & SK_BT_MINVAL));
+						chosen = array->high_compare;
+					}
+
+					Assert(chosen == NULL ||
+						   chosen->sk_attno == skipequalitykey->sk_attno);
+
+					if (!array->null_elem)
+						impliesNN = skipequalitykey;
+					else
+						Assert(chosen == NULL && impliesNN == NULL);
+				}
+
+				/*
+				 * If we didn't find a usable boundary key, see if we can
+				 * deduce a NOT NULL key
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysz] */
+					chosen = &notnullkeys[keysz];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL)
+					break;
+				startKeys[keysz++] = chosen;
+
+				/*
+				 * We can only consider adding more boundary keys when the one
+				 * that we just chose to add uses either the = or >= strategy
+				 * (during backwards scans we can only do so when the key that
+				 * we just added to startKeys[] uses the = or <= strategy)
+				 */
+				strat_total = chosen->sk_strategy;
+				if (strat_total == BTGreaterStrategyNumber ||
+					strat_total == BTLessStrategyNumber)
+					break;
+
+				/*
+				 * If the key that we just added to startKeys[] is a skip
+				 * array = key whose current element is marked NEXT or PRIOR,
+				 * make strat_total > or < (and stop adding boundary keys).
+				 * This can only happen with opclasses that lack skip support.
+				 */
+				if (chosen->sk_flags & (SK_BT_NEXT | SK_BT_PRIOR))
+				{
+					Assert(chosen->sk_flags & SK_BT_SKIP);
+					Assert(strat_total == BTEqualStrategyNumber);
+
+					if (ScanDirectionIsForward(dir))
+					{
+						Assert(!(chosen->sk_flags & SK_BT_PRIOR));
+						strat_total = BTGreaterStrategyNumber;
+					}
+					else
+					{
+						Assert(!(chosen->sk_flags & SK_BT_NEXT));
+						strat_total = BTLessStrategyNumber;
+					}
+
+					/*
+					 * We're done.  We'll never find an exact = match for a
+					 * NEXT or PRIOR sentinel sk_argument value.  There's no
+					 * sense in trying to add more keys to startKeys[].
+					 */
+					break;
+				}
+
+				/*
+				 * Done if that was the last scan key output by preprocessing.
+				 * Also done if there is a gap index attribute that lacks a
+				 * usable key (only possible when preprocessing was unable to
+				 * generate a skip array key to "fill in the gap").
+				 */
+				if (i >= so->numberOfKeys ||
+					cur->sk_attno != curattr + 1)
+					break;
+
+				/*
+				 * Reset for next attr.
+				 */
+				curattr = cur->sk_attno;
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+
+	/*
+	 * If we found no usable boundary keys, we have to start from one end of
+	 * the tree.  Walk down that edge to the first or last key, and scan from
+	 * there.
+	 *
+	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
+	 */
+	if (keysz == 0)
+		return _bt_endpoint_batch(scan, dir);
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning scan keys are finalized.)
+	 */
+	Assert(keysz <= INDEX_MAX_KEYS);
+	for (int i = 0; i < keysz; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			/*
+			 * Cannot be a NULL in the first row member: _bt_preprocess_keys
+			 * would've marked the qual as unsatisfiable, preventing us from
+			 * ever getting this far
+			 */
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			Assert(subkey->sk_attno == cur->sk_attno);
+			Assert(!(subkey->sk_flags & SK_ISNULL));
+
+			/*
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function)
+			 */
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysz - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysz + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysz < INDEX_MAX_KEYS);
+					memcpy(inskey.scankeys + keysz, subkey,
+						   sizeof(ScanKeyData));
+					keysz++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (strat_total)
+					{
+						case BTLessStrategyNumber:
+							strat_total = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							strat_total = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   InvalidStrategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey.scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   InvalidStrategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * initial descent by _bt_search (and our _bt_binsrch call for the leaf
+	 * page _bt_search returns).
+	 *----------
+	 */
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
+	inskey.anynullkeys = false; /* unused */
+	inskey.scantid = NULL;
+	inskey.keysz = keysz;
+	switch (strat_total)
+	{
+		case BTLessStrategyNumber:
+
+			inskey.nextkey = false;
+			inskey.backward = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			inskey.nextkey = true;
+			inskey.backward = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy
+				 */
+				inskey.nextkey = true;
+				inskey.backward = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy
+				 */
+				inskey.nextkey = false;
+				inskey.backward = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey
+			 */
+			inskey.nextkey = false;
+			inskey.backward = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey
+			 */
+			inskey.nextkey = true;
+			inskey.backward = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
+			return false;
+	}
+
+	/*
+	 * Use the manufactured insertion scan key to descend the tree and
+	 * position ourselves on the target leaf page.
+	 */
+	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
+	stack = _bt_search(rel, NULL, &inskey, &pos.buf, BT_READ);
+
+	/* don't need to keep the stack around... */
+	_bt_freestack(stack);
+
+	if (!BufferIsValid(pos.buf))
+	{
+		/*
+		 * We only get here if the index is completely empty. Lock relation
+		 * because nothing finer to lock exists.  Without a buffer lock, it's
+		 * possible for another transaction to insert data between
+		 * _bt_search() and PredicateLockRelation().  We have to try again
+		 * after taking the relation-level predicate lock, to close a narrow
+		 * window where we wouldn't scan concurrently inserted tuples, but the
+		 * writer wouldn't see our predicate lock.
+		 */
+		if (IsolationIsSerializable())
+		{
+			PredicateLockRelation(rel, scan->xs_snapshot);
+			stack = _bt_search(rel, NULL, &inskey, &pos.buf, BT_READ);
+			_bt_freestack(stack);
+		}
+
+		if (!BufferIsValid(pos.buf))
+		{
+			Assert(!so->needPrimScan);
+			_bt_parallel_done(scan);
+			return false;
+		}
+	}
+
+	/* position to the precise item on the page */
+	offnum = _bt_binsrch(rel, &inskey, pos.buf);
+
+	/*
+	 * Now load data from the first page of the scan (usually the page
+	 * currently in so->currPos.buf).
+	 *
+	 * If inskey.nextkey = false and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple >= inskey.scankeys.
+	 *
+	 * If inskey.nextkey = false and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple < inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = false, offnum is
+	 * positioned at the first non-pivot tuple > inskey.scankeys.
+	 *
+	 * If inskey.nextkey = true and inskey.backward = true, offnum is
+	 * positioned at the last non-pivot tuple <= inskey.scankeys.
+	 *
+	 * It's possible that _bt_binsrch returned an offnum that is out of bounds
+	 * for the page.  For example, when inskey is both < the leaf page's high
+	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
+	 */
+	return _bt_readfirstpage_batch(scan, &pos, offnum, dir);
+}
+
+/*
+ *	_bt_next_batch() -- Get the next batch of items in a scan.
+ *
+ * A batch variant of _bt_next(). Most of the comments for that function
+ * apply here too.
+ *
+ * We should only get here only when the current batch has no more items
+ * in the given direction. We don't get here with empty batches, that's
+ * handled by _bt_fist_batch().
+ *
+ * XXX See also the comments at _bt_first_batch() about returning a single
+ * batch for the page, etc.
+ */
+IndexScanBatch
+_bt_next_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+	// BTBatchScanPos pos;
+	BTBatchScanPosData tmp;
+	// IndexScanBatch	batch;
+	// int 			idx;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * restore the BTScanOpaque from the current batch
+	 *
+	 * XXX This is pretty ugly/expensive. Ideally we'd have all the fields
+	 * needed to determine "location" in the index (essentially BTScanPosData)
+	 * in the batch, without cloning all the other stuff.
+	 */
+	// Assert(scan->xs_batches->currentBatch != NULL);
+
+	/*
+	 * Use the last batch as the "current" batch. We use the streamPos if
+	 * initialized, or the readPos as a fallback. Alternatively, we could
+	 * simply use the last batch in the queue, i.e. (nextBatch - 1).
+	 *
+	 * Even better, we could pass the "correct" batch from indexam.c, and
+	 * let that figure out which position to move from.
+	 */
+/*
+	idx = scan->xs_batches->streamPos.batch;
+	if (idx == -1)
+		idx = scan->xs_batches->readPos.batch;
+
+	batch = INDEX_SCAN_BATCH(scan, idx);
+	Assert(batch != NULL);
+	pos = (BTBatchScanPos) batch->opaque;
+*/
+
+	Assert(BTBatchScanPosIsPinned(*pos));
+
+	memcpy(&tmp, pos, sizeof(tmp));
+
+	/*
+	 * Advance to next page, load the data into the index batch.
+	 *
+	 * FIXME It may not be quite correct to just pass the position from
+	 * current batch, some of the functions scribble over it (e.g.
+	 * _bt_readpage_batch). Maybe we should create a copy, or something?
+	 *
+	 * XXX For now we pass a local copy "tmp".
+	 */
+	return _bt_steppage_batch(scan, &tmp, dir);
+}
+
+/*
+ *	_bt_kill_batch() -- remember the items-to-be-killed from the current batch
+ *
+ * We simply translate the bitmap into the "regular" killedItems array, and let
+ * that to drive which items are killed.
+ */
+void
+_bt_kill_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* we should only get here for scans with batching */
+	Assert(scan->xs_batches);
+
+	/* bail out if the batch has no killed items */
+	if (batch->numKilled == 0)
+		return;
+
+	/*
+	 * XXX Now what? we don't have the currPos around anymore, so we should
+	 * load that, and apply the killed items to that, somehow?
+	 */
+	/* FIXME: _bt_kill_batch not implemented */
+
+	/*
+	 * XXX maybe we should have a separate callback for this, and call it from
+	 * the indexam.c directly whenever we think it's appropriate? And not only
+	 * from here when freeing the batch?
+	 */
+	_bt_killitems_batch(scan, batch);
+}
+
+/*
+ *	_bt_readpage() -- Load data from current index page into so->currPos
+ *
+ * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
+ * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of so->currPos are
+ * initialized from scratch here.
+ *
+ * We scan the current page starting at offnum and moving in the indicated
+ * direction.  All items matching the scan keys are loaded into currPos.items.
+ * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
+ * that there can be no more matching tuples in the current scan direction
+ * (could just be for the current primitive index scan when scan has arrays).
+ *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
+ * Returns true if any matching items found on the page, false if none.
+ */
+static bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
+			 bool firstpage)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
+
+	/* save the page/buffer block number, along with its sibling links */
+	page = BufferGetPage(so->currPos.buf);
+	opaque = BTPageGetOpaque(page);
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+	so->currPos.prevPage = opaque->btpo_prev;
+	so->currPos.nextPage = opaque->btpo_next;
+
+	Assert(!P_IGNORE(opaque));
+	Assert(BTScanPosIsPinned(so->currPos));
+	Assert(!so->needPrimScan);
+
+	if (scan->parallel_scan)
+	{
+		/* allow next/prev page to be read by other worker without delay */
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, so->currPos.nextPage,
+								 so->currPos.currPage);
+		else
+			_bt_parallel_release(scan, so->currPos.prevPage,
+								 so->currPos.currPage);
+	}
+
+	/* initialize remaining currPos fields related to current page */
+	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	so->currPos.dir = dir;
+	so->currPos.nextTupleOffset = 0;
+	/* either moreLeft or moreRight should be set now (may be unset later) */
+	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
+		   so->currPos.moreLeft);
+
+	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+
+	/* initialize local variables */
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
+	arrayKeys = so->numArrayKeys != 0;
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.firstpage = firstpage;
+	pstate.forcenonrequired = false;
+	pstate.startikey = 0;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+	pstate.nskipadvances = 0;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys)
+		{
+			if (!P_RIGHTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					so->currPos.moreRight = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   so->currPos.currPage);
+					return false;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		offnum = Max(offnum, minoff);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				offnum = OffsetNumberNext(offnum);
+				continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
+
+			pstate.offnum = offnum;
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
+
+			/*
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
+			 */
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum < pstate.skip);
+				Assert(!pstate.forcenonrequired);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
+			if (passes_quals)
+			{
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
+			}
+			/* When !continuescan, there can't be any more matches, so stop */
+			if (!pstate.continuescan)
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		/*
+		 * We don't need to visit page to the right when the high key
+		 * indicates that no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (pstate.continuescan && !so->scanBehind && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
+			int			truncatt;
+
+			truncatt = BTreeTupleGetNAtts(itup, rel);
+			pstate.forcenonrequired = false;
+			pstate.startikey = 0;	/* _bt_set_startikey ignores P_HIKEY */
+			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
+		}
+
+		if (!pstate.continuescan)
+			so->currPos.moreRight = false;
+
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* SK_SEARCHARRAY backward scans must provide final tuple up front */
+		if (arrayKeys)
+		{
+			if (minoff <= maxoff && !P_LEFTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, minoff);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					so->currPos.moreLeft = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   so->currPos.currPage);
+					return false;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in descending order */
+		itemIndex = MaxTIDsPerBTreePage;
+
+		offnum = Min(offnum, maxoff);
+
+		while (offnum >= minoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		tuple_alive;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple (previous, actually,
+			 * since we're scanning backwards).  However, if this is the first
+			 * tuple on the page, we do check the index keys, to prevent
+			 * uselessly advancing to the page to the left.  This is similar
+			 * to the high key optimization used by forward scans.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				if (offnum > minoff)
+				{
+					offnum = OffsetNumberPrev(offnum);
+					continue;
+				}
+
+				tuple_alive = false;
+			}
+			else
+				tuple_alive = true;
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+			Assert(!BTreeTupleIsPivot(itup));
+
+			pstate.offnum = offnum;
+			if (arrayKeys && offnum == minoff && pstate.forcenonrequired)
+			{
+				pstate.forcenonrequired = false;
+				pstate.startikey = 0;
+			}
+			passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
+										 itup, indnatts);
+
+			if (arrayKeys && so->scanBehind)
+			{
+				/*
+				 * Done scanning this page, but not done with the current
+				 * primscan.
+				 *
+				 * Note: Forward scans don't check this explicitly, since they
+				 * prefer to reuse pstate.skip for this instead.
+				 */
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(!pstate.forcenonrequired);
+
+				break;
+			}
+
+			/*
+			 * Check if we need to skip ahead to a later tuple (only possible
+			 * when the scan uses array keys)
+			 */
+			if (arrayKeys && OffsetNumberIsValid(pstate.skip))
+			{
+				Assert(!passes_quals && pstate.continuescan);
+				Assert(offnum > pstate.skip);
+				Assert(!pstate.forcenonrequired);
+
+				offnum = pstate.skip;
+				pstate.skip = InvalidOffsetNumber;
+				continue;
+			}
+
+			if (passes_quals && tuple_alive)
+			{
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
+			}
+			/* When !continuescan, there can't be any more matches, so stop */
+			if (!pstate.continuescan)
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		/*
+		 * We don't need to visit page to the left when no more matches will
+		 * be found there
+		 */
+		if (!pstate.continuescan)
+			so->currPos.moreLeft = false;
+
+		Assert(itemIndex >= 0);
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+	}
+
+	/*
+	 * If _bt_set_startikey told us to temporarily treat the scan's keys as
+	 * nonrequired (possible only during scans with array keys), there must be
+	 * no lasting consequences for the scan's array keys.  The scan's arrays
+	 * should now have exactly the same elements as they would have had if the
+	 * nonrequired behavior had never been used.  (In general, a scan's arrays
+	 * are expected to track its progress through the index's key space.)
+	 *
+	 * We are required (by _bt_set_startikey) to call _bt_checkkeys against
+	 * pstate.finaltup with pstate.forcenonrequired=false to allow the scan's
+	 * arrays to recover.  Assert that that step hasn't been missed.
+	 */
+	Assert(!pstate.forcenonrequired);
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+static IndexScanBatch
+_bt_readpage_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir, OffsetNumber offnum,
+				   bool firstpage)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	BTReadPageState pstate;
+	bool		arrayKeys;
+	int			itemIndex,
+				indnatts;
+
+	/* result */
+	/* IndexScanBatch batch = ddd; */
+	IndexScanBatch batch;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/*
+	 * FIXME fake for _bt_checkkeys, needs to be set properly elsewhere (not
+	 * sure where)
+	 */
+
+	/*
+	 * XXX we shouldn't be passing this info through currPos but directly, I
+	 * guess.
+	 */
+	so->currPos.dir = dir;
+
+	/*
+	 * XXX We can pass the exact number if items from this page, by using
+	 * maxoff
+	 */
+	batch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+
+	/* FIXME but we don't copy the contents until the end */
+	batch->opaque = palloc0(sizeof(BTBatchScanPosData));
+
+	/* bogus values */
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->itemIndex = -1;
+
+	/* if (so->currTuples) */
+	/* { */
+	/* batch->currTuples = (char *) palloc(BLCKSZ); */
+	/* memcpy(batch->currTuples, so->currTuples, BLCKSZ); */
+	/* } */
+
+	/* save the page/buffer block number, along with its sibling links */
+	page = BufferGetPage(pos->buf);
+	opaque = BTPageGetOpaque(page);
+	pos->currPage = BufferGetBlockNumber(pos->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+
+	Assert(!P_IGNORE(opaque));
+	Assert(BTBatchScanPosIsPinned(*pos));
+	Assert(!so->needPrimScan);
+
+	if (scan->parallel_scan)
+	{
+		/* allow next/prev page to be read by other worker without delay */
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, pos->nextPage,
+								 pos->currPage);
+		else
+			_bt_parallel_release(scan, pos->prevPage,
+								 pos->currPage);
+	}
+
+	/* initialize remaining currPos fields related to current page */
+	pos->lsn = BufferGetLSNAtomic(pos->buf);
+	pos->dir = dir;
+	pos->nextTupleOffset = 0;
+	/* either moreLeft or moreRight should be set now (may be unset later) */
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
+
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
+
+	/* initialize local variables */
+	indnatts = IndexRelationGetNumberOfAttributes(rel);
+	arrayKeys = so->numArrayKeys != 0;
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	pstate.minoff = minoff;
+	pstate.maxoff = maxoff;
+	pstate.finaltup = NULL;
+	pstate.page = page;
+	pstate.firstpage = firstpage;
+	pstate.forcenonrequired = false;
+	pstate.startikey = 0;
+	pstate.offnum = InvalidOffsetNumber;
+	pstate.skip = InvalidOffsetNumber;
+	pstate.continuescan = true; /* default assumption */
+	pstate.rechecks = 0;
+	pstate.targetdistance = 0;
+	pstate.nskipadvances = 0;
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* SK_SEARCHARRAY forward scans must provide high key up front */
+		if (arrayKeys)
+		{
+			if (!P_RIGHTMOST(opaque))
+			{
+				ItemId		iid = PageGetItemId(page, P_HIKEY);
+
+				pstate.finaltup = (IndexTuple) PageGetItem(page, iid);
+
+				if (so->scanBehind &&
+					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
+				{
+					/* Schedule another primitive index scan after all */
+					pos->moreRight = false;
+					so->needPrimScan = true;
+					if (scan->parallel_scan)
+						_bt_parallel_primscan_schedule(scan,
+													   pos->currPage);
+					return NULL;
+				}
+			}
+
+			so->scanBehind = so->oppositeDirCheck = false;	/* reset */
+		}
+
+		/*
+		 * Consider pstate.startikey optimization once the ongoing primitive
+		 * index scan has already read at least one page
+		 */
+		if (!pstate.firstpage && minoff < maxoff)
+			_bt_set_startikey(scan, &pstate);
+
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		offnum = Max(offnum, minoff);
+
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
 			bool		passes_quals;
 
 			/*
@@ -1740,7 +2994,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem_batch(batch, itemIndex, offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1752,16 +3006,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
-											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+						_bt_setuppostingitems_batch(batch, itemIndex, offnum,
+													BTreeTupleGetPostingN(itup, 0),
+													itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
-											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+						_bt_savepostingitem_batch(batch, itemIndex, offnum,
+												  BTreeTupleGetPostingN(itup, i),
+												  tupleOffset);
 						itemIndex++;
 					}
 				}
@@ -1792,17 +3046,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 
 			truncatt = BTreeTupleGetNAtts(itup, rel);
 			pstate.forcenonrequired = false;
-			pstate.startikey = 0;	/* _bt_set_startikey ignores P_HIKEY */
+			pstate.startikey = 0;	/* _bt_set_startikey ignores HIKEY */
 			_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
+		batch->itemIndex = 0;
 	}
 	else
 	{
@@ -1819,12 +3073,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
 						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
-					return false;
+													   pos->currPage);
+					return NULL;
 				}
 			}
 
@@ -1922,7 +3176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem_batch(batch, itemIndex, offnum, itup);
 				}
 				else
 				{
@@ -1940,16 +3194,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
-											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+						_bt_setuppostingitems_batch(batch, itemIndex, offnum,
+													BTreeTupleGetPostingN(itup, 0),
+													itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
-											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+						_bt_savepostingitem_batch(batch, itemIndex, offnum,
+												  BTreeTupleGetPostingN(itup, i),
+												  tupleOffset);
 					}
 				}
 			}
@@ -1965,12 +3219,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxTIDsPerBTreePage - 1;
+		batch->itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -1987,7 +3241,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	if (batch->firstItem > batch->lastItem)
+		return NULL;
+
+	memcpy(batch->opaque, pos, sizeof(BTBatchScanPosData));
+
+	return batch;
 }
 
 /* Save an index item into so->currPos.items[itemIndex] */
@@ -2005,9 +3264,97 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+	}
+}
+
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static void
+_bt_saveitem_batch(IndexScanBatch batch, int itemIndex,
+				   OffsetNumber offnum, IndexTuple itup)
+{
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
+	/* copy the populated part of the items array */
+	batch->items[itemIndex].heapTid = itup->t_tid;
+	batch->items[itemIndex].indexOffset = offnum;
+
+	if (batch->currTuples)
+	{
+		Size		itupsz = IndexTupleSize(itup);
+
+		batch->items[itemIndex].tupleOffset = pos->nextTupleOffset;
+		memcpy(batch->currTuples + pos->nextTupleOffset, itup, itupsz);
+		pos->nextTupleOffset += MAXALIGN(itupsz);
 	}
 }
 
@@ -2022,31 +3369,34 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  ItemPointer heapTid, IndexTuple itup)
+_bt_setuppostingitems_batch(IndexScanBatch batch, int itemIndex, OffsetNumber offnum,
+							ItemPointer heapTid, IndexTuple itup)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+	IndexScanBatchPosItem *item = &batch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (batch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = pos->nextTupleOffset;
+		base = (IndexTuple) (batch->currTuples + pos->nextTupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		pos->nextTupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
@@ -2060,20 +3410,20 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * posting list tuple.  Caller passes its return value as tupleOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem_batch(IndexScanBatch batch, int itemIndex, OffsetNumber offnum,
+						  ItemPointer heapTid, int tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &batch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (batch->currTuples)
+		item->tupleOffset = tupleOffset;
 }
 
 /*
@@ -2186,6 +3536,71 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+/*
+ *	a batching version of _bt_steppage(), ignoring irrelevant bits
+ */
+static IndexScanBatch
+_bt_steppage_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	/* Batching has a different concept of position, stored in the batch. */
+	Assert(BTBatchScanPosIsValid(*pos));
+
+	/*
+	 * killitems
+	 *
+	 * No need to handle killtuples here, that's going to be dealt with at the
+	 * indexam.c level when freeing the batch, or possibly in when calling
+	 * amfreebatch.
+	 */
+
+	/*
+	 * mark/restore
+	 *
+	 * Mark/restore shall also be handled at the indexam.c level, by keeping
+	 * the correct batch around, etc. We don't discard the old batch here.
+	 *
+	 * In _bt_steppage this also handled primitive scans for array keys, but
+	 * that probably would be handled at indexam.c level too.
+	 */
+
+	/* Don't unpin the buffer here, keep the batch pinned until amfreebatch. */
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = pos->nextPage;
+	else
+		blkno = pos->prevPage;
+
+	lastcurrblkno = pos->currPage;
+
+	/*
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for currPos happened to use the opposite direction to the
+	 * one that we're stepping in now.  (It's okay to leave the scan's array
+	 * keys as-is, since the next _bt_readpage will advance them.)
+	 *
+	 * XXX Not sure this is correct. Can we combine the direction from some
+	 * older batch (with mark/restore?) and the current needPrimScan from the
+	 * latest batch we processed? But, the mark/restore code in indexam should
+	 * reset this somehow.
+	 *
+	 * XXX However, aren't primitive scans very btree-specific code? How could
+	 * indexam.c ever handle that?
+	 */
+	if (pos->dir != dir)
+		so->needPrimScan = false;
+
+	return _bt_readnextpage_batch(scan, pos, blkno, lastcurrblkno, dir, false);
+}
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -2265,6 +3680,77 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	return true;
 }
 
+static IndexScanBatch
+_bt_readfirstpage_batch(IndexScanDesc scan, BTBatchScanPos pos, OffsetNumber offnum, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	/* copy position info from BTScanOpaque */
+
+	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	if (so->needPrimScan)
+	{
+		Assert(so->numArrayKeys);
+
+		pos->moreLeft = true;
+		pos->moreRight = true;
+		so->needPrimScan = false;
+	}
+	else if (ScanDirectionIsForward(dir))
+	{
+		pos->moreLeft = false;
+		pos->moreRight = true;
+	}
+	else
+	{
+		pos->moreLeft = true;
+		pos->moreRight = false;
+	}
+
+	/*
+	 * Attempt to load matching tuples from the first page.
+	 *
+	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * _bt_readpage also releases parallel scan (even when it returns false).
+	 */
+	if ((batch = _bt_readpage_batch(scan, pos, dir, offnum, true)) != NULL)
+	{
+		pos = (BTBatchScanPos) batch->opaque;
+
+		/*
+		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+		 * so->currPos.buf in preparation for btgettuple returning tuples.
+		 */
+		Assert(BTBatchScanPosIsPinned(*pos));
+
+		/* _bt_drop_lock_and_maybe_pin_batch(scan, pos); */
+		/* XXX drop just the lock, not the pin, that's up to btfreebatch */
+		/* without this btfreebatch triggers an assert when unpinning the */
+		/* buffer, because that checks we're not holding a lock on it */
+		_bt_unlockbuf(scan->indexRelation, pos->buf);
+		return batch;
+	}
+
+	/* There's no actually-matching data on the page in so->currPos.buf */
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+
+	/* XXX Not sure we can drop the pin before calling steppage_batch? But */
+	/* without this, \d+ reports unreleased buffer ... */
+	/* And the non-batch code doesn't need to do this. */
+	ReleaseBuffer(pos->buf);
+
+	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
+	return _bt_steppage_batch(scan, pos, dir);
+}
+
 /*
  *	_bt_readnextpage() -- Read next page containing valid data for _bt_next
  *
@@ -2412,6 +3898,138 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	return true;
 }
 
+static IndexScanBatch
+_bt_readnextpage_batch(IndexScanDesc scan, BTBatchScanPos pos, BlockNumber blkno,
+					   BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+
+	/* BTBatchScanPosData	newpos; */
+	IndexScanBatch newbatch = NULL;
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	Assert(pos->currPage == lastcurrblkno || seized);
+	Assert(BTBatchScanPosIsPinned(*pos) || seized);
+
+	/* initialize the new position to the old one, we'll modify it */
+	/* newpos = *pos; */
+
+	/* pos->moreLeft = pos->moreRight = false; */
+
+	/*
+	 * Remember that the scan already read lastcurrblkno, a page to the left
+	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 */
+	if (ScanDirectionIsForward(dir))
+		pos->moreLeft = true;
+	else
+		pos->moreRight = true;
+
+	for (;;)
+	{
+		Page		page;
+		BTPageOpaque opaque;
+
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !pos->moreRight : !pos->moreLeft))
+		{
+			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
+			Assert(pos->currPage == lastcurrblkno && !seized);
+			BTBatchScanPosInvalidate(*pos);
+			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
+			return NULL;
+		}
+
+		Assert(!so->needPrimScan);
+
+		/* parallel scan must never actually visit so->currPos blkno */
+		if (!seized && scan->parallel_scan != NULL &&
+			!_bt_parallel_seize_batch(scan, pos, &blkno, &lastcurrblkno, false))
+		{
+			/* whole scan is now done (or another primitive scan required) */
+			BTBatchScanPosInvalidate(*pos);
+			return NULL;
+		}
+
+		if (ScanDirectionIsForward(dir))
+		{
+			/* read blkno, but check for interrupts first */
+			CHECK_FOR_INTERRUPTS();
+			pos->buf = _bt_getbuf(rel, blkno, BT_READ);
+		}
+		else
+		{
+			/* read blkno, avoiding race (also checks for interrupts) */
+			pos->buf = _bt_lock_and_validate_left(rel, &blkno,
+												  lastcurrblkno);
+			if (pos->buf == InvalidBuffer)
+			{
+				/* must have been a concurrent deletion of leftmost page */
+				BTBatchScanPosInvalidate(*pos);
+				_bt_parallel_done(scan);
+				return NULL;
+			}
+		}
+
+		page = BufferGetPage(pos->buf);
+		opaque = BTPageGetOpaque(page);
+		lastcurrblkno = blkno;
+		if (likely(!P_IGNORE(opaque)))
+		{
+			/* see if there are any matches on this page */
+			if (ScanDirectionIsForward(dir))
+			{
+				/* note that this will clear moreRight if we can stop */
+				if ((newbatch = _bt_readpage_batch(scan, pos, dir, P_FIRSTDATAKEY(opaque), false)) != NULL)
+					break;
+				blkno = pos->nextPage;
+			}
+			else
+			{
+				/* note that this will clear moreLeft if we can stop */
+				if ((newbatch = _bt_readpage_batch(scan, pos, dir, PageGetMaxOffsetNumber(page), false)) != NULL)
+					break;
+				blkno = pos->prevPage;
+			}
+		}
+		else
+		{
+			/* _bt_readpage not called, so do all this for ourselves */
+			if (ScanDirectionIsForward(dir))
+				blkno = opaque->btpo_next;
+			else
+				blkno = opaque->btpo_prev;
+			if (scan->parallel_scan != NULL)
+				_bt_parallel_release(scan, blkno, lastcurrblkno);
+		}
+
+		/* no matching tuples on this page */
+		_bt_relbuf(rel, pos->buf);
+		seized = false;			/* released by _bt_readpage (or by us) */
+	}
+
+	/* */
+	Assert(newbatch != NULL);
+
+	pos = (BTBatchScanPos) newbatch->opaque;
+
+	/*
+	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
+	 * so->currPos.buf in preparation for btgettuple returning tuples.
+	 */
+	Assert(pos->currPage == blkno);
+	Assert(BTBatchScanPosIsPinned(*pos));
+	/* _bt_drop_lock_and_maybe_pin_batch(scan, pos); */
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+
+	return newbatch;
+}
+
 /*
  * _bt_lock_and_validate_left() -- lock caller's left sibling blkno,
  * recovering from concurrent page splits/page deletions when necessary
@@ -2693,3 +4311,79 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	_bt_returnitem(scan, so);
 	return true;
 }
+
+/*
+ *	_bt_endpoint() -- Find the first or last page in the index, and scan
+ * from there to the first key satisfying all the quals.
+ *
+ * This is used by _bt_first() to set up a scan when we've determined
+ * that the scan must start at the beginning or end of the index (for
+ * a forward or backward scan respectively).
+ *
+ * Parallel scan callers must have seized the scan before calling here.
+ * Exit conditions are the same as for _bt_first().
+ */
+static IndexScanBatch
+_bt_endpoint_batch(IndexScanDesc scan, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber start;
+	BTBatchScanPosData pos;
+
+	BTBatchScanPosInvalidate(pos);
+
+	/* batching does not work with regular scan-level positions */
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!BTScanPosIsValid(so->markPos));
+
+	Assert(!BTScanPosIsValid(so->currPos));
+	Assert(!so->needPrimScan);
+
+	/*
+	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
+	 * version of _bt_search().
+	 */
+	pos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+
+	if (!BufferIsValid(pos.buf))
+	{
+		/*
+		 * Empty index. Lock the whole relation, as nothing finer to lock
+		 * exists.
+		 */
+		PredicateLockRelation(rel, scan->xs_snapshot);
+		_bt_parallel_done(scan);
+		return false;
+	}
+
+	page = BufferGetPage(pos.buf);
+	opaque = BTPageGetOpaque(page);
+	Assert(P_ISLEAF(opaque));
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* There could be dead pages to the left, so not this: */
+		/* Assert(P_LEFTMOST(opaque)); */
+
+		start = P_FIRSTDATAKEY(opaque);
+	}
+	else if (ScanDirectionIsBackward(dir))
+	{
+		Assert(P_RIGHTMOST(opaque));
+
+		start = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		elog(ERROR, "invalid scan direction: %d", (int) dir);
+		start = 0;				/* keep compiler quiet */
+	}
+
+	/*
+	 * Now load data from the first page of the scan.
+	 */
+	return _bt_readfirstpage_batch(scan, &pos, start, dir);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 11802a4c215..187f6fa5934 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3492,6 +3492,185 @@ _bt_killitems(IndexScanDesc scan)
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
 }
 
+/*
+ * _bt_killitems_batch
+ *		a variant of _bt_killitems, using the batch-level killedItems
+ */
+void
+_bt_killitems_batch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* BTScanOpaque so = (BTScanOpaque) scan->opaque; */
+	BTBatchScanPos pos = (BTBatchScanPos) batch->opaque;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	OffsetNumber maxoff;
+	int			i;
+	int			numKilled = batch->numKilled;
+	bool		killedsomething = false;
+	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
+
+	Assert(BTBatchScanPosIsValid(*pos));
+
+	/*
+	 * Always reset the scan state, so we don't look for same items on other
+	 * pages.
+	 */
+	batch->numKilled = 0;
+
+	if (BTBatchScanPosIsPinned(*pos))
+	{
+		/*
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * re-use of any TID on the page, so there is no need to check the
+		 * LSN.
+		 */
+		droppedpin = false;
+		_bt_lockbuf(scan->indexRelation, pos->buf, BT_READ);
+
+		page = BufferGetPage(pos->buf);
+	}
+	else
+	{
+		Buffer		buf;
+
+		droppedpin = true;
+		/* Attempt to re-read the buffer, getting pin and lock. */
+		buf = _bt_getbuf(scan->indexRelation, pos->currPage, BT_READ);
+
+		page = BufferGetPage(buf);
+		if (BufferGetLSNAtomic(buf) == pos->lsn)
+			pos->buf = buf;
+		else
+		{
+			/* Modified while not pinned means hinting is not safe. */
+			_bt_relbuf(scan->indexRelation, buf);
+			return;
+		}
+	}
+
+	opaque = BTPageGetOpaque(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
+		OffsetNumber offnum = kitem->indexOffset;
+
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
+		if (offnum < minoff)
+			continue;			/* pure paranoia */
+		while (offnum <= maxoff)
+		{
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
+
+			if (BTreeTupleIsPosting(ituple))
+			{
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * We rely on the convention that heap TIDs in the scanpos
+				 * items array are stored in ascending heap TID order for a
+				 * group of TIDs that originally came from a posting list
+				 * tuple.  This convention even applies during backwards
+				 * scans, where returning the TIDs in descending order might
+				 * seem more natural.  This is about effectiveness, not
+				 * correctness.
+				 *
+				 * Note that the page may have been modified in almost any way
+				 * since we first read it (in the !droppedpin case), so it's
+				 * possible that this posting list tuple wasn't a posting list
+				 * tuple when we first encountered its heap TIDs.
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/*
+					 * kitem must have matching offnum when heap TIDs match,
+					 * though only in the common case where the page can't
+					 * have been concurrently modified
+					 */
+					Assert(kitem->indexOffset == offnum || !droppedpin);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			/*
+			 * Mark index item as dead, if it isn't already.  Since this
+			 * happens while holding a buffer lock possibly in shared mode,
+			 * it's possible that multiple processes attempt to do this
+			 * simultaneously, leading to multiple full-page images being sent
+			 * to WAL (if wal_log_hints or data checksums are enabled), which
+			 * is undesirable.
+			 */
+			if (killtuple && !ItemIdIsDead(iid))
+			{
+				/* found the item/all posting list items */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;			/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 *
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * BTP_HAS_GARBAGE flag, which is likewise just a hint.  (Note that we
+	 * only rely on the page-level flag in !heapkeyspace indexes.)
+	 */
+	if (killedsomething)
+	{
+		opaque->btpo_flags |= BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(pos->buf, true);
+	}
+
+	_bt_unlockbuf(scan->indexRelation, pos->buf);
+}
 
 /*
  * The following routines manage a shared-memory area in which we track
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..a00a1108ba5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1001,6 +1001,38 @@ typedef struct BTScanPosData
 
 typedef BTScanPosData *BTScanPos;
 
+/*
+ * Minimal AM-specific concept of "position" for batching.
+ */
+typedef struct BTBatchScanPosData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+
+	/* page details as of the saved position's call to _bt_readpage */
+	BlockNumber currPage;		/* page referenced by items array */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+	XLogRecPtr	lsn;			/* currPage's LSN */
+
+	/* scan direction for the saved position's call to _bt_readpage */
+	ScanDirection dir;
+
+	/*
+	 * If we are doing an index-only scan, nextTupleOffset is the first free
+	 * location in the associated tuple storage workspace.
+	 */
+	int			nextTupleOffset;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively.
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+} BTBatchScanPosData;
+
+typedef BTBatchScanPosData *BTBatchScanPos;
+
 #define BTScanPosIsPinned(scanpos) \
 ( \
 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -1017,7 +1049,6 @@ typedef BTScanPosData *BTScanPos;
 		if (BTScanPosIsPinned(scanpos)) \
 			BTScanPosUnpin(scanpos); \
 	} while (0)
-
 #define BTScanPosIsValid(scanpos) \
 ( \
 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -1030,6 +1061,35 @@ typedef BTScanPosData *BTScanPos;
 		(scanpos).currPage = InvalidBlockNumber; \
 	} while (0)
 
+#define BTBatchScanPosIsPinned(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BufferIsValid((scanpos).buf) \
+)
+#define BTBatchScanPosUnpin(scanpos) \
+	do { \
+		ReleaseBuffer((scanpos).buf); \
+		(scanpos).buf = InvalidBuffer; \
+	} while (0)
+#define BTBatchScanPosUnpinIfPinned(scanpos) \
+	do { \
+		if (BTBatchScanPosIsPinned(scanpos)) \
+			BTBatchScanPosUnpin(scanpos); \
+	} while (0)
+#define BTBatchScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+#define BTBatchScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).currPage = InvalidBlockNumber; \
+	} while (0)
+
+
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1191,6 +1251,8 @@ extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, ScanDirection dir);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1215,6 +1277,9 @@ extern StrategyNumber bttranslatecmptype(CompareType cmptype, Oid opfamily);
  */
 extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 							   BlockNumber *last_curr_page, bool first);
+extern bool _bt_parallel_seize_batch(IndexScanDesc scan, BTBatchScanPos pos,
+									 BlockNumber *next_scan_page,
+									 BlockNumber *last_curr_page, bool first);
 extern void _bt_parallel_release(IndexScanDesc scan,
 								 BlockNumber next_scan_page,
 								 BlockNumber curr_page);
@@ -1308,6 +1373,10 @@ extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
+extern IndexScanBatch _bt_first_batch(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next_batch(IndexScanDesc scan, BTBatchScanPos pos, ScanDirection dir);
+extern void _bt_kill_batch(IndexScanDesc scan, IndexScanBatch batch);
+
 /*
  * prototypes for functions in nbtutils.c
  */
@@ -1326,6 +1395,7 @@ extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
 extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems_batch(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 060d964e399..1e5548aacb9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -195,6 +195,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
-- 
2.49.0

v20250501-0003-WIP-Don-t-read-the-same-block-repeatedly.patchtext/x-patch; charset=UTF-8; name=v20250501-0003-WIP-Don-t-read-the-same-block-repeatedly.patchDownload

From e2a94f1f38fb73289a9d7701ab3a2e5bb5370374 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 1 Jan 2025 22:10:37 +0100
Subject: [PATCH v20250501 3/7] WIP: Don't read the same block repeatedly

---
 src/backend/access/heap/heapam_handler.c | 57 ++++++++++++++++++++++--
 src/backend/access/index/indexam.c       | 13 ++++++
 src/backend/storage/buffer/bufmgr.c      | 40 +++++++++++++++++
 src/include/access/relscan.h             |  2 +
 src/include/storage/bufmgr.h             |  2 +
 5 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f79d97a8c64..326d5fed681 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -136,6 +136,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
+		bool		release_prev = true;
 
 		/*
 		 * Read the block for the requested TID. With a read stream, simply
@@ -157,7 +158,56 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * API.
 		 */
 		if (scan->rs)
-			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		{
+			/*
+			 * If we're trying to read the same block as the last time, don't
+			 * try reading it from the stream again, but just return the last
+			 * buffer. We need to check if the previous buffer is still pinned
+			 * and contains the correct block (it might have been unpinned,
+			 * used for a different block, so we need to be careful).
+			 *
+			 * The place scheduling the blocks (index_scan_stream_read_next)
+			 * needs to do the same thing and not schedule the blocks if it
+			 * matches the previous one. Otherwise the stream will get out of
+			 * sync, causing confusion.
+			 *
+			 * This is what ReleaseAndReadBuffer does too, but it does not
+			 * have a queue of requests scheduled from somewhere else, so it
+			 * does not need to worry about that.
+			 *
+			 * XXX Maybe we should remember the block in IndexFetchTableData,
+			 * so that we can make the check even cheaper, without looking at
+			 * the buffer descriptor? But that assumes the buffer was not
+			 * unpinned (or repinned) elsewhere, before we got back here. But
+			 * can that even happen? If yes, I guess we shouldn't be releasing
+			 * the prev buffer anyway.
+			 *
+			 * XXX This has undesired impact on prefetch distance. The read
+			 * stream schedules reads for a certain number of future blocks,
+			 * but if we skip duplicate blocks, the prefetch distance may get
+			 * unexpectedly large (e.g. for correlated indexes, with long runs
+			 * of TIDs from the same heap page). This may spend a lot of CPU
+			 * time in the index_scan_stream_read_next callback, but more
+			 * importantly it may require reading (and keeping) a lot of leaf
+			 * pages from the index.
+			 *
+			 * XXX What if we pinned the buffer twice (increase the refcount),
+			 * so that if the caller unpins the buffer, we still keep the
+			 * second pin. Wouldn't that mean we don't need to worry about the
+			 * possibility someone loaded another page into the buffer?
+			 *
+			 * XXX We might also keep a longer history of recent blocks, not
+			 * just the immediately preceding one. But that makes it harder,
+			 * because the two places (read_next callback and here) need to
+			 * have a slightly different view.
+			 */
+			if (BufferMatches(hscan->xs_cbuf,
+							  hscan->xs_base.rel,
+							  ItemPointerGetBlockNumber(tid)))
+				release_prev = false;
+			else
+				hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		}
 		else
 			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
 												  hscan->xs_base.rel,
@@ -181,7 +231,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 
 		/*
-		 * When using the read stream, release the old buffer.
+		 * When using the read stream, release the old buffer - but only if
+		 * we're reading a different block.
 		 *
 		 * XXX Not sure this is really needed, or maybe this is not the right
 		 * place to do this, and buffers should be released elsewhere. The
@@ -199,7 +250,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * XXX Does this do the right thing when reading the same page? That
 		 * should return the same buffer, so won't we release it prematurely?
 		 */
-		if (scan->rs && (prev_buf != InvalidBuffer))
+		if (scan->rs && (prev_buf != InvalidBuffer) && release_prev)
 		{
 			ReleaseBuffer(prev_buf);
 		}
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index c5ba89499f1..f0fda6d761c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -1915,6 +1915,15 @@ index_scan_stream_read_next(ReadStream *stream,
 				continue;
 			}
 
+			/* same block as before, don't need to read it */
+			if (scan->xs_batches->lastBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (lastBlock)");
+				continue;
+			}
+
+			scan->xs_batches->lastBlock = ItemPointerGetBlockNumber(tid);
+
 			return ItemPointerGetBlockNumber(tid);
 		}
 
@@ -2118,6 +2127,7 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		 */
 		scan->xs_batches->direction = direction;
 		scan->xs_batches->finished = false;
+		scan->xs_batches->lastBlock = InvalidBlockNumber;
 
 		index_batch_pos_reset(scan, &scan->xs_batches->streamPos);
 		read_stream_reset(scan->xs_heapfetch->rs);
@@ -2232,6 +2242,7 @@ index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 					  scan->xs_batches->readPos.batch, scan->xs_batches->readPos.index);
 
 			scan->xs_batches->reset = false;
+			scan->xs_batches->lastBlock = InvalidBlockNumber;
 
 			/*
 			 * Need to reset the stream position, it might be too far behind.
@@ -2311,6 +2322,7 @@ index_batch_init(IndexScanDesc scan)
 	index_batch_pos_reset(scan, &scan->xs_batches->markPos);
 
 	// scan->xs_batches->currentBatch = NULL;
+	scan->xs_batches->lastBlock = InvalidBlockNumber;
 }
 
 /*
@@ -2393,6 +2405,7 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batches->finished = false;
 	batches->reset = false;
 	// batches->currentBatch = NULL;
+	batches->lastBlock = InvalidBlockNumber;
 
 	AssertCheckBatches(scan);
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0b317d2d809..35c3526e250 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3045,6 +3045,46 @@ ReleaseAndReadBuffer(Buffer buffer,
 	return ReadBuffer(relation, blockNum);
 }
 
+/*
+ * BufferMatches
+ *		Check if the buffer (still) contains the expected page.
+ *
+ * Check if the buffer contains the expected page. The buffer may be invalid,
+ * or valid and pinned.
+ */
+bool
+BufferMatches(Buffer buffer,
+			  Relation relation,
+			  BlockNumber blockNum)
+{
+	ForkNumber	forkNum = MAIN_FORKNUM;
+	BufferDesc *bufHdr;
+
+	if (BufferIsValid(buffer))
+	{
+		Assert(BufferIsPinned(buffer));
+		if (BufferIsLocal(buffer))
+		{
+			bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+		else
+		{
+			bufHdr = GetBufferDescriptor(buffer - 1);
+			/* we have pin, so it's ok to examine tag without spinlock */
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b63af845ca6..2bbd0db0223 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -242,6 +242,8 @@ typedef struct IndexScanBatches
 	bool		finished;
 	bool		reset;
 
+	BlockNumber lastBlock;
+
 	/*
 	 * Current scan direction, for the currently loaded batches. This is used
 	 * to load data in the read stream API callback, etc.
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..3b7d4e6a6a2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -237,6 +237,8 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern bool BufferMatches(Buffer buffer, Relation relation,
+						  BlockNumber blockNum);
 
 extern Buffer ExtendBufferedRel(BufferManagerRelation bmr,
 								ForkNumber forkNum,
-- 
2.49.0

v20250501-0004-fix-update-moreRight-moreLeft-in-btgetbatc.patchtext/x-patch; charset=UTF-8; name=v20250501-0004-fix-update-moreRight-moreLeft-in-btgetbatc.patchDownload

From c5f52325414d57022f66500c29fe091e4a39c77b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Wed, 23 Apr 2025 22:21:51 +0200
Subject: [PATCH v20250501 4/7] fix: update moreRight/moreLeft in btgetbatch

---
 src/backend/access/nbtree/nbtree.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 405c601d3ff..6ea1dfcc52c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -308,6 +308,11 @@ btgetbatch(IndexScanDesc scan, ScanDirection dir)
 	{
 		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->nextBatch-1);
 		pos = (BTBatchScanPos) batch->opaque;
+
+		if (ScanDirectionIsForward(scan->xs_batches->direction))
+			pos->moreRight = true;
+		else
+			pos->moreLeft = true;
 	}
 
 	/* Each loop iteration performs another primitive index scan */
@@ -322,6 +327,12 @@ btgetbatch(IndexScanDesc scan, ScanDirection dir)
 			res = _bt_first_batch(scan, dir);
 		else
 		{
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->currPos.dir);
+				so->needPrimScan = false;
+			}
+
 			/*
 			 * Now continue the scan.
 			 */
-- 
2.49.0

v20250501-0005-fix-mark-restore.patchtext/x-patch; charset=UTF-8; name=v20250501-0005-fix-mark-restore.patchDownload

From 011e5438aba40086086473a1b43e8146c029f42f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 1 May 2025 23:17:20 +0200
Subject: [PATCH v20250501 5/7] fix: mark/restore

call AM mark/restore even with batching, so that the AM can set some
fields (e.g. moreRight/moreLeft in nbtree)
---
 src/backend/access/index/indexam.c | 14 ++++-----
 src/backend/access/nbtree/nbtree.c | 47 +++++++++++++++++++++++-------
 2 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f0fda6d761c..20d4d65487f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -516,11 +516,9 @@ index_markpos(IndexScanDesc scan)
 	 * Without batching, just use the ammarkpos() callback. With batching
 	 * everything is handled at this layer, without calling the AM.
 	 */
-	if (scan->xs_batches == NULL)
-	{
-		scan->indexRelation->rd_indam->ammarkpos(scan);
-	}
-	else
+	scan->indexRelation->rd_indam->ammarkpos(scan);
+
+	if (scan->xs_batches != NULL)
 	{
 		IndexScanBatches *batches = scan->xs_batches;
 		IndexScanBatchPos *pos = &batches->markPos;
@@ -588,9 +586,9 @@ index_restrpos(IndexScanDesc scan)
 	 * Without batching, just use the amrestrpos() callback. With batching
 	 * everything is handled at this layer, without calling the AM.
 	 */
-	if (scan->xs_batches == NULL)
-		scan->indexRelation->rd_indam->amrestrpos(scan);
-	else
+	scan->indexRelation->rd_indam->amrestrpos(scan);
+
+	if (scan->xs_batches != NULL)
 	{
 		IndexScanBatches *batches = scan->xs_batches;
 		IndexScanBatchPos *pos = &batches->markPos;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6ea1dfcc52c..0143df993aa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -309,10 +309,13 @@ btgetbatch(IndexScanDesc scan, ScanDirection dir)
 		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->nextBatch-1);
 		pos = (BTBatchScanPos) batch->opaque;
 
-		if (ScanDirectionIsForward(scan->xs_batches->direction))
-			pos->moreRight = true;
-		else
-			pos->moreLeft = true;
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(scan->xs_batches->direction))
+				pos->moreRight = true;
+			else
+				pos->moreLeft = true;
+		}
 	}
 
 	/* Each loop iteration performs another primitive index scan */
@@ -327,11 +330,7 @@ btgetbatch(IndexScanDesc scan, ScanDirection dir)
 			res = _bt_first_batch(scan, dir);
 		else
 		{
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
+			so->needPrimScan = false;
 
 			/*
 			 * Now continue the scan.
@@ -656,7 +655,20 @@ btmarkpos(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
 	/* with batching, mark/restore is handled in indexam */
-	Assert(scan->xs_batches == NULL);
+	if (scan->xs_batches != NULL)
+	{
+		IndexScanBatch	batch = INDEX_SCAN_BATCH(scan, scan->xs_batches->firstBatch);
+		BTBatchScanPos pos = NULL;
+		pos = (BTBatchScanPos) batch->opaque;
+		if (so->needPrimScan)
+		{
+			if (ScanDirectionIsForward(scan->xs_batches->direction))
+				pos->moreRight = true;
+			else
+				pos->moreLeft = true;
+		}
+		return;
+	}
 
 	/* There may be an old mark with a pin (but no lock). */
 	BTScanPosUnpinIfPinned(so->markPos);
@@ -688,7 +700,20 @@ btrestrpos(IndexScanDesc scan)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
 	/* with batching, mark/restore is handled in indexam */
-	Assert(scan->xs_batches == NULL);
+	if (scan->xs_batches != NULL)
+	{
+		if (scan->xs_batches->markPos.batch != scan->xs_batches->firstBatch)
+		{
+			/* Reset the scan's array keys (see _bt_steppage for why) */
+			if (so->numArrayKeys)
+			{
+				_bt_start_array_keys(scan, so->currPos.dir);
+				so->needPrimScan = false;
+			}
+		}
+
+		return;
+	}
 
 	if (so->markItemIndex >= 0)
 	{
-- 
2.49.0

simple-prefetch.tgzapplication/x-compressed-tar; name=simple-prefetch.tgzDownload

#122

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#121)

6 attachment(s)

Re: index prefetching

Hi,

I got pinged about issues (compiler warnings, and some test failures) in
the simple patch version shared in May. So here's a rebased and cleaned
up version addressing that, and a couple additional issues I ran into.

FWIW if you run check-world on this, you may get failures in io_workers
TAP test. That's a pre-existing issue [1]/messages/by-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg@357aaf6u525j, the patch just makes it
easier to hit as it (probably) added AIO in some part of the test.

Otherwise it should pass all tests (and it does for me on CI).

The main changes in the patches and remaining questions:

(1) fixed compiler warnings

These were mostly due to contrib/test AMs with not-updated ambeginscan()
implementations.

(2) GiST fixes

I fixed a bug in how the prefetching handled distances, leading to
"tuples returned out of order" errors. It did not copy the Datums when
batching the reordered values, not realizing it may be FLOAT8, and on
32-bit systems the Datum is just a pointer. Fixed by datumCopy().

I'm not aware of any actual bug in the GiST code, but I'm sure the
memory management there is sketchy and likely leaks memory. Needs some
more thought and testing. The SP-GiST may have similar issues.

(3) ambeginscan(heap, index, ....)

I originally undid the changes to ambeginscan(), i.e. the callback was
restored back to what master has. To to create the ReadStream the AM
needs the heap, but it could build Relation using index->rd_index->indrelid.

That worked, but I did not like it for two reasons. The AM then needs to
manage the relation (close it etc.). And there was no way to know when
ambeginscan() gets called for a bitmap scan, in which case the
read_stream is unnecessary/useless. So it got created, but never used.
Not very expensive, but messy.

So I ended up restoring the ambeginscan() change, i.e. it now gets the
heap relation. I ended up passing it as the first argument, mostly for
consistency with index_beginscan(), which also does (heap, index, ...).

I renamed the index argument from 'rel' to 'index' in a couple of the
indexes, it was confusing to have 'heap' and 'rel'.

(4) lastBlock

I added the optimization to not queue duplicate block numbers, i.e. if
the index returns a sequence of TIDs from the same block, we skip
queueing that and simply use the buffer we already have. This is quite a
bit more efficient.

This is something the read_next callback in each AM needs to do, but
it's pretty simple.

(5) xs_visible

The current patch expects the AM to set the xs_visible even if it's not
using ReadStream (which is required to do that in the callback). If the
AM does not do that, index-only scans are broken.

But it occurs to me we could handle this in index_getnext_tid(). If the
AM does not use a ReadStream (xs_rs==NULL), we can check the VM and
store the value in xs_visible. It'd need moving the vmBuffer to the scan
descriptor (it's now in IndexOnlyScanState), but that seems OK. And the
AMs now add the buffer anyway.

(6) SGML

I added a couple paragraphs to indexam.sgml, documenting the new heap
argument, and also requirements from the read_next callback (e.g. the
lastBlock and xs_visible setting).

(7) remaining annoyances

There's a couple things that still annoy me - the "read_next" callbacks
are very similar, and duplicate a fair amount of code to stuff they're
required to. There's a little bit AM-specific code to get the next item
from the ScanOpaque structs, and then code to skip duplicate block
numbers and check the visibility map (if needed).

I believe both of these things could be refactored into some shared
place. The AMs would just call a function from indexam.c (which seems OK
from layering POV, and there's plenty of such calls).

I believe the same place could also act as the "scan manager" component
managing the prefetching (and related stuff?), as suggested by Peter
Geoghegan some time ago.

I ran out of time to work on this today, but I'll look into this soon.

FWIW I'm still planning to work on the "complex" patch version and see
if it can be moved forward. I've been having some very helpful chats
about this with Peter Geoghegan, and I'm still open to the possibility
of making it work. This simpler version is partially a hedge to have at
least something in case the complex patch does not make it.

regards

[1]: /messages/by-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg@357aaf6u525j
/messages/by-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg@357aaf6u525j

--
Tomas Vondra

Attachments:

v20250709-0001-index-prefetch-infrastructure.patchtext/x-patch; charset=UTF-8; name=v20250709-0001-index-prefetch-infrastructure.patchDownload

From b0b775d1f8e29be04c65c4030dbe0a38b5436dbe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 1 May 2025 20:22:55 +0200
Subject: [PATCH v20250709 1/6] index prefetch infrastructure

Extends the AM interface to allow optional use of ReadStream for heap
access. The main changes are:

* The AM may create the stream, and set it into scan->xs_rs, which
  indicates the heap should be accessed this way. For this the AM needs
  to also define the read_next callback, accessing the private "opaque"
  part of the scan state.

  Note: This patch does not implement this for any existing AMs, beyond
  updating the signature. That comes in later patches.

* To create a read_stream, the ambeginscan() callback gets that as the
  first argument not.

* The indexam.c then passes scan->xs_rs to index_fetch_begin(). If the
  pointer is NULL, the regular ReadBuffer() interface will be used.

* A new GUC "enable_indexscan_prefetch" (default: on) is introduced, to
  make experimentation easier. Not sure we want to keep this.

* The executor layer is almost untouched, except for index-only scans.
  The read_next callback needs to skip all-visible pages (we don't want
  to prefetch those), and the two sides (read_next and index_fetch_heap)
  need to have the same visibility result, even if the VM gets updated
  in between. The new scan->xs_visible flag is used to pass this to the
  executor.

* The AM has to determine the block visibility before returning the
  block from the read_next callback (and not return it if it's all
  visible). And it needs to remember the value and use it for when
  returning the TID/tuple from amgettuple. These two places have to
  agree on the visibility.

* The xs_visible has to be set even by AMs that don't support the
  read_stream. Maybe this should be rethought, and indexam.c should do
  that when xs_rs=NULL.

* The read_next callback must not return duplicate blocks, i.e. blocks
  that are exactly the same as the last returned block. That's break the
  optimization that we don't read/pin blocks unnecessarily.
---
 contrib/bloom/bloom.h                         |  2 +-
 contrib/bloom/blscan.c                        |  4 +-
 doc/src/sgml/indexam.sgml                     | 35 ++++++-
 src/backend/access/brin/brin.c                |  8 +-
 src/backend/access/gin/ginscan.c              |  4 +-
 src/backend/access/gist/gistscan.c            |  4 +-
 src/backend/access/hash/hash.c                |  4 +-
 src/backend/access/heap/heapam_handler.c      | 94 ++++++++++++++++++-
 src/backend/access/index/genam.c              |  1 +
 src/backend/access/index/indexam.c            | 21 +++--
 src/backend/access/nbtree/nbtree.c            |  6 +-
 src/backend/access/spgist/spgscan.c           | 12 +--
 src/backend/access/table/tableam.c            |  2 +-
 src/backend/commands/constraint.c             |  3 +-
 src/backend/executor/nodeIndexonlyscan.c      | 10 +-
 src/backend/optimizer/path/costsize.c         |  1 +
 src/backend/storage/buffer/bufmgr.c           | 40 ++++++++
 src/backend/utils/misc/guc_tables.c           | 10 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/amapi.h                    |  3 +-
 src/include/access/brin_internal.h            |  3 +-
 src/include/access/gin_private.h              |  3 +-
 src/include/access/gistscan.h                 |  3 +-
 src/include/access/hash.h                     |  3 +-
 src/include/access/nbtree.h                   |  3 +-
 src/include/access/relscan.h                  |  4 +
 src/include/access/spgist.h                   |  3 +-
 src/include/access/tableam.h                  | 12 ++-
 src/include/optimizer/cost.h                  |  1 +
 src/include/storage/bufmgr.h                  |  2 +
 .../modules/dummy_index_am/dummy_index_am.c   |  4 +-
 src/test/regress/expected/sysviews.out        |  3 +-
 32 files changed, 254 insertions(+), 55 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 648167045f4..00d8be39953 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -190,7 +190,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation heap, Relation index, int nkeys, int norderbys);
 extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index d072f47fe28..ba84bf6a0c5 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -22,12 +22,12 @@
  * Begin scan of bloom index.
  */
 IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BloomScanOpaque so;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	so = (BloomScanOpaque) palloc(sizeof(BloomScanOpaqueData));
 	initBloomState(&so->state, scan->indexRelation);
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 1aa4741a8ea..b15bb241f5b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -657,7 +657,8 @@ amadjustmembers (Oid opfamilyoid,
   <para>
 <programlisting>
 IndexScanDesc
-ambeginscan (Relation indexRelation,
+ambeginscan (Relation heapRelation,
+             Relation indexRelation,
              int nkeys,
              int norderbys);
 </programlisting>
@@ -674,6 +675,38 @@ ambeginscan (Relation indexRelation,
    the interesting parts of index-scan startup are in <function>amrescan</function>.
   </para>
 
+  <para>
+   The index scan may opt into asynchronous I/O by initializing a <literal>ReadStream</literal>
+   on the provided <literal>heapRelation</literal>, and storing it in <literal>xs_rs</literal>
+   field of the scan descriptor. If the field is left <literal>NULL</literal>, the synchronous
+   buffer API will be used. The <literal>heapRelation</literal> is left <literal>NULL</literal>
+   for bitmaps scans, which access it from a separate node.
+  </para>
+
+  <para>
+    To create the <literal>ReadStream</literal>, the index has to implement a <literal>read_next</literal>
+    callback, returning a sequence of block numbers. The scan has to be split into multiple
+    partial sequences (e.g. one sequence per leaf page), the index has to reset the stream
+    when advancing to the next leaf page.
+  </para>
+
+  <para>
+    If the index supports index-only-scans, it needs to set the <literal>xs_visible</literal>
+    field when returning an item from <literal>amgettuple</literal>. This value has
+    to be determined in the <literal>read_next</literal> callback and remembered,
+    and the callback must not queue all-visible blocks. Otherwise the queued blocks
+    and read blocks might disagree. The VM may be updated at any point, so a block
+    might be queued, but then not read as it's all-visible. Or vice versa. Regular
+    (non-IOS) scans don't need to worry about this.
+  </para>
+
+  <para>
+    The <literal>read_next</literal> callback must not queue runs of the same block.
+    If the block number is the same as the last returned block, it has to be
+    skipped. If the stream is reset (e.g. when advancing to the next leaft
+    page), the last block is forgotten.
+  </para>
+
   <para>
 <programlisting>
 void
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4204088fa0d..31222e5a96d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -536,16 +536,16 @@ brininsertcleanup(Relation index, IndexInfo *indexInfo)
  * holding lock on index, it's not necessary to recompute it during brinrescan.
  */
 IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BrinOpaque *opaque;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	opaque = palloc_object(BrinOpaque);
-	opaque->bo_rmAccess = brinRevmapInitialize(r, &opaque->bo_pagesPerRange);
-	opaque->bo_bdesc = brin_build_desc(r);
+	opaque->bo_rmAccess = brinRevmapInitialize(index, &opaque->bo_pagesPerRange);
+	opaque->bo_bdesc = brin_build_desc(index);
 	scan->opaque = opaque;
 
 	return scan;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index c2d1771bd77..4e11ed9626d 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
 
 
 IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	GinScanOpaque so;
@@ -30,7 +30,7 @@ ginbeginscan(Relation rel, int nkeys, int norderbys)
 	/* no order by operators allowed */
 	Assert(norderbys == 0);
 
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* allocate private workspace */
 	so = (GinScanOpaque) palloc(sizeof(GinScanOpaqueData));
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 700fa959d03..d8ba7f7eff5 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -71,14 +71,14 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  */
 
 IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	GISTSTATE  *giststate;
 	GISTScanOpaque so;
 	MemoryContext oldCxt;
 
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* First, set up a GISTSTATE with a scan-lifespan memory context */
 	giststate = initGISTstate(scan->indexRelation);
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..2133e454e9b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -371,7 +371,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	hashbeginscan() -- start a scan on a hash index
  */
 IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	HashScanOpaque so;
@@ -379,7 +379,7 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	/* no order by operators allowed */
 	Assert(norderbys == 0);
 
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
 	HashScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..a16aa3e56ae 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,13 +79,16 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
 
+	/* XXX Maybe the stream should be in IndexFetchHeapData instead? */
+	hscan->xs_base.rs = rs;
+
 	return &hscan->xs_base;
 }
 
@@ -129,16 +132,99 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
+		bool		release_prev_buf = true;
+
+		/*
+		 * XXX We should compare the previous/this block, and only do the read
+		 * if the blocks are different (and reuse the buffer otherwise). But the
+		 * index AMs would need to do exactly the same thing, to keep both sides
+		 * of the queue in sync.
+		 */
+
+		/*
+		 * If the scan is using read stream, get the block from it. If not,
+		 * use the regular buffer read.
+		 */
+		if (scan->rs)
+		{
+			/*
+			 * If we're trying to read the same block as the last time, don't
+			 * try reading it from the stream again, but just return the last
+			 * buffer. We need to check if the previous buffer is still pinned
+			 * and contains the correct block (it might have been unpinned,
+			 * used for a different block, so we need to be careful).
+			 *
+			 * The places scheduling the blocks (read_next callbacks) need to
+			 * do the same thing and not schedule the blocks if it matches the
+			 * previous one. Otherwise the stream will get out of sync, causing
+			 * confusion.
+			 *
+			 * This is what ReleaseAndReadBuffer does too, but it does not
+			 * have a queue of requests scheduled from somewhere else, so it
+			 * does not need to worry about that.
+			 *
+			 * XXX Maybe we should remember the block in IndexFetchTableData,
+			 * so that we can make the check even cheaper, without looking at
+			 * the buffer descriptor? But that assumes the buffer was not
+			 * unpinned (or repinned) elsewhere, before we got back here. But
+			 * can that even happen? If yes, I guess we shouldn't be releasing
+			 * the prev buffer anyway.
+			 *
+			 * XXX This has undesired impact on prefetch distance. The read
+			 * stream schedules reads for a certain number of future blocks,
+			 * but if we skip duplicate blocks, the prefetch distance may get
+			 * unexpectedly large (e.g. for correlated indexes, with long runs
+			 * of TIDs from the same heap page). We're however limited to items
+			 * from a single leaf page.
+			 *
+			 * XXX What if we pinned the buffer twice (increase the refcount),
+			 * so that if the caller unpins the buffer, we still keep the
+			 * second pin. Wouldn't that mean we don't need to worry about the
+			 * possibility someone loaded another page into the buffer?
+			 *
+			 * XXX We might also keep a longer history of recent blocks, not
+			 * just the immediately preceding one. But that makes it harder,
+			 * because the two places (read_next callback and here) need to
+			 * have a slightly different view.
+			 */
+			if (BufferMatches(hscan->xs_cbuf,
+							  hscan->xs_base.rel,
+							  ItemPointerGetBlockNumber(tid)))
+				release_prev_buf = false;
+			else
+				hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* We should always get a valid buffer for a valid TID. */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+
+		/*
+		 * Did we read the expected block number (per the TID)?
+		 *
+		 * For the regular buffer reads this should always match, but with the
+		 * read stream it might disagree due to a bug / subtle difference in the
+		 * read_next callback.
+		 */
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+
+		/*
+		 * The read stream does not release the buffer, the caller is expected
+		 * to do that (unlike ReleaseAndReadBuffer). But that would mean the
+		 * behavior with/without read stream is different, and the contract for
+		 * index_fetch_tuple would change. So we release the old bufffer here.
+		 */
+		if (scan->rs && (prev_buf != InvalidBuffer) && release_prev_buf)
+			ReleaseBuffer(prev_buf);
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..10dc832d4f7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -125,6 +125,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_itupdesc = NULL;
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
+	scan->xs_rs = NULL;
 
 	return scan;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..73b531a9eff 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -104,7 +104,7 @@ do { \
 			 CppAsString(pname), RelationGetRelationName(scan->indexRelation)); \
 } while(0)
 
-static IndexScanDesc index_beginscan_internal(Relation indexRelation,
+static IndexScanDesc index_beginscan_internal(Relation heapRelation, Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 static inline void validate_relation_kind(Relation r);
@@ -263,7 +263,8 @@ index_beginscan(Relation heapRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(heapRelation, indexRelation,
+									nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -274,7 +275,7 @@ index_beginscan(Relation heapRelation,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, scan->xs_rs);
 
 	return scan;
 }
@@ -295,7 +296,7 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(NULL, indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -311,7 +312,7 @@ index_beginscan_bitmap(Relation indexRelation,
  * index_beginscan_internal --- common code for index_beginscan variants
  */
 static IndexScanDesc
-index_beginscan_internal(Relation indexRelation,
+index_beginscan_internal(Relation heapRelation, Relation indexRelation,
 						 int nkeys, int norderbys, Snapshot snapshot,
 						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
@@ -331,8 +332,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-												norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(heapRelation, indexRelation,
+												nkeys, norderbys);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -593,8 +594,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
-	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-									pscan, true);
+	scan = index_beginscan_internal(heaprel, indexrel, nkeys, norderbys,
+									snapshot, pscan, true);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -605,7 +606,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->instrument = instrument;
 
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, scan->xs_rs);
 
 	return scan;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..619b356e848 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -333,7 +333,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  *	btbeginscan() -- start a scan on a btree index
  */
 IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -342,7 +342,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	Assert(norderbys == 0);
 
 	/* get the scan */
-	scan = RelationGetIndexScan(rel, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
@@ -371,7 +371,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
-	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->xs_itupdesc = RelationGetDescr(index);
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 25893050c58..655f5cdc1eb 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -301,13 +301,13 @@ spgPrepareScanKeys(IndexScanDesc scan)
 }
 
 IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 {
 	IndexScanDesc scan;
 	SpGistScanOpaque so;
 	int			i;
 
-	scan = RelationGetIndexScan(rel, keysz, orderbysz);
+	scan = RelationGetIndexScan(index, keysz, orderbysz);
 
 	so = (SpGistScanOpaque) palloc0(sizeof(SpGistScanOpaqueData));
 	if (keysz > 0)
@@ -330,7 +330,7 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 	 * most opclasses we can re-use the index reldesc instead of making one.)
 	 */
 	so->reconTupDesc = scan->xs_hitupdesc =
-		getSpGistTupleDesc(rel, &so->state.attType);
+		getSpGistTupleDesc(index, &so->state.attType);
 
 	/* Allocate various arrays needed for order-by scans */
 	if (scan->numberOfOrderBys > 0)
@@ -362,14 +362,14 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
 	}
 
 	fmgr_info_copy(&so->innerConsistentFn,
-				   index_getprocinfo(rel, 1, SPGIST_INNER_CONSISTENT_PROC),
+				   index_getprocinfo(index, 1, SPGIST_INNER_CONSISTENT_PROC),
 				   CurrentMemoryContext);
 
 	fmgr_info_copy(&so->leafConsistentFn,
-				   index_getprocinfo(rel, 1, SPGIST_LEAF_CONSISTENT_PROC),
+				   index_getprocinfo(index, 1, SPGIST_LEAF_CONSISTENT_PROC),
 				   CurrentMemoryContext);
 
-	so->indexCollation = rel->rd_indcollation[0];
+	so->indexCollation = index->rd_indcollation[0];
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..be8e02a9c45 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -217,7 +217,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221f2..31279bd82b1 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan
+			= table_index_fetch_begin(trigdata->tg_relation, NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca9507..ffebdfa7abd 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -157,10 +157,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX This expects the index AM to set the xs_visible flag. Maybe we
+		 * should not assume that. The subsequent patches do that for all the
+		 * built in AMs, but until that time it's effectively broken - the AMs
+		 * currently do not set xs_visible. Maybe we should only use this if
+		 * the AM uses ReadStream, and call VM_ALL_VISIBLE otherwise?
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!scandesc->xs_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182c..f030a34cc62 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bd68d7e0ca9..2a69d6c3ec8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3045,6 +3045,46 @@ ReleaseAndReadBuffer(Buffer buffer,
 	return ReadBuffer(relation, blockNum);
 }
 
+/*
+ * BufferMatches
+ *		Check if the buffer (still) contains the expected page.
+ *
+ * Check if the buffer contains the expected page. The buffer may be invalid,
+ * or valid and pinned.
+ */
+bool
+BufferMatches(Buffer buffer,
+			  Relation relation,
+			  BlockNumber blockNum)
+{
+	ForkNumber	forkNum = MAIN_FORKNUM;
+	BufferDesc *bufHdr;
+
+	if (BufferIsValid(buffer))
+	{
+		Assert(BufferIsPinned(buffer));
+		if (BufferIsLocal(buffer))
+		{
+			bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+		else
+		{
+			bufHdr = GetBufferDescriptor(buffer - 1);
+			/* we have pin, so it's ok to examine tag without spinlock */
+			if (bufHdr->tag.blockNum == blockNum &&
+				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
+				BufTagGetForkNum(&bufHdr->tag) == forkNum)
+				return true;
+		}
+	}
+
+	return false;
+}
+
 /*
  * PinBuffer -- make buffer unavailable for replacement.
  *
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 511dc32d519..7aa782aa070 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -809,6 +809,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexscan_prefetch", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables prefetching for index-scans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexscan_prefetch,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_indexonlyscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of index-only-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 341f88adc87..c048bb03f30 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 52916bab7a3..1ef3daee885 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -181,7 +181,8 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
 										  List *functions);
 
 /* prepare for index scan */
-typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
+typedef IndexScanDesc (*ambeginscan_function) (Relation heapRelation,
+											   Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index d093a0bf130..186b4b43f72 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -97,7 +97,8 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
 extern void brininsertcleanup(Relation index, struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index aee1f70c22e..45f19594c79 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -393,7 +393,8 @@ typedef struct GinScanOpaqueData
 
 typedef GinScanOpaqueData *GinScanOpaque;
 
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation heap, Relation index,
+								  int nkeys, int norderbys);
 extern void ginendscan(IndexScanDesc scan);
 extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 518034c36d5..372f8be5817 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,8 @@
 
 #include "access/amapi.h"
 
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 					   ScanKey orderbys, int norderbys);
 extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 073ad29b19b..6befa3ebf60 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -370,7 +370,8 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   struct IndexInfo *indexInfo);
 extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation heap, Relation index,
+								   int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..e6e52210b15 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1188,7 +1188,8 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation heap, Relation index,
+								 int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..56e6c6245e5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -19,6 +19,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,6 +122,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
@@ -168,11 +170,13 @@ typedef struct IndexScanDescData
 	struct TupleDescData *xs_itupdesc;	/* rowtype descriptor of xs_itup */
 	HeapTuple	xs_hitup;		/* index data returned by AM, as HeapTuple */
 	struct TupleDescData *xs_hitupdesc; /* rowtype descriptor of xs_hitup */
+	bool		xs_visible;		/* heap page is all-visible */
 
 	ItemPointerData xs_heaptid; /* result */
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+	ReadStream *xs_rs;			/* read_stream (if supported by the AM) */
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index cbe9b347d8f..69588d18124 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,8 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
 					  struct IndexInfo *indexInfo);
 
 /* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation heap, Relation index,
+								  int keysz, int orderbysz);
 extern void spgendscan(IndexScanDesc scan);
 extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					  ScanKey orderbys, int norderbys);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..749a68ed861 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -413,8 +413,14 @@ typedef struct TableAmRoutine
 	 * structure with additional information.
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 *
+	 * The 'rs' parameter is the read stream initialized by the AM, in
+	 * which case the read stream has to be used to fetch tuples. If the
+	 * AM does not support read stream, it's set to NULL and the regular
+	 * synchronous API to read buffers shall be used.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1149,9 +1155,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8f3..00f4c3d0011 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..3b7d4e6a6a2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -237,6 +237,8 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern bool BufferMatches(Buffer buffer, Relation relation,
+						  BlockNumber blockNum);
 
 extern Buffer ExtendBufferedRel(BufferManagerRelation bmr,
 								ForkNumber forkNum,
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6fc..622a8ed0757 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -241,12 +241,12 @@ divalidate(Oid opclassoid)
  * Begin scan of index AM.
  */
 static IndexScanDesc
-dibeginscan(Relation r, int nkeys, int norderbys)
+dibeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
 	IndexScanDesc scan;
 
 	/* Let's pretend we are doing something */
-	scan = RelationGetIndexScan(r, nkeys, norderbys);
+	scan = RelationGetIndexScan(index, nkeys, norderbys);
 	return scan;
 }
 
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 83228cfca29..3a7603b24e2 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -158,6 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -172,7 +173,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(24 rows)
+(25 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.50.0

v20250709-0002-prefetch-for-btree-indexes.patchtext/x-patch; charset=UTF-8; name=v20250709-0002-prefetch-for-btree-indexes.patchDownload

From 93d7675cb5f56629b948e3e186de7015d67c7070 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 1 May 2025 20:23:18 +0200
Subject: [PATCH v20250709 2/6] prefetch for btree indexes

Implements the bt_stream_read_next() callback, returning blocks from
the current BTScanOpaque.
---
 src/backend/access/nbtree/nbtree.c    | 162 ++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtsearch.c |  46 ++++++++
 src/include/access/nbtree.h           |   9 ++
 3 files changed, 217 insertions(+)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 619b356e848..0fa4af79dac 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,9 +21,11 @@
 #include "access/nbtree.h"
 #include "access/relscan.h"
 #include "access/stratnum.h"
+#include "access/visibilitymap.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bulk_write.h"
 #include "storage/condition_variable.h"
@@ -329,6 +331,107 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	return ntids;
 }
 
+/*
+ * bt_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+bt_stream_read_next(ReadStream *stream,
+					void *callback_private_data,
+					void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	BTScanOpaque	so = (BTScanOpaque) scan->opaque;
+	ScanDirection	dir = so->currPos.dir;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->currPos.streamIndex == -1)
+		so->currPos.streamIndex = so->currPos.itemIndex;
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while ((so->currPos.streamIndex >= so->currPos.firstItem) &&
+		   (so->currPos.streamIndex <= so->currPos.lastItem))
+	{
+		ItemPointer		tid;
+		BTScanPosItem  *item;
+
+		item = &so->currPos.items[so->currPos.streamIndex];
+
+		tid = &item->heapTid;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		if (ScanDirectionIsForward(dir))
+		{
+			so->currPos.streamIndex++;
+		}
+		else
+		{
+			so->currPos.streamIndex--;
+		}
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 /*
  *	btbeginscan() -- start a scan on a btree index
  */
@@ -364,6 +467,12 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* buffer for accessing the VM in read_next callback */
+	so->vmBuffer = InvalidBuffer;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -375,6 +484,27 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	scan->opaque = so;
 
+	/*
+	 * Initialize the read stream too, to opt in into prefetching.
+	 *
+	 * XXX We create a stream depending on the GUC, and only if the heap rel
+	 * is provided. This means we don't initialize the stream even for bitmap
+	 * scans, which don't use it.
+	 *
+	 * XXX The table has to be already locked by the query, so NoLock. Too
+	 * bad the heapRelation does not get passed here.
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 bt_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -461,6 +591,14 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
+
+	/* reset the read stream, to start over */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
@@ -483,6 +621,22 @@ btendscan(IndexScanDesc scan)
 	so->markItemIndex = -1;
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	/* release the VM buffer */
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
+	/*
+	 * XXX I wonder if maybe the stream should be managed by the indexam.c
+	 * layer, not by each index AM.
+	 */
+
+	/* terminate the read stream (and close the heap) */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
+
 	/* No need to invalidate positions, the RAM is about to be freed. */
 
 	/* Release storage */
@@ -581,6 +735,14 @@ btrestrpos(IndexScanDesc scan)
 		else
 			BTScanPosInvalidate(so->currPos);
 	}
+
+	/* we're restored the scan position, reset the read stream */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4af1ff1e9e5..91e4fe5ec95 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,6 +17,7 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2045,6 +2046,21 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
+	/*
+	 * Reset the read stream, to restart it for the new page.
+	 *
+	 * XXX Maybe we should not reset prefetch distance to 0, but start from
+	 * a somewhat higher value. We're merely continuing the same scan as
+	 * before ... maybe reduce it a bit, to not harm LIMIT queries, but not
+	 * reset it all the way to 0.
+	 */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
+
 	return (so->currPos.firstItem <= so->currPos.lastItem);
 }
 
@@ -2059,6 +2075,11 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	if (so->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
@@ -2089,6 +2110,11 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 
 	currItem->heapTid = *heapTid;
 	currItem->indexOffset = offnum;
+
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	if (so->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
@@ -2126,6 +2152,10 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
 	currItem->heapTid = *heapTid;
 	currItem->indexOffset = offnum;
 
+	/* initialize visibility flags */
+	currItem->allVisibleSet = false;
+	currItem->allVisible = false;
+
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
@@ -2150,6 +2180,22 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
 
 	/* Return next item, per amgettuple contract */
 	scan->xs_heaptid = currItem->heapTid;
+
+	/*
+	 * XXX If this is index-only scan and we haven't checked the VM yet, do
+	 * that now. We need to make sure scan->xs_visible is set correctly even
+	 * if the scan is not using a read stream.
+	 */
+	if (scan->xs_want_itup && !currItem->allVisibleSet)
+	{
+		currItem->allVisibleSet = true;
+		currItem->allVisible
+			= VM_ALL_VISIBLE(scan->heapRelation,
+							 ItemPointerGetBlockNumber(&currItem->heapTid),
+							 &so->vmBuffer);
+	}
+
+	scan->xs_visible = currItem->allVisible;
 	if (so->currTuples)
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e6e52210b15..e307fd9bf7f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -957,6 +957,8 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 	ItemPointerData heapTid;	/* TID of referenced heap item */
 	OffsetNumber indexOffset;	/* index item's location within page */
 	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+	bool allVisibleSet;			/* did we set the VM flag already? */
+	bool allVisible;			/* VM info (for IOS) */
 } BTScanPosItem;
 
 typedef struct BTScanPosData
@@ -995,6 +997,7 @@ typedef struct BTScanPosData
 	int			firstItem;		/* first valid index in items[] */
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
+	int			streamIndex;	/* item returned to the read stream */
 
 	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
@@ -1067,6 +1070,9 @@ typedef struct BTScanOpaqueData
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
+	/* buffer for accessing VM in index-only scans */
+	Buffer		vmBuffer;
+
 	/* info about killed items if any (killedItems is NULL if never used) */
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
@@ -1089,6 +1095,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber		lastBlock;
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
-- 
2.50.0

v20250709-0003-prefetch-for-hash-indexes.patchtext/x-patch; charset=UTF-8; name=v20250709-0003-prefetch-for-hash-indexes.patchDownload

From 425ccad723c34ef8eeeb1d6d80400b1705c4ec2a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 25 Apr 2025 12:34:15 +0200
Subject: [PATCH v20250709 3/6] prefetch for hash indexes

Implements the hash_stream_read_next() callback, returning blocks from
HashScanOpaque.
---
 src/backend/access/hash/hash.c       | 105 +++++++++++++++++++++++++++
 src/backend/access/hash/hashsearch.c |  37 ++++++++++
 src/include/access/hash.h            |   8 ++
 3 files changed, 150 insertions(+)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 2133e454e9b..0884f0e05d9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -22,12 +22,14 @@
 #include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "access/stratnum.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "optimizer/plancat.h"
 #include "pgstat.h"
 #include "utils/fmgrprotos.h"
@@ -366,6 +368,78 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	return ntids;
 }
 
+/*
+ * hash_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+hash_stream_read_next(ReadStream *stream,
+					  void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->currPos.streamIndex == -1)
+		so->currPos.streamIndex = so->currPos.itemIndex;
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but we might also skip duplicate blocks (in the future).
+	 */
+	while ((so->currPos.streamIndex >= so->currPos.firstItem) &&
+		   (so->currPos.streamIndex <= so->currPos.lastItem))
+	{
+		ItemPointer		tid;
+		HashScanPosItem  *item;
+
+		item = &so->currPos.items[so->currPos.streamIndex];
+
+		tid = &item->heapTid;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/* advance to the next item, depending on scan direction */
+		if (ScanDirectionIsForward(so->currPos.dir))
+		{
+			so->currPos.streamIndex++;
+		}
+		else
+		{
+			so->currPos.streamIndex--;
+		}
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 
 /*
  *	hashbeginscan() -- start a scan on a hash index
@@ -394,6 +468,25 @@ hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	scan->opaque = so;
 
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
+	/*
+	 * Initialize the read stream, to opt-in into prefetching.
+	 *
+	 * XXX See comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 hash_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -425,6 +518,14 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
+
+	/* reset the stream, to start over */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 /*
@@ -449,6 +550,10 @@ hashendscan(IndexScanDesc scan)
 		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
+
+	/* terminate read stream */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
 }
 
 /*
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65be2..d5b045deb8c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -54,6 +54,28 @@ _hash_next(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	bool		end_of_scan = false;
 
+	/*
+	 * We need to reset the read stream when the scan direction changes. Hash
+	 * indexes are not ordered, but there's still scrollable cursors, and those
+	 * do have irection. So handle that here, and also remember the direction,
+	 * so that the read_next callback can consider that.
+	 *
+	 * XXX we can't do that in the read_next callback, because we might have
+	 * already hit the end of the stream (returned InvalidBlockNumber), in
+	 * which case the callback won't be called.
+	 */
+	if (so->currPos.dir != dir)
+	{
+		so->currPos.dir = dir;
+
+		if (scan->xs_rs)
+		{
+			so->currPos.streamIndex = -1;
+			so->lastBlock = InvalidBlockNumber;
+			read_stream_reset(scan->xs_rs);
+		}
+	}
+
 	/*
 	 * Advance to the next tuple on the current page; or if done, try to read
 	 * data from the next or previous page based on the scan direction. Before
@@ -592,6 +614,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 		so->currPos.buf = InvalidBuffer;
 	}
 
+	/*
+	 * restart the stream for this page
+	 *
+	 * XXX Maybe we should not reset prefetch distance to 0, but start from
+	 * a somewhat higher value. We're merely continuing the same scan as
+	 * before ... maybe reduce it a bit, to not harm LIMIT queries, but not
+	 * reset it all the way to 0.
+	 */
+	if (scan->xs_rs)
+	{
+		so->currPos.streamIndex = - 1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
+
 	Assert(so->currPos.firstItem <= so->currPos.lastItem);
 	return true;
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6befa3ebf60..3916cf746c6 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -113,6 +113,9 @@ typedef struct HashScanPosData
 	BlockNumber nextPage;		/* next overflow page */
 	BlockNumber prevPage;		/* prev overflow or bucket page */
 
+	/* scan direction for the saved position's call to _hash_next */
+	ScanDirection dir;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -123,6 +126,7 @@ typedef struct HashScanPosData
 	int			firstItem;		/* first valid index in items[] */
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
+	int			streamIndex;	/* position of the read stream */
 
 	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
 } HashScanPosData;
@@ -150,6 +154,7 @@ typedef struct HashScanPosData
 		(scanpos).firstItem = 0; \
 		(scanpos).lastItem = 0; \
 		(scanpos).itemIndex = 0; \
+		(scanpos).dir = NoMovementScanDirection; \
 	} while (0)
 
 /*
@@ -182,6 +187,9 @@ typedef struct HashScanOpaqueData
 	int		   *killedItems;	/* currPos.items indexes of killed items */
 	int			numKilled;		/* number of currently stored items */
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
 	/*
 	 * Identify all the matching items on a page and save them in
 	 * HashScanPosData
-- 
2.50.0

v20250709-0004-prefetch-for-gist-indexes.patchtext/x-patch; charset=UTF-8; name=v20250709-0004-prefetch-for-gist-indexes.patchDownload

From db429b6dd4a18f3cf5343f44c9f553328e697f13 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 25 Apr 2025 13:22:38 +0200
Subject: [PATCH v20250709 4/6] prefetch for gist indexes

Implements gist_stream_read_next() and gist_ordered_stream_read_next()
callbacks, for different types of scans:

* gist_stream_read_next() is for traditional index scans, returning
  blocks from GISTScanOpaque.

* gist_ordered_stream_read_next() is for scans with results ordered by
  distance, etc.

The ordered scans rely on a pairing heap - the items are fed into it,
but then are read and returned one by one. That would make prefetching
quite useless, so the patch introduces a small queue on top of the
pairing heap, and the items are loaded in batches. This is what
getNextNearestPrefetch() is responsible for.

Note: Right now the batches are always 32 items, which may regress
queries with LIMIT clauses, etc. It should start at 1 and gradually
increase the batch size. Similarly to how prefetch distance grows.

FIXME The memory management of the batches is almost certainly leaky,
needs to be cleaned up.
---
 src/backend/access/gist/gistget.c  | 160 +++++++++++++++++++-
 src/backend/access/gist/gistscan.c | 225 +++++++++++++++++++++++++++++
 src/include/access/gist_private.h  |  16 ++
 3 files changed, 400 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 387d9972345..7e292ebb442 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -21,7 +21,9 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/datum.h"
 #include "utils/float.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -395,6 +397,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 	}
 
 	so->nPageData = so->curPageData = 0;
+	so->streamPageData = -1;
 	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
 	if (so->pageDataCxt)
 		MemoryContextReset(so->pageDataCxt);
@@ -460,6 +463,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 			so->pageData[so->nPageData].heapPtr = it->t_tid;
 			so->pageData[so->nPageData].recheck = recheck;
 			so->pageData[so->nPageData].offnum = i;
+			so->pageData[so->nPageData].allVisible = false;
+			so->pageData[so->nPageData].allVisibleSet = false;
 
 			/*
 			 * In an index-only scan, also fetch the data from the tuple.  The
@@ -496,6 +501,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
 				item->data.heap.heapPtr = it->t_tid;
 				item->data.heap.recheck = recheck;
 				item->data.heap.recheckDistances = recheck_distances;
+				item->data.heap.allVisible = false;
+				item->data.heap.allVisibleSet = false;
 
 				/*
 				 * In an index-only scan, also fetch the data from the tuple.
@@ -589,6 +596,22 @@ getNextNearest(IndexScanDesc scan)
 			/* in an index-only scan, also return the reconstructed tuple. */
 			if (scan->xs_want_itup)
 				scan->xs_hitup = item->data.heap.recontup;
+
+			/*
+			 * If this is index-only scan, determine the VM status, so that
+			 * we can set xs_visible correctly.
+			 */
+			if (scan->xs_want_itup && ! item->data.heap.allVisibleSet)
+			{
+				item->data.heap.allVisibleSet = true;
+				item->data.heap.allVisible
+					= VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(&item->data.heap.heapPtr),
+									 &so->vmBuffer);
+			}
+
+			scan->xs_visible = item->data.heap.allVisible;
+
 			res = true;
 		}
 		else
@@ -605,6 +628,119 @@ getNextNearest(IndexScanDesc scan)
 	return res;
 }
 
+/*
+ * A variant of getNextNearest() that stashes the items into a small buffer, so
+ * that the prefetching can work (getNextNearest returns items one by one).
+ *
+ * XXX Uses a small secondary queue, because getNextNearest() may be modifying
+ * the regular pageData[] buffer.
+ */
+static bool
+getNextNearestPrefetch(IndexScanDesc scan)
+{
+	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+	GISTSearchHeapItem *item;
+
+	/* did we use all items from the queue */
+	if (so->queueItem == so->queueUsed)
+	{
+		/* grow the number of items */
+		int		maxitems = Min(Max(1, 2 * so->queueUsed), 32);
+
+		/* FIXME gradually incresse the number of items, not 32 all the time */
+		maxitems = 32;
+
+		so->queueItem = 0;
+		so->queueUsed = 0;
+
+		while (so->queueUsed < maxitems)
+		{
+			if (!getNextNearest(scan))
+				break;
+
+			item = &so->queueItems[so->queueUsed++];
+
+			item->recheck = scan->xs_recheck;
+			item->heapPtr = scan->xs_heaptid;
+			item->recontup = scan->xs_hitup;
+
+			/*
+			 * FIXME free the memory (for tuples and orderbyvals/orderbynulls)
+			 * it's leaking now.
+			 */
+			item->orderbyvals = palloc0(sizeof(Datum) * scan->numberOfOrderBys);
+			item->orderbynulls = palloc0(sizeof(bool) * scan->numberOfOrderBys);
+
+			/*
+			 * copy the distances - might be float8, which may be byref, so use
+			 * datumCopy, otherwise it gets clobbered by other items
+			 */
+			for (int i = 0; i < scan->numberOfOrderBys; i++)
+			{
+				int16   typlen;
+				bool    typbyval;
+
+				/* don't copy NULL values */
+				if (scan->xs_orderbynulls[i])
+					continue;
+
+				get_typlenbyval(so->orderByTypes[i], &typlen, &typbyval);
+
+				item->orderbyvals[i] = datumCopy(scan->xs_orderbyvals[i],
+												 typbyval, typlen);
+			}
+
+			memcpy(item->orderbynulls,
+				   scan->xs_orderbynulls,
+				   sizeof(bool) * scan->numberOfOrderBys);
+
+			/* reset, so that we don't free it accidentally */
+			scan->xs_hitup = NULL;
+		}
+
+		/* found no new items, we're done */
+		if (so->queueUsed == 0)
+			return false;
+
+		/* restart the stream for the new queue */
+		so->queueStream = -1;
+		so->lastBlock = InvalidBlockNumber; /* XXX needed? */
+		read_stream_reset(scan->xs_rs);
+	}
+
+	/* next item to return */
+	item = &so->queueItems[so->queueItem++];
+
+	scan->xs_heaptid = item->heapPtr;
+	scan->xs_recheck = item->recheck;
+
+	/* here it's fine to copy the datum (even if byref pointers) */
+	memcpy(scan->xs_orderbyvals,
+		   item->orderbyvals,
+		   sizeof(Datum) * scan->numberOfOrderBys);
+
+	memcpy(scan->xs_orderbynulls,
+		   item->orderbynulls,
+		   sizeof(bool) * scan->numberOfOrderBys);
+
+	/* in an index-only scan, also return the reconstructed tuple. */
+	if (scan->xs_want_itup)
+		scan->xs_hitup = item->recontup;
+
+	if (scan->xs_want_itup && ! item->allVisibleSet)
+	{
+		item->allVisibleSet = true;
+		item->allVisible
+			= VM_ALL_VISIBLE(scan->heapRelation,
+							 ItemPointerGetBlockNumber(&item->heapPtr),
+							 &so->vmBuffer);
+	}
+
+	scan->xs_visible = item->allVisible;
+
+	return true;
+}
+
 /*
  * gistgettuple() -- Get the next tuple in the scan
  */
@@ -642,7 +778,10 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	if (scan->numberOfOrderBys > 0)
 	{
 		/* Must fetch tuples in strict distance order */
-		return getNextNearest(scan);
+		if (scan->xs_rs)
+			return getNextNearestPrefetch(scan);
+		else
+			return getNextNearest(scan);
 	}
 	else
 	{
@@ -677,6 +816,18 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 				if (scan->xs_want_itup)
 					scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
+				/* determine VM status, if not done already */
+				if (scan->xs_want_itup && !so->pageData[so->curPageData].allVisibleSet)
+				{
+					so->pageData[so->curPageData].allVisibleSet = true;
+					so->pageData[so->curPageData].allVisible
+						= VM_ALL_VISIBLE(scan->heapRelation,
+										 ItemPointerGetBlockNumber(&scan->xs_heaptid),
+										 &so->vmBuffer);
+				}
+
+				scan->xs_visible = so->pageData[so->curPageData].allVisible;
+
 				so->curPageData++;
 
 				return true;
@@ -734,6 +885,13 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				pfree(item);
 			} while (so->nPageData == 0);
+
+			if (scan->xs_rs)
+			{
+				so->streamPageData = -1;
+				so->lastBlock = InvalidBlockNumber;
+				read_stream_reset(scan->xs_rs);
+			}
 		}
 	}
 }
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index d8ba7f7eff5..df05f282aa1 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -17,6 +17,8 @@
 #include "access/gist_private.h"
 #include "access/gistscan.h"
 #include "access/relscan.h"
+#include "access/table.h"
+#include "optimizer/cost.h"
 #include "utils/float.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -70,6 +72,176 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
  * Index AM API functions for scanning GiST indexes
  */
 
+/*
+ * gist_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+gist_stream_read_next(ReadStream *stream,
+					  void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->streamPageData == (OffsetNumber) - 1)
+		so->streamPageData = (so->curPageData - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->streamPageData < so->nPageData)
+	{
+		ItemPointer		tid;
+		GISTSearchHeapItem  *item;
+
+		item = &so->pageData[so->streamPageData];
+
+		tid = &item->heapPtr;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		so->streamPageData++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
+/*
+ * gist_ordered_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * A variant of gist_stream_read_next for ordered scans, returning items from
+ * a small secondary queue.
+ */
+static BlockNumber
+gist_ordered_stream_read_next(ReadStream *stream,
+							  void *callback_private_data,
+							  void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	GISTScanOpaque	so = (GISTScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->queueStream == - 1)
+		so->queueStream = (so->queueItem - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->queueStream < so->queueUsed)
+	{
+		ItemPointer		tid;
+		GISTSearchHeapItem  *item;
+
+		item = &so->queueItems[so->queueStream];
+
+		tid = &item->heapPtr;
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!item->allVisibleSet)
+			{
+				item->allVisibleSet = true;
+				item->allVisible = VM_ALL_VISIBLE(scan->heapRelation,
+												  ItemPointerGetBlockNumber(tid),
+												  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (item->allVisible)
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item, assuming the current scan direction */
+		so->queueStream++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 IndexScanDesc
 gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 {
@@ -110,6 +282,14 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 	so->numKilled = 0;
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
+	so->vmBuffer = InvalidBuffer;
+
+	/* initialize small prefetch queue */
+	so->queueUsed = 0;
+	so->queueItem = 0;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
 
 	scan->opaque = so;
 
@@ -120,6 +300,31 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 
 	MemoryContextSwitchTo(oldCxt);
 
+	/*
+	 * Initialize the read stream to opt-in into prefetching.
+	 *
+	 * XXX See the comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		if (scan->numberOfOrderBys == 0)
+			scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+													 NULL,
+													 heap,
+													 MAIN_FORKNUM,
+													 gist_stream_read_next,
+													 scan,
+													 0);
+		else
+			scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+													 NULL,
+													 heap,
+													 MAIN_FORKNUM,
+													 gist_ordered_stream_read_next,
+													 scan,
+													 0);
+	}
+
 	return scan;
 }
 
@@ -341,6 +546,15 @@ gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
 
 	/* any previous xs_hitup will have been pfree'd in context resets above */
 	scan->xs_hitup = NULL;
+
+	/* reset stream */
+	if (scan->xs_rs)
+	{
+		so->streamPageData = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+		so->queueItem = so->queueUsed = so->queueStream = 0;
+	}
 }
 
 void
@@ -348,9 +562,20 @@ gistendscan(IndexScanDesc scan)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 
+	/* needs to happen before freeGISTstate */
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
 	/*
 	 * freeGISTstate is enough to clean up everything made by gistbeginscan,
 	 * as well as the queueCxt if there is a separate context for it.
 	 */
 	freeGISTstate(so->giststate);
+
+	/* reset stream */
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
 }
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..924bbae22e2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -17,6 +17,7 @@
 #include "access/amapi.h"
 #include "access/gist.h"
 #include "access/itup.h"
+#include "access/visibilitymap.h"
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
@@ -124,6 +125,10 @@ typedef struct GISTSearchHeapItem
 								 * index-only scans */
 	OffsetNumber offnum;		/* track offset in page to mark tuple as
 								 * LP_DEAD */
+	bool		allVisible;
+	bool		allVisibleSet;
+	Datum	   *orderbyvals;
+	bool	   *orderbynulls;
 } GISTSearchHeapItem;
 
 /* Unvisited item, either index page or heap tuple */
@@ -169,13 +174,24 @@ typedef struct GISTScanOpaqueData
 	int			numKilled;		/* number of currently stored items */
 	BlockNumber curBlkno;		/* current number of block */
 	GistNSN		curPageLSN;		/* pos in the WAL stream when page was read */
+	Buffer		vmBuffer;
 
 	/* In a non-ordered search, returnable heap items are stored here: */
 	GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
 	OffsetNumber nPageData;		/* number of valid items in array */
 	OffsetNumber curPageData;	/* next item to return */
+	OffsetNumber streamPageData;	/* next item to queue */
 	MemoryContext pageDataCxt;	/* context holding the fetched tuples, for
 								 * index-only scans */
+
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
+	/* queue to allow prefetching with ordered scans */
+	GISTSearchHeapItem	queueItems[32];
+	int					queueItem;
+	int					queueStream;
+	int					queueUsed;
 } GISTScanOpaqueData;
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
-- 
2.50.0

v20250709-0005-prefetch-for-spgist-indexes.patchtext/x-patch; charset=UTF-8; name=v20250709-0005-prefetch-for-spgist-indexes.patchDownload

From 19ff92d7034e726c840c83946d87c980a48a594d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 25 Apr 2025 14:52:56 +0200
Subject: [PATCH v20250709 5/6] prefetch for spgist indexes

Implements the spg_stream_read_next() callback, returning blocks from
SpGistScanOpaque.

Similar to GiST, this handles both regular and ordered scans, but with
just a single read_next callback.

Note: Right now the batches are always 32 items, which may regress
queries with LIMIT clauses, etc. It should start at 1 and gradually
increase the batch size. Similarly to how prefetch distance grows.

XXX I wonder if GiST could be simplified to use a single callback too,
or if SP-GiST is buggy and needs to use two callbacks.
---
 src/backend/access/spgist/spgscan.c | 187 +++++++++++++++++++++++++++-
 src/include/access/spgist_private.h |  11 ++
 2 files changed, 195 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 655f5cdc1eb..c90703a522e 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -18,6 +18,7 @@
 #include "access/genam.h"
 #include "access/relscan.h"
 #include "access/spgist_private.h"
+#include "access/table.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -300,6 +301,95 @@ spgPrepareScanKeys(IndexScanDesc scan)
 	}
 }
 
+/*
+ * spg_stream_read_next
+ *		Return the next block to read from the read stream.
+ *
+ * Returns the next block from the current leaf page. The first block is
+ * when accessing the first tuple, after already receiving the TID from the
+ * index (for the item itemIndex points at).
+ *
+ * With index-only scans this skips all-visible pages. The visibility info
+ * is stored, so that we can later pass it to the scan (we must not access
+ * the VM again, the bit might have changes, and the read stream would get
+ * out of sync (we'd get different blocks than we expect expect).
+ *
+ * Returns the block number to get from the read stream. InvalidBlockNumber
+ * means we've ran out of item on the current leaf page - the stream will
+ * end, and we'll need to reset it after reading the next page (or after
+ * changing the scan direction).
+ *
+ * XXX Should skip duplicate blocks (for correlated indexes). But that's
+ * not implemented yet.
+ */
+static BlockNumber
+spg_stream_read_next(ReadStream *stream,
+					 void *callback_private_data,
+					 void *per_buffer_data)
+{
+	IndexScanDesc	scan = (IndexScanDesc) callback_private_data;
+	SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+	BlockNumber		block = InvalidBlockNumber;
+
+	/*
+	 * Is this the first request for the read stream (possibly after a reset)?
+	 * If yes, initialize the stream to the current item (itemIndex).
+	 */
+	if (so->sPtr == -1)
+		so->sPtr = (so->iPtr - 1);
+
+	/*
+	 * Find the next block to read. For plain index scans we will return the
+	 * very next item, but with index-only scans we skip TIDs from all-visible
+	 * pages (because we won't read those).
+	 */
+	while (so->sPtr < so->nPtrs)
+	{
+		ItemPointer		tid;
+
+		tid = &so->heapPtrs[so->sPtr];
+		block = ItemPointerGetBlockNumber(tid);
+
+		/*
+		 * For index-only scans, check the VM and remember the result. If the page
+		 * is all-visible, don't return the block number, try reading the next one.
+		 *
+		 * XXX Maybe this could use the same logic to check for duplicate blocks,
+		 * and reuse the VM result if possible.
+		 */
+		if (scan->xs_want_itup)
+		{
+			if (!so->allVisibleSet[so->sPtr])
+			{
+				so->allVisibleSet[so->sPtr] = true;
+				so->allVisible[so->sPtr] = VM_ALL_VISIBLE(scan->heapRelation,
+														  ItemPointerGetBlockNumber(tid),
+														  &so->vmBuffer);
+			}
+
+			/* don't prefetch this all-visible block, try the next one */
+			if (so->allVisible[so->sPtr])
+				block = InvalidBlockNumber;
+		}
+
+		/* advance to the next item (forward scans only) */
+		so->sPtr++;
+
+		/* don't return the same block twice (and remember this one) */
+		if (so->lastBlock == block)
+			block = InvalidBlockNumber;
+
+		/* Did we find a valid block? If yes, we're done. */
+		if (block != InvalidBlockNumber)
+			break;
+	}
+
+	/* remember the block we're returning */
+	so->lastBlock = block;
+
+	return block;
+}
+
 IndexScanDesc
 spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 {
@@ -371,8 +461,30 @@ spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 
 	so->indexCollation = index->rd_indcollation[0];
 
+	/* access to VM for IOS scans (in read_next callback) */
+	so->vmBuffer = InvalidBuffer;
+
+	/* nothing returned */
+	so->lastBlock = InvalidBlockNumber;
+
 	scan->opaque = so;
 
+	/*
+	 * Initialize the read stream to opt-in into prefetching.
+	 *
+	 * XXX See comments in btbeginscan().
+	 */
+	if (enable_indexscan_prefetch && heap)
+	{
+		scan->xs_rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+												 NULL,
+												 heap,
+												 MAIN_FORKNUM,
+												 spg_stream_read_next,
+												 scan,
+												 0);
+	}
+
 	return scan;
 }
 
@@ -423,6 +535,14 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	pgstat_count_index_scan(scan->indexRelation);
 	if (scan->instrument)
 		scan->instrument->nsearches++;
+
+	/* reset the stream, so that rescan starts from scratch */
+	if (scan->xs_rs)
+	{
+		so->sPtr = -1;
+		so->lastBlock = InvalidBlockNumber;
+		read_stream_reset(scan->xs_rs);
+	}
 }
 
 void
@@ -453,6 +573,15 @@ spgendscan(IndexScanDesc scan)
 		pfree(scan->xs_orderbynulls);
 	}
 
+	if (so->vmBuffer != InvalidBuffer)
+	{
+		ReleaseBuffer(so->vmBuffer);
+		so->vmBuffer = InvalidBuffer;
+	}
+
+	if (scan->xs_rs)
+		read_stream_end(scan->xs_rs);
+
 	pfree(so);
 }
 
@@ -818,9 +947,25 @@ spgWalk(Relation index, SpGistScanOpaque so, bool scanWholeIndex,
 		storeRes_func storeRes)
 {
 	Buffer		buffer = InvalidBuffer;
-	bool		reportedSome = false;
+	int			reportedCount = 0;
 
-	while (scanWholeIndex || !reportedSome)
+	/*
+	 * XXX Read at least 32 items into the queue, to make prefetching work.
+	 *
+	 * XXX We should gradually increase the number of tuples to load, not read
+	 * 32 tuples from the very beginning, similar to how we increase the
+	 * prefetch distance. That might be harmful for queries with LIMIT clause.
+	 *
+	 * XXX Not sure this is quite safe. The the arrays are sized to fit at
+	 * least MaxIndexTuplesPerPage items, but what if there's a page with 31
+	 * items, and then another page with MaxIndexTuplesPerPage? Then we might
+	 * overflow the arrays (in the while loop below), I think.
+	 *
+	 * XXX I wonder if this is actually needed. Maybe it's needed only for
+	 * ordered scans, when we get the items from the pairing heap one by one.
+	 * So maybe we should do this buffering only in that case?
+	 */
+	while (scanWholeIndex || (reportedCount < 32))
 	{
 		SpGistSearchItem *item = spgGetNextQueueItem(so);
 
@@ -838,7 +983,7 @@ redirect:
 			storeRes(so, &item->heapPtr, item->value, item->isNull,
 					 item->leafTuple, item->recheck,
 					 item->recheckDistances, item->distances);
-			reportedSome = true;
+			reportedCount++;
 		}
 		else
 		{
@@ -872,23 +1017,33 @@ redirect:
 
 				if (SpGistBlockIsRoot(blkno))
 				{
+					bool	reportedSome = false;
+
 					/* When root is a leaf, examine all its tuples */
 					for (offset = FirstOffsetNumber; offset <= max; offset++)
 						(void) spgTestLeafTuple(so, item, page, offset,
 												isnull, true,
 												&reportedSome, storeRes);
+
+					if (reportedSome)
+						reportedCount++;
 				}
 				else
 				{
 					/* Normal case: just examine the chain we arrived at */
 					while (offset != InvalidOffsetNumber)
 					{
+						bool	reportedSome = false;
+
 						Assert(offset >= FirstOffsetNumber && offset <= max);
 						offset = spgTestLeafTuple(so, item, page, offset,
 												  isnull, false,
 												  &reportedSome, storeRes);
 						if (offset == SpGistRedirectOffsetNumber)
 							goto redirect;
+
+						if (reportedSome)
+							reportedCount++;
 					}
 				}
 			}
@@ -1042,6 +1197,18 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 			scan->xs_recheck = so->recheck[so->iPtr];
 			scan->xs_hitup = so->reconTups[so->iPtr];
 
+			/* determine and store the VM status, if not done already */
+			if (scan->xs_want_itup && !so->allVisibleSet[so->iPtr])
+			{
+				so->allVisibleSet[so->iPtr] = true;
+				so->allVisible[so->iPtr]
+					= VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(&so->heapPtrs[so->iPtr]),
+									 &so->vmBuffer);
+			}
+
+			scan->xs_visible = so->allVisible[so->iPtr];
+
 			if (so->numberOfOrderBys > 0)
 				index_store_float8_orderby_distances(scan, so->orderByTypes,
 													 so->distances[so->iPtr],
@@ -1074,6 +1241,20 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 
 		if (so->nPtrs == 0)
 			break;				/* must have completed scan */
+
+		/*
+		 * loaded a leaf page worth of tuples, restart stream
+		 *
+		 * XXX with ordered scans we typically get nPtrs=1, which means the
+		 * prefetch can't really benefit anything. Maybe we should queue a
+		 * couple items and then prefetch those?
+		 */
+		if (scan->xs_rs)
+		{
+			so->sPtr = -1;
+			so->lastBlock = InvalidBlockNumber;
+			read_stream_reset(scan->xs_rs);
+		}
 	}
 
 	return false;
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index cb43a278f46..46c50041ee1 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -16,8 +16,10 @@
 
 #include "access/itup.h"
 #include "access/spgist.h"
+#include "access/visibilitymap.h"
 #include "catalog/pg_am_d.h"
 #include "nodes/tidbitmap.h"
+#include "optimizer/cost.h"
 #include "storage/buf.h"
 #include "utils/geo_decls.h"
 #include "utils/relcache.h"
@@ -226,15 +228,24 @@ typedef struct SpGistScanOpaqueData
 	TupleDesc	reconTupDesc;	/* if so, descriptor for reconstructed tuples */
 	int			nPtrs;			/* number of TIDs found on current page */
 	int			iPtr;			/* index for scanning through same */
+	int			sPtr;			/* index for scanning through same (for stream) */
 	ItemPointerData heapPtrs[MaxIndexTuplesPerPage];	/* TIDs from cur page */
 	bool		recheck[MaxIndexTuplesPerPage]; /* their recheck flags */
 	bool		recheckDistances[MaxIndexTuplesPerPage];	/* distance recheck
 															 * flags */
 	HeapTuple	reconTups[MaxIndexTuplesPerPage];	/* reconstructed tuples */
 
+	/* for IOS */
+	bool		allVisible[MaxIndexTuplesPerPage];
+	bool		allVisibleSet[MaxIndexTuplesPerPage];
+	Buffer		vmBuffer;
+
 	/* distances (for recheck) */
 	IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
 
+	/* last block returned by the read_next stream callback */
+	BlockNumber	lastBlock;
+
 	/*
 	 * Note: using MaxIndexTuplesPerPage above is a bit hokey since
 	 * SpGistLeafTuples aren't exactly IndexTuples; however, they are larger,
-- 
2.50.0

v20250709-0006-add-prefetch-info-to-explain.patchtext/x-patch; charset=UTF-8; name=v20250709-0006-add-prefetch-info-to-explain.patchDownload

From 34a2e9e21ac6635cf893c9dc69603e50ac5cf284 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 25 Apr 2025 16:07:14 +0200
Subject: [PATCH v20250709 6/6] add prefetch info to explain

show whether prefetch is enabled and average distance
---
 src/backend/access/gist/gistget.c   |  9 +++++
 src/backend/access/gist/gistscan.c  |  2 ++
 src/backend/access/hash/hash.c      | 12 +++++++
 src/backend/access/nbtree/nbtree.c  | 12 +++++++
 src/backend/access/spgist/spgscan.c | 14 ++++++++
 src/backend/commands/explain.c      | 56 +++++++++++++++++++++++++++++
 src/include/access/relscan.h        |  5 +++
 7 files changed, 110 insertions(+)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 7e292ebb442..c5b4ec94616 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -830,6 +830,15 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 
 				so->curPageData++;
 
+				if (scan->xs_rs)
+				{
+					if (so->streamPageData != (OffsetNumber) -1)
+					{
+						scan->xs_rs_count++;
+						scan->xs_rs_distance += abs(so->curPageData - so->streamPageData);
+					}
+				}
+
 				return true;
 			}
 
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index df05f282aa1..3fe630a87c0 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -323,6 +323,8 @@ gistbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 													 gist_ordered_stream_read_next,
 													 scan,
 													 0);
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0884f0e05d9..65b4bba0682 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -333,6 +333,15 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		res = _hash_next(scan, dir);
 	}
 
+	if (scan->xs_rs)
+	{
+		if (so->currPos.streamIndex != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += (so->currPos.streamIndex - so->currPos.itemIndex);
+		}
+	}
+
 	return res;
 }
 
@@ -485,6 +494,9 @@ hashbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 												 hash_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 0fa4af79dac..d7242405803 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -280,6 +280,15 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
+	if (scan->xs_rs)
+	{
+		if (so->currPos.streamIndex != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += (so->currPos.streamIndex - so->currPos.itemIndex);
+		}
+	}
+
 	return res;
 }
 
@@ -503,6 +512,9 @@ btbeginscan(Relation heap, Relation index, int nkeys, int norderbys)
 												 bt_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index c90703a522e..9c87d364391 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -483,6 +483,9 @@ spgbeginscan(Relation heap, Relation index, int keysz, int orderbysz)
 												 spg_stream_read_next,
 												 scan,
 												 0);
+
+		scan->xs_rs_count = 0;
+		scan->xs_rs_distance = 0;
 	}
 
 	return scan;
@@ -1214,6 +1217,17 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
 													 so->distances[so->iPtr],
 													 so->recheckDistances[so->iPtr]);
 			so->iPtr++;
+
+
+	if (scan->xs_rs)
+	{
+		if (so->sPtr != -1)
+		{
+			scan->xs_rs_count++;
+			scan->xs_rs_distance += abs(so->iPtr - so->sPtr);
+		}
+	}
+
 			return true;
 		}
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d978d79e0f6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/xact.h"
+#include "access/relscan.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
 #include "commands/defrem.h"
@@ -136,6 +137,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_index_prefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1968,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_index_prefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1986,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Heap Fetches", NULL,
 									 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_index_prefetch_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3893,58 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
+
+static void
+show_index_prefetch_info(PlanState *planstate, ExplainState *es)
+{
+	Plan	   *plan = planstate->plan;
+	bool		prefetch = false;
+	float		distance = 0;
+
+	if (!es->analyze)
+		return;
+
+	if (!es->verbose)
+		return;
+
+	/* Initialize counters with stats from the local process first */
+	switch (nodeTag(plan))
+	{
+		case T_IndexScan:
+			{
+				IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+				if (indexstate->iss_ScanDesc)
+				{
+					prefetch = (indexstate->iss_ScanDesc->xs_rs != NULL);
+
+					if (indexstate->iss_ScanDesc->xs_rs)
+						distance = (indexstate->iss_ScanDesc->xs_rs_distance / (float) Max(1, indexstate->iss_ScanDesc->xs_rs_count));
+				}
+				break;
+			}
+		case T_IndexOnlyScan:
+			{
+				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
+
+				if (indexstate->ioss_ScanDesc)
+				{
+					prefetch = (indexstate->ioss_ScanDesc->xs_rs != NULL);
+
+					if (indexstate->ioss_ScanDesc->xs_rs)
+						distance = (indexstate->ioss_ScanDesc->xs_rs_distance / (float) Max(1, indexstate->ioss_ScanDesc->xs_rs_count));
+				}
+				break;
+			}
+		default:
+			break;
+	}
+
+	/* Next get the sum of the counters set within each and every process */
+	ExplainPropertyBool("Index Prefetch", prefetch, es);
+	ExplainPropertyFloat("Index Distance", NULL, distance, 1, es);
+}
+
 /*
  * Show exact/lossy pages for a BitmapHeapScan node
  */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 56e6c6245e5..ab524bd9b18 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -176,8 +176,13 @@ typedef struct IndexScanDescData
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+
 	ReadStream *xs_rs;			/* read_stream (if supported by the AM) */
 
+	/* stats for explain */
+	int			xs_rs_distance;
+	int			xs_rs_count;
+
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
 	/*
-- 
2.50.0

#123

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#121)

Re: index prefetching

On Thu, May 1, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:

There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like
-         Index Searches: 5
+         Index Searches: 4
And that is almost certainly a bug.

I haven't figured this out yet, and I feel a bit lost again :-(

For the benefit of other people reading this thread: I sent Tomas a
revised version of this "complex" patch this week, fixing all these
bugs. It only took me a few hours, and I regret not doing that work
sooner.

I also cleaned up nbtree aspects of the "complex" patch considerably.
The nbtree footprint was massively reduced:

17 files changed, 422 insertions(+), 685 deletions(-)

So there's a net negative nbtree code footprint. We're effectively
just moving things out of nbtree that are already completely
index-AM-generic. I think that the amount of code that can be removed
from nbtree (and other AMs that currently use amgettuple) will be even
higher if we go this way.

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

Did you model/benchmark it?

The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.

We discussed this in Montreal, but I'd like to respond to this point
again on list:

I don't think that array keys are in any way relevant to the design of
this patch. Nothing I've said about this project has anything to do
with array keys, except when I was concerned about specific bugs in
the patch. (Bugs that I've now fixed in a way that is wholly confined
to nbtree.)

The overarching goal of my work on nbtree array scans was to make them
work just like other scans to the maximum extent possible. Array scans
"where each array key matches a single tuple on a different leaf page"
are virtually identical to any other scan that'll return only one or
two tuples from each neighboring page. You could see a similar pattern
with literally any kind of key.

Again, what I'm concerned about is coming up with a design that gives
scans maximum freedom to reorder work (not necessarily in the first
committed version), so that we can keep the read stream busy by giving
it sufficiently many heap pages to read: a truly adaptive design, that
weighs all relevant costs. Sometimes that'll necessitate eagerly
reading leaf pages. There is nothing fundamentally complicated about
that idea. Nothing in index AMs cares about how or when heap accesses
take place.

Again, it just *makes sense* to centralize the code that controls the
progress of ordered/amgettuple scans. Every affected index AM is
already doing virtually the same thing as each other. They're all
following the rules around index locking/pinning for amgettuple [1]https://www.postgresql.org/docs/current/index-locking.html.
Individual index AMs are *already* required to read leaf pages a
certain way, in a certain order *relative to the heap accesses*. All
for the benefit of scan correctness (to avoid breaking things in a way
that relates to heapam implementation details).

Why wouldn't we want to relieve all AMs of that responsibility?
Leaving it up to index AMs has resulted in subtle bugs [2]https://commitfest.postgresql.org/patch/5721/[3]https://commitfest.postgresql.org/patch/5542/ -- Peter Geoghegan, and
AFAICT has no redeeming quality. If affected index AMs were *forced*
to do *exactly* the same thing as each other (not just *oblidged* to
do *almost* the same thing), it would make life easier for everybody.

[1]: https://www.postgresql.org/docs/current/index-locking.html
[2]: https://commitfest.postgresql.org/patch/5721/
[3]: https://commitfest.postgresql.org/patch/5542/ -- Peter Geoghegan
--
Peter Geoghegan

#124

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#123)

Re: index prefetching

On 7/13/25 01:50, Peter Geoghegan wrote:

On Thu, May 1, 2025 at 7:02 PM Tomas Vondra <tomas@vondra.me> wrote:
There's two "fix" patches trying to make this work - it does not crash,
and almost all the "incorrect" query results are actually stats about
buffer hits etc. And that is expected to change with prefetching, not a
bug. But then there are a bunch of explains where the number of index
scans changed, e.g. like
-         Index Searches: 5
+         Index Searches: 4
And that is almost certainly a bug.

I haven't figured this out yet, and I feel a bit lost again :-(
For the benefit of other people reading this thread: I sent Tomas a
revised version of this "complex" patch this week, fixing all these
bugs. It only took me a few hours, and I regret not doing that work
sooner.

I also cleaned up nbtree aspects of the "complex" patch considerably.
The nbtree footprint was massively reduced:

17 files changed, 422 insertions(+), 685 deletions(-)

So there's a net negative nbtree code footprint. We're effectively
just moving things out of nbtree that are already completely
index-AM-generic. I think that the amount of code that can be removed
from nbtree (and other AMs that currently use amgettuple) will be even
higher if we go this way.

Thank you! I'll take a look next week, but these numbers suggest you
simplified it a lot..

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

Did you model/benchmark it?

Yes. I did benchmark the simple and complex versions I had at the time.
But you know how it's with benchmarking - I'm sure it's possible to pick
queries where it'd make a (significant) difference.

For example if you make the index tuples "fat" that would make the
prefetching less efficient.

Another thing is hardware. I've been testing on local NVMe drives, and
those don't seem to need very long queues (it's diminishing returns).
Maybe the results would be different on systems with more I/O latency
(e.g. because the storage is not local).

The one case where I think it might matter is queries with array keys,
where each array key matches a single tuple on a different leaf page.
The complex patch might prefetch tuples for later array values, while
the simpler patch won't be able to do that. If an array key matches
multiple tuples, the simple patch can prefetch those just fine, of
course. I don't know which case is more likely.

We discussed this in Montreal, but I'd like to respond to this point
again on list:

I don't think that array keys are in any way relevant to the design of
this patch. Nothing I've said about this project has anything to do
with array keys, except when I was concerned about specific bugs in
the patch. (Bugs that I've now fixed in a way that is wholly confined
to nbtree.)

The overarching goal of my work on nbtree array scans was to make them
work just like other scans to the maximum extent possible. Array scans
"where each array key matches a single tuple on a different leaf page"
are virtually identical to any other scan that'll return only one or
two tuples from each neighboring page. You could see a similar pattern
with literally any kind of key.

Again, what I'm concerned about is coming up with a design that gives
scans maximum freedom to reorder work (not necessarily in the first
committed version), so that we can keep the read stream busy by giving
it sufficiently many heap pages to read: a truly adaptive design, that
weighs all relevant costs. Sometimes that'll necessitate eagerly
reading leaf pages. There is nothing fundamentally complicated about
that idea. Nothing in index AMs cares about how or when heap accesses
take place.

Again, it just *makes sense* to centralize the code that controls the
progress of ordered/amgettuple scans. Every affected index AM is
already doing virtually the same thing as each other. They're all
following the rules around index locking/pinning for amgettuple [1].
Individual index AMs are *already* required to read leaf pages a
certain way, in a certain order *relative to the heap accesses*. All
for the benefit of scan correctness (to avoid breaking things in a way
that relates to heapam implementation details).

Why wouldn't we want to relieve all AMs of that responsibility?
Leaving it up to index AMs has resulted in subtle bugs [2][3], and
AFAICT has no redeeming quality. If affected index AMs were *forced*
to do *exactly* the same thing as each other (not just *oblidged* to
do *almost* the same thing), it would make life easier for everybody.

[1] https://www.postgresql.org/docs/current/index-locking.html
[2] https://commitfest.postgresql.org/patch/5721/
[3] https://commitfest.postgresql.org/patch/5542/

Thanks.

I don't remember the array key details, I'll need to swap the context
back in. But I think the thing I've been concerned about the most is the
coordination of advancing to the next leaf page vs. the next array key
(and then perhaps having to go back when the scan direction changes).

regards

--
Tomas Vondra

#125

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#124)

Re: index prefetching

On Sun, Jul 13, 2025 at 5:57 PM Tomas Vondra <tomas@vondra.me> wrote:

Thank you! I'll take a look next week, but these numbers suggest you
simplified it a lot..

Right.

I'm still not done removing code from nbtree here. I still haven't
done things like generalize _bt_killitems across all index AMs. That
can largely (though not entirely) work the same way across all index
AMs. Including the stuff about checking LSN/not dropping pins to avoid
blocking VACUUM. It's already totally index-AM-agnostic, even though
the avoid-blocking-vacuum thing happens to be nbtree-only right now.

Another thing is hardware. I've been testing on local NVMe drives, and
those don't seem to need very long queues (it's diminishing returns).
Maybe the results would be different on systems with more I/O latency
(e.g. because the storage is not local).

That seems likely. Cloud storage with 1ms latency is going to have
very different performance characteristics. The benefit of reading
multiple leaf pages will also only be seen with certain workloads.

Other thing is that leaf pages are typically much denser and more
likely to be cached than heap pages. And, the potential to combine
heap I/Os for TIDs that appear on adjacent index leaf pages seems like
an interesting avenue.

I don't remember the array key details, I'll need to swap the context
back in. But I think the thing I've been concerned about the most is the
coordination of advancing to the next leaf page vs. the next array key
(and then perhaps having to go back when the scan direction changes).

But we don't require anything like that. That's just not how it works.

The scan can change direction, and the array keys will automatically
be maintained correctly; _bt_advance_array_keys will be called as
needed, taking care of everything. This all happens in a way that code
in nbtree.c and nbtsearch.c knows nothing about (obviously that means
that your patch won't need to, either).

We do need to be careful about the scan direction changing when the
so->needPrimscan flag is set, but that won't affect your
patch/indexam.c, either. It also isn't very complicated; we only have
to be sure to *unset* the flag when we detect a *change* in direction
at the point where we're stepping off a page/pos. We don't need to
modify the array keys themselves at this point -- the next call to
_bt_advance_array_keys will just take care of that for us
automatically (we lean on _bt_advance_array_keys like this in a number
of places).

The only thing in my revised version of your "complex" patch set does
in indexam.c that is in any way related to nbtree arrays is the call
to amrestrpos. But you'd never be able to tell -- since the amrestrpos
call is nothing new. It just so happens that the only reason we still
need the amrestrpos call/the whole entire concept of amrestrpos
(having completely moved mark/restore out of nbtree and into
indexam.c) is so that the index AM (nbtree) gets a signal that we
(indexam.c) are going to restore *some* mark. Because nbtree *will*
need to reset its array keys (if any) at that point. But that's it.

We don't need to tell the index AM any specific details about the
mark, and indexam.c is blissfully unaware of why it is that an index
AM might need this. So it's a total non-issue, from a layering
cleanliness point of view. There is no mutable state involved at *any*
layer.

(FWIW, even when we restore a mark like this, nbtree is still mostly
leaning on _bt_advance_array_keys to advance the array keys properly
later on. If you're interested in why we need the remaining hard reset
of the arrays within amrestrpos/btrestrpos, let me know and I'll
explain.)

--
Peter Geoghegan

#126

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#123)

Re: index prefetching

On Sat, Jul 12, 2025 at 7:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Why wouldn't we want to relieve all AMs of that responsibility?
Leaving it up to index AMs has resulted in subtle bugs [2][3], and
AFAICT has no redeeming quality. If affected index AMs were *forced*
to do *exactly* the same thing as each other (not just *oblidged* to
do *almost* the same thing), it would make life easier for everybody.

[1] https://www.postgresql.org/docs/current/index-locking.html
[2] https://commitfest.postgresql.org/patch/5721/
[3] https://commitfest.postgresql.org/patch/5542/

The kill_prior_tuple code that GiST uses to set LP_DEAD bits is also
buggy, as is the equivalent code used by hash indexes:

/messages/by-id/CAH2-Wz=3eeujcHi3P_r+L8n-vDjdue9yGa+ytb95zh--S9kWfA@mail.gmail.com

This seems like another case where a non-nbtree index AM copied
something from nbtree but didn't quite get the details right. Most
likely because the underlying principles weren't really understood
(even though they are in fact totally independent of index
AM/amgettuple implementation details).

BTW, neither gistkillitems() nor _hash_kill_items() have any test coverage.

--
Peter Geoghegan

#127

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#124)

3 attachment(s)

Re: index prefetching

On 7/13/25 23:56, Tomas Vondra wrote:

...

The one real limitation of the simpler approach is that prefetching is
limited to a single leaf page - we can't prefetch from the next one,
until the scan advances to it. But based on experiments comparing this
simpler and the "complex" approach, I don't think that really matters
that much. I haven't seen any difference for regular queries.

Did you model/benchmark it?

Yes. I did benchmark the simple and complex versions I had at the time.
But you know how it's with benchmarking - I'm sure it's possible to pick
queries where it'd make a (significant) difference.

For example if you make the index tuples "fat" that would make the
prefetching less efficient.

Another thing is hardware. I've been testing on local NVMe drives, and
those don't seem to need very long queues (it's diminishing returns).
Maybe the results would be different on systems with more I/O latency
(e.g. because the storage is not local).

I decided to do some fresh benchmarks, to confirm my claims about the
simple vs. complex patches is still true even for the recent versions.
And there's a lot of strange stuff / stuff I don't quite understand.

The results are in git (still running, so only some data sets):

https://github.com/tvondra/indexscan-prefetch-tests/

there's a run.sh script, it expects three builds - master,
prefetch-simple and prefetch-complex (for the two patches). And then it
does queries with index scans (and bitmap scans, for comparison),
forcing different io_methods, eic, ... Tests are running on the same
data directory, in random order.

Consider for example this (attached):

https://github.com/tvondra/indexscan-prefetch-tests/blob/master/d16-rows-cold-32GB-16-scaled.pdf

There's one column for each io_method ("worker" has two different
counts), different data sets in rows. There's not much difference
between io_methods, so I'll focus on "sync" (it's the simplest one).

For "uniform" data set, both prefetch patches do much better than master
(for low selectivities it's clearer in the log-scale chart). The
"complex" prefetch patch appears to have a bit of an edge for >1%
selectivities. I find this a bit surprising, the leaf pages have ~360
index items, so I wouldn't expect such impact due to not being able to
prefetch beyond the end of the current leaf page. But could be on
storage with higher latencies (this is the cloud SSD on azure).

But the thing I don't really understand it the "cyclic" dataset (for
example). And the "simple" patch performs really badly here. This data
set is designed to not work for prefetching, it's pretty much an
adversary case. There's ~100 TIDs from 100 pages for each key value, and
once you read the 100 pages you'll hit them many times for following
values. Prefetching is pointless, and skipping duplicate blocks can't
help, because the blocks are not effective.

But how come the "complex" patch does so much better? It can't really
benefit from prefetching TID from the next leaf - not this much. Yet it
does a bit better than master. I'm looking at this since yesterday, and
it makes no sense to me. Per "perf trace" it actually does 2x many
fadvise calls compared to the "simple" patch (which is strange on it's
own, I think), yet it's apparently so much faster?

regards

--
Tomas Vondra

Attachments:

d16-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=d16-rows-cold-32GB-16-unscaled.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x���M��Hr���_���d��H�B�������h�R�UgJ�U9R���}���A�
Rf�6�e�5s��s#���<<����><�g,5>_�<��c�����?�����??��<��1?��c~�����9O����4m����?���_�<J��5?c��e��_�e��������~���]XZ�T�U��R�X�����������a��3�S*���i��u�y������?����s���z�o?�����T��-�nFuK��m��e��gL�T�rXw���Kq*�.��Q�'�����C��r\z�Pz��TJ���0(��q�6���P�T��o�r��e�J*�tF�K��n�J_js<,}�RzNS	���Q�'�����C�j}��L%��+��vY���9��b�jtVsLSys]	W�@�:�-���T-Q\�	;�{�������u�y�K�u�=�[��nvR��Ny}�v_��,a�Y�M��l���M�I�%O���$\�������-!{t�(������S���kI���>�Y����}�w���PN/$�a��S
o>��y���0�I�-�2�i{W7{q�d�y�u}��i�E��4{q.��s�yJ5�\"��b/�E� �2��_����Kr�}�+�aJ�N���l���X�+S.��ce���z>F��4�`��2�Gn=
�|!N1T�O��|�&�H�+�\��O��+���|(��an+���EHiZ�)^��7���xoV��M>Q�����#a���)���O��|�&�H������'� ��|(�-i���'�y>v�|$�������2�Gn=
�|k��d~~���c7�G�(_�B5Q����Ca�o��d~~��_<���C����R�<�t|p��eZ�gdW��)v��I8���L�����i>q�|,��<�u>Q�����#a�/�����r���$	�|e��b��2�Gn=
�|u�!�|���[��� _J���O��|�&�H�[���t$��WO�(W��`~v��^5���Ca|�h�/��w��e#�uZVS�(�o�I�$��eZ���r���$	�|�2�h~D9��n���A�-N)��'�y>v�|$���S�6+�|����0�W����y>v�|$����n��'� ��|(��m��Kfa�
�z2��s�y�j���rz�7�1$|p��yZR\6�4&�:-��.��;%nR;	��*���@��<�I>F��)���ce���z>�r������y>v�|$���)F�L�A>r��P�+���u�<�u��>�Ke��b~~���[��� �2O!�����c7�G�(��9�(����%a|�H%O�2��q�@��Nu����),��O?�/a]���F�������_������������������?��w\���-iMS���p)
����=��I�/�Pd=g��1��^�Ql�QvE[Dy�����i�����4>�?�����n����������^7!���@���)���H��@�s��[�uZ������9M����IyH�����k	��h�������TT-����k�yJ�y_X��u��lU(�Y�����2��J�+ORZ��>C{�tdS���\xQos����M@x�3dU��)��
�/�/��9�
�\���/���P�4��}AE�/������e��;FI� cZ�T�a���7����1���L�K����@h��a��vdS��zC���6���m�z������P���#z�"^�`-���z�C�p�:2y����ro������+T������7�n��b����8=�\�mg����e����Kp6���|NQ)K@v�i=�)��
��"�z�C��8���##4K���|q��#�"^�`-�"�z�C�k���p�e	�������bkY���7���\�
�",,sb&*�Lt������A�s&��aPy�"���A��� ����I�/-):�z�d�7`�Lk�P��BB�oh���0��I�/�t�����o���%�e��#�wP�W���x�����I�o��
���k����TPbM�d���bA^���������x5��
J��T�L�FRZ-�\x	o���i#*xg�M�2�g
���Y����\�����|���eo7�[��Nj@)�����b���yAY����,����-LcP�����phc���\�����6��'e���s���9a����<�C#^�<�9�����7�c�b�������@i�G6E�Xy`rs�E��!oj�SY�3����.�n��{|_b���{H(�O�������gL�-?+�[J��C#^�<�1n�BKx[B��������Ff�n���3�_������<�o!�Tg����xQ��L����=s�"R���b��<�E��6����^������p��B�=��Lt��x�D�L����i�\��p�,o���_���?��w�����i[�eY?�N���K�R
�Q��,}i���S�U#}��!�^���g�+#�s^�����_��o>-S*U�
Vd�0mGv�(�!�z�l����
>�"�i�bo+���O���?�O�Z�:|j��O������'ks��>���O�����O����t����:S����@iyC|�5M����+���/�my�Z��|b�������M/V��/�m�'�a�S�>i�"<|���!#�����>�e��'m��(�E�N����0��cP�1��T�3��T}�$�0(�&jT\-��6ExW����!C��-���26E8.�AY�C�
���P��\S(mS��PK� ��2BQ9.E5A�(m��EQ� (���?FQ���jQT�(��PT�(�>����[������G�K�Q����D*���[Q6��j�����+�,��Kx�CN�6./����kE�tdc��D
]x	os��H%x~��=��m�2�j�����6�����&��(�Oy#�Z�Ex��%d���F����5����1��)pQ|��2�Sx1�|j��[�Q�����L.���9d���X�j��S�����E�)ks�9�j�~A�aR�DwaT�������CL��a��hLem�p�
]:�2���0�_���Ti��U=�������CZ��P�*��U319T��9(Pe�t�#�����PA`.����e���(����Vr������	O9?�T�i����|��V5\5�T�pCq�&(N�My��9US�2��Cb���T�r�b9U�q�b9Uy�����.q���T�q��8U�s��8Ux�$o9��W��"�+�uz[���E��y-!a-�S��4��v��p*pQ������]cRr�I�5&�]cR��I�5&����G�h�&A�����\��` ^���P�����8�$��MH�Y�����P�f�������I� ��
��b�2��*p��J��?T1y@�T����E*cs�P,S�
H�T���L*t������L��v���IM�@��aU$^��2�����G�h���I�P��j��&��T!��8m+*P��V�O��M/VP5Y��2Tp;iU����)���������I��4g�$5g�$�m��MRs�MR���1�B*�U�:Pi�"<�j.P�C��R��JU�+Um�T��JU�+Um�T��Wj��j��j��j�������7��-�Wj��R���Zv�R���Zl���A�T���M��J���6F�^�H.���9dH���j�&S��G����)ks��W�bW����}1�_����T������-!����������)mS���j.�s��2 SEjae���Dy�p-�LQ�����i-��)����9��9e�H�+M����)cK��La��&SM�d�����(2em�)�Ud�	�L3Sg�&S� d�����C�n*5�J�t�m�0dj%&S���s2��N�N�b���������L%t�pg���L�i�o�])�t�m��t*Tt�%�-!��
��;��vx�Q�*�*/v��z	���>!S[�J~.���c0������wa��?��=�R��
7���*���h:��i���S��5�N�v���o�k�Z��L�^A�&(��My��!8t�eL��U��Pp�!W�r)S���
�@.m����Ej~E=
�A���d��=��K�d��������e�����el�p�^������!�jU��jU�
)�f�^��l��!����V���Y��pM���� �tgm�W����~#c������w������F��W�jk;Em�YSt���)�u��K�2�6����S�4�-�MP�O��n!:�3&��+R��vo`!`S446E8�.Z�C����~����)�"$D�	��!�jU�0�L������fl�p]n�����9�V���� ���CFm������!�1��^����N�@H�a���.9gS�����"�z�C����4�\����XYf�)�8����bk]�k�6���U��
_�5��A���a fFY��r�V��|�P��T�:"P��a g�o0���9�^�L)��0�@Ke����������UM���[���=�,�w���,�C(t4��;�|�7M;7xL�����:0�6F�:�/�m�W���@�8���gO���=���5o�BKx[Bn�*|4orOP��jMt������{���) 5��T�����$���m�0t6�/�m�W+�������Z����:�"4�M]dQo���%J�k����YI�3��6Eh�r�E��!�je0#|�T�7P���l���7Tr��%���{�2�
�c*:6P��l��t8Dt�%��!7kE�;o�����@`\�LrW�x.�@�9��oI�w����<'P�&�l�Pt9��\p	o��{�"�����|�.��H�L��h� �����H�a+R%+��4zo)��t�"��Kx�|K���"_7��6!��Dw�^�r�pg������yJ��c���W��p'0�l����)t��?)�?�����t+e����vj&���v���vj&���.�fZ+��3�����7��3�����7S�X�����h�G��8�vG��8>���J������6x��������m����kR8E��S4�Y���38tvk�M�s~X���1z������4���E#���I�U�q/��i��~]4S����L\S66�\�pY��:b����*��tdS���\xQos��
5���<��C�Y��cg12�(�ll�R���H�;n8]�cm�������1��Ji�bd���p
JC��-J|���b��&5������
�]i���!��$wM�3:�z�d�+�	���1]�1�{�g�(.�Sb����+J7�B�1�R�5UZT����J�����d	os��
U�)�j�u���7�Z�0��8��,���y���D�(G�(�Es��������=��V1�������g`(�[�����Dw������������4��'���1-��e�kMrW�7/�@�9������5F��l�]�Y��.����f��'Kx�h��a�[�F����������n*���z�C.T(�(���@�:�u&�k����{�����`5�	P�6l������m���Kx�C�TH8�1����((�h�6F(f�������K"Dm���Q|V�dvoGnkdR�NCZ��r�BB��=SCF�J{�;���M���s����6�\���i;`�v���������;�������E��6�\�`i�#�*,���	��"k����Yl���m��}�0D�
��U��v8�Z��:��{m�'���L��R����[������Ug����v������.|��L�>�'Q�f��L�c(d��Q�8��I�$~zs��s�4�R�����4�R`�������oa�����m���K�����nu�����[�(��J�5��cM2�jH�2e��!��A�Lmn������[���E�V��I[����ii����[��l��i���v[��~C
D�]����-���j���6[��EqZcs��Z��y��������z����l�E\�}cs��Z�$�c\�]��9	��ms�I�������X�b��y��.��+�k��y��.�t�km�y������,�$��$�Mr�M�� M��d�{E*�����M�t����.�t�km�W�>%�$9g%F{�����s�J���c��
kU��8�Q7	7E7	[�8�&at�M����[��Y����Ge��$5*S���mFe65*S��C�r�j�%�2�Q����Gi�w3�Q���GiZ�44_�U��
==�`&P���MgA=���9�^��w��y�I�)�I������E5	[�C�����R�$8�[MB06E�Ix�w��`m�Wk��t�(u��R����xh/�5�
>@^����|��p����3kk
��C#^�`���Kx�C��*M�����7<�;��@c��������$�{E��f�	[8m���k'���sh��,���[���������JH�}���1B�q�7r�%�-!�jU}�i5��e��qp6E���L.�S��r�V���(J�E��v�&��9���t
;[Bn��}�����
��9�m����)�Ko>�6���U�z+���C"(�)�m����q!^��r�Vn�mS0Ws"O���xvv��W�s8:����6�����W�x$��� ��<�$w�929�z�d��ERSo<�X�U,��;;��c�K1o�������{��`��=�jnEK�f���
o���[1Gt�pg���"�9�Q���������9\C��v�����[�[*��0��f��)���U�~`�7�����9�8������P0yX��p���_������@��$B.�=l��������/����^�~����������?|x���8a�}[(K,�������_^?}=x����7ay��]����_��e���;�#�0�rZy�8���7���y�;3\^l�~�������2�V>>�\Jo��m�.��Q�(�����SM0���.m���8p��T"
���IU/�R�	<�>�����+����\=+��{ Wo�������_���������zRF�K�ToO�_j�<�~�T}N�QD�'eT�J�&���
g��qp\}�/�����`�J� )\���T�\i��������^7	��%�+7a������\i��?��$��I�.a\�	;+}]���y�/���@gq�$*�0����U�[ys�	��H������0��s�\�u�Y������K����������}�^2

r���X���7��y�����������7e��g7)��vs^��a�z{>Q�����#a��N5W���A>r��P�[!q�O��|�&�H�[q���'�y>v�|$�����X�#���Q�
3Ou>V�����4M�4�eF:��<�I>F�
��������|(|p�X��;�z�/enxC�Y]9}����g���*�����O��|�&�H��iZ��Xu�<�I>F�S0�HdC���A.����r���$	�|���|U���#���A>�����r���$	�|�	O0�X�#���Q��'��|��^8���Ca|�(��et�����l��Rj���b7����{��)�*���K��=��(U2�Q�����#a���i����<�I>F�6����X�#����|��`�A����|���X�[������ ��|(��Y�f�|���c7�G�(_�A�:+��
q��P_7��N1�������8�E_��r�V���N���j�Gg#�<:I&0y�]��
g�<yI*�G�pO�N�� ��t(��~�`~�D9��n���Q�e�u1�X�#���A���a~KD9��n���Q>nC��X�#���Q��vf�|��_.���C���E;px����.0FT�.��{�nR;	���:�4H��G��|�&��A�Pp���c�<yI2�����k�<�I:F�����2�Gn=
�|i����]9��n@�0����^�ce���z>F�V����>�2�Gn�6a�o��S������0�n�����Yt���!�=��l%�ii_���W`�/�#"��i{_T�n��(4C+����[�d��;lj_bgs�����y��%[��e������Z��!o�i�a�%
���p���-[�Hy������O�Kk�R8��&3��Tt-����ki��1Q$�6�#[���E��y-�Vv6�����4�#�����E��y-u�GZ��xVhgZ�����m�vFJ������A���"F]�(���qJG�,LJ��/���Z�:AwM�������TT-����k������D�]n
��#[����w��E�}z�i�����u����$����)�Z�������vC�c�
	����%�6��-g���������>=n�,A��P-�p�6�������j�E�}^K�����%,
��]�d�#[�&�6I5Y���g{�-�\�(g��Bg���]Ul9���~FJ_������i!����D�}M��H��-�H��a�E��!o�X�&�m�)-w;Zk���m�x�B��K�
��!oX�:����F���v&��si�a{��_,��/t�����������	����8�4>����Qm���,h=g��4V��f�Ao�(m�Y��=�9�)��
�.B^���1�t$d���D���\����D�|lC�&���Nx�[���'T*(
|�C#^��� p�%��!o�M�=������	��M��	�/p�Os�pg��q[����i��Z��m�.Vg���U�L�Cd�1���*T 2$, ,�]2�&��Xx�2�@��d�7�f����) �a%���8b~oc���l�AZ�����(�e�%�A����@����/V�S]x	os��qi;{�D)��a��y���+x�)�����7���GR����MVre�����b����/�my�N"��Jm���,X`g�7�v��#S���d����1�Kv��	�$�-�}�����r(�O����UN)�1K������#�"^�<`s������7()�~���X+����C#^��q��Kx[B�<u�����z��C<[�A<0y>'	���(��������=���C��:�����G�
�h4����-�����!oz���[��O�f��l�7v��(� �z�;��j�Ma�U2�"�s�����{-�hL��z�w���S����Te����T���|��`[��Ty��J�2��b*pQ������T��hL�^��T�����T��0������*��APS�d1��e���T�Ec*cs�L5K�9���;5��l��DJ���6�|��v��A*�����*�>�S���d�T�p*��T�h*=�����_j0�
��ML�KS���1��a0L���)mS�S�����9d���-���:��&�[0�����C0��JK%C�R�R���0���@����TS4�26E8"�.�HY�CDj��F]��t���)*��������`*a�n��J����/Vh�Ns�%��!C0��_���h4em>G��)pQt�����+�d��&SKu&�G<��<c���"<,U(�� �����PT������!��Sn5����
EY{�#k
�B����-!C5�� ����Q��1��(t�(��2DQx��FQM�(�����1(
\�2��P���I��A�5���0TA^�����2|�f�����Z��T�E%�r2(x�!��
�Mt��D)S��(
��k����ME�KGQ���!���
E5E�(kGzD`P�(el	9EQy�v����	aO�Dw��r"
w&��GQ���R��M���������L����kS*�M�����kS*�M�S�v�'8��	{��	O������k�
�e*����o�
�e*<}�OXT����XTr,*�YTr,*9u�J��m�
;�
�E�=�
�E��O�����-S�4�j�F�(mw�cX�(e�s���������5�j�fQ��c
�����!c=0F���Q������R���!����I��%
Ii��j
Is;i�0���^��$Ul�O�H��^)ks��R�V)�)���}R�M�uI
Y��G(@eMr�,����L��(,B�����Q�$w�$His=�����7v$����MI�KGR���a�T�%n�������M�Q
\T���9d���O� �v��FR�^��o����FR�^��$hZ�-}�H��<��	�G5E��I��^�6�6K5EwKY#\����)ks��SaK��TMP�����[5�B�T��sNE�?jN�@v.�l�0�j!^��2�Tf��j�s)oc��T�\��hmp�sH4���t.�m�0�jC�T���!�������hRel�p�
]:��6�����W-`1��zv\�[�W��G�
N�U���s��J[�kZ���B�*p�9�"���J/����~��S%r�%����
!�x���T3��9��1�q*pQ���r����b�T��!�rV���l���G���>T��g���?;�M��1���Rx��������g��w'���"�� 0	|����I�0��r�y��b;��m�*���b�������"�S�4�j��\��#����mNM�^��Ss�l��i�"<���!�jU���W��M�����}��\$������o�R���!��)�A4t���r�V����,
���!���4,���C6k��������!\S4�3���e �tgms���Z�����y�����Y���#�"^��}=������1�kU|
����)��"�C������~.�����ec����B|�h�8t�gL��Wd�2�4�s7Q:�+��xdS���)���z�C����]H���M�t���� �(:hm�W�{a��\/[p�la��\;[p�l�����������e��y?��l�x�B��5Y���>��r��WzN�!�J����!3������[�v����4��A����a@e&^��r���{�
��	f��������b��������{�2�\�m��w�� ������g"
w&��+Rh��-8g�#Poc�A��\x	os��Z�f�)�
vW6P:"�6Fd�E��6���U���g�W���������Gr�%��!�j&1��g�t�����Q�k�]����%�^�M��[�V�M��'����l�{�����9�^�LZ�ww1��f��_����@��d�[E
tM+=?�v��\gS�&�)�/�m�W+C���S$QREH��)B��X���6�����m�S��W
(B�w6Fh2�N�k.���%�f�u�qg��G�����w�8K�w6����}���,���dc��P�������>�j���n���_���'�v�lMv����1t���Y�����q��V:��+d,B��6F(2���FZ��r�V����kl��
���zog��r��.����>~��])�-&�u�����&�����:0�@�L��g��w{�Y��3���-{�}�!�^7��!2N��u�M������qnz]r���:���
�s6�^�PwTB+�i����mj[��l�fs�m����K��4s^�P�N�o7N�n7�6E�������C�T�{2��2�����m�e?�E5~�C�T�!2>����{M����R
.�T�<��B���(.���j���)��ipQ|��r�B���T���"R����VTZ���Ko��f��B����i���`Y��)��hpQ,���=v��
5"��A���Ak�"<�����!W*T}���j�O���O�M��\T���9�J�j��\����=����������gm�R�P�u���PP�JG�����3������+J�:-�i_����W����}5����L��R�B��i�����M�'�K�����v��0D�:��{{a��)r"^����HV��V�S�_t�jT��1�]��Fr�pg����z7+^�U?kT;�13]l;*�@����/���F����������1B���.���%�J�Lls�m��n3��)Bc�����;��;�+���\��UW�1�����B��zm��"Kx��'WV�t�}J��my=����pZ�E�v���
���J7�����^oS��q!^����rT!��P��C��`���B�{C$
wf�����5���t��
C���
��S�����6�\���kh�z�yXt`�����7��S�c�g�M��RA��7X���P(Lnw6E(�2��6�\��j�����!�&0������m;-8������aN�E��BYo%j�g{�m���P��j��n�ZrG�[�H���W8��������_�	�i{3kD� 0��&�+,���;����F�4%��,,|��x3	1&
�ft�pg&��?�:Ldt�4�S�&���4pS����������k�)���2��3��������H���w�7�#^��K�jV
�k
���������6
d��!�j���Q������k�7E������Z��������j���j\�i\��^�����p����V�T����������8�n��q\�Uo��Z����H���v���������T����v;����������������v���[f�>��'�-�Dm�C1�DpQ�����Z�f~?/�N��:���vN��p�:�]��T������6eV������1��^������U{14���bc�>x�^�.������r�
��`fMz��I��e��!���H��9�^������U{14���bc�>x�^.������r�j�@[�N/�Mhj|��)�
0�]m}���9�^�j�^����veQ�Y����7�>����5�j�z:�<��b����"�t���!�j�5������5�S��n�u����4&��+�a��&��gQ���yA|dS��_.���9�^����)���'������6F�X��W����6���U
��>M5�5Z����w7�9����$�{E���2���h�@�3��fFCDY��r�V��M4�+(�����F��93�����{�J��s�P�@�
����P��7;[B���p^p�'�ke��)m�jc�G6E�Xy�@t�E��!�j�]��z5E���)]�����{�7t�bk���"�}��	v���bo�c��P�"���vO/�*���Q�����������#������Wk�^����8��������i&�3:p�3��^��<��`E�x�~ak��f?�qFE7�����8nqg'qC�7�����y#�G����,��P���ueEF[��B�����9�/�m�W+���m��C&s+����p6G���Z�E�6��������Oj�������xns�m&
w&�w���||v�����{��s�������^�~���������?|���o�s
�=��4�����������������&,��K_���/�o����O�>�~��P��
<�|�;�n���C�l����~���o��-SA����g��O4����h/��2,^"{�&�������3�����3���.���LU?)��%��oBO�e*�q��J�a��|��geT���m�Y��Q{9l���\.��,��-��2�_"{�&�����������s".+��2�_"{�&���e���y�J8�?��^@�'���=,��z�*����y�\u���������Q�'������y�o�:���������'aT��I�&�����x|��_��m�f����Q�'�����zs�	��\�����0��s\������S;���'������+�;��w~��YHf�|�am�9o�#L���o�?��/9�`�����M~b$��wu}�{�'�y>v�|$���$���r���$	�|�oo�|���c7�G�(_���X�#���A����S>Q�����#a�o���u>V����Ca��u���Q�����#a��wp>VN/����0�xl���=N>�z|p���cw*^��7���xoVM4�E��r���$	����-&=_WN���ca����5�HdC���A���)%�<�I:F�V�6��X�#���A��}���y>v�|$��ey������>�Gn=
�|�T�y?E9�p�[�����Q�4�0�p������FW��+v��I�W�����p���$��\�E���r���$	�|k�>�'�y>v�|$���.^�{'� ��|(�m���M��<�I>F����q>V����Ca�7 �|���[��� _���(_�g�]7�M����uc[���q�/a�k���+����~r������Hc�{JV��'Da��^�|�������(_*S�$i���zJRF9������"
r�_�I�0g��*'K����r�2���Pc���('�����r����"
r�_�I�0g�����,�^T�������oI^W�'�k����D�i��_)�7m���:�H����s�2�Y�4��+�����0�W�X�ki���TJT9��an�~�&�yN��4��a�
���,�r�����(g�2�R2Fi8;��~���2�Yh���Ydd�yN�S9Q�<x�����������%�u*�:��\f�W��o1N�}���v\�9/U�����mQ[�������B����H#��e
)}pS_���'�m)��j��&
�I��*s��-�H�{����>=�sKUn���+Q w}n�;��M/V��������	W[����V����q������j�E�}^�@R?#V�g�Z���3BE��dQo�����W�����`�#�����E��y-k���W���W������j�E�}^�������M��Y�sk��G�l0$��I��z�����fi���f��������ol�<FJ�o����������4s
�E�Q[�l:�)��
��
}���!oh���&
�ek85��Q�����/�myCW��T�������#�����E��y-�hpz����B�	�6"����H���E��!o��<�I�j�F���p;[����G��E��!o�`�K��jT���g+=k��u`������z�7�~4���}���!h\]�i��2���������oxW�}�0�C�Z��Ze��6���m�Z{�7����j����7��Ei�dY���#�"^���)������7�H���n+|�h�H������
��m�x����Qk_bgs�;���n��b1,�)h�-�l:��Z�����#'P�[ ���
����Hjg>�6���	%����W�	�����}���%@f,�$!`h�X� (I��N��N���3�/~(	/�a�n�C��$�"9�2;��}�y���2�]i)�m����AP�h��d��@Q'����~o�<x������*]_d�-��N�h?��o�o8���4�a	w5�y��x,`�#��=;�D���:!B0�7/S��P�M	��aG�����(8�\d	oS�{V�v9:$X�R)�O(?0��Re'Zf'p�[�S���@+!��-������@A���v��wE�-��\��'�R���	���	�/xd����I���L�
�i�����@�u�#{���Hy����BKx[B���#�"���RHH�Y�X>0��%B+$���L�WD����~�������c��E���x��@Q�n��lZ�4*5�8��B����h[2������'�+O�������I~�����CVv�8V+{V+��a'M���U���C`u���C`���|�������C`e���C`���|�������C`e���C`����.��CL�����>�G��P4�E�}�����������'����#��hfl�p]:�6���/�������b�(f���E�)�q�5E#0c[3]:��9�3t��Xu�:V��:V�>����k�F`������#0ks����n������)�!0pQ��2D`��C��!��;��\��9�S�
��4	�VC��I���5E�����{�vH�Va����ko���x'���s&��X�,Y�,�J;��,�JO������t@W��Z����Y�#�����3�Ue�@/L�$�YGu�Ue�Zy'�g���(���W��ry�ENz9��>�^m���^M2��	��:)����z�_
��d��0�C/tR��	5�^y	z5A��2�UA�����,�	���3��,������Y��;��y��9dL���z��$����A�XK����(���'���$���Az�S�^V�������^M2��
��9u���A���zm�t��m�0�+�-�m���I���
Mg_�������/cK��}�P�/�ti�e
B��Zf�N����^;��4`KI$�v������	uN��A�0`RA��7�]����������IXX�unS!��>`a<;|9��;ca��v�=��0m�
�ea��X��O��~�,�t�W���-3�-����!��v�����nko�n�v��v0pQ�`�>�iL��h��,L�����E�0c��b1U����XXq,��YXq,�<}��XX����XXq,��YXq,�<}�OX|�6,,&���]��X���E�0cs�����������5��i�"<k.���C��`q��`q�
a����\tO��9�=����/�`X��� �:�)��
�Z�;�C�H��&���M�5�����}���
��j2��!c�$wG����1m��	+t�e4V���H������(t�fTK�l
7��VlcX�ma�����.�%�<����1<���&2f
�d��:sE}�Fj�Z�t;�0h��N��
��A���5��1'��ZG��I�1'�����J���������A�v0p��`V��!�%Y2�!c�@WE�jI����Y�#2������0������cF��O�������k�y@�N�'�
�����V8��&@�����: �G}�B.�5�2+P�d���(j����������0kc��#i
�6���X�m��a���M�-���z�CF�l���J� )��0���N��NX��=j�_�
 k�dN� ���2+p�9 �����n�a�*�3���d�������=b[�u��~����'�$��.1<NP�CV=*���E����E����-O��/���+�u�M�\��JA4cs��Z����ls��M��!�(�fl�W��cX��p-��p���.
���jU�,��A��hHgl�p�]:��6���U�5� ^S4�36E8��.�Y�C����MZU��)��"�C����!�jU|��)[N�2]:��9�r���a����M�����u�@Bt�����>��j�f�g�~Oi�vO��)��)����r�V���V!������j��.2Z��jU|���)B�"�D�!���#.��6�eq$+���A�#�"^�P����z�C����"~���)b�"�1��!�jU�q�~
9��!��)�ANpQ���r�V��8|MC��A�M�ft�E��!�j~�{4;$���:[�Z*H��1�����{�2�\P(���B �5�]C�L��3��V��?p����Rq��tz{�#�UbbZ���>5�z�BI�F"U�=B^w����hL�d��@QwK�Z`
�F��{_5��	d�md'^f'P���x�1Z���
�uY�;�/�(�n�
���v��u���a����8�n��is���#H��z���9��;����L�7�o2	+�6F��D2��E��8��Y�vy�y�k��g�-��������(�n����,��EJ�1�i�^���~��JN��N���%'ns�x�4K�N��$*;�2;��n�,9����-���wv�.\Rp�������"f����f�s�Av�m��,{��B�z�Cn�Jz�aj�b�3�����1B��PWr�%��C����:�?�b�J����3��B�'p]�����	I��}�d�a��3�+��C ����9���3l�\x	oS��[^��Z2�a�A�73���.�%��;���Y~���}��������|�0�p�5�n����vSz�U��6l4�<���-��,?o.���C�V�-u����+�vo�36]��9�J�
c�����R�Mtw�:&7&�_)M����^���r'��;��r'��;�����/�n8��tWkXlW��)�w����j56�\�Pj�#j0��Y�m��i18�(nl�R��`��qn����F���egS����������+j
O�
���E�����y��B�����g��+~9o���v����YC����D�T]���$�+�)��f���lm��#���A�1���_���8���������m��8�[ks��
�&�0O#�����MY�L����9�J���W8�D�����MTgt�E��!W*T�>�5�����Dw����"��\/wP���.�����1��h�Kx�C�T�oW:�DZo��:w����AGr�pg�����F�d��������n�s 
w&�_*M�o���4�C�	d@s��D����V(3�6���L�\x�7B�c+P���ya'^y'�;?l&��H�H��2/��w������	E�Z(��v����@R��dPr*�$+���V(S�X�L-�,�Rg�;��@�9��;����t7�i6�\l�b:�C���h?���"�*��� La
������P����p��
	��5��9�
���$��@� �����������X(b�C�&��5������@A��2{��.��6�L����$��@� ��b`'^f'P��B���9�5�C��>�R���1BA�V+�����K"�]+M�g����~��:���]r������JC�J;����`�pwv�f�>���F5`gs��
����I��&������E0�Ma������A��7{x��f�_$�bp��L{�����(��������[���)-�a\���@x�����0c}L4R]V���u�mo�%EQ]Y����^���������nq|���nq|�8�kf�kEhz�����^�e����^���������!�
)nHA�)nHAx�$wk5�6�n�gp=���=���>��Z5�E��qpC�k��~���E�`c;��V=� ���I�L����I�3q�;'�3q�����_�vg�$wfNrg����9������9���@MS��4:���-Q���M/Vg������[�6�� �
���i9��IT�IY������ZU�2ve������hcS���F��m�M�c\�U!_�q�H�)�)�"UF������[���*��o�"��-Q�Y����28t�lL��W�>���E'��lWt�5E'��lK��>�$M��4G��I���b��~������{���
!$7�)z,��)���5��r�V5�a�1#jlDS��cS�.jl��9�^��e�����6?��C+}_��S�����Y/|�:��q���
��Y���7��\:�66���U�'�R���T#���W}&E"��m������
�^��t�"0��z',�]������	��.WKn�>N�>��>Pf�aqhC�K����z	oS��Z�#9��S�W*��;������d��������t���\�D�������b.��+9I��8�n���KV�
��-p��
r}�^�X�Cw;��Y�vy���n<7���AR��;�l���N�����[�t$/S?��:��NgoC�9~)%tQ�������r+�&��d�O^�^� ��;��i'l������!9&�6������0���3;�2;!���7J�V�u*��tgP�������CfZ��r�V�K�NQ�JE��wv�#���y#nvv������s���� ��*5�c/`�2Gv�evE�.���k���d���'�� ^� =(�5u���(��m-Q������j��C������o39P�3�������Mi.�Od����3�����?�~���_�K;����>��}���o�#��Ai�
?����/���~��z������o��������o�����]\�����C4������q;=l[o�n��of�����e
�5�t
z�����)"G]�����3�Z�A��Le[L�$
���R�	}S���1^��j�dq�%�(�i~	$
_B��`B�_B(SY�_B���V	��K`i�T(�z�B9�����\{	�B�����/���K0��/a����%l�^BN�P����/���K0��^���
��K�������W��~��U�y�1M��u(\�C�+����IU��|x^�<O��u(\�+^3����2��r�&����=�|��_�.�F�H������\�	<-��)����k��
C
�2��@������vT���+��/"��=����WNJ��=����<���k�4���WP�~2~"���o_��W@J�\���m�(?���]:�)~���Q��'�qN�9���$e�3�'��S�AN��9I�lm���di���TNTF9S���"
r�_�I�0g�C�MN�F9�O�De��*9Y�$?��Q���8�H�W��9I��
�~K�};��|p����}�D�k��_)�wm
0V�i���zNR�9��(���QN�S9Q��Z�I�� #���(���Y��rB����$e�3��&'K����r�2���<MN�F9�O�De�3����H�.��?���0'�29Y:�������'W������.\\RL��A��o��@��MK��P�de���zBF��1Y�\X�4��~=%)��U{$'K����r�2�	j�_F�9���$e��G��,�r�����(����s�4��~=')��eZ����4�I~*'*��}D���'D�^\�O�D���Kj>b<��|rm��i]��@�����I�g�b\?����:����2U�+/�y&v���(���4��\K�4��~=%)���C�9Y�$?��Q�6a���.
r�_�I�0�2{���QN�S9Q���kMN�F9�O�De�3E8	\���Z"~=')\K�[2�%��\��!�i���_����%���:�H����s�2���!J:%+�����0��N�l_�H����R�2���i.��"� '����s.SJ��"�('�����r�YF����Y�Ns�����s&�T9�4��$?��a��g�,
..�O..�:�0�,�F���>�P��
�N�q
�H]96�������@6�u�v����i=xc�Ck��x��pz.i�s�1�T�H\Ak;���*�k���,w�����M8���T���
�$��Yy'J�If�C�T���L�aIUf����(�-AM����%��Mi��
�]��Yy'���)uI,I���P�%����;aP�MI,I�X�N�%����;aP��j�^�h;a�s��-��Z!'Yy'��e'���T�l���
�$��Yy'J��{���%��Px>x��Ck��d��pz<0���[�0>�{2��P����v���pz2oHsQgH%��E�2���O�����Za'^y'��RX�A0�%:Q*��p(��Q�v��w��9�!�mkm����D����_��Za'^y'��G���6m����@�����R;l�H��wZ+��+���S{C*3���i�'R���>�R.��vAI�r��wE��^Af�i]��S�*��E�AI�r��wE�+i���(��v��67
|6�Ck��d��p~�o��P�S��M"e�A	���8�)��
��B^����9���i�+r�p�Bj��
��G��NE�vE�P:��]i�d��)!kkY��CG������W�	���������aqJ��p���C��$�;�2;�����M�(6����R�1�=�0J-�/�(�]Iy&n�ah+I$�r*�k�0J-��,�(�=E:��z�X)����d�##^�<�l_p�%�-!��T����VFR��}�m!�c�����;�V���9�:�+�����
R���-<{��G%Z"'Yf'P�	�����p�pJ��W<��@� (��v�/:�2;����'�
�no�L�.�R�+�aI�r��wE�-���>�0V�:�@������W/�6}���p0~}%�hn;�1����v�-�m	yWK��������PK$�v�X� ��������	���o������"- ���aI�������z{��_��}#p�AJ�;�##^�<�^p�%��!g�.L�<�AB�x���u���10��gv�3���]=`v�3����	�1 ���]���zfW�]���zfwR�iu���gv�3�z���gv�3�OK�m`��2����,}/��H����s��`�k��x����fv�4�����4 ��)�bvV��:��{fW=�����fW=�����@���cv uD��U��av�)DW���%�c�D��D�����C���~h���*x�	|��{fWi�BN��*�,�	a\>�1�%
��Jj��M!�����u����a]9�u����a�	��m�;X�D��@A;XN�Y��>�u�mU�:���a�(h��I�:+P����&���a��)��:t����2�u4�B�:�B�a�(��:r���	u
�*��tXW��m�9��@A�ag%9��;��>�uk]=�k��uN� ����:'P��n�C5�k��uN� ����:'P�'����k#�Y�v��4��BV����������)�Y#�C���-!X�+<�1�.����(��r��wE}�����]b����!�%���(\gm
��I�����P]t����@.��O��JG��4��gj��Y!�Gs�����tN84�aOJ����)#�sc�t0�RQ:��B�x,@�t0k@Q:cc��t��(��%dL��$M��M����I"���S�tN��!��tP��
�U��N� C�Be'Zf'p����y�8�H��#������&uJ�n���,�myO��Y����L�s�������m��w�Z��t�����n��;�d����������o�+�����:.I���ZW|k�YI���C�$p'f8�e/h�#�Yy'h�aga����t�c�p����t�c���L�[����Z|k]8h���.�����4;#�1�#������t����@�1]�6�#����D��C\7���xT�Ck��d����?�ta����t�c�p����t�c��������C:���F: zB�����I�u��:��=uF������TO�x��|�p�t�!4�3qO]f&�N
�Ya;������8��tV� ���Ia:'l�8�m�i����!�Z��
$�.'Yy'P���;JW�;����'���������!\���`���)��9pQ|���A3],�f�v6�i�3����I5�Y��>�st+����):gl�x�Bg�.�Y�B�`�;|�����������9',�>���5{0�$���A��SsN�����(0��,]����sN�����������������T����9cK�`��5���@
��@A;0N�Y��7
3+x<�R�D��]��
�"�����1��i0�"sM2h�	������9+p�|�X��s�:�3v����.
�Y{;g�����rQa���\4P.2�{X����<9 �&���9O�����q�d��+���%��Q�m'���Zd'^f'��������-����"o;�,����Q�8j�����"q��W��7=1��*���?.>fpq^�D�e���������\o��C/�=�r<
UQ9x�����S|M���S9'�S|��}�d�����v8NVQ;#P������vV����C�����L�FP���dtR%[�oo��P\^WO���bM��@A�����~N���%��:�E���`��n����E����I5�9�w��,Y�<����!L����
��!:)j���[�B}	���*6�PE+P�������(�n�
�%^CG��+�hl�"��i�hm
�Y�b�	��j�$�#�@AG����N���%�V���U�$�Uh
�]����
�@QwKV�1e��8�IgZ��<�$��3�@Q7Kn�
�O�IK��t�CA�*�L��;�I�����
P���K��#^�����Z�Bn��&�>�q)��q�(��Rr���	u�da�+ov�8g+��
�85���(�n��@;����Np������NV�	��n�,��m�b�&g�T��(�vMfr��wE�-�I�
L���t\�l�0�6�/�m�Yk#�<+�.k"U��o�h��d��@QwK�EFJq��H��@�XX�\��q��,�}����;�-S��?�6��	d�6#;�2;������D���I�;�,<�����	u�d!�aZ�b�rIq���A.���h���QwKf"�7x����M2\�
d�sk�'Yy'P����#g~<&����F��LO����g���9�f�L����u�AR\�d�u��$+���[2s����v��R��;���N���g6����]2���t�U��[;�V��dxk�C'Zf'p���_���4$���zc�^� ��"9�2;��n��{^�D����
"sk�+��95�����t�.��x�����HI�d����{h:�.��r�V�� v�`N�i�e�~ `�������	u�d��a[�G��p3<^$& `��~-�^f'P��;f�[2w���b�.�����y���%��,�m�T��~�C-�����=�?�
O����oo�%��jZu��+8�������AWp�]��w����nYj������s���uc.:��\+p���B�SO
-v�P#�N=](:�B���z�
��i���k[v�^{#���z�=8���V����
����4����9�����k�'��k��V��,O����|x>���A?f����I����y��i���/eG�����(������V��k�
"��ZM�KI����A��'r�evE]+T�qDc|��7�<8io��V�����5y���������:yw����;����I���
���S�1vE]+Tpj�T�:+e76Fx��.��[�B.U(�{��
Zos�5Yw6F�D��AStoS��
�.�������M
���zy��/�&�.�����(�m�x����
�����K
�^�x�.U���	d	y"'Yf'P��B{�����mvH�m��a�x$^��r�����Cg{'2�������AU57�TZ^��r�B�C�2E$�j"��;!��(���N��N��K�
�nS�%���e#L#t^���v�~����HE��QF�;[�HQQm����)�R�L��y��v[	�U{���N;��;����l9�)��`�A��z'P���q!'Yy'P��B�+�2U��eE`���
�C$^��r�B��mG0�T�V��D���a�h��d���5L��P��a����K�
���e7�hK%'^f'p��B�����������<�
�x:��N��N��k�)�^3���:S��{"t�v��z	oS��
����e��.����/8{�3������w���k�j���[���3�]q�u!Z����4���E�L_�o�j������(�gI�BKx�C�U�x7%yH��B7	��D�:Yy�������v��G?�����5��
QV�
��xgg�����E��6����yZ�g;k���x��`<Rs|�=���e����-��Y#�Ix������V�����%�Aa?�!����t3����L�p�d�Q�3�������t0�!����|H�~�fPC���~&D�3!2�����~&�7����]w8�W������qsp��Np�j���]w8�m���B��sr��Np��j�zD����qNF���z�D��9Y���������X#�&z&E��c���\-Y���e=�~	��
#��e=����
+|���4A�0�Ld�gZXA���3-�I��p���Y����7J�j���Q���(�����j'��Q�K�$�n�5��;m
��@AZ�S��N��
���Y��z��cv�66Fx|�._[{3�j��s����&q����������&qr�M�N0{_���69<���,��L�u<(���Q+��+���[���7=��]��Dcc�%�]�8kS��Z��[��,]3G��#t%�a,��YVo�^����I�t��"\8��pks��Z�-{�?*������{'P����$+�������L/��A�J��y���oc-�/����r��������Rq_������7�����b�Kx�Bn����< �d\�2
?0HZ�A�������[��c'~�%|V�:�wB�g`]�����	�>�Z��y���Db�#��;�l��JN�����[r�y���h�4K�gMg�F{!������iYy'��y�j�������AR�;�l�FN�����[��{8��t�o ����A�<ErR��N���%s[v�s�d|I��d��(�t����x��@QwK����7h8�H����{!���D��I��	u�dh�m/RU����]��@W�o|*�����-|�<j�aZ�����t|�l��sW��\����r�V��2��N���}�4ue/�N�Zd'^f'�N}�dj�����������A��.�������x��%90���ep���{�A�6������9���7��+����*l��Y>��O?���i��qn7[;I�<����~��������_������������Bl�Er,
��������������������	�����������?}��4�_SI�7J;�FM��,�|<z'-����%���]��{����,m�=z(��9���Q��FW4���Q�'u����c���YB�����/�2����:AUN��p���M���^�j}�5N�z\v���������0�\��txV{����d���{�j����t��8������2����x�}Om�9|���I�.qR�	<���f���~�}��^;	��%Nj7�g���S����������\Qh�=�^b���d�)��q�N���p�
�T6N�-qZ�����Y��7���}��Y�E�.|�y�������sv��������t�+�����g�M&UK����l�������sv�d����l�L�����N�RN2����t��������/w��q��3�U�����h[�
T���m6X�lN�f�-	��*s��f5ae2�*���<d4�6#���r���ie�3�����C��7�mF2���H�!��/T��d2�Y���}�y��������0k2:��1j~�����VmB0�W� ���*�|	q���M���\0k�w���J�e����}F�5����7MFg�3�����C���mF2���H�>�j�b�����0k2:��Q^��x��Vm:��4S��@<dS�6��l%���s�>����<d�z���H��"�6!�_\D��.��^��""������~\4��W�������~u:q����t$����{^�������Y�����Y����d2�Y���}��f�4����&�3m�A���CF3k3�y�ho���d2�Y���}�h}MFg�3�����C�������3wZ�	�<_Hd�P����+�/�#9��z��q����YS�3���j������g�Y�����6i��H�!�����g,���dt�>#�����g�7����g�Y�����:3�M�!�Z��@<d�^�6��|f�&$s�Q&�������f�Y�T9��q��mF2w�Z�	���:���0���gil����h�w����YS�3���h�����0k2:��q{����3���H�>�.�8����g�Y�������`9d$������d�3�H��mPe�3���������}�q+��q�����g�����l����IG�!��U�3����N��B"���tC���e���4��j�}���t5m�Wp�{�7P��\���~������p����=��vj�3�{i?���;m���f���D	����|'aE��;H���l?�X�0'l��&S����@hj@������w�s��&a�|�+��2���k��U}; |�}UAX��[����D���<�f����A��A�_��y���� �������8�8���V��1].V�,��ft)/�'BN�4���~ j@���B"] 'o�$S�}QFq��Yc^z�D��k�GL�K���sv���N��?���N��q��C��."g��(R2��;Sl^9N�q��r0������1].�O�P��M�w����l;��8a`=�����1].�<{��C�N���b_:'��FXj@��~\���Z�H�
�����������w/Ijp� �/t�����f\'l��<{C����ZZ����������}�V�&I��6�	��`���L�����VD<��q��l?���b��~�������*�$��u�(!]��&���8f���B��Y\��l���1eN��8���c��0D��2�E���3��������Zzq����P���q�[+�.�D�%�����#�zl;��L��t�T�(��������{����X-f�=������	2���t�We�2�m�Cl;�e�	C��.������D*��)\�yK.��y�d��j��e\K��^Kt�'����A5�IXb������2O��n�:*]�\�8`��,b}����r��d��QG���q.���$^iY��t�K�F��e��Mg�9U^y/	ow;���| &�c�\�7����>=gVe��:�����ZL��t��<&}��p����s�����N�4�&O�wp�]�)���u�y�5g���V4Nax�dd�g�	���.��m���h����Q%d��c'#�,0a���B�	�K4s�`el���S#�y�`�)XC��.�D{r�g���
����m�g����I�y$&�cw�x����N]m��3&��(�2����y�U�L��t����{
?�\��x��A���|��:��d5u���Qyu*o�Q����{u*��/�&��M��@����NP�����|w�*���F�������4���!t� �/�)��f�����d�f�@0�������w��Cz��!!4ccz��1�y�d�
�	C��.�:�������7`6;x�Zy��gw��N�5!zL��*��rU�6[dH����y���gl;�EM<D��rQ���Q����lv�Y��q��HD�,!z�������f(I�%:�)�s��<v2��R��t�������7���
_1*�y�d �H��1].�K"��b�I%�:Q�9@3�I�����������������XS%j�F����L��t���r[��k�-��	y~`$���OYe���nN�������6}c
����y�d����&D��r)S����e�LeLQF���!6���}f�=���L%b�*H�2�1�����',�*�&D��r!S��z�ef��b����5�������&�c�\�U���2_������3N�f����&��tw���
���X�����@3�I���4T���R����S�LE����E����<v2�d&�c�\�Th��������5�g`�3a����9U��l�����uK�,5����A-!7!zL�G���K�2�0�Lu��Dd*�Z����r-S���L2s&+�B���NR��x���R�
���+��������<�j���R��{i�$S�]�A�������2�T���n-S%������R������a��x�"������;g�F!}t�heVeT}b����>e&�c�<�S�2����h��x-��SjP�����>��N�J�A�j��w��Ti�a�(L�_���#m��#��Wi�)3������l�hF69�2��I����#��1aJM<D��r-L�����2��O��4�[,��0U��!z�.�����-�rfU���6�V�ZL��t���w��)�#��Lqu���R��;��L)*���te��V�2��+S�nH���rbSb�����f���^�5T��~�L%�(S$ EM��q��LQ�)�w�A��V��q�z��yz�wJ�W�-E�?�OE��X�����1i����0�E����"!z��)2�]*q�>����O1D��������T*q+K����C��^�B_�5[�8hj��LX��6����A-b�!zL��.������<9�)�=u�����,%&��t���t��w�7��Y��)�y�d�QAM<D��r�M2�&�`�,����O��v��y�d �����1].�)k_Mok�s�!�SA5�I��,7���e�7H�gamT�e�^BG�<v2���=��e3�������V|�js�����NB��0D��r)V��Eo+S����2����A-j��s9b�\�U;�u�E	��&�xa}Lj���`�{Q)Rj�2`�c�����d�R1�y��:���}F���)~��	l��#6���e6��1]���
����2E����'u0���&D������6Mv���d��f0������O�	C�8�]�fLV!AW�UfUFd�2�����Z��!zL����hC~+��F0���`�q�M��ms���Z���;�
A�n�tL��Mh�;���HM��w���RT^�-�4C���q2�*3�T��d��q� �;��������u��pSj�tL���HQ"KlhtF^+�!i�����e�a�5h�����cOS<5N��q*v�S��8�����'�E����h��i��8b#��q�Lj����Z�Z��[m�L-�C�Ty"���0/����!z�.��T)*�����t�IQ�F;����A{�5��X����m��`e6eT���E+`L�2��1]n����[�W��9���6�O��^&�����1].e�dMW[�H���j�&���o��
���'�mI�{IJn-���BU�����r��y7����U��N"L�<��y^����(D�.)���'uJ5�V�Z�A�j!5��:%�:�B��0�O�,vK]E
'L�(���r�	w�`n��|Z�����zu�M�(��(c?:������.f|�eur���f8�����ca�	���T3r��cz|WhH>��^2UfSFZ��!�xo��o�j����Zg�J�tn�z�el@3�I���������H�nE��\������9W<�����3��q�t������U�0���o�8�u������;x�e�q�"�M���}�p{������k�@N���&��k���IR�#S��4��^�������t���5�S�c�-�r&+#�d���N�e��!zL��j-��Y��m����(z���N���f�=v��j�(������1Q����N���!z�._���$�����2gk
�S�C<��������=v��jU�/�?W�.+&����/���A��A{|X��q���������]�Qaq�����Z��A{L��j]V{���g���Q3�Z:����N����A{L��jEg�TM�$����*8Y�:�]�w����*d�'�n[�M2G8cI-)t�	C�8�{m?�5�1s���^c'6%��n�|'�������wEJ_]�����W����z#��;	X��W�3��wE�� ?���x������D@3�I��2����WE��M��{��DeT>��#0&�0<����|W+;������v@i�����6��p���C��.��JU�0�R��(���	�G+���L����]�h�^*	T�6�z��*�P�;��"M~������j����k��G���-��A{L��j�<�Y��U
8�����=c�h4�X6� D����Z��0�3Cm^�ZB�9���6/F��$t� ��,2��E\�ok!S�Qix��c'���3f�=����Z{b�K�)�����3^����C�����Z�m��)��\�GR���u@��y�GR>�4����1]����gY���V�2�2��<c�hz(c�n���t��I���5��E[����'lQ{�	e��K]�����,�!���	���������\������NlJ�9��.@��������
�+�d�\�����,�zF��`�z���~,xn�?�I���,�H�+E�����pj�;\����U���S�g*�iF#�*��e4�Kf~U_�?]���
d���I����N�k�f������c�|R�.�Lx��
�3����^��q�"i�8�&D��q���*\�����i{99�����������=��'���qI���)��+>�zy�#<�=������0E?�=���$�1maB�����N���8�	�c�|R�|}a�X�"������	zCl;�j�c�|R���ZUY{��LQF�0����A���0D���I��6��Mw^e�'Ot�2�l�S��j����
�������UbBN9��v������{a�Ai"�.���Y�Ke��Ev��!f++O���t��By�F������*���G0�(+V�doi;@�R�4xLu:�i���|Q��!�1?���0����|R��z
Z!�#����OP�w�C
��A�R�t����n��Af��s.���y�d^�
K
�cw��BhU�t�6T�hWV��-��c'c���A{L�O*��&��
���0����Bc�
=h���P�Yyk�p�2E�vG8�\��6��	C��.�T(����A��B�U&*�����N�U3A���Vu6�����Z]M����� ��6|�C��.�T�O#�xT#1O��j�'��|'�J�k�����l.����)���]�:�	gJ{��N��x��V��e�"���'�-Fm�U�(�z�'LP#��l&��t��B��Y�b��+��t�R�����2n������'�����Ai��c�2<F~���+��a�&�c�|R!�y�iU������@�C3o��`� ��A���4;#ez���A�SR����x��@?��lL,D���G�l����8��S]O�"������U��1�g5Am2��+�/$&��y��.)���� �?*
��x�_%Wi���Ox������������$4�8�������	N�r�5a���v�K�UV+��e��,��Yy�}g7��>���]�Q����;	����Js�<���q�{i��'E��n��`�
�6���|o�b�p���!��_V��w	�Kl��
���XbQ`_���C�/��3��|���:�)#��4�:���������p|��,:����$���Y�p������t���y��U��7=6 s5�(���������1���-I������+c(2wq��Q�` ��	C��.��*�2X&��M�5fSF��i��c'�Z��C��._�Z��zf���*"����$P���w���9I4�>�����F�K�y�dp�~�M����Z���������)������������&�c�|Wk����Qk�)���s$z�07
z���C��.��*�&%b�������Fn2Bg���;H����y���`�m*�*�#/�8bX
��S��1]��5dl@]�:c��:s�j���\�!z�5�k��,�Z�Am��e��Vc�Z=D���Z�YV��M�/l�	���h�;	=�mp� ��+��,�[��>���Y=6����	&�c�|Wk*�rz��e����"oCl;�8W��1]��U:z7{��#=7{���$�{�������c�o��Av,"��he�26+a����p��L����]��+�m�c{9,�	�#l�����0����|U+�&,s>F!(�S���P�%)��:X��G:�C����f��Rl-7���y��n�Xz��=��w��<P�Z������'�Z�����0D��Nk��V�e��{
����44�N�<��s0�c�|Y+��e�����}�'L=���L������F��2&����������
����|W+�>L����D���,�l��l��tlZ�W�/k��q�H�����8c����q0a����Z��[���m��������y���e�	C��.�������
�K��VH����w��nf����E�J���m�*4�bm��>C��$^�6*�E��� ��|�����T	���u�
N�v.f@��i_��7��U�������rU�;��.�������������?����P�]BE�@����kNP������_������?eY�_���#�yy�����&�����_�������s���?����	���?����?������>��l�F�W�G�P��_�0_c]�g��L��z���?Nr�k2�6�`����K���~5��y*�������e[��W6�����'M������oj]��A<��n,��6��������k�?)}	8����y*�����m��4f�q�}��/i{Y>����#�?:���l�=�q��G��|���B/�S����o�O�����>}\w���<���^����|i?���/���c�d��%���������}l�t,x���?��f4�6#������|���2�a�dt�!c���Hf����Mf���"�71�$9�5]|��)<�R��$�������`����������2:s�fMFg�3�{�<�3�a�dt�!���L��d2�XzR3���X�0{Fg�3�����C���f���CF3k3�y�hTmF2�$�d$s�q�}2:s�fMFg2&��m3������M����OW�����62��W����Y��2���O�����}����FFg�3.3F�{Fg�3�����C���=���CF3k3����o��_���g�Y�������K�!�Z��@�g����W��|0k:��1z�3�y�hfmF2�ydF2wZ�	�<_H�%��:������&^�3��������J�
t�����0k2:s�q[|I2:s�fMFg2n�<�&��O��t n��)b�&�U�6�j��<d,8TX3�y�hfmF2��\�^[Fg�3�����CF[h�f$������d2N���������M��2���@���o��#YN4��+s�i��)���OK[h����g�Y�����)a�gt�>#�����g���xFg�3�����C��{�H�!����<d��N�f$������d�3��zFg�3�����CF��9��8������d2�>p$q{%1�6������{�7���g�,o�	�;s�Q��)����j�o���tU�6#������!��odt�>#�����g�����}��}F�5�y�����W�2���H�>�|��4���gV�^�![�4\��|��pe2���V_��������1��/�����4x2���Kx��|�
AT}=�{����4���Q���pX���P��\���Pb;@��~���'���9c������x���BX�o����b'l��L[���@�8���;H��9z��u%T�	b���h4�DL<D��r�	/~��U����i�yTL=]TC��.Jj=�a��*c���9db�mSG=�=�]��'�`='R	="�?���?'R�;�������O��]4�N��=����`^>����q���-��e6{���S���	�]f�8������k���Z_���m���foB�	_�=�X����k����q��Yf��hs�+S����C�0���	C��.�e����]������8K�c���`��Z���B�[�����LQ&����q��}0<�=�����u*�NYWBOn'Yy:��NM��:������~�aN�_E������UCl;��R��1].���n�R��~�T���-Lv���.�{a!�M���x�L�z[����c'c�UC��.�R�������"D�2nu;��;	�
��{a?.b�}�����LQ&����q�N0�
�c�\�T3�4`�z%t�z�?tC�A`!fi�;��j1��NX�~�Ug�^|��F
��A�_hY�WE�9�BV�U���6�W%0�i����q�����4C����2�2�z�Cl;��L��t�Tjx:n�������5���#6���S2��1]����iAQH$%r�����N�%g����;���z��1��6=����2�u��c'��Fj�=v�K�FF���7*<J�E�Y�E����w6�K
������H5��a�HY�Oc�?�L�;C��$^2+�������u^�7h�5���Y�YC�_�6���m�P�c�\�R00m����$e$P�����t;�� D���B#��um�����eVFT�m�������j�A{L��Z�	e���Z�l��F���<v2���L��t���Sp���BfUF��2�{.��|�e
��x�s�M��U����I��8�w�3��y�	'5a���e7�jvs�9���>Kd��C�`^z�L���Y'{!X�p$4�����`5i�A�c�j��f��P	}J���2�x�
���w
26��E��' P���1w�z��^��Y�?]�K�j���s��H2�2Q�S
�y�d���	C��.zU�R��z�MQ��!6���*1�=���^��mE7�����X���3�Z�����6����%��?�0�>�!6��2�	C��.����z���3�2?Mh[��y�d(Q�&D��r)Qa�����B��>zC;m��s�f��@�Y�����6��Mo0i���c���;�f�WS�s��A�_�R��V�o:��u�-��|��,�U�����72�5y'�6l2�1���!6��e��	�c�\v��H���c����bov�<��<v2V��0D��r)C�������)6G6�q��H0�V)M���B�*:OC%����-��)q�=�@j�=Nw�9"m�����Je���N��0D��r)Ja(�}�9Q�HrY��/-*9��Lw_r"�q��9a���6���!/�$r���R���%kj;	#ZSb���P��M���R��~"��deVeDo��8b�(Rb�!zL�kQj33��\�U���8b=3�EM<D��r�MAF[�J[�UU���<v2��$&��t���0��������2U���<v26��L�O4?`w������^�T&+���/x���)3a�/w/BT<�1�Ln+��Q�i�g<����b5A����De��Z�J�V�:��]c��L�Du��r#Q����w���W|�5���L��3a���I����T�J�F�j!��J��Ru��T�l�q�T	�*UG����T�I�T1]����m��L�J�$��P#8�}!S������D�3�U�Z��@�B3��)1h�����6��tp���i��#���w����jSGL�km*/&JSJ��4��F0��b��B��A�_�R�7��U���JNC<cB���!z�.����S�[QJu���P'l�(%�R����1]�E)Ye"]T:���@���Uj�]x�nP���w�=��f0l�
M������vA��T����m���*,��-T��U��u�_W[Y'
�2��J��PnB�8��r�v'��F��TY����=��Q)�jTu��?jTF�Q����-h�T��� TQ������B����;�Eq��p92[��3�����5D��r!X-��g+'6%d��6��J ��)+��`�{�Jr>�����ee6[Z&{�p�������&�c�\6S-T�����f�da���c���@��c�\7S�f��F*E��}9�wf����������.�I�����1���wb���@��w�c�\6L��O$}�DQ"��f,v��w��op� �o:��<>{�T�=8���{�=9�������=9��G�����d �'�60��0D���W]�#��*����(J��4
���$�E
��A��hS&��|������O#l����&T
�c�\JTPms�E	���&l���P�;H����4���DQ���34s�S�������6�I��H�&�p�����A�j@��[�A�";��Z��3v�j���E)�^T�c�\�R8�g_n�Y�Q�i�#���X-j�!z���r^��a�����Y]1'����"&��tyl�����aJ��a�������6L�rw��z������$eTr��&��(e&�cwy�l%]+J	��RG�_� J�I��].E)�j'������lN9�	�\��f3A����(��	�T�:����R���{�����$kD)aZQ����������.�T����De����]��0U`Be���r�M��8o�rf��vU�F;�O����v�T���s��}@��z�/�F��a�=��/X��{^p��2v�GE�!^p8��v<�	C�x�;OT;���������R=�c��S����Ou�]��*��U�&�u�.�D��Tq����w��U��?KJ�<��y�R�g��O��)�����h���h���[��7�9�ol;"j��oL�s�q�i~��&g��)���g\��@��c�|WkNx%���Yk����y�32�PWIM����Z�`G�b;���M��C���j�c�|Wk�%�ZU�X�2�����
L]hUC������Z7���������l����2�����Y81�=��W�J���%���������f���N�AR���1]��5$�*%��5���bw�!�������AS��0D���]���UL?��%���)��$^��<v288��&D���]���%<��T���mA5�I�ubn�;�����u���������N�j�4!zL��jM��D��������u������]�`%�.�]���&�kg���8�����:�-� l��n�i��W����Z'9�e5�JOi���!6���=�	B��]��U����3�$c#�-��+�
�f�x<�7��H���R���� h1Mr�|'��AKu�aiv�~\�(�2Gn#�8yL��h"Y�0<v2/}�T�c�|W��)5�*��CE���0/oR�A{L��j���Rv����Wt�y#�����x��K��"p��QAs���E#��8Q3����5D���]�"����d� �������`��Vg�{a�]��'L��q���8�:��}y��6�2W����"WL��7�e;�XO���#���I9f��=��w���o�]��X��7�0{9j3���������*��l��l-De�2���y�dL~1����|U�h
3ZU&)�GdOx�v�zfv�`�=v��j�p;]�Su��VAQ���W]X:�@�;X��+���3���������g��/tcy>U����|Y����]S��S����rax����9��
3j�=v��j�$,G������u�36�Fw�DM<D���e�&�&�47���Q��U�T�F��J
��A�Y$:�_�W�^��|��m���]�L��t��V=��Dk��l��\=�������x����GhU�eq�|8/,����>a�8��L(sw�]�n���-M�IW�m�w����0~~��X���.�h�AO&�L���h��I��d���Y���K+�;^>�!vA��F��=>�����]TG�����e�#�*�T�!z�M�k|'�u��<2���NX��c��gL;9�!zL�O*������g��B�#��_e����=>���m��wA,�Q�M��\��<v2|�17!zL�*��(�����W�y�i����������W>b�|Ra�������@�.uH�g���fH���!�G��/�_V8�Q��,V�~u��g��A�)4!zL�O*��m�T��2�J>�	�#NL}�����1]>�0�O���#����8`����$^v@�T�����d���N!�}�L���@I��y�d��DM��t����b��J�v�:t�� 3+��5���O����cs�����V/C������~��B��.�T({������^'�����h�;����f��������D������1��H�-w��pT��v5�8���1�����w����$e����x��^��	C��]>�p�f�	��+c��E��C<at2������|R�����������
z����p���j�A{L�O*�FW��YV��vF�������~z�����j�A{����Y����JelZ|{��4
��'5D�c+����l#T���gDE]6wG;��j��=��'�v/�n6���F����p`�d4����|R!�����������'�;����&��/n>M~�;��/5��&=�1w����\��&�c�|P!EW����26f�J�s��+���c&����w
�W9635cEz�	�y�����)W�����(��y��P�UWmO8b�zUq���!zL��*�v�8[OC�	����������	�a�	B��]>���4��Yel����C<�������f����.UhB��LW���B�4{��R\�]r�	B��]>���9�;�F+�L��;��u��x�,i3�w�����I6�<����LQ�Z�G8A�#:�l&���	�"i���F�U-4T�����UbS�	���������wiN�byL~�7�>����l9��/�����1�U��������/��y� >�5���
G�gS���+�)`=W��1e;F?������"s��v��1uq�6��"�-������6!}\���]�~(��z?q{(?���P~���#��w���<U�%C���O�
��=.�2D�;���Z7k_����~LBf����u�|'����{a�U���$p�%T\2U��C���9�T2D���k
��u4��9	`�v.��9�H����#�����$���x��;Ox���y�d����&D���]���h��f�3Ey����<v2���&�c�|Wk�.�y�2���dc�'<��R����<�a�i�����"�A_��k��	�B�@�E�����GL��j�����	���cc����.�!6�����C��t��VQr��iwge�S��!�h����	���vw~\k��cG+��W�s����`!��5!zL��j���z"��WfU&��M�#F�������=��w���H��h�L��"�����`���x����Zg���E�"���}o�h����N�jU�c�|W�������D�`�$�C�z����c'��SnB��.����A��f_m3����f��c'c����!z�.���"���W	������86z���p� ��+Ro�o��1-^�8�����W#f�=v��j��e����&=S�Ct���%�$�9�#��w�r[N��W�
m���FN}��;p� ��+�������3�]��}|�.��y^a�����W�z���sj��g[����'lm7�<��!zL��jyY��B�H��l��>=��a��;
3a����Zg9v����r6����2�����U�V���cw��V�����f��d_+>���'|�Ta{[a!�����Z1IUvF��E��*M��p6��G3�5n�PVC��._����"��]Cq���>���mL.&�c�|Y�
d�r�}%Sl�5G��0<���q�0a�o���U%���$�3P���P���}Fj�W<�CyU�Y���~��<�S�*��(�S����#������U����;�;�I�/��^sm��������?��?������/�������e]�1"Y&�������o�������������������o��~������������^����`��R<���2��@g���B�</z���\��Z�����m�@�y}�?<�z�`�w�Z��u\������9���'�>��~Yo���,�S�������,��������?)^zO�����X�{����?���L�?������lN�� ��TO���w?������0�/�$�,������8�����Y.~���'����p��\oi���?���)7n������?��?�1��{����@��d2�Y���C�M��6#���f�f$s�1�����7�EZ5	����_��.��]Ypu�������K�P�������`�����&R�r�'r�>#����<d,��k��*�mF3k3����"FSzFg�3�����C��w�1#���f�f$s�1���&�3�a�dt�!��S����CF3k3����rZ�gt�>#����<d\0�f\|P���VmB0����Cx����b"/��������fM���~\�T7�Z���f�Y�X������DFg�3�����C�\�>���CF3k3���8O�����T�>#����<d�D���C>�j��x�Vt�U���C>3k�������5��g�Y�����/�kF2wZ�	�<_H6Y0�t[�|����zQ�3���������d���^�*s�fMFg�3�c�<�3�a�dt�!��MyH�!�Z��@�g���r�7t�>����<d�����<d4�6#�����gd$������d�3��z�2;s�fMFg2&����^G`�&����6�'=�qs%y���I�������r����R��X�t�u:du�!+���N=dm�j!�SYa�fu�)k��K�Z�G�>����z�*mk9du�!+���N=e��7�������!+���)�~�&�SYa�fu�)k�7�mVROY�����S��NF7I��]khu�h���&Liz��>\l���0��4S�w��3�]�8�����fu�!+���N=d�"�kV�������}� �.��������a�z���{���SV�;d%�����rL
�)��R�y�����z�����z��p���J�)����z�:x!H��Z3T�A��j������������-�,��~�y}��{D}E�z���JHv��OAU�N=��=
1�4����j��n-2kv��D��>�F�������A������1l�����Re��U9:
���������{�s���*c�����1�����j���-r�8��PRQ��5d��C=6��O!�&D��r���6�����E�E�17q�y�������;K�V9F��/�3�2e���6��f����1].t������9c��J���#l;���=���b+m�z&P�5�(S�bg�N�<v28gb��t�PV#��g�%�*�M�m����c���`���x��������A:M��(�p������@3�����G����P���m�eR�DQb[t��	&�t��),��w���"�w�6\�x���T�(����6��VS�	C��.�c��G;�DQbK�=��������A�_��+�N'\U+S����-�=N�D�������q���nk�{]�;���7����z���`^�����q�[b�a�����v��J��K��rw<�����1��6��L�j`�_��M%�y�|'��[��A�_IY��E]�R�U�0M3�N�9���0�0'^�:W���;�L�"�*����w@����\���9����)W]h`����I��G�<v2v|�L��t��x�eN���Y�Q�,
q�Q28K��=�����������J}�/T�����0'��rn���9��*)�7Y��8������OJ���N�����w����}=��k��T��	��'���s!'��!������cc�6�W=�&��
�0��g������\�#�x��=��F���D�G�h�Ba�wCNV�#�|"�u����=�A
U�
��u�g���$��f�	s"���Pw����jI��R!L���t�9II����D��R�@W�n�0�����Hy��cw�k+�'��Z��<�rO���YD�3Uea���y�6��	q�p��c�e ������J_yL���S9h�0'^�%a��l�B�$RI)����0'+	\�h�0'�^�}���Q��[c�161~��Cl;���2�0D��r�w����,�{
��@�k8�����D/�H���1�&�I�q�&�Q�"�$���%K��h�1N�J����f���P�����|wS�p%�Cy�@��������fL�������������^�P���L�����Q��=~���ujQ�M��V�E9��	`(z�&D��r�2�MjZ��+RK.�u���L]�ZC��.�W��M����3�2"i�!6��'��&D��r)z�s�F�T������\�#��0/��SC��.�������^�K �HZ���NBS��=����$sL���"5�)���5�y�d0<FL<D��r��9ct�}��(JH�a@3�I�l�V�;H��V9��N�h�r�XW�h\���N��Q��	�c�\a��m��'N!6���a��P�������H$*���06�2v\4�e�f��B��!zL�K!��\&���LQF��!N������k������p�{a��YTE�8�j��xq�~u� �/0S��JC�(B��G0��J(bn�;����� ��f�&Q[$��2��m����&D��r��ac��P��$�1�����a�aN�CIs��T"V����Z�R���	s�.f\���9�����Fpal2%L(����9iI�y��[y,C��U���"��!aNZ�sF�D��*Y��W��:�daN�JfF�J����J��d�����N([�J��Ie�O��V,+6����
�2lL���e�-�Fs"�u+��Ob�*m�t��*�e�����Og ��eE�e�\++�o/��`���
�1�1]�U�2�0���P�R*�-c����g���0'^�P��-�A%���u�9�*��Jv$�u�����2������Zc�1aNV8Q���aN����dC�lR��5�aCNV�#�|"���,l'�L4��fv$�����Q���x�\eBV*d�����7�&��	sr�,�N)!�����f�D>�9c����p�oD�����ql~����E�1}r��j�)�y0E�B5�w��jF�*�I_g\U3eZ��!z��*6W3Vs�E�Rk9�*X����Z�?�����o�V�\7�-?E�*G0��z\L��w���Y� ������D���g��EFL<D��r�*��=6��26����[�s�4�:����1].T�����������(y�Ug4�����b@���B.[�V/�N�2�2e-x�����}0���&D����B�ld��G���%�)S�f��O8c4HTb�!zL�K�L�5�cgY!��)S���5N�<v2�W[�&D��r)�����3P�&�����T���0�!z|�T��V/3�*dq�[�W�*�Y�?�f~����	'V%�/l�|'��Y5�{i��7�m����;b��������1].U3S���p�jf�5u�&6�y�d�3a���R5�gt6��1c�2���Cl;��L��t�l�b�4�A�Y�������n���66:lL/��23"�2#$�
a���N�Y���Z(�d�Sg���n	K{��aj
Q��q��05�
�������&�$8M��sl��0'��[K���LTfU�a1��5H�	C��.7
Y�C2�b��V�L���
�����9�����V��G���@�����J�������&�����r2�Z�e2��0'���� 6#s"�u���v�e$�����J�/� c����!z��u�*�(w����iaG��NZ�5Z����/�0]bw�������y����L3�2���Q�)�GL�V;`xt"��T����,������K��:�6��"���4"XG<4��2�� �E���D��Q+4����k+��V����6��`�����	&���P�b3�/��+��5F��
	sr�K�F�D��Q�Z���A��J5���0���5��aN�n��Me������R���9U�K��!������������O�U�������|�4�]=\�����5�:-)��p\�y}�����U��l{^��BX�' Ujq���	�R�	\���;{�q�������Z��
lz��K�4�K������q�1��]������.F9N���<v2�UL<D���e�[�
L�g���v�X�]�1��r<D���]�9�(��N�:�)#'2���N�Qb�!zL��j]CUv�sufSF��y�d����=��w���.��2�g>��3���!N8�{��&D�S����Z7��UC��e����y�d �����1]��U�:�G�gF�K1��7�_�`��c��t����������^���AQ���N����	�c�|W�1�g��P�'r�0��Cl;��R��1]��uIh���T"	!z�6�j��@�\�;H�����:�Z��=�ZC����N����w����2>�I�@�P�[aD��Cl;[pd&�cw��V���g�e�'
t�A��Z���w�4���A�W��]�f���s�< �KK=�:��������L��t���Rp����������s]��N�j5�q������"H��S��0i���0���|'�&�X�{���T�SG	/Y��@U5����I���9��Jn���9���dD���%���B�LU���v�UhF�D�������<�� �.E����	s:���~O�<�����%�~(q�J2I�hG�<(��o���!z�._����7]�����Y)�q�!'j��ey��oKf�����U�9�j'��0'-���Fs"��m��q�������hj�%O��q�4b���+O�U�
�����CT���g����S���z}[2DaF������+�'p��g�Df&�����ZM.�`s_�NJCt�T��������9�������{\��J�(��	s�rm�D#�9���d�O�r�m;����fB���aNl�4n���9���dS�c�hPj�U)���9�x�F�D����W��r���>.�$Z�|��y���~��a��=��������X;����m��c������1!�����������J\#�'�0�b�	��A]�`B=���RW����}��v���w
��@6~~��XW_|�9�KA��T�b�7�����p� ��3���%���Ee��l��V���<v2T��&D���I�r����J��B�k��K	#�.�����})���DD�����}T OpoC����8:�Np�������I�9`�SnH�3\��n���n�=��'�����Du<��M�t�a@3�IXk��{i�Iie�������������y�|'�7k��A�R�V���z�2v�Y��Cl;�
M����
���LX�J��l��y�<���`��;/M���I����kE����2E�m�K���N#��C��.�T8G�bw��)�l|�}�	7�`p�[L<D�S{���
��i�6�����,��0aj=�np� �?)
'��g��m�)("V�Cl;����&D���I�2IC�1��*���8rz�1/3LM����
�����i�@�6B�fE�#���-{��t������4���X\�PG�qb����+��!�y>cw����i[���v��LTFt�<������)f�=v�O*������<�{gkF[w��<v2�"�L����A�u�����<��v��j�A��3#�P�<����g��N\t�_+I����*�'NGI:��#�x}V(��:���9z�����?�t8G��0�0'^����r>7n���rB7n���3aN�t��"�1����g�j�������{��Rh!�T��7�N����x}V�6m����DJ����@��G�\qV.���O�>+TM&8!�U���`\���.�U.��aND����$_�	���:�l<�|���"!L���7P����hc�=���[�g��8���������	����S�wG���B���+_�u�-��*��MfT����U�]��!��s��+y7�����W��4b��iK�=��'�*��k'�PI)4:	sr�Xw�o0B��^j��|��S�#��@�f�����aN�>,t����u��ac�0�ff��cw����	&�c�|V���
�t�����R�R���	srAX�@#�9����
m�h���U�
U�=a�he�T`B����r-�2[�gYe���T`XL�����Y�+7,��Nc�B�y��5x>��9-D�;m��Z����f:.���q���:� �l�n���-���V<�f�
��)������e��&D���e�mU. ;C�8���U@�e��C���?�5��u�R2J��������L+%3D�;)��Z��N������-F^�����V�������q�V��Z�����w���\�Q�q�+Z0�r��	��������-��;�V�M���C�q�j�c�|U�h�2�Z�g��X����i�3z����~nB�8��K�$��D�\S����4�����S��t��V��#�b�\�)���<�����(-&��t���%���<�����:�C�p��L[���={�q�q���d�6*Q��4�&����R��A�W��Y�<����*�*#�tb���`;��0D���]���3������!�k����%0��C���_������D��.�YmX�������.�y��95!zL��j-�����gp�BT�!�C�4�.<D��.>�U���{�
�������C�`���L��t���f�������6�N�8�o���#s"��m�*Ec��VEo,^�(g�yp�E�Wf�=v�/km������E�x���t����H������[�]O�X�������������j�0'�^���I����t��A�f�FO��0tC^��G>���d�������.rJ��-���4&��e�d��0b��oK�&�Y��v����+�6�����m�1�1\���+���7�9��������9�A���D������2|}����{�:�#��
C��{��a�>�C�G��(L�q��&?�yi/��N�1]���<��G*�����9�y��D#�9K{����)po��J�&�I�l��h;�@H'~Pw���[$:���)]��&�j�������Dy���K�!���=h�����N#��F#�9����MO�Jk��D���������	sr�]��Fs"����:t�5���	����z�)]h������������7�or��=�*�k���������?��?���� �/����������n��#���$�����_��O�?�����������������m���_~�������os��_���a>H����38lAe�Y��v1'n�O��5�&��w~.!��������_��g�O���O�J�~P��WW���zQ�v�K����\�?�b�~��'��TG�p�*_[���V�����Y���c���OP]�G8�����r�e����
�m��?RO?C�����z�3,[~Oq�3���;,�{O�g��3TW���?C����������j�g��3TW���?�����?����2��x��U���h���F�;�k���xVROY�����SVn�n��%��Y�����SV�X�YI�w�L������?V��L?)���/��4����Y�b ��g���'G������'7�b���N=d�]�����sF{p���CV��Y�z�����S�:��vmV������p�J�)����z���wgV�������S���?YI=e5�CVROY��5;d%����YI=dM��DV�n�94k����5g����t���'�i��	�z��`��N=|n��U��N=d�]�������O��z�jv����nK����N=d�]����������������U���u���������Y������_xyJRO9�����S����fVROY�����C�0cic������fmRR���,S|��_\k�=8��W��S�]�8������z��U�!+���N=e��U�!+���fw�J�!kJ<��dr�Y���CFyI��M�z�	�6�SOY#�0���;3����!+���+��4YI=e5�CVRY���|��:��vmV���.������������o�5��)���S���l3�n�����
v�O����u�����R�Yi�JN=d
N���N=d�]�����Yg���z�jv�����3lkV�������S����M���C����!+�����v{VROY�����CV^7��z�
�6�SOY���aR0��X2����|$)��_�{1�+��|�>3��?�S��|q��k\�������CV��}�
��CV��Y�z���I8�J�)����z�Z������CV��Y�z���]I�<�T�CJ0��y�rn��J=e]01��J�)����z�:xEI��Z3z��7W�����x�����'���<M�������*h}�z����I��	-�0qjS*�b�]�9iS�s�<�����Dl���J&Q���
��
/g�����-0�0'^�%m���u�TjS*�!��SG���$�1������^��~��Nj�=8�}+EGd_�S��Fs"��:��y0������_����Zn�����-�aN�/���
e�zU�Nc�x_-�������%�aNNNd:����)�����~��r��D��������5�Vf&����=��zC��R��p���u����j(���5A:s�j�F�Dp8�UI�)+\�Js�K�$ks��H���Dn�Qi�s�|�T�V����C�)��<�p�(!0/�.TC��.�Z������J��Ra�x�D��|<��a�aN��J�Y��]R	�?������6���A�_I�/��D��lklXdv��H�kS�F�D���
����l�)�qT��0
Y��	s����i�0'^�Bf��(��d�*J�%h#'���$�f1�������������*67~�3.�'���$�f1�����F�bq�k�f��h�D�^��Fs"8����R����K�2E�.h�=��]�`��wnC�8���y v��6��2�0a�x��c�����kf���#���,�aH����Z�
2�$�	s���m�i�#�u�-M��X�%Rv��r�<&�IKr.����x]�@��
�BT�,L���_�G��3v2NM<D��rQK�����R��7�e�`r"f_���DR#�9�����!��G6��z1���D���Dn)0b�A�Kj���
��+�#�!�
b@O�IKr.��#�x]�D�[& ��D�D{�2Uj`x�d��&�'L��Zpo�D=�����{!��~���9���\���x]�K�;���lk�3r@P~��F�<vg����x�����������?�Zl����CN��8�i��'^��P0]p^y���e�CR+	8YI�"�<����eI�]�-����D*+�Z�2&�IKrn���x]���@nY�R������
y��0'+	�f�0'�^�?���s��V{R��1���r�O%s"b��q%������2�0�t��������B�[���l����!M!�����W�-0�0'�����X7<p4SC�,�0O|8��j��Fs"8��Z����N�3������D+��k��2&��:�A�H�b�Q�m���qL���u�1�����X��Q`���P6@)�r�yL`>���1���T����������������1����r��D��;�9O��e[�*#>���!V��n�mC�.�����Y)��M)U���0'�x�[`�aN�.��PWC9tjS*�k�����9�2��Ba5�0'^�����	����i'h��%97����x��u����~���Ra���!aN�}En6��D��V�3q�!�bL����`���L��t����U�lUS���XA.�����*7����x�Hwj�b�a�*Ji;c��%97��aN�n�;{���E[�b�����c"�m�f%�����V:u�^��hB4�(���2&����q3��D��F�+���P�3f�X��-C�C���
�c��ives���^��br��H>��r3��D��N����Sr��6iOI����N({���:����d:���������D�y2�4b�o/>*�EQdB
�:�V�L���j]4Fs"�u����X����)e�	qc��\�S.����x��u�v������J�7&�IKrn���9�������fg�u��q#�`Y���!zL�'�n�b�JEq�D��\��u�\��	z��u�<n8LJ�_d��	q#NU��=f��O�~������m����Z�&�Zw�����JhI��Z�:[��V�p#��3/���!z��Nd�p��dv+��X=�2�f&�,w�ty��f[���tBd�#�^��Q��:^�2]���V��Mh�aNG�n���9�z��"f<7��T���s:�t�Fs"�u#��Z{�����TRJ�0&��2�rz3�0'�^O2�u�I+��F67
��0��L'�����9��������n:y����q�z�[�����Do4�9��5���G�i�s�=t�D<4�M�{Zc�����\��D���kd�iL<�t��;�V��]�a�����2��v�&D��r-���)P��M�S0�m�D�n�rJ3*u[������p�B/��^��.�]�����n�h�92��is�Sf�j�8����2�c��Hi�����sN�N?�dmt�y���6:���J�y@���1�}�j�1����M7;1�R�R���c��(�����<�����@q�����-�a��Lx<7s��������|�10+�mm���f�-���ok+7�Q
s"���ZQ�dY���Y�F��R*�-c���Fg�#s"�u���C�i�P����*��1aNT���aFs}�x�)sx�����
g��c�8U���q�kU���u'6�W�X���miL$oT��L#�9��Q][�V��BKD�l]�5����k+7��aND�m���b�l2-��t�~���F�1aN�MW��Fs"�u���r�sk��m��cws�s��p���������J�J��,�?����s�F���:^w2],���b�����$�X��9�%�0�	����YW��
�3�>='�
B�3����o9�j������]���t�rc���s6��?'����#0���r�����(�_t����E�����H����N��.�A�;]�d�.u9L�$�1aN��	�1����o���S����-]�����L�[����O�Anf��r��r�N�3�*��z�M�je��h9mmh%����!Q�@���uD�����Y����0g7�w"��JbW]���s=A�Gy�p�A��l=A#�����\�#�x�B��)�V���	�<��pL+��Q#�u�����V�z���
Y/���.�7���rf�.�m�������7g���6�aN���B#�����9y�>�q��5J�	�����G�Y�Vz;a��5�I������������~��2\�����-������l�PCmJ��lT9����s�r]�r$r�H������|3�R�Ra�f���s2���B#�9����y6a�Qg��V���D��8G1����3_��Lz*���r�/��1�J����y�A	����Z��l��yY��������93A����Z)�%}$h���L�
{l�p�	C��.����J���l+U�
s����#���Er3�6_����������>��9p�5���3d�aN�=��a�aN��-�l�cC�TvLc���MR�0#s"R�����7Y�dM%��t���v��<��`�~n
M����`�w��S������P8<$������>M��oK3;4f;������m��y�d���L��t����n��Y��F��J��9�	s�6��Sf4��RG�^��,�d��3�����Og��aN<���<����9���d=��I6�B�U)=�\L=�CN�X��4b��oKNIW1-��4�H#��W�������>�A{L�/k��F���;�&�C�9h��	����k&�c�|Y���S�-wTm��{�L�p�v������c����>/Y�E���G��m�+:I����36���K�m`�=����n��m|e�����i��6����!z��j�NJ[���J�m�~�!aN�i3��	#�9�_��Ua�6����t�/]����fI�G���U��{����}���N-\m���j���������$����fU��D�LO'����bq�H�����l&_�J�,��T�=���J;���J^�?J�Y��a�e�H��K�'��EM(�@�.Yg�U���Z}2J8���O�>:�2�A���I	��+��%z&_����-p!R[F�eF�Btbt)#�6U�e�V���	�-�R��,�	����wjV{�TK�2L��UnrSfS7�����g9c�%]f��6dJUgf����$������ub����]�����X��?};t��rF��d��@�,]��A	�@��
Ys������M�G���G��O�����7U�e�V_�,%��gL����GI�2�Qf����'&�*X�����X)�X�{E�n���gjT���"������}�&���1TEI��,O?o�P5��RM��V�D�1y��=����}��e��7��{��{��O\�{���k���Ns��{7f����p�]��_���`q*������2�)y�!���M��-��]�nAPdKSgr-�e��b�^n_�a�c�pW�����zW�
U*�����h@�J�-i�$*`���4�V����[��@)��]N�{�������:�v��W��]�ws����M�M������I��S������������@��h3iFI�Q=���D��W�������o#���<d��U�oY�={@����U�e��#�����U����-Y�����l����n�����n�v��?	���I��'��'���4M>�L�V�M%[�NL�O���Vg���hg�y�}���(�%[fj�Y���d�+��<�**���������*�2P���������(g�s�9>u��9�I�y'������y��!4�4T��%��� �����*�2P����f���(1�D��j������.3X}h�v-:z������������BY�Av(�2��g���u��]0�����V���%]f��,��Y�H�L����v��\(�:Ev@�$Pe���h��>�{z��O�[��&��1��J��E(a��6���@9�?�(Y�DJ�9�=�Q��D�\n����$
!w��Rg���RhiEg�������,��0�(B�d>���F���z
��MK~V�l���g�j�W��]���l]{��Z��
\T,��0�(B��)Y�i*C�Q�Z�@�.(
�J�����@���d$�����kL���d�J&/KG�%z9}���d^co���/�XvzY��US^�`����Yh�3�W�#^�����&�<1������(a���g�j�4�����]E;#���1jK����d�@�>T����0�"I�Zm�Dk�V����Z}�&��������TU!�<�g�}$+a����@%1LyH�t[Q����7O@�K��9(a������#z�L�R��
#�*��Y������Z}���,pv��W���\�\��[Z��5��X-���'���.���
e��A���]T��d-&-���
�&e�5��@�~v��R��5��@���0dl�����kC�����}�u(�,��-3��:���*Bn�����m�4!�m[E�i�zrp����|���hZ4��Xfh��m��{�i�A�B+:q>:��!i-,@	�@��
9Ik0^'�r\���O�M��lq���]�����X�f'z��"����n����,��-3��6��Y?Kfo��D���Z�t [Uhe�����I����!���^���������9#T��d�@��
��R8�����p�j'�l1��'T�#�2P�/C�?���/aZ�.�H���@�� L�75� ]7�����&�f�U��P4�q�9H6[�2�J������}��3��+*r��ds��teN�l����!K�����iH�S�$�^�zh�D����%z9^�>�U�y8���^�xD�� ��I�����=�k�������������V�Z���@l�[?9g�S�O
#�����[�������!*X�����X�Wn�kL����Q�qwbtm��A	�@��
y����K��������s F�"l���Z}2��������\P��HG����5X2v(�2���!s�;����5y'�����fJ^�����/c��� ���^�Q3��jt���*��P�oCFc_��[�]��r��=�������@���V���)�`oF��y�A�KA��}q�%z���#j�����mF>FM{���%#��s\[J�X}2*�}�mA[\��Z7>1��{�J���������k���o�R[��{Y-��rU��2L��U������'��
[�k�~�f=��j��T�����d��n��l2
�vL�]��l�*�2P��C���C���u�T�cM��@�.u�����Z}2����W��:��3�%�P�Ky��d+@��9r�����%��$"����lq����K�2L��U2�1�K9��)��t�#g����9Hm�W�y�����F�L�F�~��|[����-���0����N�2/89'�����P��y�����CF~�o��KN�3�������r����?�_������E�����*���l���������/�t���_�r�W����/�r���?������������/���/	���w��)=�]K�]�b�_�
zhDM��]��)v��o�����No<��t��~6�~��x$M�[�S�fgq_�q�����}��������:�R��>j��W;�~5�����)�d	���}{���bW���Y��U�����/����{����4�+x���,���*v�y�c��|��.�����!�j����������������{�%G�.�S����[�U��Z��M�G�y�R���E~�Vwk���Z�<�G����S
\�S�fg�_W�o������'����5R������E���*S��[���6���D�q�����huy�It�#�K?���:Xz����W�?,�|��Y�����wB�����O�*5~xx��g7q��YP����{������q[����w�J5n���m����*Yy�VuW���[��@��D������B\�����j5����q2�u�`���Zg
����]R�����j5����%�Sl�����V�d�O^�Z k���Y��7�����j�yzf8��i��~h<?3�^qjb7���T������J;�F��T��gd�/�)y�`�M�gK_��E�����7�j�Y�K|�����'Z�?������[ k���Y��'-F��T��gd�O�%��@��D����'9������Z�?������.=<7�T��/��JV_�j����*�]/+Y�S�����?j#x��_���j5����|���Y�����_�#�?z�,��V�����\Pm�����V�d�/�aV�����j5����E.9m�����V�d�/s������������u����_<:r��u���W�ZM�F�_U��{�?#K���3����������?�j�Y����w
��X�u�`����.[�J��T�qgd���b^~V����V�d���#�����h��@V���A{f���JV�����������)e`��P�����G�T�><2��r����&p#��I����,����lC%K���3��'�,Z k���Y���hG��V���j5��,����T���ZmJx�E $����S���J�3k_�.�+Y{����_��!�;��7Qj��}
��F�O�1�o���R|o�����1D�j���q��<.����w��GO���6iECz��A���	����-@a��H��(�'�k�;N��w"�o������*�@.<����Q�:�2�
~US�������R)�q������d>;1j�)����w"�oroR���
kD/���E����J^��U�D/����xp���'rk��)�pT.?����XDK�2Ln2_Y�UeB"��@2����zY,N�?$DK�2Ln�8�|{K�8B2Wv.Ue�8A$V�%z&�X��g��Sv�##��tC��bq��~v��.��fr�Dq8�+��P$�����(�'����y'��Mf�H�
�l@�NR/"F�)x�~W1���H�S�?~�D���8���y�kQ�%z�Ln����%3�`o$8&4\���jq�P,���.��0�Mcp2@�OHc�w"�F	3Y-N�E�0�E{&7����?�����DZR���0��v�RB�Q�%z&�{l����oZTr0�[��T�zaB���*�D/��k�zH�xHMZT"}�iK�M���H,�bK�2Lnvx��<���]@x2b��e��jq�����`�^����m�3��%����
?Q/#+�����U�f�#5�;�q�����H�SY,N���N��.��fr�Q��L�;�����Bi&��	B�����h/��&)������E)��JT�d>� ��	B�DU�����MZ��}�*��%����r��HJ(��T��e��q��|v�z`D_J��(�����O��uK��k_������Lo�����r��L�M���l�#_+����c`�0H�Db�N�
^�75U�N��]��������<�������3�H1V�N������f5�n7R�2�xE�T o�D�}/��fSw���>�=������Q�O�6XA�;��{;�;@�}i���IO
�gr�K�Jt[�*X��ar���|R6�R����;���z���K7�G�D/��f[%���-�����;���m��4[<[���[���I�.S?���)w����,'�leDE��e3�9���-��d�������
x�
j���s�x����
	2����3Y-Ny[��^b����-�Y*���?������V��xg6����X�������0n�M� qG;��=��~���`����=��:q�3���������d�8A�GT�h/���5�I�.���$�`�y���X� �zJ�.��0���h������B�@����6E��R��:Q�o^�#_@�wk��\IfB�e*�n���o��bK�2Ln_���u������wL�%8�d�8A�u0��-��0��%��A~�t������c*m���b����D/��6��o���������'$�<�jq��+;������Wv*?r��������_���X� ���*�D/�d����`�������v����i�L{E�d3�e�q+�'�K���wv4����}7G���IS#r���!�T����D_VYK�r\��r_��ut��"#�<��{9�&%���*X��ar��y�%��i�Q�*���(�]�}O���X��ar[�����'�{a�i]�����yO�%zy��HY��@x�({Z���QW��,'���EK�2LnY����1��UHfB���T�D���S��e���?��C��w�.n$1��Dq*��	�����K�����?o�m��z;l��o�S���Z	��.��fr���{]|���� Wa�O��v����75V�E{&����h�[���^$�GQ�O��]����s�����-���c�>E:~<����jq��k������}�^~N�M�����(��	�"����:�,B��[��c�+9���!N�������3�����}��/������P6&��w3Y-N��a[��arK��b��=���������
�Z� ��#*X��ar{,�u����N�' =GQ�O��l�NT���V���:9���#B�
q&��	"���E{&7{�MO���U�$1	:Nz��NGW���:�`�^6��c	�y���|%��������������l&��-� ��������nj&��	"�-V�E{&��-�k��&	���	:���'��qD��rY&	h���Nf�U"�����Le��S"{Q�����_����h_�'���/r�
0�D�/6}k-"LdK7���>�BX���e��7}�����Di��Av���n���*��u���n�vO��r���U�v�5����!�Y�u�����d\��D��(�S3�D�5��v]�q����`����(�'���yW�;����u�-�5dF�fl��U�2!M
�-���A
_���F�����l�n�>���t�����4a�^�mC��c��������(��8���QF�2%Z��*X��/����P��>����<�L����SY,N�$�
��e�|;�Y�cAk�*�L(K��r�2,%+�`�^��w�Z�/�vgTHb������/��d��.��f�]��$��ER����)M�����'����
�e�|+��<��5�Ft1M,�3Y-N)&a[��a�]���A��-�Z���V��$�8��ZB^?;�Da,������q�EJ[��
[[&��Ne�8A^?e�<Z]��a�]���q

JUpL�q����T��)E�f��l&���lE�A�*����Uq��Q��*���.n]��a�U�|f���N���7����S�k��!��e3�.V�h���Di*evL�g��j"��	��)4��Tl�^��w�R��~�iS�����Lh�r��P%�����`�^N���ut�O�$�z$S�E��F�u�*S��W5
�|Y���vNaQo��]�@
�y��,'����
��e�|�@��qQ|�U��RSB7�b�n�BQl�:�Ln�j�T}�~r�f����M������2%����5��tz���+x�`��;��<s�
����F������r���Jp��%z9~1�i�+����"]��������\�F�*X�����/���[A:�`�W���-��G������#�������	~a�x�,��e����v$e*��	�'��bK�2L>�0;�y�;� }}@���%*>��bq��v�Tl�^��'�]/\L2�e�O�z���
�KV�k���>L��&�^G���J"O���
~%/+��%z�L>��n����OZ+	L���T���������.��f�I�If�fi�s5#��OQ���X� r8 *�D/��'�^�e]�����TR�x��2����e~a]��a�A�\���O9��MV��t�j�����������l&�D������]{�DG �����(�'���po�j����s��N���X�G�N1IL��i*��	��)�!*X�����e3��.�
�u���<E����$
X�����_�F�
�{�c�_%2��b�2��)����*X���A~����5�{D�v�l���I�����E������~���_|j���9���^V������
�e��������L��i�������4�������{}�lE\mQ`Wa������a�E�$�DU��N;N� a��$�6�zuWi@��L���U��hWg��b����2P�� �j��`-Z��'���j�Vn�6:�M��@
p�'J�S'��@Qj�����w"���i
�k!����t~�k#����M�Y��a��L9l��J��6J
�������`�^��w����m���*HfB��i*��	�gg��%z&��Z��-��O����e)e�8A$VV�%z&��z<�����nD�x�n���	���bq��fITt�^6��b����g
���&�����i���T�
b����e�|Rn@��i��MD���
8HQ�N4�����'%F�s���^�8)�w'd��-��f�]�;��38�l@v�<�y"f|��k}X����dPr��"t���������xnN2����#���-��f�]�|zA'�x�D�?�^Pe�^(�X�
�.���F�����#;�@]�r�Bj%��g�%z&���SW�q�P���f�N��c'����s5����9}d���J��a���$$��<�bq��Y^,��f���Q��={6���Rw^���N,eU���|�g�?w�7�w_�]����s��p��B�n
�7s��/����w������[���f<��}�	���	g~;���K,�g�L8���O�������?��������w��	
�"����~�Y����
���/����r���3}�������)t3��/���O{���>���g��+B������_����;m��6���������>�-v%O���E1\����f���O�����)|3��/��������tf�o���o�W��Z��e����f~��f�!��oZ;��A����j�
?�TK9�����3?%=����<�o���p~��!���3�[���L�������.���}�o�w��,|k�j��G�Y�u�����W��~����������~��<�%��������7�j��,�_��?#K���3�������v%k���Y���d����'Z�?��?:U��d��Z�?#O��#���
���|��1���T�	����*[7M���?�j�Y�+|����'Z�?��?��_�F��T��gd�O��5�@��D����w�������(5���R�ogd�-��vF�������'Z�?�����u��+Y=8����4���8���ij�/�n^�7��;�ZM�F�_�k���d�O�F���;]����E:�'��~b��D����/�j�Y������Y�����_��?#�I���gd���?��?�j�����������'Z�?����jE�N�Y>4T���8U������zK|I���`d�]�V���w�<L��gd�O�F������\�Z�j�h���A�S��`��tL����d�j%Y��&�X{�&�X���]�83���Z�;#k����d�O�Z k{7�������W���w����3�zn@��g����]y������p��e�R%������Y~U���?#K���3���J���,��V����
��$*Y�S�����?���/�@��D������������n�z%kIZ4�@��D�����[c[u�`���Zg
��(c{�Y1�|l�V����c��#>=6��x��k�k�����R�&v#���R��������A�IyY�����J��T��gd���n8y%K���3��W�����_��O#KT+����J��T�MY���E+Tw��V�X�u�`��h����x�KA���?�R�Cg#���$�o���A������������~3�-�H����o�Ho	i��)Ij�����b��$
��96���L"��@Rn��*��Yb	���e��*yY!p]���#(�e����f%r!3��-g�q�RI��Y��e��$���-��*�a5i��u��v�S�>�Y�����^��Nz�������G����'�����f�����)�zK2I
-�{���D���Q�O*�{���%z9�ZVRj%i
+%��d&��8�r��Y%R,*X��ar�\��c���q�X�$&�����,'���a,��fr�x�:Z�is�J"�D'�S�i�p%/m~x�%z�Ln��Z.'I.x��L���n]�/"���>�y'B�63"e���!�I�i�SY,N�&zu�^6�����1)�����@��6_d�J'��	����;��ov
��J���#�i�4����Qt��E{y_u(������N�2��0U����'��`,��e5��vYgr�)���N��s�C���X� �VT�D/��v���hR_����-�������]�KJ�YK�rZu@/\���1�)����B��F�uO�]y��Q���s]��;�ro:-$}��N`��bq���5R��l&7�P/l��*m������y!7�����X��E{�X
K�wh�*�w�Wj�.�;���^`�����[�.�i��.8���*��e|f5�1�LIefK���(��+��q�Y�j��\��D�C�$WyG����Fd]���#�"���q���D��Dza��h����f�^����qw��N���{1R�Pu���bq�`���%z&7�����em@
�xl�!�����uV����X���bD��C��}/�
�{�%zy��$W�LK��VR�$�a*GmN�D�V���~��"�CO���"�CGyI��������5�,��0�y�m�c����	���'�Y��+��1�D����h'zkS��|IQ��2��(�-`����X�6�A��J�/)=��TF�Q%�Oi]��ars2�u���R�J�RR�ES����Y��L�
��e3�����-����n��$&�msSY,N��U�D/������7~���H`B��q*��	"�Q���N6��c�����c�4*	L���T�D�T��K����)��,�GFh��.���LV�D�p��`������=���16�����~}�L�J����h//�F��X'#�#f�~�?NDQ?$V�z���7{*��@�J
��������jq���*X��ar���Vx4lNv2F&�crS9k�B%:>�U�D/����&��x�^���'�ZFY,N��L�c,��0�����eV`<���4�A�2�O�\a&�w"�o�VZA�v�����������K3Y-N����-��0����:~��C�y��A���� ��	���G]b�a����9�l�v���^+��m�^����D��7}6�~g��A�d�����wu��������?�=��E���UP�E��$���T�D��YK�2L�;
��e�>#��E����u��xm�gK��z���4�kv� �;o��n�-!�K�����m��J��(�/�TF�P%��h]��ar[��}��*��0�=�>��NV�]�k����X���9��'�UR��-����,'�����-��0��Xd����������J��T�D79��%z&7'#x�q:������=���t��=� ,��frs(��y�4�[I`BGi*o��U"/��l�^6����hS�4�
BGM\pX]�EV��ey��h/��w�?�AB�-�����)o<�6�W�L��j��b�����&�P�
�o��4�����7�"�*����Q6��J9�����Ft %U����'��9�
����i�w��F}��$�Q�I���'�N�#[����i���Y|���J�L����TNzp�DNFDK�rZ��R�L��t�Y%2��P���tb��ET�D/���3z��#�p�o�"�B�	3��O�������{T��4|�������0��
[,!rX�����e3��Qpk/'�w��`B�������3Y-Ny�g[��ar���R�b/�����M���������|��>�2�E����c�;&E�
�Z� ��`�^�����j��)�����	����jq��!������&���U"�m��2&��M�{���N~�9u����c�W�������E��"���r43�D��j�}1�U��[�H�-��Oe��RR{��%z��y��X)����Y����l�r�1�\I�d^�������8V*�k�2y��4���Sc���u�^�L�8�c�3�9]T�t��4i*��	��I�bK�2L��5��_�%��q?�E'b���F#�y'^%|d�Z�stC�� �I��+�e�8A�X�U�D/���X�����@��?d�1��Q�O���I������<��V���5Q��b���~p���N��wA����W�H�T�������Z*%�%]�����X��N��q������|���d�8A^:���KriG|<Eu�n!ZD�{���n������G��
�l�^6��bu2_����}r0�2��,'��e[��a�]�t�A�l��p�8��d��<���yq�JQ�����b����P��qW"����s��QGk+���bK�2L��59���u�8=����-��'�K��&_�d�|k�DX9"�(S� �I�g�T���Sh��`�^��w�f��,��e�F)����s&��	B�K��`�^��w����xZ��]�L2���,'�}�TK�2L��u��B��*E��HQ{����2*��P�����l&��J�ch�L�\�"e�B�\�;�N�(��	��Y���K���|+Uo����M�:f e��8KQ�O�W�:��������)��f��������t��bq��PV��l&_�*W�.���cR�p�k*��	���XQ����b��%>r���N0r8&���jq�����Ul�^����&$OI�C�{ I&u�9�^�J^<pSTt�^6�/c�����-��1�=���F�E�x�����y/����ML�<�Jd~�e��1k�K�g�lW'?�����K�������!f�U"��7S���L�h���/f���t�c5�F��oS��������������H���,~���|��ke�E�g���L����V�_�/&��O�JB���QL\���I�W�j���5���1]e�|a���NJN��aQy���HP)�9��D/_rZ��0n������$3I����A��*�$��������09�Qd&�OO�m"b������;�2
���e��`]�@2�����{Y,N��X���e�|a����Qy-$1���}*��	���{�K���|!�Z]Q�:�b�H~g&��	����w"�?-o�����-5"�������7�c*����b�^6�O"t|�����2�mn�p����'���II�ub�N}�Uh����9�P�H���L�3Y-N����-��0�(��g���1I�I�.
������,'���J����O"L���.��3�^QN<~t�^XV����(�y'B���r�����,�G��N%NDQ?$��
X�U���J�������I�t����f*d�V*���E�gEA��D��)�#��)<^���I/�=���AV��%yV����O"<�����i��4����COe�8A^?%���K���|����)i�ag����j�^�tx��W���y�����IhN������
x4$e������j'
0�����_�����6���Pp�9>�z�'�����������24jm�W�
���LW�������z�JI����b��Q4��aSS@)�4��T��{���:H%2�VT�h/���w.'��F�$ �����k�L^<�XT�h/aM
`R��l��^���5��������8)�JC���>*�6�z�O��������wQ�O��t��y'B��ibT��~�\!k@�[�X%L����
d|l�����]��V���������O,�N�m��Oq/�4�N�2��n��R+c2�eK48y*[��������? ������b����	Z�?�T7AC�F�u�*�����:V�,�mtq�2$.2���L q�����'��O���U�� �}FK`$�c �9EC%���O�
^�NT���,�Q��g������w�`��X� ��`,��f�e���(�{K%�B������D��**�D/��w����q!y�.��b-�.��+�)����e���p�^DO�"N��bq�H���K���|+�CS����m��6>;����%!/+L�%z�L������
Q��4����d�>��bq�H��Ul�^��w�rQJ������j��D��E����N��wA�d������G7�&���Oe�8A�F�`�^6��bMr����9�e �6�"�D'�����������E&�����^��P�dtY������yq6KT�h/���X�����3�5���`����,'����
��e�|+������f��:"������s��UT�D/���X���
=4g��|��e&���ZxT�^T����;d����$��5�Vx��II����
��e�|��|�qW�����l�T��+%Dj�DK�2L���K/V*w�_�WA��lnH1�����:V�E{&_�*9L*!��4�D��R��������S]�����X��7����Lv����=3Y-N���{u��,��0�.Vi[Qr�,`�"����bE�v�(Y+Ca����r��4��A����e��^[:h�.�����'}����l���~i���������/g���������������Cx��^�������}�v�1K�9�&��t�����U�z�1����-�-��&x%���e
�b:���,�C�o���j�.�����sY��)�ji�_M�����/�Fy��%��lc�q���<~���b������'s����3g�(y�fY?��t�������o'���c{'N���<}�ji�j���d,M�;[�>@vR�i�xv5��p<5�I7��?��e��/�1x���[�e�\K�'s�-x0b���������<�`5��W�>W���.�9.��~j�).��-}g ko���Y�s���3���Z�?#k�OK[ k���Y���E��Y��:9��Y�;���������������<����������R�&x#�/+�"3F��T��gd�/%��j��,��V����_vu���3���Z�?#kI�����h��@����]�)X{c�������;#Ko���3����������n�q%k�Wn��,���y|pd���7M������R�&x#�/���L{���d�Z�_%K.��s��&J�3��/���qfd�M�wF���(Z kI�BT k������c�w��u�K�/r%K���3����������n�q%k�{K��#��Z O����?O~���t�Y0���T�������h�m��,��V����_I�#��Y�S������]NF��T��gd�/��`�M�qg
���Kg ko���Y�;�N�2F��T��gd�/w3�+Y���L�J��v� ���j��@�Gzo������Ss�%o���2�&�P���r�*�>
�}����ih�3��sZ�T�����O�r�@4h�S�Z��|&-9k|=���O����xl|6�q�S�.>��>�D���OCk����4��3]gW��1]�W��o~~�,�vq	��h��#l>��m����0���T����_�.5�OCk����4��yP���/���OUk}Z�t����a���O���`C>S��V��3��mE>�:�>�|���'�����A`��J�E��dW��/�1�p��Q�..�|N��P�_^|=>\�� ������0��~i�wtN���|����DZVPY�6���BImaQ�����=mZ����M����Me\�PRo^�%z&�������p�]�qp3Q;�*�����W���/_��Dj���Le��Q��[2u�^��<�}��R���>�"���e"F���@nA���wb\M��$!��;�oVI����N~��bq�h+V����2Lnb���|�\��yH���8d��U�iTho�aEgq�w�8�d&����^�D���
��e��$G�����������Z�\e�8A�����l&7��M�������������qWY,N��U�D/���F�xsg
�E�d�18��iU��g�(�z������!�	�[HE���=�J���|�,���<��}���\�h�����������d,B	+@��B�Q�Hy~��@(�@r�$u���4".D������"�MF9��"��G!��|�p�+P#�z�EU���V���,#<�j7�nPa��3< F�1%,3����$Ir{�O�%Fn���9#	�����.3X����1S��f���z�m[�2�NJ��`u�L-Go`I}��d�4jD!U�����V�/�A�%$������s��w
�HB��d� �&,:��c*}�#���`�5Z�1���(a���b����I����WQ�|�<�������,��0�=��^�u�|%����2�+�3�m�^~�)OO�w)���r;R�����T����f/dK��z/�n�_%�@�_��2��)������G56�/�Zt#@7?��f�������w�z/�jrN�B����F�N���u/��%��}�������P�kv�I+���{�<3z�1^I�%_����,y�8��^�7�/��;w��%�)o�\�7S'���w����nZT"#�k���Q�(y� ��D/��&�lg	�c�X@
�cs�}*��	"��
��e��T���t2k%��A��t�����
����Lf����= M"*�L��M���h�CV��V30�]�>o������3��O����h�;Q�o��>�e�;�$&�\��bq���Q�%z�Ln��A{��d)0IL�thC�^�Dba,��fr�o�o��q�;�`XI�L�J6����q��b��2L��k7����{>�
���%��2b���`����nCro�Q��2���t���iHyz���?�����/C6�������7��+]��`u{ Zl0����(Hsn�E/���F���b#�;�V���C����h�kr[v:����t��i���%z9/�.�mO�3k���������Hv�`JXfj��=2�F^����x��v��L�Q%]f�Z����(z�	$0�mq���6TQ"�XK�����"sE���n���8�>ns F�GU�v(�2����p�>��bU���'�����T�V��z����|���r��������6j$!�EU���V�m3�A�H-V$�#d[<�R��(a��Uj�6�49T^�Dy�ny�^m�L�CY��N6������������3���I�����������Q��	.J��y_�������O��
��;+�~�;U��Dm��{���U��lH��|���|HI����O�(��	�s��,��0��9t��E�d&���bT/}X*�#V��V�W:X�$X����CJ'Sy&�3�@G���y'����#c��Gy�H�������'��n��;���U�&�xh��W	�L�4�Me�8A���U�D/��M�l�������;*	L�=h}\/o��C��Hg]����vB7,)K���/3E�xq~�`����{�����Y��V��\�}^�Lr�i/ ��*�D/��]i��J���\D
t0r�9�1��~��J���jQc�$p��]�����L�:[�|;��%z9��f�-F���b�`��5���,@	�@�V[u��l���z�n1� ���1�om����S]��J�
J��Y�D;���A	� .:�=JN�g�(i����=q��#aJXfyU]���h'�a$}]x���@��9B�8U��	�j��~��u�����'���h�M[w�Yf�Z-�����M�!/-�x���@�p�(,A	+@����E��=J'�������1���4"��E����,�9�G����o��s
���9���(��P�����ov"A����c���/�;j��*�yt��,��v*��e���m���|:G��3�����HK�};���*����c�f�^����8���^@����q%���/	i��l�^��w�&/��|�(�-�"]Z�.���,'�����Y��a�e����<7Q��0�S���U�Dce,��0�.����������$7nEQ?x/-
j����.�����S:}��x�1c���������%z&��zx��{C�d�$3���GY,N�~��-��0�*�c�
����mu!���yi�9�bq����`�^��w�RWa��K>��$��bh'"'��E�x� ��h���1p�kUM%�	������DF��$',��f�]�i{�^�v�t8) �v��|4���~����^'B�� ������������6Q���{<���w"���[�sI��\��������(�'�������w�����u��,~_�Hz��=[?�1,R�4�%��1�������}9����b!��E�y"�E���V�ZA���2L��5z�#L]\e�N�
#�����'@�x�d���-3�����<�$�XK��B�_
����y�1B���
JXfj�m�%q���M���R����O�T��u�T����� MA�v}5����%�QS�F�uK�TW��'/������e����������@�lkM�:���.3X}2��9���UF���3�)�z���[}�"�)���/�S�����m��2�c��������N~��q��M@p�>K/OE�iZ�D�@N�XA�;��������.R�=��""�������/fiS��]'�f�F
�=���bq�`[~4K�2L>��F�����-��]�K?{Io?�m�+�����|���.�����Dv��"[U%/���%z����e�4jBg�P���?�d&��1����=-����O",���G��C��t>�����#�|�e�r/����1�Z!t���j��D�]d�8A�tPT�h/_���L��h�%t�����gkVq�����^�b��2L>��m���V�4�0�j�8b��GaJXfj�Y��&�R2��=3=$��������I����JXf��,PMFS���
�����(@����6	Xf��Y�nY�d�=
$���1:����
JXfj�Y�)�(��������we��f:��.3X}(����[�-vK���z�:jd���2������@�$�yl���>����/yJ~j$��%U���V��G������j(�^����P���V[fj�Y�Tx�2O������l�|u������]�X}(_��N���l��v���+�S�Fvo��L�oV@�|N�\��Fo����&e��E�9#�M�l���Z}���_�����bYF�3�:�2j�bYaY�l�����.��a���)����9#���d�@�>4Kww��������IHn����Ybt�W�t�����;�pc\�Om��^���uW�������8��
;�]�?|\5��Z�H7Mb�O��u#���Hc�^�6������,�k�r�����fi�36��������r����At����y�e��i��X���}������y���";��5e�,'�fXK�2L���NS#7��1HaB=��T�:�P���*X����L6�����6��r�D�%�{��bq�hA�?�%z&���iRNH����<l�F/��T�V����*���X��W��e��7������D��z�������t
%�U�j�d&GH���������**X���5��a�i��>0/r�$19R�i�,'��t��.��f�]��o�����_NWMe�;A^���u�^6��b�����QA����
�.�Z� /��**X�������Xi���rrU�����!s��br������kF_�KI�F75^CI��q�h���3FwY	�V_�1�Z�vU"��Vq���C��%U�1��*���.�[����
G7��v�q�n3�n��yU�e���OCN������l���[/�9#���d�@��
9'y��}���5$M�SX�1�<�2�_��e�oC�������=�~�������9#��sP�2P�/C�}9�����{��+�0�$C�1B&B����Z}�A�_
M����J�����,'�lP�%z&_��2W��Tp��P�7?���)P#���E(a��?y�D/����)�^s_� k���i�EiKU�E{��f�>}�EF�����a�I����9
�4\���M����G�{2y�����a��';W
'g����>[f�#���������[v��S���������a���_M-��������J�t��}�v��~��	���)>�/&;���G���L�gxBl����gxBl��������gx�Cl����gx�Cl����gh�	�gxE�����3�b1>���"����GXO#�O`���[�<���_�
#��@b��-�����O$��7��f�'����g�g�������/F7�@��D��� ���\���Z{T���
���ps��O���v�	��yh�����OUk}z���_�����]|=�����'�z���]|=������7�V�<5,��KU�����|��������	�ih�S�Z���>��]�������>
=�����@>E��h�3�������OUk}z�Y0��\*y��Z�J����l�oih�Q�Z��|�nj�|�nj�|��q��O���j�@�OOG���I�t)q�����������[�A���D3���j�OCk���p�y2������3���s?���_�v������k�gEK�Pk|V����>��|��g��O����4���j�OC>c7�A>c7�A>�N-i|�:������O��'J�h����:v!`�I���_������K�Y���eh�S�Z���>��j��"U���j�OC>w����z�)j�@k�9����d�Q�Z� ������bM��E��h��J������>U��i��g���V��3�CR+z��9z�	�~����'�����w9�| r�YQ������w�jm����Y��r��w���O��9Ck�8�o|Z�T��������h}=���O��O�����WEk����4���4�&�g���K��v�	����.���}�Z�������I5��1tsR�<��������j�@���:��7�S|�k���W�j�0���r����OCk����4��Yt8Y�������>
=���R�z�)j�@k���[Ck����4��3����g�qK��v�	��yl��e��J�E���W��O.^=x��K���j���ev���p�E�7�<��|Rx�<����Hd��'�7Y��3�se�)���^-s�d�-����]���q`
��_�S��9��^��9U�e�
��N���
���S�����1@��F�D/�f�9�j�6e�X@
�y~�D�}�c�D/��.A�m��6D�H�5r^<�A��(�9U�e��|J�"B���Y*��O�nob�!sP�2P���rB�m�kDZ�9�z�\e�S�NG�D//�J���E����Pf��Z�1����
JXfju��H������%9����qH`�/`o��ns	;���K��P���g�)P#�X�V�Z�n���H�E������7^��h�5���E(a���]H{��F��"�|�k�)@c�W��*��@;�����G>������v
D����C�j�!��d+@�n��N��y7D//T��"���FReQ�l����V6�-Y���[YE<����N3C=P#�X�V�Z�n��^)�k�����������d,���<����"�8�9J9J�!��b�6� F�+P�e��mB��|sA�% /c�x�8j$!�%U���V�o���r��e�Z���L�5���EU���V�!��2j��~ it�k�9#�X���Z-N���G/� �',G~s���_
���u�<v~s�#��s=��Y�d�M���H�-H3�������,���iI�ll,��x�G4��W�n,�l���_�����nG�6/�y}ws�o���5o�y���sm�&���/����VkW��2%,3�����w[���c�oCx���y{���}Ap{�]��WC��/I�PFn��u@�����s���cJXfju[>���_�d��t�����96���
JXfa9�2�la���=������������2}3�� ,_z���w@/#Y�o��K��^��K��]��arw����^=�H���YC�8�zd��P�e����iO�zp$�6](��IH��%]f�Z$��f�O�S�.xb���#H6��2�J���r�%��T�[��"�o��Ijr5�Q����:�A�p�9�`�H���h�g3X��J��`��������wA��u�`�5g�gc4H����`�zM
2[3g��A�v�7�)P#{M
2�Q�l����;�
�l��*��{#p���'�������3��:�=gg:�w�q���"W�=
j��h7�+sP�2P���X�/��+*�%��9#ysP�2P�E-,^�t�ZE:���q�D�V�S%[fq9Z�S�=�)l�L�&w��A�9f�Y��`����"�����*����;���@�8$cJXfju�>�5Uhm�+�.��=�Amk_�%,3���{��|?p3^~�eF�"��@�����
JXfj�(�DJW�+�p�^hV�Tf������J���UV����%fm�_Q������s�}~e�����6���
����[	D\����2[�F�a������?�1�����; ��y*��e���jq�H;aQ��������Mr�;����F����@��R(,��-3�ZF�0Ao)0CIsWy�1��5F����e��K��<��U��
	D�m/Ne�8���yQ�%z&�I��f�}=@�18��@�����F��8�_������I��S�|^)�d�8��$?��%�,��C�H����hZ!�`��.��'@���,��-3�zz3�!�
�����s:�}3�����h�G):�����,E�2�f�2�����y���W F�]sP�2P�/���+���mN������� �h���*�2�m�/B���m��r�~���4IGc�1�2@����S�|�fw����sTz�NS�-N#:��h��e5�2���M���s����!�s�������{�%[fj�m�{�*6��D���1�1�OIX���?��9����������G|�9�vxiL���e�g���nPV*��BY�
�p^F��`��`Z,P���
��i�|N~�
��@���Ks���l,C	+@��
9�2GF�kRU�N}�|i��d�#*�P�2H����C���j��H��d������yJ%/~yR]�����r!2������P��hN2�~��j��`O�t�����K�=�����4��c�ue�5���%�<��6dN���NJ�#D')��4j$!�%U���V_��bv�<N���EL���8��`l�Fu�����=z�H���O��.K���o�eU��p���Q�r�c �p<��AA�z�*P�������XJV�2���!S_���LQ�>�Q���v��9�bq��������l&_�J{��y��Y"��|�L��������,1�j�m���"o|�f	���[8.+2-t��P�R��.3X}28�e�|h����g$eRS�FVK�,���<��mv�E�E�J�����%�/7�j���J��`�m�4%���Q�Fo�n�����c�<j�s7aY�l���qH�����{*�"#���9#$d��]���`�u�I�I<�T+������s F�x���e�V��U �c�I$�Hk�h:i��1�4�&������rg��C��I+��n��Y5�!��G��n��� ��J�>���5�z)�B����AS8g�����P8�Q�~��0)e��e ��Y�4b$��9U�e�V�JE=8��/�<If-Ne�8��O�D/��G�$;��jn�D�:��j�1���e���oMI�o���^�Tq�m�1�?`JXfj�Y�9Zk����}:l*L��|c8��e��������J[�W���}��K�j����3z]f]����c�H|oZj�0r~�M������9(a���g�I�������������BS�%,3���>~hp��
sFT�ey�Q��0���mP�2_<@)�(�����$�#C�1B��6(a���i���H��YE�
���8�^��QNR��Z^�>�rS�G�c)RLQ��s&G�c�1��%zYM>��������[T����@5:s��o�1WT	�V��s��<r�\Q`��g�������E!Q�e������W�g��d������q
lj$w��"������@��@�\�� E��~�K����bJ^��,���H��EH��LI5��u)@��M�s F���,��e�����@�{���@�k�o������5�U�V�Z}h��\�Z�k�����En�9%f{��.3X}���w�R.4]�������1�@��*�2P��Y���"P�\=�95Pc����7��{���A&jwFqc����F��*��P�
|;2��Wu�+"�7F��5�\��J�����?I�����tFl��#`�t���mV���XnmR��&�:w�[�%�Tj�$�A�|1f�,3�.��a��n%il����(�o��&�QR�%��Q�����RV�TOB������z���l�\��>9�@Y�b�.���-�������d�*Kvu �I��C��5]��PE��p���\��hO���*�2�����!s/})-�������K��^�D�{TE��e�|+��P�����,��e�T�C���u�T�����_���Z[�:�M�T��:���4"c:Y��e5�2Vj��h�~C��������-�_Y��@�����|l<���(������e��y#p�nS�����������C.�o�p�v?����f�F F��sP�2P�/C���//���p�}���@�$d�,Ju����!�w����3�-��C�	�u�g,��-3�|��}r������o0�&�����^d����g����y/���A�������tZ��}F|i#���f�0�J��������.�;�I��:0O���M���]f����V��\�p���DO�*N���#P#\V�U�V�Z}���������x%F�cg
��6��\��j�����!��3��0.u�^oc��)P#�������V_�\�+��*N�x��a
@�P\",A	+`��?
��y{������&���z&��	������K���|��M�:z��"�
h���Cc�y%]f���u��Z��l���`$k�3������� �?k�J':~��Q�nOG(�������_����������X���9������is���f�.$���N�
��I1�}�v~+�����9�S���b����)�7�,��U|?�����ej�������1]�Oj?���G���=7���<�o�5���2~�����?|?-����������1]�O}`n~���G�6�>�C���_
�Gy���#��kb�+x
��,���*t��p3c6zr����8�,��l6u-�-[>	��p�"q+x
��,���*t:%.�o|� �������B�v�j�}?���[?>y���Hn���M��G6���2p�e
���&��jh�5�
�L�.r���\�qcVv��k�*XyS�������$�d�*XyS���������Duf���(UW����d��+K_��8Xzs�����h
Y��ZuX������q�hd�Q�Z� w6h=Y{��#�����gF%����F�n�������<$�U��d���Z���7��Y����GUk<y����
d�`���Zw
��R�mg����?UkY{�[� i����5�<x��������?K�<x,��\~N+Y{��#��#m����J�U��h��#���8T��$������uP����_<F�?F*Y~Y��od�e�-v����<B�z�d����c���?�j�<x���HC����1R����c�!K���x4��Q��5��=�Z�d����/���,=�Z����G���x4��(j�G����1�~��Z�/��1��r��x�y~�����S�!��J�������\�[�x�d�Q��F�i��)b`�O�wko�-���d�O��F<^��������c���,=�Z����G��z4����z� K���d���O
#����������.A�Gxg���xN������F%�/jM��������GGC�U��h��c�C�����GQk=��=R���7d�Q��F<o������
�������=	3���jm���c�
Y{������G��]�HK���x�6�-Y?:&I_#��Sj�Sg��!�q�_���]�96j��7��6n�+�)�V���W�N�2�D��9:����?O�"��$�l$����w8&�X�J^��T�D/��<��
�X��U���$5�������B��D/��<�������@����2��x��~#��7A����X�������qf�,��BR]����6�,�+�*J������Q!�1����Vw	J�U,j�����;q4�l*{m4+_���%z�L���rW]��>�\'�9��7�����W�Q�e��lj�����A�������t�S��f���.��CIC�H#�{}�u��d�2x�������*J2�)x3��YN�����Vw�>)^=�u�HK�~j��1�����`u���o��I���+���
�xk�d���e���b�����3�$��pHG�Q�:�A�����.��fr�q,2
���jC�$��������j�;T	�V�IB�S��HI*���wO�^f��%oy��%z&7��CF������Q��p���c��5�V7!�MSh�/9���������Q�S�~;��%z�Lnr[�n����c�*�LhxK��bq*�XXK����d�������*	L<���bq*�X��`�^6�����M��*�J�i��L�S���*X����&���c	���&��C
d�9P#��,���<����P�j��o�T������T�S%[fju���d����*�����U�n7 �S��*�2P���!fC��� i�H�Aw�_be��JXf��	��Es6b�(�H:\}����?�
s�d� ���r��D5����G�!�n==��~��F%��������R�U��[N9<�7]����TB��p�*X��ar�t��mn�7��T��;&~���f?6�JQ%,3X���������g��������lL}�eP�?����l���LL�:��^{��X�.�����,���h���*|�HOQ������]-�jN��EuD�XTN����@It)���l���{��.�� [�@���^��h%����c�b�!�JR�4��y������
���J�;��q�6�����Z]6�������*B��;����������������[iq3�)����#��]�Wm����K�x 
X�i![�f�^,Hk&���5�i�pB�g`�L�! 	rB��
/O�F\�A��A:A�C�r���j	Z�������UWQ����cfO��_%�)�x���B^e�t��
���d`-M��0�V?A���F'h$u`OC�PS�l?> h\��Z�t��������L3w��4mO��Ya�����5#C���^�4�(eZke�V�(eZ32���	A�^q��5�d�V��ni�f����aC����4+����40�����F��G��5:�f����Z�F����>!hK�RC�:��Dk��4F3{}����i���z��i���sH��*�AZS4H��@���&
����L@�B���A�������4\O�A�������h�j0��9C4���@14=d�9A�A���UG��H��#h���}����;A )��r�-���<�5%h�Lt	x`Upl���DF
�9a���PX��D+	�B+86�%Zi�4��]��
����$C���rxC��H4+����ez�P���c��y�$��d$�B�z���+���FhM�M���)�b34����e�6��� 
f;Hsc�0 -�	O����H�����@��ii`�A��k
�p�{�H��
:Y���Qi^`�9H�]<,LI5/��N������B�L��)	7>S�l2������4�@^3��������Fz�P����@�Md
?f�	d���DA6]��1y(�V��&2���)dkmuZ�Z��=l��(wP-T��
��l![�f�L�PQ���6=�\����v
��%��mv�o�
l����E��Z�"���en�2���[v�-���U���j-�������X�\�Zx� W�-�:%�"��V�������������-0A�H��+�`����(�)�X\>anydn�3���[>`n�3����yJ��r�q�p�-(@���;���|��`���<j��
{��k^3���c�)��<c�����
�R<`�|�3����n^�Ya��`����5#��^�\�[Uu)���D@'�k��-������asq��ZSp���x�[���D�5=�^�F�J�����
�4�Uxm'#�k^�F���8����Ztl-�j�����A>�j�E�p�&��t�\
�W�{��Z[_a�ZS4W�c�p\
LW3cq���k]��]�f���f�50Qjf,.S��m��m��jf���5�������Z�\�(l�:D�hnL��4�I��^V2j��O"��5�p5+���jh���v�<r���{�j%U�2t�\-��L3�5�j���n���Bf��N��2�i���\-�\-y�6������y���\-�+%�*��!�+�-d$�B�z���W��'Q�XI8��]R\�
�d
?.��N
y-���jM�\����q54�\���e���J{�)��{�)��r�\����<�5�j�����5���Y��,W+h�g��r5��d�ZIa�A@'��62����r��=m�f��-D�c����V�5����e�t��@
[�t����a�Z@�����������6 c����w�@-��c2���y���N�Uc[����D���i�������m���V{h�������6{���X\�j*�p���[7[
g��2���T����<����^�Q��iE��������g)K�\�_��$�9��w~�GG���h"�xn,.�r�@/��0xv<;�0xv<;|~z
��}x
\�
t5]����W��5#���}xo�l�`d�2y
��^�z��9d� ��{q;e
��L�J�
���	T'���H�J+Hu���5e�;{4�Qj��t�@���{=L�S��4�L�����A@'<w2���z����7��-�P?FDw2a ����(WC1��LS����	���))`j�z��f���4Qm�A�N@'_��F��Z����a(2r�]�d����sW4R��
��0eA�l���,l�8�������ucqy�� �
�Ikn���9�����MdR?f�g�
��i��b(��1y���MdR?f�g�
����E�"3�u�L�/(sk���d�,�k���fWo�.E�c���
��Md
?&�g�
�]v���5�M2���@N/��d�A �g)w��pu������Q��p����M3��0e����p�4on���I�8�o�K
h"S�1�<���pc�Ys��2s�~����$|!�����n���dl\2�Y�����a�i�Ag�e%�����Y����B�]�;a��x�g����F&4����\8�V|��H[�:��4�@^S 
�N��o��

k�:�
�QAM3��4e��J/��o�:���_�Vk%�������	j���>dE�k���T#)����y����~�.s�]�+��*U��K���B�I��[DvwC�����}��uQ"5�b���������D&�cv���y�^�kc�ii�����bq:|��C�?K�����B���W������[H^_>��M1*S�^��}?+��*^����W=�T���K�,��T{#���N�N�ja�������Y�n���6�J�x0�;�V��MWH���+��jm�����F���w5��p2�_2n������fd�����5���Y��X
�:
<�n���(�i�����$������=�^�ag^�~,.w24$��6����mt�7i�m�������w��n�i���	��Q7)�m����b���J��h�m���Q��&p�������fL�`�&��
��6)�m�G�v�����RRI��A@'�72�����h��ltovI�A@'��72������0f^��)3>�uP���W�)p����~\������`\2�J���wUT��h��n�3���������&���7���TP:��c�0t:��L���r'C���a+^�N�)C�1z����&2�����/�]R�DzI!h/��)T^
���@^���7��4z) )�<�di�BF2� ���D).����f��)
.�1x|c��c�D��cr�����=�����'�:dt�5����O3�u+Q�e����Z)*�<���0g��XM���r'C��i����dlm�py��S`s�dBS�������o(�FD��o2)�B/6:b�Lx
?.�]��"����\t�tc�Mab<���+_�����)��]�dHp��>|��J��8��M���n��~�?u���ZueEx�0N�I���W��Lx
?f�;2�m�r��85*��`�qC�{wwC���r��WY�[���L��(������I�W�������npuRAl��~L���&2���9�^��^�� 0�2�z �S��S&X�@�u�T���Q��f��{�y/�����j���AB/�����X5HPE���a���������$�/B����o�l�o�ruX�:�\G�\V�/�i��2��Mv{~7E�V������"l#����)����R�>} ���]�3z��w���Y��P?�n"D�n�
�4��#�n���0e�)Z�v�n�Q�1z�v`��Q���<���h]�� �n�V@����\7X#����M��T�~���V@��[,��F`��)�V���v���u��:
]/�Hw�0{=LY5��%��M�]1�=\W0Q]1�X\����5nR��f4Ew���@{M��`� ����\U����a��M�}5�=��O`�Zk���<�U�����H7�m�njk���������f,.�r5�h���Q�ky\�����3��~����<��W\�p�m���T7\/������Hf�z��i������. ���A '��#��j��������ro�i�7%�T��A@'��#,d$�y=LYoN�%���H�	� ���
��HmO��z��������K������@u
�7�%�����\�ho�6�vP�c���|[����cvy���)��K����#�P�)�n���b�;����[��o��do���
�\�t8w���e���1��%��G�$����^��/�������-���a��b�$�QIH�s�($�Qi��4�@^OS�N$�p�nQ*6���&��p�niw�#��~\|����bd��-�x����CS���D&��M�-���J��fo�����u}�qy���~�yh�D�����U[��x~Xb{���������~���������Ix)p�N�;'~9�����������������_��Mx�?�w\�V�a5f��_���������������|��P^y�����ywp|�����x{I���)���q{���;Y7���M�>`G}���������3��kiG�Y��E~�zY���Y��U������}����as�����;���9�I�C�~����v�u�'X�B����'���!�k?�z}��c!�;����C iz����^����d8�t�j��V��29����[���+��p�}��}i�lIgO�,{q����e�[��;���[���/�%yf������.3O����y�neN[��I��.���q���:>�������g��$����[�����Z.��z���xE�n�'��(P��oL�������%���P�J�+���,{2S�����;����#v�2"����\G�����r��TDQ�#�����#v�2"����L"�������h�#�rq�5�:bW.#���(�$�9�m~l�9���c��1�/sW�#�����u�=@}��(���Av*�(��G���75����#/��;'�ru�������3����U ����J�c�:���b#���xd��2�X�����\GD3����1���!bW.#���(�$bQ���h�#�2���y��iW�#�����u���^�����d�"�2�X�sd��Z�v:$+�kI�����R�����w��g�+����T��\��m{�K����J�c�:��k}%\W.���
(�u�d��2Z�p,L�X����r�t@V.#�����+�R�"�Y���I�V5\MDQ�#�����$�dYG�:"����\G����^]�����
)��*R���mrK2�����kV��+����T��\����R.#���(�u����Lt�2"����L"�w4	����G�u����R.���
(�$"|>u@����G�$Z�vt8Q������u�muO�J��Hf*�(�����h"�2�������*�N����"s�Sj�V*��\�02S��ry����)�\Ed3E�D�������V�eD2SE��[(�c��eD2SE�D,��M)��{|��u�����nw�2Zi�I�$Zz�L8Q������$���rqs�]��8����Rr�E�]J��{���<���U��}���NwP�r���K|�c�nv���>(��~���������(�],�w:s�1Tx�NTS�1������_�f���������
��M��cq9��\�%�����R���QXy���Z�>� \o���f�5���i)G�xM	=%�8�f&)���faI��x��o��Zh��4Z������}����b��U.8K�`�A�}5�Fk��4�p��n���2��5[J�EZi/�t"��/��:�>� �����e�:%%Qu{���$e���IJa��8�e��$!�����D��~�FM��4�p��on����&t5P�C���:�AZ��Z�f�A�	��}!.�R���m-�� ��F���4� �N�a��R.�
I� ���u$�?ZJ������:I)��e��_�JI��)��@�X�P]��N}�A�P'|�����f�e����%��M3\�O5	_�����!��@��/�%��������[��u�<�����V5���	��`*�@�DJ �}���A@��7
����f����a�.���"g)��+��;���$Z��_75� ��M��tx���paN	�;����Ki�6�O3�Z�8��0�	�T����B;����+�tLx
?f�^��w~o/��)�Tb�g�A@�vzDk���O3�u�*�+nc���U����\z��9���V�Hf�:#F-����M��w(��z4&�o��2�f"��1���$�jlTK����X�����^ �vz���j�A ���v��g�����U��B�.J�5�j�A �3��pyA[����A	�K��s$�SKI�e##�f���0�r�%���uiik[@����Kk+�4�@^g�;2�����0k��V�
��R�Z"#�y���Q�Ub�+*�����v�`�"}X������$���t��Pw+%ao��5N<���k�FF<� ���	P�	��kiEi�ua�x<�o�|���f"S�1��������Fv��H�6�z"�S;=][�H��:K�n��D
A�([S�+���-"I��Md
?&�B���;�w��A%� �;�	N�h+�i���*VuY�>��WW"(�� ���+_���47�����9�Km���h>z��� ��������Q	�&q�]g�Y_Jc�eh�&����)�"���K��j;:��Y�L3��c���M7(�.Q��<zJ�q2� LR��,�<.{�=��<.{�=�;'M
�q���qA�q�v"XT+�>� |���
tP6>�v"H'��Q��>� H'�Ox���<��0��(��;9�I�?,�Kyq);gt@ig�����6���@��iq`�A������K��%�q�i(9�^>������w�H��
:��#l��k(����!�7X��s����N�kF���>�r	>��5�`9'���rM3X�
���+K�X�I�9�<�C#����^�`��\.��Y�ZJ�5�F����>�s����sM2t�	���):g����u��%E����oG���H`���:�@L���4��c��t���sf�.3:W����sm��4xt�t.�Q�f��:W�'���&:�t�t�4�syM�\�-v����P��5��/de`"`���e���?0X;"(�6�d�\ #�f�k��
�K%U�q���Ri��d�A(��R14���h$���@N�R�th�g��!���(������&�	�V�Hf��������0�&h��������4��{M��J;U))���� ��!w��O3��	�e@w�8vgt��Dd����k
��{��GJ� )67�d^D�>� �����>� C������G^]p�>���E
]|������v�p��������X�tH�4�k�K�;3f�e��(t��N����:���I��`�]���X����%�EwKt��
����;0�������R:L�����L)�zJ����R:#LR�A��j9�-���.�R�pPJ|)]��t�����a��)�kg���
�n���Aq�.�3��n�GCw�{/n'�$�=��� ��v�v���l'����?�.��r�������;0Q�N���tE5������i@w���;#��'�n�t�o�YAv"���tg������u��[}E��+������W������d������V���*���NCE��:#��'���Aw�&tg� P��;0�����Ia�6��m��nsUu�8P�>�`�k����QE�������:#��PQF���
���E�4�	��5E3;3&���D1;;��@�)s���.�j�lk�����/�U��������:,�45t�����Y���fdj��@^���25tM25tN@'_C�F���
��A
��ij��dj���y�NSC�F���	���k���k��uN@'��H�:'���������$��@N�%4�3y}��P<�k��uV '��H�:'������`�	�V���9YXW���<�5�u���H�^��A|��`�NF���^SXh?��d��������t�ln�k
��B���jM2l�	�da��F}�A �	�����u��R�s~L��m��L���rQm��{�=,��vY�5�L�z��1����4���^�/�x`d^�O�����p�C�h_��/
��5E`�_|���&i��z����g�^�h�i�g������J�g)���g)���y\���J��W��/R!`9�`��	��Q���N�DT�G0���
v���)_<�,��[N��2��h� ����;u|��z��!�q�f���:
+���a�F`��)4	;�X
�TGA��N#MA��^Sn��{�5�B�%d-@OtbJ�%,hTzM���a�m��75���� ����;��i�
��0e���������T���J�
��!�?KR3QlZn�k�~u:y�
F�Z�����i�;'��9���	�5*�n6����-uN�/���>`��6�p['��p��nk�z�r���co�nIa�A@'�u72���z�r����)<�6EA^7F�����L����,��nw�0D8������rf2�������.��g�n�&vB�����H�������� �H�+����8Q\yA�����Y���J�v+��8
6{��L��R���<U��~;eA��J='��A�zV�9!�M���x=L��r�J�pqz�2H�t���%�Q�[�Y����,�v���"�$��@N�r���`$3y=LY�t�����p��G!���J[�H���w��������R u>�d(yB��f�������ck�a)z���hh>
�d(���O3��0eA��?k1�}�B����t����Z�4��^OSf4���Y�~�:`�U>��U�����~�}�&���u�p��}�+(��1y���u/d���1�<���z��4�d)S�q��X�z�mA�������23��6��z����CS�5��L����Pr���mi�Iki��#���������������Y\'��Z�{'����?L��C�����w�����/\&���'��^�L�������e��@�w�w��[1+)����������T���f�����*������z���A�eh�7V���JI
d�B�H5���n�H�������\�uX!���h��C�h���d���D;���TI�4��� ���;�����^��l{��c�&��t��F�{�JT��v���7E�s;FS���I'�f,.w2�8;�5��I��;�<9#M���^��({����d�����3����V`�;�*��@i���$���,,�@S���u+���{_\��
�"�^ '�����@^��e��52��M2��V '[��Hf�����T4/P�h���d
��@N��'2����n%��5��5��I};!��^�-d$�B�o{?LT�j�2�quS4�����Ev��s�	O���r'���a�K1([b���N���eFly'���_{�D��!��Gi��4��\{:mDy��@R�z��p�m'#�f��N�)�F�
�n?�Aa�a�X��zU0�)��]�d(9�������A�z����X�������pc�+��v�<\'�5������Ko%������2(.]�
�q��X�H�&<����	�-���+;(��q�o�tS���&2���_p���ks������1��������	M���T�|�!!���mC���z4�����1��[ug�C�����k��Z�tS#c��X9,h"S�1��zTdt��S0qn��~L&��&2���L^���j��p��%sG�����t<}��5j��q+���
�Y���:O������:O�����q�������5�-�4�����D;���$�����}�����F��7�
���l����H6�F��?��\�n���;������g_.����/��\��g���r�}iO������Q�o���a����d7��l'��'�`�I���a���i���u
w@����d��������h�����^S��0�&_��h�������7V��E��6��]��jz������F��k����tw�Hw� Os�{R�q������~����wE#��U��_=�k&]����I;�<�#����^�RF:M���+�D;xUlCs$�-�D���4� [z=K�t������d�nn-ZG`�nn-+�u�S�$:"�f!y��B�<��[`�0����\M[^�D���Oi:[a��[��t`#����M���n#%�n#F ���H������a���+�'�HV��M���1y����&������Y��zS����+	 ��%^ '�a8�Q�y��a���H�����Vpvz��A(�_�"�i�������[��O6�T��A@'��$2Rm��@^�R6�P�S�H;*��'�@N�7J&#�
�	��0e��XM�j<,$��� ��������n���0e�����v�����*�������`��[����,�VZ�r��Wv�T��A@'S5��d�Z�8�����������w|����@N�j�!��lx6��4e.����RR���B��]���NF�����4en�R"�����z��Q@'���,d$�y=LY����TI���e���~���4�@^O���o���
���A����)��L���f:��������
��qM�����w(���������������:����?�����������/��?�������?�w.���+����_������_����������_��S(����������i����?G�����Q�;lL���V7���_�����_^%
���@N���|l{x������Z�I�,�����^��>K���o?�||��rWX{W{{����9�MX�A�4?�Ua��������A����|�X�����0��q��z�{�F�I��_u��� ����"�9�
����X�B��#��WP�6��q��V�m�#��r��L����q��k�6��r+��w��"�0���z���2��;B>>����[�'p)I��i����7����������w��_��?+�����o]���kkXs|�����5f����������.�����������,a����#kU�+ei�AN�g��~Wp��3GZq�"v�2"����\G�z3U�����d�"�2����j"�r�tDV�#����#v�2"����L"���:�(��LGd�:b^��9D.���
��$��	�p�\�C3��I�
�:"+����!Y�]K�����b���
k���+�g��T��\�������p�T8��U\��u�2����\F�Kz�M_+�r��z��L"�w^��(�uD4�Y��"`?�+��LEe��g�&�(��LGd�:�����?9J��Hf*�(���0��������tHVf��:-�K��J����j>]�<_d����|��+!����J�c�:ZYU���e<2SE�D��f��p�t8��m	��p]��Gf*�(���a��(�uD4�Y�����T�U�+��LEe1��u:�(��LGdeq7x�9�2�������*��j�mrK2����F�
-]�1�SO�]�<g���#�������*�$���z"�����d���4��{��*�uL4�!Y�E��aNK���=�)i����\�D3��Y��Jo�!E��D;��I����H-]G%;U�Y��.!��,]_g���eiv�����uv��s��l�>MC�����Cir�`��v�:*���"M���j%��]��Jv:�H��;�4��"M������u����}���eT���R�Y�]�Y9(+��`fB�2��T���:&���,jv��Z�D���SI��/BwU�xq�9��,M�7�1i.�[����������r|�czo�8�MS(��~�V��%������U*8�����Q�O���������V�u����>�p��PQ��d
?�x��U�L�(:U�	��k��L3��"��U��������� ���Fk��4�p�����z.��[J��Z{[b~"�����Z�>� �����:��&�.QY�� ��D�d�A��Tv����yT�����Y'�,�����������V�?.T�����OY(�5Z��Y(x]� ��?Pnf����Av7�mf��������JRLII�A���U'��D��4�0I�5_���
*���A���8V�4�Z�r�>� ���L����R.������ ���x~R��� ���S��'�l��$������'�l��5���O3�	�	����$�']�$����)*��q�S����P��/t����/�iR����N�����������X������e���J�v,�������
)��Ljoe��r�������>�]J �6����N����63nF2� ��IJ��ZZ���R%%���':��Dk���H��:En�//���R_�Wh)����+�7���c�����_�������3O��?*��������O5*�4�u�/��;�"e[�s"���/��'�UO3�u�����t��F��7&�j��9AJ�6���^m��T���Q�#��S�S�b<Q��~]Ll&���F]�N2+�+��;�|��
4(���������O5*�4�m}o�t��^ZK<��a��3q��L�
x���`�7��Y�����4���)���uq%�>��`������W`}�X�u8���wj�2(���u1�Y�~T0�yf/(�m�F��V���wuG
�af]L}�r�lW��9M��������]�����g
�Af]�66��F��gV�X6�{�5�s���)�G��Xv6��Fe���g�)R�r��T�r\�|���~������F�gF�%@���V������#�0�.Jp�jT"��u�Y��J{�_��U�J5!���@?�L��f2��`����E]8�X��J����b��7V�qLh
?�.H��C{F[�{��>Y	n���L�C�A_6`���3��`L��	�2��1`�Me��:�3`�44�$��f���f�������u������`=�������������
��/)	q��1�����N����/��`9Y0X��V�FN6#� ��>�a������`8�������9�1�Y0�f��N#
��^�A��d���[0h��<��A0Q`P��e�����}���� ���d��v�V�~Js0��92�$���N-%�Z�0Rt�
�u��I4Qv�A�w,���An��d�A`�O��6�������A�=E��m
n/�#0��-�n�
:!�]�7���V�k:���,)|��5�
��|��t
��.�����A'������A+�����H����@N�f6��a��#��Z>��:�F>Hf�z~����������A���3������|0����f��W�o��h���S(�|��,��&N��Q?����F~��R�a]���V!������^���A�?��A�v���S�o��d���W6�E�>���X>��w�z�F>Hf�:�~��w��������~�V6>8(�3>��#��3�:%����A03|�)��s�p]��]Z��Z��
�y>���L5*�:wM�`�U������F�<\�L�8��j(>����@?&��R�D&�cv����]Z� ���r�r�E�@HK�5!\7�� r]��8K	�0����8a_
*������pB����?$'����p�;�b�`<La�;�5� |\@`������c�A@'�	i%U��^`����4�We����S�3�ia���^@M[,'��p�	��N'l�c'4{��r�u���������'\='\='</F1���2��I�:a�-�'D#�	��^pB����UniNh��E���`�8���'�p���zH�rA��Y="���������Kq��V@'�H���>( ��Ma�L�����. �{}saZ���fd(���N��"�	�"�9��k����o8�S(�G�p�ca�,'�
�����'t
�s���*6TT@o�>z
�af],l&��
�n����S�B#����H�B+�W�����]c��9����{��T�O5*��vz�
�����*8*��q�Bf}�Q��s\�T.�QDo���<.,d���zU~c�54��iz�F\�fz~��3����pT����Hf}�Q)Wo�,��"�����B�����L�B�lW������TD/�����AA?��:������i��pCD[�����x�)��qaf3�~Tv�q���fqa�Vnz�.���T�B?�����~��g
�9\�v6��F�~�#�*q��J�N������L3�jm�PA�:W��.���n���Kn����z�8����S���~����!a/$��F��*-���ws�Nk����]�,?K9f��_�$IL ~��$YS2v'��)���g)�z�<�>f_��}�c>(}���1����)�0� �F���B�yy��F�F��,e�)�Y$���A�V��� Q0�H����l8&�l�i�e��z������,e��4�:R�����X@'A��a��4��^S64�������i@�!:�j�z�����\Y#YXl���h��C�`���V�����;G
�M� ��"���N��d$����0e���@SW(��q R���N&�t�X\���r�\���&�5�N@'�f���q+����5�M���-�c��k��DAa3f�g�v����RJ� )B<o-����F{����9���(�w���
���7��A�vH���� ��������)���r�����{T0���;%�i�.��q�-���<��l&����w�����������vB9(f���u��i����u�
�:�t��:�O3��4e��n6j����*l=(����R�L��8s��m�Dj�����<v�}�����m�6���F>��au#�m�LM��5{)�g�x�W�d�Q���3g���7��(m�z������|�
���G�R����v�"���U�V	6?P��*��O�T�R��7T`x�>C�s����G%���������~�`���wX#�g4t>j�(��@?G���f2��`���3L��*��TKC�Q'���A���f��i�B�����b���AS��@?���������H�p������W����������9!Zg���<��8eD�uk��U�
h=����t�o���O3���1��
������/�����~[�f4�I��]�&����V���-�]�G�;���?���e�����d�]{�I��V&�I��6<���w�vp'QC�A�����V���~0���A��a��@�w�w����sR����#������A�e,�.����rP]|t��%�K��*c}�)�^�+������e����EAV��IT��e���d�����{0���
�u+����U�!Sx~��U�W�*^�����A�xKc(}�!1��
�Z�Pz0���R�{'Q��C�l>d���8P!�e�![6���r'C�����!�M2D�	���y4RD�
�u+Q��i���P��I��,9��h�`����N���������#�p��
Q�R//�^���WXf�;.S���d�������$j����T`�v�������v��Pf��r��e��E:.vM�����J���L��Bo���w�eW��Cp���~�<��L5*�_���q�����Z����������%��<�u/Q��
X�b��CMe�(�$���6e�z�K�ivC��6[i�[�����f+�Vb��F��f���S�n*
4�#�h*>*�������L5*�f�B��
�mPy�t.��(��Py�d&S�
�����X����{R�{T���X��O?*�zI���)u{�����z����QA?���C3�~T0��|�MW��g�,��
$a�:!Rf-�O3�u3Q�mj7 �K�%��b�} �����"�4���7L�tn�F&
�k�=�I��Q@'L��@F2� ���D�;���y�R�8?��a���6��IO���r/C��m]��oo��Q���d�v[2������x��z��)V�$wn����9v��-���7�0u��V����{Y?���c;����A��6�N��Cu\�
:m�0�h�t�K�{V����G�7S�<:��;{��=���;{��=���S� �������%E�O��Cq��X��)7Z�q����c�&���D�M?6n��=>����,����Y�d�O�e�mSL����N3��4537{=K�u�X��"�o*���"�AS��7Y}S��q�
n����.��wC�'��/�1A�i�3��)��d��������~"�]�d�9ob��:�,e��:�z��]�V���z����w=[��g�?�����\�Vfe���2���Vo�f���x����)��%y(��z�%�"�&�Q�+��{=LY74��wJ�L�'����F�w���a��R��0��M2��N@'�74�%�V`�g)�����������zP��9x$���8mS7W���*����~g3]o
�8s�x9P�V��9@�U���+�����f�%�U(�����m��Vq��VqV '�U����P-���r���e[��*�wY����|���f2����-���u�q��:l,7*�h��8l,7*��e����w����]U��A'��tmA�>� �����~t�A�h��t���Me2��������3����f���:x�S=�G�|����tGj�p���K�$�tm����)��$�@���d�QI�?���U��v��Mk*h�i����+��#��v�^���3�6;=�*m����QA?��f�l��i{e�#���\}[�eO�:~���Z��|����z��A���3�5��~����4Mu�9P��5�)��T���g��lry����n�-a��t�9P�������d�Q������������?��
~L�-CJh"��1�����?Q�|P��a	�5.�P���?�7�iO��-���wNGo*~���������_���������N�[���������?����~���_y����
���������?��O?�����w'��-���
q�Mx�N[Y�_O:2�Z��[tM��`I������������t��v>��uP+XE#������u>9�� ��[�����n�x�~����;��0�����;�����q$|Ca����qt�~�yr���W�y���M
�q�6?�����8O����:�r�[��
���:�fG�\� ���1�B��B�we�G }$M���!��C��{?|�����;����F�2=q����u����|����6~B��I�@w�#0�����k����~���[K�CP�|�����|s�=��.I���%j�|v���R���T�����O�{Y��P�LC��VCyzd�A�h��u]�^UI�Q�NE��$j���0j������*�,��^�j��4��v&*K������v�:*���"����wU�IT�3QY�Dmk�ML�#���G�,��\�x"M"�����,*nSd��4����	���Z��
^�\_m>����t1%]�:��� �������:&����L"n��G���uL��AE�D�q�
�K�Q�NGiu{�9��"M������$j
����j������*�,j~�l?�]�DE;��YTUA�QE�DE;����q	:��R�����*l�����P�sy���rW.�������v�D������g�+�1�L�de1-}�5���uL��AE�E-�Z���2�	f&$)���n��]��Iv:�H����J�QE�DE;��I��@�������d���4��9�OR�&Q��Dei�H	�,�.5dh��4����������'�����h?$]�>qd��@���k(�>�)�:*���"]GMK�{Gf��eT��O�"M�����g�+�1�L�de���E%Mb����I������\�D3��Y����tH�&1��eiu��J�D��j�&Q[�e1?!]�\j�P�iz������K��)���Ws�Q���#;}"M�\[Ai���t��tT�&Q�6���w�:*���"����^Gi�LT�&Q����t��tT�fQ��rPV&1���$e��k���.Mb��	��$j��9UI�Q������s��+��+8/�7G�����f��1Lom���;��C�����8h��o�^������KTq�}����B(Tq `_�/�a�:5� p7�3(�mJ���"e��iE�(�"�����>� \o�%��)+�������q]����+�5']���j�Q��R<�!��jZ�i�������L?��E>A}�Q����d�oh�G��:
�om��g�*��u���FE������6��qQ`����D�<d�Q�e�J+�!C�>TZ���5�K�QY����-4��F��f��B��Y�2ha����)����k���jT(�if��B5K��F%k{�>SV��"��SS��,�;������|Z�|a�R_��l���5l���n�wF3����U���^��H���������jT�7�qY�!�"O,�V.`k+�=SVY��EZ���Y�v�7GC��@i�AA\�Q����^]��UO5*�4��$��"��OIT���� ���F���4� ���l�����{m���J�Kk����f.7�1m������j���*ON���,v
{y�i��N�&}q���"%x�sz�� %�`;=� ��YJ��z���u-�`p9S�?�"�t0��F��f���t�~��U�!n[_'�� ?�]l���L�
x�Y��}d]�I�h<��3Y���D��T�B�2KK�}jDJ��K��UN��;X��Li��d�A@���#R@��^�}�V�r��D�l�3y�R���=\�����C��n
���O���d�Q�.�FNQ~&��	H�h>H��Xz��f���f2��`������4�B�{]K������)�������jT(�if��?��{�6Mi.(Ky#�u�$Y��DZ|���Dk��9���z%.;e�-������3(���u�����G�g���-j]�!3����������@A?�����d�Q�����+>��>y����Ai��i�H?���m#3�jT(�yf���nZ��0�~���������jT6�q���h�6+�.��wa�Np�K�h���L3�u����m��/�$������%�w���L�x��F�����Qi�0)�����Q�+��T�������.��.����F�:�s#i��
�4� ^����V0������dC����|#�X�H��68�)�Y 	f�HZe��(�_�$a2�$� [W[&	F�IaF���j���uM8^�M�X�'S������L20�4�3�1�4���L2L2L21�00�00���QC�|L�������3E�?���T�2#
��f�5M ��$��r�D���IRfJ�t�J:Eef�$�i0i�Yf���7�B�;o;- �Lqhr���D�(����G4��E�N�4	fMZE6��M�#��=��=����������������}��L�i9*��'������O
��P�D�F�HI�,�L�L�L��}@e{0��	��(�D#*�@^��FP�*-�t
��J03��*���'�2�#�l�x,�t
��J03��*�#P��L�J<P
*���s�A%�iP�
8�
���HeM��Q?|
��rc3���B���r+��+Wl��y������F<� ���W����mK��9*��yed3�jT0�G�2�n��W6��J����+�L�J�`�x��_J�e�&):9*�|)��d�y��P��x%�5ey%ltex�W�o��h�y�W�85�����7������B�Y^��O5*�^���r�2�q2��^@'�Rv-���<�5�m����\��Ca�AA?*���G�Ae��_3*������������d�Q�����TP�J�U3��+�G�)X�f�V:�~�+�/��������")qa3�jT�/��W&����r�Z"G�<�l&S�
���
+b5�DX��� ����u)l�3y]����Z�d{.8ld��%��P�P�����z_B�Y���f9S:|�����'��"�,{%���L���Q2�\O�{�2�<�<�|,�,�,/���iE�a(�Ce8*�Ce�(/p���Xr�����G��A��+w2�T�y#\��Xp�-�l�",�t
��J4���*�R`�gK*n�������'�;�	�
��N*�6���yR��U�O[R��,��
��T��%�����J����hI%�Ri
��\P�:N�:L�BB�p���A����|R=�K'm��-�tD�~�rI[+�!���\W�!�~�4��
��^r9(�\���e(��@S3�E��P5yU��`�����M�4�+�7�H4�4�)��l����F��@�H�,���i$�i�
�Q�d�������I�������T��W0�'���!������z�
���'��WO:��>��T	��'�2e�'��~c�$���I�`�hd�]1��c��G�<�\�L��LMh$4,�-�l����1��~�F�Z:4��F�~F#q?IK#[����N�	f�F:�w���F�}��i�4�+�7�H4�4�+�#�%dGb����N�{�rC$�L!I���/�
#�[T%���@I���%�LCI���[T������p3l��S�o��d���W0�P2��8%.�SrT��C���:�
��������i�8*�l���T�������M���.KuP2�fK&�~J��f<��P�)�lw�5(	@0)�8�d�$��4����(C�ui�:�fi�lkWY��D�WQ���z��aJ�??e]�z���-����L/�RON�Z���4�au����<����&y���6*:s%O�jT�h�?@%���
��5����lN���xp�f�ZE�x?���3o4�OW=v�kU=ST���R�(S���L�~��b�q�g[K�g�~|�4��
|��)�L�����=
��j�T��a	|���}��gnp��G��H���N����]03|�*�*�enp�����]2��+�7�_03��)�q���PZB��h���P|��TR|xP(������^��	$E��l���F���^OS����M%,lg*a�~c%,��JX�P���kb�~<������~#�&3���Bg�����"��$G�Q?��76��9�>�����l�p�&i�=����74�t���4�����{���W�4������Y�~T0���5����������;�<���j�Q��O3W(;s����\A�Q�R�����d�Q�T��0s�6������6���	�d���F<� ������U�T���������~���3�f2��`���*_"�5l~)���C����%��.�@���3�(�5>h��t98�	>&����@^OS.�%�&��E�Z����@A?���v4����F��f.���F��]�AS4�@?��Jf}�Q���3gF�"/��R�����G!�"��-d$�z=MY�},���X�����-���8P
�-x�9����d/�mB[�����@?�6�-cA3�jT(�����mxC�7���J��-H�mg
���A�=�o=k
P�^E���Z�_1�o����h�A`��)�;���o8�J��-�y�/��|�{hBS�1�<e��b'\"�3(��� ��ym�f6�����V��k��{����G�-�?�y���o+>M�{6�����Py}+���V�?@��[������5��n_<�+�}���U*���4�/�Bo�k�D�c_�����}W���yS�W��b�-)���v�x3Y�%�R�F����������P����O�N�z���S���S@3�N�*�>�S�&�7	^��M��7	N�fw��8I���%��N������?�
������P��5������W�,T5o
�L�5p
�������`so
6��Ju�9uxU�	6��`�{OcH}"L�_
�0��~��2S��Bo���<�_1/`��y!�� �W�2S/�Bo���<mm^����5�W�o|
@f�5�S(��|;�oW����
�����~����Y�~T0��|uY�R�#�4[�����?�Y�~T0��|;~o\gh�`�p(��y$�>��`�{�*���z�4��fqx%I����T��t����2p_���_��W���������<�u/Q����8�9���~���D?��L5*�f�����[����.��H�r�����d�Q�nc_�[*��4��G�\�����L5*�f�B��mu��AZqs��?P������Y�jT0��|��C�n���u�&���~�G�9[M?*�n���i)���
�l*�G�\�~[D�M����o�K����
��e�{�N��@>��j[�_jZ?&�{9"�[�]�X��"�}�G��_�
�H�~L.�2d.������5g�:Lt��k�d��{�L��w���F�"X��0������W���~L.�2$�����G_S�a������8�<}38��O�H� �`�A '��cb#�y�������J}��v��sg���3	�P�s���d=�6���2i��r9Sleu7�p�xn}7s�Q[P;�f�'\�{���=���.�=������=�=�=A�<@�<@�?������f��8wf� �)h���A�x2S �)���a�����[@�Z���*E�`9Q������(��g��i�0sS�O��
xmV8Eqp�*����x~7s�Q]	>4�����a[�t@���l�������7�6a9h�2�|Z������24|��<N�z���u�j�t��3�5�����M��b�`���hl�0s����d�k'S];�:�����L}� �s����=L��Q�0s��+�l/�V�d�i;%��T��6��~�V���3�-gx���q�k<t����"��
��7N��E�g���s�{�D8Q��W�o���f��;�>����na���}G�:�Q	��D��N'�-��P������tGo�a���W����2S/�Bg��po��+XW�L�o+����MF}9���i����_x��'�@����=x2�������_x���^MvZ��Q�{/�6�d���{��^|���������a���>*I�����h��;�J��Sw3�-w���O�����������>��S�������{��!]K��G�|���z��p���K?��[/(��~	�G��D�zA���T�i�����~AE:�w����B�Rz�|�Q#����R�s�����P���.�p��������V�Hf��Gn�{}�z�������%�?�#��
}����P���s����1���\L��\s��Zr1���<s�%���d�~���������@)�+���1��S�L?���b����H��x�A~5�^2`����P���U�^k\������h���l��������O����?��/������O�������-V*������Z�������?����~���_y����
����/������O����7�B���r�[�(W���^�X���
U���g{,��zx>R[��,�;��s�Z�)�(��o�F�<+����$o]�o���I�+�&���F�%���*R���g�����B%[��>�?,� j'"@�����q�<�;����@+q�:���_�^;^&�
����_o����
m*yf���$o��o��e;I���_7��"��2M_<{����6hsx���O��D��aez����^@^����'����~j�H��x�0�����*��^��_`�T�'@�����q�<������O��[���|���{�X�����'W���_��w��I�=���S����9rTv�������: �����ca-���
��$����L"���OEde��TDV�#n���G�:"����L"���*"+��8�~o�LEd�������9@��V�r��^�0��������D�>cl�����5T��b��c�]��Y��^�j��4�Jv:*K����bNo�&Q�NEi����,�������$j�)��*�$*���"���h:*K��d���4�Zi_8��YT��QY�D������"]_m�NEi~��w�az��_nB����?�H��v�D���\��}4E�De;U�I�R�[��&Q�NEi��b��B'�$*���"����PEei��tT�&Q�����]E�De;U�Y�P�Dei��tT�fQ7D�*()��h�C�2���z���H��
���,}p�i��6����l@��t����]?�.]����I��o~�&Q�NEi�m�nb�0��V*	�h���.M"���)�,jz��Eei��tT�fQw�q�Gei��tT�&QS��BU�IT�SQE�EMPU���2��f:$)���;�6")��6�I�_`��/0��1\a
u
V��49cl�@��9k��}"E�De;U�I�}}�'EQ&1�L�deq���*$K��d���4�ZW��PQE�De;U�Y�2�j�4�Z�_
�fQ+v��QY�E%;����������m"]G;E�E�d:()�W6�!y3���������J3��q]io���H���v�D����67�P]�De;U�Y���V:*K��d���4���;���$&����L"��:m���,Mb����,������T�YT��QY�E��q��W�YT��QY�D-�;F�Ii��TT�fQ�z�&���!"�/7����rs�}N[�Q������"����9���������Y��{?Mk��H��Y(v�Jek�2�c��FJ+[�<B��cq9F�>�lm�>�A�w���1z|#�sYk�����Rg��;���R�'�%P��Q@�Vl$?��4��^g,9�#?��� Q"��Y���0^hu*|zp�N���rB]�AKK}Z��RXAj%��X 'X�,ZA�>� ��9�JO6���3��n�c�w�:?�7O���r�V�wMtq���J"���c�p��Z|��4� ��xmP��il
�����F��@NP*ZA�>� ��9��@v9F�e�z��ut9���V�Hf�:��UB/��op�D
+H��J?������F}�A �s�J+>���H������c��ZJ]�f2}�A��2')E�V���Xj'2��'�@ 'HI��F}�A �3���wy���M�V�B-���9���V@S3y���n/�.�����t��W�Y�r�����F}�A �3<��O*�9�R����
{}
��R�ZA�>� ����eg;X����AL�V8��rj)u
����@^��O.1��v)� �F��@N-��4�3y���.$��\KH��Y~��������E�b���h��.P��7�]j�^m��r,�S;/]+h�g�:�o;���[ %��_��`�
o�G��ZJ]+h�g�:Ce����+t)`�u��7/�SK�k����@^gl�����`l���-]�K�9�������<�u���<c��H!�k��rj)u-��f�:I�}<�A�D]j��bz����@N�������<�uN�r��s�8v), AQ��@N-��%4�3�vuql�����SIX�WZ^{ �SK�k��4�@^g������B�aDIHm%�r"$~QZ@�>� �����������&�_"�Z-{=�������d�A �3��';�:�F}DJ ���� %�b���e�A�����=�J��~+���(�~(�SK�k��d�A ���v,i��h���E&���C�� %�2����_���p��X7�	g�}����z(��$ZB�>� lW�m	<��c)�D���J���<����q�
&}
?&��G6�����"	�Y|��l��@N�0"Z&#�y�����.�0+�q�;�����ms�]�p!w������B�@������	�[z-
�)D��&�
�1���K�	J�����t��P��u�k�0f����<���a|������Aj���B�!�����s�c������J;���H� 5Ot"^G��
�d�A`����(-I	eb@�v�G�	��9��t�[�Sl�:�v�"u�&,E�'��D@'�v���1s��S/��9���Bv4�m����,�����Z�����I���6Z(�������v��`�y�ia�e{e$hn�������hn�����f����>sK�["Ds(�����Es^ 'Bs���U�y��$���g��{����@�/e��	�
��R�ZA�>� ��9���c��$�9�B��1m���r"4G~����@^g)5�*��	NJz����X '@s�2���������D<J&4����-�Dh���F}�A �s4'�T�F�+�������=����iUScr9er����o�K-����DL��g����@^�Ln�?lf(��'&#�������>;9�)�1��������#Xr��n��>=]�K�y�?=m�c�V<�hN�|�u�9�"F}�A �4����;v)��}@��rb4������@^h�P�F��"�ND��7�v,��9�2����>@s�����Y0g�����/��m���1��39�������l;��7R(0�I��]�a���&�B���7�6�G������?���T�+S;����E4��B����&�$���B�f���!
�E�8b�C7�'�
o�
dk��lN�|�n )���xu�n��dk [�dsB��l
d#
�� ^���b ��l� 5
�md$L�	�u�6<��9 ����4����=����tMxR?f�S���
���Jj������/l'	&2���)V���a�{J�Aj5s���N�O����d�A �]������gp6�d�ZkdF2� ��][p�v����x���({I��V���P��DA��cfe�h��S����Pd+��X��%�{���t?f����C�9aR��~p�&"�R�B$����z�[�j�A`�3����m"1[��Q� (����5�f��
��Z�FRgk�D0�
5���ca�B_X��@�Y�"��i':l�"����p������h`
�����6^+Q�����v�k,lC	�Z���D����F}�A �s�F+t*��B�F�xV\VU2�9r"�FZA�>� ��9lk�9�vGt)������X '�%��Q�y���EJ����K�6�n�X '�m�����@^��M��2�,�R��d��^�gr��{i^����@^gup�k*����,{��9Qi����@^��������p�I���rb���K`�g�����O���z-U��c�����Re5�(��s�r9(����a���<��B�LdR?f�S���	��ce��R��W�k�X �vz�V���<�uQ�F�NR�F�YT�v,�����90�3y]0�:��������`��H��o�D����7rI#7��7`���<��B"0�I��]Na�������KP��l-�D�-r�;�����`�'|�% �D�v0&�m���'��]=X��A���B�q���p������r^��a�
S��$S�fr�En`����@^���1��
:��"7+��P���L�������B�Mz�����&��yQ� �����O3�5�o�A{��-���� ���o
��M3����Ok��oN��K#��� ^���*W��$S�fv�Enh����@^3����M����[�����o�LxR?�.�b
�ax��B�����p7�
L�cv9/k��]�}E��;@����N��
wS�yM�[������K�vw�6���[kb�F�s��$?��{��)rY��l��X�-z��L�*;V��>c������� �dmfW��_��F?8�-L�u���7D����g�\�P��o^�&Cm�uW�����D�9�o��h����BLK�|m.����Z����|8������S_O�b�_��8_����R
m���N��d�CE�%�@
�&��|�����,�;�>�V�D���>����4@�^.������"#f?����v?�T]�����%z�����.�~���c�G��
�g'�7�sc#���'�<^2~=P�B5]8g{�F����m�I��j�G�y#����%�%f=���v1�E%�Vk�tp�s��2�,r��r�YN�������B����pR}rU���<r��r���
8�����h�!�:H���N�/�61��
���%��[t^�|g<�����5<�A�IPO���kuy��2�i��OH:��Nz�;B[��	����a#��x�]r�C;�5Gtr��a��Fk'F��0��
���%�*(��7��@���v��K1�v��P��%��[�L����������a)N\����B>�ax!����r�����I���+E	�s���'�V�-0�0
���%�I7jV|�3"mu��~&s�B������S������dC;�y��kE��%Zc	F��z�dn�5F�$G�Y�����0���4`���/�tW�k�y�$^A#Vo/�8�����\�F���r���J��	�Z��{&��`�46�0
0��K�ZI��:��L*�:b�e}
�o%(QZ���|
�WkUu���p�Z�����%��|�Is��h��O�j�n��������G%1Z���p����61��
���%���������� <�~I����5&��������{�����?^���K^��g�i�4!��;@��`M%dl�a��KV����T	�|R5��0�#2�4�%�|�����u��������I���-�4L����M����� �����Zw���i�������1g�n/��A6"����k��VBw��%�8_������M
b�����U9o�3Q��Nc)�DC�����Z!�����Nv&����+~���N_�lb!����������C�AJ��V����\/�8i%,�|�H�4@��J*�So��WJ=��{�[ N�R�Y��+�'=+���B%p��O�?���g��2�?/N�����D:��@����D�<���t	q�6��OVH�������H!�a�����
b[��8_�����JW����dM������B�=F^�����M@�������Z:+��7��8_�����W�
�,2��&X��'�i
��%�8_�����t�[O�S�d�Jd:����o] �h����<Y�*?����!�������!���Gn@�!���nV�]t����+�>8��nL�@C����B��L�F��5�D>����
�� 1���kuy��}���;��t�(�NZ�N,�g?�F��z��#1f�s�P/��
�P/Y8����K���%O^�Z&�d�Bg�����k�����5�k���(6��
����R=�)����S���[6?�_]������������������#������g'T�e1��
�����n�.���	��+"}w���[	6s�������d�T<�v|h��t�N~TG���o%P��������,��w��A#j"�v�Z<��`�d�A��{�_~��-��������s�~��*�7/�8�&l#�x=Zh����:��mcI	���.�8A�����4^�J=6lM��;���S�'m�)L��B����BIg�8 ��Z������ N*igt\c#
��z�P�C�7���rC�R�U���@���Y�*F��z���d�$23:]��}/���
+�|�t�M-5���\��P�v�T+���6iY��?;`����v���0
P�G�~	�I�Rf����~��&�E-���?[�� iN��ZYg��t�������i�z=Z�� -��b�eb �Ku�o�\���������� �;�)��$�L���p�J��#��x=Z��d��|lV[l��b�|6�	76����4`{��c��<%5�#*�;`��:K0�0
X�,���������I������.1�$��8_�������6i1R���
��u�&-&,�`D&�|�.�V(S8�����b2��`�Ss!��z9���;N��?|�Vw������/��[��`c:���4`�u_=�x��?g^���~�ks���\ADsy)��+[����FL..�� ������u������������#�t`��U���:20�4���\^�����Z�	�
��Z����[�P�]��`�t��uw	��Lv>���?3�{����U��K�������h���x��=� N�^u��>dda�^/�L�`�N��{E,�<i�8q���,F��z��5�S�R�1Di��T%��e���&��MUz��������!���Z�������#7@[;�\��D�#FpDa�������,	G<��:Yx��}�1G������3�*��NZ������#7^/�|L��1�d#$R��	mq����
�W�M,��Z]���L_�sb��#�S�F������7�6 z�V�Wk%a��,��PGt���UBK��	]f������-�p�t��N��Q~&!�z=�� g��2��
���%�r�]�K�������p��:!y��S��%/A&�#���L���������X}��=h�n����S���v�D	��}'mv[���<r��r�>A����0�*	#�N��Q����������?0�������h��������j'm������#7��g4/�J�W<Hg@G���}'��^����<r��r������G1�&�k����8�,����Gn�^.��z�(K�a \6H*����RF�C-d�!��9T>]+�e���i9��C�Pe���L��YJ�<L�X��x��gw���:��5��=]��6����}
20��%��-�[�$l�UmDU�FN ���1n��	[
0��K����k%�w������:U��eE��En�+�/Z8mm)���I����bEKY\RMFr�]�?O�_//K�s�8-�wp@]w�I�vaF���(RR}�q>�����EHj��@�����$F��z�dmWrL�F�����'-�t��}���i�z�]�L8K��Q�����~������4 ��<���O��	����2��*��x�+�
q�6��k]�x2����]os=w�� �k)&�|}�{�O������G��������T��T��b#��+����mh�J�$��V}�/T(s|��k�����?�.=h���Zfzl��v�������/�����m!��Q����2a�����o�~������������'�L���W�������:o���?>�������_���K������I��/���?�������:'��guw�N��i{NA�����)����NE�S�_C{x����z�(�|N�
�l�������_�h��j��]����I�'B?f����m����V8{���-���B������b�%����[��g�@�2�����`�~���=�_��=<|&-k�{����=T���@] r�����}�(c��ht�j�P����G�}���$�s��������r����|���>,�����4�s�{�\��aY��Y�{�����W���=���C�zT������g��~��{P4�w�{�]o�!4���>�?|6.����W�t������w&<yi�t\&af)

r�]Hjh�u�y�1��QV����FYw�����FYa�*d�G��VY

��]�jh�u�>o�U+m���f�*�G��OV�7�'�f�����5Mi����
��`h���]�C�W�;D[Vo�|�U�BVC����Z���FYa�*d=V-q���Y�.d54�*-����FYa�*d��G�ah�U�BVC��6s��*e�]���>�6������}V����FYgu�Y���]����94"}~��?r�y
Pr�/���[04x����Q%U2�	��R�(#���:��QN����Y��v��
��]�jh���Dj��h��hm�gU4�J�S��dU����(k>k<�FY�����Q�&A!��QV����Y���KVC���Y
�?n�m�����7?��9&��k�`h���]�C��L_��J9aR*e��*#�(���� �3�v���G�9���:e=�����h�v1��AV	�Y

��]�jh�u����f)��.fU4�Z�"��*e�]��h�uNa��d5t�Qcv!���GMN�WY��g���&/Eg��nj����u��XA�jh�U�BVC��;����*e�]��h�5/_K�T� '�BJ%���tg�)�r�.&U4���Z�iY

��]�jh�u�Z�:)�(���� ����V��+e��d��H�E�R|��]Hj��2��z?e�f��$�����eS�p��_�}��"d{��}V����A��k�jh�U�BVC����qa�T� '�BJ%���a{����os�.&U4��$?��[��3�>��E9_�(�m!��QV����FY�z/�\���>Epe�l�����m���<w~����Y�� ��^�v�j��_���J�Z����f��,���^��h	��H�k����LZ2�G6=L��f���.�y�D
d�-���|�����h��C���������@Nd������Z��	�,���\.4�d3�q 4 9����;>��G@=L�Gf��
[7���p{���i���{���y����\J��������@"K�.].�8��c����4@�.��&�;�Y��h�m���v6+�Fc{�����h����c����~n4��s)��
>pJ����������7��tN(o���]���x|+���)�h��r!�Ro�c�DuUD��R���]J�N�9�b���+�qFU:���j��a��s[w�J<]����<d�A��p�T �W��u��xu�������N��8�b����%��O����� 
m�@!}|0u`2�:Dn��`��X��[���$r*��
{�������N�$gY�<r�u�H������a4�v���|Vv`r�:Dn����LG�tJ
j'�p��S��&T�N��e�>���u�����q�e�O���L�������@����Br��2����j��3+p�w
�o���&�|]������`���Y;S"t������x|+�b�������_��qf[�y:Jr�)�gi����8�0��
(w�<I\�������`*�y���n
�DKr�a���@�������}�W�P���$�M}'Z��
F���V��a�KN�H���/:]�t��0A�fz�����t�����H��Q>�����q����9�'�+����+w�wD�������?-k'Z��F�����x��8��NG����^�M]6���4^����/��z@�+y���`����M0�0
X�����}T��������#��/	�g5���i�y]-���<������F�,{@�hI����!L��J��YP����������#��kx|+�����3=_��b-Y����)����R/I7��:���l��En�~��-E'Mm���`$B\l���#������Yx�s#�C��1%�	���k���7lq�V����]��������t��������uh�����
F�������"�}�����U�[w�� �+p[��g�U�^W�����a0V����?\��h�&�U���G`���a`��.��-�I���H���}b�/e�uY���cq5��DH�3�0ZiG��/��	�lda�^Wz��;��NA,=��l����T/�0
�WP�<R��b��:'��Z��\8�,w��db!r�u%����*S��
2�p�?�N,���h�F�����PC]����`[Xz�}'Z�3�:"7^�"u�y�J'�V0����e�r�DKr���#7^�Z����(u��j_X�;:�b��@�K2X��Eu	�Ku?
W��<���3�m=�m������e1��
���H���2��N`�os�	"�L����N_���H�o!���N*��L�F����P -m2E�C��������,��Gn��E:�����$��I;�N�L&d#��x��tY����O��8D�8�Hg�gl��J�����(L2���q}� g�<r�� R�t�=w�2��na�w
HwY�26���ku������+.�����H��}'zy�eY���Z��m#��h$��IsS�� xy�D�����/G�����(Z*�o���U�>��Jw+�3��En�n��G5d�#V�6Q��N*�	�`d���NkV��edh�*�,���hI�6Y���Z���q�5V5�[�g�>o~�o�����(����?Q���I�#T�u5��Y��������:�����@""�u�7����M,��z������~%�����q�e�#�x�D���u�Lw��LW�Y[��2U2]���n[��P%���8�e:1
2]��F����P�W��������K2&�n����V��}��	��&"!n�u�Z���"7^?P���AP��)}T�N`SU�R��(�u'���2$��R�r�����|���d,���i�F�]�����(f����S%��F�����m��T���NG!��(�K��O�,;T;���J��/�8�j�l�a�a�^W��\�%�n���� �_k����=L���j�Du��MG�5�($V@�hI�x)�i�z]W���jZ�n����PU���>�hn��q����d�W@#��[.�����M2���4@�n����Ljl-��N����Y���X����j���Hv3�?��w0	r�'�v������4@��U�E��]Q�	&i�Z��T�C[^��Gn�nJ���l��:�^�QL�pb�n�k{
aZ��/���W�f�h�t-�p�N����,6��
�����#hH�[� �M�����p��-cY�<r�u��Au�9!-�.I���K;�>�,W�U��.k��c�X����1n���f�o%x� q���M-^0i/���&R��L��d�tx+��#7^�2���R)��a���>�*����L,Dn�}�����t��S2O����Y��En��z����K��9J2�1s�M�	2�Ui���u\'�k���Uu:3W��>�-�Y������F���So[�u��b!���j��(b��U:x��u������������D'H/!Ds
��B;h9�m����*�z����%dlbA���Mi8kgP�h@1T:��DX�]��K
z�V�k}��j�5'�q�����������k
z�>n�x��T��C1NH��I~s����o%:4b!�k���t�h�zM�>��ZB�p2��	��"7^w*�u��t8�OHT�>'��+�������X75Z�t����R�?�@>D�N:�t���{�t��M���VF"��AR=��/l�a`^?P�d�R�U*�	L��R�tbT�
���J���h#�����:���b:)�����������n�*0��b�UQBR'��d�t������i������s-@@;���5@�jy.���4^7���[�M�;p/3����x|����M,��Z]�u�y�Q7�.W]��u���D���i�nt�m�Sa<Ov.?RjgYU��j:���J���O���Rgha�'l/�8�R��>��H�4@�����%�]g0�XC�����ej��q�c��V��n������0K@8l��=��/�����i���|��c	�KF���H�B��3+*��I!Z���L{�d?��k�t����g��k�I�Z���\{���v����L	1����)]B�l�j�=_�������W�6���,"���5<�A��%�X�����[�j��|�#�:�Ij�y5�k��<���P�w�z�����(�MZ]�QDv�8�6iu)dTU��^/���u~P\V���B ��"����7�>���=_����nR�'C��&
J2/�j�>�
�61��
���%��_���OD�����x|+���������n����WvL�q�d�/+�}'T4�����{���x��o(G��I�
�	�Jz�I�Ta�y��������17�vt�'�Q�� N�Z��Q�a���Ob.c< � ������VLx�#�x�\r����~Y=��$A����d�o�0�M��hu	�w��7(c�@�D�H�����t$�C��DC�����ZIj�G���x���z�������
'Uu����(k)�	���%� ��Qv�}w�8�.��'��f�!L�����h4�{u{"�7���������lbA��py����!\U�����C�,�����"�F��y��r�3^Z��?V���/)��"��_\���V���i��T�9nRE�����n���[��L��J�\�aD�������2�S����Bs�I�ha+�,r�?�7e2M�:�J���{�����V���c�t����Ih���k�m(���
#V�/�8i���F��z���I4���������0[9��Y�o��z�d��S6��P���zw����Y������g���os��IZ�H~�V� ��ez�V��k�#i^�i��&g\���=����"7�=��N������Wzz�S�iAb�a�^/�L��tk^w>�#��h�l~�8�fl#��x��TPO�!F.�� �wH`�,Lr�������������h�L��������jJ��,P\�x�>���!������v	'-*`��x	~�����$R��H�Y���B}�2-g�0
���'�7�m��<�I���&�1���0
�7A~����������8m�����6�,����i�z=[h�*KU�kD+t7��^k���|�z����<X!�f��i5���qr�����jx�����V��d�t�.�mZe/��.���E�/�=��*{�,t��g��;#�����[�S��5<�A�m%�X����<Y��Rra9�v��g&����kx|��n�X����<Y�
���K���-K����P
l5���%�L��-��}�2�99�c��(�}'���\/y����BY��]����x��y�*��Yd2���kuy�B�����g;J��*�pR�\�<��Gn@���?\������2B����^��5<�A���&�|�.OVH���Xy|�#��K�H�}'���D��Gn��,��jk�S�@���#/}'��`+�,r���,V'�|�V�$/��@�P�����4^�J�	���,���)_�M{�6�h�>m'�G%�z��$a��/�j*��#���~���B����3�"Z����SBas��J���f#�i�y=Z�&��D�^�,z�{F���.P'�a����Y�h�'�j�C���d��*<Dw��CK�E�Obb!��py�B�q^X�>������'���p�F�6Y����B)�N��
����u�>���s`F�����B�.z��,J���Yuu�S,�^�#�x=Y�u6u7��	��&o���>6���4`y�@d���
Y�z�e^���hw@����#
��z�Pm�0����I�bk��Y�&'nC����B�0��,G\���J�^�r^���\�DC�����
UV�i�%���D"�U7�+�^�v�&1��kuy�B�f��������*�;`����fY�l����U�xZ���<r���E���`da�<r���4�m/�d������zA�
�������kuy���
1���'"G,���T���#�x���+5o�L�B���Ek�����Ayy-��J��6���%�x���.@���,�G�}`�Z�y����~
�b�=6�������
��2D`}!�-y_��v{]��?wW������7i0f�]?�_k��n�G
#���������������F"��]2���}'�h=_��wBY���>8+�����A.=��!O�|��r�!Oh��3�4L�c��L�����������Q�����|}���uN~ ^��g�E����}
T���LU��g�O��s?\����?�%4��7r]a�5<�A�f�L����V�wk]��h����-;w�M������L��GRiZ�����s��nm��6��/�v�}����vU��M�MCw�fF�w�>���f=�F��z���f%�t��df��K�	9X#��x�\2���z�O5|�K�����+��)w?_���"IQH��E�uDg��K�Iw$��l��p����%��=�d���l.;����k��V�'��������R��c ��?
��f}'��Q�+���^/�<K���m+%�7)�x��Dt��K�x�\�������"��=�C�l}'��`+�,r��r�4V�:j�6��zj���p�	^+�}1��
���%{%����|�����N���F���e�:����5(���DAG������fY���������\ND���%�k��B�
q�>�\�z�"S��]�q��2#��� N��&Y������(��]@�g��`Ot^^vi�����%{Qx������~(,o��C{����&�|
�wkM�<H���d���8&��N��,�Cv��-�8_����Z�b��jf� �}�[ NU3d�g���e-���%k8
I�?8G��������|�0��
(������k��J���*�����mZ�ML,��Z]^�5q3�=��	#��vXx�s�p�A���4l�C�����Z��
�������4>��-
�S��%o0��
���%sk�U�&e�����E�k����!;�d�!��py)D��N/O���,�����S%�/	F��
�~W��e�@J��y�t��*&��
#�����0���k���q�V�Sgk�����o�~������������'�������W���e�Z������������������[������}�K����?�����Os������T�	�Tb��7�l;�p������|RJ���7�D�/�s�E����y�����9�+��;�#s����@������v�����S�������~n���&��;��&��7a��&*��M�_=�~��(~�L�����K}@��0�p���]����]<|/�/!�]���|�]T��wA]%r������������]���E�|{3�������{1o�+��]���|�]T��w���=�����{�$l��]���|�]T��w�P����x�^L��?���.�7�E�|4�'��E����w�hx��wQ;��E�g��/�������~k���W��{P�x� ��h�t���z�R�('�bRE��T�n3O%��AV�Y
��n�����h�v1��Q���R���FYa�*d����jm�G�!�!�4x��zL�l^�<����k����a����
^9��`h�����C��aF�]V�Y

�n4���WC��j�d�� �dU����(���ls�U�(+�bVE��4�y��"
��]�jh�u
o���*
��.fU4��o�j�� +�bVE����#����3G�BVC?�����e~���#�_��T�`h���]�C�/\�]����}N5��FSfm/�44��v!��AV�]�18dU����(k�:�?|G����Y
�.�*���Y�.d54���uG������Q�0J��^%p�v1��A�u�v��j�����BVC���B�Y\p�y3��)��-8�-�pjn�����t��J�d�f!��A�}�:��T2���Q� ��g�<��AN�I
��f=Z�Y���.fU4�Z��s��AV�Y
��.z���*e�]��h�5{e�fU4�
��U�(k��#�U���F�bVE?��)��q|��?i�D�������s���8���D��%����2�a�ll�y�P	2eV����(����/DG����i
�n_{�Oi�
���J�l�Y�����o�}fV���2o��������.�U4�*��UV�aV���
�J��:����!7�+�i���2�(��i����}�~j��������Vg66���1��Q�����ll�Y
cfc���z��s
��]�
�nZ��n�Z�>/������N���
3����l���P��=eV�jB�0�V�����a���������je�������|&�[��}���b�I�A��o������j�����������B��2:�!GZ���v�N|z���u����4�V����A�7@�hI��Dg	a�^W"x
�Wp$��5���5�����	��4�~���� ���������m�����3��0
����y�R��d����~ySa�|=I�jk#B����Bh�!�(�g������]�������Z��|�.�-���%�nk���qd��w��(�]R.������"m���E������y��ggW����D;�!Ds
�)�h���3�(��~���p���,��"7^W�����x\�j'�~��y���Z<��h{�5�h��r��n3F[��G��D���:|�#2�"7@G6]���~�1��0��P}l�����d�w������eK*���6���h�1�����Z�N�$gY�<r�u�Y&=,^0�����,��Jg'Z��,F��n�m�g��(���TME<�{�,y���3�-�Y#��x]���T��o����9J�Q�s.}'Z��M�<r�u%��>:h��F����80(�p�%9�b������iQYUNx�s�1�]>uj�����%gY����-�l��J�������d3zz��IT��-,B�q�[R����i����+�&��%i���Ds��D�de����������G?��[�.o�1��D�$5�"~�2�f�%��rekP(e�a`Z�4��n�7D�de75��-90��JB��ra��53F�
��>m	s�Z"~�2���Y��H�����}Xi��]$��8q����#����f�<kR8����c`i������ �'/��U�,|Kv�|��������G��c���g�~�2���Y��l8�q�sm�Kd�����b}*�k�%�'+s���B�d���+]H{JC����M�IL%��FK�+�v�����Z�	�W��,d����0E�t��`G���W�G.�H�4��.e��s�w����^w�O��x���v�{�N7#������4J�t$��D?��Y�t9;
�4L��rI�R�������4�~����+
pV3�=H}�2y�#���4�������%������B�dA�����M��&��
����|*O�8����
F���S������~G�J���&��,�R��?�u(I@2�d�?�`CI�a��i�
%�(o��m
��#U�6�hi@�����ga0��f�Ee?C��2������T>
������G��>	HF���X.��>q�Q'�6��Bj���
~����w����w�%�~�|}/�Q�H����l@��j|.�8��c������u�������Z�d�L:j��z����2�F�^L���@�V-q:8���eY���Z D��������-��m�N,�0��
���@�t���/P�� D`�8���e1��
���N�����:���)!��9�N�ue,��Gn��uB�[W�u4aw=XL�k��V��bW�\��B ��vH�[�zG�$�PY������7��&�|�.�������������� �p���y���p�,�Q��":��;�~[�	� X#��x]+�&����H�}w���\}D&�V���[��jQ�kK2�+8'�aI������GZ]��s��}��,�_��)�E��9�8"���A|I���2�/�Y**&�=�D�L�0��-��w���U�;J(�'�o�8�bc,��"7^����u������xL��u�8���eY���}I��L9��M_cu�D�����i����ls
�L$��(��+��T-p�OB>�"�a?�&5�P-��?�w�O]� z�����9\����d��OY����9g�����
D����`�u����D�K����A.�/Q��������F���/���dk�7�f|e�)����(pR3�I�-0u���Q�Z�zZ��h�g�	��@(tK9\���M�E�� �'*����j�P-��?�efK��X�J�~��Q,��������%AL��l�n�Ea���~x}��3���^_N��5�--�
�@!T�;�X�l���i�x�I��\n*�
��"����N���[��Q�6u'��hA������(J\v���Ti��x�F����S��'NIO�?Pu8D�����T�r*t�5[�k�� ����i��p]����s"C������i
B�)���8_��e������RP&I,[sP&k���1��RP&+0����Q57'��Q���jnB���O���{�-�ae6Dz_�_�/�T�����z��m��m�Nz������� �����2�
\B������W��H����������-4g�#
��t��v;�o��v����V���2�I���d��Qe ���B�?t&�Q���pBe X��En���?�T+\�,���gF���}'����<r�u-���l�I�!������K�� �}�e#��x]6�Bv@%��$ao�8�@��k��Gn@����=S	E�R�jM��]�L%D�H2����{e+'���^���}����J�	���l��Gn�n��Ck������H�i�������Gd��P�~�s�����j�����+
�NZ8�(�n!L�uS���V�1����I���NZ����<r�u��Y)���c�����������U����%��o�?�����K#���?��6�����o�$����`a����o�8�������"7����D���������W:�_9����
j����jK	�U��{A��k�P�����/����2�XBq6{D�\���w3��Sc���F����W����_���,�g"	��IeT��QyS���D�:���E��Dv����������j�L�/�B��Y�D�L,G+���Z<����be���?*	���sI ��T%�g�~mI �U%�5����@����S��|-�����T�.j�a��o=D��n�BH��2��������������2*�P�*��u�
��"����4 i �IiU���!�J<���@�a4�3��
�D��qO��2�����_$�]�E�~WpR3��������f�Z,���q�b 
�#�x�\|0�����Y�vX�@/
�����/.�3�����-���/�3�Uo�V�
�/y����$����e�����jRB�&e!����n��@y�j#��#HTk����feL%*���^.����(�s���m�Z.�����:�+�{�d����=I����+7Zn���
b���!�kuy��=���wG��[p\���*��O{b�_��^.�8��O43G���c����fX���%�'����%�a~�-U��!?3�����*H�#��x�[27�aVK���y��%�+����~��w��-�}���#H|p����#2y���z��%���-
�<�U��J�X9�8�����8y������I;3�,"��j����� ���b����K��������<��_[{	�o�	�~�Y�>��o��<;��E�k�FT��.�8�^L���0
���%��>�Hzn��4�T;_q��6���i�^.��0������C�r^�U'�&cMC��qb��%��gJuI@�I��'��6���<L��n�R����WB�(�����kqA�(��*���h�;�\o����*$����+2�������V6�'�����U�	��
L�A��|\��#��	\�4TK4����HL��Y�������X�%��O�'+w���%���r��"S�m)�lM����#�W��$��B�D�^�V����
�ThX�j��_]�J�*af�["	_�\Uh�)'8+R`zv�d�����f�%����I�^dZ��/2/s�S�} N��blQ#
�x�]����4]���^�Q�|X_���~A//2�
f�%H�z�*o/��m;�*m�{$�m;�*m���~�r��������,�&HH���b���pT�C�d�S�_�\�f�yIw
�r���m;��Ne�iV3�I�z�RW;�b�����f�>_���]g)*�p�'9��JT������m��^�G��.������,TK����!���4�������� ������L�b��Z���W^t^��?N;���+����g5�P-�����S�&l@��o	�@��}�����4^a�y������|������nW�����U�%��B���MI��p�.�5Ap�=,L�	���ns���dCZo,mx{  +�zc
�+@~����"�����\S����������������j���+�'h/�|dt+Z4q�����hUdda�^�:k�7_�!]��������a�j�x�o.�	haD�� N�o���lT�S
���B�V�,��
Qs9uD��X�7�a����D!�g-T�3���!��!�:!A��T��P
���#7^�:O�YW���(�������>�+����s��x=Z(����`�M��6	����d���I����z��u�����T��e�]6;�/��[��9��|	�'K����������������W���`#��x=ZhN*
�J�PJ��FX��Pym
$y����B�HZG�P��cK�����O��I��gl(��Gn�-���
_m���m�a�������T��p�:�({��z��b���m>/i��^���_/������`�`-RF���'�H";���)�������<����dC_�G�M�n����
l�����.\�30��C�D>\/������]w��%�sP"�_��;�i1�P-�����w��r�o��*�����[X��En�x=[(i�r
[�
$'�Y3�vT80���i��b�6����.CT��!���8�B�eY�������M�6�����Y��"~'�|�jf�["	�W��IW�h���`��=�~u9)�0�P-���k��J�U-�S��E���w�`6���4@��-T�����_�,����l��_�vIY�,|K�����ZE�\|����)1��
�_]�>S3�u���[r<��w]z�:�E�b�*�=�~�>�C�4TK���z����j��x���
��\�~A�����Z���oIA�����Rm5�����pb<��z����!�a���o�$|�^4��.8�c�8��7���y?��0��-��:��
��bM{�y�CJ�M���������=�i��=���<o\�5n���|�q�4�y���i�n4��������U��$nm��wZe��/��_��h?e�C���=_�J���XbT��U
����%���m-p7T���\w���L�!��5��0
�6������AW\w�M�.4S]��4�,4>]��S:��3��Z������R C�>>��+FrE�w��v�ZGd9��Xy����k�%$����kuy�������3*�m�����.�8���m��Q������%�I{DN_,9���^S��=����;��kuy��u|v8B���>��k?�$������p���V�{��f�S0��Sp
�t�,F�S�	�������!�\���d�#I�{�I+���y�������-eS�@N����>���M � ��@�eS���c�$����2YE����7���=_����n��f��Q��t���1�*���#7�����%S��Z7����Iz��������<r�?��xOz�6��P�0����|�Owp�a�:�����YOch�m�v���������S�`���k��[kY����1���|��]F�m}'��|�72��
���%�L�Y������gR�{�p�����h����&�������k��T�-} N,n9[��#7@��.���5%Zqk,M��%�����K�W�y����D���oS9SYA��+�~^/pQ3��OV=_95V�.��f�����y��[4_��N�5�P-���W��g��3�S��+��o����7��=|��Y����+�6�t�����4V�+2[��A>� f�M�z�>�MZ	U���-�=�0�%�W����b�P-���Wn|Y�J;�5�&�w��U���Y����J���Y��U�hf�U�U���?���!L����]�?2�j�����IW��;��N-hL��Z"	��\+��YW���x�{��[����6o1L��v��|�e�F���>����\��ZX�+]��>����R;N�Ik[��{WwH����3�$|�rmICc��\���R�-��6-�S�����o�$|�rmY�'.S�z��EJ��M����G�����j	�
A��eE{��vF��q��	�#�x�}���v:	JC�.s�h��������������]x�r|��tx~�:{
�����������o���?��?��%��
�iI���|��Fb�_���_��o����������_����?���������������9����?�=N��?�c
���?�v��i-Y%������s��I)Ss8��[o����^��}�%|��M(�Dp����/nb-$��(�����6M���Z������;�}T�W��|Q�}��izx#k�g�������H�<��u%���<}G��s�>�o�\�.���&���������A��)������MT��7A�~q��y��{����6����r��luody�n���R��R6�w���7�Q���;�<}G��������o���F*�������G��������F��7��~#���F�����j}�����x#��7��Fj��I��U.>���#������m�J��W��v��������+oWi�
�����X�0s�^g!��afV���2�T�����(���������z��l��Ufe��t���pm�
������p�grH��u�)����_���&t���{�F�
^B37�l��3j�bfc��j3e^��<����(����F��0�Y3eV����0��!��afV��
3oj�����g�a�Y�(3�8_N����2�a�ll���
���\p��Ufe���@�e���LfXeV���y�F����|.�Y���}��j����<2��zbE����i��������dl�W
cbc�����O%���aXeV6���A��'��Af3��
3_[>gV6��*��Qf:�z���l�Y
cfc��������aXeV6��c4fV6�L2�*��|&-\���>�~���,�W>�Oug�WP
�}���������.�U4��-_�0�FYa�*f��6uZe��0�+e�+��V���2�a�ll��h��
3�#��Q�=y�^�ll�Y
cfc���4;�2+f�a�Y�0����B��g��~�a�Y�O>�vv5>�~�q��C�6�S��T�x�/ w#X�����0�D�F��\@}��F��0f66��v�1�v���*-�(��H)NLkl�W
cbc����[fe��0�2+e^&�CzbE����i
��_��cC��bWef�>�������������~�QD��r?�~�����}X�>��^?5��al�
� �����l�Y
cfc��{l����F��0f66�,]��@������������Fy�0&66������Y�03���F���&'�<a��*�����Q6���E��l�y�7]��8s�$C��3����'�I����}���f/�I�
��N��X��Fw��o�����?��WX��q��b�\'+7q&vy#��"�'%Vqf7�j�����Wi��=*#���i���W��xe��2�j�u��R������N	L�����|E�OVf��B���l����p�s��r8-�%iO0]����Z)�j	^�l�>��v����"�]3��"�'+3�}1��%Hx��M�m�����~���~A�OVf��1TK��re�hSwR13�E�W�4S��L6�!�C�4TK��re������l`8(��(�?Y���f�O;:�&����k�����d���As���?;�j���Kiz��`C�����s��E:Wt�����0���oIAg�+y��|��;�F����Zt�!��+0���o�$�\Y:����,Qg��d�B�U?�"~�2�Y�,|K��D���z����G�3jO���,��zD�de��Y��H���-v�_F��C�X�����z�������kq�{�sq��mV{�{�,��j-�"~�*9�jf�["	/W�'}$+�G����h�o�l�6�h����be�Ix��}��v�
z������2��D�de3�<|K$�������� �,��>�[�q����y����l����"��)��J��D���r�GZ<|K$����K�i��C�Y�u���!OD�de��:�}(��H�����e6��J��vI��@��?*���u��u�NZQL���C�'�@�xI�0��#7@�.��E�	�|9���N����u>��u���n0��-9p��be+	�A���K���.�W� �'+s�������sO"na�@������}�"~�2�+�<|Kv���V���f��i�O��F[�~�2�z(�C�$�����h��U�M����8�Y��d�	f�%+�M^����:�m'���e�O�������de'5�P-���+��o7E��XB3��U�1�"����X{6�P-���+;p����AE:"�e����?��e5��
����D5���
$��%�SL��O^,��<|K$���V��{����Hcw�'�����r����4@�.�������Z��?[�"�=��!�g5�P-�pH��!m�����V.�i���R�"~x|4(�[b��hC�;q2}i���[V�dw��N�W�M)TC�D�	�Y+�Z%J�*Hz��%Je��\����,�iCs�a?H�
Y�+1TK��N���n�D	f�����(Q*���|A~ Q�s�N��L&����+�~�(b�O���s]\l�:��h�������8�t��[�!L��F�t�q2m��H>��-A�<��6�%rZc�� ��6iMl�b9��~i����i�����L���0
�N���6�A_,g��P��+^,���X���^��6��8D=.6�g��H��|A�OV�0�Y��?Ix�M�w�|�@����,�6$:ND�L��jf�["	��I4N)<�	�����M�]uA���I��f�%��N�����qmRXT"�9i��A��W���vm�a?�a����D�\����0;�m�����kmr�a�m�N
K(7X���?Y���f�%��N�DO�������O� �"~�Tf5��-)�etUBz���e��w��X]C=��R���
��5�3��wJ���x�w��b�$�tA���J�73��r�����.U�:�&/H�P�b��[2�Pf�� X�:���}2���1b<�o��,6��4�wFj.7&���"~.X&l���b���w�%��Zu�B������<�9���OW��a��n�������u�<�h�!�'+s�����D���P�2:5���/9�Y�+X�5��/�,���N�Dk����M���������>��Tip���o�$��*w��\g�*w���������s�rGGF1��-���R%�E~�Vpt�5��~A���3�I�3���T���Ty"���*���*OD�K��_��J|�yA�/H�'5�P-��?�*�47R����Ty&��J�b���h��T����J��p����P�Z�\�H#7^7R�1�n��P���1�����T	�����D�H�D��J��^��Xy&��.Vr%��^y&m�K��V6�sqv�#�,���!���8���%���0W'w{@R&BaRdK�wR,���!��=< �K�Z7�K�D?�,uTJ��J��,��,H�V��r�j�3�����n&Pd(Y�^*Y3�r������B(��XC~ Y.v��%K� PW��,qn8��J��������
l�xwE���BK�$��,w��WUU5����*w�)a��Z���J�T�E�6�\+(s�0�#�X>������mz9���R91�Db9��T-��.�����/���bY�TFj �.�|!�D��\�jf�["	o�)�	���)�����D�L��jf�[R0H�J�<\��e��U*��2I��?S.M���z��H����CMle�t���]�����Y��V�E2IE���%��i�bZ�xMCb��C�T��-�.���QP]��}g��pWVjwA���K��f�%�vc��D�����f�$�_�3���`V�/�L
��^+�&�n���,Y�����?S.M��Yq��L$�]���i����r��7Vf�x�� �g5������o�$�S.���&�:K�?K��|A&�H ����-���7����.P��Q�"��$I^�s�R������D^�,���V����c������>�����b��D;��*\�)�lZv?�Pq��Tg��f?�p�uWc��C�Af��EE�8Y�%������A��)����>�3�U&/���V��"��1T���,&����Z��0Z�S����J-
j�Ix�_�&m�l���H��S�ND�\��)Ob��[r�>�U���Xj���K-OD��RK1���g"	Tji�C���u�R�3a���f���L$����-�F�����_�	�����E��L$�J-gj8ZWZn��4TP6����BW.!L��G5��Y�X�?����D�}E�5�lV�XVD�������R~���
����F^cy������g�J��=�(\�~�p)fQ�<����RtD��+!X�� ��\��B�$���+�2�����p���D_gE���_.���0�P-��C�r���Z���� \�$��F����P-��Xc_��y�Y����)�v\UK���a���B���ym��9��<�'�"�^��U�Y��X��x���3/���������>�|��|O�4���^.y.^���3����KL�$�E6hJ��j��g/W�l^��
qg$�S7�k�D�����h`�y��l��\��Y��4�%�k������#0��0
�z��K�v��V���*~����L���?+�jf�["	_�<.n������0�S�WSND���.����p5�L$���s1��urD�W��dO} N���Y��En�x�]2��Z*P�
���qRe�"~&�f5��-���W^�8��xY�������h�~C���R��,|K
v�����H�3NNv��I����
j"~&�njf�["	_�<�(�]w���|Uzd�������
t`f�[����WNO���(���`��J��vA���c�U�,|K�8���M����2X�rR��+�~&..jf�Z"	_�|=����Ci�B���b�5�P-�b��+�~�m\m�����N�������1ja��K�Q!�r���@;w��2��N�dgI�<L�������+�����v�����39SnC��H��+���ur�y�LwT����'�y��,��x�r��7(<Z��l������l��I�B�D�^9I�Z>����EB�_z�u2�\����H����U����w.� �������u.����G�9���e��t����v�M^���GV��0��C�d�������h��qh�gt���MsA���x@U<|Kt�x�rV�7�������C:��
���N	L:���
@���KVY��'N�����A� T����u|nA��j�#����Uv_h"�^���m��i*�w;��m�F��1L��z�*�/�?r�N8`�~K����@?af�[��Q��WnB���0X�������'�y��,H4q��abh��n�@��}z#�i�z�]����*��
��@h��}�z)f�%���]9���c�����������2�G��.���I�,TK$����r�v�E+��3�h�����S�jf�[R��c�m+�.-��+���7(:$[��1�Z��(��P����L63���FF�m^�������<����]oZ��X�����5TK�/�������\����9���yA&;���k���%H�p���>�z����%6��IX�A[��j���.�X�{w�|B�i���$�9]����H
z����r���T�m�*�'-�iI��V���%V��h���B�B����-Q�q&����w�b����B�����^{`�G�
�����=�L��f�%H�p��uf*,���.ZH�"����~A����f5��-���{��U���pvF��c���p\����COZ����D>\/�F�!�b�
�U����K�/��y��*F�Y��H�g����~�l�3��P��x� �g[�Y�,|K
N�>Zo��pl�����s���1���?l(�jf�["	��
mM�J�m�|BL�.���6�v>�Y��(g"	��
���SCXJ�������W�/�H<���|E$����E�\�'v"�����������|�|�U!����h�y�����,%fi��n;6D�P�np���o�$|�^*.OI?�7�fOI?�7��_�l��\Ti�P-�8e�h�Tr�� �������Wd�o��EB�d{�
A:">�W�����lO�tE������P-������5V�@m�����������y'���6/!TK4���z�`:f^L�?��G,vn	����6/�$BC4����//�t�R�_X�n`���-a?�����C�D>\/����O< U��YR��U��qb�K��J��z�P��A��3�l������;�f�%H�p����>�66����G��Az���&"$���j	>\�)��&_}��$r��["~'I�jf�["	���u����`Y��M�o�����������@�g�UQ�z��43e�B��v�<��;��TZ/f�%;�N=Z�J����B������(�#�Wk�3}����j�$|�^��I*�r�]��R��1��)*���#
�x=[(m}ruj����ry��zmZ2w�d��2�,r��p����I�-��j���I�m�a:�LC�	��W�qjS��Z������[�N�2����4@�>��N�-R��!�\T��;i��,f!TK$���N�Q���U���wm�w���F��&~}���1� �+2�]OF6b�G`.��A�38K�O�<O.�i����9_����)�bf�Zb�R^�|��0Ro������N�W�n������NI����]�fbk�c�Ze����Vzk
�k��r�+m]��M�v���������!L��&�����%�>h �[[V���fo�z"�����e���mY�D�^yF�K���6sC�kzW��A�W8�j��Z���W����Rerc&�����<�(�;TQ����$�?]9����Y���|AN?`�����+��t���>Y8O`'�;�4�����I��q��7W�D�]9w�>���j��0���~t��y�Cg�%��I�z���/��0f�5������V����o���x����������u��~ 4$N�r�����D'x�\9)�����%��`�����o	h�t�o�~&%�|����M���|����[�����yn�����C�x���������]�m��$a�%� �g�n7�y����r�^�X�sc)�_��h��y��b��[r����+?����4�	�I��	�w.���F���x1��-���WNO�+8'��� ��GO�D�lKp���o���>^����5h\�T��c}4�U���-9i\OW�;����XqD+8��.���6��m�����W�]��U;+�]����${u�����Z���_��,���k�L����a??k 0��C�D�^�w_�l!0����-Y�l!�	f�%K]��t�z�^�$-��2������@�o���/h�&|�r�.(�WB�*��31�=��~�������o�$|�r�8(2���� ��?�l�C���Sq0�P-A��+�-
=�S9�g:v����,�M�,|K$����A�cS�%���HA������������!�����+�R�<���R�����:y��J�x�]��2�g�Lg���Y�a(�	f�%�z�N�T�8��L�����C����=5�P-A��+��{��
R��`�'�!�����<����o���4P���M��A�7�~I��D��\���,���v9d��^�t��)}�������i������}�\�e�l���_���������������������y�*��}���y#U���������������[������}�K����������oZ~����v]����P�>d[���J��oz�dt�q�e+�~�v�� ��E��;76��'��9�k����3�"[<�p�����\������ ��k�����6M����[?�p�����\���y7Y������������q���W?_}t�]<5k��/~z��ot:��]�������_����F�R���G���O,����y�
T��7��$��x��+3�
�o�<�*��X��<��	-���TP�k72����\oo`�����,�����]�%�pO�����V.O����w`�:��ndx��7P�����s��7��w=�������/n@�����e�YYxB��`���g�2O,�3r�f!����E�����}F���Fn3���T��Nn3��gtr��f~�UF#+?*
CF%{����y�z��N�o����)
������;�}�������K�\����g@�9���4��V�1��AV�������MT����}V�Y�g��!��������:d����� +�bVC�����Y

��:F�����L���:���v!��A�Yfb���n?m�.fU4��IT�X�8����M��"����WN��=8��h�H�)��>'�BJ#������)

r�.&5t������w��>�����Yw��cVC����Y
�g=�l3������]��h�uq3f54�
���� ��UK<N9�,�Tr��L�6�tt�Y�v!��|�����g���f�B���`@�/���-t��Q;���:���v!�������O/�gU����}�e�m����}V�Y
�n�{�d�]�jh�����d-�'�������������]��h�u������ +�bVC���4��Y
�~��]��h��3��W�=����|�d���;pt���]�G��=8���o@�Y�.dut�����6�gU���� �!��1��A�CK<�����h[�h
�>�����Y1}4f54���KCVC������ k9?dt�u���������j�C#����K���n?m�.&U��O�R��?��f�y�F���*4,,3��A��g=����8Q�]�;������@��$�����1�Lj���7�����m������G���u�����K(�����zV��fut=k^�o6��U��Y��Z^�����b�O��z��H>v��k�zN���tC7�f9<��j�fV�������1:t3k;�ct�z�*�<�Y]��v���nfw�]�nN6H��n��q��|�'�
�D��h�2�������hk�t���������9����^�tYm���JEB�O��FEBjzR����z�A��`z���*�$��������+�sE����o�P�_\|z����:I�>�N�S����6I�?���
���$>y���D�RT�J�F6RZ���nP�N��z=v#����Q�L���"A��@�
*�y6��[	������p���Y����
N$	���XLl"y@�����,Fy��-�{�������@�����z�A���Vz���Z�yu���J���]9�]Ny@����#�<x��_>	XQM@�_F�.0P�+�u�%9�b������Wj��U)���V�U`+J,
@�hI�2�,��u���$vR@	-�d�V�ky����"2���1\&k)\q��P3��fb]X�<���
����=��e���iz���2����p�������e1����L]���n���uheD*������(Ff��-)K�oz��3��q��	'^�19&���fj����r)�deB���������|lb!�cu��ZA���YvHjPR�2�D��E�<��5[R��dG):�$C�W?�q�%9[`da��r�v�u��������	'Z��u#3��-)IiZ��T^���(��&@�xI�1�0��L�
���\,��/%�	7-<������&&�86��Z��X�q�?2G���Z'@���q����^3I�c�/+��;�QI�<�DK2���<��k�$�)�OS�%wH�s9�	���c�#
3���F�]����#4k�C��N�$gU�<����������Y�#t0�Ig����J�4�b'6��1\&k��]�\e���h������O���q�ad���U���f>#AB���(�����-�Y����f�M����P�CR��b�	'^�1�8��^s
�p�����PeDRZ���u6;����^�%Q�*2��+��0I%���qDcH!��nL�86���%�y�K�

:TQ��8���2���0P��&����NZ\u�0��� N�$eMvd=��k���A	l�K�2Y�l|��el�c�x+y�)Q6��1\��i2r���3��_���i��S3��l�&Bv# �Vz���R����)�m����!�Lp+�d���m�����4�,�ax-�9t�n@*����`���g*�Y�\knK��/]��
�N����	K+0����<���P!�e��2.?�:�u����y��k��wVY.�k��p�_�w`'�UM.�k�\�r9��}
0�
��$�,;�$�b�e�|���Y�e��%�'�_oE,�IJ��
@��-�����G�fKZU�h/����W�P�h�1<�J ���=��2��/R���2�}��U8��*u��N,���aj��k&�-�]��k��e8������cx���,��]��.��T)�OfIJ�)��Y]Q6��R��#�<x����y�r#r��S�f"�q�o%y�dq�e�jX5�@�����Q�m��50c	Ff��i`AS�*theDW�q�%9�
k]��k�.$��M;$g�)k�M�8���1�0��lI���|�%��O'�	/]N���y��b�!�cu�+�+>W�i�Pa�"�D�iu��	3��,��*sD�����Pa����k`�H	`#
3���r�	-8�T&)��zh
y<���.�q�.S=�K�p��*#���'@���q������iy���-�IfBs��q@�K!Q��DC��2M�J��9%�H�����`1����y���dIT�R?Wj���C����&@�X�SFzY���L�[��S�*�����R�z��@�X�SF"%Y���\��<6�l�Hj[<���y��D&�8���^�$��&\���e�N�Pu�cI�<���&iV���hq���ve��r|�����\s�P�d'�C�q�r��NR�d��3���$��
�\��PeDrZ�q������^s���[��CE<Et������	'^����H�@����YY�
�0"U-M�8Az������z�@zK"���[*�^z;q:Jol�Ko{�^w��*'d{���/��2�m�D�����a��Z�)�����9Hp����4j8�������@Tu�T����jB4���7z_+q�R���1�\�U*T���&�O}C���J�#!��^�B�KtKR�8���T�6����7\�w�M�+b��.gI�&���\��\1j�&|��e�Q���%��HS���{���z<��e
��u�u�}
���d�����NH{���y��-�Bvn(�����M"�'c�x�h�(�h��X]�Z��o���}��'��$M��Z�&�h�M
z_����M�$}JQS
�d���u�|��wb�zN��k.�m����%��V��V��6u���5����^������%���]V w.m]V����tY�l�vY��y�^��K�#Zo$�0v
���5cY�<��5[�&]Q��-�m��B7�	'Z�3��x��k&�m�{SU�r$E�#mdL@5M�Y���@���B{�%���q��7�bH�x>f�������8��2�6�I#��G�Q�[�#'���-b�a���%�G�c��Uq6��U�t,o�5lb!�cu��/u��8c��F1'�����+0��X�NP.ZZ��&Q:��R%����k��U�,��k���g]�T�� R>[;��[	�dd�!�c���5������k�f��pv>f���Ydb!�cu���������:T��J
� N����R����@�����:s���L�n��VN����O'{���\��Z���n#I��-vq/xy�E�M4�ql.�7��B�rE���D"���������%�v!�1\.R�����C�$�YR��A����%1�0�W�,sQ�~z���
jB(�P���1{���z�&�8����e��#��Q�:�����#3x]�l�W�3t�
#V�& �3�3�����@�fK*xqr����3��r:�F��3f�j
�>PS��5R��~V��Q�� �mc
q��\_��������V���X:�qR}�Y[���@�.��n)z��0b�l�I�����,���N_���v��[���F N;}��$���^s}�$9*�J=Z�sy��
������)n���)n��#j�%,_��,��q��DO�����������k]�����)��/z�9{&���������|�V���j.R
Z�q�,���ip6V�7�~�5<��P�/��j�/��d�@"d�\r�{�����#�<x}��^k��>�r6�����G�N�g��<�������h��lDH��e��8��[�*d��=����Z���n�$+��/n������'�%?�+F��=x}��>�G&��>Tm����#����(e�=�����wK�Az����X.�[MN��S�N8�[�;+�-y��r�M7�H�Rd�d�����V�l�Z��1\�\k��+i��h�w�;�l���c'��Q�d�=���K�ps���s�#��"�v���c�x+��F&�8���k-���d����O/>qba�X���@���}�d>���E%$?�0�v�����H�H�3��r�!����I9�T� Ds�	'�t��h����z}�d*<��S��8�^�1F�9�+z�
�)61��q���~���P��%h����`�8���I�
�I�Qy��v�~���S�B����3��c�x+�W�*&�8��wk��s�}��-������0��
P����������.��D8���w�x
�}�T�s NP���&Ff�����
��G���J�X���E���e�.�0P�/��E��9J]G
j����Q2�@����&�8V�/�Z���e
�� Pc9/��gj0�0��wKN������� ����*I��N��)���G��\2��g��w���7������:&z�0(��=i���f_:�R%)6d� [����M3x}�dn�F��h��F�#:��N�8���U�F{����KFjC�,Iv�Q��6�������H�@��\rYd��������w�k�&,8�2dda�^�.9�	�B����
B� ����8���y����T	q���ka+�$G��4"PcD�y�qB+X�����\r��|�$m��!�����Y��cQZzz��]������`����I�2�R����X<� ������v!�cu����z���k���v���8f��dzY����8���m<��=����s���������~����:�"�C����c���xwN}ik� kA�cu���d�t����Gi��@�dQ��0��Y�6k�C������y3\�$��'�;����
b����A�cu�`����oQ�)w ?���S���P�E�n���
�K%f4���T{2#w?V�7���\��8n���?ZaJ��l�

E��-�����������B7�����+I���u)�)P'������#����h�[�r������H�����q�����g�8��'+,��#;\����H���;ub����pX�h��OZ7�O�c0�8�����o]��:q&p��+d{��������oA@�l�/�"^}����v��p��#^-tmZi�i�E�.Ni�E�@�8k�X��E�>Z�4���q�C�<��*����l��y���q�g%)8��-5�#~�����N����6�X��X]>Z��jA�H�#�kC��]g [�3��z���������3�%;$��TCs�����0�0��������/=i�&Y�>kI!H�d
q��G+\����u������d(�ot2+l8������(��`�g��	������p�������By��h�uE=�����C+#��9E;adaP>�y�C��n�����������N����y�����R&�lC�z�Lh I�!�z�q�LheT���,�����zQ��h[pG����8������@�>Zh��KD	�I�2���D-��,K�e3��h�1 �"��Ez�(��H

���x����Faz������T�Hnv���D��qFr���h�[�C������V�C.xR�%�����d3����/��Ff���B3v�HwEN�#��7�	����
�[���G%);�\����d����ha������z}���
��Pb~�
v�q�����q0�0��g]�G9�b�&����g�+5�,���^-�.�XU�U����:8~�g�a�7���z���P���-�H|/��<��<B^�v��������joH�7���I�]{y��g������^se�q��P"���LX���������Q��4-<��� W4W��c
���jc�x��T������b�'k��Ak�(�n�O�q�X���5Z!��kA������Z��-c���i�_��Y����
�����bh�������e��'���!=�@�U�)�����i�.�v�>_r��������6Q����z��2;�a��p�n�)��ZW��[lrn�J����E�V��n���w�$}�|I����)�S�Nh��BKy��v��X/a�&J�7�Jtw��D���=���
���)=,U.$���gP"E�T��o%Z��=���1\�[+����H��*�*��k�S�N(k���#^�.�����B� `��aap6r�����#�<x}��-��	dl�Yau����T6��������e?: �p�6I��r�PU�X�Q�C
P���gK&]��2����Kujd�������b;�x��._�5��\�V������q��$�%q��W,�U��V��j���w�x�VcZ������~�d:�Np��S��LJdDBq;p��&8]����%o�h*�I1G��������	�q�
�,������f6���jo�1�U��G N,�b�a�o���9U�R��b%R|1�,�_��x�AP�@L4�q�.��u����F��
UFT�#L�8�B�"Ff��n����HX�~m I���
�E�^K����0G?��^��.�����w7T��' ��3�!�H�@��\2���a!w'�	W9����m6����|��_������DG��Z�z���n�,Fy��v�+Z��}�m�0*%�A��S��,���������q�����%v��F N��]��H�@��\2�F��H�8d������L@�C��(	��,����%ThH?LCZgD\�D�z�����
#�<x}��UN��:�9��h��5�3��#3�����K&%��dSD_�F,�O@��6��&Ff������U���m��P�=�8��2�a'�� �7
>|�3�<F�������>��g3;]~����G�^�.����U5�}�<5�]"7���?~J����g��i�TJ(����Z�����o�Y���@���z�����
>D����~����������������AC�9SGZk�4�X���������������oay����o�	������������!�&[���L�%9���qn���W�i$����6�Z��#�����jZ�8����_�}?\�8����]����w���O[��Ky��j��<��|r	����_��%��_����(�+IF����	���5� 5����^�:�������:����A��������
��.a�zy
e[_m�� ~�7(mA6�_��5��_����Zz�6���tlm�}��^���5�\���@��K���C�����t{
����s���H�.�����C�^]�����kpW�����5d�	V����������t{
����s��~�<�;����&��gC���N�����7$%�g�)��gut=��u�:���h�4�������g5t3����w���v����g��R�1���9a�Mi�f������$�����TQ��(Wz���sn�3��'�����5v9r��?8��.���G��,�nR#�s���������(������j�M��f���Y
��
�~VC��R����8��U��Y������nVC7��������}��j�fV������,��^��6>�7�$=b�3%?usq�1�~FE?����X�����^$�t	��?0��.���G��������j����f�M�����nf�]?���Y�ru������j����f��{�cR%7s�Y?����7�n/��S��I�������j�fV������.(���:��U��Y���*���P���F��Y��nJ)�t{������n���?7�������F
�����zV��fut3������nf�]?���Y+�2n��M�.g5;��C7�n�����u�k�����!�!�~VG���]7���Y����j�fV�����u��`7���Ya��j�z�5��futy�1�nVC�w*���x�
u��y����8������G��[Y��2���Y��������^n7���Ya��j�z���3�Y]��v���nf���nVC7��������mx����mx������2��f�rxwr3����nJC7s�������k�������6j��j�w:�������xI����c�����S��]~p���������Y���}]��&�����gU�nVG���V��{�����j����f�����nf�]?���YS��-��l�zV���|C7����y�nf�������j��������g5t=������[������[�5�p��b�'��k�\du��D���D��W<�|�5���=X�k�/�D-)���i���q�xV���3)��E�e��Lmh���K��u�����6��Y��E�e�j�1C�+[�"_A��H�-�V��������z\R=.��,��T�K�v�#���Ar�ZCr�0?3e^�u�%9�Sy��
�����dH�#�^O�5?z�LJ v��-����7�Y���!9>H�Vo/���~�,��G�f*���q�9�A��������x+����=��2Y���3���6���������}<�2�,��5�[#�@�%�U����I�+��X=� �t�����2�~m���hl!���)P'�x�em���f�e�N,�0n�+�mO�3}6��[	>2���1\&k���x������*L����X����eK�Bu|����c;�1*m���TM��X#3x����jw���
;�2���q����,���^3��h��"*I��
��,��j��~t�N�����z�4���S�T�&M�PeT���U#'^��E�<��5���������6�r��-�������zM���~+#�QMC�QM�~/ N�$c���a�^3Q0I����?gN
:W�N�M��<�H�L4�ql.S�[N�3D����PX�q�����f#
3��ik�.V�:����F���c=�*���T-�ql.%�j]����F$��	'�x�����4��k&SUm��|�!���8A���Q�C6�0P���D�l����;B)Q���8���Y��k�I�6���%���l~�=8�=�0��hW
�I�*r�����\UpG+��Bb��!��y�c��.�lz�����n&\�9�H�>������B��2Y�������uUFt�c�q����Ff��	@Qo����C���4�u�y��&�*<��k������s|�
���4��v�w�6�q��5��?Tv��!|(t�r�S2f�������W�s��H����.o�������<w{����%�!���W��#$�=KN�x>f��.����B�p��()�*��/�yA��@w��C����t�8������i��L�G��*�=zV`d��������v�E�RdzS��{�P��7y�\P�Tb�tD.��/EJgck�b}L�]��.}�te��$�!��Ft>]I������hI�I6l��;��!��Z�#����d��p)�Q�*c$iT�V�
�g���g@�X24���#^s}.m��T�c�z��R�S�N���b����>J���@�c�+(�YU�:�DKr���#^��� �
OR<q��5����ub��X#�<x��$����'0C�H|+<[8��%�e1��hW�a���E����!^A�Oj
O�:���e���x���6�~b�
�mD���Q)U�1�:���e1����dI-�?���D�B�n�OeUkuL*�wa��hg�`������H�� N�L6d#3x�E;�0���"���
�n�@�����]��k&�%o_S��C�Q�6��s��:�1�'��^3�����_�}X�5��q7���n���c7R��0'���w�'��0���e��w�BJ��@�f���YFi���B�����q��� �y��B��2Q��^i@]�I�*.2A�b9�bTl�a�^3a�������0g�0���6�����u����z��9T��0#���%D�wq���d�^H�H�@�f�\ze�0@7�����t�J���$&�8��W��\������u�	X�����`�a�^3an�*��u(1�����9ak����z���I��������n��dM{88�0g���"^3a� GR,Is� �R�M9p�%9�0����D�~���"-	d��6�0�@�xI��y���"�����b��&heD��	'��EY��k������_�4n�PeD���4;�E�<��5����'�j4;���b�	�ZQ�������z�5;i�EB�
u�0"9�M@�W!g$�����z�@���$���N�;q:jvl�kv{�^w��ZX��E;�U*[���@�8q�X����.����6���W�hw�	��tB8�����	F��)r��t:Uv�]��(2�������f@�t�B��*'cW��t�
<�q|#���t`�Z�t"];�D�[/�Y���Z������]��������)�E;g��e��<�k���W����:F}]<�������C�&�2�.o���k��D���N����~l�A�cu�$�eOm��������������^��%�*6�ur
������z�A��=�X���]m�������I�Z4uvaDm��)P'���e1����<�����5D+ T45|�����,Fy��,��	�Fr��1���g��+����V�~��=��2��K��%!���E
���S�N���b�����Y4�����<W���W�����V��[���8��2��%\�(gCI�K'����I���ZN��B��:i��!i�YV4;U�qv,���^?H����'��%��8����K�;x����U9�����p�*#J� N��[���^�\�$�v;�U/69���v�Qbi���L�+��8������g�4��ye��s��Ff�5�� �|�:����' �w����H�@�f�\���I��U��+�v%�8A�[`da�.s�j�c	��Z~v�2�$�:��K2������'9)����v$�I^�������`�a�^s����)�Q
��<&`1)�5J`#
3���tM[�3�LgH2�T|<��2��u����z����������d=��o��X<�J�qRb���z���K���)��)���9��9�#�<����E�*�Y��,#�	o�c�x�<D�cq��4��.����EwHjCS�\�+��m0�0P�y2���!��t�U8N8��[�Cf.��a�{).�z�������'F���5�����f�!�=B*�6�)N�\�3�����KG��������hlgc�x�<��8�h���\n48$��RP\^A��gFf�Km9���t%�������qU����+k?�Q���c��I_�N�;u��r����<�]������=��������%��;��^��H-�[��i�E�*�2z�;{r���8�L�����Z��{���Fxq�I���l��]lzvA�c�|�Vz��F�bq�P�O��S�N���b�����]�.���B��r�
D��69G{2V�7>]~������|����������xuK|���!�����r�]���k��Kh�Cq���!^a�8�0�uBj"X#�<x}�dR+����:�DX%mq�>[v,�����]rR��z����8!�NV���P'^��,Fy��r��J�T�a&�����������_�.������SL�Vv��	T`2M�8���T���^_.�������"���Hu�&@� ��E1�0���K�<����Y��:>U|:x)�����!�cs�n�P����s��[g�X��NHp+b�����L�E:^��/���,��d�_�Y������KQs?�K������A��0�� ^FJ�~���E�!6�r����������#�<x}�d��U�Q����#.�N
qC%����fc�0��<x}��,�d��E��;JrX��)N�bg�����Ff����K@�*��
���W_�q�%++�Y����K^!���B����>�%�z>f����|~lb!�c�|��F�Et�)���Q�o_KhC1�l_bgFf���Sr��F7I����H��N�p\���"^_.���>\����#��>W�_B�����B�����@9�r{	O��'@���$}���#3���<Y�q��
�
W�IpAR��Q!�	'���EY����%S�"��(������@#:�M�8���%1�0���K^���.8������Z'`��M�J����z}��$eCW6J���Pr������x��<U6�����O�?$��1����L��fR����"�Y-��2���8�������N�SjK�zh]Pc���'@�����#3x}�d�� I����P~/�u7U�����8�����"��)-��h/������=��������8���]J�>�	p���o?��OO�������k� �)y5���Y�+Oi�(=��<%>�-�V<x+�f�� ��q�Y�B�,Vz�������"�����
F��y}�<M��d���g�*����h}��3$�z�h��G�^�v�u���N�:�@�T�Y�����n�	j�;�������j�cx���������p�`���p8_:�����]�S�N(�	���#^-4.���T+�%��c?�/D+e��=����Z���gIC�����@�X�6�ad���g�.4'���|�*U\�p�m-��[O������Z��8V��VX�3C��1�������g�W�S�N,��0����G-���>���*��,���)P'V��eY�������D��b��x]��O�:�B��i�.����BWKG.�:$
	�y�
M�b�aPz�����c��Y�'Be��@�X57ada�OJe7�P-a��C�Q���H�y��Ff�����BC���r^�Cr��T�<Q�k9[�����z}�Pj����)���-�G}�<8��l���"`����K�]���5F�n��hk��,����BsD�J��"�Z��Y'@�x���y�����Z�w�
��%��F�������#3��o��p�����nt��uT��>��5
����Ff���B[~��������&�����o���<��=�����?=�>>j����|���s'M�V`d���J�p�lF��tG�A�<�	��tc[�f���B)?Z�S�f�h6PeD�u�q��
����^-4H�P���PaDju�q��-�m0�0P�����H!��*�"*4C�����J3��l��~I����;��c���9�����*�,����I�����_�����(��P����G�>[h��2|N6�b-�O<�}>�8� 2%�X��.�p���eY^1i�!�W`8��[�C���!�cs�d�P�I�����De�q,�,�)9�L4�ql.��N���j4��k/?�@����C#]���B���|J�;���$���I!Q������B����WwH��B�9���v��i1�<�u#�p��K���Y���T2:����34/���nF�-�����USE��OM��PM�������[2������<A�]%����<�G�����K��}�^'1]�J��q���Lz����{���������[�\sF��������l�y#�0��/��We�m���!^a��L��uB�9��
t���7���t(
���8����)P'�<k,��G��\r�v�DR��p}�W�����c�x�h�
2���q���>^k��5��)������|�m����P�#��I/�����|�"���+�G���p�jyZ���%���3oPb
�PW�8����I�����p�r�����!^����n9���p/j F��@�g~��E��+�svH���e��@�X*7����^_.Utd'�K0tHNpY�	��P�cF�+.��[�N����|Nh'��=x��s1����?Zr(Z�2�O������\�c�	Y�(���H�@��\2��z�!���H���Z>�����#
3��v��8���p�
Nm��P��z��
'`Fy��r��j)���n5�P���.@�B���,�������Re�M�x)Y���C���`CW�C�u6��q�����Z�t��������r;�|���c`Fy��r�m�Z�MdR�;_P��e���1<� |0��D�����Z�{;-�:�������/lq�}�^��p���Y�L���p��\s��Ic
;,���I�O�L�y^�L���r�1!�};WT���Sb�!�cu�n������wH������������ i����K����Q�X�0C��Kj���:�#1h�]����K^qH=lwwH��i�m�nww,�������?\�RI+r������N�dcE�<��}���K.xoH�O����1"�~�qBU�#3x}��'���[+�.�0���	�Z��Y�`�a�^_.�4x��F�C���P
�X�?p����y���[�B���e�:$U�\���A���
F&��zu���p���8���g
���u���u^~�����E*b~�x�{�����_������������I�_�����-^R�O:VI���<�������_�����/�_{�K���<������)n������n���kBG��~��Z�K�LeC;��BXQs�|CY����K������r��<�,~x-r���
����|�+�9�_AH�(�|Z���gY�O.��&8;�/��"�������EPV=�d��}�g�gWidwB��A=�K�]��`���wK?����W���E����~{���(��j����CiZ ����*�������U4��8���9�|�_��U�ow;������r��J�-���E~@�Wa��U����"��������q�������p_�����U�6Z.�����B�����
���b�|}�m��-��������O�����0��0$������Y
��
�~VC������VG���]7���Ykw��v�.g�]?���Y��aN������z�u��s�9������P�~�/aeI����9�{-K���@eA��	�C���u����+�h:>���9a�Mi�fF;g�S��v����g�3�q��tt=��u�:���o��Y
��
�~VC��R����VG���]7���Y�v��Y
��
�~VC7��(o��*��S��)�\��?��)]�l�����n6�����k~p������]lj�]���n[6���Y;t9�������$����t�zV��fut=k�&h����gU�nVG7��W�_���9���R����mz���C�s�]7���Y3��u���v���nf�����nf�]?���Y��I6������u����l����vs������8������G���Q]�-G���]7���Y[���~VG���]7���Y����n��zV��fut3k�R1���Ya��j�zV���������j����f��u����nf�]?���Y����Y
��
�~VC7�nZ��g5t}�Q�~VE?���5�x�:u�	K�K������O�
;u�c��]X��mkw3;��Y
������������nfV�~fg73S���]�nfV�~fgw3����afcw3�p7����������������������Cw���;{��f�(I�Ok�n^�&6v7�������]���p7���;RX
��nw��{�3,���r�g7�������\��+����������nfF����nfV�~fg�3�e�fW����zf3��qgw3���l�nf�f6v7����a����0��l�f����=��Y
w���f.���f6v73w3���,c���}�|F��K��)�����F�'� ��g~.�,��
�]?p�D�e����V:�j��9���
]��6���h��hhy���IU�w��}����\S}�qhZ����Wm���p�J+*�:�+�i?{�A!];rz_�#_)HO}pS3%\+-��Z�+����
�������2YK�
}9���b��|�Vy���u����,Fy��������K%z��t��d��'�t��<�q�.�vAo;OiH�O���t�);&�'������%5���k�%������E��5u�%9�0j���fJ�����h���k��-���cx���������p�����P/Id�����B�n�;�N�G��ay��-�h{X��MK�����S�N�$gFy��i�K�)<���t���s�fH�1<�J�2bxvA�c�L����G���dc���[�q\��KZ���8��e�bF=�UP�C"gn�p�-�1�/��^3�sQ��Hk��P���V�?T��E���<dv2��q�h���������v�:��������������C�p��fQ��%>H63Tm�2w���T�1�0����k��O]�,�O{��?)I�e��A���U��vqF�]�,
K���R�CR�2,��P?��1;`da�����>�b=C#����3�~�Yu����>�H0�te)�b����9��"����q:�)a?Y��u���	&�����,��-���\�I+������l#3��JEE���WB����ZJ��t�N�$ctv5�a�����yK����X��z�B�3�~��3H�Y�LC�NW��w o���m�����d~$�'+3�65����@0�tetT
��Y��*���+�H�OVfpmj��F�	�+��tAg|��mRJx��/�~�2�I�,�H�U�����*=�ra�*��N�ae��u�F�Q�b��I��5�!�gR����a?^Y��Y����G�����(g��"
�)�:#���)l�,�H0�|e������G����2W��;����z��k�������6>+C��B���9`'^����H�^
������=���|�uF�O�~��fj$���K�`� v���O�1H��q\�`*���%1��q�����_�MCg��y����w��D{���|1vI���)2Y��_OA��9����p��Q�5�sb�O�I{�l"�������tA��K��j��4��f%Zln�^,'c�"'���������LpQA��:r+5�:]a��Y�ha���W�p��&7�u���F�t��:�aT�F�h��I�g'�lr�d�tB��m�@�hI��m�h�t���<�T�����|��`�%�X�\/�Z�jgribH���[�-����2�,��E�4��j[���x
������9�D���2�,��5�A�^
���/�RP������O�����%&�8V��/SU����(J�XZ#)��g�ub��X��E�f�_yi{��{!Jh~"���'c�x����bM��cu��]��H�N�1%I{��]'A�����8���t\6�gZ�-���h+Y��@�X�3�`da�����]y-E�
��(�2L�8���E1�0��lI�FR���!���Z�+�E�|u,����\m�m��gO��IR����T�u�	R �"Ff��'0B�M|DP-���Q�	'Z��Ff����/Au@��P�0�iqF��47��h��,�H0�te1a�2`��C�a�a?A�:����3��H���6p�$��-���,i����~�a\W5�P#��s0��W�'+��ak�:��@�D[7i��������W��%If]���	���e��	1�P#��Wj`*�<X���@��k���N;��elm0�0����b�Y�h?��-�����d���UbV;��@��o�Z��Vd�U��p�����S'���U5����E>[�
��M��
t`N�#Yo9p�%+j���������V��/)z��$i����E��P#�	�+���%e)B�3���n��3�O �Vb��F�	/�?�H���?a(0������I�c35L8_Y�D4��PQ}s}�su��k��3����=�_�
�!�@�T������@���?��R��CE6"X�;Q��9���@��?���;C�VF���@����X����.�?n�S�W�zZ� �i��z�#bZ��s(#J8d�c]������f ���A��K�-�z�5O6�������t��;&��������M�6��6 S�2�#��m��iry���b[��d��s��l'`�dg9uy�Iv��Q7��lHS�^=��
��#�<�vu�#���
��$���5i&;ub��X��E@�z|\W�
4<k���\i������0j~8�����J����^7Q���&������u�%m����<��5S�c�����Pnc����1<�J4�5?���q��5#Un)��A2�5�F�B���#P'N�3�ad��.�����$�q!FC,��(�t�@����X#�<x���y�J!�%m�P�	x��-�Y���-^�%�
9U:���k%�-Gh���x�A��Y����L�C��7�:$�(�/N@����I��.�������ob��1'��r���q�0 ��#q��cT~�$�S�!�+�)�m�}{�1�0����"�i�'�t�2"qo�q�&�)]�����%��m��3�\�����K����og$��w�f�3�<V�%��Y�M��G�~P���Y��`��
(;+�l����4?Y�g���2�R6��i��`��������+���J���<��*��Ff�5�'�(���O&?x�m)�p��,��;�~�]����s10�w�r�'�_���3�~
&5�P#�	�b`���h+sV4��9�#����X��`�+M�$��$�CK�@b�2#�g�������	&�k�
����Z�,�-������h��2�,���"'PD9���@A�<������8�&(,�����5��Z�Al,uvR��I�T���wa�/��{�,����&�0'�H6���A�,�H���L�0�������3y4a���'���2���Y�����>V�D��#�T�Q��-�@�D�����z]��H`LK01������w���@�F�a�^�b @�O��,
�(}�~�(��W5�P#���� rXzMY,.�� kvL��,�r�s�	���:�'qX�9���� ����@���]�1V	�h�]X��������t*�=���E�?^k*�������TV<�^�r��#@��/���.�
�Tm��q������2�,����������,�";�Z�����:��U�<�]o���\���-Y����[�1��#��[���`5N���q��������Q�������|�Vz����PP��3�r{�����k���y����%�rN�s�D�U��s$��+�)P'$@n��&Fy��n���7|��������Di`4u�IcFy��r�q��'e&�5�Fd���c�x+�����8����Z��;UUS��DK���TK�s����^�.9�B��:8~mhcDbo�q�"����^_.�tM4��u���&@�h���������%�WZTk���UF[^���$�����h��^_.���I�����+
hd�UM�N��i���G��\2��t���yiD��z��~�� '�C���C�p�n����[���Q����s'�V��#^�.yE�~��)�b#������|�o%���&�8��Wk���m��\R��i���T�3�����g�F�4
���G�KZ���U�g��g��a?Y��35�������*i�q���`Ev�Iw�f���+,��i��`��W��7_N�,:�"	f��7�d��3��f����9�t���\���H���_���i���5����W-�L_�5X"���L�\g���T<`P35��_���I�o<p�'(�@�n��<��_gFf����I��\�%[		g
u8�lB���p��B�$����+'5�R>
�+��vg|,#�n�&D���=`U3?�
�(�[9��-�pZ���keq}=-��69ma����G�R���w���{���3������Y��d���r�$�,��L(WF[v"��O�9?K����[��4����\R�K��L�
?�"��,��q2��X�`�a�^_/Yrt��i?>EYJ�i��	�>�P�_1B����KF�B	I%��0��-������0��q1�P#��_���H�~A�{�����(\;�q��[����P���������������	xT�~F^+~���0��0`
z���G+L�$nt@��@�6d�qWa�I_a������WX}9��@}��|
����gZl�"`�9�P�}i,�����u�r
�	�`Fy�����B�ju�4OY�jVrF!���wx�h5P6Y���~����
�Il
7;C����k�P7u�TjcY�<�Z���B�������B��s�DH�'c�x��GI&�8V��V��X-r��*��g�[N�@������#^�,t���k�$)��b���i��k���y��^I��B����Ud�FC�fu�k�:C�Q�<�=��B��yo����%��Z�T
�S�N�%����Fy��h�����j��x�f��[�^�d��s��]6�]������6�Y�����g�6FU��qb��X����>Z(�g�����!I�����G�$5�3M�d�-x��&"5F��� N����y������Z�l���I����|������dyua�Y_-4z
v�\���
��1{��<d;�L<�q�V������q-2�q��Gd�mV���K����T5�b�Df�p�!�8AW.��G\��vF�r��8����K�l�E�(%'�q�1���rp����i�����ZuJ���5"����@��P�
�����L��zI���1"�J�D7?�n$:�S|b�0P��j
wY�k���i��<!�w��KQ3?���ol
HR�'�x6F�8�f.��o$���,0��#Y�������,I�@?��9O�x����r�E�,�Hd��kI�$�my�pNE���'���3�\���LC�~�^�M��e�\{�sc���[��K����z
&�y������w�F��~�3������
V��P#i�����4kn�Z�����������2!�g"9`U3?�
��?Z/� ��D����.��	�s�g������35����&�<-�t�o�V�s[&���o����]&@�>\�h���.Hn���K�[�)�G�N����Fyk�U�������G���w�
���}��je0k�4��������A��bw��hW�G N}�|!�w����B��t2_��g����h'54D������5u�2�����J�UE����5�%t-�:��K�]�&��Z���6���0�X��G���U��n����6{����UO�N"7���E�A"�t������/��`o;�
�`^�7N���n�U
M���"��W�F����A���8��.��Jm�ds*�q
z����y��%S����DRW�z���`��g��=����K&�[tDCv��K�f�'c�x�h�2�����|��
����(,���+e5����V����.�q�������9���Y(I��V6ub�����Fy��v�V�����)D��b�g��o%�l}vA����_���X�?�9�
�')��o��������n��.?�C:�#��[�Jd2hP����pD��u�W�,�\%y����&� N\��X�����_2*�V}8u��Se	Zt�N�%U�Bz��p�����[�P�e��6O&Dzz�q��n�~���^�.�"��Eh������i�Dq���� �Q>]�(�Z:4l��x-SJ���s ^��kUR�<�������i��F=��i��M�7�:!���7�)�j$u����������A��\�����6��>��*����_eR����O^v�����g����|������2�?_9'���������B�������`[�LC�~����s@5 �}:�,����\�4H�yb��F�	�^yE%���W�Q�.Z�%���H������Y�������+��-t�+��4�|�0#`'M�VWi����%��NbG�v�v�p�]#�6�����5�P#���M���a��������8'Y�������a+
f~$2��+����]���re�)%�|����Vn�]���{����cO��`)����d/x9!�]����P#��_���x�����|*�k��������K��<�Hd��W�P '%y���Q��_'��tl#3��O�n�
?�����`l������a?���jf�F"~�r��[�=��\�p6��C����N�z6�v�*�����+�#�|W?���~�`����zu[gkkZ1���T�z���I�D�:/�����7�"#�m3��_�����?~�����?�������3�E����h[�~������o������������������/������_��/����w��/Hk8�Hp�����Z���6)��)M*��s�����|���P�+ji�\�>��B��O/�Eh�E��_�;�E��'���Vy��*��<��|t��������pg�����:�$�g���a��B"%���C��e��_E�zs�j�e��E��p���*��^F�l��w������6��?�k�B=�ev!���s����j���/�3+%���w��9�\H&%����>���
�B���w��9�\u������_$����w^e��9�����/��r���E�~w!`���~!;���n�#���5����=v�y�<�3�}
�6�8������3������������kvv3��3;��9k��n�����g��nfcw37)��O��n^��M��n�Mw8�i�e�n�u�M���W���?���"W��V�����e	��Nbh����];v�	�aw%��c\�����v���nfMh��O��f^5�'vv3s��%>�����������U�v3��������\��������j����n���	����������f���������vs����u'�uNc�w#3����w�H�}��f��{Qk]���������u8������nfg73�a?�����K?������������b��_H��g6�n����L��+��W�v�*��u����������nbc73S��>�����������E7�f6v73w3��y}�z���]���p7����RX^��=���������pv��a�n>@��t��9��Y
�����\�f����nfV�~fgw3oZ�������0��l�f��� R?������������6��f6v73w3�����E%fvv3��3;��Y2��3���������4����y���mI
w3+��m��o.����'�%n�w��:�����g�`^�r��g�3�a/9��9$lv3;��Y
�����������cw3�p7����c�.?>������������.����G]�c73�p����~ZCw�j�^?m�j����p7����k�Wg������d������������[���Y!��pv�	�a�n>�j-z}fg73�a?����[���>��������������f6v73w3��yM��;���ff5�gvv7s�	�3���������-�>�����p�-��L���r�M����0��l�n���c7���� e?�/Q>a��/}���I�J��d�OFt�|wm�\��Haa��bi[��y=��L���@��	�$��aQ3?���ei\�V�B�GG(�����h���Y�\��1l��TY�"[A(����Kr�x�\/)V$���Zgz�6��As$���_��Y������l��MtBTU{Fr�T6�S'���eY���Ti.�f��f+�`��+����	���r�����D&���v�te���&�b�mG�+3h��]G��YY��j�e]Z�2>�IRo�QJ��~�2�Y�,�Hd����hE'%}U	����V�l����'�����p��������7��`���6("�;?%��O�a����L8[YZ���icl�k?R���W;'���u0����L8]YX���?���<U�I�H���OV�0�����#i(�3YY���
�*HUK����:���YV#�<xM��y��,��;�I�`�6!�'���f���>�p���Py�ND���d�Z�)G�=������F�h��.3v�g����Y�5���x*������O�aP35�p���r<���uG�QH����q����Ff���1k�bX\�r�-�����pQ35��v�3}0H_.���r�E�L��9#�'+3Hu���B��V�m'B%���%f!�Z�)a?Y������	&��,D����r*����e��p������,��k*�E�Y�b�X�:���H��?,�tf��,�H0�|eV�'��U�*��tDu�����#3��-)gw���=�S�!���p$�'�A�X�P#����Q�w�Z�����"58/������aU3?�p���t�d�DU��Q#������h?Y���f~$2�tem��!=��#�3j�#;sU{�D��2�f~$�����-t�\5��g��TR�t�uF��Vy�`P35�p��5!	n���VU��3	7#���0�P#�	�+kH�"���*�jT�m����9\��B�D&��l(y�F���Ya2����ae
�35L8_Y�K�AE�o����������������B��WV����O8Ha����
�������M�4�H0��]-!��
#������qC���N,�8Kl���x]
�����T>����	���h��$Y�Z��R���������PEG?�2��B�i~@.�ein:���gL%>�<��/������3���i����O��f~$Zp�2)l�[� _E\������%h��,)W��^2��{k��9��|X��Y��h_��������+k�q-�F���9�V}_f~$2�\�����G�s�uPW�"��~P#
f�y������m�yA�M�I�:����?F?����m���Hd���h�2B�Y��x��[��#����aV3?�p�MnZ����m�M�IQ�M�v�9����aV3?�p�{W��R�������s[$:�s?�&�o���#�	����M�Qz�������-(���#�<�����A��7.����G�3�~�K
&�y������%L%���ejl���)�W����9Ljf�F"�����4����JH�����E���@�dq�e�
�,s�d��5��C��uF�����fj$2�\\����c����H�Bz�����"Ff��2Ot��'�2 �����kA��aF����E�,�Hd��6�h�� ��{$}��v����������d',P:������I��
#a?���!e��NZ8Lx%Jj�#U��I#���������i���tI����Y��(�����7����Y��\�rJ�\N��L:ci�qU5k$�'+3��0�P#��W�$RM.T
m,I�j����i�&����	&����J#�t�Ls�s�������� ihb��F�	��dD��!MJ�<'����`'��#�+�a������e1���
]��Z�r$�%f�fj$�Zbn+�c�
;�V�{Ijc��3Q0����L8���_���;T�>MJL����{gm#3x]�&T��&�k���e�>�v,��� _=��*h�Y	���sYI_��s�!�����F�	�+K��_J-ma�Hf��qF��dH�m���	&��!��52imQh����}s!���.�q�.w�#MU����"m�U��~xE4�l�.�Hd�+��J1dN��w�������������^+�\��G��7�`3u�bx��y��k�<MWq��hz�
�A�c`�<+��X&�ZyL�*���v�16$�@��c�A��X/gc��!�q� �	��)J�	��F?��:�j������s}o���	���Iz�`�iL�-�n �z�l���������/�XV#�<m�1�]�L*5��P
��I��u
hz�� ^R^�����fL��(];'�3)��'3?��*�Q�+����\,M�.x��?,c8���������6�-H��K����8��,���<���"��h�f�,��hn(IK�"#�dE�Y�,�Hd������d+tE0Z'���rN�
�`V3?�p�.�s]�0 �p�`�����2�,��u!+�E��dE+V����f��LV�jf�F"�DV��b�+�'�	� N�a�0���(������l������	:�����35�|-����B�n�w��y���xB�m�w0���I��n��~���k�_��f�p�f���2��35L8_���@�6�X�_��3���p��Y��d4�H�L�"h�FgR�Q2g��<2����<�H0�|eV�,�q�C�����w����[������.�?��Gx%��;��X l3�~������Y��`������������
���3�~�k�Y��,�H0����|�I`�#���8��Qil�a�^	���=��bo�'�������%@z?x1�]?��	�"�)��d�*�i�Q�=r1v ����0��#�	��^��rE�`G�%�r�����%1�0���GN����i��GaMN�pJ����g>
jf�F"^e>"<qWd>�U�T@�����e>.0�P#�	�+[u� ��e������H�MK��N���	&��g�[9g���d�e�d���{�%�q�L{�W���Eg�q�Bw�;-�~K��J��VKc������	�����������"#^8�G�~&9
���,�H��G��+�[c���o����L�o�|�X�x"Hx_��@��^��C\}������_�|��!�i�� ����7�Q�d�o���a��I�ZdB�D�@VV���������v	[�G���!�q=���C�4���H����K���D�����0��e*��6���@
P��
�_���	�P�h54!�H$��;�:��/�*���i��gr���P��z����7wn*
�L������q�uv ��t��R�7��5�L����1)�[��q�&��RG�M�,�H�y��s��j�Jn�\b���G?����|6��#Q������U5q�q)]��r�&��RI�%���ND���=�irfg��::hKt���;/A��=�w'�pf���[�]�R"�t7���+����&_�<�����o������R��������E-�%�+�~V
���B�D����4���v1��e�/�5�~1�XV#
3�g�����}mP��_�n�]@H����8���,������r��%��Es��� ��yF�w�
�jf�F"	��y�:�~���wz]�����G�~��R=��i�� ��;������G�K���Oo~i�N")+�j���w�\��em�����KE�����l�C<���������_sg�I�]'��_���;����k�;��j�|���DC��V�SV�E��@�~���8���7������*f.�A�6�`��F"	������
��yA-X���y��'�� r+�s	-�]'	���
-�4V�y��*:���M���<��Y?I����~�8(�R���[j�Z�������h��.o�5��
M�\e�g`cz�Y����s�:��C�D���MO���e	,�+���>_�C%�����B�	��y��������$��s��;�0��C���P����7��K{5�RP1�-�y1��y�������	��s��-$��0�Y�5z��"�o�$�W.�k�� ��;��ZP�m���X	��&����u�U�,�H$��;����Sh:�����L���a�
i>3�Y�� ��;��Z)��/8��t���%�~���fj$���9I��S,���h���m�?tD�_q��jc�y����[�R-U�:�RKHg�5HV\����@q��[.")�v��,0����	Y������LC�	��pXP�M�:j|�.;�2�e���f�"���+�����_��\��7^���@�Gxy�4$�P]�	��(�`(�^fD���R�p���v����b��z	N�c��Y=����O7��x#6MPz7d�8�c7��'���D['���d�)C�Y0�/�v�O9�C�Y��_���D;������v�6�@0;�{������mEk����6�xu�hW|���Yw
�������������������Y�X ��5�"J-_�5��u��V�,f~$������������!]�s���#[`����vhxm�r%�T��Z����+`����z��h��K�C�t�4��$H�H�f����	��G�Yh�p&
]7_�/��E-p��{���f��s��&�~Vl��Y?I��~iX�������v��H�&��T�`V35����/����L;���z���N�I�����Y��T�����m�������u�I�uF���	0��C���s����j�wm!��?����^P[>#��a�-v��35I��~�W�q�����3�*�nrA��+�~�M7�����$|q�T���V�^��~F��+���	a?�(���i�� �����w�Q�T#{O�*g�7�8���8�
�=�	��_:+[��-���1����wF\>
bi5��|���R	:~`��5�`��e����3Qp�j$����R��ux����W�>��wB�O�h;,0��#�QB��~;>����e��P�L�~����3��fj$����������	����������dk�35��&�/�w�h�\V;�r��*w4O� �g9�:�,�H�����$�������m�\q���I�#?H�1�#�I�-���[�T���v�
��X����:s9��4k5$|q�4����N_PV-��/�X�_'D�L��`��G"	_��|�_��71l����e���V�`V35�+�~�_j�J_X]A��c[�z[�/�WD��_�f~$������)�i��f��|��n!W��\������$|q��&����I�n`Yno�_f����_~�����4�H4������e���������4�@�T��ada�7�F�t�����-%�<� �w��L�af�G"	_�"/b����Q`UJ�]t� �(��c��d44����/�n+T���L���%��q�R�k�����Z���l$k����N�:"DV���k�}��yi�4�}�M�Td*�Z���	D��TN�<������S�~���u[V������0���gU>������O	��pO�Cc�9�>4F���D����;��N�mZ�iL~��S�h6�a�Q�����*�7v^�T�t�^EO���W�X��"��Q�W����}��5}�)��x:��v�B�Gd��9YG�hG���-��,�c��,����@�w��e)��,'���^�y_B y��6����}G'��K���#@���2�����*w��������G	���u���0+>��L$��;���'��>enB�F�G?�<{����4��xo�y����&c�O��t�H���5�w����c����Ww�p�������3��/3R�����fj$���yF!����1k*C����63�������ys�����}h�u)����<#��s�I�,�H$��;�VIz�tV��0����d����fj$���9�wt���f�l��Q�7#�g7RK^6�P#A��w�W?v�Sc����N���?+�7�[�<�H��)��;�u�m��<G�T��1@�����j�y�<��-SY�\X�l0�!�'��o���df�zw�$��d�"��fr0#���:8J�,�H�����1�g��#�,��?gF+�p��,�H��������9�1bT�>#a��C#��F�������Q"@ZiY�La�_Vb|XP'D��L�����D����5��![��6��N��C�[���@���2�����kb����0��H�&���Y���x!������S����z�@}-\w9C("~v�Xa��G"	���h��$n�������M���8I�o�mW#��z��e}c�[��������\�l6�&�
}�-�0���-w�����q��3��;� K���075�P#A��wN���R���9��:m�tE6?�s���C�d;���s�����\:��q].��.�0�P#�~��K��D�'(4]���	a�\��+W���|�a�j��F���������$#0�Q��IV�N*v1�A�xRq}f����oU&�Q MRIy����z�")��+-,���+����_������������������V�,_�,y�����R����}��/��o����_�����_����~������!���������H.��������T)�R����T�&�2uO�������/	}�.dI������\�U����?���m�q����?�^n��N��,/7_�����G����������z���Q��NW.���.�<@��P%O���G���4���f�/|������J��<n_�|���v����\o>��������q����?��>@��i�x��oT:���J��=���w��>s����_�	�R�O�������z�4<�N�g?�M����i�1y5OnDT���Le
	����N�m������H��GO�D����q��������O~��?����i]�[�����^j��wr����CF'���3:��H�(��#����j���f\��y��ns��'
�!k�K�1������Y
�g]WQ-BVG�Y�.du���cVCYa�z�����!������Y
�gM�q�:�}�1��������.R&��F��>�RuJ9����rj����+��R�:���v!�����BVG�Y�.dut��f3l�gut�U�BVGY1C f5��v1����m���!����j�:z�*�hY
=d�]�j�!k�����CV������v�]!�����Y=��pa��[��qvt�	O���uS���n_��.2���t���<k@Y���!������Y
�g�"��G�����]���!k����Y=d�]�j�>kNR��:���v!�����k�����CV������R��z�
����}�m�[V!���w�Y=��P��������w�T9�O���uS����_����:���v!�������z�
����}��I�����}V�Y=d��P3f5��v1�����
��:���v!�����[5��Jr�YL��!�.��bJC9a���Js�����n�m�.Hk���m��u{z�y��rZ��Gx#�/��������#BF#�a2y���1�����@$$5t�u���+$5r�f!�����\��)
=��]Lj�>+�,�����j�eCY��e�Y
=d�]�j�>+����\@�Y�.du��u<�tt�6s��;z|����}>���~�i����`���������zGm�G9���6�����m�Ej��] ��t&�A��x|�|�����8���Z�^��u����3�N�k���^y��Z]�E�$
Z����;�����=��.h�.D��[q^���^0����E����j���=���^��B���2�����I���.%2z�]q��,g	Ff����&��$C]zN�Ej�.�8�EQc	3?�jN^�-��l'S�u����]q�{i��,���|>t�$;�E�����\,��hy�I&(Kb�a�q����U�yX�V�N��f�o%��fq^�e"�.�&���1��Nd�~L�k��V�{�1�y
�������e���p�4�e��~T��~3x��H��n���Q���o�'���U�<��5�������5�����m��*Ff��|��yr��qe=�*�����d���la�^3a2������dToZvQNx?k�y��������r���,��" isA����{�G`Mz\x���Lp+�����z��c�~��o%����%)��p��pE#o�8�\�&]n�q����
#3xM�D���,�Q_�PfD#��	'����`�a�^3��}�W�Z�*�����.W�*��>��f1P���
���n\eJ��&���cu��^g+�,��5��pm�$s��*�D�s'@�hK�Y��k�/�z[;�k}IP[%��<8���Y��k&>�r?�������2�KA	����Y��E@��0D�_|";i������I�#�m�Y#�<x���u��I��"8	��A-���M#�m�Y#�<x���KQ;�kj��Q�Ht��']_n�N�%g�,��5S�
�Q��Lt�dFb�V7
N�%g�,����yju��N�=g���&�K�*��'��1"^�-a������t�F+_8���m0����lK�No�����~'�q��o�X����fj�~��w�,j�!��R��o�X#3x����B��������L�8���eY��k��h��ii4�Un-9j���F"�@�����Ff��Q���}��_[���uP�.�X��K��x�N��<N:�����q-� �����V��V��A�����nU���k
,$��C���{�������6����qmZ���Y���^�C[W���W�IgBsM��C���V���u	c�/��.��n�JVC]~�iB��ku��,��5�������vG]&
�U�g]qb��X#3x�����l'���tnR�����f�o%�����8��2����vht �o����A�5{|+�^��C��p��s&��&����B�['@��&nq	~�$W'^s����R�H�����^����<���B��p��Z\w]��������%�'��K��b��	�k.�u=_mVm�Ps����'�t`�y���F�[!S�2�������	'�V`�na��H�����&U-��j������Y3x�E:<��������<��NX�Of�5���u�������_8A�+b����_���HC�C�J'�~	��"\�p�-9�`d���V'e�-���H%B"\�^�������&����T��-��wS4:}*"�����x|�|��p1A���\��4�����3�]���5<�������D����2�V�~�3�j�l�'���0�0���HW��I��d;�o N*�Uhrbda�G�n��� ���� ����"]��E�~����3��&��'*8"M.7����T�V`d���n+��i���&_�X�pR�n+��m!�����W|�����NP�13���k'�t`E�<����N��D:B���t��(�t'�����3�.s�D�#��Y�� ����D��w�5��*��%���t�y���Q����� ���Q�� NG�.���^?�db�A�#t�N@��"�����(�%|�
Hz��mI?�:+�Y����H��SI�T���R��H�`dap[@��Y�_h��HGA��6q:�t������nD���p)}�?P���}�hf��IF�u���;���2���m�}M?�w���1�\��2p��@��5��0�M�cr�������V�[*2}"u��p����x|��VB��Z]�����%�\��j9(c���rL�.��,vX��r+5�S;i���KO�������r�0���^�-�^�����.
s��Eu' N����91�����5���z�-5t@����^��8��,���^?��[�PC��S
���PCGF��#��dK4G��,�u�$�U��@�����y�������|���H��q�-9K/faP��R��|�$�$���.Ks�\�^����H'9�8��2Q���}�J��t�oS�o�I��V1�0�n%�]J%Y9�6�F�|O�<��b��"	����_\4�y�o��EKa���+��Z���� N*�	[��0��M
��_�e���n�/!.��q�:������@���\��8JlG]�^�D���N�%gFy��F��"�AA0DU������K'����bd�����ca�1qex@��o N(�K0�0��M)�[i+���3F����U����#
3���ur������d�1�qm0��#�#3�q<��6T+4>�	�2b9n�I5;a�,��u��u|�c�@�L����u�3?����&��V��Xg*���q��0�t�r
����1�]K8�vww�4|��7���0����k'�u`Fy���`G�IU���w	�dbv�Fy�������W?�7J�K	���kx|�|Hr6���u�-���*�AT�T�#��Y������~��qW��H��Q�;���[	�
�$t�5\�j����	�q�u�su�ys!q^���x��c.��O@�Kn N��	[ada�G]���@�Q�� NG].���^7������Xr�qR]N�U
a�]nE�-=BS�Dp�^����yA��B��py��[�\�qTpcG��$��c��V��v�5���%Y�(��&��|I���j������������t(s"b���q��_BB���8���{W��Wz�����t"�+[u
���k�Bt����u���_���w��WO�Z��3}�Mo5��l����X��i	��6I�T��#��]-�)�'����I�����$�v��g���I�\R��	'H�`I�<���������1��Q��sk�K�' N(��zM��<�Jm���w�V�|G]�_��5o'Pl>����k�N���/o9�_]���n�#\�E�����:�UP3��������
���/%���00qB!��l�a�7�L���	��jRyZ�6J�8A�[���^on����p�7�y�:}4�q��a4�bda�^on�~����/
��_g�V'��~�������z���C���� �eF���	'�gTF�A�A���z��eR9�����hT�Yx�G��o��U��Y�<x���uC��$�*�HK- N\�U�3�zs�������Wd��"�Z�5��
�nX��E���2���
�|�o���9�M�Wkx|�����D������^�����[����H�'@����%Y����-����~e����RI��0qb��X������2�3|3`)|* J0PgD"o�q��`da�7�L�`�$�s��^��5�r�:_8A)�`d��7�L$���oV�l��l��h��1;C��G��1��eR��<�:���-�0a%�r-� ������C�����^���'��e��."��+B/�8���N7����^on��c�UAo�����X��qR�Z�g�,3�zs�};](����2b�z�I�N�-b�a��������]���/.Kp�RC(�Wk��VB��Kz^�����HM�5��z��2#Q���8Y�*_dm0B����[f-���W&'�I����~��!3��B������U�6����X��������N����
Fy�zw�P�[���Qoo�J��>���[	ajb�!�kuyo�Z����O>T�6���CA�(z-�$Ff�zw��2lk��_l�k�@���VX���@�=��]e�y��������a�d08�FPC����Z]|�Y���2���e���gR��x<?��R���n��������b�5�^T���Rc���x�����>��*oA��6]�C�m����!�kuya�;�i�>�F�od��������BbIq^��
�p�[*��H����\��Z����aq^��;$�\��J��@_���Zo�����8�����~�Ky���V�.p(�]�b�V��	�R�W6�� 3��t�=���~X9����%�8��az��v����\�wb����:\�O�/�M,�y
�Wv��^�]��[�6��m��q�����C�Wb��.����
t�n�P�~��[�P��F5���^�(5������>:���U�|@�p�,a����^�hu�����bf�Z���k��V�bf2�������+;d����� GU�;z����lAY��@~�-(-Iu����I7TP��m�F N���|��<������W_i���"����
�y����V,�k���K���
'/���;���"�{�p�Vad���6���G��#���3��W�:C����V���Q:��ub�x�D:�qu��zE�:!��/&q^��+;,�:Kw�];�O�W�z��JP*�&����+;$�����U�����5�k'���4�#^/m��|n
�)������N���4:��f-���p�Z
��=�C3���?_���[	��s��g19����!'�y�##YNz�Q��?�Y���!�ksya�,�,���ZRA;#�'@�T��ada��6j
C��Rl�MJw{
mGG N����"Ff�zi�&���Uzo�\p�8T�}#3�-W�t�����Lj�
��V**����)_�^`d���6����o�����\�"��Y/g+�,��X1���j)u�NG����2�
��)�<�M>�������
�N�+�=�d���i_q���}K0�0��k������?jo��i�#�S,����3�C��zi��w�m*��U���J��v�����a��6
������Q����/�b��J����z��]�o_V.B
�2��-�j��t��;�LO>��+�����9`�&0k��v��A�uy)|�������hJ�I�:7�u]�����anR�ueN��������W:E��>F��Bor9��������B��%�h����%V{�zY���\�C/����J��b}e�����d,*[��I���a-g]�L�.}X��{{M+n��"!������Kk����Z<�At�db!�kuyo�yq�C�uFk�.[��8q�xua��<�������6���Y������%�����,]og�z{{��QF�MP%���[o.����	'��[{6�0���[���u�|��K{�R���q]�	�yQdb!��r�^�+u����\�p���n��2q����;��3�0���[V���L9zq����54��^�@�&q^o������/���b��n���v�Z�����ku��R���$�fK����J����+�'@��<a�f�zs�k���t}7tF=!n�1��}7��,Y����[��S�/x�
ZGYJ_��,VAk������z��ez���(4aC�A;#n�1U��ada�7��q����A�J�O'����?�B>D��8�����R����c4���L���������0�0���[&	����]g��I��U�z�T-Vad��7��0rp/+_�n^q�K� N��3�:�4����-���mE��6F��_�U��9�Ff�����s/�m���uk3R'���#`�����3�����9aZ@�j���$�����j,QR�z��e������j{�\.�_�zT���qq��������30�.��$�AL�1���Wk�@_>kYe&��8�����j3jU(�.�:#oi2���QK0�0(�:��[��}��Y�V���B�,><�fey������L�.�}2�wR���~�T�s����	B�����^Q�������r1
�/@u��X���@=���nYs�-S�F.��@��<f"y����[��)��uI;wG2�:�\q��Yz����@���2nZ���~}���(:������kc=L9����w|������G���0��S����_w���+H�W{+�@�O���r������:��a������Ui#�h�VF���������o����o������*}�O�J���������/�~����������?��\5�i�L'��_fS����}��/������������>���_���������}N._�-���t�;[��%��r���m��.���}�k|�������|"��c�_>z���~�|��z���������{����\z���,�<���tx��G0W����#`d<}�Z/����./=����{EO����G��g����.O�~!q��g((`�G������_�&�z�����G>>����?�����'��^�44
���������z�t��n����k?����������z��|&����C�F��g���7'�`v��
�>u
c�4������Y
=d���F�vo%C�a�\cVC;7M��������{�J��9���>��6�������S����_���M�:���v!��������CVG�Y�.dut�����=�:���v!�����{�i��G�Ya����W���:���v!����g�j�!+�bVCY�=
d5��v1�����������{�������s����{��[N!A����}���!�����?�������]���!+.M�����.f5t�5oz���:���v!������������.f5t�������=��?yE�&cLb&C�`3��JJ�������j�:z����_�f;��y�Q��R���K!q�kn�_~����|7����`�o���q��d'�a2������H��>������6Y
=d�]�j�6k]��I���T3O��!c�j�)�O���	����C����j���Cn��.f5t�u�_G��}N���F2��H�2F�y�1�����[L]�UP���<��T>(;�~8���.<����m����}V�Y�g-t��������]���!���!������Y
�g���||VG�Y�.du���cv�j�!+�bVC�Y[������>���������A@Y�Y�!k��p���t��w��Y
=���-����YV��b��#��N��#8�}��@����mV������+_�I����YHi�>cZ���tt�S�BRGY����Y
=d�]�j�>+U���t�U���l�!+$�*y�)f1������;�!#�CN��t���G��n�j�$|G�o5�����[��y�����o�J�����G������a�
?����|c�i7|-�H��������\/�#3�r7&l++[����-������Gd�-�������>An8$�_�R}��	���L�	aP���ou?5s"��V�C�1_k�0!����8��e�nwT������*�F��'7�>"K0�0��eav	v��K��Z?7�:��,0ih��vXqY�/��U�����`��a!���L��pG^������"J�y@�QYlV�����������*i���|��;18�L�Z!3��
�hc22����L�RyK�{T�(��:#�����<��y��k�rR�������V���w�JV`r+��^�����O"�Kx�����>��6���*Ff��l��$��~���?�D[r��h�.W'���YTP���s$�#

���l?8g�,����:�m��b���Jr�V[����3G�Ff�5�/�P-R�K�������4���8�5�8��e��aL����`V���� ����#�#�<xM�D��e����^�@���g2q�-9��]!��5S�Vh�t���z�U*���Uga9�����@�f2�����b�v@20�����#�:w;�,Ff�n�v-I/a���3��v�A���%1�0��L��x�������e�0"Eo�q�-+2���@�f*���V�A*�"������yp�-9�0����dK��+�H����2�4E��������"��}B�{�t�PA:�i��g���Wkx|�`��h��Z]&�S��|?1G�����m���78}DV`d���~b�j��$�6L�v�Fm]qop�-9�`d���&J���*�
����Q[��z
��[2&RB���OH-a(
�V.�l)����~���6#�m��#�<�v������rt
���vF�M�8���e1�0��lK�_xd�;/I*K1�v�	|�s�u����;��j�u����+�t}�^�
,���^3�,�������F!Ixm��a`r�0�@��3F5r�G�n�������M&M�8��c+�,��u���/.�������A���h���U�E�)��T �Q��k��5�F���2�x��.y�O�U�S�Z\Gi�y�"��0��C�������j�=���[)<�
h��(a�C������PI��>aM@��<�%Y�h_�Ii��w@���W�>:� ���a� ��d������`�mX���.������Il��o����_(�q$
��=�/�ul
L:@�0��Myn���y`O�����t�r����Bt��X��Z]���O\�?�uF��� N�>�,���@��Y�(��'+����I%5��>������I!����T���S�
���U"D����s��Jl�a�y��)m�X=�����v�f�o%��D&����D�+�dP�]G�Q���	�t
u`+�,��5���(���Eu9���$�u�	��#3x�t��G�@H�k�*x���]N��b#
3���v(fy_QV�!��]t�|��o�rb����Lt��M���:�2��$���	��#�<�v���R�%��T�V��PgD�[�q�.�`da�]�A�X������Kn N��5�lda�^3].���0��Q���$��k'�r`�y���F�[7q�/��X����N@XL�(!�y��������?�F�&�uhD�*
��8� ���#��@�n9��N��
r���2b���I�B�Y���F�+PI�s��F�GDi�\����_lbA�k��(q�� #��F]2��K'zy�Y���F�S�|��w����^Dk�pRANX��E�~ ��/�A�#t��NgAN�� w������Q��� G� ����EA��u'��N�,������.�L����"��Y�J�A��ZSAn�
F���	'�6�_l�a�� �S��(�((qP�[H`��i�<�t21�j�T@RE
\���
,�����Z*hgZ����nV�����0�/t1�=�L������@�2][`da�����q��N����Ec��NW�q0t��-�Yd[�C�q�:����UoO|\��t[�	Z&�9Re.����2����T�['�����-U/������et��j�L�C]��5'�����k0�#�#3*:����38�1G������O��Gd�<�n��������D@4���
�|�.��
DE�C��p�H��G�?T�t"������"#�@�6��5\��]jZ\��j��(�&@�PS����^3�.��^
r��3*+�\qb��X#3x�K���l9�QgD�\�q�vg��l�a��v��{�tK�3���2E�%����^s���u�]�;�����<���:�U�<��5��P���A0������d� �#3x�$<Q���
7�
UF���	'Hx`�,��5/����jW�
mr
���/���	�Z����z�%<���m���H��&@�hK�Z����z�(y�z��m�PNG�%��5<�A>D
z^����N�����' y.R��l�������v��d��P�JV/I_��E��Z����)O��~Z�����s�� �:��Cz\�p�h���G@��IOKP�������E�	��W`�O'��v���5�����������	���l���>�������������v���7���t�C�~w�{(m���XE��Nc���\E'F�����������*:"��������NL����V��h��B��f����A��\�������bX��#����]�b]��Z�a-�J�6	B�q
��N+��f2����
�%���tZ�FFy��#�U���3
%q�z6X���@�;�;hg%�U���"�q]E��1������FG][��k����K+�;-�8�%V0�`]��S�\�BC�V7�b���,��ma�����D%��A�;q:+ql�������(����
���6�d�����02����F�+Lk4�����8������s2�2������:�9C2�pK:�ci���=ap������Q�\Qk����\?�MG N,�Kb�a�7�L���n��1$�KTl�'uI�I�R3�fzo����8Mz3��6�As
��%@cI�<������[��q�}V����U/;���D����B�����^����f�M����	�%�<����`I�<���������8�u$5T��'������fz��-����|��D�y���Z<�A��&��V���Jc7������1���}�	��`�y����-g��'3�D�R�����r-� �����8�����nY&����iSHeB���:�F���M,�y�.����H����D(�����	'=�*Y����[���*����:���v���x|+��)wlbA�k����Vp�5�ao����x���_8A-�`d��7����;�{��*o��~
���w���� ���^�c�����r"������:�G�����8�C�^�++�t�T:��A��"�a��j
���r��h��Z]��k�����������:_n����,�u3�zs�Y�g��/��`��O�j���S�|�����.����K���
#.����w(���+6�00�7����i�O��!I��<}
���&�BB������J.�au4%�h�������IEca�y����-sa4�![wm@`(%�@��NZ*���"^on��d��\�w�IA�5'����0�0���[&�x�'��I�	+�Wkx@���ak��=�������L�YK����N�<���Z<T��z�N�2[1����^Iv.:���#;��dk���I+Q�m0��h������v�3���Y�X�����w1�00�w��D��I=r���H��k��a���a�a�^onY��F|0!����D�P�q`Y�<��������t�
J;�����>,���v�iw��%���<~{s��[�
���`�D�`t�%Y�h�{[V����c���(�sA���RO����#���-���S�����*z���5{|+��_k61���V��8H�T'g�C?��/����^+��~�j�t���\���VGlZ��8���g����e����@��r��C�J0�PY&@km_��������F��JY��Y�����A~����e�.�����	8���i����]���w4{�
�/����7_���[�5��!�k����}�&rZ����u��:�����]q^������p7)��KI���D����}\��7�V���nV�kuya��|�)gD����}��(09
ap8#��F���Jt� ��t"$���5{|+���-�8�������v��\�Pg�=4'@���fRA��<����F�'�x��t���/���M����:�p���������������K.��u�[o`�%7���[�7J�
E�W�@kD�N�P�r-� �N &��V�WvH*����J���?k���U����H�@�^�h�0+7p�L���>��2�\��0�0P��6�w,D_9�o��.��$}��5<�A���M4�y�./��rRm%G���@����Pc�O�l�>�	������ZbqJ���6FR�|
Vm_��n�����z���,�=\�(��F�����x|���=�X��Z]^�!)�RX��>���_��}
�����"Fy�zi�</ck�n��e��[�*������7�e��_�5�������.`�-_�
.�����~�4z���.��P��=���7�c
\������v���X��.��UV��n�����a
z6�k�l����^�(��}�������r��������}Ya�!�kuya��+���W[�u%�����o�j-�����&��V�Wv���'��)!+����"��`�a�^/mT�;5V�%�`wj����W_q�5�t��Ff���F�t����g�
�2#V�'`�+�������0P��6��G)���s"7�����Z��	�_�&��<�y�./��:5u�_NC4��~������,�Zad���6
��S���F��22	��S��{]`da��6��o/��n����K]AP�xp��r/E�<����\��|�}X�m�-��Cu8/������FZ��	4�Q���)����]tja]^__�����c�!�}$o�W C2���,����� /m�JLwA����|�	pA��[3��������-/T�^�+�J�y�5�����wny�%�CSW ���'������k�>��q��-���xw��
4�p���3\�08�xy�}�j������jD�T��7�hA�>�s�T�m���ul���b�	8���1J�)6�Q:m�����������h��K�P���YBQ���aN��z����u�A��������u����el��|\e��J���Cz��<�&��qY0�F������mr���f�%��D���6�.��$<�N�G_�r�x�6�&-��B����Z<�A����8������
�X�~OH��r�M���.[adap�����I������4�D:K��2I�@8#e;�f�����]�Z��JD�U�~B3���8A��:�4����-�MD���w��SJy��	�8�*�,������z[�����������k'���Z�Y����-S���Eh�3��*#n�;Y�[`da��g�/n9�_����_z��3"
�N@�^h�e1�0���[��|2]:2:��&��?���=�0i�"^on���wQ�i^�V�����"�K'��E?g#�<x��eR�W���,F�a4�����V*B>�~�h���\��+7���{�B<
HDi|
��G�eY���x�����?��Pe���d�8[`da����[��"���������+�����
q^��[{�F��"��Y�����^�S8���)�B�����^�3�*�Xc��U��y�����Y�����^on9%��,G���D|�U����!�����-C@��R���22��S���e�����������������;u����^��#^�n���U��D��h��a)������F�77�z:}����(��6U~p:H�T��Fy-�Qy��������O�dQ������Onx���8������*����������]q�:>_d��(���z�R�Kr�tmE4���_i]?S�A)������p����gZ�������_������������������V��A�[��3s�T>��o��������?�����_����~��������i����1��;��+i1���}I�@��_�[������y�]�u�a[�|���]���r��|.�J��	J����@�O`��	��O��n��[���ey�!�\��@�C=?����88_?D�Rr���z�����uy�)�-K�Q�S(z|
���8:�>������_>��~�`��?���gP�����	
�l�~�������������|�C����.���S��s�{su;��V����Oqt�}
��W����������O������|��o&O���E_��?�6���<����Pc�r��hVCYa�z�Z1�$d5D��]cVC�����ES�F���4����_���+�$f�����S����_91����j�:���m�p��:���v!����e�y�1����j�:z�*-ZY
=d�]�j�!����4������Y
�g���5�:���v!����a��f5��v1����;�{<d5t���v1���������9?x����(���pj����w��������]���6+�;e?<k@�Y���t�uM_5�ot�U�BVGY���������f�]�j�>kZ��I����YHi�!�����r�.&5����E_�j�!+�bVC�Y�*�,���������F�j�&����y~�����?�N�_2��������DF#�a2y��C1&Rz�	����}V�����}V�Y=d���ZH��!����J�3��z���}N�I=dm_���-�����Y
�g���F�s�,�4��1��!���7��I
=����k���c��aV�imh���n_77���}��%e�	��=dV����C��z����!�����27iRv�l�)3��=d���~zfg��0fv������������0<d6���,>�P3;{���1�����|��S�|V�"{�\���l��=�
��=�+��F��oK�:�������^A5�����D�!����j3;{��������.�5��U�X�{��Cbc��W�}�x�}f3�������(1����0<d6��yM��7%*z���H��)k�6!!m�v"�yaxHl�)���������sc��G�����~����+���t��4ob�<,:�1��)��W�n�zn�L�]���O��W9�^�o��Z!����=�x��O�IC��z��vEK����f��)�fK��0�:��z.�.5q�-9Kb�a������FiI_��%�^r�(-�#3���$�[+�.��:�^V��w�D[r�`T�i�	�k&�6�*S��w[��'����`�|�	�NI������R�����
���d�%Y��E��������'rI(���n��P��P`	Ff�n-�l�cE���L��Z�u����,�h���P�&�QY����������:Y�	(ZX���nN^s�V����x'�	����u�
w!x�fq^��D�L_��K�.;���{jR��
������X��Z]���a�W����n�.�����cdb!�ku���Y?fL�H.H�4�d�N�l#3�|7�Z'�xQzYE�4�mxv��u�^�-��\�b
q^��D��65���?����Q�	'zy���"6�0P��zX�QF�;�#n5x
�D[r���"���R���+4v�o/$��I�� N�%g+�,��5���t�m�"�6d������>>�#D'�j$��l�(�����XQ4�K����v_���9�jf�F"	�;��W���qq&2�p,tcF�Ov�p���I����F�#��L#g��X�����3�~�3�����,�H�p������s�a��5�\\6��3���9���~$����u���.���q������_������a����^��kW�%��,}��G%�$^�L��"�';sX��������\��[���1d`m���]z����9�������I�L��?���Or�{����2����o%�/MLv�G����T!�^zF���r����D���q��Y?��NB�������1&�&@�uYu��@��w��f~$��������.��^ps���P�r����"?71�����������t�(V$W/YFJ�@��?"��h3xM��PU��XdrE���2#��<���B����y^������,��X�sZ�a?��`R35I8�:Up�H�������
M��#B[ada�;�o���9zW��bt��q:f�Z���TA��.f�t�\�:�M�c���R�q}���t��U��\PkH&(�$mu���T,���@G*L���G����}��?0�}f��i�'�H����M@P�����&�A��_�~R������oA�3�b�����Gsu7�?C*�mZ$;���S����^��aB�W�3��sPt�Z`2�-�����U����b_�����b_��������~���3���}���?c	Ff��Ug.��M.�����>E/�����^s����/��}�d{B���	'���$Ff��W<����U��J5F$��	'�jl#3x��@�YP��
b Pe���o9'���E�<��5�����
��8���^�����@��5�8��e����x����	i	��Z<�A �	B���2�qvA�����2}UM N���#
3��k�|��KI��������m0��������RA�����e�	'�4�#3x�h�4\�����h�`��@"_��3-P M	3
5$���ON�.��
��O$�|3Re���f5�P#���Z ����e�A�"���}gt�7i3��^��O��E�;��U��!u/�H����9�!�H$��X!�����������7!��"`�����$��[}O-�
jr��������1�������p�T
�O���O�$��"~��5��#����_;��{6��$�]8��e���F@���E�#m0����cr�����A��`"�+5���v1�����E*�E��&�d�]��f!�HF���~���H��W����&D�\��gaf�G�T�������o,A�;�+�f�����W��]I�8�m��e?�o�]�k�N.��}��<���N����������)��8��fj$��/�A�Sm;�~"HG��L�i��fA�;�r�m���P. 9�sEo8X��_[`da���e?�V�`�o�?��~���-���[���	���w)�k���cB��8������zc��#������B���d��m���r��Wg@���P�Wg`	F=��;x������T���2K��k�N �Wf����['� �5����{���}���Y�z��� >	5�8��2Q�v���xT@��ye�	�X������~T6u�����2�z?0/�[&���o��y��q$�^���6�����#3x���Vm���#�F�}�){�]�A��{q^�eZ�W��F�j������ix��Z<�A�1�m��q�.�
��e�
��lGE��z�T$�l�C����TD=�*A��" �bN�]�q�������v�����BL���B�����t�<�qL&�0��L����	���
(H�+WK6�V�!:^w��R���F��Z���gu�F'�Z	�1�
�Ff�u#������_@����E�#9k
Ff�5�vU��Z ����bQ��	���Z�Y��k.�A��6���\����t�f�o%�
]b��.7�^W�
_��e���f��\�������	��z�dZ�H��.��$��3�~��f5�P#��w��m���u�K�+�����&�y�����Q�]������Ou/����{3�~^�W!�����$���k5��{�X���M�����U����H��
�C����d��{�@�T���I6���w_�C��4�=��-������T��=��mjf�G"	o���'XwW�
^�a�nB��K�
^�Y��H�G�O���G����UK�����W�����n�j���~�E��&�4���~��.�$���W��Q���R���&��> XB������m���)U����9=�zb��^\���Xa���\��� �����\�'F�����\�W�W�!�/k� #�>,�����~ �Y}a��P��3)��0�y0r���;e�56�Z�r^�O����@��r�#S��@����:O�j{�Z.��-��jy^�'S>��Q������A�;`��+<�|'p`���S�
�#s������������<�����-�8?+�'�,���g��J�
X�!��8����n�_-l]D��H����k���bd#3x������U����:#��WK0�0(q���[�w�
E���1�x����r]����Y{q^�������L$Ee(�:#���J�#3x�������KA��@RcH��u�	%�`	Ff�zs����jK�_	G��T+^q��F�-bda���Lu�M;;��#��]�v��~D���fz���-�U�b3f�TQS�e���L��0�q���[���gI��;���$����Z�������z�����dD���D������ F&��6���Je�
�`Jz�Z��%�m�^8��5�&Fy�zs��@�M�v�D����R.�	C+�|�1A���\��k+_{�.F��D�����	'��`Ff�zs�t�e� �"��U�i�$�9�5<�A>dp��h�����7Rj��=���j������	�����"=����'".�	���j,l���a?W�f5�P#��o��J7�*���}�v��E�3�~^+*0����$|{�$�����'������3�~�:�0�P#���-*SS�"�wMn�"�w0��?<�H������L��='~cR�Z��X���s��a�af�F��o�������]�z���D#�V�-�/	����$���y��H��w��"AOj���D����9[�!����-���+�	%�}�dR�V�^q�%��F
F3�zw�;����Pg@Q�un�����l���a ^on����o�bRF��\9c7�G���y�f�~$����Co{�����������X���@�}/��N��������U
�(�6!����=`Q3?I�������
�z���	{�����vnp����$|�(�m��'�X����R����d'��,����-C����eCrA���oYY���0�������{���xr��}8������~����!�i�.>J�|f������j���g������?gG�
3��������x�O��q^������7��5�!��&��L;X�hk�W6�+>��5�@���m��5vc	Ff�zi�[�y*�F�{�.ph"
�]�h5�+-mj��wCs�MtB��`�qBs�]5t1�0��K��.z��-v?��}-��B�v?�8m9��F[��wRFPgD���	'��`	Ff�zi�$�'�{��k{���4B�y`�I�$�����^���{X:V�f�5K���. NhYj�f�h�^�'�W6Jb�5J�|�!����<�]��)����^�(u�x�������=�b��y��x����^�A�������6�A��Ql{p�zi�[���M��:��}���	����K-����FKh�!73*��!���vMeF��dda�^/m���d��������"1>M�8��ZX�0�0P��6�
>K��
u�N����NZ-���"^/m�� �sW��hg���Pu�}`Ff�8��G��\�B�����Qa�=&@��3�Sf#
3�ze����{�R���R��k���&�E�4�r<���F�yB<�q��4a�z��YQ#�<��	�O7J�77
�C�j�\��A���	?�V�y���Q���~Y��\��
N6m�|���
��_XcFy�zm�\���\h�6m,K�$y_���IN�����dG����[�[�*7���0����3�~���0�P#��/����$.7
:�Y�
t�/�8tj���`~���FI�����.*�����E��5��\��B�$�#�����l��r?QEsF���<���@�L4&r~�<�zi���������*3�"qE���l����Y5����W�;]�h��J�R��`��@�k����x��Q��<�#��,N�������+�~GY��E�,�H����������i-��B�� N���F�����z��Q��wtt���=]������0Va��LC�	_�/�h���w�(K6U�����4�f�����D��5�4���v�����t� N��yT ����F����N3�����o���.:`��/�������[�B
_�K(K�����&�i�����=D�����^it����(nH5�EzY_� �S
���$�����]E���4J?����m56W���>�V��-�� �������� ��L7B���	'H�}�k)����2}��G_�����������%��d�a�w�����U���?\2u�f�o%����5\��k�"������h�r]���;	l�!��r�~^����"��J@@����qb1�X������2i����9��>U��(~��K&����2/�5-���|�I������h+��+���y����-�U;�lh�PcD��6���	l#3x���-I�������*#RI�	�z�#�E�<�r�����KVI8�8.GU�hm����6����
a�7�L�����h�\$�O�j[6F}��H�@���2�l8����!������N���*�,�����L�
����}���F=�|�3�Ty�
Fy�zs�T���Z��H(�Tq]�������-Ff����C-��	Cl�*�����x^���P#�7o���t��H��{4!5^T�k�N.�+
Ff�zw������m���i��J"$��5'��V��#@���2����J�/��E9d�~F�����$�e��F��o��E���K:	�;�(����N�~�0�P#��o��;������m���0����tx��U�~GH����B�	��9)��i1���j�0�Pg%F��{�Ljf�F"	����vI�����/lc��\�;������C�	��9����q��?����:$�����y�Qz5��X^����a��e����E������"�w�P��bp��h.��s�GGM��T+rU���]�;������P#����\%�]�F&
J�a�a�������B������.����L+*�
�#H��Y���4�@���2
��Lkmt��N��_�%`V35��-�7v�?����$�pTpA��t��v5�P#A�����E�y�
��F�X1��74�Q��W8[��]��NY�H�����0e��?�����%/�|�e�9�W������������o���?��?������H��R����>�������_��o��������?}��������O������?H�/����k:iL~Uh������?k_�&��_&M���z,��=��1w�|U����A��s�zQ>z-~�4V���C�=?�;�C�'�a<Mp)�Q����Hw8<��s��?��y���|�M�����/����?�h�h���A��=����A�V��(������H/�vR������O]�������3���H��v��{~w��88?<H���y����e<>���	�� G����u��w�_��������=?�;����P������'bR���p�q�����a|�5.R�:��g��.c�����{�������� ������Q���G3����Ne��?�_��>���BB�~x���_B3��K���;ifg��0fv��9�~8���=dV����S�0n�2{��Cfc�����v��Y
cfgO��������2�����C�m������Cf5���=e��3{��CfcO�����������1��!���7���@����'�KT���o��=��j����+�B5�fv��Y
cfg����5����j3;{����6fv��Y
cfgO��:����=e��!������;�y��f�� �����5rv��C���2����SfcO�ax�l�)��u�����!�����L
[��������rj��8<����v�=�r��DY
=d�]�j�!�V����u��W
cbg�K�n���Cf5���=en6X�+z�+v����R
���L`y�0&v���xa�e6�������2w��+z�+v�������SZg�Bj;���P%��=���mh_�'�?����O
�c8��e�;j���3�a�*�=e����2{��Cfc�i��r���Cf5���=dN��[���2�a���)��Q���Pw������2��~��Sf2{�����#���j3;{��h��Y��m���=�-m\������o�I���p��
�a|g��U�������$�m��}�1IC��KefzE��!cu�8
��8$��W�q������#y�� ��3�]�Nv���jOv���4{�r��/��]<��W�v��j���\�6���4���m4�/NV�x���Nn�%O6���0<�l���������������+���F�����:����}u��Mn��G��G����y4�gI_�s���-}�U����N��^V�vh��$�9R����4��o_����C������)S!�������t(f�~R!�PZ�b��h7�J��5hh;D�76d�E��+�~�Y�Y�,�L��ef�zYO]}�I_�����#���f�y��<��fU��I�y30l���xW����y3��fj&���MR�mO�,��L���\�o�����f5�P3��<����+�U�s&��S���k����f5k���J�m�L+�T�n2�:�!�]r����J�O2s������+����&�x������U���q�">gY�4�������������@(��23a?y�f5�P3y\����]�K�@@���j����`da& ^��~u���PU���V��q���%5�0��R$���ZV�4��0�����3�6�?�35���0�"���pv��jL60���W��z�7�y���wXI�e���)Aj5Ve�������df��Y����ef��&a�Z��+�������� �'�9l0��3�����Uil�}�������V��t�� ����3y��Zh+O�b �Xd�}i9G�o���J�O2s���B�D\fFe��xI#d�y���7�>�V��$3�M?X,�Lp�2�C_R�|��`���1�����d�����������v,m���O�����EW@����L�<�����,<�el��LV����l��#��o���P3
5�����ay��5U8�l��O�x�]|M�C��q�<�&\����N��
�3_K�������/w�y���;������WvP"����j%�������0�����ICs��Zt�1[���:��7�!�+��3���kMM>�ZG�@`]��S����D�$3�f~&���d�8����&
!�E���If��Y��t��Xic��Y�3c}��C
7D����
3?9p�Y��u@���M�Q(X���?���u��}�Z�[�����Q��_���WM����Nm1gv��_gQQ��-�<���)�eE�O��5�P3���u������33�_u�"���23���B�D\?�\r������eB>dm�-a?y8t�)
!�L���
�)��,^N�z�'�����U������X}�I��G�@`<|���D���	��G�\�W��nZ�L:�g[����@��I�XV��d���F��g��lb�L6��L��+�~���0�Yu
�Jt�Jk,a�4�&��R�G�h~!��������<�Lt\�Jk�������Y6����d������>f�J���b��t�Y�&5�(5.T�4������j�a&�����|46�6SY���DF�Y�,�L��'�1��*~�
�t"��-;��h3r������G�Qh���d����mn�Y�,�Lt�Jd�	�l�IwV`�vHA,+�~��`V359�Ik���.]|�uf�l��
a?yDw���B�D|�McU����+��V{%Apq�������KI���`vq@�����-@����mb�a& ^Z#��$����g4��yE�}�;��W�C�>i��/�����
/�,u���	��Y�����0��������;^���?��Hc�D���rX�����c��Zd��P�'e&r`]>��>�����h0������$2BmN|��5dE?��AdTH���LC�>���w<���ng:U���cA��DF��f~&�DF1L&,8S5��I.,���i����'����7�U���h�a?�75�P3���F!-+��������mR�����<�L��gR~Bw��jH��E^�pb���#�<x}C���0OH.�����*@�Q /^��a���d�W�y�wN���Fy�zRD�����&�,).����8��������-�QF���Gz�8��g N�fE�Q���x��-�Q��(9�&w��D�n$G���t,A�Zr������0Qr�q�J�0r����$9��zjj�DGa3�����������C�D����Y��o�g�B�}y>��be�i�_�O����[�!���$C^	��2$��y%;��~%C���z�eHR�������~�cOJ59�I�LG��\��uI��!umB�Y���!u%�,G��h��v�H��,�����:�F�M��i,�;�9W@/�����Wr��j�Pi��u��
�����B���
�{��*i�4���t7$��M��P3��*��A���'S ��m/,����+92�K&G!����"g9��,�1�J��#m�������@�
�8���9���`%GvY����@���!ss����d���f5�P3�1�q-GZS����l/��fqv-�,���L`*� u�eU��A`C�I'�1��n��jf�f�?~;���O@c�!G�����IgL+�~&G�������������G��Z�(��\��"U6��<��fj&r�Z�La��@���.R��}E�UV�&��I}�� ��e��W��l?�`'�#�mb�a& ^O��� �+�Q�\��">\�a�(e5������CKe�i`E�?Yk\�C��=�n�C�>��R��@��@9��iA��X������@�DB�`�}q��m����i<V����pS359���1CJ��J`�N�5�+��^�aj��f�UIn�o�S&�;��M��8�/��
#�<�z�#Q��i�B`E�Hg+�Y��CjY3
5�(G(�{25E���U�{ N�ho,�����u�]��[��C���O�=��C�hS�<�L��G�B52���OD�?�������d<~��FXU�;L��*�Po�L��*�!���$Gn�WHV)�*�Q�8��8�)Lz*B�	���6����pQia�Po�8%cFy�������Y�3�>n�I�3X]L+���W�*��v*&��3�H�N��<�L�c�B���=���S��h���H�y���]K����* �q<�k���r�+ 3����!=�>��o��)�����cV359�[�#>�����Qx���>��(FAu������I�����#+}9��7�����!fAa�|.}l|y���s�5=/k�Vk��Qv����*����1L!zp��qC\!zp5�=�Y{9��u��8D�r$�Y�/.;��,���|��wSN�EEo)�$s�Z�=��~P1cO��������73����W'��_��tB�Lv��q���B�d������gQ+�����&4�5����`V359���������32Y��3X������{�a�{o����x����<�*c�2�����"��>��y�����J���3o
���j�������Sm��"�g%��Y�,�L���3�-����X�6Z��<V��$s�I�Z�-��w�����y�\�"���u)3#M����t���B�D|;�c��e�7�j���"��V.!]��JS-G������n�����vc
:���o���w [���~�H�L^����O}3���9�H�z�{ N��2���Ff��n�TP�!��l��Y����,�����fHtlda&�wS�Bx)b2�.�X)Z�}C�b�B.9b�����3��$�jb���I%���������f~&�lofNeC\�+�6��b��D��_ .dB������9i�\Y���vH���+�~.V,jf�f�P��f�%�22��3+)t�/����-p]�+2���f��vO�����K���
y��E����f5�P3���\��iYa��J�M�ui��'��%���Y����og����6k�wv��]����m���Y������������\x��N`�d��
���2��
Ff�z7e��;�����h7��`'������<������e��y�T,���+�~�����z�i(~3��6��6.g	[D��'�g3+�<�L:�������*���X�'C����Cf�	fj&��z�4A��f�;0��vi��$k�vH��5�P3��og��}?6)o�Qd��I�w���7}`>���	|?s��;7c�n��/n�8��>�i�	��K��@��_G7t/���^�^�������~��q
�~�h���Z�p�DG�zS���
��fj&r�������Uk�OW�nr����n�����;�������@/��O�z�p���!yX�����{����u����#>�k�����y�0��
�v������1��+C��`�`�`�!J�7W�������j���M��a��)����\)�����<���/����]����}�6i���y���yc�l�>�+���/�[�O���lH�]1�a?�����0k>��J����Gl8C�U��y����X����������������0��<�,���+r�T����I�/���kZ��&-��0
�Ki,�~v��XV#
3�����{�r������^�p������zQ��fr���n�T��������^��pc��cA��.��Y����/�{�a	�/0���.���nZ_�Y�����=j}����{���1"�.�?�����
��fj&r����(��b�}e���t<wE��g�v�������b�$,��n���5i� �?���D]�
fj&r����(
Y���?t��.]����+�~>��a��P5�b�$�K�����3��K��y,����@E���y��Qs�n�TN�%^��`�y|�z���_+j�a&����H).����6��.�������+��L^�%���(�k.�����������Z��e5�0��k�Ri9���[�����]���W.Q_���w5��3�_���uYW�g�|���Ti�����|�U�,�L��������l����
l>`c&�w��Qv5��3w�R�T�������&�
zp��}A����6�y���h�R�$��7�m��g=$���y���35��/��S6��������K��L�X�s�^`������b�����B���������!s��H�y�?OPg%{��T�a})_��q����#�G����N��~P;]��7�x��L�u�QyI
2F�%6�����y&� ]6�S����/�����+�Q�7�&���;���7�y������+�8���J���b��TTV�����71f�z-Q���>xs�+��a�
k��W��i�wz��;��5����������;d�����f��,jf�f�����|!v����Z�e��o����uz/af�g2b���t!t�}_���$,h�8~�����9�bt�+���������p;����H�~3�D.�=���[���? �(�7,���v��H�-���Yh1W��Vt|UAa�[�\���g�~`b0�Y59����}��P�Su��h���Wd�����Y������W3��{&/8�����W��|,��0��zS����og^v��`Rv���&�O���0���
E�}������73�{k��3�g�d����{_�����L ������|
7�6��o���a?S�mV7��������]�����V�i4�2��eKT��V���'�x�l�W"��y�b����������H� q��"j&���3���O���Sg���eE��d}��fj&r������cq����j��\�H�[>�I�,�L������Iy�N����c���{O��T���0���=%WR�WU�fN%�
�a��:3��_�3�0��C�D|;s*���o����5���uk�L���s���C��`�����,uD2&�EX��\�p�H�a�;f�{�����3���a>�������p�R�?[��`��gr���3��y���g3[�wHp~^�M|�~aa�!b��f"��y�}E^E���&E�Ek���y�������K��D|;��|�q�v`c�y��
�a?k,jf�f"������%g:k��f�/�}��Z����4���^�\��ytg�y����L:V���Q�!�lG5�v�$���]%�d��\��d�d���F3�z7e*��f����K[�U<�q�����]3V�y����ogN�?O-^G�L���&aEB���U�j&�TG�z�t�7Y��"�����z�SX����f~&�|��j�<���������;P[����IA��,0�P3i�P���U#>h���{lH�5����=���j&;�=����3�v|�v&r1\<�����Y����ogn~�4��NEa]�tC��2QhK0�P3������<��g!��
�
0��~�6n������nm~3s[>�e�v�!�0��w���s���K�,�Lp���������I@���60q��A�*�,�������a��N��z�8N������cljf�f"�����i��}c��������~�1Q�n7&��}���CGP��������IkMm|����<%z��[�?����w�&���/�����������~���������ui�6�G)tI���?������������������/�����>�������o������?���������MQ
endstream
endobj
5 0 obj
   197523
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-1-0 8 0 R
   >>
>>
endobj
9 0 obj
<< /Type /ObjStm
   /Length 10 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
10 0 obj
   16
endobj
12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
   /Length1 9164
>>
stream
x��ytU��>����wWu����t��	�;�&��4N��u�&<$!��gP�B@��/A#���x�d����(8��q�	����&z���������{jUWu��������
�<=9g�G'v�� �~��E������HW��G��z��F�VX���W���!�����S��s{>py`��y���J��f@��'=�u�����{b�Cs�o���
0��9�,��C�c�������w��jZ�jA��������2������ �DZ"�1�W�{o1��L���WK��}���Tt_�U���`��A����!^�����H���"���:"��L�"(�-+`Sd!=���SX_�sg���;w� z�����gy�����y3�"Yd �
�^�+x	y�<K���T�����/������N�QP���+�Vs���BH� �3NF�%w5�D��Ql9w��H�#���4m�$��%���3+�L�w�O����R�=�����A �W��!��!����t�!-�K��I	�@is.�����%I����<aU���U�p I)������G���i�Q�I��a%H�p�kr	y>k�������3K[f^#���'�����%d��'�OXR�{��w�����.��jq�h-k��!�"f4�9�(���zb2�S�Q�E��2�1PT��Z�G�����=If�i��'��5�-�������a<��Ga	�)���4��)��=���,&Kq
1@��x1K�r����F��g����n>��]W��+���w�rTu_���Raf05Fs�	B�b��O�*S697�M)zg�;����$���tD�4��u����M��d�M�D��fl�M�I�j�G�B) )����=0�8�_����a�G��7+����{�����~���/\�;��������M~�����/���{^9q��=��g��i�;��}������@:���<a}��^�����?��t1���mroH��oJ��t�:�z�#��N�7)S�j�t4F4�rr�9�y�|=''�v!�ii�ebU��:�����=pPL�|)��~��}7���������������~���P;~��U[��Z�IxwGE����+v��Y������'�\����}Xu��Y�j��E�V�2���f�`�IAG.��b+%�u��!���HcW����h^����?���#!����T(�UX�?K�:��L�/��O&v���������u�w@�MX-���r�U�����.^�8�}���v0����d�7�;6X�I��f�+�$�����H �����j����c����I��6Q��2�/%-��S���W7o~U�o�0�������/}}hC�0�������W�{�����o�o�����]��Y��h���O�"����I��q:�kx$�:gW$3h��A}��P�@�HAL��S��l8��!�������wl��RZ�����BN�*F�^���m�qv�kxc���@��|Q��<�[!DU�b�F���l�����%����Tr�T���
���]u3T9e��?	�Udd��$-��Lm�x+�.�a�py0�t2A�@��tb����9�u��|�����"Q>�-�[\Nf0&35/uA���Pj(�������zR��=�����1n?�rh�4c��U�=������
O-y~C���g���eO����]�����Y!����[{oV�����]����+�-��d0=!����kqb�E��'w6����8�a�Y�3&��>6M����	4�������BG�V%Ge��te�3�3=����Q�AW�Lz��`J�+�����������/H_��pW$Wx*�kR6������=����sC��0��S�]�^��������"y�g�7�vnA+�lQm9pP���x+��|h�����Q'�j�y��n/<�������.d��-��h���+�����s���l}��u��]��������NLJl������
��t)6�]qt�m^%��in�x=3����'nJ
�?Z��K���J~�(aH���yX4���S��[��w_�����jp��$X����:� ������;�:��TR{��2�A�
������`LN���$�E��S��R��%���rr4�U��]���
R-\Z�\A�D3�#�NHA�I��it����������������T�����w��G��^�^��ojz���Xn*7�To� ��f��eL�DLB'u��d���UFY���������(X
`�����8��JL�n%��?��\?������[G5����f��D��9�6<|0x��O�P|��:��W~~p�������c>����YS'�����V�:������+x��B�
&A9YK-���U�qaeC�d3�x��D��-����Um�A�5INZ��1)��H�?��4���V���Sv��q��y;�����&�3�G��{h��>}�47_���.5��$b#C}=D�h)���`�-D��(a�It0E�Y��"Q���Q�����?:��~�P����m;���������Z�Bv�R)�~�� P��JE����U|�[��M��E���D�vW$�L�����|�'Z�phQ���E�a���������Bi���kB���nm_��W��a�E�}r�4F@P�B�h)�!	S�A������a
&r"��6mp&9�C��u�SK�5V�M�:��S�(�W������F�(2J����Jt�}
�a>�/�w�O�,TM����YQ��� i��hY�S���������~�dt}F���o��������<��/B$���?n�~���[6�����E��+,a�I� �@7����';"�-��?��;F�f[�0eNL������������w�Q��t��Zx��!�4������-��VI��D�h�1:����w���1Y�6�]�|�l���P��m��5�W��~�>�����W������#���	���q?�*A���,������6���-�.��Z�51#6���Vk,�u*�cVc}�J�T,��te�2C���Tf.��Y��2��JhOP~���+�z���-�mi'6~��[�
Q������z���/k��_��$��;��7�>K�Hf�1E�D� ���D �1*@���
D=
�N�+:��k�(�:x�	���h�[[�8-9�z�}-��[;��������������e�C�*��Y&0�x�Z���5O4�����E��D�0�F�;��}�LC��i��>--2<OW^���vi�a�a?y������B�)cz�1��O4f`��{=��$��� }�1�<�������y� ��Yx�!��f�f�g������2��d�t��J��?�/����:J���OO�����I���8?�!y�?�!� �����gI��&
�|!�����Z�L���,a��(}�b3�ui(
~���*Y�R�hKn�+pO8<��eM��M�^�y�������U<&<�����9d�3�;z���.>���p�&}	&��V�`��4_8����z*���:�]������B�%6�E�[U���<V�{E�P��e�q���1� �-?.;����u������9��sh��]���h��e�����#9aa��K�.��t�+����~�o�So���\2� �d����T���e�Db��B"0���"Q�\�-&2� �-�q���/���g�I����U;*�%
����(z_/<&�
eB��B�$�
:��z�S8H"&�4H#�A=�l�&Cq(����qdN���x1����g�<]1�G�Q��'��"R��t1[*��5�+i%+���lv���E�]<�^�ugt�u���P��'Y�7�� y��@'-���C7B�\b����J���t�^�J
V�b�Y1�A=X������\�Q/��U�)���l1�zAg�V�,�4F9Cn�d����H:����1t�FeM�2�?�	Y����G5�.��z4�����g�6O0�c�b���mx�Pa^a�b�����d��q�!�Tfq��nJ�$Z�!��
�a�^z�!��jJ7����z���M��L������s,9�Le4IPb�YPJA]�~��.��kP���d�0�h�!N�f����g�n�n�e�e�S�I�0����Qk�R�{���������*���J�K�m�m���Zc������^��rY�V.�f!}HvV`�(�rP��e��e[���������wa�����tJ�V|���B�X+�`q�G%�\v%�${H�4!Xi� ������N0��[������r�#"q]���VQ{q=�8�f7��A�5���y�z�Z���u��g�jb�1&��������������}���������J>k����V~�/���=��;2���KM�����
-a��J���c(�'�{bBsH�Et�<[|��
����@{�3�#;���Ukl�tD���vN�&��=����	�=����$��y)h�>��XQ�����6���k�,�����7�;|l���/N;�T�{�� ����������w�TW=�m��%��iG=�=���&�:�����"���].�Jd*#�(�]�7FK��:�=���K�e�)Y��z�%�8��#�x��g	���Zgt�d}�����+���C��#�I�	�B�}4���!yS���tzc����4k-���HO���Z�E�������$�q�q��nL��l'm_�����j�����������
;���N��t��u��'>�7��A������k��[���pj1�����/H������w�{z�QT!�� ��
S�F=��`�2�&����5�,�.y�<����C���n��!r/)$�����c<������P=��Aj�.���nm��9r�\��:l���\��A7Pp�d��?���|�A2$@$j���#L�@X�V0@�������W�
ah�&8uPC�C�`!4AH8k`1���D*�~���.BT@�Q !��C&@���PCr���H":����0�J�a0-������,��f��l��w��&�������'�j����u�9�Q
MP�P
a���P&�
�B�g��v�|h�f��\�v8NVC+��T�H+��&�+��|���b�����
v@	�J��>P���j�.��Z������V�v�G.j�O�!q��">��#����:���B�b�d�z�XL����V�>]XBI|I��P
���Q�^��p�a�(C3F�`%T�:��fi"5���0�c6<������P%�F���B��O�b�"��fPL�C���DFQ ��#��	E���3=fy����W�,y�!����'���7�&�Y��Y�~]=��>�g����;)o�����1��:�pL���������1���~}������a��_Y��:
�������G6�J�2L�������$wN�J^i�����K��XX�����A5��"U,��8�TX1;r�,�`����U(U��>0��Bf����4q�v^���~�I���P��I��,v�8�aqp���JH�(�kd2	)B��H8#\�4\K��	�d�X-�&�+���-R����j����a�1�����&��	r���e�u�yj|+��78v���~�[���Z�{�����Y��jl��7�F�o8^����1��_u�_8~9���U�_���i��jl[7��M�+�
`W:���)�O8^����j���#��r����?����[[�b����.���$v�����_9~��������or��96���|���5
;��wc���;��x���oq<����Op<�`C��5p�y��9�y���y�\A�������X����G9�Q�G8�����o9.��X��A?;T��l���l������N<��U��9��a-��{,lo�X��"��B������&���N����j���2{9w���_������s�6��u���j�������-����6�`�8n�*`O����?�*�� }���s���?��q}\W�g�Fc�Z#���Z#V��YE��QX��(�������*Wr\���c������_s\��ai����q)�g9>c�%&|���9.���N|�v���9��'��8���\��4|���������9>����C�r�3;�Ap���l��f��N�i��b�}��qz��M��|N#2����q��6�c���8�n�M�x���9N��&q�8Afcp���&�8��wqW�c�q�;�~��N�=��'a��(�#G��H;�ne#l8|��
v[q��r��8d��
����d6�����l���F��,3�0��;��9��2�8�����Y���o����>E�;��z�1�������hL�c������7b*G�+z���6�ar'�]f�.B��J.srL���\LPrY��"��M`qc��b����1��M�e6����)�(/GkZ8�M����Ddf�E#G��z�:��q��(!%2���1�(�	���I��y����a�������_vP��
endstream
endobj
13 0 obj
   6645
endobj
14 0 obj
<< /Length 15 0 R
   /Filter /FlateDecode
>>
stream
x�]��j�0��z
�C�#%R�P����n����������e5��l>����w��;tq��zK��)�q�!�u�%O�D�)���a���T��2,�j�C�f�tq��s�z��t��.WOa>���RV�)P��Y��m����,_t��e-�FE�>��p!Y��u(�)�����S|����7���@�e���x&����nA1���JN���p��u��F8
��X1o�[f6���G�
x����g�c�����F8O��&�>0����E�y�k��Ma�2�K!��<f���"��
g0�<�G����4%�[��?X���=5�j���SsN�Z��z�/����w�w�-%��,^�;O|������pUy�~���
endstream
endobj
15 0 obj
   372
endobj
16 0 obj
<< /Type /FontDescriptor
   /FontName /BFPSEF+BitstreamVeraSans-Roman
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -183 -235 1287 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 12 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /BFPSEF+BitstreamVeraSans-Roman
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 16 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 317.871094 0 0 0 0 0 0 0 390.136719 390.136719 0 0 0 0 317.871094 0 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 612.792969 634.765625 549.804688 634.765625 615.234375 0 634.765625 0 277.832031 0 0 277.832031 974.121094 633.789062 611.816406 634.765625 0 411.132812 520.996094 392.089844 0 591.796875 0 591.796875 591.796875 ]
    /ToUnicode 14 0 R
>>
endobj
17 0 obj
<< /Length 18 0 R
   /Filter /FlateDecode
   /Length1 8164
>>
stream
x��xytTU���g�skn��T��J��0$!PR"c��H�! �E���a�m;KKDc(c�C��2���A�
��|��n���>�(N��7����~����[��wN�s�=S��������P��s������	������������q�53���x�+����.�nlw�@3�Z<k���/gm�K�>�f�����|����9�z�a�����{�N�Z������L}�F�e\����<0�f����$0l���W��@`��^��

f��F������
��X[� I
�������G����`���J.By�y^�k�}�)p�������N��|G�����8�����WbNy����K;��.KeX�jfe*����O_��YAP�*���Q+oo���a�K��[�C��o��/��g
�z���
{����Ez�o�S&�.\�xK�VG:8�XmdO�S:��#��������)�����v��n114qp:�Z}M��+15������;�
Nq��4�(B/���xqb&��������O*kq��T���	N��S������9Y�E�X6��g��:�[����E���������,�����0�=��x�D7���J��Q�"��c�R*���0e&��<��+O�����D����e�k�
z��n��;�K;y�����e�m���������3v�E*�")d������~��^X-sG�[�I��;}f^y���^����������f�R�@��.l`h'w&b�!?V5���F�	�I!5����Iq�=��e?�k���]9��&���it&�
�e���*M5�:��������+��U���.ta1q��e��3@�o�h�z�ha�y��������P�q^�y-�!n����'45��5V�<y����1-���e6��4g"v%���"�F��[��H��������{B��Q�����9������/.Y&����-,�L4� �ba��_9YQ��������X�>ta����%����)��������
6�w���(����6��Pw�D� )���)��R+t��3u�| �8	f�)����m�
��dLVR�Z�m����,��CCH:6���8^n�Ix��C3�~$*#���u��u�Y�X�������4�Yl/:�3W�5���?��R��@��3K�;/���L\�9/}v)R�]QR��j���>}�
���4�	���3fT���Nyml����I�/����|����s��H������������S��W�����  ��8���Z��3������>���&��}��������G=m�kT�&|Dm�)�l	����qv�al�m79���X���D"�.�ACA����/�������mFbE�n��d6Y,V���Y)�[�=����g�3�Y��y��`�8�4�<�����Vf*5�ZJ�#m�m���M���-���ma�b7��v��l���
���k����o�������U�����|����>E�^�������eSc�~yU��^\��s���=j�������$/���������}GV:���G
�m�������q����J��6h>�:�9��nGf����p&;�������W�cj$)�Q����l*Q��s:��l�s3z�K6����y5u�����}fo\�0W����������x��i�:�i$�!�W��GK�t�� ��0
�b��n*�[�1s��L�4���U#k�,)_Y�V��dT'���mG5��+,��[����Ola������Mg�����s�\�JD�y�9j�����5/3�,!ZD��; N����f_���}^�&7w	��48�.����dc7�C�*��=L�,ak�]��b6@�W���Y�G�Q4��#�#�i�`�do�h�b�f�|��T*
��*s�e���D�Rc�1�7�X��'�_�2=ny�����*�
�&�:m����_7m��a}w)�wYb+�����}���1���W�$�oRQ�a����8y�>��wN����}{�"^�H��Y��Zb	���9�,��^�$e��Y���MC�i�g�fP����R�;}��G$V�U��U'��.Y�p���0������R��T���uojA�1�������~�=��ZP���W�����#���b���L���~�=��FTW��Z����D����������b��O������K\�o�|����^O����� ��c�7�����(�v�`Rr���E���V��P�}��h����/���(e��*�r��\�)����A���$=�������AYe�E)^���C�Qj��*�3Y%�!���Xl|�=)�1>�^���
��Q�&t������Y6F��e�'��U�������sq/;�����Mx�#pg�F
Wt�	�e��?���g;��T�y~I\��`t�x��L
��m�J�!�vZsRK�j�1���w����vz)�M��D{k"�si�����������'P�������7��A���(�=n�v���l���}s��'g��a������yzi�����^�����y��/�	�1�O������:�b���s1%|������X���!����>�&k�kX$�}����Z��-��`7z=�j'��Lk�MJ}O;�Y��3����fg~�����ag�l�����#�O$�|��L��e&�@��$y�������14����K7c�l���!��|�]�=�+�0Z1w�L�0�
Y:����F���b�>�����2���b�@,_�����	�pH�<�? ��8/^	~Mu���������]
�$�M�l�L���B=�o/�{��i�Ij��f���Iq�c��1+3{-�bv�;"��N�����>���f�����������Y����ua0Yv���&(�f�9<G�m�5��u���#��Y�\_�q������(r�8�J�Pw����m��i�W������V�5i��4g/	i��F������m��+h=�`[)��o����N�{��������W�x��i����5�e��M	U~/�C��S�����
Mk6���}�L�8/��e?��I�f5{V;��k�)^��_�����NM���Ph�=�����v6>"�/�)y����e1L�pA~+/|�-&��r��y8W��x�������J��I��}���W�G���:�6Z!�G�6��2(@6���.��zh��2�����"�KDG��


��-x[�"X�/x_hipi�&�����W���������<��[3��;3��gT����XX��6�-�%���g�p���w��G��5�O�c��x�|i��G��Bf~��o����o�c>����������}��#��+>Z���i��=/�������@�7$Ae4�)�j\���`W��Y�&a�
sX��3c�����N&b�K��U��^�b��
�U���U�����Pg����T���*K-��>Z~p�@,���m��~����;tL$�R%��w��e|����B�GsS=.7@P1$;N�g����iIp��F�2�ms�������Db�fR�NG8�����X�Q#�O���T�.q6dwI���������Y��V��uV���R����&�=tO��Gd#��1�s�l���Y#q��7g��F�f��t��6������������������S*k�7%
��
���pj�������.���Av&�,�6�2;:��X����%7�����O��>!d�l���*��ko��x�!�v����L�.���O�������%7$w"8�^��VR�
v}�,���%����������_p��8��!��j����!{aV�#�1��8�o�������E0
��0��{a7\�j|����.m��G�Z?�������9|������j���ja'�j^�?��P�I������4a)fAk���������?�p��H|'�$��fh������1�qf���#0KX	^�o���34�s��F���\��a����I�C���|`6j�N��IqR\��`6��2qRqB��]�>�6�����0@����J���4� �"
�*���q��	�)���W�Y;����Nv��Y8��;��jY����5�z��J�4�
MJ/�K�n�;�M8V��3���~�
\��X�_�4a�>������(�@}�V��������F%x�
1dlI��pc����w�J�	�:���R�t���������CC�z��c�i�����������<ULli-6��p���y�W������x���7�V9�W��;�[XX{:1��s��^:B0�Q����8��N�����S��m�&g�t��@0���Ysd:��19K��L���+��cP��C��~���r���(Wv�)Q��zX�?��b4�`�X��=4�q�L�`����D=��Ll��b7L�0MTC9�B�q��s�
5X�c�a|��o���c����+>�o���81[|�W�R���RC�1���K�
c������� .�g�x��Yf��0�@�
C�v�8f�3o�e���{����=@�U@�����S!S��^'��O��z�A7�A�G��	��{�^;�K@�C��	mh�gA�Mo���hEL��Z�8Z�� �BD��-�&}�Q�
`�km����<qK*�{	���(����-���XB7�����S$$]�IqI�:B�}	]A�=N?J�A�I�m}/��$]����$�b�,.J��)����,�+����/q���+���M����^qAR�����S�9I_K�J��q:{&Y���3�tzm�8]E_�
�/�t*L:�������^���S|���Nj��"������x",���^q"L�_���Y���n�I6}|�%>v�1����������#�H:|�B�C����h���P��?�����WQ�
�h��a} �}I��/���-�����_��_H���b_
���{U���*�8h��v�{�.I;%���w\�;I;$m���Go���KomIo�i��T�5N[��bK*5�{E�zS������$m��I�o��Q���v�A�z;���uk3��8��������+�������4zE��U{�jI��*��=���7=M��/Kj|)O4Jz)����!#�A/�`/�+�����e�|�W</�~�S�{i����o$-����g$=-�)IO>OJz"L�KzL�������G$��i���JZ"i���8=�E�>�I,���&Z0?U,���T������j��s{�}q��{��KI�%�#i�t��UH3%U��*��!��LUQ>}�YL��43M����
T�����3�-�B�dT�dI��J�$����+�&J*��H����vL��I�3�N7���/���T�~{�_����cTq����t[�.s��*+UE��JG�E�J��42N#���
w��8
��.�:�V;
�%,���T�-a�v����7��`�l�Amb���h@��qS?I}��OI����b�(I����l�n*��E�VQ���(/�R��M���TE�&��R^���_��SOOX��O=��{�J��C9>U�dPv���-K��2(K�L���8�����f����4�HS�#I��)m'B����(�_&R��U�/�dI>������������]EI*�$��
U���v�p$���;�^�m�
[���d1���K�:n��9�M����a�$a&�<NTEU�$!��6��X��r���F����L��?B��
endstream
endobj
18 0 obj
   5838
endobj
19 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
>>
stream
x�]R�n�0��+r�V<"�*E�*�����@b(R	(d�}�x��z ��gb;k�K��(�����g���-X�N�E)�l���N�������{����*���'N��!O/n�IH)���0�~��k�1�����Q���pY��oo��2K�����s<������:6�e�l�����z?��yn�G#���J��a��}��:�A�����g#4X�-a���2O��	��G��8�2B+��r��a`���5�r._3_�����%�%��y`�2iqME5�+���N�����|ki)~��w{���3����{������[�cZ�4/�����S��QV�~���U
endstream
endobj
20 0 obj
   350
endobj
21 0 obj
<< /Type /FontDescriptor
   /FontName /YIQLNL+BitstreamVeraSans-Bold
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -199 -235 1416 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 17 0 R
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /YIQLNL+BitstreamVeraSans-Bold
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 21 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 348.144531 0 0 0 0 0 0 0 0 0 0 0 0 415.039062 0 365.234375 695.800781 695.800781 695.800781 695.800781 0 0 695.800781 0 0 0 0 0 0 837.890625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 674.804688 0 592.773438 0 678.222656 435.058594 715.820312 0 342.773438 0 665.039062 342.773438 1041.992188 711.914062 687.011719 0 0 493.164062 595.214844 0 711.914062 0 923.828125 0 651.855469 ]
    /ToUnicode 19 0 R
>>
endobj
11 0 obj
<< /Type /ObjStm
   /Length 24 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Mk�0����eL�W�Rz�c ��x�O-��$���~$Z����7	��B&���$Ss������kO�;��g�G{�8BA`�SB����"9v��C�[g!g�m�0����5��6N�����u�tz�q�Ck�J��Z+�2�����|�Y��X�7O��������,�O:�zcp����T�R��]�.x�>[g�yE���DGT���|��u��n�Q�5xE����vaQ��=y;8C�Gg�`�m�����A����!�W���o@
endstream
endobj
24 0 obj
   274
endobj
25 0 obj
<< /Type /XRef
   /Length 104
   /Filter /FlateDecode
   /Size 26
   /W [1 3 2]
   /Root 23 0 R
   /Info 22 0 R
>>
stream
x�-�1@P��Y�$D�B�J!�T�����@B��u�'��|�3����f �"@�����]\���;i�����i��7�Gs�eG��<hS���+����~>'�
a
endstream
endobj
startxref
213892
%%EOF

prefetch-cyclic.pngimage/png; name=prefetch-cyclic.pngDownload

�PNG


IHDR���rQ� IDATx^��	�M����c���%Q�JR�M��U���k���%kI���������B�P�Rv�dg������Lg�;w�03����_//��s�9�yfy?��I8p��B!�B!����B!�B�xGr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B�l��-[��K/��3g�{�����q�V�|y�1c�u���~��W����{��7on
�����gO�4i�����v�9���1c��3�<���}�v[�h�����v�ZKLLt�s��[�Z�,��)^������^���{����wl�%���N�f���UW]ee��9��'�"{!9B���59F�w�n������N��X����G��?l���A���X�N��x��}n������O��#�x����.�Q�F9�N���Bd]$�B!���x�����������
Z�|�2l��������{���ON��z��m���.Z{�i��������G����	l���]B$9&Z|�=��o�����\�����V�J��u���5�J�*eC�I�g�sB�u�!��vD����t�:�k���D��o��%Kf�ed:�������?�v�����?�������x�����}{������.��2>|��z������tHr��y��h��� �"z$�B!���w�q8g�WcJ4�����z��Y��}m���N,X��������-[Z�B����G�����;:���]���SNq���a�;���S����M�6Y�����Q���];+[��a�U���.�I}�_������3�8��GS���q���r��q���\rI2�K	ju}���%Kl����f�A�v�M7Y��5-O�<I���V��?���o���������t��/����1��.s!�����E����{������o"��o�b��<@�s��q������x�������/��f�������3f�k�v"��6�}�Q���O?����^�t������/�C�u��z�����=�"E���q��3�8q����/��������S����������n��Ms���m�-Z��W�n�Z�R]�B���B��f��5N����;w�=���n�����m���)S��ZjGO8�w��u��i��.u��o����"��v���7v
�,�����C�9��g�K�}��'�����"�-8���#s�����/������})�g���k�u�I�I%������}��t�5j�E�s�=7�z������`A�m��Q��G+�w�y�[0��y�[aQ��3gj�Q�C�&M��+G"����5i���p�0Am�
7�`�~��[����/��2G:w������g�q����w�=��]y������1
��=�X'�|a8��r)�H6�6|m�y��BD��X!�a��"D��w�D�pm]�l�}��'���1Q_D
Q@���Z�bD����.������a#
M$�B���-����s;yB����#5���Z�����!�D��:_#D��1c��"O]�t���KG�P�dl���j|��7�"��+:9�9B��#Z4�z����#���>QM:\���G\����z�<yrX��0O�<N�A�
���iy<k�1�^��[�ha�7v�YFHb��u���z��"�;v�p��EB���g���e��
����4���9_~!�g,Yf��/vcR�n]���������6�$9B���!�8,HAF�H[d��GqM�|����+���u{�5������h!�Y�\9��?����u�]��U�;�8�9rdR$��[ou���%���8��1i�W\q��S����"��r�rCJ��Hyj�$��B��If,�4�V�%�H����~��z��.=��mJr��!�l�\O�~���z>�	�R��2M�|4E=���3G4���-�X��j�\A�XDa�9��s_|.����>�����>��	1Qg�-����AJ�� ����7�E����"�\3��3c�5��1�c��3�I�&{�?y�,
T�T��R�I���t�}-;�����`��B��H��BH����!�i��o���?��AB�*�:<��={���!�<��"��)@�%���&MAE�������K>5�H
u��
cAj8Q^"���F�����w���-��#d�aJr�= ��	�i�{N/��c�������
�n�����s �DO��H�s@�I�'R��	�yo�U��"��=�����,�p�,��e��9D]1;D����o�*�d>�x���`�~�o������;�E
�����Yg����"�B��Cr,�����H���$���"5Q������3�{����PG�t��d���9��k�����$��"���"446��t������� f�+i��T$H�n��MR��� �Mm,��)�11����O<��v5��������\����V�C����#���7wW��"���?toA�2QV��S#���#���37C�t�j�}�8X[�������&L�<AZsL���	�~i���@$���n>�0�?N6]�i&����!�8,%rD<9�Di�	u�H��%"9F�So��c:	�>Kz,�G���n���F�i<����;��
&E7=#��^��\d�v��N@�r�|2?G/���y$�9xH�'��l���U@�4]��H7�m�M��	��Y�aq�'����5g��
��mP�I!g������"�����c��+r,���X!�aM�1�bXk6��g��������"la�e2��c������c����R��Z�1���E��M�H������>�j��4M���NcE
+u��a�����g���9>Rr��_�4p����(Q���:�-2�n������<�!�4D���e����C���g�tf�3G8i�dJP+L�m���1����9�k�?S!��#9BqX��V���n�+V�p]�_}����"�aQd����;��[5
�:.#3HIzv�&*K�j"�D��$4�n���Q>�M�h4��M��?��^��?tKf�e�{�W:O/\��5���C�VM=7�A����[9�l�����H������4lc�YD��+��+R�������+�������	�����;uG�P�|�Ps�\+s�ka �t��Z�
�<d?�8�f^"�5i��Z�C��p����F�il�������U!D�H��B6��sL48)�D}�/L�8���������9F�Fi6F*mj�� �aSO��>��`�qz�s�P��b�	��1�����:R���<�`A=8���s$�C}�4F#"KM1���� RLm1�����?���1h�c!�H$�B!�"z��N+"����J���`���!�F�&}�w��U��%
�����,���7;I�84Q�I*��4�
F�HGj�~�Q�h��\2����b;����(5�D��
��4�"J����"{5%E�h�y����0,�Q�)S��n������(?�z��m���C�c�+9��H���?�tv��Y\��x�Z�Jw��>�!��|��+��k�/n����t���LS:j��:�
��Ds��M�����@����Q��{��M��
!����B�#
��Q�l��/��9�����9=�SD��z,��&
�
qQ���B�#��X!D��d��>K�&�@d��PD6/���d���t2��2�C���^8R�Yd���n3���R_O6d4���f��s�B!�
��B�L��d�y�i#��|���~=B.��3��VS@J3���^�B!b��B�L��1��H9fK�N�:%�k,DV#(��YL}:��t��vJB�5�!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{$�B!�B!���B!�B��Gr,�B!�"��!�B!��{��'&&��_mS�N����������g�U�R��7on��z�)R��+W��a������k6l���>����z����,X0�1�~�������7���]����N��={Z�-,O�<a'��/�lc���X����^�z6p�@�S�������c8��>h/���-[���/nM�4�!C�X��-G�q?��B!�"�dk9F8�>�l��m�U�V�j��a;w�t�i�&�U�����{V�L�d��|�r���m���v�E�	'������~�F������p����y����e���o�>���+�X�b��;���U�������������<�F����+�������i�l�����[o����+W�1�w������>��'���w��O?����/o�}��{��a�Z!�B!D�dk9^�p��?�:w�l'�|r�4�[���5k�$�H���C-��I�t����h��w�m��~��Z���>}��3�n��F��7�;�>�������r�]�~}'�k�����?�I���3���h�?�`\p��,Y����;��c\���e�]�>����[�%�$��G�~��Y��
��W^�B�
���9�����c��QG�9/�B!�i [�q$���=�����/�t�`X�b�i��D��i�������[�*W�����1�F#�M�6���z�
(���O<���v�m��C��%�k����?��;6�v��m[{��W����s���[����j������;7���&�L���9s��SNQ�X!�B!�@����%K���O��e��-i�.]\}1�����������N���E��'�|�Z�n�LtI���
*�������z���x�b��������KV+L=1�^��D�C�����~�i'����su�B!�B!�#n���_��n��.��b'�^&���������^���GR���H��o����7n������5k��&�y��Em��+g7n���W�4��[�Z�J���Ho��E�=-d�A�.U�Zf��iFZ75���� �\�-���j��yN�Z��5k���!�B!D6����������^<�r�����d�f!���U�(0�-)�%��>
�&��f��4��9R�G��>M]4Q��"�>�L��C=�j��i�;�����b@����
��iT��FFw�����/��u����H�g��\4o�~�c��o������Gz=���c����-R{
i�4�w����e��Hm�������M����k�<c�F3Oy����Hs-�{��s7�q�������������C�=������!��p�_\;s%���9�c�^�K��`4���k��P�+t��k�����`��N��5=�c�{������?C�BZ�=p���~��)���H����k���p�����JZ�0=�4�i������S�y&%�z,��v���6�
%����^��>��pk��MZ�-Kwr��#]��7�p���>�h��i&�k��f�&Mr���}=l�4a�wu�|��^��R���O��8k��m���w5��4R��#�t����d��OtQ��������W�^ng����������>w�����q��T�RQOL�q=W���S������&=Vx��g�:���;#!
�.�D��������u��-)�>Z�`�6�c�i������#F�����_}��������p�����������?�UBh�������N>�?��^pY��}���n�A�UB�GFBz�����0�������f�p,s����K�C~!��B��������������"��
��|Ob�(�	��,�w�}a��eh��d�u�n		YT4��g���5zv�s��B��oZ
��X�pl4���Z���h�/q)��2��{Li��s�� xM��FF;`�#$t�h��|�������`�X�e�I�f$����������l=�w���e�J����e���|_���w�1���j�~an�{��aOM��y������^����`��@J���5t����z�L�!��;�p_���1�����-Yo�����0������|m�U�o��;��G�`?��H�=	��
B�������s~g��k	W��X��B��0�9I��9�\���o^t�F�x�A������c~	C�C��=���d��c���8��K�3���8��K�3�qtb�Yr�B&!6lp������]�h��8��/���baR�C�����hC�����9f�&��q�)���1�IZ7+����jm�1�������p�pj��)?�z�#��Xr�YH�%����Xr�YH�%����8���s�=����i�v�����79����I�!��2e���!��m�n��'��QTRHQv�&���`��o
�e��i��N�o
�V��k�����Jk���,�4,coi�x��J~(����n��NR^7%��,���������/)�M�+������F���`��6L�N�j���#������f
���}���V�J�do������k�S�N^Z�vm|/"%�F��2]�A�3)~�\rI�3?5R�h�<%���jR�#�5R�^��d+�p�s�22z���4XR�HaM?��i�lH�_���d$��[����k���
�����aC;���*K���u����5rM.CK��1r�H��|�UWE,/!�����s���\����)���%~�$�LO���VM0�-&�oJ�2�)S���+d��;��@	�?�k40MKI���9��K~�����w7	��i�\��I�&���p�����.�������yMJ��n4HVM�&5�U(�,+��%�?���^q�n������}��+�!@��z,�q��O�"eP��{M}	l������'_s��+s���RN�!����8��#4|��QM���o������V�Z��}�}3,R���)(���T��}�}c/$�_����LR~����1�*�p�r�/������s���Xd-����EV�c����r,�YU�E�Cr��y����O�>.��(c4?VaX�#��MEP�S����D2|�V�j���"��P��MW;V�Y=g���y+�G�P���h�_|���.
�XMd5��n$��9"M��VS���(�4h�[=�~�i�Dr,2���,$�"����Br,2�q6�����)��qh*,�����D�I��o�nH�CZ���Ie��t0�LD����ARIu����R`���{P[v����0�t�F��GM4�e�-#H�#��4B"�H8)�����N�	�c�YH�Ef!9���Xd�c�YH��)H#5[HgJ �lh�&Q`"����]d�:��o�=)b
�ql���_;�����"mKAJ���c���XQ�G�&M���9��D�9��v\7�f&1���R+9���Xd�c�YH�Ef!9��������[>�
�6��	�m�G{m�����u����F��s,��4 U���H�&�86o�����9Bd,j��G�8}O���k���qMt�W���`������'e�^�">X�o�=����~������������0�uS-�{N_���U�/G����H�c��������B!�"}����Z;�X����������1��?9��&�U�Q��`mJ�I��Q$�Br,�B!D�����������DKL~��/�v�Y���������r�N�:�~�$�Br,�B!D��5q�
]=�����rlU���,fC�
������GHr,�R�o��&{��]��3�<3l�� 4h�[���r�)n����Kk�d1�,Y�:�����zi��,��>}��y����wo����B��Y�g�
]5�&m��'X������gg$�Br|r��G{��gl��inOk:�����[n��>��m��"��tv?��s�=��=���7o�l����m���i����0a���=+��}��Y��r�/o9���9���!���;�a����M���{B"���LR~��Q ��+��qr��\�s]�`A����`�q�`��9v�u��5�\cw�qG��g���)�v IDAT=����u���J�,�|��]�i�d�����c���c�Y���zu+������L�����-�BDb���.�z��	�{Y���9����������;�c!9�B�SBr�5 5�t�)S��9���j},r$��-CHC��g��q�U��^m�:vtB|T�.V�N'��[���E�l����w�Z;f�8+p���$!��[��g�=v��{��W�]�E�;�����Lr,�Z�?��c�7o�K�^�x�+V��6mjp{V���p5���U��~��������[�B����O>��~��{��'l���n�_"q��v�����W_��{�~��'�n���
4�����r!wD����;��+�W�zu���������w�y�u���?�W^y���kk������T�RI������?����`���u��>���_}����V\�j�=���6u�TW��3gN�T��5n����2�0k�,w�/����-[f�v��c�9��5���Ov��z�r�d���n�P���~��y��s2��3���6m�
���I�&Y�*U���o�a}���E_�b������5�<y����?����w���gm��M6�[�lq��3�������<�3�8���~���C��9�&N�h_���^���������K.�n����Gm_|��CHv�|���S��#��[����1��;5j�p{�3������w���.��g������'��>��p�]w�g�k/��r�b�����[7K�����q���\��K<�o��}����m������U�k�!�Bdn����n����}�/��"�y��c�J���e�[���|�nH�E\�1M�\D�A���N;��(�`���7�|�I�W_}���t�\�r9	�|H�o�u��:	GbD�s"(^LN?�tw
���H�����LF����o���!�D�h*�����}�&k�S3��IDn>��w����a��\�r�+!7n�f���c7ntc�����N�`�����cG'�\;c	����~���f���c, ���r���Hu&U��:d����(/�G�x_������:��n���@xy�]v��?/��h�������msc�3�8d�k{��w���9���m���	%��<k@.S��{%
�aXh@���������_��6l����c�{6�^�����S�M��-�ls��\���-Zd]�tqc�� �,v���6s�9
^�[>��W�Z�]3���/���{-6={��2�%��f���C�Z���`�z)F���Ze�li���]d	�r���!������?n�W
����&�����8w��V�f]���]vI�K2���0�c�r\�@'��0b��~'D��$r�R������#o��V�� �H
��� �H��D6"�����k��q�E$��L�o��I2�#Q�<H)���/�h7�B��,������+���}�^�z�s�Nw='�tR�h."�xqn"�Dg��a#F�p5���u�u��n����{�=��#�����>�#��������&��=sm\7���]z��.��x�F�����+�3��s�=�EpT"�4Y���MD���0�gD����rc�*T�P�����i-Z�p�L����w�GB�?�@���Q����)�	f��G}�E�}���0��y�����I�>^��^DH����9��2i�����|�v���������r}K����u�Z�V�,W��v+B!D����n{{������l���{�nK<�h�q�%9�q��6���V�P�l���#9q-��i�E�M�	�2�D�#B{�rLt�f��C4��}��q5�AiC�_z�%)W�d!=�"r��"�����_h��|��M���#G�h1Q\D��\/��x����p����c���+	���}�������#���w�k�r��<�`#(D�0���{�d���`���/i�l����c�<���Jt����s��k����u�F�5c�BB(��L37������y�����q}dD�vI��u�����3<D�����1�D�����_����P���x���zj��:�X����,X�������fCJ��������D:o��B�]�A�7������
�6�����52p�]u�U����"9q-�D%���FRR~I�%uw���a�9�6rLM.��`d1E����
^"�����.�>���_m��1Nv�0#�A�F���L"~DIs&�9X�L��s-$zIJ-i�\����Im&m7X�K�x������TR�����g� \�J�/x9~�����	J)���kW���k��Bj{N�D@�s�ET3������SD`����o\�?8~�G�xA}�Oj���U'�:D�����M�=�V�{5j��-0�Bk{#�1c�3a�F���X"E��17D�C#��X�v�����k��w���[���t��������u������R��B�?���n��O�����c�r���g�}P�1D|H)%���Q���M���E+��<�F�Q�H��RgD�V�1A�c"x4�b�[��\i��/rI
0Qhd|�1p�#QM��H1�F�('Qf���b�����"Q"�M"����`:t8�s��'�Qj��3�	�B�H����OO6�
���<��Q���#�H�.
�7-�����B��g��#�� ��?cF5�LV,t�f|����!���F�;������}VC$9&*LC7��S���f�!����������4%�s��k,�v����oV�W��������l�����.]:�nC!��	$�"�!�GM*����������qz�1�L�MJvP����C�����r���q:#������HtZ"�>}��G�1�����B�8��	�q�n����vHO9&��=���b��:F��c�4C�Y���y���>s �S�M]5i�,��i��r�"r�Rj}�"���,�����m���Z�#�P��)7�Z���h���T����b/�BE���
"��x����9fA��.�6
^��C�������o3t�F�Hq�cr��9&bMj7����b��`�N�3��cR~��&��M�6��w3V'���Qr�h��@zu0���^rL�8]��5fl������J={�l���*
�Bgp��}�0cHt�g���g}��4%g�+��h���g���<C��K$n�n�;w��
Y�����K���z��=�m��
��a���B!9��9�,���U���e����#"I�`�hC�=���(���O��]�i�VN��8R�j���!i���/w�����g�'cH)�H��U#q�����f�fR�j�
	'��%���8/���Ng���4i�4�"�J�� �%�@J3�"
�HA�����5�={9&��lR��Y��k%%�r���#�,�����h�E�6B�\	�'��h��D�YX�V��}��7i7���b	�wy����.��� ������-[l����v�h��r�3n�<�,�'�B!Fi�"��g9����Q�a���@ ,��sDq�$�i��Nb�����������$
�������n�\K���p5�\��TjK�\�#&
��s��^j��i��*N-X��	���1rK�&���.�C��cjn��"��-]�����!r|��m��(9���/$��E4=��4hj�Yh�^�5kV�>�<�`Z5i��?���Ot���`����]�)S[��r�_��Z#����`t�^�x��i��s4�bN����*���@G+�d
0�:�s��>� �H���m��u�]J�\V�m[U�s�q)�]!�B�cE���x�c�Q������A���������AJ�c���-n8�m�hx�L!MO9�C�V���k��OR^�X�Q�cx�rMe��smD��1g'@X�1��dob"�������p��:h��b��m��wR��3?~���.�6QH���c�.��R�)+�����)�D�I��6�yE�10D]�Cf���G��c�s��p<���c����[�92����U�bE�f��L#�D�y�,t�0B�:BMc4�	��#�,���A�BZ��:�=��j�����r)b�+Tp��r�8��%�Bd	9qM<�q�I��	�@#��S��U$�HC�<2O�7���.�B����X�5������?���e�h"������P��vUD��2�b#.!�B�?�c�H��/��� ���j�����H�"������S�L�t��U�uGB!D�Er,��q�Er,�B!���X�5�c!�B!H�E\#9B!�BH��;D�5�c!�B!��Xr�H��B!��c�q�#9B!�BH�%�q�����g��;���
d7�x�*T(�y��'L�`S�L�
X�<yR=F�w�y����������i�������������O?��N9�����5k�X��M-o��6~�x�Q��!]����������Z�je�>��!�"k��\"�������v����������=uEl2o�<k����l��n��6+U�T�/4;�qbb��i���i�j�����9,���&o�l����YdrU�z��Y�Bm����Y��|G���B�,��X�5��C���K������z��V�hQu�g:v��"�D�k��i			i��_~���m�f'�|�,X0��g�9���q�N��s��V�d����ta��]6��!6u�TkT�����t+���m���I���^�:�������*����{
!�����k$��&���g
~��w���+��3��#FX�
���GAz�����kW^y�����E�c%u<#6����f[����v���+!W���J\e��w�S��j�����*�h
!����X�5���r|�1����m��9.L����C]t0W��~��9~����������74�}�Q�\�|�]t�Ev�y���5�����m�����}���R���]����;���C�(�p��Z���������k��������D�x�GT;w��I�����O�>��kW�V��=��s6s�L����}��6p�@�]�v����~��]�G}���1([����W������K�����{��g�>����5����+Q���q���K7�
�-[�����]d���v�K����z��V�~}�:�������t�I��_?W#��[o�����j���.������.���Q�F��1c�q�����{���_w��`���u�)R��e�v�����pi��1&"�����/]d�Q�F�N�*U\Z��q���}f����?��.��rwO=����;�^|�E��P�L������[n����'�U$9��s��^x��?��T��5o��n��f�HAgN?���N��j�c��v~f���gC��F�Y����|�y��Y�
]�g����@S+���9&�B��!9q���?9�)�?��$Z�h������A��i���yA�cD��>p�Pd����:�I�W�\�	2�{�X��	T�b���W^q����~���3��r~�)g������?�������\��o����_�7���_�s�=�I?o)C<�X������wN���!��H2����/;��8�����O�k�>�.��7�|�D�E�ib��m['�'�p��U�V���d�s��^Z�/^���#���"wC�s��I�nJ4k���T��gn1"(�4eCZ�A��K�.m7n�����]����oG}�;u$9fA����?�~��G�����w������^w?|~��%n��oD��SOM�t���1V,��s�9n��O>q���k��{���=SHI�������o��F��|���g��1G��� ���Yi�������Iy�q��[mM��2}b�#^^����h����X?���r���Bq����Fr��o���E�3�������c�?���%��:��jdiD�8~��N����X#j7�pC�<�"��A�8�7�"r�(�
�["�H��a��\����z9FD�qy�]���N<�D�D�{�1wD��9��
�)����HQi�QE���#�D�9��#j�4��5�M��@#�H%x9F��&�7mO�
��n���c�����|�(�{�i������E��7>�(���&���9�����p?�����]�/8Xx ����0�o�>����[a���1�0��3R�$s��x�s�\�OJr�@s=�"��9���z���"���d�9�CF��>�l�b��;X�����ho���R������mJ�d�����s�o~	!��?$�"���'�i����B�/�C�iE�G������E �|}���.����o"��s��UW]����<���P�E�H^��=����.���zx_�p?��s��K��F��x������g����"����H �\7���<8)U�h&c����cD�g�����������h�3�k��:�#��5��[��!�F�5R��HrL�i
>+"����9�n����#��#�^��WM|6 ���~��4�&{ ��0�=3V�0��L��b��6z��d�'��F���`T;V�q��V:gi�Z���N�������_c��]r,�B�Ar,���rL�,��QC�����,���#��R]�QhX�l��h�D�H?��&�1"z����<�������!�����n��VL�lh�,i�D��R��-��k��7l��E��F���W�~L�oj�
~��K_&=�(h(�#7n��E����@�4(�������_�*o,p�"���F�BA�+���.�{��Tf�2��HrL�8����G�U�?��j�_���j��p�AY'��j�;��yN����kW�Lw�H�0�o�L_� ����o��5�����V5w���w����@�T�&5�B!� 9q���?9�����j�!%��R�j�"�Z{8r�H��V"�� &4�
�3�U���/�K�HwF�"\�!�ZK��B��c/;^�I�&:�5�oT����M���YD��U�
�]@�`p�M7�Z������,��4����)A1�hQM���U����������.��E��
��E�Y�@.��Zd��_P�.
����M���u�8@�:�f�����v��C�:u��]w����8{�%ZN��K��9F�I�B$��K.q)��d�'�D�y?d���
���|p������?���m�=��aUr����K���r�M\k�����ng�
���9
���!�G���k$��9N9&j����z_�9�j�d��9W�G9F�=D�.\�R����\�n�DH��D�/��bW��p� �NM1��%�,B��`p�!���s)������]��7���������4�Kf@Z�d�)��MgA�g!��e�zmk����u+�����/�"�=��zD�V+o�%Z!��g$�"���_�1��H"����i�
��C��6jP9�����m��S�y��6��q�5�Ai
�9�@E��)���S�L������XD��@��=�	iFD�y_j�Y8 <Z�C�>j���.��(yf�1)��j���N���'�NWjR�4i�l��HPN�tR�����+����������,�]�����$��q`���2�&m�d#K���[��<�r�B!DL!9q��8�n�l�D�H����%������g�[�}�jR���L��T�����TZ�u��q�n��h�fQ���K��`�-���U#L�L#'�zI���[5o��j�hIS&��7c�X�@�{�9���m��5h� I���9i��}�ivFm/���r�,d�G�V�8S#�[5�d�_5��������p��
���>��A�=����i�)��f����wc_��k��Y���(���'�m�rW��%�[��
,�{�!�"9�c�H���c��!�J})�Hgc���f�c"�@�"$�(=����l��)="��F�SK�5s
�7��6O�2{��F�I����V"�D�i���1u�4aB�����D�="-�e�s�9F��>��.����1���:V�5k������2*�H�F����^�|��%���@�/����8YG��������9�>�j�i�F������sL�E����#����s��Zl���#��1���8�,,R��X`���.��h�"������(b��:����k�r�o�4!�B���X�5����������M$��&Q73�f�!�W��!C�� �����;��dz�����]4q����o� IDATk�M�X��^5T�7d��o��E��#R�����9�EWiEs/����r����j�r��A��*�74i��I#Z�Q�8���1L
��y�����D'We�'�"��=���s�7�z�(#�tv��d$�&Z��s?^3+r����N��3�4����:����>�@��p<����2�{����K�x�o?�Y������tj{H!�"�#9q��8~����E5�
�p�9������%�d#�,R�d-3a�d)�N��_4B!D�Er,��q� 9����'�G�=t��X�H�1��[�l��v"� Vq	!�"}���Fr?H��Cj1
����:�`��X�H�1)��
Quj�cql�B��H�E\#9�$�Y�#%�B!��O$�"��!�B!@r,���B!�Br��*"��!�B!$����Gr,�B!��K����,���z�������n��S}�o����r�-��G����)R$�cDl��k��w�a=��kV�?����`������}�[�ha��wI������;����{[�V��p����B!�8���X�5�cs���3���i��������s����UX���������c��c���KMW��'&&Z�6m��?��	&X�Z����2��;�����l��m�h�N�P!���W�6,l%J�����zg!�"��w`�����$Z�<,oB��,9qM<�1��d�+_����+W.��%�Y�����=��c�>��]q�q5�#!����o��q�N��s��V�d���(����������aU���2er��-��h�.��a��|�Qv�
%�d��}�!���v����o����n�j���w�\,g1��������vv��-w��!���k�I�SBr�����.��[�~�����V�Z����t�H�1��+�����������rJ:�Q����l��u��c)���bV�p�do���[���������KX�"�?/�Bd7�O�0�\������v�mh��7G^+�����ez[��E�H�E\�����_|�I������c�q���o�9Y�d��c$���~;hw�q���_Z�
�j�?���>�����Z���T�bw�u�]u�U���`c���^x��-[��#�v���'���y�;�;��c�����[��]�Yg���;�>�l����4^s��W�I'�d={���_~���}w>�1�*T(����{����xpo���(Q����z��V�^=��/_����G�8q�}��������9�����l���+�^;e�M���������9rX����C��.���s�y���������_��I�\��QG���]����?��:v������V�L����>����s�<k�,��m�}����iS�PBV�z���>��=�+VX�J��3n��������q{��������e���k��m�qy����O�>��kW7����1c���:u�X�~�z>����_~�'�|����zd�s��s��
��0`��_R���f~z�x�	>|���F��N8!�s�o��S������%��]��v�M��y��V�`x��:u�=�����
V�~!��K9�B!�/�6O��>`sw���w����	��b��vg�;�s���`	���;%D4du9��q�?z�]|��Nj����?��{��V�xq7��q#]��^�b@�5�H������M���l���I��}�������&O���h��NL�>��"��^zi�T|���N8+W�lU�Vu��@"�;w�t�����z9F~����?�	4B��"�H9F-.Q����;��]���Vi��5N�iH�d�1A���k�Z�
�F������w��ds��<q�DH"��~��[`����k�K�N&�K�.u"���v�in<��+W:�C����}����p=��v���&[����y���oM�4q����_;���X�@\Y�$����36��c��l�����}�]�����H>�����>k�\sM���c_�h^�� �~����O=���.?R�cXxv���s���|���.�x1���������=��#Ic�B��7���������u�&[`>8�Z�n��W�E�#���k���v��A��F����qa��K�v�.���k!�";�[=�F�e�m�[����:��`����Jy*I�����������vB����iAJ����?��aC�K�[uji��1�s��:"���p�.]������T!H/r��C~��@$B���Ij���!2\s�r��C�rL�h85��sD
a�!�W�fM�={����y�����&���"�HW��9m��E.�I�!�����E�j�������H�L�g���T��9�}��$Qf>�����Q>����D4%�<=x=������9s����s��9�8�B��y�RcKt�?3��� K����1�?�
�#s�\'�a> ��
^��<��*�O,����
���Y�>���{�	���|R��r�q��<,��������,�&C�G�=?�A6��*�
�{�e{��^���q��o.���}����`��V�*U��Z!�"��h�"�z����U�^���n`��
����K���C�Cv����,����{w��K��;\9F"=���(*Fd�H��p�������|Q�]�z�-'p�$����$��h��O ������;�����9�����	����hE$�~���[�R���T�=�FD�f������#����o��Q��\"��"_��xJ�E�h6^+���>�x�������(x<O�A�>���1���0w�\w�M��f"���x�rLJ6�-x/���E��q���sz89&[Y%r�BC��'����|:4R<d����<"��gA���x(H8�E�?t�c���Y����K��\w�H�}�*[�r��qG;���J�B����[?��W�m���2��������C�}�����4D���rLD��!���`�����������S�/��#�D�� "hAHC%)ek��5���|u��<�[}��7�}����g�}������c$�( u�A�'�l&R>t�P+Z����%rI4��\$�)Q���u&��FP���cH�F��4�I��j��r�����h�@��q����� ��,�Lt6HZ��H9����F*�d�����>.\��4�>x�;����/��\/��,���c������D�1!rMD�:�prL@��O���3��/��dXp��_���P([@�)+`��b���W�����=J�9��c�k��HW�Y��B����[���W�m�����*�����S�N���FLd[����;u��:!"KZ*R@��T���c�k�E��������Pq�v�B�����X#�Hu���Jd��G4�B���y9���T�P������9�E��AE�_y��#"�8�h��	,c����xR�������uA�>��i�D��rI&"��l��\����H3RF�:\:o(�*zD��{�r� �H��}�6Qb��u������.�<T�9	�3�^��["�>E��sC��j�'�N���f���I�~0�<���e@v���<Yfzy�E!D�h:B��#�x���6c��.|�I)�C�fK�����u���b���Z!D�d��y.����o�>��VM������C�q�����9%D�du9���qz�1�f�A���$1'��F�����?#�H=���k#�4�B��q���8���2�`�n��]��1����L$�(��F�}����.� ��1��$�P��p#����zM$?���B�7��?����:�����o$
Yt-�y������������m�.������_�� _}��u�0���yfA��~kB!D�c���.�z�������Y����x3Ws|j�S%��n6�4������p5�DMi�M�1��te�su8a��2=��U"�t�F��QC"�t�&BK�84r�R�1����RG�0�V��H(��8?������K��))��H�O���;�"�<K$��Y"��-��r^�I-&r�9&���p5��-��[sL=��)`��t�p���{�9����^v��Ar(4z������������t�^k���Lq��V������M�Y���Z!�������\�x��%)F�i�U"W	�W���,��r'��g�	 �� ���U����)�U#
�G�v5�DVC����`R�I�%-�/i��&���m�B��h3]��
�[5�*U%2��e�$R���L��N�����L�j�Ii6����bD����i�H8��%�s!�D$����c�K!���H+-���U34�By,\�U��[uz�q�n�,�0"u�F��%�7o�k���2]�������n���TSW��e,h����^�shS6�0D�SZ|�&M�h����%rZ�j��VM[�$������?vY�ne�s�Rv�Q�b�r�B�e���6v�X{h�C�!q��9���{���k�s��.Gu�>e�X��������@
����&�H8�o�"���rL�1
��:"��aC�����"H ���������)�l���O'dt�}H{&����������9f/[�}�h������&M������-{�"�|�FLy��+i"m���
��^��D��s���g8����C0�6��f�#��/���j��������o��#
�����>��-��)RO�8��g��IZ�9&��gGM;�|�����o��/wu����&Z�1��a��(d$���yG�Dp�m ���b4�����c������)�l���a�>�|��c������O���qB!����������W��J��q��5�o����XS'����k���.L�$�m��$
JD��otr��M)r���U
5��!��MO9F~
�YN��i��VL�&Mr2J�xh��4h"���i��!���\�P!w�+M��V.�FM��{`LHI������H%rF��1���BL�0�������]��H�rK�/��e�!���u���!������V�}��x�DWYh�I#D�y�����h�E��=RO�9�l��q$� �����3Fv���3&e��q�C]3Y<+��<?�W��<��3��"��c�TN!�Y$���l��U���~+�����U��Q�"9qMV��x�w�FV���;� :M�)R���/s,��U#��j#��
�d�`��#j��^�B!���H�E\#9�m$��!RM��D4S��(V�*rL
>[QQ�L�y�6�B!D�"9q��8��'��\��i��������@V�cj��S&�����!�B�CCr,��ql#9�^d9B!D|"9q��X!�B�c�H��B!��c�s�H��B!��c�q�#9B!�BH�%�q��8��{���x�b��7��~��g��;��A�Y�v��D�i�P���%K�����+�����|�QGit"0}�t���;�w����U++\���K!��Ts,��q��g=n���������_�SN9%�mMt�=����c������[�2�!$&&Z�6m��?��	&X�Z�2�������m����w�"�Y����S�r�.m	9s���!�����k$�i���~��;w�}v�����D�"����9s�����k���E��b�8���~�m=����u���J�,�~%��x�m9�v������,wn��D�}�,�FV��{����A_�B!Dv@r,��ql"9N?H
~���l��)v�9�X�\������������������E����*����
7�l{,�{������R
r���&M�M�H��B����Fr�v��Ue� ��/��}��Y�z�l��a.557�����{���������F�.Z�f�Wo�������>��#[�z��=���]-m�:u,_�|�n��6{��g�[�n�s�J�Jz�G}��n\p��=�*V�h{��ui��`���u�)R��W�������j+Z�h�9V�ZeO>��M�:��G��9�R�J��qc�~e��u��5k�;�_|a��-�]�v�Z��M�Z�=�|������t�^�z���<y����?��O��n���{��!v�q���K�M�6Y�
�M�4��T������;����/{��7�x�����b�
w<�f��Y�<y�x#�������[o����mk��K:��-[�5q����m���V�`A;��3�����P:���3m������_�g��cl.�����>�h������O>��|������u����1j�S���7���:X�>p�@'����w)�<�����N>�d�1������=c^{���gX�����[�����1�l�����`A�s��v��O[^��sN!��Cr,��q��$�|n��N�O<�D')�!����[�j���]-Z���D�j��i��/w��N?�t�={��`C.���'"��!q���~h;v����z�5��&Y�>��G/��2��?��?���%���������q��D_x��V�ti��q�}��wv�	'���������o;vt)�\k��u����o����0���"���r��n��Tg$��:d����(/���x_�c���v�Yg��	��M7��������A��.]�>�m�6���J�c���w�}��9�C��_7%��0 �3��������������}w������
�cx������������/���"�-[�t��u��,n�&�\����#�)f���@�0�MJy�
��z9fl�<B^�jUw���/"�������[���={Z�2e����l�<�6n{~��i�����J�g�Z��j2&�B��X�5���I�*�
)ED"���o_A$���w��#F8�,T����[o����g��C%�1��"0D�)��p��X�G�a��yNl�8E���k�urK���$�J��D���F�O:��d�\Rln���Ed�k�&5AE�D"�� 
J���Gq_��I��3~/�����r.��{D��K/�4)J�<���5�E�W?��q���t��i.�M��DT���g��{F�Y��o`������c0����Z�-�=����v�����#���s �~�Q����Hi�H0����3����if�1/�4i����1�����^���(��"�4c���b[����6�~i�s[�^���m�Y�r�R�BgH�E\#9N;���H�l��` �DB��%�����Ld��� ���#�>r|�}�9Y�c]t�Ab8t�P'�H0��e�"jJ�����o�O�>N�|�����]��s<rLu��{�9��5H���� %�M*/�����q���x9�o5������{�T��"�N���hQ��y$z����6��s��k��fW\qE�b���.���`!!�Yf\������-���3����>2BS��D�cd�s���3-"}D��@�)�I���1�
���K���:�a�g}���F��3�um���)S������eB�l��J�a��VM��B!D�!9q��8�D�c"�H]0BI-)���Y�EN�:��'5�����!"u|��r���$r	���`���A�2z��W\$��&��� ��"�\��`#��~0��s��V4����x���g�����/�O?�tR*��q�����~#��~G��'�B��#��j#��
���R�b�����G��L]�O����U'�:�����:��s�U�^�5rs�t����$��-�d�����9@
1)�^���'�Q��V��������k��&�����m{�y�f�"gN+���4�r��!�B��8��%"^����r���H�C].��"rL:-bI�Q	�%���q�r�T!q�	S�C!��{�9�k3Et9\�('u����v��I�%��D!�D7���ho0:D�I����R����2�xq��i�>=9�����`�1���<������1"��qO���}C�p`^O���~z�8�P��#���1f�H�F���9<���.:����o ;�LD��| }�h.R��E�c���=��;/cL��p0��7����\�T�&n�P3MS2?7���[����>h��������'��6�
w�d9c|k*!��H����k$�i'=��P"�D+�8�p��mJ�����"K��fK4�B�|�r(t>���G�0���7�t��i�s]����L�%��U�V.b�Qr��}�H7�������������2�DB{�c IDAT���"���t���y���f��"G�c��Y0aN�`���E��1��qJ�u���/�g���V��u�_o�f�<h�Pr-je�N�|��o	�_�/�B�����Fr�v�C�}�1��D
Ck����`�1��h�DZ4Q[�y84�Bn��1c��(6� "K�6[8�����M�Zs]����K�-j���L�E��':&g�#�D�I�&�=4}=���Tqj��5��Y�q��D��>�bC0r����4�5��!���#G:�6�"�L'i JN�BJ���h���g���<��T�Xa����e��G�8d��$���������s/+�.�B��c�H��Nz�1	&R�S�S�V�?~�'{����}����\���i�&^Gj4���:�E�����]"�l#E���U�C�HiF�@���k�V�����w��5�F�iD�=��9/�9!�%��������4�+]����)�0
�h�E
:���!��������c���&��Y��k%%�r��cd���A���l��m��
�'��h��D���V�~���c�>��*�6��T:D���+6d��@@�AT,( - ����bA�Q� :��� **��*%B��z^���\Rn����k� �9��#s����_7�[i=���:��ge�fmq6s��n�b��]S�����e�Y�Q�,.h�k����5�q��+�s��\��dfU!�q�>�
�Z�� ���'�p�aj�V0V�R�RAh��y.�*�h���)�i���*l��1�W�\^M�V�Q��4pj���.�)(�s�p�!M
�
]�@�����XknU5U8U8��f
&��)���������p�ad:/�do]q`
g8V��
k}��++��=�������U�
Z��u�zo��W���>j���)�=�u��x�{��Z�Z_�,�:�V^
��^���Q�;W{��W��t��>��P���th��^��������[�������7��e�EU���p���[�N�,:9y_"�p_#�^���,]��UpU�U�Wk:��&\��U�x�<
�
��H��
�
6tU��T��B�B�Zy���=2�
���^�0��P����� k����]�t�����^�`+t)|�R�m�TA�:c�3Y�z�[��������
�Zs�pWQ�X��WC����~^���p�cU��`Bks�����z�����u��
I�x�c�����~�Z��}������Z��-k����^[���3U���AT�V@WuZ��t����uj�zOpC	�
��D�!���.��]�r"�����i
���h=/!/4j�W�W��{��p�"����UTU��P2U�����6{mW������:����F8���X��Z��}��fcO���Z;}��������k�cB8����5�F8@8&����p�{�c�c�����	��G8.��^{�z��e��~��u�]�MP��mm��U���o[��-��>w�������o{��W����+�vE����o��>���>�����������;[�*U�����#�<�L�5{�lw}���kW_}�%&&��u����t�=�wk���c�-:*z_
�
���k���#W>�~��]y��v�UW��w�iu��-�I���877��v�j����M�8�Z�jU�������e��N�	�&�������iQe-�[���n�uzX����@���5�q��.[��5j�����-����q��p�
�������kQQ�L?��������:�(�^�z���U9]U�o��F�����N�:�8���.{�u_���J����n�\�-��GW��U��s>g'T;�J2E �����8<�.]j;v�N8�|�Ak��qx^8�3o���:u�dYYY�z)���r�s���^6;m���������djv��F��U��w��C8��������c�9����{�f��i/���-_��RRR�G�v�w��>9//�&O�����%K,))��?�|8p�]���v��=��������/�l�-ra�E��e������W�UF/^�~/==�����V���w6n�h�^z���i��Y�v����_~���]�9�X��U�6lh��r�
<�������sj���C���������e��]����v�-��g�a��U���4�1c��L~����a�Wy�����y����<?���iZg=`��Fx��Y�i�&;������K.�$�u�c����~�{�1�}�U���l�^�t-���O�q���[7��zO���j�������{�����?o�������w��s�A�����z�]WU�O=�TKMM�6m�X\\��s�����q<��s�`�{�������
�u�]����[�S\8�}���{E?_�>�����/w��W�u�M�4���:����_��������.2�u���D���Cf[���U1Tf�c������[�n�&M���y��'�p�@�@��� ��-��'L�`�j���/��7}�����~��U���9sv�RxS���)�+�*�~�����w�����3�X�f���O�2�7���������'^%�|�7����\@>��s\�P_�������^y�W�V�S�������g�u��s�u�K��g�}����W�ZC���=�P�����������k��z���:�u������T������+8�E�R=o�<k��u�@8d�Z�Y�\�~}��y�}����Xx�;���K�5KT�>��C�C�
�z�F���G��������� ����@�?���C��N;�=�;w�����v�W�����RHQ�X����n���_v�a�������B�v�I'��O>���7���'���������Fk�\����ZY���G�����u�[���%~=~C8������p�����������*�.�(l)P~������ ��o���*j8�[o��B�����;�(p����8UXau������+�)T�������&k+��;b���qk����(O<��{]�;�������K�������]!J�^��sU����
��"����
j:�5j��Q���,��I�R�p������U}�=
�:���]��E����Ow?C�s^��]�\E\��?S�:����VE�;����~�xt>:���LW���w,>>��s��z0���c��z�����{�#
��]Og�Z�C:��=��I���ST8V����s�=��`.zt,���w���`$p8��O��c��-p�"��+n�i[����%R��X�:]-5%��U�����c����J
�}��]���
��>U>O>�dW�U�S����>j]�t)�Z���)��z�X���g������j=
����zT�U5T RP���xTQ����p<h� �+�����sQ��BcBBBHV�Y�S�Q�UA~�����j�����~���X^h��+��^�������R������
^�Utz/t�u-Ta-nPWq�X-��~���Z��8j�V�u���w_#�<U��+�^8Vx�=�o����w���_U��b^X8���s��R�yR'�B�:y���no�
��U�Hq�������Zz^z���p|U�����U�����G8������p�f��^\���������Ym�Z��v�����@�U �R�6m/���.0ksqT)}��
A�	k���S�g����C�����~r-�j�V�[���RS-��m���������[R@����u����z���;���UaW�40T��k�u[J
o:o�+���X��@����*�
�g�u���Z��U�=��c=�Pp��y�*���Z��b��u�:e]_=����c������l���+W��j�zU��Wsa�X	Z����������[zOu}u�����DG�1�������kK<4�ULh�������/����1|�p\~%
����O\U.�������@�i������fY�GAITC�4�J-������*�
<�C��:�5�Zw��vp����q�����{E�]
�R����ov-��V�r��va
���������h�Z�J����J+|N�:��|�Mw<^x����t,�:w�����))k��������a+h�g+l.\���Z���j=W�:J��!d�>������k�u����j���^h��+X�}
�J�����[-��V�Ua��S�}����{����4{�\H�����4�R=`�[�k��k�V	�6�j�����E���7��1|�p\~�
�����XUa��p[���p�5��J�=������S���g�g������=�B��\������7��U��k)��TM��p��k�|@=�P+rQ4D��/�pa~����r�
�����[9���
��V�-��=�]/U�54,���������9islg��"/�������M��#"�<��c�������[s�j
F
\s�*�Z��.T�V���k.�*�j��$f�o� -��*�jBsI����
����Z�P�k��*�Z�<K!]hM�V ����~�&x�U\-���=DU�_�u�j�*��
�z�
[s��\���+������^���?������k�wpz�X��������s����cK���k�����vr����N`O�c������:���4��5IZ�pTUTK�7PK�F�_�B�u�j�V���Z���'q���]���s�z\�mU[�Zp5�8p�����V������2��^K(���
����u�
��>�/c
��)[E�c���}[���f^P��i
��5lA���z_Tu�j������i�
��T^���a���5�:7]c�/�������
(��Y?�/���i�\qC�����4�a�=��Y���q��hW�:������E|���k_n�'�1|�p\~�
��������H��T���k���9V+��~������{r��S�/vV�^>�A+�>��C�{�������&7n��
��&�i�?��bDm{��d�(���9V���9V���,U��}�5[�[�X0��U�U���9WT[���[Y�[i2��V[�Wk|U���I�������������>���5�:F=@�dn�g�z���4����Z�������-/X��=D��0���]W�\�����7�;R�����s�������i��[����bB@8����/��Xal���. +��R��:p�@f5 ����.0Z�M�!+�j-����Z�P���
�Z#������>
����������Ei"��xi���k_�V�Zh�VxS���k������F����S0���V
 SV�O�IW
��?U9V���n�s]3o���We����m]"5DK[)�|����*�
���I��,�����t<j���4OQ��������&������0t�{���D�[�����=��i���%�!
*?�1|�p?�0a����.�+�nMIT�V8�C
M����A����4��S{WF^����g[�������M�6��J�U��B8����G��+x*�i-n����b_�c��|�UW����n��������m����M����5��h�.���fw4����}�%�$�|���5�1�J����}���8p@U��W�X-��6���=?�
~�2k��X7�&�1��,o�KeQ����C��c���\�*�����ox�Z��k�.�r��6$e��R������k�c���:�k�~�����e�g}��f1Q1��n���7-�$	��5�1Py��6���n�2YI�`U�;��`C
����<��p_#���M\8�����p,����a)��Sr�O�p_#����_���������x��m��8����v5�����c��_��������O���>�^x��_�~��7�|����K���	'�`����s�b���v�=����>��U%$$���������?�����+�z��~��g��{������kW_}�%&&V����h�GnR��]���@�Q�vE�+lP�Avd��%��1|�pL8�,�o�n^x���Y��|�Ik��Ye9�r��877��v�j����M�8�Z�jU�s�[V���^�n�gg��e�V�^��uVU���$k�2���J��	@�����
��a�����Z]5��
Jd������O�p_�k8����E��*���jqq%�eQT�������������o;v�E�X�E8�����z|��7�M7�du��	��Y����6v�6[�4��w���%$DY���v�M5]`����}��^7�>��q�����u������#B:e�1|���xo!�Gvv�����i�&�9s�v�a�y�J`_�cU�;u�dYYY�z�%�j����Ci��w��q���h;��]yeu�Y3:RO�������[^��?`�3��
r^~��^m����v����cR��^S��5�������e����	�����j����^U.7nl��?\��7�~�a���k111v�i���a�\���������_�I�&�W_}��H�F���.p��8�����4���+���?����ULu,��WU�SSS�g��������~����-^�����O��|�GX�n���K.�������*��<��������h}���n�]w�]u�U�5������&~����j�*w|z�=z�u���y��m���_���}��G6e���[�^=�����q�����?�n��.��b�����A�{|[��!o���;g�_;v�p��s��v�����!�������^���{��!_�r�x�������K->>��+Z��t�Rk���;������uy����_�~��wo�uo��?�q���'�h�SN9��V������?�����t>z��	q�I'Y�>}���O�}�4���Z��f[����g��Q�F��i�����8������xW�R%��k����l��}�}��.��-�G�[�:16cF;���C�5��[���&�1��M{����w���7��k]n=����U���$	��5�q�5�^8n���}��.�����p���?�g�a���J�p��x�-����;]�ThR�Z�d��:�Co W~~�(
+
e
C�j�r_������o��������������q��~��'���#G����c����!C��SO=eu���s�9�
��y��
>��}�����
�����#�<�������@~����Ux���qDHz������ �����t\����;�-[���VxS�]�f�p
��C=z��5f����F���-�y(�u����C��=���O�u��������GE����P����u��)(���������������?S�����-=�����v���q�-\�����(l������x����qy��p���o��n��������s��u��!^z��#z@��{w�g����}
uO����k���nH]������k����_�=������fO>���^�cy=0/�Z�H�����A��F8������*�z�V�T�c�(U��<�LW-TQ�P�Q@R��[��������p������*���V\_}�U����\�T�v����TSHQ����u���*����P�P�k�.��u�-[���D�V�g���~��U=�Z
���{n���������yJ��E�E�K!Y�U��k���[���������~��������:���X�<=0����E`�+�y�J����5�3f�����r��������:��v�<T��
=`�����{��PA���u�z�����������]kc��u�p=��G�3z���X�E
�z����q�!��wM���{K-��r���u�E����ou���|���z������Q��y4m��o`U9R�|�&�6m����d\|5X
]�&Zjj�5o^1s����5�q��X-��na�����@��W/W��������fE�VAU���}��� ��Vm��&M����
����sQEQ��
W
T^�V!Gkq�*�B�����Bs`4�Z�UE����nw�}��4�B�������B�*��.j#V�R�P�VlU�[ru�t.���|�^U�U����c�9�����<�&�amQ��7��R�T�����(���*�����}���+�/X���G��0������B�p��G���M^�W��;�<����p�1,� IDAT�N�o��������[a^��]{�/j�?��C\�W���=�Qe<�B�L(x+�_�Hp�um��t��3?��������C���C�#���AW}�U�L���[o-P!4t�P��P �����ja{��W��
A�������/iQ�U���T�>+H)|�2�U��u��X�������/�aW!*xl(����Y�f�6���N���k�	�Ek��Z�0FT�
��Vk����viU�zU�Z���S[�B�GJUzu�)�)���<EX��j�~_?����rm��
��.��r�nC	�?���e��a+��P���G�X�����aN��WS�����pH?�M�6���k��l��������{��+jCW%U�xj���U����o���zOT��Z�E��:^=�8��c��y�X���)��}]U�U�WWAa�Xk��`&==���H\��{��k���v�m����W���  �����n����z"Mj��0�O�����j3�6�w����IIa�q���c�<��{��J���Q}�
HHFA:���h������2��3g���(����>�����IW���#
�
T
�����j�T�O!L��X}���|�������>��"�
��05J����C8.<6@K=Um�FS����;���Q
�R�1�}�,U�����Z[UU�F�Q�rH:�?�Q�P�����6���"�W~tL������T���T���U&�������)��W�W����u�`Hm��6��zC��AV��S�*�
ez�PX;o0U�c+h{��������^��s������*��N���>|��7�~	�z-=HPx�5/+��!��>
��e
��j���j~a�X��u?��Da��GW�����K����_�yVeY��;$��0y����#�������j���m�<�,@��'N�k�\S��X
@0��~LA@�q�Q��T*�����Q! x�na�^���zo-�>������B���>�z�U��T�S�M���OG�*�k����Z>������
8��Msp��C=�*�
�Z{��_��|����c
F���J����/.���������=��kF��^���0���P/�[�����#�~�p��0���{P**�u\�w��
x�Js^�X���`pWGQ�c��o�=�S�p\����
���*����`j�V�����@N�k
�R{xa�'��PS��~��\��iiyv�l����N��z�C��_�om�TqAD8�	
NRQ�k8��Z�U�60� ���=U�AV��^PkG�����{?�k���%�^P���NZ��i��j�r�mU���]m�6��Ui���������k �������c�2��*T*h�-��+\����~��H�5X�T�P(rrT5�R:U$C]s�_�a�*��A�k�U�wE�c
�R���YU��/�e��Z�U�
\s�N=�*l�q��qy�kX��wk0��a:���J��\����>����k^
��C�C���j�W����G���3�W�?l��L�����G���T�r�����Y|<����}����Zld���
N`�U�QX�V6_~���vlU��-�/����_	j��k*�(�(8�C�*����Q5QaIa[���u��!�/N�V+�&�4�ZOMT��z_U��^�_��s����Z�U�S�WYtoh����U�SX���j��;���� �V\��w�y�������k=����������Z�	�Z&���-K�&6+��U�����S�Z��i���P���V�k�j�����2iF��Ca������V�Vk��7�Z���������OY������������U�=,�
=�S���s�P6�{�Z�`Q�Q"���Yv���w�w������Z�Q�6jTm;��KH P��O�&�B��_?�Sm���������u�ZS\��]
�QQ�J-�����>��T;�����2+@i���/jh��sp��X}x��[��P���q������s��.�a�~h_X�O�s� ��	���j�ox��'`U�t�x������~��T�^X�|������ytOi���_����B�
��oz
�/^+����W�B��f����B�:����A�Z�~���<���J������WQ�X�T?W����+����x�SC�4�N�Q�|��.j��p�c]S�z���+DD���4���^��:a�w��0EG�����{C��dC��h=��3�]�!g]�tqK=��Mm��Ht��Q����P��+�������O����V�f�5kk'�\���Yc@I�>Q�p\�@.}�Tb��i�
��6
"����a[a[AB����*�Wm��i���U�Ll����j���Z-����:N}�lW
�����MOe�0�-�t���^uM��(B����+t)���*���E�\{=Uq>��>�A��y
��
*�*Tk0���ut�hU����I�����S�QY
zM!�����]A]�L��j����{�{��7�y
K����S��������+*����A���{�y�8��]����a�l�}[��W�@.U��0C�^��f��?]G����ry��������B�����h]������z�t�`��:	�z �������/Py����*�E�Y5�F���"��D��_�~*�p�G
 ����TaU-�
j�|��w�:?�
V�X_�5���\Y�`U�4G��*�&����BQ��X��������w
���]w�vV_���3g�?S�P>��j��:n�sZ�ze
��u�B$���:��z@U�}�#�7�ZaTk�d#�����-�P�����@dSQ@�b���(��S������c��������f�bI���y���
����H��T�
TEWE�,{{�K�%k
��rh����G� .^�U�Dy��7K�1p��Z.�U���:���k��Y��X������:�5���i�V����~�vx
1��CZ���t��,�Xk��NY-�Z]�B��!�Dy���ji��������j��O����<�������<eZm��T�:eok����
���9P��PL�"We	����>Q�p�!@8�6jM	��
{�������n��_�
4x�co�%����1��Z�v��iE�s��[��J��Ak0�}��1��"�D(�X�n�����C��S�X����7�Y�n��Q����6kU�7n�hg�q���\kB���z���:�
���Z
�R���.�1��Z��4���� �I�^k����g�T<d����O<�g��_�rP�x?��l��������r��(�j��Z�}c5�Y�T�R�kobM�V��Z�5mZ��(���ZU�����oT���m}��U��Y�yiO����
�c��q�c����������e��Yn��7[t�:ur��U)V�:��s���r�}sQ���B8 ���������gQ���Z������MRWc�5<KY�:i�O�U����$V���a
��'Vf����G-��w�����x�dX�w�=;w�t�b}���RW�Y����,Cs��>���# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1������c�# �}�p�pL8�=�1�1����@�����6do���$k���,��v���6g�KMM�:T���DQ�c������f��L�'6>a?���r-��-��7����d=�������4��F8"��;���F�k[_�<������w|Q�Br���6$e��Kl�'���@d����F�i�7����\+��Z�K�.. \���<��F8"�����j<?}��W������R�Y�z��v��c����6��ak��!\%�`.U���������<$�c����r�sl�����:T7���RSR���B��b��k�c`������k���u��:�vlR��[�n.7��,,'@8�����0v�X�n�����t@p��jt����$�c����ly�F�a�d�R�@��5lX�a��A��<��F8"���
6r�H��i�[�\m�tU���@�#���'���@�X������M�c�e�eX�����u|�Q�o��;�}
��c��'���@d������f�#������C�jw�����}������&���@����j�N�I�q
����B8�����F8@8�����n����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&���8��x��
�j�*��V�ju����>��T�bQQQ�m�p�p\�p�`��2e����{�t�R���+pW%$$�I'�d�;w�.]�X���	��p�p\�p��G���C���?���\�V��s�1��Q#�333m��-���?�o���Bs�5�G�6h� �_�>!9B�������Z�9s�k������������[[\\\��W���>��S����s-11�&M�d]t���[�c��2�cU�{�����Z[\����"�=��4ibp��}�p�p\�p���Ud�8T�x
�������1�wd�e���_�io����-1:���y�]�t�5�o�w����9s�Xjj�u��a���E8��X���6o�<�n�}������?0���VL-�^��������7���!�L�x������]�7��+�����2}/*��X���c����/w|iY�YfQ{����8����6�� �[�b�T�K)99����J�]��u��~D�1Pq~���F�aS7O�<�+������gC
�[��j�Q�wPE ���	�>������z�[�lim��u�o���>��[�p��k���=�\����U�R��?�pT����d��
�_w�j�J�E����:�Q�5�oVqU�q9}���v�����1c������m��W^q[?���KnQ7{G�1Pq����i�%w��X�]�|�
Nl������"��Lk��������d�'O����z��v�j�V����g[bb�^�Q4�1Pq4�K���5.��s��������5����aG���7nl��u��C�Z��U}�Q�F�c�=f��-c�q�!��MO����mc��H������A)����#K��p#�S��M��O�^x�����$��������A8*�������o2���4�zX��}2��p\N�^{��������z��5���M��={��g�i3f�(2@c� gS�&�Z����,;��-q�Win�)���N��;�b�����~r����YcG}��~������B��U/Z��j��a�����i��bbb��N�c�b}���=��{3��B�k�qbL��k���6�kU���?��0�����;���>��rss��I'�d������?�`��@�[�k�=��!{}���3o��
�U��Z�*-���]jw�15*�`�@8��~���k�e��Y��s�1v�!��#��;v����;�~`?�������fg$�aM�����(��F8 �c���8??������_o��6�����o���RJNN����R~�_�n�����p@���r�J���+�w��i��]��E� �1|�p�p\�5����������{������qRR���X�N>�d������@�c��0��#F��Q�,33����_�~6l�0KHH���0�c��0��Y�fY�.]�f��v�=�X����I�&�q�F������������)S�������x��B8@8C8����-X���C9d�i�����SN9�7nls�����D��B8@8C8V���������[�j�
�����n�:u�-[��}�#��8��A��]Z��E�):t��7�V�XA8�0�c��0�cM�NOO�O?���;���*pg���Z��m��?���?��}
"��8�������T;��s��>���v�;^�r�
8�f���&V<��V���A��a���cG���cz�<�RRRl��MnJuvv��p�	��[oY��u��,c�" �!���;��G���'�@��l�F��k��6`�KJJ"G �1�q��q ��m������GGGs�E0�1�q�cT.�c��0��E���9s��k��8`�J������g�}��n��bbb��"��8��G�������~��0.l+�#�<�Z�je�=��U�^�;/�����|�=�X�4iR�����ovz����sa��a�����.]���Q�,!!���j�����c��I��.#r��������?���V�Z�B��>}���i�l��enK'D�1�q��E]d,������k��x���n�q�
��w����D��B8@8C8�>}��p�
v��G�O<Q`"��%K\�X���g7�x����s�E�1�q��^B����^���k���5n��m��j�*����K.��m��I���e�[�c��0�c����g�}���o?�����4o��z��m��v�U�R�`���a
���l�b��ms[6��]������"����1*�1�1�����e�]�v�;����;�8���)�]���e/������G��������c����.�������w��U��������/�lS�Lq�{��q��gO�q  �!�����Y��C����-11��m��MKII�Z�j�
�sir��_~i��-s�O=�T�}'�|����rF�1�q��'--�^y��<y�-X�`��M�4���/�^�z�pL(�,�c��r��@
X����X��U����^�zv��G�������q���a
�����>�����x�b���n�:�Q���z��n�XQ��Yc#F����z���8����:����U�^��*�_|a<��}��gn]v��-�O�>v��W��ya�����O������SN9�l'�xb�����c��u���/_��ww���RSS�I�&]���	���y8~��Gl����}���w���-Y�����
�/@����s���}��������w�}g����3f�!dQQQ����9v�UWYNN�u��������7���k������=����C���v�c�����no���egg��Y���
\��������s]�n���������X�F�l��y��Y�Ro�E8@8�����������Gq�
8�����W\a�����w�}��
�
��g���{��6T;v�p��W�Zeo����y��.�n����8����U�V�����������:u��G}dM�6uU�O>��
/����7V�v��!\�_
0���kg��MsptM�6l���"����A8@8���q M�VU[tQ�x���v�QG�j�*����U������j���}����z,j������s����s�Y�j�v��g�}����N����k�N�� IDATHHp��wo���?�����������^}�U{�����/w���������c���q��j��������+7���P��
���U���o�I�� �[n���/V�60�������K]HV��g����&M�d�\sM���m}O����~�
��*������_�E��
k=�^3��>��s��xt��?�����usBE8 �c��r|�}�����#G�������
�J����o�v}�y���i��������'�p�U[�q��V�6�?���<�@��
��k��N}��g�Vm=�Q�����[k�����B��]{I�)�Z�CE8@8��������Uh�}��W�*�W�&�����-s����*�Wm��iM�V�����Wm�z�G}��Y��Uq���k��:�U�A����v���
�YYYE��n��D[�~.��q��]m���6e�4����;���&N�h���w����^�@�5���)p���i��.\��0+Hk���lRP�lo��G��?�U�-Z���3�<cw�u�u���M���.{f�������.���z��������c
+�Z���T.��i�)*�����_�������8
�j>����4��
�iO��%��r�Z3]�zeU��eT��R*�����m��"�(������2��V�P��z����)��kz+R(�X�v��<z�h����B^s���s�����X{��Yk�55[�A����9���+�o�����K�9v�*��������m�4�JC������)�'�t��z��n:����5k�^��^z���%�p�vjm���K�o<�Z�)}����Lk+��S�X�����
�2��i]���){[CN�V�n��u�u����bZ5�� ���S
:��}�];��#�h��D�3�<����n���_�����o������7���aX��}�5�Z+�����7��K.q�~�u���{����p��7��I�V�����o�u�k)�>�����>���p\N'�|����AQ��V��I��7�>��7`Jk{�U=
n+�(��c� .m�4b�w�j�V�tjm����t�^K��;��c����W�;���*�Z��q�F;��3���~st]#�������Yg��g}��Gn����t]p�������v!��
���Z��
���c���!C���/�'�x����W�K�VN�p\N����np���V�Z��i��1c�����(kM�<�&$�fO������di-:����^{����;����X�������}�=�PZ59Z��dhM��J��o��877��T���o����u]������Q��5Q������������m�f�����~�s����k_W�X!\U��s��AY�}U��*�1�q�9VxT@�3gN�-���w���G�b�E
����+�r��� ��Y�!�K^���
�������]eW����Ul�i������~��l��-�O�>.4k�Va�����O�u��6J[<
<���6ZS�U)��i�(�pU�5j][,�C����q9����<���j���^H��i��5�&L���u�5X�����ks~������{��8���[��
��FV���>�U��4�Y�NUQUYVuS���`�kMoQ���;�@���^k�������f�w.���8;���vM�k�}����T�������R��W[!i�"/skK'mI�AR���9�@������-����cmi�R����|�w/kq�WH�2�����pfj�V���RksY��b�!�{}��6f�[���2�2���|��������5�g������1�1|�p�f��%6j�(��e�����E�}1c�������:�uB���8L�J��j�A����9���
���}�p�f��)6r�H[���r������3j�����N<;�����@{	k���@*n���sE�k��#�yj�S6l�0�#���!�����'����vA�%~}$ ��SO=e��s���m���5k��VGE��c�������9�@h����
_7�6�l���v�
k4�.L�0��������uk[�|����U�VS�W��D8B�m�T9�a�%~��wL�h�S[�jmJ��H@8.�z���y��g�>�,U�J�p����/l���n�����[�@�(���u{��FC�q\��~�>F8.��-[���o'N$WB�c 4��������clW��"�I�X�xH�!. ������
d3g���?��4hP�ZcD.�1�Y+����Lu��,��7kJu��*��~���}V#�F�/����)==�.��R��x��1v������!��<k�=��q��y�������=�+7�kb�������n�ck����1�q9i��-[l��U���U�����i�,��5k��'"��@���������0c���^o1Q1v@���Z+k���/��t��ZZZZH��b�
KJJ
�k�w���F8@8v���3�1�1�����e�?���������5j��%K��_���#�`�u�! �!'''�\o����o������@��[���G�  �!����222�o����eK������C���O[�j�B�R�%�c��2�c�_���c�# �}�p�p�p�c�{���m����z�j���*��z�����kD�1�q������S�N�b�
+���Vy��%��\x��������7�l=z�p����YTT�����T��c�! �!7h���>�hZ�+)�1�q��Ad��s��7�=�+!�1�q���7�h�����;���*�1�q����+�������;[jj����7Fd" �!�/���Z������#��Z�j�_x��^�:w^! �!����v����K���" �!_y��6k�,�����[�n��i�"�r��EGGs�E�1�q��`-Z����z�j���]U�����<�@;������g+�J�p�p�p|�u���+l��9l�T	����5��]�v��G���{-!!���*�1�q�����n?���}�����I;����v��������za��a����������u�V����c��0����[nnnHw��Gm111!}-��1�q�1*7�1�1�����e�?���������k��aK�,q��GA[u�! �!{��~�mk�����[��\������w�����a}����-[�����P<���l�a��e�������7l�`������`QQQ��U����}�v�_��EGGs�E�1�q��� �=�n��&���/��3f�
:���[����p�pL8�=�1��^
���s�=�������S9�0�c��2����'[VV���w�}�]|��v�i�Ylll��*77�V�Ze�=��t�A6w�\�>��p�p\�p��u��U���/�l;v����P�
{�����8���&S��S�U��T���e��.��2�������T�+!�1�q�1*7�1�1������c�#��2�3mM������q�-6��<~2|�p�3g����Z��t�E8�~4?c�=��q{3�M����]��
�k��vK�[�E���,�c��~���cS6O�������������eV%������4`�:����}�nc�c����^���?l3������zBt�]P������X�D�\*��5�1�b��%���l������U�ik�q�z�mP� �[���p_#�/�n�j#����2�<�+���,�N�q���k���%"������7=m����9C:�����F��9�sH__���k�c���<o������C:e�7NMI��.��+;�1|�p�x��w\����/-���M�5�'_l������KD8�����2�5����*q������U��5��%"����M��q�=��A������k�vH�Clp�`����~�<�c���(�~�5w���x������aYy[:E��G�[��z��A_������~A��F8�����i�o{�Z��-��h��G[���x�
l8��I<�b�b|ui��5�1�*;?�6�l�u��,:*��5��1�}�=�c�����k�c���|r����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c�����	��G8@8&����p�{�c�c���������]~hN�d�>����o�grrlAJ��OM�v�_o��E8�����m�G���^3��2�f����bf7''�	�{[�wXL����8�c�������iiO=e��[��'�/�������,y�@K��.��������c�'�sr��'����FY���{���plf���Z|��Vk� ����~�� >�����u�0�1uj�g\ �+������V+5u�_L8����%Kl����>}zh������w�8*!a��J�c�����'�z������=�z���������U���D8�����[][u�����qp8V����a����~�� >�u�H�Zm99{�up8�=�`W5N��e��B�c�����. �|��"�r�������={Z������~�� >��������������g��ro�.��j�w��s�/���c���~���=�����e��oQ���lf��c}{��S�����
}se� ~��g9k�X�W_Y���w�����!����C_]�1�������9s,�p?���?�u��v�w����w�Q��/�{�c�����k�c����|n#���|�^��v��g��|c���V���,��s-�eK���'N��F8�?��7�����I�,g�����Df��Z������Z��w[l���O�p_#����4�6v���o���f��{�`TB�%�p�%���; ��k�c`?��oN�`[G���
L{%�Z5K��������e�1|�p�?r�/�m�G��)S��b��[��.�Z��gU�?�p���(�1�����g����m������%�?�(�5l�U��2�qI�7�1��H�5��fY�}W�I)�4h`�SS��-��K�b�����f�����d�ny�~���b�b���c���7�EIYrl���w�aU�o��3�p�����������p�s���{�"Sj���*GZVZ9rfi�[s�H\���9��������9�sPT�����r��w�?7���<�������s*��v�T9V��uh(\�wg8~b�%*��������2Z�X���	�b(6���RRH�����G[����1�(a��&*
�k�����pn��)S���K�O�;E��1W�j�����8�yF��M�D)V�'���&.M���<9&R���U#��#A�R	���PB���p���KT1QqtVwA7��)e,b���PA�3�}����lq|�'�'v�N]�4����
�t�u` ������QNd�������8Z��3�g Fc5� C��Ek����q���3g��>�?���KV$$���T*.�V����E�;��pLv���������y	�0+~���6�^c�����@{��6o,�����������%�l��R%8�m�W_�������pLv���������9�s0+a��6�^}�����DG��6Ob8&��pLDDDD���D���)�)��'�9���A�Ah������7�c�k�DDDDT�8���0�H�����I �0�a(������qJ�c�k�DDDDT��z�r� qR������+(* �|zy����p���1Qqu�x[���92���V�-�B�������\f0���Q�-�HH0!5��%K���?�#44���Wc3��"�
�g�DDDDT���:|u�+�I��K�Kb ,4wm�)�S���%�yq~�b'.������bEbb0��[\o�S;v<�
k	ww���3�1����:����$uBS����PJ��D�q�cG&"#���O=�X=���y�d������B����~�����/#�����f��''�>���b�����������:u�`������;����C}��WX�p!N�:�B�_|AAAh��d2�����k����+P���ra�C���!��';���s��>]������}�B�e.@*��F��@`���%
�q�6m�c���l6?��
*����pww��k��mC�=`4��K�T*l��qqqbp�8q"����g��%���;###?��#6n���^{
r�KC�z=�v��]�v���E��t�v��
???����+W�3T��������d2%'#s�f���O��r86m
�~����$��<����������!���/;b�d^{���`i�p�!�;w��_C�5���h���x���[��U+1�&$$�e��b��?�@��
���G��u��b%z����X��X���o:v�(���#G���	�D"�����1i�$�m�����������3g",,L�,�����[��1Q����o������{a������L��\�|����[����1O�����r����@
x{����a�K#��<6K�
�7�|��/��]?�Y�t)F�����s�c6L<���??{����}�b������/��;���������3g��v��u�m�Z-T��e���s�=W��1�1Q��?|��hdn�KFF�7/T�����* �J��c._6 "B�/�L����uo8�t��j�����]�1����x���b�]�lz��uO��z��V���q���;��k����������Z��={�����9���#V���-,�*���'���C�b�
1l���...6��e8&""""*9L���DE!u�"�`������RGG�z,����.0u��W�����nw������=�pl��p,Th���BUW����2e��Y��K��{�����_�����|}}�����7o���SSSQ�R%q��z=<<�y��r�6m��K�������-[��{�_}�U1<�~�P<y�d2D��l�r�\�DDDDD%G���b8�:T`8)p����@84n����0
f��eYuu�Wh�'���7��F��!��\B�O?������U��/�J�*t��������a����*pn�Y�>-t��+T���������V��}�]��;W��l+�c""""��#����	���U�7-������	�s�v���fi����	
���s,�����*�%���c������8s�����W�$-�Ua����������o��tK b�a����:��V������������������&!(c��������\�V-��|��I1��g3f���/�����6l� ~�S�Nb�*���
�B�\�z��)�p�E���a9������K�����3,�}}��_���,\��	����������e�B8:�]�s��S�7?&��5���$,W:C#���Y�K�KK�Xx�&M��;�Y�B����X���������A�����dpn��� \�������7n��y���s�y�f�P=�� /�Tz�`��v#�j,T���q!�6���������caT��K���9f���=���&�R+,��u�q�����]8�w" IDAT;�Y��{����������h""��������D��e���s�=����S��7���j��epr�C��E@��vQ1��p\BS-����H+&&�N����V�G3��������k���K�w��M������;����B�n���=���v�h(v�&""""*�,:��gC;s&,���?�D�����p:�\��X,@j�qq&df��|y������o�����<0�?����{������;{�s�au��M��|��c���F�z`�q��%aT����	�B�]�n]�s��?.n�4g����g�}q�1Q�g���v�4��\	�LV��rH�Jx�U` ���6����pqb�������;vL\-����G=zT�$T��}�#G����������Q�~}�������Y����D�l�R����Z�P�{N�|�[��\���Wl�%�]�AW������r�6����I��,��w_���Fpp�8jj���n��n�DDDDD%���5�����/��9-M|��"6�����}�p�
Y!&�0�B�����.��~`ai��m�d2�TXB-4���+���[���{�0gx����fw��.'N����=oZ��g�p������!V�
6n�(6�Bv.�^�.]�`���b�X��>�]�v�]���9�n��a8&""""*�L&.^�^�+7oR)�*@��!�+##����uX�*
G��a4��+���+:vt����]���a��06I��^�~]l�%da�S��}�e�B#��3������~1���SG�0�9��K�.\�S�NA�P�#������Y�{�q.�N'V���	���.T��������K����������9}:3gj�iSt:���Z ��T*%h���'����E<�c�k�DDDDD����F�����������^..t���	Th�4g�(�1�5�c""""��#+�"V�g�����f?HX;|�U�k�c�k�DDDDD����82R�-[2a0�\2����
R�}{g�������1Q��}{&�N����8������
U�7�ck/�J7�c""""��Ch�~'Nd��:�>>9�x�Pw�������1Q�e�jaJH���2@")�f�#"B�m�2�t���
: 4�]�rY5���c8&""""*^�))�����!��IX��!lVT������d��y����ZXV-t������N�:9#8X�&MX9.^�
�Ic8&""""*>�O��f�4d��,F# ��%�Y����C=e
�^y������b�8!�T`@vv�`�d�8���A�p\|����pLDDDDT<�o��&2)�~��M�<oL"�������2�g�y���3���X�0���{��@�R-�0���*�����('�k�DDDDD�C��-�N����9K��#�B��Up0�����dV�H���������^���������a�(T����d����������O>�&<\� [%�@uHHN8����
8~<��F�����
8�zu�c8&��pLDDDD��	���`,�7��:����P��B^�B�<�1�5�c""""��OXF-�cmT�m7#���_��p\��m�������1Q�����&,���nH�?�k���o��x�(.��[W)�'���j�����b8&��pLDDDDT<�����p��|!�����D6���8T������b'����m~8�c�k�DDDDD����
h""��lY��x'^�b��EmLyW�
��80P���w�����1�'O�9���r�0�8�T�+��X`�?�;�l)h��E� ���5����������1/��x�.Z��%K`�u���I\\���P|�9��������K/9"(H��^��������1Q�d����Y����2??,]kAX�		&�7-�
*�F�..V�g8&��pLDDDDTr,Z���p
��c�����p�)v�������1Q��������q���F;	M��e�B�jk���1=Y���8H ����9�FG�������V���prU�$�z�c�k�DDDDD��%+��7#e�<���E�S��zy��Gx�E�ZVoD��`�l-f��"3����B0���!8X��C���[�pLv�������F�)���as�f���e�e����h���
��9e�%T��QQ��b��XrBmn���dP<��AAp����ll���i�re*L�G"]��J	F��@`�
���U����1Qa��z|���	�8�yf�s>n�)**+bx������r����T�p��9�DE�����3��pz�U��L�c��V���kF����/�HCz�
ENi8;��r��>������|�\�d�������0V&������0���G%��Q��p�p|T�#8K���-����A������I
�CC��2��W"T�/^4��	=n�4�[�+T����*WVX+@?�����1��P�!���c{�v���(!)�����H�H������/�����&,����?�T
�a��
	A���c��\1�lY9Z�v�K/9����jpa0�]c8&""""[��������lJ��#��1�g<�r�M��F����0����?�L���q����YU�z��2�4j��	<�M�K��
�1�5�c""""����0/q^�U�\����|�I�&��Ce[/Q�h���pl����lr9T��;��+jdd����''	*UR`��8���.��b8&��pLDDDD�0Z�s}�Zh���A���=1��d<���M�)��V��&<�K��>�uIE,Q����7�ml�=��x�UgL����/;Z=�����1���nan�\�,:�*�#�G �\���V�/�t{�B;m����A+B�x���B���aTSX�Z�E�����1�jq�b����k�k6}d��,���Dn����9#)s����-���?�\�E���o�F�ms�G��@h�j�m�[�pLv�������l��Ah\(��|
������%����!�?���f��K-sJ
R,@��E0''���->k��b�R	S����=��g�{�6�����`5j�P��k���1QalM�*V��L�<��1N^r/q���v]5�����}���3�={rtI$��/�>���A���aa��V_�+'V������L����pLv�������
�@�D�G���_�e�C��$��\_����#�E�,�4��t0��A�����20���S�ABB�{�s�d�T����pLv�������F�)
�2���_pI	>
�tm��]_�L�w�T6��4����_�.lO�/8b��x�����1�5�c""""��c��4DEicDv��s�e2	<<�b����= ��=��a0�]c8&""""*>t:��*s�jq��F�E�B(�7y{�0x��q/�������1?g�fc��L;�Gb���R<���:��I��W�c8&��pLDDDD��,z=�7n@"�C���SV	 T���#�~���}�HJ2 3�sxy�����x���K/�$���E�dg������
O�(R��C����c!I�||�������r����	9{V�Y�p�\��qC�zNpq�b��hl�z���o��1lXY�-+Bw�t1�1Q!X��X���0]�KV�=��8:��Y3�&N�s��8��q�����DZ�	����c22��!)	s�m��qh�r,bb�EW^�H�~W��pLDDDDT�}���>uJ\R����;w�* �������k��[��k
�5��ZG�"���!�����+o����T�|�]\�����1p`t��z�7V0�1���@��1C��'��(p1��@���l����Gx�MHw��nC��e�K7G�.=�$S2g��oN���*�M�YZ/<��BC�J��c�cb8&""""�Q����DD�x�`6�)��-[B=e
�^yE<��V�J��M�~�GG	5Rb�@wt��g�����.g��5���,�|�W�kUm�����\$.X;{-���2�<��Qw���6��0?8;Km|K%�11�(e�h��`�}Vc�D�&M�	��M,[���sSp��fsn_���H����;!8X�_t��n
��%="��a�zz0�3����(w�S��8_�&~�Y����P?q�ri�pL�DDDDD6J����po�R)^z	�����"#5b�X���<��\����� �yFa��A{#Q\V}H�7B�|��L�{�r�����z��>�/�����<��J�cb8&""""�Q�O?A��cB�*�Sr9\�~g[����c�N���?�� A@�
�&���1�!l�?���,�&TC�����$VC��}��GE��j����h��pLDDDDd#!k#"�������?��T���Z��=�������j4n�`���u*n����T��&�  ��8�8Wn����&���zXm^��@/�V)�1�rIIIh��F��?�������������0�)�����9�����r��P��uP^~aa��V*�d2�m������M�+t�>z4�3��W����	����_���_u�:�!92r���R� ����i�gx��g������zq��11�)9Y�iK����u��cq7��d��P��a�0#:����c�1�k�����_~�p�����Z,X��+W���SR3��e�S�����q+�;v�"##�����0t�X��{
k$s���p���X�7T21�1Q!Y23�Y�������wH�pl�:g|S�Vb)x��41��Z�p�o��OW�Q���7�JI1a��x8������lu'q���/#qM6$��e��1r��r	23�X�|:�����Z"��467�G����W!�P��pL�DDDDD�b0������Z&���R�:g}���o�DD����F����05\]n���'l��[��;�w6�F��p�������,�yV�sU�PuCe�,���Mj���F8���v-n����]����p��.B�!�(���o��`8&�c""""��H�1c�4
��O)������N����{���t�j6f������e���w#4����U!�J���e.�I��0n���?o�^Y����4�4�{'(G����s��6Nm��)>���������1;y2[�s�iS��+���&����B!y�;���tD}����`b�*h�ad����/++k����-Gp�\3������gpc�
$�MB�Q�B�94�$�����11=��ec�L-6l���XX-t��QC�q�T������S�Z��"����/;aJ�VP����t��t��~�����{�E�T���	{:%]����J��pL�DDDDDOHV�E�����$3�J�J��W�|yy����u�0s�A���9&~�B��\���YzC{��q4�3�844���/��))��������1����X�8�$S�:TE���{������N�����	H
84ny�j����Z�c#f�������cj��G��m>��������~���$%%�Q�F5j>��C��>��""""z:�#~I��oN���}����Rg�P��X����T�!�K�.]���>�1..�]��0(k����Qp��R����|�s�b%X���qcT�*���l?��s.����4�f�|�<���9�26�Y�)U���G0?�o�DDDD%����bN�\�_�	�������ks������}t����5h""��r%���$���c�Q��
�3 �����i���X��X`0u�*1z��vu�����c�Ur����`�
N��yU���u~�F������a�pw�o�� +�����������F|�Z7�m8����O��1Q��7m/��M�+��A�!�C����nm^,YY���	�����%�@��up0��+��._6`�4-��M��d�k���pt�`�0L���P?���c��X,���;���0�-�p"?���Vf����.e���;�VUB*� 66�w�"�����K���2&���������E%�1Q�x#s�Bo�[}�J�J��?=={Z=�4�>Md$2�l���~R)��_���9?L���`�,-����6��~K���
��<Y���w�;[�#G2������x�I"��������L�@��-p))3����<^���kz���H5h�����p��g���T����x�Z�m���!�'�c"""��Kg�al�X,�����PH�[~.��n���M������!����?��_�C�CB����x��Y�1C���u����d����V���h:Q��=;���~�U���+h��dw$����U��!�>���XP�)<�^RHq���Y'P�Pe#���$�T*���d������J�Y���w��l~�0�0��W��5bM]����aJH�i�e�C�FP����K��o�IGx�g��}T��;;#(H-6�**[��`���8So>Lw��Ot������<�����c��u0|��j��+�ai�L�X=^�U
]_�����������1�]c8&"""*����p#k���<�
�o��,�������%E���b��c���R8�h!.�vj�V<~��1''����>��9 4T���_�F�W�B�7�P��>������z��BH^��A��=0�u/���7,F#�6�c��:?xO�d������J.!��>���+��,�J�Bt�h,3��>�#������:��_��T}7����C5e
�u��_���4��ip��}]�� ��[;"$D�V��������������g��\��lY��n�/�zc���pxLn�%�r��S��^�\�Ah��x�W%��f8&��pLDDDT�}}�kD�G���t�Y���z��Q��~�~��d�h��6
)�|b�!���Q���!�wV��:DDh��Y�6��%��z�b�5j��{�p�,d<���s��rN�kC��sz:tg�`��,�{7����0���|��O���8��������pLv�������d3Z�����#�������*(+`��t���U���v��6*
�}�r�E��k�^PBQ���#��L���b�������%�~�	< �����_�;w���up}�%x���������?��[D����:��)��A�����L������������7�pLv���������e����h�����d����u�NR'Tv��	>�S�S��]3���j�8��p�rN����t��Tl��
	�U�O�GF�T���q�K����sE`�
���^5��� q�|(+VD�Q� ���Y��'��~��j��I����sy~;�lI���`��-&�|]�[_J�pLv�������t��O������7m/4&
*++��GG�Z\���
(�'��lt���a���|y������J�<_��h������������l-dY�T��M�N��^�F\��V!��1x������q�p��M��+�([V��-��)�[�����mQ]�.C*���]�=���l�mK���	��o��9����i���~����1��cz,]��;2oG7U�(G������
�*����H�\��#�?��M��0���R8y�a��#��������Ww�����a�
He��S�l� 6���|�]��Y/�x��W�o�!���1=}��4���
���p�:����+p\�tP���9�=4V�C�t9F>�������c}��i��*(PS�=i�������nezAb��/�c�k�DDDDd�������M�2p��^�'����tqF��;�����#~��9�q���h8�"z�����/�K��tK:N�ObM�����iE�/����H��
.�`��p)�o�P��6p��j��V������d����������g��uz�����qY �..Rt���	T�UK�D_M��K��w#�OR`B�n�(�������������h�Q^���g�Hv����f��=E|��` IDATP~���.��3�]c8&""""{"T���4��%3�INBc-a��0��a�?����_"4�}_�/v����|�����Z��B}��+�
�
� ���_�i�0���1��HL4!"B�E�R��U(`���'��_NN��)~X�S������bd�}h��<{�����������]���S�=\	/������R��{�{0�]c8&""""{�m[&�M����,a�q����]�� (H��
�+�K�E]C��T���;�m�
�:o��������t����I��{�>�4�3N[�
�ES��Er�d��������3��!m�d�_��K���P�������d*�M����i���e�M��l���P5^y����E�%��������a4u����[�;y&�R8��	�N��V�>�"c���EXPETQT~�K��pLv��������-a����FG�x����w�J����P����m[������� )�J��_��+���o�X=wa�tf���� �#��n��%�1�i7�R����o��D���3�tuGh�7�(q,����pLv����������o��v�Ld�8��t^�J�t����ph���G���t�r|����#����uBp�-Z<z���2c��4|��-�����$^�re���F��^8����=
li�!�j�GO8;����h�`��T�]t
g�	#*a����b+�c�k�DDDDTe�=MP26m,B��H��7���!+[6��]������|�v.����
S����3�GzEB�x��x�q�6�9s������n(��*.�o�����p#�i�;����g��{�^^2dd�q��I�ldv�/�O@`�a����H�v7�c�k�DDDD�������mX���3�h1��S}����PF^��]��K]��3`����e28�huh(�Z����33-�9S�Y����*8p��R�j<r�����a|�I�m����K`�~55�(kVC.9�D�T�O�C��������3���Q��A0�]�%�F8VK@L�/�V/��������V����1=gug1��4l�n�����-v�8��[L����- )`��������Y�`����(7�gD���/�x�����9Wvv��
{�G����I*�Ty����c����i������a������;�X,�����:���	���a�)�.���%H4%�9�U�����1N555l��1�]c8&"""*z7�o �f�X1��d�y�^s{
�M��/�M�p�9s��qb��5`e������K��V���5#���X�.����,�X�P��#����2�M\���V�����?�sgWw��������JL������X�|�a�@�)Z�V���'�������0�����1Q��6gc�����8���O��(0��(��_�o��H	��a4�>u���H�p���� 84nl�x��,��ov����	=
	�5s�6U��F��9�����0elM��s���6���/��-�C�C�W��������1Q�:�y�q����&<2H
):7�T���d�O������A�c��#��'��}�0� �X�X���������W���?� b�I|��OT�;=���d������������^�����>r�*?}�����p�^��,XKjj�O#�@V���������H���/��N"�����f����#�b���2��j���qa8&��pLDDDT��h�`j�T�8d��=��.�A��l:���n���i�Kf��d�R�K�����88����VD��u������w����Y���o
�E�E����d����������c���)[�W�-@m��������7��FJ	�N���84g����s�:PB�z��x�����s��������>��3/����L���M]�Oa����������LGX�7��Y��?�1�5�c"""��o�G@lV�^m����ca�q�o0��4�z�=��?��k���:�t��	,pt���M���]�Y�5j(���������������P:��|K/�y��<�p�B��7f���z���������.v��q�|V�J��}����*���.U��D�%�1�5�c"""��� qf��F�!q�P��n����Yv$\dV���CY������u��|���b�R�?t���*T��d���c����������2�8� �
e$�?�W���c�P����-�yy=>=�1��n�����F�#�����8��K���Z���2O�[�pLv�������������I��3�`����H �\"�@��(�*U��&J���M�<�6V�H�������%WW	�N���!�bE�q���T|�\���6��8��Y���p���VWE�w�crpy�/���o��QIH���g�����0����w	�������NAU������pLv���������5X���n-B\v�=q�:�}��1�{$*;T~<7P�m���Y�Rp�`��~�x�L��B�z��)W|����������z����?���/.!���`X�&:�|c*��!'�_�����Vc��-������U�j�k�h��	J���~����1=^���pRw����l1���97BU����j����i�>]���{��y�W�n���P5Z�tzl�t��1X�� �w����U�H������s�S��0��tl���7�
����h���c����pLv�������
�b���q�R���0��C���_���!pl����*EGk����[�����������D���o�����0��mxkL9�tn����y��#�#�Xr���<���kP�W�40�]c8&"""��0%%A;{6�V��Y��g��E"�D��k�^PM�e���?�fM"#�8w�`���}����$V�_x!�	��0c�,��gL�,>z��|/!4a��f?>_yA�5�����N��d�������fF#4��#e����=d2������������j�o��0m����%\�@����0a�
��=����V��
 `�������������=�<�!������-���1��26m�v�t��EA-���KRxEE�m�`��\XfL�����)Vr��K0s�vry�#�n���L&���ry���g��
��C@���A��3m�%�9�A����;����c�k�DDDdo�
��������8�u�G�����z��Z�I��:�t�&@;o��m��6hT�'CQ��C=���zL��;W�W=r���;����6��`������l�-�����D��4[�]��0p`4h�,���V�_���K����)���Y6�%��& ����T�SF?���>��?i�d����������3�g�x�qd#�P�d��{y��8�q��X��^�m,$�#.�X
�Ri����@(����y���N���a=�F�z��U�g,�o����j��mRJ�	s�&���x�]�����0�-����W_%c��4��?,sO%���=X��[��*�l��&V��N�����k�Y�zN6#��0TQ��9��d������^�N���	3�/m�,Yy>���y��eG�_�o/��6F#���E���/��>}����4�$%����:������Fq�q��
t�������,�������\��]�R���	Yu~�7I�pJy	r��
u���$�}{6�woOt��q����t|����s�!��HW�C��(_�&������2���'�F�@��oa������{����1�=�a������{��0Z��=�$N��?
����Q��:�D��ph���%-���r�&Lx��\E���l|�q.d}��K`�$���J�����bB��
���2n���|0Vm�� �{�w�!�� i�	��_�{#^���-T���F]�s�^�h��p���c<6�d�������J^����q6�,�0��J����1��D�u�k��f�\c�[���?8����������8�7�
pJ�^���5
�������[��:��^��W[7�3n����p���L��������^�j��C�+�+�$����
�����r�Ib��[H�����;����R���$��1�5�c"""�Q�Q��0��D��e%���W���n�������Kg����o"e�
�;�W"J��dF���&a��S��6��q[��"��������N]��}&�K^<��V��W�Cp���������g
g�S��������h������<�d����������>�����q�97C������x{!t��r��'")�	z<��\
#�0a�5�x�l���qz\�,>��n����0��7
�L���������z6|\�T�0�]c8&"""{�"i�����?VW�R��@�@4pn`��b�l���bSR�8BI��d��9-���M�������5:?2�*�^4�
��_�(�Y��~qc{�����
<��ON���?c������� ]Z0�]c8&"""{ t�����9{��j��j�[u�O��~%��X�Fd����������x�a��\z����P���H!9  ��i��9l=�
�x�d��Y�8�x����~
�Z��)�+A���n�6"<���[���y����;�c�k�DDDd2�umV'�F�%;�p,c_�/f���{^��/	R>���|cL,Y���&9��ys��L�S�V�H��1c�0~���/��H�@�^�T�N�f���������_����U^��j�!�:�h��E�1�<���`���x����4�2���k�1s�!Tl��6C�g{�x�������1���vck�V\�_gw�����mQF^�Hn�|�y�������fJ16�G��Ti���(;n2�"�����������������r�"�V�W�X����x��T�>+��{������m���N�����a���#�R��f�C�����a����s�*�V�k���1��|���|�=z����
��J��\�����oc�v
�,C��1�
�3�_�z�������1=M:�N�4?q�^����<{+++c|������\���*\#�f~���,K,K��c��0��8�V�.�k=)����2w.,:]���H$�d��(�v��P����9Z����������	T([V�P����"|�E��������\/{J���6������V#FO����C�\p�HO���pO9��Ga��j8y���&T�[�tC�{q��|t�}C�A*��K�c�k�DDD�4�K��O?����0�������%��[a.�z��������e�pZwgugo��B�@EEE4vi����%f)��������H]�����T�k�\x|��m������0}�/>������+:��Fy<��c�����a���H�	�mcPCQ�2�{x�xW
�`^���}�'4F���>���{��xU&��PG�	7M�J%���`�z��5/����r����n����1�5�c"""zZ����d���;L0�{B��SD���h��d_�M` 2��N�J"����-��
�q�b7�(��.����q����[k��I��vm���#,�n����j�l�T��.�qa��`g�u�Q9|��!������T�)D\����0���L�������;�s��aC�c�)��s�Yb�M�4�h�x�u|X;��^x�{/����1��21v"�Z�4s�M�0��l)3���������
��K�x%�h%E���:;?�m��p; ��n�:��z�@<|a���9g��&��|��h�_�7+c�@"���;w�����Y�v������[�i�<�����T�a|��"����G@������y�����>�c�l���j�t9}r"N��C6!��D���
x�^/�i�&d�.6��4a8&��pLDDDO��+�y��4�z���[�`��h��tI`NME���H�?� �{H$��+�a��6p ����<f3n�����-X���	gX�`|7�_�(�X��X�����}��Gz�f!"B���ubwg����Z��I�Cu�v�pt|�n�Sg���_C��V����7`�Nl��8�}�*z�������1�]c8&""��!����Wb��56]^XZ=�{0���G5�j6}�i2��C���>�i�%���I�p4���FYkY������4�+�y�	�o�I@P�?^����0-�����q�x������3j4P�~}�r������u;�#|dK��������vc����
�vc8~X�d�����(?���as�f�I��c
�uzo��D�&E��y��aX��"g��
B|C0�{8��6�t�,X���0���W1��8X��gX�G������x33�7�6�,�m����s�x����������$-[���G����(�Z�Pf�@x����T��~���<��~o�_���������n��y���x�g�g�d������~��xq��0b����8�H$�X�}��q�+n��A���/pF�|��1�	����/����{C�3S|<�'ND�����;�r8��TAAp|��B?���FL���u�rdY_�,����=z/p���������S�~K(�k7������*x=v
�o�b�D���j��)��}�0{j2�4���c���y�5u�������.���u�Z�.�Uk�{+����Z��-n���e�F�^!�����LH����9��y�&����s�9��,Ck��>�U$&M��;�`��R	�]��|����
���`0������c_�>dI���S��
���y���+�
�VJ��'�|�y����4���/�4*���+��!e�d_��\�7�'���		p�t�i4�uX!��*�j�"w���W|�:3f$�����)���=���[��T{#�$=�g���c|�1�����aT+�@N����;u�6*c�Q|w��\�e�������/���U���Z��o�%NI������;��n�	]�������T�8f�h�8f0��yr�kL!�R(�����l�3�����[�M���6~-���J��y�:�������7'��ql,��A���{�<��34m
�a�Z����o�.X�W��^6��90���G��G��F`���z�2�
���%$H0qb��)���Gd\���+m�������|�2|WDb��}�&�j�
z���$I��<�m�x��v�,��J>]�=
g��/��E�0�S���h��.n���#�Y<_���l�&^��~��c>��n`
&�%&���(^��y��~��:<�2����mQN�����tI:�������#�u����������U@������>a="�"�B]���
��XrE��Y���������}�R/F���\����VN���
���9onQ��qo�n���pB�	B(�F�BC��H��h��L��)�)'LH���i��n�	t�t�5��U�X{\���!�1���zpc@�����EQ���~S�fEk[�L�+�V ��>�������H���)J���q��=��<����p8�2>��1�D��1��`0��xQ<��-��+��P&����&|�����C��[tq"���18�vV��������y�����\�9N��I�`�c������;<<���D��R�,A��[���*�F��1�F�����C����}P$�u0�N��
�,����l�"C�>�D��p�XxEfU�:���S���PiQl9���OL�b�*U���6��&����;�������<���L���lL�S����|���s
�S�����P�� ��C�����B3�f�h1nznE>�#L3J4L3���C��'EM��)��&�<�~��/r���O��+�W����*�U�(7��3�_A��Ic�"��Q@,F��
x�*0D���X1'��.������k_p���\��������Z�#|HP�M���vh���H�#.N�I��w �2� Qv�� IDAT(3��@GG�����}�5������I�;6
O"�c��
hV����+�c����� ��^	u�U,���I�$�B�����XI,n�,p��F?��A��~��8f�h�8f0���+�Ex^8�t�`'���g92�0��<l���qB����o������[�*�T�zd�H�H9���j��v�t��~����&
�?G����p��(�88��U+�
��UQ�����������N�v��AG����E� CEa����5�����?���}>���9X0?w��B"U/B��)��
��i�,���ta8���)� ��@f�.�Bj�,��.���������'x�cW��a��%7.
Q!��5�"j�U]Z&c���x�>s�{��W��r1�[�8f�h�8f0�����sp�F&����
�T�f�6m�P��>����.��w��y+cW�/����E�u���o\�!G=������1!jN��.��3A!�?[������n���y��������[#�������C'�NZ�K�,'��7#}�V���(�T(�oi	�	`2`t���s�����5�1/~0��d(�,��<V=�00�eJ���F�W�>{6����������07�c�r+�}����<L���8���=`����g#(H��,-��UK:�X��X67��v����E��]�����&{�u��*/-��>��NF��fNrE�RE�f�w0q�(�0q�`0����X�8f��E+p�ecbDx�$u�c�dT�f��(���H&���e����c!)�?��<�F<#x�xc��\44iX�/�����1����_h�h95�j`��r�4S-\
�@�,�[�W9�T�q����0�ak7e��z��H[���	a)�#��
4C��h����yKS)lV��
V%����6�6Z�XZK�Au�4�#3�����u`��������s�$�����d������
]��������0}Z*��c��2(e����^�wsf��M��?�ff�p����oy�)
�1pTc�W��������2{7�n�������3L�_����cF���c���>HLa����0q�=��7��f�����Az��&9�z��;�S����q�c�HRPQ�"'���-O]���PQ'��{1�""E�(�W
�L[a��PT1��MB�w%����u���'��<��=�j�E��������/��ht(�{��F��UQ �?>|,vG�G��������
����@�+��p
��uZ[�-�i8r�_����q��
v�z#e!Sh���e�!:���=e���Q�6E:��W0-�0�E���r(xD�0L���&�y����4	��i�^I����3���iV}��3���;T����������e�L���J��8�6V&Z�K���3-�����Q(?�'�x{C`kif&�_�@��-Xz�	g�����5������a�o`��Q�a���_�����ho�f0�?C9���$����ZUy���1
����������{�/��<T�ia�B<�z�����C,sy�z<�n�������B����c[�6������H!��<��t�j��W�z��3�g�R�%��y!+y�����i�cN���������AU�,�����x�|r�v��c��4/@�8�x_]�T�.w,i��?4���>�4i�o�boF'��A�h%�J-�f`!|�4���A��i~�y�;,]���D�L�wm��s�Q������6>S'���d2��Ez�P��p��vD�����>}w����^��7��D�8~�03�G�^c�Uf�Z_���%�]����Q����R_>N��=�����u1b�#��s��@�������cF�&22������f��
��UId0����Kt��
���C�:u����I�����g�r�J����^�z��I�1o��b��p���-~���+��������i\�`���{}�A�Pa(�q��$X�%+�Vpb�rg�+Y*"U��.�8/Ac��a4au�jlO�� a�Ro.y�M��X��?Y�C�rC�:V�����U�E�-Z�������Rs�t}or�p������$���8�x_���]��-.B�2
�-�`��LT7���2��n�LXm�s��*+-Sn�����k�:�����o&N������5���1*!�������Ej�4kF"6l�@r�L��R��>���D���Gi�>���D#:�|�9��+��N����k�df�E"�N��}�Wt��i?~4h� ����M����������P���V^�v<����}h
��UM����dI2�\?��e��W
f�2XX������x=�*��O30�c+x�j/��;�������s��o����/�?�'�D&F���c�����S���
����=������>p^��m����)'�<�!,t����:Yu�����db��X�g�6m�����s�cI��M=�4iw,}�>�4�4�i\���
����1=f:����.	v@�)`S����i;c���E��H�u/��DM���;��I6@�;����/�d��<T����e6��i�"�t����
�8���������@Li@/����'P�6`�
oco,v^�&&M��6��Oz1[OG#s����$#���/\B�;d�uj��e���X��U��tL�G�_Tj���s,�����j/�e"��c�~	r��]�?����Ye��;��������I��M�`�<k4m�}���?oc��7��>y��qcbb>���x��1���}�u6��6����+��cb�����<{u���Jk��I0qR8N����5��N7���"�q"�6>�����J���X�
<���
�\~�������-L3�
���X�|98���0XZZ�]�v�e����H���,����m�A��r����4��	8t]�`Q���	��C����)[z����i�������w\~m��+����x�}�\����	|��
}�u�6&O[���3�]�tu5��h����������7���x'z�X����q��}����zy�w�	O��0�h��
A$�a�s?�H����7�I����M�:�������s�!S���ax��8�?���-:4�eJ�����_��Dq"���������=����MY���v�-�7�:1�H�$�����7�<�2���3�il����a��
��s�,�1f>���/��]�Zz�^�����-C�_%�O�N��]��<�)��_�G��D��|������q�H�r��I��t
&��S��G���[�5q����V��:�XT~4�A��*3�l��NW��R9�"Fao�^d���~�m>�;t�����F7�dn�������U�����	����/�����!y�i}�:��m�g'���]�A��`!��>���bM����8��u�$X���}�
-Xa��B�7����Q[�c��a���x�U�������i��F��-������^��a��\\����$��m60�<�#����=����>e�c&���������/�u�����|2�{����:w������W�f�L�#~����)�T
88�p���`���X�&�z7���_�"���K��0�;D�h�?��@��v���g����dq��W�dL��em�Y���
��b�P(D���q������q��		���q��U���j��Y��1	���V�J�;��� �6��e<J������7�/�-���*C����}v��5o�fRN-�2� ]����Ca�l�1.J"��A����K����q��4���r`?�x��B0^��V�#��+dQ._
8E���3�=k�9�.R��,7�����kv��
Y�=`��rX�
fNG����QJ��[��|y}4jd�^�t���n�<�������6[2����5��5.N.j
�+1��i��9���w!IO���0/�
��lI����@������~����D�����;X������`���!��P8>�������^�P��z|M7��@U������'d���`o���>n�u� p{�G0��!hP���B�'$��j�E}{a��CQ����<v�g��w7tj�&�E���Hs�-n��@������9R2�H�g���1cfST��|��xNe2>5��� �� �"�:b@��D�>��b����P�G4�H�b��U����w�� ��<����35��\�J�F���K��|�$�UX���U!���G$]�\�Y%A�����R�v���u��E�����_�d��x���������NG4��m��n<��v-��<K�C�������@ 6�_�������H��f��HsZ��^8r:����w�]Ub���I��>c����R�������v��:,��o��*"X���<����1�ET��^�D_|�13E8�����cB"�t~�i�=��Y�s�;���������l}[�PC~����v�Zf�C�$��!M�V�*,��}�1u�5��io�5��o������B�I������gg��|Uq��U,��~}c���Pi9K�]��=�1e`K�SS�!��cb���X����w89	`d�Gb�I�"���D�_�`F�~h��U�1�(0q�(�Y�S�LA��-q��A���p�b��%� &��n�:��j���8����Va�_�Y��EQq���B�����g'�T����x�>Z����
l���<��F �R��V
<5��5k���S��R�[�g���G��������/P�O��+;� J�V�(�vv0qr:&�o[�/�rw��ALLZ�j�E��|��'>���!�"Y`����%\>������HNN�����uk��� #C�q�"9Q���2��+�ChX0�� >�,�f�I$)`��Z~���^n�Tz�s�?G��1�I$0o���+c����{�G����X=�/�m|
rP��a�_��;k�9/y��^���6�}�g�~�8����,x|�����U}Q���V�����p�P2ru���[od�9b��$��P���m�u<���y�v&|����V�Irr�>�!B>�D�U"6.�D��u`f�������([�,*U�}��sP�����k�Z8� @*�M���5�1s�1�L���}{.�EAAA�{�.7�)��$+���������@��Q"�a��z��>
��(��#�>��sl>���S�a7��fL�_�N@[��E�*x���Cqk~���c���?B��X:�=�jM��}4�'xAt��'c��+�>E��������x�\/333q��5T�\����l�
����\��x������@��F5�����U�<y���\�3��L�n�����mR!^L�unc��t�1w�4\����������x��U�K����������Q�uq��Z^����P���9�����5<���W�Y������;]���?�� ��7�����6W���u9�gd��%}���q�

2U�3��������������d���O�c����[��I�@������2�N���c����GV��9�r(���>�����[��=��F�\AIU<z�����Y�&\\\� Sf&�WM������>�������li&,x���_�m�\x5wA���Q�T��!M�1UL?�����u �/\ ��G�C:��!���+����:VM�<'j9w�266F����1�or��
���+������{<��[��kChh(4�ljj*�_����i��Y����I����'C����I�����}�C�)%%M�6��7$�8f<==���+<y�U�V������'���<x�j��i�=..�8��{4�����J��:y��d76��"
d��@��R�*t���W��WW;�����8���������Qy$��?��Wa����<�����}K� 5�D���m��>���*hP��B}��e���x�bn���J���~\�P��OKe�����Q?4�*:}���w/������������P��>n����#G���}�mm���
i�# Uh��
������Q��K+��@H�.0i�v�Gsm"�]?�X ����M=�#�,L����u��U�F���q9�2f[������mR]������~)�+B��a��A(g������4����R�p���/�ID�8�@�v������P��v�|QQy=���GV�d���7��<�Ux�s�j�y�2�
�@w�����/�X������v^��������F'?�G��Q�;�����=�^�b-Z�/4:h�T����BZE����(�L��z+�
@e[�����N=#q�_��C�l��ktt�{����k*�=�{��"
�{��{<��%h��,��jV��=J�(��{������B��#�Y�}88�#���.n��rc�g����)��d��q��I�^�n/�}�3z7s�!_���EQS#F��G��d�e�I�f
I	0�^�IS���^�z�1c�(
�_�j^�x����+�4�?���!QLO5��/B�wX��
�����c�t��(��T{
��H��]��e�����	����sr�C�b��"��[}Eg�����}��l��v�Z�:�3h�I�H����8|�*Ui"�e�lpR���4
?���V�r���u�.�t� !�bb��2-������>��o��H;�_;6���am�z_�e�nFc�Q�F�Aw��[X}1�y`L�:8���>}��"aC)<������I����b�S����"��c9��8<K���I����kcGT����qu��		�	&p�+��h��7�?>g�6l��P[(:�
S�h�B��T�IH5�,��I�R(�[4'�h;v,��Ai�T���9�����7���f�B�r�����1�����i�$k+Y_��M�;v�������	-v�R\�q�����R}H2I$���Y$��#4�W���[(�
��;�C���}����k8��3���o��HGF�j�ED����8~�����a���g��h�1��
�����]������o���������:�H ����*_��c�w�H���*N�8����c�u�+�����d�$q����;7"5F��W�9ts��(�5��E~�y��Q}��+����]l�.�p�����{ky0(���������m�-���m��[�u�����y�z	!u��x|�jV�Q#B�n��9�s�_��k.��~�<������Q��������a��pDg��c�H�*����#B��1�/(W�ceQM��O���F��B��o�y�����'���>y��D�8N;{g..����!$�[�s�����peYO4��� �0�6� ��'�qa�%(5|
v��6�o�3��lV	�5�����=����6M����a�hD��$�B8F���*��Ju�*q��>����U�/�G��+������<C�U���ciz:z	�����4��yRt�����^�Cv���x�����[�th�G/���[��^	mzk��-���u�����E�!/�w9�w��������0wO��4�R� 
�+u1p�+L�����������'2���#N��l�R��Z]CN�~�����x����y�|G���mEl�����s/��F�.-@��&��.�w�����F�D��^Z�c"+(���c��(x�H���# :��6�yEs����k���P�gx����������Lk&��8�>`���	�I$K<��R1����D���S1|�p��1q�Z�-����4��q��k�88b�U�k��i��F�l�������_n�Vo>��F�$%I�h#q�#�����k���	�66�(S�=Q�!2��S��������L��v.:B�Md5�a�y7uO�Y/_n��u���G#�\���A�� �����n.*����{{�r��O����F��U���13��g1�#��jo�g���`T?X�)^��~�~�h��o���_��3
~���M����gM.��0�nM����f}��ix��T�sP��&�< IDATn&�nOC�.Uad�YN��9��/�&����)r����w�KP��N\���?zY5��[���������$V�6�g`���p��<�T�86�r����F��W/P6��D�:����z���������"��N�M86�:W�����S���r^P��:���18r�:�SRWs6����NH�������q~���v�P��J_���O�4�E]51����I��N��X$B��q8q&W�{��y�]�%j4|�LI������o�uwr����q4����������&������HOvP�sL���a�N'��Ou!5����1uF*��1�TfP��d\7 �����d���B����s�p�`6�	���n�4��<���R����/o�����h-���]��E����I���|��_�pR)��<������%"^o���u��1�z�"�������x�_R�D�/��%	bw}g��S��X�j�_	��y�����={�p�X�>}��c#����C��zSo���~
B���oD���U� TY�����\e�DyX��))�����'A����sU��h_�c�bS���i.��'����O1�������4m�3gBje�5�"�����{���2���''q�
,������<6V��C�[�
3�F./1��dyI��{77�����e�h�Xy�Bpp0�=�$ao�3<�&�{,�=���v��V00$%�`�V�u�i�*\3H��18�"2R5�
}���Stla
sc�������>�=�������HM�D�}�y�[h��4LL����	�G�*�q��!RR4�D�6W���C�����p���Q%�@����#^���l����e�����@a��*�x�.g��&�_�x���kLPQ�*�7���v����������q[����-z{����w�: '�^�sQ��o#�P�|��B�n���wf�J5������1��%��[��S'p�s��s<�R�8*���g%��IG�j_�V��������t
�}���dT��S�_O�!��Ac�"���p�f:�^�CJ���T�V/����QR�����\xf��
?��SE�;����	$P�^�8tr��s��(.�N>��<@��_��?�).���{������ ��y(:w����fbZ�k�bp����U���C��o���=�P����G����&M���9	�8|Xa�9�� #�|y��,�D����P���������~��C����@���/��R�����DB^b��>�2NZ�Cb	��9�}���R�6�w���wz_�M�����.�Q��D~b"^�
���Npp�C����]��?
�-,2�r<����=U�0U^r2����#�j�����`]��u st�
_R(nBB7�����}QQQ��?�H��������ys�k�����P���{�-���\Z�����
@9�+V��J�
��r��XYY��~(�T4'��Dl
WW����X��+��������w��A��T�HJ�����9f|�P�0���+�I����M��
�BI�����!2��`0��(����r�;u�T�n��M�qq�\��}j���S>d��s+��k+�<m��o|��9j�U�����N(���w���-p�>z�)�j!tQ>4z�����0��qGf���c�8<�V
5���J����OT�(�*����[�c8�{*��'�u��h�R����c~q�A��w�D����B�m{�C�2
�U�R��Wv��.]���=]��{<�P>
�k<	��/�:^F*������e��J����[5j�"��D��?��//���_��+�X��6mrA���x"��C�#�5?g4k�yXjz���#Fx�a^�JR�� :"�58����VU�}6U1>r<v�/�����l�����z��3k���������q���^��0�R�nl0rx�'�����}z��z
;O�#5�<�j���e[�������V��wE����������$g���}�-�i�7�����227�5x�8��b�7����CQ��m��`\���."GS�D:Z�;i>.?�$��5Crfi����H���f���	����Uq��Q��n��w+:_�'��a��0�U�v���h�X*����qd��t{���B"�?&y<��n
�����U�d������W�0}n"�����P7��>�GV��*�B�JV�����+`r�.9�66DX� 4�
&l��3-`���2�����Aj�!�
�(��}�0��5++ou�-/�?`����~')q|�d�w�ap77�j�+#�#����8.&P�Udl��
W5�[�<�������
��]��$�Y�)�V��_uH�*�~��g��BJmo4�r����1N�_�Ug�d#�����4+�z/����4�"���;��?)�'���3<:�5*���G��,��}��N��R���XG�G��X�����)��{w�����]��Z�0_��������f�L�����_~
3'���]��?��ss8L��O�Cu��G~����R�#���6�{���@��^Xa�-
[�e����k��6��w#�����j���##���mf�,��#���y��A�P�?]���={�G���x����.��=���M���y�/�q�cZ|���X�B�BL/]r@��&\eXMI9z1k�`xHw�'��D����B��8��5j{�ni��g���g�5����W#$��Q��"����F��izI���D�����2@^^a!�2 B�V�1mi+T������=�a�	>2$d@T=O�B�v��`�/x����
���'GyB2����i\Tkq�5@�Z$8������#w�"&*�	 �]�����>Z7��$'W��Q���5��F��TF���]�����ku��{�k�Fj��������85Qo��L�Ab�	WD���H�:�����}��X2�1Zyk�z� �������x[�
:\�S��(Vb
i+A�gW�t����XM����0kO�������C^�1t��ak���<��U{X�M�3�&0q\L�<N
}���(X����~�����T����|�,X�iIJ��*"��S���wj�Ki�5&�<����"o����A��F��f*=���A���'Q��b5j
�O������U �v+�}F����
`�,��AA��y@�<����@�F�/p�DS�����G��"�O?����������5D�_������������9��U���4fR���p��&
�'`������*���cA�o���A�+�����V]�H
)F%�B�8
�lV���T��d��$*�E�Vq|����pKu8�j�-�m�0?V�����x�F�$�50#gVU�����0}�F� '�
�\}s	AL�����h��A�X.������q������VN2�@�����P=���]1)j�6p $����=�������3�p,�\
�~��B*�B���}��cs���
�+���q��[]�����zc�2����	�p�]E�d�3���������VT@����/�uoGc����GG�g�K��!��b���;�tT]�Z���`��7����Tb���$��gb�{=�����9���[w;W�"1�	�R��cwM|��D�����:�p55����H���'v�C�j��k��������~yY�o[�|M�0���Y?�_���@X��X���2e��O��M�h�]\�{Y�+��v<f�
����I��Ov���<���������=�{���b���1���~e|����9��������qq����8z�-rSU�������3����2NE���Y��O���dU�Q_kvD��go�����d�\&A/���>�4.��.?���
0���C(�HQ���]�2T�.2��W��8��3*8k:��_'1y�-�"M��/7���1O��uF���*%��{�.}����F�Q�l��i���%&����%�Ym��X#��+l�
r�c�����a��)�&�|��7l����0�����4n.��@^����g��3(U����
�r�X�4}��p��?pi�.�#���b����/�;����Q�Ms�In�>>a����s<.���L�N�"L:T�����`���p��Y��l���5kb��<��Go��V)�[|�D�k��][���SYQl�Hs0��\���D�k7�Z�I����<�[���3�����+@�-y��x���)!�q9�'dK����s}:�~t2���l�Se�f(#�c���W$n��A,6������'$<8�C����<���E?��m^�5k������>�"=~6�*=���|�����E���E�j���FA�����GW�X���%sM�gX��~]���?�1A"�dIL ��#��
�e/���_���\�gJk������:�/��D�>O�T+��X<�=;��������f���1l�����
@�9�+������j�U����-T+d���8r#�i�O;�\�����h�r��)��HK#:Z�Pkk00�,,��`0�L#�J��)S>�VSe9���j�9s&:w��u����V�v%��u+7`��RH��'�/�(^t��V� .�5���h�����3Z��7N_�S����2hZ�"�4�:\��e���07�?(���4��5r8T�j?e7�vWB^� �kF�m�Q�S1cH�<�!c��9���rmHr*���(�y��4A���W)OHa����g2��?�B_�����~�.$Y����������U*p�;#��3D���/ �����
�q0��<���[$�E�w�>�=�:h��'����
X����X�j��1��������^2��K���u`����������������@��,[�*{��$`z�$�\v.�_�2��31qb�$#'�R��#NHP��l]QM�W�*��}O������T�����H`�K@����~MTs�.l� ���.[����q1�9��*"Gh�iV�4���a�J��RV���G�L*��?n��~}���qBK���/�v0sUi����79	�g~�1��C��_���8.oP��-,�#����H�hu���S�p���<}�^���V:K�WE�6N�h�����P!f.x�s�S��h)��*
5�>`�����*W{�J"��p��5�?
BR��z�[�=���2��n*�����HDF���
.�Ua.��N!a����~`��!
��q~~~���q��		��+W�2��:����f�%��w���WO�AW����*B�S��o����Xx�����1s�@���a��}x\VV&.9����``i�6
��m+h��KN���]~8u5	�FbtnZ
�{6����-RT�.�%�<u��(co�_�t��GM��k.��!Al���XY	���CCj��ChL��<��W/������{���;tZ�C�����~���4T��5p+���~���t��]���D
��(�����$q0�r� �Nr2%�\�5�����^��/G,�!.N��h��!NJ�.���d'{��|�9���'9Y���T�z���h�-c��^�p,������$:&�����xzX���;W O��-��� '*
o���uVtm���Z��a���JD	��G�i�+�XX���+*�K�"+�()	��h�+���s�!^���`0�a�������y���1�����@����R����-7���`0�QR`���`0��`0%&���`0�Q�a���`0��`0%&���`0�Q�a���`0��`0%&���`0�Q�a���`0��`0%&���`0�Q�a����������������%��k���g���|>_�+
�X�n������`����u���1c��/�/����,Y��N�B\\��gO�7VVV��xJ?���S�|����h�����}}}��+������m��{�n�{����h��9�N�
www�������q�F<x�2��j���	��Y3~q����x��SIII�X���,,,�:/�xs��l��	w���X,F���1f�n311����R)"""���3n<����[�npss������a�lhm��'On,�)S�d�F��9��%Z�BBB �H��m[n}TEjj*>�]�v���7����eK�9u���j���Gzz:��}��u����X�"��*U��������u+��=���r���_�~���7������W�=�|m�Q��
�����&��$J:w��+W�p"�q���������T����JEmA�X}��������F�!**�;���
��9�j���;Vdd$:v��	�&M�p���"  ��.�����@������.���Y�h7Y������ ���0t�P=z�3�4l�����~�:g\���^�z��U�Vq��h��
7v.]����4N����#��5jg��I^��o�&���y�f���r��6dh��+11�3����KH��\D�m>��\FY�q��(���+����P9������2�p��e��?w����w277�����+=��������s�@r@��Ls���=.\���90���}�Z3i_.���s��Qip�qH��������XC���}�r�4�}	c2�,X��;'����p��}$''sN�?���Xd&��k����)S�����9�@-�4�i�#O����������i���g�n��c�9�X�	��eK�c
0�����
r��E"|||8�D��#>�^0?����&k'y�/^��Y�f1q\L!o1��*U�p�c�"����B����6l�H��x B���#�X$�3-�d�!Y+�H#_������{�0C�
$b�sL�ha&�KQ,4i�"/-��
�"Zh����V�)�����34�h������G���c��,]���H�(!#��������cFQ�u��6�j���v��d(L�Lsy�h�E��h�G��A������(�Wbc(�t���\�'��w��Q��qtt4���c�0i�$������


��y���?O��<���9�2R4+E��������$����]�~�w���b���'^�z�y;h��H��I�0k
M-��Uy���C��p���X���RD���E��^��N��5�
El��'���G^le�E�;�+�9.��o���T8~�8/��qC�_4h!V^M����9K"y�),__H�����2��,m�0C<�%���s=�;��]� S8���c������:t��h2q�(
�>��E�=&�_�#J[#��*q,����Y�l�yFFD7�E��w��_���JI"�jgg�R�<y�{�~�/]���Y����+W���u��l����^'M@P�(�]�(����~�{T��
��Y�&���8�J�iE�c�c�np��,,��,?MV tLE�3L�C�6��E����9qD�c����>C����H�S���|e&��7���\$���e������9��S���!C���R>-�$�i<*�iZ�i���o���1� L�L~��'.��B�i>����|#�cG��&4QN�60q��Z�8f|K4�C�/���>��i!�B�����9������&��������sBWQSP��G�H$�:,�'�2���o����'Np��T��>9��g�8.&����Q1����D1��@.�8���]�6�@�#E�7}����:M����E����$�)4��G��_�Y&��7$F)l�B�
\��(*B�=���i2UV�A�"�
E$��y���P�"�w��y��0C�,�%���L.�����4�y��57��������[�8f|-L3�%��c��LiN$��A��<��rLI +363J��1�a��]�v-'d'N����������G�>��������s�5��������)���o���w}��8.&P^0y�)������	f������ =�������sQ`��H�gZ������L!=T�N������
��P+�L+������>�����BS����IDATy��KME�
z�	&�K�(���
�Ph��+B!���D��g����`����0q���h"��H%��H+J[+1C�Cz�
$){�Q�P'��y�SRR8�KZ�!�%~��1�U��{��{�i������{���bY�(\���H��!AK�Q�t
m-,��
=�����'fs=	9�I�<b���#M0y�I������Q�$���Ra��xC!Y��?�.�?w�gT��d�qqqQy�AAA�[>��9+rH�Phu�"I�Bp�8.yP16����4)�E�V��S'��By���&�_��o�&�x���\%bZ�i/FH!C!�{�G���(��QrQ'���%	]�H �K�P��?z����_*�Ei�����S����uJ��i���a����<���5�s��_�<���&��M�1�3�A�8f�c0q\\�������L�k�6��Z�PSx�8j�sL�����,���!��R����9��k
�f9�M��c*
GU�)���3�7�8f|K4�����<��A9���D�s�PD�8��c��R4(�@k�sL�����s��O��U)����{_�S����7�T����&M�p�Z9���[��x�j���':��������E�T�U����<O�z[+���f������wb����c�sLyK�6t-����U3�W0q���h"���"�b����Z�<���T��`Z�d�N��Ib�R��_�1C=�������5Z��y���[C�j��o���1%�Su_>������Qcw�����1y�����	U�S�s��N��F��}���P����?�>�d�����6:4n(�����}����(��z�)����r��m��1�-Z���S/O*X��1U�'���3�0q���h"�)��L2*S�T���Q��cy���e�r�W�:	`�/����9��G���k���1�#G�s�����p��)�C��������IhP�u������;oVVW����F�
M�a�ZH��B�����cQ��.����E�����&a*�E�#QD�}� ���djj���YA��s�qC�4q����7^^^\�2e�|�XZ���L)T(Dn����n��q�?�*�D�Q����{s����Z�j�9
�
r�L��y��gT��#�!�q@@|||�b"T�Fn�#hL�h��Cg��(V����0q���h"�	y�4E�Z\�fMH$.�
h�|8c�n�f�l4���F{z�5C��(r�
�^���T�������HQr��=�$�)��n�:Gb�
f���Q�{��*FP�w������Mp�7FHH�k����{���|�R�t�+NKK���b�`:	�e�
#�ES�:U�>s�'J���{L��)����)t����\�}���GQ�P�4�B�����>�t\j�CZ�&K]U�U��Z�9M��K��r>������������E1���R�7y�����U���j�*�H����U��1K@�[Q�|N�!�LU�i|S8�B�s���������������0|TD�D�`����J�
{�QQTT���K�`��{�D�"�l�`�	�������d6�����]��}��Dd���a���=�??�4�������D������_n���t�w���--��`�l3����_�����S�B��Kn����K���T��-n�Jo��IL�Y��gn$��a�tu���c���-�Gq�Eg ��R�n#�_���K�����-7�d;>�ym��jG��F��'N��f�I|��Z��/[�R��5�M�
���L�&K�5�&�A=x�������k�b���o��i����'�ft{�f����1\r�G�"�@Z]yS����j���1XtO��I��i&�P���#��u��is��FTh!����������;Dl�C�*����m=����
-�T����m��(��R�2\�Z8���#G�����3��7o�T��F�=��|���@K[?i���SOeH�	�|�������7�Z
��/|_|���^���S*B��nB���*7g��!|��C�i	m�t�5}�#��5C@�2eJ�����eK���k:��}��qBq�K,����B[(���7o�
4(�Yf�Y�LfqM�g3�gk�)�U����$I���3K�$I�J�p,I�$I*=��$I�$����$I���3K�$I�J�p,I�$I*=��$I�$����$I���3K�$I�J�p,I�$I*=��$I�$����$I���3K�$I�J�p,I�$I*=��$I�$����$I���3K�$I�J�p,I�$I*=��$I�$����$upGuT�;6F�����\��~�mt�A1}�����k���u=����.����8�����C�����uI����X��nn�c�����3��������9�����<��3��k���6�����H���cI�:�������hhh���'����/�x]���i���/����[.�Zj������X�4��%I��:Z8���K����$I������u���{o������?�b�-���Z���~1p�����w���L��-��?�x|��g��K�l!>���b�m��n��5~mk����������/y���<yr��93�_~��f�m��O���|/���,��;�W\qE,�����9����b���q�9���!C�G����������+��r\|����Zk�����g��.� W�SO=5����o����s��UW]���}��Wq�!�d����[b�-�h��YIRY�%Ij��
�#Fd��j��b�V���9�y��Xz��3����zy���~:�>���:ujl��&����g��0aB|��wq�e����%�X"���p\�q���rJ<��s��k��y��|�Iz%���W^����{�������n>B��n�7�r�	��H_z����:�4{V�?���^xaXZ��r�����9
���i����>:�z�����;b���kv����+���n��c���O;��J���p,IR������;����n�)+�.�`�Q���Xd�Eb���*�G?��S\u�U�����B-�_KH�r��GeP����p\�qx������N:)��a�������]��/�:���|��Y!���;s��,��"�<8>���5jTpM��][57 x��Y�9�����W�u��vX�7.n����}���WI����$Im���f8������x���n������LE�v��aV7�pC�v�mY�=��b�E-�����_��N;-+��g
t��7�xc><������FN7c���:����z��|Gyd��_}��Y��K/��UqQ�Z���1K��c�'ks�Ua���W�h	���_Z=+!�MkuQ����P/���l9f�/-���'����������6��x ��>���Ut��A���w��{,������s���m�>S��w��$����$������s�=��k�e+5[�����[n�eV�	�|�v��6k���_�~9(�`]�k=��J+�G�6['�w8���>�`�������J�����0�������YkM�5�<�������o���o�y;��J���p,IR;���b��I�����c�=�ULZ�i�f�/kr�4T{��9����Z�s������1X����z�3�<3���m�x`�p�	�&v��+��_}��\_�:e����s�8������$I�2K�4�q�����b�-����~��]{��g{��pL%���L�8��5��z��'�k5�vP���+�3�30�5���l�D+9C��*+
���M�*0���JsS��M��A��	��k�]wmvI��/��$Im��M��W\1�w��x4Z�����0A�\�?�U���c	��������sx{��^E��m�j9m����jqZ5�2��
*�l���/��i�h-��ma��7��UVY�pHVQ8�>}z>O~�a����>���\���o�o�������Kr�'I��a8�$����i����e{l��r	o���������>��S���m�����[�|�1�P/�r�
-�s\�q�����^�k��f��f����~�-��=��#�9z��lUfx[/���;�cf�*�hj�r�s�b�����p.�:2$z�����.
�`�6U`�K�(�}<V�������;/�'��b�3k�$I�0K��FT;B���O�����:� ��VZ)��k����t����`P�����UR��!Yl�A~���oK.�d~mK������U�����e���O����Ti0�`�NHnhh��,S�	�l��Z8�{	�����&�V�ln��p�~��_}�3&oP�fB5C���'�0aBN�&�S��&�$I�2K��y�{��a��=t����$��p,I��9�����_~9C�N;�� .IR��%I�<�5��b�c�I����f��I���cI�4�`���#b���6������7�<~IR�e8�$I�$���X�$I�Tz�cI�$IR��%I�$I�g8�$I�$���X�$I�Tz�cI�$IR��%I�$I�g8�$I�$��E�a12
�xIEND�B`�

prefetch-uniform.pngimage/png; name=prefetch-uniform.pngDownload

�PNG


IHDR����< IDATx^��	�M�����y���� �D�AIi��%R$2e�AI��4��H(�4H��P*2%�L�?���n�n��{/�v��?��x���������]�a���{�nB!�B!"F.	�B!�B�("!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�B!�"�H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�B!�"�H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�B!�"�H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!����>h����K�Z�R�������
��i��-��3��/�`?���m���v��m��s��w�}v�a�e��s2��w�Q�F��
�Q�Fv���[�z����6m�ds���o���V�Xa;w�t�9��3��c��B�
����i�������k�^�X���C=�N;�4����U�V-[�O!Dr!!B���%���
�#F89
��1Bx�u��������&����{��x�����=N���������-����O?������i�l��]{<�B�
����[��M34��:��?�{�1���K2�|!����X!D�2f�{���l���V�D	�<�����v��ag�u�M�2�E����o�\s�|��V�lY�^��.\8[��U<���N���v�mv�QGe�K�s�=6a�I�S��]��v��Gf����n���q[�j�OI$��}��u���������%K���1�c��o���g�o���2
2*��:q�6t��l��!��\$�B!���/��"��������D�����7[���-o��Y�2q!M��+�t����r�u�����)�-c�/=z�����V�F
��m�����n���h���^k/����n����&M�X�|�R��~�z+P����^���o����+W�\z�.�" !B�i��Z�ri�@D�s��.��O>i�W�v3$����s�C�������"G+�Ct�|��v�W�o��S^3^�4�?~��X2��t�bp�-Y�������r�����Z�����sQf���O>��=�:Sx��g-O�<.5��o���:����Y���-��3'�q��f��e�K�����;ae�H��wD,y=d���5#���n��K/�d��5�\�r�y���O�8�]K"��*Ur��S�N.b��{�J�f���}�����l��e��U�T�N8�E�k��������>�����������43>@��&�������-[�L�:��M��G}d�/v�N3�T������/�O<�E�y���s�o�������
��y/��9�"������sYp����� �3��~2,�;���)���������Wm�����_�,j���x��y!�QAB,�"������fT�k��i[�n������� D�z���"��*�H#�CBc��"�@r�!��a!F(��]�v��=�������V�����<��\��������m�:�aN��SO=ew�y�{�Ed����_�[o�������D�i^v���:�DBL�*jriPE#�X�7��7o�~E�;)���c�VMoz���>��E�h����_~���C&��+WNkS�7�t��D���{v��7�H|,8N�n��X���"��C�7(�?`�7,�<��C��H�
1��Ux�Y���E�3�����������0>��!DT�!�H!,�4�"��Q!��ED��������_w��xB���o:�@x���N:P-Z�N:�$���$ H��=�\'�D�H��f"�D��(����)������#�l�H�%:Nd����"��~���-Z�����GVWz��p��W�� ��9�4��<2O-Z�pQR����?vRKdw���v�g�Hd<!&"�"sO�1�<��S��C�&O�l�>��]x�����%�w�q��E_q������(mz	^���'������f��j���&���r��S�<���|/\���S�!���<����b������&D{� ������w\o�w��.���(��=:x�`	�"RH��B�b�1�R��W��2
��2����r�6���3Q�i"}�#�0h� �����\�����
��_�IqEB�E�M48,�4[;!��a�/(�H�O<��r�"Qgd�?���.M	�����d���������1cF�H�4�x ��y
o����K
/ <��xBL�v�=Rj�9&�ij���kI*�O�e�Q�����kH;�Ez#�D<�&0W,>�}%4�b[) :K$��?����@���2s��Ow]�M�"cd���_~q�L����.�*���/��,���������#���~28?�X��}�}��L��m�O�����|�BD	�B�2K�I��K�����E��#�}"�|��!F���$"�<)��(_��R���Q��!&51	7�
q0b����&�:������m��ePR����������@"����N��d"��:2O����\,X�����,�+�D��Ts�����\���d"HeF,O>�d�0C�5-2z��e|�����\��[�u�F`��.��m��"�}�����N��'������,F��8�j��gOW��}A6)��7�����W�`D$;=5�B��#!B�BZ)�H-�D�5QS-"�|��[M�4�}Q�S`��W!F�H�&��TX��Dh=Q7/(<�@4�B��i�f��XXz����s���?��E�~��G�rH�e�]����FOH	Q�x�� �CV�8���>��J�!�:O�IAF�H�0q�E�����"=���k�����W�YPa�>���'~�H9'3�zb����������#��������%���U@�?R
�Za�\����"
OJ:�!�B�@S����`�v�
��O)k"�,!��H�BD	�B��"���/)�|��5�AM��42���%2�|e�gV����PxO��bD��A��d�G��+��_��{���.����H�G�~�k��X�A�����F���DZ��fW
1��I�aS���-B�
5,�$bo���H4�#�!c�C9$�8�;x�����:B����$������������������w��f����B��X�Xb!�0	�B�Y�b�����#�4-b���b�Q21�L�����8Y��}���C0��MD���SCLd9�9S05�������A������4���K���M��, ��4d�OjjL��BFY` �g.�co�J]|��S�tv�1&2�e�C�����x��5&��o�O��������d��!BD	E��B���X������2���e�:GO21��De���t��o�E�1�Oqz�&�(��?���D���e{��i��b�?Wd9#]�����L�]��9����tf@�3��tc9h��"
���,�FS3��7�"]��"�%9l�D���
���q�����
����^�(0c����~������0o\pA����A&��:�toR��va��B��4����S�����em�$��b!���C2M�|C%�Q�I])��� ��D��1Dd%�>�� ���D}�5r�^H��~5�>���!���2�qf@1S��<?�����j��I;.�x8WXHk/�}�DyI/�5�)���������H����v���`���
1hb!��	�B�=7n��b��ojM7n�"Ut�5j��T�,B�����:e����C3[��5��v)��("[%�H�3��#�H4�-}hp�y#�D����
��E�=lD*��1c\=M��K�G}*���J?�{+��vF��k�VKH)��yq�&��rx�!&c�%R����^$j�������0,4Qw�=�I��N��!�QwM�4��b���X,:��s�}��q���bDB,�B$���w��H����o����")�d�6M�3����U�B��GB,�Bd3D�a�N"O�/������d4P�����H<u�d_�����B��EB,�B�M��L����qj���:=�V!��7$�B!���b���Y��u���m���B��H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�B!�"�H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�B!�"�H��B!�BD	�B!�B�H"!B!�BI$�B!�B!"��X!�B!D$�!�B!��$b!�B!��DB,�������;�/^��B!�"�{��6d��������[�v�}����o���M�0�v��m��s�)Rd�>|��r�)N���������z�5o���<�H��7o���O>��~��'�����x��Y:�#F�QG���@���Z��{��x���:,X0������m���v���[�J��|s����Z�6m�b��q�t�R{��g�}��V�r������o��K.����+��c^}�U��'�5i��J�(����=���-[��{��Ks���o��f�G��v��������>p�[�]������s?��S�t���~��M�����v��1����C�
6���#�5�]�����/�k=����s�^�zV�P�=�l�2{�����O�������i�l�����Y�=���-��M�Z�
�}�������{�F�v�	'�������/���7nl5k�������3�k~�!�d�X�����7�Q�FV�T�4����/��_�d������;v�I'����}Z�|y7����{o����,���g�G}d����S'�1'����m����	��U�Z���x�
�U�������[�,Y���u`�3��/������{��������9����<���XL�2��/_�>7:��T��s����y9���S�{�7ov���i���������3m��9�;�G��C�4i����:�,+S��^���k��m����E���X.t��U�Vu�Ei�g���S��#����=�g�f��~,��������w����?w���>�{������?�n��fk��mZ���H�EJ��G�v�5�X���m��]v�m����K)�m���.\8�|��v�}�9����e'>��M�8�F��P�2���N�:����q����fD'O�l�z��_|�IHZ07| 2?|i���#�c���������wy�;����8T�.]�D���oY�~;��C�{���,|���v��W�/�|!�J�
f������Xs�z����yO����W�y��g�^{�]|��{|>���-Z��D6�g2�cY\�%7|��w|���0�#��jwXL����A����_�[�l��$w��!���+H���?�>7x�}��?�)\�D_�b��&4|N�������~��U�T�;l��C��v�W��?������&�{�����/������.�����)�p�
�n�:7�Y)-����u����=�\����.����Y�l����:����/?��c��xr�?��Sw���x ����^j��~����� ��+,�����_���b�����o��n��O�>Y���Y�3��e�V���f%C�u�.�x��V�Z��~)>o���E_�ed�����r��
4���;�9>3�z�)���g^<�g�l����O,�Z�0��w�G}������~���~���=QB,"�O��K�c!!�{$����H��	q�HZ,$�{��81��GB�	q����+B�	��#!�gb	qV"!�gb	q2"!{1��@�B��>jmH?#e$\��S�����zD�E�&:����:Xjq���z���.����c�Y�����f��:����'VM���:R���W�GDp����*�n�7�����C�k�sb�����]
5����}�:�-[����5���x�g���z\>;H�K��4U
���^N�x��1��������{���8`����#�������)�@$�s���_�+�����Czjw�u�x+�' -������{%=�%�=>��o�l��2!���\S;��~��7��L>����95e���~��N������C�3�����X��@�sF���/�����s�S7�gNZ�_�r����}������)���*��������6����2������.#=*�R��w�7b����>����3���|~�����-�W�_i=�{��_>����p�0��s�4o���������2-"G8ez_���D!9��(�"g�S�X$?9U�E��S�X�$�"rH�EN@B,�
	��*$�"����DB,"��X�$�"����BB,�
	��J$�"rd���m��>���X�b�UH�EV!!Y�@�j��"/\*�� !�#3�X!�B�s���[��w�u����Q�i!!B!�B8$�"r(B,�B!���H"!B!�BH�E$��w���^|�E�����^�z�7o�����?��
X�����������Hn���o^x�]p�v�������f&L�`}�����{[�V��X�b�<"!�B�(eZD	��q�n���g����z�N9���/_����B���m������������s���;n��V{�����{��K.��
.�����s��m��,X`�?��{��Y�z���m�l�_��)Sl�����re+���k����.�Y/#�B��H�E���-���m������)b5k�L_	q�`��i��M�����O�>92:�_1����.J|�UW��W_me��I��y�L[2p������r��+_�v�[g[��u��0p�k��r*���!�Bd'b9$�iq<$�9�~Iu����Q�Fi��'#�!�6lp)���owQb"����o��%7�l�[���-ZX�P���>�e��Y��}�X���;�d=!�"[���e!���l���.�y��yV�dIk���
<�*T���V��������y���Z�j����Z���Sj�?��C�����'��9s�X���]��g��V�xq3f��1����{�n����x��V�T)��+W�k tD����[�x��GaW^y���d�AN>�d�8�?i�$=z��;ck���]s�5V�l��������?���6{�l[�~����[�ha%J�Hy���K����������<y�X��U����ru��!L�:�w��)�������[���rs��B�J�R�g�^��5;v��^������~k
p�f_<���K�5k��i���j`G�e�zh�9z�������w���-r�At���.�����c�=����_~q���G���+R���u����F?����X��]���?�=��SO�#���/��'�x�>��s[�l�,X��M�f��k��v����g������w�;�c����w?���?���3�g��Z�jY���\s�yn��'��>��K���O�>��:�(�3��
r����w�y�$D���r+�{�ys�S�H�!�?�V���U<�
��cxo	!��$�"rDY�i���"H��l��S��'��T��7�x���g�}�d�Tj�s!^���N?�t'��R��pL����q������;i$�w�����:�����E��H"x4�������~!���zB���{��7B7t�P�X������Gy�I2c.W���Z���	�w�w8	�Y�fY���3v��������1��~��#~��Cq�@3i�<�Cp=^���r|���e,(�p�	Nn�����u8��s��yI�0G���������s�y����y�'�����w�o�H�����P2��7�J�1�,������������+�I�&�9�7ov�����s��z)�����?'����m���K�;w�]{��n��D���r�=�=
^��[~��W�^��������_~�=�E����[�����3r�+���I��|��V�V�������7on%/��J]z�`�� IDATQd!�BH�E����9D���]�v9� JHt�('�x]��J��B�d�7.�uXD�Dj�o���\D�4"�H������/w�E���@�o���1���!9�8�(���/��6�B�+d�������?����O�-[���y������7�����-���O��;�t5���tx���M����n�|�����.K�d���P�Bn�DV9g���DD��s�q�D�}��6|�pUFV��A��t�I.R��y�Q����\"����g��sF��~�i�_`����-�*z����Z��-�9��Q��{<���9����9��;�'J�F|�?d���re��8��o���[D8�����x!f�,����>� ��}HC/�'�X:d������
Vb����\s���\�n	!�"5���e!FxH	
�#/�Dd�H��
1QX��%)���������E
�~���\D0��P��� "4,RI�wP����#�@���m�c�u�].*L��"���s,&N��$�� �|p��7��M�\d�4kd�����\�b��5
6sB�����]w]�4�X��A���g�]
wg�s�0��RJ���3f�+���
t�Fj5s��A�A�!����nr�!*�-??c|d$�R�H�t�����#�"���
s���O��^�Y`0`@J�?�s�"�����N��������-_�
v����)�Ko��v,]j�������0!�B����e!&�Hd.eDBI�%=���%K����8�bjl�#��(rG��hkp��<���D�|����~�����	�dd8�H��Hd��!)��/���Rsl��D)I�%��1�|�����eRr���D����^w���R�1����d3�@X��/�#G�t�'(��{�.]��K/u�	�u'��&���3�8c�X����T��Fi����/�p����������>5��&�J��4�0,�P�5%k�s e��j���[����&b��k�}���b��s/�,�]G�=d1�@-6�u��M���%7�d�6lH_��[7�S��?�\�J���B��H�E����k����9d�!]�4Y�@��"L4V��^!&�9�E>�%�KmoP���SBdYF�����}h�2Rj^���^��,�b"��G���;a$�h/�L��H:5����9E��B��D�y=���T�X
f�'QPj�2�	�B�H��q�O=6�
�
)��
���NK��gD�s��b������x���~aY�Z �3�H7����"�(�}�b�D�����smH��~#��VDm��>{!���)�e��c���l��l�E�t��s�}��{��k2����l�'���o��� �/��J�q��n������	!�B5�"�DY���qf1��@��L�uPz�!C��N�a!�H�8�m$��dd�4k�9#b��M�1�yYp�\�g�#~4?#�MT9���)�Dy9'X@���@�g"�a!��p@1$�E��k����g$b����&%�E��,^�Y�@��u��Y�_� �^XI�^��~k���v�`%�9'�>�?����s��2]�^=u�B!b���Q�}�!&=�(.�N��T+3��Z[j{��������"}9,�4��UCLd��m���l�}8��>��hfzk�I����t��m��J�E�����qV	1r�b���trOf	1i�t�v��
6�"O���_��i�a�:zS�k~�C��\+�>��(3��h8M�����H����y#u�k�����u&���2�����x���V�G�5S�8[3	!�QGB,"G��8Q�iR�}��xM��� �Hg_Rd�
�2S�}��n���O����i���K��8Q�i��!)�.t����}�v'`�������L#n�.��7��2i�H4��xY���V?�Zm���J�����__I�[r��b ]�e���[���P����c5�1{!&��`R�X�����Mz8)�^�d@����F4�"%��^	�'���f��D�YLI�S�N�n"�,h�%�F��e��Y�����a�Y���]��������|��V���-���!�bO$�"rDY���%%�TW:�"
HJz�!������ys'.�
RBJsf
�?����T��G��43"��F#O�j�
��RjE[�&���C���\j�h�y)�F��=�Ir�����)���!*M�05�DGR����l/���7~OZ6��*!��9/���	�33��gjY\����:uj�>�\�`�4)��?������

�������;�V�y�8�O�_I�F�y����=o�<'�<�����{��w��}���:5��b�����c�y�n�/\hk^�6��e���k�U�����e�Y���\��e��B��FB,"G���T��3g�moc��Eli�#i/B���
�gK!�!DB3S��O��r|����tV�C��BL.���~���w�F�����n	�TYF�;�H1��q/���="�t5���B�X`&��tl���5��;6RM�Q����*!$�k�|��:3��h0����r_QC���U���O��#������r]x>����C�u����|sl�W�*U\4�E �������b)�H4���&Hmz��gq��22���B!r&b9�(�Q!��E	�7#���R�+�K��#��o#��N�B!��?���!!����S"���Ld:����j��Z��2��dm�%�B��EB,"��x�EB�/��:M�7�*/^|�������#S]�z�T]��B��"!�CB��"!B!�AB,"��X!�Bb9$�B!�B	��$b!�B!��XD	�B!�BB,"�����g��>}���7�h;v��E��y/����^�u;����Gk�o�����o��1c�N�:��L�����G}d�k�N�X�|�5o��
(`�>����Uk��5a�7?�{��V�ZY�b���8B!��Y��XD	q�	����SN���;�I6[���d�����eK��n��=�l����,�;w���m��{���c�=6���]����fl�asw���y*[����\�r�'W���B�c���!!�;!����l��ev�GX�%\tQB�s������!>��c,W�\��?�h7n���:��)���g�!��(�UW]eW_}��)S&+��i������n�mx�r��m�s�w?[�s�]Z�RPr�U�W�r[�L{M!�bCB,"��x��8���/��b�����W����N�\�r�x:�L!��a�]p��}�v%N���X,���:������n�J���h���E���6��u#��-��e����K��B�8H�E�����:��x�	�6m����=d���7o���!9r�
4���]��=��C�������3���O>�E���|��wm��MN�n��v�n�b�
��o����>��2��O?�
*�r�_��^y�w$��&�����|���<���^�~��Y�.]������{����K��k�5l��n���[�n�z���������w�g*T�`'�x�{|�j�Rjow��m'N�g�}��N�j�W����K���o�^{�������u�l���.��?��%���z��a��z�{�g��yN��<�H0`���7n��\���W���\[�h��:������n���{^0��c�{��W�9��=���_o��ws��];�x2 V�tp��<s_|���.��iS������R�y�7�D����o������w�ynO?����1�^|�E��P�|y�����s��V�R���J$�[�lq���/����U�V�K.������0��sO?��SN�����N�i�Vv�Z�jY�bm��<�KZ����1����f5��H��B!�	������J�~��h�����;��&[o���6/�a!FN�{�=�!�d ���p�	N��V���CqR�����-r�T�dI=z��B���C?����a����ZH$��p��������c������m?f/kH�O?�d'�t�}"�H5"�l"���_|����H6BjKc����/;��yH�����'�tc�<��_|���E�iDu�W8Y�Y�����K��|��c��^T��*U���#���"t<�H��Q����x\t�E.]z���n"(�4VCT�)����+g�V��o������;��<0M!f���5j���������{'�W^y��v�m�|������}��#�G}t���P~�\�@��Q#'�~�����^z��z����B<!f����:{��7�b�]�`Aw-���?��p�Y�h���u����x�1u�T�~��-���-��y����by������P����.!�B���XD	��B�f�=C�|$�w��_��=���N� V���R�D����v��Gd9����S�����"��|�."��Q����%"��:�*V��^�1���c�����w����h$�>b�7������2��#��6�D��DN�g/�H4�b����3���`Q�FWD�sD�#�t�����GS,��H|��	��7���\���[]��s�w�ap^7�k?�Db2��yn���5�bQ��R��;����u���-�����1�a>�eD1�<��e�����x�xB�43�"��8pY��us��`�1d��!s��{�I5_��#��/�.��X�|�����V;mkS������FjB!����XD	��B�t�R��!+��!�U�h_�)����\���W_}�"���O���!���]���tr�:O���_���Qb/}^z��"���zx]�j?��c��K$iF�w$��
���f����"����D ��aDRo����4p����bR��C�����X������g���r�I���U`
4ps��IM�l+��GT����>��pI�f����aG���X/�+%>�������E��\�%K�Y����+���
B��R���{���~O&���#���u��yeg'����*�M\~���m��5��D;4����B!�AB,"���_!&��z��!ODN���T�}b�����h3����.rI�9"��C��4fd�(�Y��� W�f�rB�H2�����BL�0i���WR��R�f�Y$���n����83F����N�ZL*oZR
�&Mr���^�C}�Yg����^�Y@ JI@�I��2eJ������H))�,��r���"��7vi��e0�
����pd5����f#���3�v3���^�b��Y�
:���Ls����b	����j������`�,�)��=�b)��F�U}m�����_g�K�!�gU7�n��Y�|���B!$�"�H������;)�i��MI%zF�1i��"����S� �	2B���,A%
��"�D�Tf���E�1i�D�.$!������g���mt|�)���l"�4�"*��r������:ur���/v�J*0
�|�s<hF#,j����|��B8~���q����Hs��V�9�%��`R�K)b�X#��6��'��#���UZM�H[g~<��S���Mw�AY��JT�(�u_C�L����y�f�\z6��D�c	1�d^'�����p
�����Y����'��{������.���x���Z��l`m���2y�{)!�";P�XD	�!�!&:��x��]$9�F�X���VX��!F�=D���������� �.�DB���D��<�LW��d���M�0��Y%�,<��H_pq!���K_G�'L��"��)�WB��b}Fx����I4����,��l�db�����?���%{Z�BM-_�;���}�
\5��d�
6r{!�"5��������XPwG�_"jA�,Y��ADVF����(���/���M���6J���X 2�m��
Q'��

�����q�Tr^D�h�s��g��Z���Gz�g^
1���!Q�pJ4��Z�!��o��r�a����z��sm�Uqzk����ci$ZH(�{�[C�{�H&����VD��4���NB�"B��R�b���<��l�4~�x�F�{��b��c�I�3)�b��(:�����3#�gR���zN�8��)�����#��-X����n��{t�^�s�]��:'�����Ry�i*&�B��H��3z����P��P,��H�`P$i0�v+�
�iH���jI
%M��.��-kh<��}��n�|Y%�Y%7Xs	�E�*RK�{��RI$�/��VPHHY�Q���D��RO��V��!
4�H'BB��.�l_D#�D]��_�$M�N���a!3��G:�wIS��2�$i����w�9����D]�I��
�H&����������i$�{��RvI�NO�i"�H/�����D��`Q��s�
!�Nj6���N;-E����O�wpd>?��E�S����)���L#��|'�2�>�|rn�1���=�Mh2���R�i���m����������Z��7�m�
7���Z�<�m���������
/�T�������B�	�~�b��"��"^A��K����8�,����6I�(i��  |���	"���P���}���/�t��z� ]�(5�4"���S�	����Xy��v��G�2"|��ww�u����,�E7X��$��
1[�]e�C:����}�}���
��H'�;��:Y��e�%3"��vC�S����x���{���p���jj��H B�=J���>��I�H	������D��
��	�C����!FnYPb>�>�,,��,� �H���S����U)�@�6RL�t����C�.�w��w!������{AL�>���gf
1�	,PKNs5�Hg��>�DzYh�.��)��{d��n���s�����D�9"�4�J�u;xg�;���'m������Y������[�]��"X���ls&�B��H��32*��e1&j�D�x�/�DP�w�K2
p|��(����Dj+��%�� -|�&��oE��e��k�L_�&�J+���7�A����W���&=��i�����5=H��b���8D0;5Xp@*�*��j{���� ��:��cf
1YD������E0}s0�2�B��������"��BD:��Y�!������6N,� ����"�_���1{7��M
4r�1�C��{��LM���3����D!�U�7���12OB���&`���1��{�U��Ga��������*B���g!��0��x��D��1�f��E>+�h ������t����}���G,�pO����B!r>����
1�w�"Dd�r�_���a"Z��������"@<�:e��!��/t�%���`�"_�9fpD����q<��/�6�������G�e:V�2j��2���r�6\��,�F�Y��Q�	=X����,���!�"�"!���BL�����J�h���)���v�K�F���-,�,��I!�E�'Q4�L��$��<�-[�RN�s��4<���kR��k$
���RJ�6Q�&,��0c'ZD$��Oz�G	���F6�<��������b�e��!���dm�%�B��EB����YR�_������#��;���/�>���Qe�I9$�1Q��G�iTD�)Q��"�>�L=�}��������BL�&_~�E�w:�fD�Er!!N
i�����E]��ti�j�K�I�fn���x��s#�B��A	(7|#|?�l�L���k7�|�1l�B�-��nL�+�?#�lE�v5�q�E��P[�\��]Mi�/�&��`���75�|�7��[����KM2��^�43B������"���]�b�E��|�����x�Q���3)�t�M/^�]���5#�M
e����YH�s�%�B!�����eX�����'���&n����$��"�Q�#��\s��P,�B!��J��>���]��R�K.)���~�]w�u��!�/������_��RCLz2�eVg�[C��eK��
bUC,�B!��l$���er�i�EJ5"K�4x���FU�.�l}�V$���4�b��`]��o��MJ4��Uw��q
v�F��5XW$����L!�B!�	qD����{����
a%e�7��^�����`����$��ohE
6[/%�������7�b;����*U#+�v���q�!�z�� IDATf�V�_�&7�H��U����!B!�B�-���S�=��S�pz�]�d���RG|�����l0�fZl�4t�P������N�J���2Qk��d��q�K����?����s�0����]���������,�������qc��j��))ni�EWWR����'�~����R�����w��s��x��v����#����U��KB!�B��GIEH�V�F�V�B'��|��M�g�'��R�(1��_���Y��U"�����t^v��8q���s���NLi�E7k��m���w
�� �w�}��^�EfL��;l��q��i�T��m�f\p�}��G."�x����~h�*Ur?�\�5�i�����
6��aH'O��^{�����,v��[�,���OW�O���i�+w�y�+���;��=��	&X��}�Vk,���!�b�DB��j�*�%K��#�{�a�9i�i��.�h/b�����u�l��#�a�3�f\�$#��dw����r��}I�~���m���n�'��o��W�k��-[���0�c�'�ODy��!N�f�P	�9A~��g����r�(���Dr��Y,d��a��(�B�������Vsd�����jc���>Xos�n������'�+�(mu���s���B!�
	��Qb��eD���$�9�G}�n��V�8YQ�Cv1��B��>��$��[����o^j�����M��q�������u;m���6f�j�_���rKe;��9��B!�	��Q�xH��J	�5kf+W�tedzD��b"��il���E�k����S~������7���n���7o�Tc]�t�u����>��u�\�*V����"�Bd'b9�!�����%����3�k��&U�c�b��-��T�V�>��S����*�4��3f��W��C�A��|j�I���\:;�#���Y����k��c����O?��v�3�`N�f��
S��c�`�S�iG�;��g<�4bR�=�W���#G����H�V��W�����i�>��'�p��(`<l
��I�'5^�u5������]�>�y�E�:S�8u�<��7�p�lkF
.��(A`,�m���[��\�_x�����f�=���s�:u�k�GYD���������?�����~l�V�jUw�/��"W��c����l�F����+�H5/�;����m���GM�'�|���R	���O"!���]7{��k����D��I'����~����|I��������O��a���1|�p�7�����bj�}3�db���6`�2������Z�=3gn�k�]d������KZ��J�B!�H�E���BL�7��{�n������q����n�R�J��K��5Rqi��� �R#�D$�x��5Nr��q��8�y�f;v�!j������V$�:o/��#�4@�^��;��lS#���x/�^�^��fpH3��"��8�Em-��K.���[��u�X�
l���N�i*�X�9A��:�"��
�8��Y��k�	�M�i$�H37�;���+W.�=�H�m���:u���Dni8��!������|i���N����y }g��]v��g;������X�@V}�<\��y�w���u�\��;������;�5���������_��(������3/��I��}���������E��c���&�������/�o��o�~G�z?�,n�5��b?�
�Z�~��A��������������/��n���~x���aZ���J���"����i!�BB��/ir������XQAD���eil�������2�V�4B���~��N~�&�:�=�i��H!�k9BD4���^��vw0
Im38��1W�X���^���������� ��5��������v��z���l��� ��h��|���.�IT	��z0r�D#��)\��������Z�N�H[��3��D�����r<��}��]�c����������e�<���M�6��q�0O,���>@��%
���?"�DYY4�/<�g!�s���#��8�.�/�dp}���O,���"
�������q��R���{��=��D��?�xwl����d�w��n���G�|�����'W^��+��z�.oU�&������5;m��
V�FbyB!��"�"r�B��i�t'��oo��B�8 w��R"�a/���-���9RY�riC/��#)�D}�[k��J:�M7�����#��5r��Sc� �D���A�}��BOD�F��"����H5x�D��F,�nK�9$���������7Qk"�,P��y��"������%�\O�A�<���1���a���<���3�_�Ez;x!&���<�$��d����r�c	1Y*b�"K���q��G��
�L�#R��D����A�Y��^D��s�t����+���t9�u�N�-�,���w�����!�1��������)B#e�V��SOu�����/��"�D�"eAH1%�e��H�F8�.3>���A�o���a&O������b��hu�A8'�T&"��[%J�p��D(��z��	%��eR}��H)cA�x)��p��F�|7b^��P?n���-��y���B��a!�ER~����yGP�tR��"N���h�c��9nN���5F����wp~I�f����8^� ���{��5�w��c	1�,���� �H��G���uI�&����Z��?%	5%�\�d��;���Y=z����G};uZh�+��N�Pc-!�"b9r�����\���y%� �J��1���"����y�\jiX��F��)�g.��,�C]&r����y
�x/k^���%�6,P�������E��)E�G��"|D���-[:ie�x="�<��fD'Q�2� RH�3J�'R���3�����K��7�����ReD�u�T�0�)�x���{2r ��w�>5�h���:�yp��O��R�����X<@��e�B�������L���=Q�XBL
0+\O����ANd:����@�75�D���0�a!y'j�D��H&��]m<��
^�5*�>^�b�]|�k����jU��Y!�"5b9r��W��b"��)�����%����^ImF�yj�I�F�iP�8g$��S��n:J
�":��]�,b�����;����t�1���7Nw����Xb��B��b��I�&b�Hh���Mm������,��H����!��n�\o"���|2�a�N���w�U�-��
bo���L��mr��k�.d�b7�B!"��XD��.��#��!&:J������K7e:N��$����F��9%�K�g�*$�M�h"�D���x5����K](�D�t,z"�4���D3R{K�.v"��_#�>}�����s-7j`���3r^�I&B�!&����X5��-��ZCL]<]�)`��p*},�@�9s����#�4�b����04k��A��p�B��h�v��e���Y�z�,o�?Ex7n�i#F��I����a��A�"�'�lX!����XD��A�u�&���c��L�����{]M+�p���b��K7����%���c�U�SX��*��w���	�L��S"���X�9"]�n�DG�0L�7��t��>��Q�.��	#���wH�F��c/�Y%��9d�� 
�e��p3�����2�<���X��B�C�.��-���L�P�����4�Nm���3]s��~�t�&�N���M�:"M�4�����Agm�s!���j�9,�Y�����[����������������*}���l���v�!�-wn��B	��9]��!��BC'c:�"i�az�!����GC*j`9&�A�hf
1[�����I��y"��6��9V�4�`b���~q�<�~�}�i����_0{���-���\02���^��)����h��x�j��n�:^��i�DWf��d�+��UB�����b���N�y�������!�4���u�}�3[��SD��o�/�-�$���,�5�D���.T��-\���-S��aq����s�X8!�����wdI����i
�4S/���X����;m�����[\��~�"V�\^E��B�t !�#�1��4@"���J��D�:v���8X�/B����5�!��i�L!Fx�RU�LRi���I�F�rJ
x8BL�3L���!��!��_\�hQw+��U�&�"M��s`NH7����=��$BFT�9���L0�y������[�<�NK�.��eQ!���v���!����S����]q�����@�1C�vs��)�Y�T�H7QzD�zr"�-�HA0R����k�����3'���w�C�2�\+��\?����r=e���E!����N!��/b9r�G�eA�#s��u� 
M)��3�or2��L#��^#��
�X�H�}Gt>����B�����!!Nn$��!"MS0��\��n(Y�)BLz=�FQ�L�x�6�B!D�"!�CB��H�SC�-)�4RcO�`jqN �15���^M�sN[xB!��!!�CB��H��/r�!�"�H�E��!�B!@B,"��X!�B!!�DB,�B!���H"!B!�BH�E$�g��N:�����:�f����<������n��Fk����.]:�)��?�.��B�����p�f'&L��}�Z����U�VV�X1��BTC,"��8�H�s�;w���'����j�k��q��z������=��c�\r�.\8K/���;�m���`�{�����c�����,vo�b[>��6�����7�r�/oE�>�
�{��)S�,W��z)!�b�DB,"��8�|����e��n�2~�t�q�1m�4k���]|��.��������o����W]u�]}��V�Lb6�ckn��v��������4O�S���4���k��!�"&b9$����8� �������_�5jdy������Gd�o�����o���E���'+_}���r�m�3�l��������>��m��r/���"�Bd+b9$�'V�tP`�[?��S������O��C����|���zAj[o��6{���l��MV�V-�[�|���UC��w��c�=f����-[���w6l������_�
,�^�g������Z��]����-���=��
6�7nl��{�U�R�v���R�y����m���V�xq;��#\s�-�D�)�X�t�=��S6~�xwy����U��Yg��^�B�
��S�Nu��2e������u�VW{��ys����U�T�rRY{����s����z�j{��']�6�x�}��7[�j������5k���Nss4j�(;��CS=���O����k������_�E����:\t�E�?~7�H�/�����G�v�WX��%S��n�:7f"�?����X���)b��{�����G���_~iO<��}����Zr���f����v����g�i~�����;�������������Q���r?u����u���np�������h���������r?��0h� w�y�y���e����~�?���m_m��0�����c�Z�SO�\���B��!!�CB�q	1����O��~��NLB����?��;,�~u���NT�I��c�����F�w�}���NZ�M��r"��3���!��&M���7��O?��GQc��Qo��#���{�*T����P�QE~r������Gq�|���[�r�l��U��7�X��5��;�p�f��:��q���A�s�����r�������8��C�"i��=�C�\�b�������c��9v�	'��	>'Hn�N���s~^�=�o���~�q�Fw�I���1�w�y�	9�C���$�4l�J$��\I1f���>p��}������;��+WZ�&M�s��?���������.���"������`3^�
R������Z7G�"���:`��t���+��z!fn�=^�zu7f�^~�e����[[����|���d1�����=����hQ�Bly�Z����D�.��\�,�B�����!!�8���B�Q���l���w�B���c�~�m���;�4-Z��|��qN��6��w_�#.�.�BT�Bd�F�Ho���7��9��?'r�]z��Nh��}$*I��H6��H��G�*jK�,��D�X���?c��4)E��B"��;�$2�����^v�e)���3/�����r,��9"���9���
���w�m��wQed���!F����9}���\��5r���_���
��g��s�k�.w���_��Z�l��y��1V�F
�x���s��� �\���(e�e��a"�\S_��qe�3����>���b��"��^�����"�4���K6����6����Z�>����cE���R7�h��VM��B!D����!!�8����bl��T �D<���%u����L��� <��#�>B|���;A�gg�q�28d�'��/����p%��<}����_?'d>�v��m.Z��y>BLc2q�D'Y<Q<����=�D@�n��Kd�}��n87?^��/Rl�������������h:Q^R�}�<�b��(��3��DZg��a�������)p�)���j����0����"�7�t�;Qa��03~~�����I$�:�bA�kZ�b�T�%�M����O��^�Y`0`@J�?�s�������N���� ]z��q�{���_2W.+|��V���-_��B��H�E��g�DBL�F"�
%��j��	�r��W�('5���9>��#&D.��&�����o�C�m�F��"��65��[!��d,��F���`�!g��Uz��{������"�A�Z"�~<^�G�����a��t����H=i�� ��t��E�pWp��4l�2�4�H+��, �Q<�@@�3)�����p u��+�s���P_L4� ��9�2�k5m���[�����H��[���%K��5���{!����r8r�!e��j��������7L��_?[����k���_)o^+��������V�����B��!!�CB�q�j�E#��u���"�1���$R�$�r"�Bd�1"���L�C��J,BH�e��(r�h)�L�bI�%�����K�qB
�b<����n0�9D�I
g�DA�MF�HF�8G���r�zl�E�t0���c��?p��4�yF�8�X)���V�H/��.���>�C�Y�@�f��3g�H#�dp��y�f~}-7�@FrM�=��M��M�	1��#F��2�4���l��l�E��o��h��{��k2�a�([{�����_�w����oe�����mk�
��B�b9$�'3�xo"�D%7Rlc	m<H�FjH%BL�$�i!^�9�i������z��7\�tF"�<�n�D�I����J��U+�*!F���LD��rp�2S���rND�I?n��"	� a!��p@1��Y�]���O3oD�	1u�,�pO�H��%�1q���i��=�|s�$;W��?[��-�'������V���m��3���B�	�����B�k�IU%:�!&����5�|@�4��g���cp"h��� �����Vs"��d��RZ�|�{�^�4j���b�yi�E
3�������D���b��(4�����S�3K�I�F��a���OQKLt���,0#��`^��M0_��������`e�4
'!^��'�BL3���y�a8�>Y�6m�����m�>�l�����U���{�+����*P Y�.�B$b9$�'3���D~�i���L������\��#�D?��3��)���#��z`j���"Yl�DZ.I�|"���B��!F�+#FH��2���]���T�R.��F����{��c6�e�%�<����t�q_I�[r��b ]��)M�H/���� �t�&2���h2����yb��{��5�2� ��<(�\#f���D#����zQ�Lt�=Y��x�u�&���Fp�dc��Y�W�^���6l�G�s��\E�X^���;��gX����.�B�=���!!�8�%��}�I������H2�F��C�`Rc���U�^�zN�H�F��'"��������"1�.����.�\{���NVy]$�.�D��E�h5{�l'q�Qxb��FKH8�E�h�1�f�Z��)BFwe���|�% IDAT�3��������J�i(�y!��N8(��)��8S�K�-���'������?�J�&����.�kC/
���t����k��r�i��4�aQ����B�y.������w�K33��3j��+�dpPw�1��<g7��m��_|a[�L��s�X�*U�`��V�I�S�tvO!�Hz$�"rH�3Nf	1���/.RK��/5��WLgjRT���jyi�-|H�F�+T����VD\��"J����|=4�"��e��i�q ��L��&LD/���="�4�"R�h!\D����H)u�������'z�TmDT�!F��J���]�����Ne�
f1�Z[����a��.�J]1���G��sG������#���K�u��d��cs}�fD�Yd"�H9Qh�!�D�IIG��7�&Hmz��'����8gd{-!�B�L$�"rH�����B���\�]� �'}Iv�B!����XD	��DN���X��4[���B��RD��&'k3-!�Bd.b9$�"
�>L�45����v-���_�#S]�z���s	!�b�BB,"��X!�Bb9$�B!�B	��$b!�B!��XD	�B!�BB,"��X!�B!!�DB�����k��sg�������+�:�|���x�b�8q�y���>����o_{��7m��1V�N�wR��������>���k'�l._���7on
�G}�j���W��0a�����{[�V��X�b{u!�B�,�TKD	��#!�y��9�Z�l��������-[6�'�?���;�m���`�{�����c����dkw���g��-s�r��V�H}+������'��$�B�$�"rH���p���V�R%�'o���~�(B�yt���Ev�s�1�+W�����7�QGeE��������sC������������)�C�4�m�g7/��&�{wgs��q�:��2��E)��6�n�(��n�J�J����DJR��/��R�J��R��$	13c����?��3c����������*���r���|����~��m[��������Wmo#��v	��RT�R;&
��@\~�K�/��b��w���[�}��g
6,����m�f��s�eff�*q���F������/5��w��(��J�J�l�g�[�n�b�@ F�!���Z�[�l�B�{��g/���=�������[��������~�����877��x�	@~��'�Q���y��6h� �������_��!NOOw������w�}�L�&M\;�
7�`p���.[��}---�^}�U����t���v�����y����s��n���+�ZT]��5k\��^�z��];2d�;NTT�{
������'������-[�f��v�	'X�>}��SN�*U�Xjj����K���?��
\�U?�{��C�s�+V�`�u�tk~_�u��i�v�an��y���:�	&����m�������YYY���5�^l����GGq�]u�U����Ha-�z���n����\�y��Y�������K�d=�P��������*�I'�d�����?�bcc������v��������K��g����W[��u���/wk����).�s��������7n��.��B�9�����r}�f���>�:}&�MjN�]��"�p���c9E�^�E[��f��!O�qU�s�c���@�������v����y��'�@���p���,��y�KJJ�s�=��5��~o��Uo���N�P-6�k}M\!W�u��%�����p9c�;�����?��S.�(��8����^w�u.,)�*��Z?z��P��S'W!��*��G��s�������=��c�=��U��v�[�|}��g.<�z�5�J�^a�Y�f�WA����w����q���b��������u*��w��A�U0�E�_��~��w�q�B���C]P��b]s�:ul�����W_�s���{���r/]\ �0+�N��i���A��_�_c��q�������o�q����������^����'�����l���v�E��Q���	)*��r�-��k������[�%�:�6m���i�v���X��K/�������x�yi�K6����s���k����B�������K�Z4��(�@��C �{�bU@�j�T���c�;
X
��~����)����[y�W?
8z���]P����\2U�T�
���:l�0W��?�$U�U0R�P�rTq=z�k>��5j�����S����:R�!MLL��Y:'U��u�
N
�:?]���W}V�V�X���W8��U�Z5�gD�_��t�:'I�������C�*�(�i(V���]���Q�����c�J�����s��|���=U�+�:oU���*.����G������p��w�}����\��������v��7������;�@����TXV��gH�����L�}�k�8Eb�f���w��*�
���A���C�)�z8`L�OG���'��_�b��!6c��'�+���@��n����V?v�g�?1"�x���-Z��V`+��~O��m������Zcx����+��
�54J-�^ �{w����p���Y��*��8]�SoU��un:U��h/<����JvAji����WA�r��A�X�eMF�T��_|�U�U�T���Z�5����}�?��������u��t�A�WY����c�UR��U\ V���_�{��c��_����C��G:�*�
���X/+��3��m��p�����������z�u��W
���@��+��a��I����l�m}&�������W����Zznz��Xm��'�o�5���?,\.��A F�!�����u�\�l��:`�W�U%T��j��S�"����*����vY�`{�x���.$k]rqT�={v���`�A��)[��]���?���k/V��*�:?����j������*�Z���[R(]�Z�u��v���g�q�;7/���*i`�
�R[���t�j�U(U�����TA�C	Unx;v�����a{�aOq�X"V�{���p�6d��{t��X�W ������� @k����Z��U��^��buh���bGz�g�=����N;��9�����6g������� U�{��eC����������<��W�P�O>��U�)����!L^�S QxV,��Wk�x�:5XJC��.�
nQ������8�Jm�Zs�u��^��z��=e�P�YQWC�TY����]{���k�5�X��V����AX���"�����
�s���y����9���\@��s)�p�G�.P�OI�X�xU��-��Z�R�V����o��u�z ��r�N��_�P-
����
^���]v�������VU�U%�����XaZ�S U�5pM��j�VU��@�J�������n^��U��F����z(��]���i�M�{����{�����lt��vm�k�VLxo#��@��@��J+�i�X-�
�
�E���?v
�ZC����q���\omiAj#��kSTU�5�Y�P��=��~��nm�B��.��9�j^V�X��:�B�F���(���_�?w�\W���R��
�{[!���
���'�'�K@��_��k�W�Y���.]u�}��Y�C�jF���}��U�_��5�?1"�x��V .n
���5�(p
���j��:OUU�n�N�(�0��V��^Y���N� ���%m����*��zk��]C�R�dj=k��V
��4kB�BhYT�u\M�V����
x
��������Z��}��~��X��*�%�!V�W]����x�04�\]j���-����-�����m�����b�Y���6��T�8�b&LP1"�x��V V����=�2�	�ZW�p�������X�a�]�P��U{�Z`�����=�{���Z������j��Z��^�����g����i�$���vp���\��2�6_�^�7��U�W�Y�� *
����E ��Z�u�J��C����)����r����E�P�T��6�}���2���	��M�1b�[��k�=��%�s�-���@C��e�����_W�`����M���������,����t����+EU����m����������
@�"#���^ibQ��4bmQ�aF
'jS�?�]Sa���X-�jkU�WU<
���e�������B�Z�h���~����b��QX����
�S��5����V(�Ej9V���)pb-ob�[
�R����XS���Z��P���P������n��Z�E��
���J������Z��*��7��.���?���Gt�`�!����\C�s�CM��{������dbUz��A��U%W��:��K��'����������UY�����n��5�k������s���_]��X���\�fk��b}��p��-1"�x��f V{��'\(V�QEN!u��A.�j������MrVS-�
�Z�������
+�j��B���*�k�~G���X0,+�����O�d�A\����7��c���^��������[�@+\�EZ�P��0���5�"�j�����
�v�?eU!V��zmr�3o��W�c~���%]5K��z�P��*�
���I��&���m�t>j}��3OQ��>'j��CM2OMMu^=�P�C�����������
>��=�@�G F�!#=��#6j�(�l��
'�X+���&P�K���	M��C�`����@��C F$R�[aS!Okk�m.�W ���\r���I���:L�.1"��Jm�d�}��.8p�T��_�X���7����>�
(}bD1��W ��@��C  bD11"���@�@������?���m���Gm�g��:u��������3�<c|���n��bbb��8���_�;���x�7p�r�������>{�����_|�%$$��5��;����.���;�����U�V���	�?�#��	����m�����������i����./��W�G ����+���~��7{��G��c���k����rm��lk� ��,*j_���'RqFF�}��w����Y3���-���
q��>}��5�z�!���{DT�eby��7]���k����{[�Z�J��,;w������������Y���� ��E���W�o�n�����W�� #�Dj �W��#++��<�L��i�����v�������+�"�9�Xff��kyA�Z�^�����;-3�o~�3�J�&Mbl���v����P1"N�����_��y������J�*�]W��
ZT��?����o�/���h;���m����W�^[C��o���3��/�t��A���[7W�;����1RSS���.��?��f���*�:Q[�*w
��������]x���|�I[�l�m����q�v�UW�y��g5j�����R>c�{����}�z�������_n�\r�{My��W]�������k����k^}��n���s������{���l��E��SO��8����
������`,\�����;��s��{���u���k�^���?�]����������G���[����z���+��{����������qc�^����>+Z���/�X�z������3�}y��g������n���z}6>��#���x��6p�@k�������]_q���t�!]��;u<�i�����k�����>+�v��6h���g���c����c][�>����4�����lW�T)��k���1���M��k�l�6%��K�qqfg�U�����5kV6!�wbDq�5�^ ����}��.���_��+V�)��b�=�\�@����OKOOwaSAIA���~r?���@�
����.�(�(�)%%%��S8RXy���\`R�V�S���c���s�����Sm��1.�M�<���:�~�a�]��u���
��y��
;�����B��
��SO�#�<��o����2���;���Svv��*4*�+�����.�
0 �p2�������l���������u�\hSf���q�� �	&�U�V����=��>�Nnn�u����G�~��������~��{��zk��P����}�������'����~�����=�F}O���C�.� o���4i���~^�E�?����l<������P\T �7o�{�6n���Q��c.X����q������H�^����L��w����]��9s�������2�.��R�~� b[�d��r�?�ti���6��fs���.]*[L��D F�!�U�T��B���*�
N��u���U<"h�T��#+�)�*l(P{��P����*w���^x�n��W	Uh�Y���L*t)�(@i-����]�OT�T�S���s��
�:�-Z�;Q�U�Y����v�M��B�i����2������:���E�EAK�X��U��k�B��[_SR�d}o��U�W��p|�QG�;��(��!�~^)C��\����]v�[C��K/��'���:��5��B{A��U\�Ct|��~Ga_t����Q���9��C\��?�����M�8�U��0�{�������/RX ��]��CF��{��}��^�j�	'��^G����~�@���O\u[s���B���Q�F�|������Rm��[�.��SR������^��Z���>����q��b�;�������B�B��7����^�U�S`����*��M���>�_
�j�V�>��w��������/���V�J!���*�hm�WMT�S@W�SP�t��oUo�������(C������*���*��/jVER� �6kU��mu�t-���z��]U�U����-[���=����W�	��
|�$��{��
�����l�����U�r���K���(8�K��%z-=�/��@�x-j�W�����3�8�=\),�#@�7U���+dUW������������i��.���A����^���F(l+�����w�M����m�������Um��$;�`��(�@��C .<������80��kj�V�P�����=Y�RD��{�jM��T�
�
���*Z[[����k'�*���*\+�)���\p���MUI}]������\k��W��/���r�`B����]����Za8�����w>
��������jZ�2��8���(�^�
hO�C�OU!���+j1W�T�J��C��������Ow!]����^���UZ���Z�r_���C�S��+
��'�P����������aLZZZ��#�n�l�������V���w�j���Aj�W�NNNv�f��F����[ZZ��X��\R�F�N�&M�D F�!�������Zs���
���v����b�0$�+U�4tK�XkgNU�=zt�
e U����"�!G���`�����/U8xtN����111���c*����
�B���t�\A
��*�+�����b���Z��V�F�WoP�7�*�B�QEYAL
k�-H�T��
�^+�gO�CP�Q�5{�������T�u����_�>/��^K�u���
�z�����4�MX�kU�U�/,k�>�zQ�g����Z��>y�`G�_�aVY��;!��[���Ly�N&LH�L���*�eZ���%Z�>5�NZ�(�@��C ��@��b
7R�P���~pk3���,��Z��2i���Y���%V��*@i��s��zO*�^������|(�*U���*�-\Uv����x���:�
����r^�U��
��C
�{[!��_=hQ���@[�Z�����}U�U��0-�~|�"Z��i�z�U�.X�s�����[����v�n�0s�`\�`U�D���V��8�����!�P��N:i�5�
bj�U�T�,���0k��q-�
4:��?��WAT���[F�F�Am��sP�1�5�j���d�r���]�o�U ��*7��UE8p=r(k��6�
m�b��������v ��5�������-�=�������u�z���w�\
����+.H�����=�)�����v�����7��]G��p�.Ul�������Aq ?1"�x�q��i��jqIS���j��\Z���q`K���U8S��Z�5 KU=Ux5HI�Vk��O�X� IDATj�
���V�B�n�����lp�}�]pt���Lk}�Z�[��NX�UE�01M>�@&�coco�����p�*k@���,Vp���
�:�)�����+��kM���}(l�ti���L��Z-��M��{�������g��~��|���6noZ�Z��U�Cu'�!�&k�_a��`5�w��Y�������������?6~���O���8P��>����M��+�oG-@ ��<��x������C��G���d�[�r���U���!VxT��daU<���*��GAF�WU;U0�}sU��~�Z*Z��u�jgU��6:
JZ���w�q���~t
v�k�5r�}����l���0�=y��>Z�����W�������+Tj?[��
�
�:����y�Je��!��������*�Zo]�>���uO������z�5�{��Z���i����{P�@D��5����!a������=S���]q�n{%u�<���>���(4�aO������y;�L_~�a~|�4��A�����,1�6i�C��@���X�f5�YX�EC�V��j�����U�T�S0U�O�TAZ��4�Z�� J�U!�����Z�U���]�Q��I�&�5~4:�5�m������z

cRR���5<J�S@�~�:?��*pk�\�~Yb��������+N��`�C4�J-�z�����a�Z�6l��%�,�j���
�z��C�V�Qm������Z
�R�Uw�V���!����)kz��+Uw����V0������z���B�:�0��HK������A������@�������� u���������2����
����{�u{k����%�
�?1"���*��������rY�d�pV^���k�(�c���p�J��@��Dkm5�L�iY�}��]y	�ZS�u�j��z������@��C ����@d"#����@�@��D @ ��ou���n�Q��Z�V"������G��J����U{�j?��=\=_|����D{�j�O����o_�ELQ��<��sn��e��Yll��k��Mt�~��X�����:��������k��n��<0��7b�� y�u�Yzz��|������Z��U�]��5k�S�N�z�j;����Y�f.�~�����K{����Z�j�B����%���v�9�Xbb���7����/V��4�a������]���������[o�eYYY�����c�����[FF��{���`����oo����}����A������.4H�@�@\�����������?Z��m
�]t����nAy���]�UHU��;w�M�<���J�*�;�}�v�����km�����Cd7l�`��r�����{��yaU[�t���j��e�-�F�����99��������k�Y�f^��2e�
8�:w�l�?����T���6b���N�jp���b�
N-�
�?��k�V�-X!���?����r�Y����������}�!����~��������{[�=����*U�����{�n��6����]�s�����n��&��>��������={�/�`�g��/���RK��-]�_�t�;G��ZUiU��B���_��G�GUb1a�t��+��UW]���*<��U�V�b��>}�����l`��M��_~��cUuU-����9�.���|�V�����
��%K\��#�8�V�Xa+W��&M��[���u��Z�Ua.�:���Y�\��uj�s���@\)��U�����_~��Y�2�����=f����[���=����������3�p��Z��x�bW5n���n���������m����z�����qc��
�5j��w��*}����
��&�5Z����� ��\0�*��o��F�fYk��E @ ����������~��wE�����Ub������U��jo`�8�WU���_��+�).���U�55Z��UU.�r�U�����pk���b���A��LF�r�S���3����GZ~�i��"OH3�4IE6�fS���~��"_�*���Hoh�~�\�����o����+��_|��z�).�������>j=���
�j�V��^m�TpK���?����[�&Y����~r�+)kz����Jr���]5Y�]�x��v���[�^�l��Iyk�=�|�{g�}�[#]�v���V/k����j/L��u�q����k}����\Ru�v�anh/���KII��*��{��
�*.Gr�Xa��n�m*��5���Mi������b��U���q���[n	z
��U��N8��5���X��ZC�m�T]Vu�k�U�>�����g�a
1�RG �@4�J�,���{UjO�&@�UZ["i[&
�*8eZ[�����Ck��9s��[W���q*8Z������{�8N�Vp>�����+��S���@\������?[nnn��R_��y���3�<�Uk
���z���zk����M�6��C�
�:�����K�!ZW���>��p.m�t������I�����/rbU�����W�y��nX��8�2-��m�F��Yi�B��JkZ��e�Zc}]�o���p���?�����I[*m���
�Z�j�k��d���h�;�|��u���
�R[��p5d�[�n������r���
��k��vL����m���n��p�{�.��!G������X�T�u�f�\P��g�]��5:p������>�j�V0�>��B�[��M�0`�n�)�N�0������[ZZ��YYY����[�.]\��ddd�9��c|���+xkc��n�����&^{�;Xb�QR U{U!V�U��*�ZW�����lA�3X��>��S^[�ha}��uAY�
�v�i����e��O��i��!n:v`�hk$U��{��I�[e�E���Em�T1�B��C @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$11"���@�@��D @ FD" #"������HbbD$1��~����r��m�#������n]�v��7������*Q,1yr6l�m�f��3,��?�rsm�����hw��g�������
}c� &���,y�h��p��ss����\ ^if�EE����gIC�Z\�V���A "H����2z�m�3�,''���b3km5n�������+�"�@D�I'���M��|�����c���a�,�G�
y�� �����Q���..75��j����V��
y�J-�[��>��s[�f�X���V�vmk����n�����SW:�
C�����~�%��i&_��?�8p�%�a�J�*�M��@�e�{���m��Y���?Zn!O�r��v��g��7�h�:u����
w#�31�&�@�����,n���
w�B
�iii6y�d�4i����X�5��Np��
X��5-##��+V��%Kl������c�{�M�0�:t�@0b B������c��}�+��p������8����_���������U+U�TlK���i�?��=���r�J�����~����>����))�]t��Z����jU���
y�B
��{���}���Gi���{tc����������L���+	�a�@D��~pC��^}��A8��` �v�5�Bs�A���Q������`�k�������"��<(�>�]e�v������=���1�@���7K�8��=������P �)>��_w��4p��4hP���b R����}���������3��o��a��[��]+�]	)�z�������y����J�*!�.�-�d�����;��pq�-555�OLrr��}�1q��-[�����~;v�-X�����
������x��M��'���Y��]�v6t�P;���x25�����~�Y��rw�����-*!���B�3f�������U�V��+V����jw�y��|��l�f�@���j�m}�������q�)[\���li5��������	�!h���s�16y��"�����-Z�*���W�((+b���>���G�}������%����Z���8d��6i��K A�:u��k�uU����B_a��16a�[�f
k����X2�.���c-��������8_|�%�}�%`�*U��8����v������Z�j�
}�.]�������5j�r�1P��L�h�#F��/\��h����%
b����Cq�]w���.��21b�t�A�^j��
��j�V��=m��)l�f�@������G�������������������C���b��u�%K�XLL�v�aV�^=�V�\i;w��-Z���<������@ *����-e����Kn�p���,i�K��q����l��I���O�u���I�����b0`���Y�|>_��@!�\��J9�7[���ZA��hW!N:�@\�T5�4i�-�����@�T��J11�.��aZ������2}��bD1P����R0�e���q��*]��y�@���J%�U�l5��w[/EU�N F�&Ml������n~��7�]
3b�|	f[�]S���J��w���8��V���p����j��k� 
C>|�u���|�������=��e�������#|���#w�6���g%.��m��f���*�=�����A��Z1"�(?2�������W^	n���U6����e���]��E%$z�bD1P~d.[�+���RI�� ==��T�5:>>�����(?r6n���#m���A�t��J%!�(++��O�nO<���������^�Y�fv�UWY�~��J�*��0D ����G�r�&�X�U*	�8v����������k[���m������v�o���x�
KJJ"�1P������c�~X�:b_l�����
����/�@���'��A��]�vv����q�g���T��O?���m���6d�0`�k�F� ����^���c-������v�VI����\bIC�Z���u���t�I���l�~����Ys�
���O>�d�����>c��0C ��?+��������6Z���haq-ZXT���3���R���v����32�."�aC�q��V�w��=��A A�z����/v����+�
C���S�����n���!%4�>����7���������B���c���Z}����U��[��F�mo����t�]~��v�wXBBB�m�_|���{��.�����Z�ha}��u�&N{�����i����e�,66�U�UI?��=/�����u�^�]fk��<�����"�%7-��^�R'N����w�P���Yt��V�W/�~�-���[G��u�e�[g��;]��[��7���!P����=���npVa���.7p���~���X�w��%.�6j��
�������?�p�����r��cbb�N{��5��S'[�z��v�in0�B���o]�t��^z��U��/���;v�%��5���s��F��������Z��
f���w��{�������G��_�+���V��s�����y�~~����A������.4H�@��Z��	,w���g�h�v��.r��qGYj'O �g��*�
��7��J�e�k���p�n�������7�X��M�XN�kVe��c����?�9]t�En�>���wk��;
�s��uk����:�T��������1�ck��uk�;t�����
��SNq����?w���V���k��e�-ra]��O>���:�,�����:_��)S�yw�����y�Z������1��eU�u������q��V�*r`�B�/&�j�~�%
��B���!x����k��:u�����]UUK��?������k��1�z��Yd�������B���j����U9>���\�Va����)��/��m��!���K{�^m��{��=z���?��J��c��m��fW_}�ku�Z�o��&�����w[�u�^x��={�]x����ZZ�li?���-]�����U�V�X-�_~��}��{T%&�!79��G����r'�+�!&����w��r���r��%X�.33��W����M�4��5�����Z�P�}�|E!T������lW�T(���_\0VUW�b����3��K/�n�~��i���k���s�G��+l�����I�|Uu��k��Z�Ua.�:���Y��{�}��,12�/�������6sk����j�\S*@ �}���j�U���_�V�\@S��W���� ]��F��*�Z��mu�=���hU�o�����h�*�o���k�V��Z���w����j��u������y����~�z��u�V�Z��U����	q��SO=��akm�Z�5�K-���+0�
�:�o���Y���eb <�������������q�]w���+(�kk��Z��
�
�&@�u�[,�U{�����������{����q ��������WZS\\���*kX�&W��BI�k�����<��[�,/k��!5)\�Im���@���J�c���y��
�Q�j�@�-�J��r�Dt��mE���O>�:bUd�n6�������.��7�Y�mEa}M�QC��u�W^y�[��SO�pY����_?{��G���r����:��5�j��^�s������~������Sk{%�c
"+8dL������j�B������n�z�*�����_v�;���������k��z�XC���vmtu�/{�ZW!�:kVP-�q�Z�@�p�y%�����C����K ���5<Jm��v�nS������u����~��U��.�p�B7ZAYm��b��n�!�^��rs-u�K5�rSR�?xL�U��zK<�b4(��e:j�U�����T:�����^}���\j��VP��j���]S��WU�q���-���bm��?'�pB�k��'�Z���X�U]Vu�k�/��b;�����g�a
1P�)��l���L��[����d�����k[g�4��*�*cb,�eK7a:���K�N�C���K�U��Tf�����U�p�����X[&�G������+�p����5��J8Z�&��3'��b���S���j��k�����)p����q��o]q��m��2
����TK{�
�:}�e|���v�������j�{[��/����m�'�X�������]��? ���,�^=K���v������@�������\�@&�(k^�=��\�:t����j���`k��7�*p���&X�dU��C�
�:�����K�!ZW+r�}���\����k���F�@��������_���|��>��#��.�!�S��?����_����������I'Y��!n_aU����c������q�8�vm�?��~���g��ri!����w��5	���o�{�=[�|�>�+|����.��=���-Zov�:u��������k���E�=�
��Te�NM���K��Ik��i�j���Rj��w]�\�z���n���U�\;�&K�-[�������~�h�"8�Cv IDAT7�KW�[�n������r��;o�����Mk����
C�u��N��P�	�]�Vnr�%�k�?�+��z@_L�U��r�.8�i��=�B�C��������T�>�����X�UMp.*4�6�
�5��(���
�����-saU�R�����GUb��j��f���������~G���Ikb��	���q�TU�y����u������u�Q��0a����w���4�����������y�����\EX�[�/X��]�������E ���w���{���%K��6��DE�Y���Y���+��*�8����3�p�m���.�ih������j��@)��!�,i�-�,�V�X�B��Q�)�jbH��8������7o��*�ZW�����lA���!]
�-Z�p�PPV��0j��6m����I�1
2��:����tW��i�'oU�5]�!v-%!e+u�TK1�r�l)�@~}����C�Z��an}��D ��]RXT�� �������C����@��N9�R��	� �����/��Q��S�{A�}�Q�^v��-V�zuk����x���nG��@ ��Z�] 7.���|V�gO�c9$��)%bD1P�R&Nt���}{PJ8�b_||P?_Z��8b�li�$�+��b��W�(ZCU���>�j�yg��T!�A�����#���i�
���HM�4	�5Pz�@������F���3�
��2�p�y�4t���jU�'Uq�go�N��VB�|��V�����n�����vD�	����N���W�X�����J����U���E3������y�v�b�G���2_t�U:�DK2���yf�'��A�5k���G{�}��.���_����;.���~��V�^mK�,q[io���T;��3��A��7:L����=��y������>��e����bc�����,��C�F�>���[���x������m���Z��My_�%$X��.��;���f���dJ	�8H
��?w��n�^�_�Z���7S����-+��G�����;�p{k�]�1�gr���M����c.k��/�%bc]@�v���4h�E�����ii.���XTB�E7h`Q���=��'�R�}��wm����'���5k,%%����k�
r���s��f��QCb`��X��i�2~����Y�/��XtR�%`���3_L�d� ����\��|���^����<n��]d���Y��/��{���-�?�~B F�!�K��#7k���n�V�|>�o����
���;���@��C ����-y�H�����~)�IK>�������ObD1�����#,��o��*B��G�@�p��d?!#����)'�k;��5��.�cb����[8�M�����q�@��;w��Z)��g��la��R��w[��A��R%���'bD1�gT%�>��k�B����d�m����1"��si��j�c�X����Z�ssw�HT��U:�Y3K<��^r�i�ty@ F�!��Z���>����������{����-�����u�Y��G����	�8���3	�o�|>�Q��%$$�G�  ��l�d�[��E��a��6nl�;u�J�[[T�uk�!HLL�����~�a����G<x���[���N���@�H���;�}�R'N����2f�YN�kw�U�d�;v���-����M���"����/�_����������M�Z�z�\�Z�b������?�x�T�����/��|�}�����Q#B�~F @$�6{���o��W�?=}�[����'�h5����t���@�5k�X�N��K�.6t�P�_�~^�U�;v����+���o��|��������w��6a��\�rGEi! ��\��m����{n���R\��+-��{,�I�
��!����l�����[oY��U}�3�<�}o���V�J;��Sl�����W_Y���C8*J��&��{-y�H�gd�x��
Z��V�w���#���C�K/���
f�������1cl�����O?��Z�G}�U�����$7%����t�]����(K>�����k�	�!�z�s�9�|�AW�-����nO<�D^V�<`����x?# �d�����e��~����r���1����BqE�:M A�������sC��<�H�����*
�j�n�����j�����_o.t��h�����$����B�:eJp�����F�����K�@8�����/��
������o�>o�����*������s��\��,n���w�qnM��&��C @y�������7m�k�Y��e���w���^~��;8�
n��1�<b��m�J]���5�7�\���{T�C��3���h���x'N�h�\s����YJJ�-Z��M�V0.XQ��E @��gg�m�R'O��_5���^�st��bc�R�v�����K����.�����Cq\�%�w�%
dq-[�����x/�
���o�6h��Z�[�li=z��:u���p�" ������K��g����9�X��}-�m��/)'��L'�{�k�����>\+6�m��8h�U��3�oQ� #���2�/��a�,��7Jnq����7��&BG��S�e�l�l�{����a97�_Ug�P����sL��n�V�+�0_��+1"��l����:q�e�Ze��]������R����[��3J�,��2��������e�,75�����J�kU�u����*�5*�^�����������7[v�[n��������R&M����]��=��,�U+7Zk�g�!��}�]}����o�=Z[����QPV�g)&�@�[����y��
�p���|Ya�@�;����M�f�7��.���4ib������������w��>�g�_|�R����~(�4�������i���������@�f��Ynn�}���V�n�B�0���l���n����rs�?U���^u��������!��J��w��S�Z���V��P*rs-{�:�geYt�zUJ���s�m>�R�M3��E�#��,�~}�gp�>}L�b�q��mkt�=��S����+`"`o����m�9��>��e�����j���,���,�}{�����!,{��2h������5��`�XK6ccw��[o������x��@�Y�f��A���>�-ZXOb�1B���/-e�X���{f����Ki���@�JL���[�n������z���������f��KG�>���=��k��r"�8����a����������v��Y�Z�
]K��As�! �k�Z������g\.TL�E��m5���j�z�EG�r����m�?�l�o9��*�1d��?��Sm��8�F)555��d���C ���gdX��Q�:u���m+��}�+[�)�X��AV��S��P���!����----��|��gYgf��S��Y������~���ll��z��q�-{z(�CbD1@���a�e~��e�Y�&?W:���Wo�'0���F�����/q+�]���jO�h5n�m���Qf��8b�
����o����o�m��t��������V���j�q����������a�l�*�%�
,>��
q�~��
cbD1@��6�����������.���`1�Y��aV��KB
�9�6�������gu��������[��=��y�� ����v��a���-[��W_�����g��8��*��U�l������K�Vn}��V��3,q�@�o�6��@�:y��*�����Jm�X���V���B:�
q�������.]�X����2]��*��3f��z����������?���G����|R��bb��i��NJ�87�"��?����\�[��������k�����Q#�!3b��c��A���E�\@����x�����M�6g��N�`�+V����w��rS���\;v�����,�I����}�@��C �r��m�w��G	���������
����"���-y�X���;��x��Zb��V���,�f��_���@PA���?
�S��&M�z��.�4h��-7�rSR,������i��j�AZZ����@��6�~�z[�n�e2�N��mk111{q�61@��r��n
qNr����h�s��ow�2�(O�:�&N�����n�0���n(������x�%�e���_�P���--i�K������q��g�F�r��:u��g�������3�{���*W��QPV�K���[��I���?E�N��Z��QV�o_��R�b���@�������[m�����iS�H�3b��%g�K3������n��]y+��*U2
��z�e��i5mZ�.{�@m�t�Yg��i��
O��1@���Vis�������K-75�|U�Zl��V��,��s-�z��w��+��|����aC{��'-!!!�W��D  ���=����n{��������e��!�{��f�]k��8��W�|����$��!�P-U����cm����5k:T��#��������@ �Grr,}�BK�<��-2�������-�[7��Q���7��_�~B �#�<b���M�6�
l�~�e/7=�R����Gq�y-77o�`
��EG[Tb�%�a�{�6_||��P�8��O�;�������]�vv�����~��[��@P�R}�R&L���wa�P��{�aVs�XK����?)�@��&���U�V�~���5��D�/b����q�m8����]t�DG���j��o��j�������v�k��T�)1@���x�%n�����~�|�p���8d�Uj��lO ������n��=�P�5k�.�Cb������-e�X�Z����}3�e�]m�g�]����5�!�@���G����i��l�T���������q�,��_�:P��G�
�����@i!���?����������;�,v��
����l���%�e;?���5�QQV���-I-��[����C���h��]
?b����b�%li��V�b��RT�U��z7X+���������!����----����>�:�0C (c��n���I�,{������m��6���G[��������D��@�dg[����u��Z������8�/>�����Cw����}pR@~bD1����l��>�������&'�Gp�U��J�����v��;�@��-�(����*��k��/.����3_l��?	�@����g���c�7Y�e��v��W��c����!3bB�t�������.]�X����2]������Xnn���[����m������h����11"��8D����[K��I�.b<��?�l����������B<���8DZS<n�8���w�aw��	6l�0[�~��y�1q��	���w��XF ?b��U=�;w��\��j����P�@xZ���������ZZN������M<�N�r�U��\�'��KA���k��~�������.t
qNN��B���_�-�^~�e�Z�j�G��@ ����56e���y����X����w�c�/��Wmo�����:[�/��N�@$o/�`s�1��SO�QGU��-�b |l��l��gS7N�,V�'���v	�\(>���v�� ����{��O��n��.�����jEEEY�Z�,))�����9�{h�C6f����������������f�����C0y�d�����j���o9D /;��.�ms���O?����-���-������ZL�F��o�����Z?��������"��g����@|Y��J�	��8b�����R�����������e����re�;�K��n�������"�Va�:�kW� ��)��I�I6���[�o�?1"�D���[m����u�L�;����Z�����(��
���LzfRH���*"��"

�
��E�(�Izg-���X��
��������(��d������a0�$3T3�=�s8.03�������{U�����?`�2A>��GEW�7�� %=EN����j��j��c&^�+d &�a &"""Er:��l������I�^}5N������6n���^������d�x�i�G�hm����_�d &�a &"""%!87>�o�%o_��!�cb1gB�~Z����������6
�O/���X�#N�K�a\�+d &�a &"""%2��#L���d��*�J�G�C��{�E%>.��Vf�DrZ2NZN���tx��r�R�6��\!1)1)Q��0�����?nz5���?a#Gj�]%��|���bQ�"X��*+�~*?�|�c��1��e�2bRb"""R"1]:w�$�]����
�������=z=������e����wQ�(A��N�S���[�FBl��O�_�\�Iq����H�l6d����K���)T*�~�8}:���_����>Q���;l���������h�����khW���w�/������vl���1���;=z��\s
{�1�1���P�����S�N!%%�~�������SO=�1c� ((�����������c�����E>O<����Y���oc��e��?t:��k����z���h4��i���7o��]�c���h4�[�nHLLD�z�����;�DDD�T�~������[��i�
ac��0n4Q��W��6b �!"d�~��(**B�f���eK�����j2���ukl������������S'������{��iSr�����s��x���R�=[�l��O>	���=z�`0`��MHKK�au��	�����t�N��9s��P��{wc��
�Z��������j�g�c6���gO|��W2l�y��2�o��qqq������Q�*�tM����H�l6�f������??�N����a����m[�>*Rb��/�����7���g�bVVy��E�7))	g�����Ue��a���2���*����p�B8z�^�G��n�	'N�������A633:t��������U����q��w#""��mC�
duWT�x����w�Fxx����x�bL�8��s��[���`�{�g��� ����Ku���d &"""%s�������7w.���?
�	B���?��	�k�R�����X!D5UTr���:|�����+����2<����W<R}��A����D����Ql�{��g�z�!�\����g�������#G�_�~���+x:T�v��E���T?��3x��w��o����(�p��7�����={*|Q�cq�z��]���.�J�@LDDD�N�����`=zN���H��x#t��A]����|�B���o����#+�"�
"�2D���l�p+��)B��#Gd0U]Q-�[�bz��])�����=u�������_������C�p��a4i��R����Y���hQa>�s\�5�Z�J��}��>gO1��@�o����yt��EPW�����G��O��a���=�"*��|��<R��kWy4Z����?��U�[n����lll,rrr���.�@�a����"����Uz�"�w��Q����h��%�l������"��������e���)b"""""�@R1�Y���W"��W�T�\�Xq��W�^����V�W��*���z%z�k�����bj��s�kw�kWUY�?/X�@� {��������U�E5\����_Q�&6����������;�|��,��m6�N�����p�������G��,YR�X��O?�w�}�W�������"V5�����}�/XbqtZ�X��+�3����M�6��o��I�Yq�E8��]��.����ysYM����d ~��W0j�(<����?~�!`����/����������+��`5��E@A��������,*�4V1 IDAT�b�PM�
X1�������/��b�X�$*���+��H���|n������~�s�I@DDDDD�_aa!N�>-W�VGl�qX!�Abe����{~���s���W9�9s&^|�E�{��Z%��m��������k�#�b���.�c�����*�w���s�5��������.{�}�(���5K���]������aA�+����#C���R�������C��Io��f��b��s�C�#�b�tU}��5N�L��bW�+���'N�&""""���@��D�b��87/����a�@����������n���=���V?��\�T1D���c���!v
������_iU������v���{�b1�JM�-[������������<��?��7r��W�_���4�V�!�1H�OB��}<����C��.]�����#�b��d�
�����Z���9�J����)�J��K����5�.�U�V8y�$>��3��I��fee�C������qj1Y�5�Y���������m���a\��������_����d0wr���o��i��I���y�0e����S�oTT�����%"""""e1�Lx��Kx��+��e��4�?�5��7
�h�Qw��3� H�����k����:u���s���"�V���*��,�)��6m�TU1�Y�\Lt��+V�E�Y����2��=�b���S�dC��	���_�/��b�����wGqq��V�U�8w����:(�l6�G���u����-���W�����.&^_�Z$b"""""��:�X��s2� ��*�aZ�(�F��������9*�|��t A����A�:"��*lE��+*�"�����
��+=z������J"1�k���2��h����AY���8n�l�2���_No����/�n���D8Q�k��u����������Ua &""""R���>Bjz*~*�	�����	���_��N��d(�3�N�|@>��3�DDDDD�������v�F�-[��`C���o�B ������q�2�"bb &""""�Af��|�9�s��h'�����s� �pRl���J����������|��
/;�L�Ua�s�������G�@�	���bb &""""�5������C��a�S�b5�H�K���)�~�Wa��X!&""""�!b�tRZf��,��}N ���+��4���������������Drz2Ve�����
�Xu�J�!�;M�M.�s�11��=%{0=}:6�m�~��+'�[=>�uO�pk�[1)f�����Ne��@LDDDD�C�BJz
��������b�X�+���58~s���1����=�v�.���wa &b"""""�o��=��2��UM=�N R�����&�
�\�%��g!1!=���CO�j��@LDDDD������]�y�<\��;�wbI��Y�j�Zh�'�&�N���k����-[� 11��u����bb &""""�Bf��W���9;�v@�����b�t�&D���f���X�n�o������+_�@L������o�#F���A���g@DDDD�MDU���/aA�dZ3Q�(9�,�����V������u|Z����1bh��QgB5���L�DDDDD�����0+c�<mv���x'�V��P�A� ��I��mn�
�#�cHGYYva����������{�#��i���>���k�%���[cr�d�4���@L��@LDDDD�=D�wZ�4��^	g��IU��#16��w{���8�DDDDD�cO�LO��
y��U��#,qb\���d &�a &""""���!%=�r�����W��0<1f���2��0y�kX�/�\T~�5�
��
q��~no��������������^���$�Y�j�-�D������V���d &�a &""""�.&�IN�^��6���x
4h�o��������
2��0y1\K�^�9N��1�:>�W�a�1��P��3��P�n�����������;e���v��x3�M5��8B���o����������c &�a &"""""���������j3Gi)������P����im�&����@L��@LDDD���jE���"o�,X����p8���o���S��c���e�@L��@LDDD���lF���(|�-8rs���%��A��3���������%��(��My�p�|Q�(�z��vE�.�o}r��8�DDDT��/]*����U�R
��!b�B=� �R��� ���Q�
y0/s�����i?��j�
�`H�<�"����c &�a &""�������#P��{��a��:d4QQ��6��#&A���������<��5�Ra|�x�������3��0QmR�};L		(��;�3�N`��0&%A�M�^J���Y�3�g����`C5�h�As��H�M@��^W�z33����Rl�R��L�[������8	=��������t:�����1�&������a��w�.K��)"-B`�n����� s���Q�(����
qb\��_	��v��z!�//����3�{�p%������7�6<�H}���������V)Z��3`��g����ukD����N�<z=�������aZ�48e�������I�I��W��/W���K�uk)��h�T��4���EE�3a�zW�@��D�-0i��=��-���@LDDD�J����:e�|��:T%�A�<�������ZuD.��b$�%�aZ���O�?�)�Sp������J�N�dx��dg���;����:w@B�w�����^��������Vq��"{�p���������!b����u@@������OGRzR���5��E�����E=��RS��pa>����u���"8X��D#��
������������j��;�����m���4��#�����iV���{���HNOF�5��Ku�
b*�?<5v*Tg�2{NL�z�������a�??��������S�2���_��` &b"""��
V�B��yr��C�"v8���]�W�N���1k�t)�5��I�������=�G�u`k\����N-����>ks��������Uiq{��H�I@���uG�?8)��+
/����Q��d ����������j-���_��_|{z:��@[�.�}!�=�5���k'�'v	o�����T�-��JQx����-11f"zzB�*�',^+��*�U�b�r)F#W.
�tQ�a���-�:����Ul.��2�R1_�Z%���)F���@L�DDDDD��4ci�R����\[�yU_��4aYg$FE�B�:PV�w����Y���k�f�5*
�T~h�
c�������}1v�2c�t��,�n.��*��h
�1dH��\J��@L�DDDDDhm�ZyZ��G����
�`r�d<���W���d7�.�X�@�6
w�����nG�.���|��f����*��B,Tl*���X�k���
���\{�����6b &b""""��c��G��f.-?�|n��@T~��|������]�W�xb���!�??��9�uy�X���o��&��G��������������.����l��5�SL��7�^L������/��\QUbQ-vT{��aX�	���
���q������O�o�����.��@LDDDDt�3����d,;����5FR\��x���_���{�����{������~~*�p����_���,\�j;'�g����Rj5bb &""""�G��-�+�W`{�v���eo�����m���7����Oe �����[K��2?f|���_.'O��pa>��)���E0�Wa�h�����Ht ��������-[� 11��u�\�Rk11)���Cre�������N��~Z��ZN�~>�y9)Z�Cr��g������U�O�H����c&�]p����m6'����O?-���[QV�Dd����G�n�h�H[���Iq���q��7c��4h��������HY�������$k	��Wy�"�hB�>������,��Q&��X�dqXj�%&M��+'Mkj�g3��0��8�,?�\����r?puD��U@+h6<|�e�J����
��]�$vw	��I1�p{��^���Iq�����HINZN")=	�N�*���&�L��*����h�L�P-�k������N@�R� }���H�I@��^��!�V��8�DDDD�$G�� 9=ks�z|����!1.A��;y�g��^���>�)�)�y��X<�FD����K!�����()q 6V�� �E}1)1)�o���q���KW$��E V��>`1(���
�$��,��:���F�4�����A	�bRb""""R�B{�<2� s�G�-z�ceqm���HI1���r.X�^h�F�g�	��)4n�����Iq�����Hi�@������X5P�@C}CY����<�={��1#6�f�����U�0� ��3��0��8a9��r�����q�|TVoo�������`O�)�)�)���v���P,�h���/�j�����G_���!)�����'d
p���HH0�}{���@L��@LDDDD�����*{�g��AXNt�@��#�$�&�Sh'�ngS�&Y)�]����Ib�pC�������=������!����}�z�4HJ2��P��g &�a &"""���]����1�+�l�����a=0:z4����������x;�m�9�����1bp�`�31��>�J��ii6X,N��iq������~}�������A��x��Kd &�a &"""���`�ALI�"W�[>���"�8�������p�A[i�4;�a�d>w���v����Z������V>A:8X�f�t(-u��!�G�X�NL4b��s��N�XI�������6{�����1X�p�&I���]P;$�$�kX��|[�	m��B�����Gm0�]����b���`�xv{11����s11�j�3�cN�d��<��V��T��
{�����oJ1wn���JQR"����CWG�n�M/������U�&�a������j��s�����e���-�������������<m�������UK�G�a�xBC��^�@L��@LDDDD��{��az�t�/���2Eq��.�����Z���5}z^}���@,*���K����S!$D�A�Be ���xt����0Qm���� 9=_|�v����7������k�mUi��2�?����@,��
jQT����6�EU��;�1f��� �jy����������j3���i���$k	�E�^��
t
��Qe���|���&|�E���X�T1"��v��:�AP���;1)1�v��\$�%����Q�(=�R��a�0�;�?�?�������T�c8ON����+���������#��8�DDDD�
D~;�m���������tr
��iE���;w�	�J�
�S%�_8)��9sLg~��	��(��� !����/�7�Iq�������;��k���[���t��OufAo-WP��G�7n,��CVDFj��k�����>c�	?�h��V���� ��3������"bRb""""�T6�
�,���x���������Ak����f�S�]Q����R�H�SN��[W���C��s������������X*?�^���/�&sS��` &�a &"""��%�1o�����s��d���^1���6�_d?�X�E���|���vf����#VX��Hr���_��c�����3g����K`��u5�]���1a�����j������������.��
���%�?�Zsa��Z^�������Ih��b��O��SRL����uJ"_s�S����!r�R^��N�`�QQjDEi��y�H����������.�����(��N��-�k�����(�^�M�����}%�j�,�����1c�����K��#9���\��O)��a�����[v���������<�()A��o#o�BX��|%�TP�:��A[���I^���`��1Xvz��;��o
�����Kh���E���d v�+�N���r�R��:w/���Iq�����N����(x�u8���
�i�����x�]�'K^���8�O�c�i�Gw����zs�l������~a�E��'�:�G	���n���N�f &�a &"""O������\��Z
cJ
B���h�����,����d|Y�ey�������1��pw/�:eeN���{���t�[��������|g����X����
��<}51��0�;N����G����^*�O��o���3�?�7(=�)��`C����]������cH��y ���ys���o�:]y�~�
~�0��=�������d����E�@��XT��&bRb"""r�z��G�D����^*Gi��A�����Gu�z�N��\�;9�e�V~n�7�7�������q���s�����c��<9�b��zbOpX��G0rdN��c�L^{�PN����m��e ~�������Iq�������m�`JLD������Xql:��������&/�<k�\�t�z����A��QC���z~�1pm��"����Na�����j!�bM�l@��!����j����U��##��4��_�+��6bRb"""r����r�V���U�:�*(Q+V�[����H8���N���\��j�������q��>�0��E�7���W�]�����CoDL�?�P&�����TV�E��W�Z����Q9P�J�v�����������ge!{�����J����p!���������gf��;��E�8�*����q��
�;	�VuuD])?�`�k�>��D���Jt���S�h��_�����4c��R=jCD�:�}{��N��8�DDD����D,[Gnn�/��`5
���C[��'M^���ak�V��Y��E�Q�(A3}3<j|T���W�������/������k���Y}�����@L��@LDDD��U�#P�����k��k����!�kWO>���l�T���\Y�u;M��E ~���Z���8�DDD�)1\+w��S��rm��suU��PGD <9A�zAT�����b� <c�	7��j$k�w_ ����#��1)1][z:
_yE�[���bMT���a/�}�����S����2�\�$�c��BI����7`�$�����Iq������.���I��Jl�V=YK�"�a�����_�Iq������.�c�t1\�"�N;��!�(���M����O>,�M{bRb""""�&�����?�d��z����qc-���~�<t���+��'�HK�����a���[��0��0�7HO�a����Z!��E�,?��R�p��:�c������|;b�pV�yyDGk`0\�g�bRb""""����a��\��N�����#������F���@[P���uEX�8�[��8Yin��c�����K����@L��@LDDD��L62m���#J��eY�ALuNJ2a��|W=�J�XTw��^���5��W+����{����Y)h�p���!������u��Iq�����|O�=o�������p�a�a�A��_}��~����������������n��:������!HH0�Q#m����c���yX��@��h4@H�
&1zt�"*���8�DDDD��@�L95�7��<3�X83sJ
5��GR\:�t��7/*����a�~��kU��]1u���U��H���b�����*�`����7����xy$����v��8�DDDD�#����S�Xyz%l�U{chd�xB�\���V?�E��e57#�|�s��h�J��$#z�����6$'��p.����		��3E���1��0��
y03c&v����k����zs�l�����Z%��y����~E�����h�����UI"�aZ�5*L��P��3JY IDAT�uy+bd6���~�������/����P�N�BJJ
>��S������z
c��APP��~��;w"55;v�@YYZ�h������'���_����~�m,[�L^�N�C�v�����[o��*��$>w��yX�v-�;���n��!11��������*0��Y��07s.r��5�����9��`T�(hU��������3f��}{�<�\��h��IFy��*bb�����I������u�����jb �1��������e����/������j�9�N�:��?�������M��������;w�{������J�x��-x��'a����Gl��	iii2�N�0��>u�T��3G��������6l��j�G}$��V��;C��={�����a��;����G�u�V�������F�F���5a &"""�
v�cO����%p�B�p5D/�����=
�
j�C0��������gqu��	�^��x�����U�X�2���'��T�:,�0�:b#��_|����_=&M�$�jM������,�2�=ZYRE���q#.\��B����UTT��n�	'N�����q�]w� ����:�`�����u��g���?�������m4h ���}�x����w�Fxx����x�bL�8��s��['+��R={�l$%%���t�RDEE]��"1�o�9m}b4��^��
�@< r��#n�o��{�._|Q*�`��QV��i���	���a������.����UNEE5�@�l�T��S���;��������������)�<W����O����U[Q

=H<����h��1�o�.���8�<`�<��CX�r%��Ux��W1r�H���Ou��:t�|��E��{+�~��g��;���7��c�=&�R7�x#~��gy�[\��H��J��/�P���7�p�U������|GJz
�e�C�����J�M���1����mm�p��R���P,�p�9�-:
�j���Cr��;W�;\��}f������%5N�����C0e�qq��X���@��<	�"�2D���l�p+��)B��#Gd0U]Q-�[�bz��])�?~\��n�����+� ���������I�J���?X|f�kG�E����q��j�*����+C�������|���5��1����������/�Q/x�����W_-�}�YYv������'���o6�<�~�y	f����k�����E������0��U�g_�@��<	�����H����1l�����]D���O>�G��v�*�F�^����?�j|�-��W�=�999r��8]PP��
����{n/�8*��cGy[�&���b��8�-z�E`>7�� ,�}����gY�1��H��c��Ix3���F��#�G09f2Z���p��b��<��v�����DEi0xp(�
EL���Z�����y�k��
�J�8���W/Y
vU{+V�+rU���79�J��T�uU���h1�ZT��U�]Ue���`����)b""""���d��M�_������X���N�C���H�KD��.�u�A�.,t -���b��AA��b�*�>��@���O��w�����e�t����1/��2�,Y"��E G�E�=�b=��+���i�}����d�E��X�$���^��Ev������j�O?�$�+���Q�F��g����]��.������|��/Gdd����@l2���V��)&&F
������v��f`q�b��^�l{���
R����0:z4��7��7A���N#��������[�l)������N�t�Z�}�'�X�b1[�z�n?��uqd���j""""����E��e ���`M0�tq���RL�"
obRnn�;���_/
�6Lb�I }������3���/z�C,�*�m�����X�$G�E��v-�����=�O<��t����z�=�DDDDDt�����$���b%R�>}���s��b���*M�k��|��J}�.�5N�N�G�������]k�*N�����o��W\Q�5N�2MDDDD���),'bc5T^��c �a�b�@������`��{�m�����5�������*�h�s�w|�b�p.������V5�@�n��j��3�b���W�!����.�!&""R.gY�;w����a��j���������WP��,]�O����+
q��M�
T�cG�gD���r=�~�>��@,�aZb�RJJ�d���e�S���%��I��_D�y�V�p��I|��gr
�X�����:�������di���'���q��w��W��m���D�C����~y���~����Ex�8q��c�b��=�?e����K�.ETT��)r�4��+�����a���f����j�CC:t(B�)�D����er���_��l���"??5�F5��	����",���������q1�l6�b��&qVz���UY�/,&(�*��J,���!RM�6�AUL|?��d������e�'�n��`*�o��	�N��
�&L8oX��s�����{��(..���j����>������{�����'*�"x�=�_}����,~]L�v�nO1y7��f�@��]p�lU�J��`�<Yc�~���:�����l�;|����pE�?g�
�������!�{4T�1�E�a�:�'X�C��nHT{E�X����Y�}��G�>[�=��,�����exm���.���Uq�z��e���\�$�1%$$��[o��]JKKeEX�O�y�[T��Xt�����I5a &""�^�?�D��(���s�5P�\���{C��y�MS�PV�@rr�.���{��������/?>M�1)1��*X�
����z�0�D��h44ac�B�����4�
���1y�	�7�xt=z�
D`��P�^ObRb"""�eJIA��%�gg��*�� ��G`LH���7z�MS��}{�M3���Ka�{vI"���n�1)1���;��gQ���P���=OI��?����tE8�������k�-��J�6�m�9�3���@
�h��������
�'�X��]�0���]��J�1)1����;y����Y��wo���������;e9�y�����=��r`v���P��1bp�`��:�:��.32�?>k�x��@�%�3'}���{S2bRb"""�%v���	��=p[��ja�0a��A��7��2���>
��W��Yz�]8� u�����u^D�6�V��hY3&����&'K�o�GRR8:v����t��8�DDD�K�����w����
"G._.����yuY�V�g.F����/�R��b@�hU���&���v��k��l�_~Y���s�c������t*L�d����04�r+t�1��0y���[�p��;d(�r�����}�y�
��a/��d��O�����8-5��^�G���=��_w�w����[oa��|>l��U����`��0�����K��Wb��<�5l6q�����S����_(&L0�Q�K�zte1��0y���>���z��V+ ~�TP@������:|8t
x��z�O�>��=%{<�����Fr\2:�t����:p��)SL�����6���&G����IIF�}��a���3g����J`.o��4�`�X��6Y���8�DDD������M�P�s'�YY2�,�H����J����B/�~	�2f�O��]�u���Z7������90iR.V�(8S����
���������.���K����q��V���������g��b &�a &"""��>��XV���%��@����Vw:�tt��sm�X��3M�����z�O<wn8��
���C������������������`K�� m����C/c/L����k�l���GN�����3'�F�A�e%��@L��@LDDDte�fLI���YKQ�,�~��&�����4���'�7#P�ug`H���v�]�;6K�������%<8����!^1)1��w��0�\�V�a_��^8;��	�=���b����t �]�E�)��Ggc����+���}�M�\z1y?�&�a &"""��N��c��l�^]�MOV��E�Pi�5��qc
�����mJ�	�������#�b���1a>�zH��Iq������,Q�MM5a��<����Z}�nH�
3g�c��P�t���fM!f�����Z��{�,��/�^�@^��������������N5����e�oMt:�w�`y��E����t�\������U�h�G
��I�jua�����Iq�������/�,��i����������=�MG�N�_\��~2c����v���>b�
2t�m�Gb�]�^�� ��@L��@LDDD8��I�I�.��{��������
���#��X��P�>r���K.�i1kV8�|2����RF���c��dg��O��SOc��04kv�h�m��8�DDD�d6�
�
�an�\l/�.�J���iG�:O�?�au���������v�w���G���f�0���<z}M��2��;����8��������������������eY�p�|����E�*w��I��pW�]��v�*��i&|��8�\�G�����1y�Qk&���Iq����H��.�������[X�g���u��@ ".�X�Qi0$j�E�C}�~\f�S��b��|����v??`�4#�
C` W!���@L��@LDDDJ*2����lz2;���49<�4�|�&�h���������u��II&�__����aX���o@�����������������d3a���X��o�������G�����^��;���:/\�����3g�a��Y5.?>��Z
��@�n��8��[n�Qi�{0��0�(=������3����0�"����#ylc��f �4?�J����k��!["�i��3���K�����4�}�Iq����H�����I�����A@^x��@T�o��y)
0��[eO1�/b &�a &"""%��5:���c�5sBZ����1b`�)1)1��c��@����x�zX~�����~;B��];����������r���E�rs��c��0h��4�wb &�a &""�2J>��Y�`��N�p:���V+�ph��1�����P���UH��_���JM-�z=1�&bRb""�������7{6����j��BB`;��A}�/�j$�<�������'����x��P�^O���Iq����./qL:g�(yL6[�.��"G.^�����J,���j�T���CI���}5�$�-�0}�={^��i������������*|�-�����?G�����g���b�u�]����bl/�����K�/���KX�v?�tqP��U�]����h�_���G%�8*=n\���>�d���@L��@LDDty�f�D����ge��`�
���cr2�m���B8��'�O03c&����?wQC-���:�00r bu���T>����yr'���<��-�tDx��`��d����1��0]^9		(X���<�>��}{����R�MbV�,�-���V�V$?���M��Q���1^����g��}T\��tL�#F���DEq�0�>bRb""���`�*����o���`����'������^������`3����z��H,i�O��Fui�U��w -���R'������ 0P���NT����������*����D�}��_����R�0i��
���3�f-����8f9�6��QuFaT�(���wy� �����������r#g��Y���5�am�z�X�A={�7��O����/!�����(NGay@�	P�W�5�����01f"�`���R1��0]~�C��3v,J>��|��8�[�N'�,E��"�����*�a���0����3�v?���P����:}uH1{I�I�9����I�0��8�DDDW���a��MC��
p��r�J������qc�����P��^��R���Xzd
r�=
|�
(�:����9}��Sh�y�������Z�k����jbRb""�+�i��r���a������
���hcbxL�
oe�������S8`�j���/���UH�/4B�b"�K�@L��@LDDD��[�Oc��,��W�j���	*���1x�������R1��0Qm2lX6V�(��\��%������\s�8VMD����������j�~���zu�G����z#>��-j�&�{��8�DDDT���U�
�pnwD ��/��p�u�]*bRb"""�MRRLX����v�.+9��_Ex����Q��Iq�����b��6������h�T��B��;EHM5����TU#�X�"
�>"�7]bRb""R2{~>�ii�FGCm4r
�~-��2������k������Wx/�3���y�)����#s���E5�vB�S�G� $$p�M�K��DT����������QP��5kP�b,V+DyQ��9B�A�O@S����G������E8Pz���{Th�o�ug����UU�?�3Z0vl�m+���<�]k��r�R�Vz����S���q���` &�a &""%1������a������������w���9s�o�FI���m�03c&�-�f�����P|k��H�K���w����de��pa>^{�P�o�~:c�����9�MD��Iq���H)De8{� o�gQQ�0\�*0�"t�(�6T���Q�5#O��z�z��v���7~�d���A�����*�3T�n�w�������;L&��U���"8X��a�+��������������7k�{������&<u��E�}�)���H�Y��/�(��{8Y�iX�pQ��S�C��4�������khW>?"/�@L��@LDDJ�;e

^~��l�n��o ��'��s`����H=��S6����������}�3,�%��YQ��@L��@LDDJ���������t%Z-c� t�0h��U�#�����"ev&~^�8]u�����z��`��w�g+�D����������"���Q���p��V�j6|8�F����8�l[~���#���)����+�wK������*��ybRb""R��1cP�j���r��%����P�^����H�������pj��j\�b,_����t�Z"�=�Iq���H)
V�D����98����������	.�>������?Yj8.��0�
���'����^v��_�?���m3!-������S�x"7������;����we���������u���)�Z��F�������8�DD����0��}&����7Z-�{��q�d��5S����_� )���������z�4X�,={]�����b��R;fF\�m�!:Z{�����13�Ol�n?�[�����a(�&7!�n�~;<�HK{1���]��]-��t�^�:�=�
����v��pB�$
�~�kZ�0&n(�t�NU�������������w/�G��y�n�b��n?{�2 ����m������mJz45��W+�����-�n�Jt���3�q�����-F��?���y(��i0AUu���Da��zh�&:��Wn������V[�{z��g[��[=���d����?��������0��#1<AA���]N�iiV��N�M��7}��E�u�5�Xy�_=��F4o���l��a�������x8�;1)1)���RRP�n&�_�80��?���#�k�Pi��F60fLV��GI�8n^}���i��4(F����w?�3f����q�����=��:�W�k��u,?���Sn����^t�x������wx���}�"B��������5�����5�.��WO��U���ex��l���	��[�7�
:t��AQ��)����c'�X��3f�@�c=0y|}�~{p���&��� IDAT�
+V����;p����|��W�/�4;w~���Dt�������I�t�k�!_�@LDDJ�(,�����uX4qqP�g�G�X1~|.6n,'�����o ��V��/��Q �t9�
�ww���X�����mLX�KnS�����\�_����?��e~�����������������2���~�
�A{�_��S"��[��E�2q�M�x��4k�??22�����Y��v��1qb�����	+v�.F^���E�`�!�Q������[�����������^������.���&������1dH'���01y������b���Y5?\�8���Ejj8z������~�������A����		���^��e���q�vB����i�������S���������}���A��|�lv3�����:�����f7��C,����)��]I�Bqw��-���BBB����Y�Eb��_��������;���������u���C����_��+��]����M���.�g���$�.� >^Er��������`d��4D�W�Ck���)Z�Ep0Iw����������Y�E������C��W�� m�
��X��]�����M���"
bQ��������N�g���l�E�-T������^C}&�)J��v�q�����������r���H4x����mC�����l05jzO����s���O��2<���0��X����R-TX���}�>>���R5�G;bb���
`���Z�0ibn��
��EDr����.$(��NF���T�*��3{v��Ln�$GRlC��|��qHR&R��,u�PLZ���cY�&�6�����7�!��d��aUo���^�T�k���+2����y����Fh����R�4�~![��	b�������iSg���0�h?F�������XD�""""""�p6�������]��$SP
��c[)�����n��N�v�������������"��oW%S�����Q�-�y��4�S�Q����P���W&2�_0��kL�V��Jgx*��y@��~�\�j���4^@�5<��qtvg���89g�}�g�L���P9}��v%���Nc��}Q(4�������t1���4��7�n}K����T)�Lk�##�?���{��w?�J�6�sGWZ4����O���9~,��KBx��&�/�v��O4C�W���l���P%��0����!;�
����}9D9�,���Uk$���.9�Z�/����vU01����w,;�O��=d��t�����+N���]���1��>�Zl���+'^��A�J�x���e�Z��<H���)��6�C��&�(�EDA,""""��AM��3$?���d���4k�Q�zHm������r�C&�J���f 7U:���H�,��B���r��w9�)^����
���)`�W�+
)��K}��v3��������k���n�Z��d��C�����e(Q�#�)�=�oO3���*��i:Ui)����[���1�4
jk�����0{�z�bd��:��5��J~�5����yp/	C	�+����
u����]�����P9IIjmZ���n�v����~�����~'S3=�Ww�m�<��3�����7�y��8{�1��W�-���d���1~P��jt����������zx3g����=�+_��2	blP��B���L����������u�[���Eb�_�T;�����3r"I����\yhJ��6���F��&��lcdt{�W��Q��HP���w�(����#l�5Li��a#r�>}�����9��g+1q}M�f�;��ukk�CS������x5 9@���1����I���h�`���P�W����S�^�tDA,"
b��9��T�7n$~�b��hW���J���������m�5��@�2���g�"�@�I��a
.ERX>7����S�y�����7"CK�@n�p�g�gL{�����������)W!�����7�/n?e��z��Y4�����h��.����\���3�lF��\��Y)Lj�����14�6�;-M�����%`s<���R����/%0N0iwoQ:�2v���nZ���5(��#��R��Y������N�;����{�&2Z��C�Z�pu�}�2��eV>d��4b���n�H�Rp/����thS�?�]��������� �.�11y�Z�Or�+�H'�5���a���zM0U�5�-k�����8u���]
��l�L�j�9._O!ao{F5i��Q.ZS�O�;��e���;���i]����*9����5O��d��2���d�<_b��9� ��vG"�(��A����)��r�B����J��������
x���l?���^U�h��?!���	�z�	l�]���m��q#
s��+�MK{%����r�;��1�Ecz�/�� 6s4d��0
�U���	�\��a�wD�"� ��;>��V�
"]�b==$FF��Y�Y�v��}�>\���N�#k7�%Kl���<[}L?������c	�$���b��������vaI����Y��j��UlZ���m���-���UG�3o�&��M��������Xw����q��,���M��Q[����`9?�{���#th��zm���qwf�������{��A\i�������U�4wj�j���e����'�@Q+�����x�����
�xs�<rL�P*H�3�4��)��nn��^��s��r��j]���������>���ty�����|�)�q>+�~��%/�hN����+5���M�����3'X���:�+3~x��g�<�RiX��-����~�(�Z6��e�/NAh]t!�(���%�Je�,G��6�B��2}F0�BO�=�����{���=��-�ad������0��QM~��rW�<�4�P:*�5�����9;t�>���6*s�T�tn���e?�*&��3����=���I�^��;�������Q��E��6����S�����h��_b����L���o'��u�����XD�"""""�SR._&z�x�n��22�5R)���^�q��bk$`����c��Y�l���g�b�`,-u3�Zwdsg�1eX#z���t�I�f������;����6UT���a��$�rf���T�*��#���0j���U��jF���u���_��g�yz�1�����0��RJt����D�Z�A3?��F�<9��*�I{MzM��-�;�Q�|�4�I>���o�o9�|��q�-jO��2�S���w�st;~���}�N�O�
����4����������1�ai�~[��8{�	����_,��)�V����UJ6������T�!cF�am���VPX'v��SKR�j�o�
+��g�������������u�V���=��|�P*c����>�g�h���2��������>���e��2H���������f�&?�a�8���6�V;��=���v<����4�:�
|����1}n�w�0-��	c����{3��#���c�I{~�-A�$�|-B�>�5)�q\�[��Q���*(��������+N��_b�b������>c��4�����Q���XDDDD�J���$l��:>>�M�Y���=�����M�N�� ��|� e-<s*��X��Y�����Y�y��h�>�Z����V@�������0����iU?��Xcc#�����(�#G���GX0�-7Ej�})�O�1'tR|I,v����K~�h����p�+��e��������&�M�2`He�X��B������SX���3�S�Qu�N,E���G�3������G�2L��[�����]f��{LZ�������$OMd�O����&3FV�e����� �'�:����0s`C�5���<.����;o3�KQ�v*���Mybc��s70��hT5}��K�kS�2�
bG��[n��sY�-�^g���Y��!��\ep�R��Y-��Y���V<!����u�TfgJ�2�9����33���k�9���c��i�t��)||B�y� &=�1vt>�����aR}}y7>�M��Q��
[��.��M�0�	D�s"��#eKX2�m_���h!���BGD`�������W��:�n�DA,"
b��)�:��o_���_#�`�����!��}�
]��	4\���4F
H4l�`O�^:	�|���0�e{��4�z���,_���-���M����Z����>~�S�",�&�L��s���1&4)G���1���hL�FMl\(���HH�&���%03��5+�-�g��34h��fuq�}Js��xu<�����pf�C����:�7L�z�]�����	�;e,�}��4�=n��L����yt����L�O�2%7�������?������!#;}E�>�����FT�W�����79x�>���W��E��+��6+�:������gL<��5��6����gggl���W��C�������\f��;����~%+�'�Q��{��}��m����BXYI�P:?=z�P�����9<\����a��0�&�8������0�����T��������V]Y��*�lq�� 5U�+�?����lW����P��i���t��� ��w�n�����x{�fp�>����q��XD�"""""�S�[�&��a�9dG���4�1c������E�*�K��8w�_���hl��&Mp��>Q��`��7�`)�Y^��CkW:����cw�Z���J�4`���-�q�����p��?���]���OM���nn����$�+�IK���+fe�!���H�o�Os����i�������2=���q�C���K��et�����p��3f�J�F���k������g��_���K�4��S'��S=��f���k��]����H�4/��a.I3����g�����L[�v�2���^b�����]��������7>�E9W�F�M31��{��~
���5>��`n`���g���.��c��r������X��<�r)ysY1����h�G�����{�i3Bp�o��~?l�����~-���"�~=S������(������.u�~q��������lPBU�4�V�u�!��\�����T����a
��7��X^f��������GX���fj�N�w�=��O@�"� ��EP��w����Z^�������9��k"z�"q�nHK��<$�g��b�@������O�%�b�17|.�R�N1�w�`��4W����pA/�^8�;���Fo����,I~^4��KT4/��isQ���Q���W�i��--���M���6�(�������q� |F�����c����4���gb��h�����6���kXu�4���4=��I������o�_�
�3o[5z�
.�����Uh�D7���'N1cN$=:�b�����w��E���5�C�n��{�Ev�����tk��0��{�_N_`��R�o�0�y�O��gn �o��=1�k�����9�7��{�"x����oG.\`�� T����[����r
�g������W����N����#�fX[�a�|��x���6��SCm���Pii����������ve7��6�lL�6��Hc��%DD(�p.����m�����\`,�6S�5��^���y�[+b/4!E���{��1kaN��a�-7�*e��K-� �����H��T$�<I��Z1,�����1�za9j�
��p�L�N���""����={���$�?���������Z������v��% �$jp�f�0h}�����m����'R���q�1��x�*�8�9g��8
����G�rn%�u-�s�t�&����r��=�;P�]Y:u��X1#m}��?s�
������1�xV�{��Ht>�V��<����k��C�l��9))j��do��x��w��:O��Y��}�!�F��^����j���}�'w��6��k�����'GS�\��v���0�s���9f�|E�N5��>����f�����U��3����|�����N�:�3<�Y��0g^ �jv`��|�
��8��Mz��mxMu��*���KH�e����?�����l�~�����L[��
��>�e��r��c�6�G�������%8_�>���wI�����1\KTg����+|q�U�Y����V����n����7��S���
�����e���+��"E���� ����gs�H.��a��a�91�P���k�J����{������wF�"� ���E��[�������o����bEl.���������=#r�R�����X&��Jl,��������R�1��<NN/��jMH5���P�����w��C�X�FO������+�����)q����3	/�����5��P�Y2�M.qk�}jVf�0l�2/!�T���L��7D�����
�i��.k�0����"����>��e��9���!��-��	��T3��������	�J�V�&.mu�6��g9r�>�?��]�����io?>�\J&�w�)��&_Y�V*5���3���ke\<e�6���Q���6����s�I?��X����=pqL������-������E���q���������5P�>�8`k��[t``���5d?����U�3�x�\|zI%N*��D7�6q��@N����`����#�=���T}�?���S�xr�!yc�R���v��$5�'���I�]��X�j���6��y�)`\�B��������G��G3�hN6������I���v�6`��;}� ;y��;H{�J��]fc�i����Q�b������gy����>_b^J���M�V9���OA�"� I��7�;����������:T)����@I�����H_K�HLM�[������u�?�c���U�KP�W���DQ�<�&�0�z�S��9��q��A%��7�5'���/y��\�4.��Z%\��!*���a�7r���������D��W;�|m�##.,�����)�&R����d}�+�+<�.���
����B�*fH�*��J
G��1oNO�����$�A)��������L���fU�0rP�LL�����{Y�,���3�OA����<��#��.���aj�m����/�:#�|ix�s�h��
�2B�R�v�^��zG��57� nN_��_����@$	��0�
�u�4sE�3���"���B��v��4�:�!ffR�"���&������b|�!4�������h�W>����%[`c+�nEEE�b��K��x
+F���� ��a�������s!*�g�EZ�0O"��k��%0��7�������������#�7b�R��
��_��L�=z�V)�~N��3$����V
��U0��{�i��q��	���i�$gm��I��XD�"""""�9d	[��IL�r�����m��I�Y~6=��E����	���������l==$FF�Ze�)S0���P��7g�����m��V��"�������&�+�����F3c������w~���C���!'�0\N%���������O��c9�t�+�WS����6���U��b��}-������cq��/$.k@�����[h�A�[]�5��P��/�=C��#`9CI�~�ik�����iS��]�a��7�Z���Bo�X����8�d��*����X�4�z%�1�O^*�s��P����=����0��,����V#�?�J����Y8/�G������y����rB^`[�6s&��S���*�Ph�����<����jQ��Y�V ���/�eWO{/��5�
��x��e7vp��b���h$X[$��L^�W�J�\�f�������o<�H��=�2s
���|��8�}�*J�P������wvF�8cS��E�"�9"##)[�,#F�`����}g?�J%�m��|�H�/�a�^L�!���i�|��'OP���gi�A���u��~���u��Oz/>���"D�����*�7�93G�rn��n��b��X�&I�Y�������e�X��}m00�-=[���Q�05f*_�F����� Fi�%����89(�j%;3�r4�����G��1e�c6�����?n���R����3���D5''����V��{�f���hV���
B���P"�	���y���*�(Y#CQ�'oc����|aDJ��s�H�u�� IDAT�T�1�mk���^�!��>�����;$��bm�H	��7+�M��;� ��!
b��Q�~��>}>��f�E�0��[+b�	������II�P�Dn������i�RG?��(_6��Cr���A\��(�L�C����7�)�2��v���OMjt.�D�G
��h4������J�z6�[G��������U��_G�*g���<O�^$zR,0�b��R-��o��E+x����3drl���+/������4����������4%�m��U�*e�����{9�:-
EH������EtF�"�9DA,""""�5��"�t!���lO���u�u��^��'�^���)*o[�qZ*)2�$&L1aH�<8����M!5��H?�
�8���*([/�y�=�[I�H������,aP�c~�R
�v-����X�NJ��:F�1@Z�<>�<([����G�]����@���������N�3,G^Y^d�t65DD����Q�����CQ�H�{��C�H�}[[kkT�&�m� �����%�:��o���*��_m��a�>LZ���O�����u���������:�$�v�b���uO��63����[��AT6z%KT��r�Y?��JM�S=���]{��+�t��D��A�.�����H���aatM���lz9~h�����DDD�w��X�?�(�EDDD�y���I����^���)��c`�y��XM�����n��H�������?}����T��l���A�����p�
�'Jxu�.�~��0H���j������n�hN���j������V�u��x/-G�f�r4���3��Q������
�x�M�HI1#5��<5�3��-���/��|DDD����X�?�(�EDDD�Yh��F��UCC,z��r�8�������DJ��m�:MK�����
����U����r�em}l~���5�K�
X�}m�C�rp��H)N�7�lZW��u��|]�&�b�:9i1�hT��j����XDSKr�YK�R����Y?��Ic��}L��y�+C���K�������} ""��B�"�9DA,"""��A�}��I�\�����a�x1�={�g�{�S@E�����|�����LM��������4k�]����D6�o���V�A	��5r��&m=�F&�0��B�f@P�[zwL��1!B�U8�����'���� ����k��<8K��!���IP*�<1C�����<�rI���U��m��v�?� ��!
b�?EP��.�z�*��D
��~}J��
��%���'O��k-d3C&��Ir����{H9w���������E�/&-[b��'�nBk��cz�tN&�$�!D����I	PK��K�xi� L,�lr�DK���K��_��r�3G�]0��Z,����S�`^0�N�mZ��_W�%��	b��~[����+D�
�����x�o*���$��.O������D�"�9DA,"""�������?_��]-Do5���V�U�b5iF��!14���	��;k�/�uA��\������C�tc�W�Aq�2���hv��P����\�k(P>���VS��������f2��Jv��f�������-�[*������@����
�g��i���!-�������u���A�<&�
�e�k��E�EDDD�,DA,��C�""""��&-��={�|�<��mZ�����#���%K0�YS���O�-]J��(����I�\\����z�����H}�����y�{�	.����^#�
F�������o��� �ryH5I���4�?�!����7}m�`%��i���7�E�;'�Z�lD������C'6��`��JMU3��
O7���i����6�������FjC��f�}��gr���@��.:�!"""�W 
b��� �/���E���V(j�j%�FY;o�k����(���1���c5y�6�:'��i��?~���tV�����U����H
���ao~V=�~p%�+B�z�})os���yi�i,�%�����J���A�9�W	q���$Z?
�m�z�7,�4�
�eu��+~��Zp�;��Ap"�!��M(C��a��Z�)�{��d5?�pn��\�Z�D\
��"�����G(���\Z�j-��������w#
b��� �[���x�J+��*����P(�����e�>�+����t�\��cT�n�
�4���5lH�����n���S�a���<��7��F��w���i�]^���D��a�L����v�z�$�=<tOM��������+�?�2�BHOeo2��u����s��:���/�����z1_�SA�7������}hh_3_���T9���a5$��X|KT�lbP%i��"7�:�chd��X�s�tK�����i�HP��*�����>����r|l�?Q��������������?*��X��FR������5�P��jnU���4���Q��!"%	,�r�Hd��:O�"0��H�s���Igv��p?�C��`��&��c�:m�^�T��x�~?p ��dg�T��� �M%$��pHN�y�#:D\�E����u�0k�^�9�h�9<��?�������p�UY�<l��@�^50r�-�����������l�1k�0+������&vuu��v��r-�$q�K�������IF�H���=LLdn����V�t'=RR��y#��R�������DDDD�lDA,�����[�n�Z�j���iNz""�o�>r��E�J�0��zEt@�B����lJ�����������;w.���Q[����j��`De1h�vl����������DTUFF�/Z�y���Y���6��=���C��W�`)7���4��t\=�a#WV'o�&����:�]�U�x��c��sD��7�C�%(���R0�%2V�����p�Y\�������������@��x��)R�,kV��F�b��-GW>b�|
��LR�?#��Ml��:N���/������B~���N��BZ�+(�H-����bVs����������-��{���+�,Y�b�t���HMM���k���e���� �""_#�������M��m��$�h>_}���3� �.\H��M1�3�E��t���
0|�p�s"��
h4���C���������U���4
-p'�?!�3{�65W���=���T��G[[��w��v����}h�Q��pM���o��i�:�Q��[G��%$�r���3����1�o\��<��7����hM��r���2����u�""k��qw���I*APg$�4������l������a�Y�d�~���yz�[�L������P�v:���a��|��_��w��f��rpw^Z7�-����{��r���9����i<�&�aN����/��u	���nr��#k����un>B���6m�Mig�[��YDE�N�K�ZN9�����s���+:t��\ED2"::�q���;wn���E����������o�
������'I��0������c�^�~���5M�4���.�A�^�v�?G�"��G}a�xh��PI$��
S��z�Itf�C��	I���v�:�Dk`d����)UA4��8��k�!%�F� R��1��
�)S�����M�4��I��=��ZP .��8d��=o�4cCClW��D	��� �1���}6y�h\�������������G�Z��,���5c���:
��9�O2��>/wD}�����b���4�����_�����C�����|�k�SE�,E���0u���V'}:�4=���
~k�BT5#�Ao[eI�.�����i���l�^��l�I���H����T�����N}R8~!����3��q�g��X�6��Le��'<��!��Q���$�7 �q^�L,��U����15��F�(�E�*DA,�W"
b��-iii�j��3g�P�hQj��A@@g������s���(U9'�X��@��;�<~���+��+#��P���4�7wE�'ZakjC�JEiP��$�P�*n���aq���Lb�$;�����w�x�_(�����*j���{�*��w�@��BH���	IHS4�R=;;J	gO�phk0��c(S1�������)�0���QE*W�}^4J%�O��|���z�T�y�Kb��nlBTJq$���(�H��%p4����SFG�|�.q~O�sW���+���2,��C{w�%��?��_�:�m\�� up@����O�%����S���:v�����5����N(���#�������4�����G�7m$�bU�D���qT@�� �w4�f��-(j��T�{�N@��%a�=Y����Rc�X*�]e@����A�Y3��F9���n���i\�
�/�6����QY�M�P��S���$w��8���0n��x}�"������$����gJ�<��bb��Mjg���=�����]I���)���k"#�A�(?�b� �(��<� �Z�	G�����}U�����F��	u�r��w�<4O�|�s�����F~,�O����X��,w�������K����
4Y�A
�E^�kMQ���m�1�c�p�6q�~���oO�9Rj�0����/���5u_��=�Q�����V&l�jO���d5_2���6#Z#D�3��~��o�ee5�W�=:�5����7��%����:�UJ�0��;yr���9� ����_�(�E��,Y���'R�^=v������v^H�n\A,/[�;;�R�t���T�-X�����}'������"��mn�� ,�c�
Th�!UA�'�W|�����C�:�M
�V��������ut9E����qk�A���<�I���L�a���B�8v8���T�)�L/ka,��FO�@��c�����w�9��xE'\��Fk
#�����A������>R#)=����Wg�s6k�USE�s��=�k"�2B]@�1b���y"U��dhJ;�A_��9�<��y�xs������6���	a7��s a,� ��Qkk����w�������Zq�]�98q�qA�W?C���`�s�l����E��\�Q"�6���c��Y�L?��i?���K�z�	N)�z�7� �T��h�$}�X�/+L��6�&����?z��|#�b
|4��[�G�T���A��h�g;��
gw���Q3$h{�wi�j�D4��}p��t
�	���5����?Pcv�mtK�'��c�i}|������-�D�
�c��YT��G�o����X�����T��L�����y2�7
�bj_?w.���(����Po�I/�'��t�����^���g4W�|�o��������z'�����~��5��1E�B�9[���(�7b�&�5K�tkS��'�#����sG")���k!�9����E�lS*���3*0��_���@�����T�����n�����gYs*�
�Fx>��.�T���lgD�n%1��������_�(�E�JDA,��E0�x��)����x����V)�Jm�XH��u�%J�����kA�~�z�j5�����&E�����\
�}�,��h���Q�G�f���"f���q��b�<���pM������/���P�G�G���������X�	��4��5���h�'.����"�T�D����<�Cm=id_C���
���F�P�~����O3p���`
��"���Y<�&�����B��-�VB���9��>��s����LB}�����V���R0�u����S>O�t�����b�
<==q��"�W/��eH������e�Eq��4t����Bk�~tH&�:�mvDL�����o��E��9s&�w����8u�t�c��."4�hs�R��nT� X�G���^��
{j��#�����k���I�;F�?�;��s�E��"��{SO�������P��)��Y�O�:���.�5��A*����S�h:L���s�l��"�4d���h�����d�qs�K�I��p:IwSu^_j�*��u�6E���C�c��q�,Q�O���1I���Q����D�[�k��v�r�#>>^[���Mj������s�sz����]�\��!�IDUO�`�f���Y�b>��u�,�Z9![f�����e?l�-=;����^7���;o��'Q��^�K���H��V@Z�Vo���j�7[�T�}(#{�`�����e��n��������]!J0z��~�6Od�n77����'����x�h������Gj���i��6�q������k��K�)��k#2e�W�K ��j!�Z���F��,���.2xaK\*���Z��b�fqf�k��A.���uk\]���f���!��~����+�1����
~�����x��
$%%i�.O����t%,,�E�Q�jU�7o��;|������wh���N�"��<�~���#3N�<������`h��/_��#��kL��{�D�-(�Nor!"?�u�����O�B6��V�[�.u�����khh(=z��fR��\q��������Om�bvy��G����[�l����u���G��G������c��)����0��|�,���:t�/^h�}��a/
b��%�bOx�
�	7o���<a��A�?`�/�g��:��~-�'L������8��
�la���*W&�eK^D��]���D����	mDY��,��[�����:S��}��b��{dA��N%x���_�/(4���`�g��o0�i"������G������__k$�aN�ta!���S�5L��u����y"��l
U�e�QL2\��a�	���f�(��Y��[���|�=C��a�m�Nx9�5�-��c<u*V�Z1��c�����k.y��2}�U�L4':H��eR�'S���R�����k�V���C�(]�t�?���i����o��T��?�~G���CT21�	)J6o��M{L����&i����G�����}��x?s���?�����<y�9��3��d�@'���uW���,���f��L�#?R�LM��Y�=>NyL����K��an~�a�"��cL2���`���m��[�}���L��L�����U'F�tk�2A?�q�3���+6��8:�Q�����\x	��&&���%������/o���?�w��a
���Q����)��VV�
���k�}�����r�9���#�PW��"WxH�qU�g�+
����v��_�#����x6�
C�g����^)�Z#{^�^��J����k(�7����YG7A���h;�"G�~����i���f�DW�yXdx=�;�����1�*U�|�9Ur2��of����
)B��0r�VEPN����p���v#F��
����{k���(xld��7OX�2���O��E�����W���'CX�~m>�O�S�L!..���k���B��6l��5�o������ h��^�w��.�������uVl���K�.i�:u�d�qAp8p����koz�����3p�@�������Q3z�h������������x����
�N�:��Q�����6I�
��S��?�+99Y+�M"��GcOap���T�\��1�s����������{F���O�������={��,����\�R+������	Q��-9v�;v�~��ZC����Bx��I��MxX�[0fDvqZPm�P`�~*�j�]4��C�u+�m�q�u��B[��MQ$6QX�e�=]���M�=��Q���v<��,��y9@��Lp�"|S����-��d��CH�L���i�0��7�o{����0UH��-c�ea']y�]c}dK(�}�������>��|xR5��������N
��H�Ty8}�����/��E�'A��I�=B�D��
b��Rj��~�\�{�;o���8��
�PR?���G����}�|��x�r�<P�>f`K���B"�(�5J+��K>�{=K����&\�{b�F�Ep%��5c!(k�gh�a�B;�v5��"ks�1?�g�bW��sey�����a��L`Q�����:wV�,��[F��=����^F�4�H�;��#�qt��E����x������N�����)nu�9qs)f����bT�j�q�w���R����<Y��K^�E�����P�g��	b��v7�������4��C7�e�� IDAT��!u_���	W��7�v��pE8m^�����p��G��L���a��:����Vq���0��}1��:v������2�f�|��Nmdx�p�p�o7(����@�����e���J���,
S�T�l�G�������'��7)���m�%����L�������v�&
��Qg�(�s�(�3G��D��[�l��r�����#/�H����vR�m�<������v�on+WR�A� ��};\�F=�5�����^8��������-��Rm=��6�o��aed��}���*�U+��va�H"->�
����m_s��:�����/��?��Y�P��~���Z.SYK�r��	��ah��f}�� �Y�����N
��?F��+!��X����O���|�(�����*L5Z��
3(F�BA�
a<~��R�~@��/��m@��^�TO���[�7h�A�`�Z��������W�P���Q�U7�P��r��r<�8~��YG�$z~h���5��Y�w��_7�'%V�����V�o����x
��s���O�����yk���O)�Y����F�.��e��k��^��i���7�W�����k�s����)T.e�����,�� V����z\\�6��ac ����E`�:��EEh��^�q�3u�;�
I7�n�)�3�Vq���N�X�������is�Q�����M�*�����4�b,;[�l�8q�8�=���q���k����D���k��l��>�,�W�e�!�BH�6?
RA-Af�f�2':w���$�������%B��(��zDA�9� �9� �Q�o�?�>IHs���,������P�"�?!]Q�������kbcc���!!!��M���
�U>��pR����x33SS��;���h4<�r2I=�5�Xd
5F����}0�JtM� ��X}$Y:�����T^i<��2{[�Hp�<�=���=��d,�tc��(N
�|���)L+h���1R���I��:�W��mT4#�1���q�G�Ri���jT2��R�G#UHQ��H�wE��~�W}?,Uq�f�D����r�P)H�������������.�jeC	���F*M������}@GU�_����@�$���(H/���)*��DQ�"(OD�"**��QDlO�K��N)�g2��O^�d&D_���Z.5��drs��s���R��q2
~)��Z��s��X��:�,��3~~g������%X������e^I���_z"<��=�tRS3�\D����8��-^^�����;g=�P"%�����R*�����'5����iA���/_;���e�\ho��K�/���,�|�;{!����������qy^�������:D�b	�K	CzlH�k��]����b����y������J$����.\�����|Vp["���O|�G�/�
�do�M�3,^$�����m@�
o�h��\t�<�	�T���!=#VL�0�������}�)�iii�w�s$��DNx��9�s��-�z~�9�t\�������#�����^����?3�����������@n�IA�k�������KSn��������CG~O���������<.����o{�s:�k<�y����������������z�y���v,��8���+����G^�z�<m��=�ce?��q�<|^������L<>���t��7:�"���!&i�G��=�bb����
����jl���x,��x��8rSg�o�(���������x�knG�er1�gFG����1�=���Wy�u����N4�P�vBb��p�:}�5����������c������8������7����*���
����|��:��;�����Q�y�&�S��d�b���HJp<|w�]����Yi�����Zy|.��t9
�p'��9���A\X"B#�v�I��N��v���H?����AXQ�l4�T9o��C��^8�*,��y���(V�j��7[�=��N�X�cHqK���r�bGX�`A�*�w�o,7��cKR0���kRn��^�fA�{�.i'.�����%�������|��K ���P��eXP��9�-/�k��r�w�(���B�-@,���}�
�E�Z��X��a����q1�|�������XQe`F�J�j�6%��J�-�V���t�����L��x7$�	\
��0M��R-�����f�������wIJ2�a�9�
������G��P ."����-��p���������>�U-���N}��A��e�\j��i+�Nq@�z0���8��)�N���vdm,	Y��>�������H�l���311p �{�OOF��IX�v�|y�����vK����YC<�������n���{����������s3�����8�()�~
�kx�8y�x�mtC��u��n?���1��v���������4�S��s����8�]�������o������(��Xx�\�e��@O���j��r���}�;V6Z����:�k_{�-�_��C&���C�^/�o��c�78��as�Y��'O����(Q"�����IO{����pg����5~�#oE�����wm�}v�7�����'y/0n��#����}X�bbb�|ye�ygqxz�u���'>r���bu[\�eG�<����A����7;���]��x3����1��+����H^�>��7�T�+��e���U�@�������FD��MC�tV/���H��������n���Y���/��y���:�Gv��6����5��K���ej����?U	�#/b���P�ig����'*%��_��K��p:�P�TZ��`���T�B�n�&""����n'���)���g�v�k���@~�L;��:��[o!�}{�{{#�?�����15�>�d����f�X�p��+��k��zuss;?n>V&����s��wP���v��u��@�BGen�����Om��������R�|�l����Z!���k�n��H����p��ry�����}���L���%�c���bH������8��qX�D�>�p�����~������A��_��'p�o_��-<V�}�
h����GI����
F��<��Gj�\�9Qs����i�nyl��*vFy���-��_/�zfn)��k�������b���v_�f�b��.�3r3����g����H:�(������X��,7�v(t�&%��]���~y�[�Dx�^�?������+�Tk*����w����sk`�s����>���\��&M<0qb9���`����1/a����u�q�:��<S�v�>��������5%f
l]���#��g��n���w
<2R�pAE�����T{b��A'_���z#e{ .��2w��O����Q��X?{R�20i�X�{%N�#���J��l����
7���gP�f]��v����@\���!��c�������k�_���G����bV���i�Pq�t�5h`���a�p����)�m9���\��z��B�(��Lm<���i��=e��0=��<4�ivMG���~�8Nv�
��9�a�n��C��IFNA��������a��%K2g�O��3K�:V��r[����w�n���'.s������sN/L��^O�@�H$<�����~����{�jot��V���6�f�F��
����J���J�C'�NHOuC�������lWdb����#Q�wo�8��	>���#�-.3�&�#e6�2[�\<��Vm�a��T��s`Y���|4]'��%-��k��k^��d�B���n�;7����b���l��=xf�'p/��{�f���:[0e�/���%��n�es4�(�����2{t�����l������?�|R�[;�%�,��X>�U�_����2C����>�f$K���+���!���_pL�&aR�$,X���BFDU������}�9^�X4��]	���!�x��R�E`������N���u�%�\������L�R��B��������?�D��=�C�f����������s��v�7n�������~���f��=z`���(S�!��qb�X\X��<��
���7�������%U��R�������n�]h�8�&g ��	�H���.}�J����?���Alsi�2�z���0q�K���"j��O|�M��^��v� �*j�H��		@��2]t+�����
Lg���J���n�6����������u�������p�=�N^���13���i�a=W.��R�`�Q,	�T�vSzwB����!�kj*�������~tL��-3n�-�+��>�bn_DG�����~���C���	��@��w9�)����?����!��|���j��I������Yf(<S���f�
C���yn��0~!�m����_@����\�����������f��W�<�1~B)��`��d'/�
�����3n����m����'��W!/wxa�����s �^���@����OFd��l�Fo�]�����Ie������>�|���|�>����5.���6����nA�[+�������lI�����~�.n���J�B��S���r����P#��&RSc�q��^�8r�!�U��
E�0�|�^"""R�)!�&��{wl�����m����X�~=��+g>^�jU�;w:+n�F�}�m$����3g�_�yx`L���%�3/T�\����gP"$�����vk���������J���>M��������qb�h|���E	}��[M�����e��+��sLk�4J���[^k*/KZ�1����?��%:��D�*U��w8�-m�?��")��LW�X��!���������2�e�(�^~
7](���NB�����ce���v��Z��-"p���H��>�����e.�t�H��J���&x9��.~�5�����������Q��Z����A�����	5�jn�1�p����������l��p�������#lN���XxF�!xyWxoi���5����[z��������2���H?����`��#qqy�
E��F�$��;����7����CJ�n]/xy9 m��E��~����q)�V�Tx��E�vg��!��<���[co�+���
�cwa��#���QGk�+��}�P�����G����Z�����C�����7*��E��5P%��
 ����
�E;�qF��GDD��8�t�b��s}�#[Q������[��6�q!
�""""""���EDDDDD�%)�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�.*%%3g����~����x�����&L����������W&?�:}�4�L��o��g��E�r���o_�=���pss����=;v���={�����;��[o������U��v���i��a������E�5�������(Q��SO��cedd ""��o��C��y��_?s�{zz:��Rt���b����?>���___�k�c��E���������s������;p����p�}��n�����r�8�Z�~u��As�9u������m���]���|����j<�N�8���K��;�0��99r�,X�/��'O�Dhh(���~25k���s��)..���y�:s��9n��6�-[6��_;w�\�Z�
111�V�z�!s���Xv������0k�,���o���>��c��k��(V�X�}��]�����w��0��Ms��I��������a�����+22��u������)o&�l��]�v�m��X�x��C������u���
���o��'�x���.�[,z6m���C��7f��w���?��������/�7lG8{�?����
3�(�9��;��S�*7(�a���R�L�j������O�w��'���JGB1oDy��Ze�X��b<6C����
�*��x��w�  �cmZ�ha���&��?�o�����i��	�x�
�u�]v����X�z5n����_���3�Q�Lx�����i��Zn|�:u*&O�l��mx��������u�����^��O������������+���/�y^�1�7ol�����x��g���O�P�@������_|����9aa��>}�9�;t�`.����K~�5h� |��3f��	

BZZ��\�|9^{�5>�J���/�%K�4�������6t���3$]�t1�y����{�5���(������;>��st������&?�b�8q����9��7����+���.]j��Y���
�k�{��q���=���?����c5�!^��n�j�gb�����/scY�JS�����<bq�]|��X�������r
�������3s���4��^�u3����O�{0Onl���x������l��|��)s���������s������<�V�\i�[�q[E����j�����������-'0�q�b��g����W vA�f0�E�#��R?����9��;2K��������	�)*T�pe&�%���~��@�]������?F�v1�+
�x�9��������T���'�l�2�������/_>�� ��5���@|�z���Mg�9c���/K��Yc�8��l�3+x�r�P�X�*^�X��@,���3��p	YN��Q+,q��p�����6a�8��*���=������+��8����������1���]�|������?l>nlaU��y.����_a��R v1�aY2Gz6n�xM���!o�8���:�z���36�F�2��,#+U��U����[�0��M7�d�l[���a9�����G�j�������\�R�N�\��8�����o�z���s��8+b��l/g�Y�2b��*b�R g)��r$�7��I���eK����p�����J����U\�#���w�1����7�6k`�4?�~����y��q���{jV�p��k��e>_�^�B��+������#���QV�g�����������(?�b��G��z�)SB�C8�X��z��a�|V���a��t����3{Y4�x�����o��nS5��8�����_~1a��n|��YC��g�}��^z����
�#��P g)��r$�����.Cc��>�����1�T)������z��!yb�X3��ax�����y6G��y��f]����i>��pY����g%*?��}�B��)���fhFyy�=��f �/�Y��X��q�3��j^������q��&Zl��AV(d��|��wc��
��,����sAK��������p$�3!���r���g�1�����]W��Y
�r�	�������T0{y*�	?����������
�y��^�x��C�<�8�m�6�1v���ge�Uf�?��#����a~�$r=C'�8����x�����Ua�����X,�`9���5K����kLzYZp��+=
d��b�qe���{���I��r-qn]�X
�7�������q����oO��m��T�Z����A9J�X��@,���@�2i^�L8��}kJ�SsM''(x?��:\K^��[�2������eo>�JN����\���aX����%+���8�Z����+l�]L~fus��ci��5���MAK����b)l��z9�5C,��+k�X\KR�f����������c�fwmm����5���%Kr]C��m�-�mj{
�X
�7>�YbE
��c�������\/G1���L�U,\���Pk�][^��k�9���M�8�����mQ�=ZC,���#GL����K���a�+vv��t~��F6,��i�2�����/X�����T2]��:C�a���6kc���;:�e�z��@|����4[+�?�l����4����
u����\/G��E�LYt�V���2��`y��.��+�@L\c����G�;����;w��e�2��qG����l�8���*�����������w�K���;z,���1���kJ��C|��A�~T��x�6��m�����3����a������ ��@�l�s��&�����!��u���X�	
�r�	��c���XL�8Q��U	��m��c�A��9h��XIDAT��0�r�0?fo�A����&>l�s��
�����`�Y6jJTY��(�	�"f�������hP�n�L������X�c��0�X6�0>t�P���6*�S��=�����4���9�-�8���*,�a�4�I�����6_`CPP�Y7�S~����k`��	&�b��f �A��Z��~|�f�-�E\�n�+/o>��n���f���MM��Y
�r�	�)))&t��5�,��0�����Os��5������G���'���W_}e�S������;f�!NL�J���Nr�f�y����������O&@GEE������P��j���x����?V�Zen9c�}���
�Js���
^U���������#`y��&1?��,q��]�w�^S����-[��2>�3
�j)�C��5�3�n��}�9��S^^^h���Y��n��;Ls��+���c,���'LY5��mx~r4�Z�j&Ds����c�����8�=�����y]�\9s�����tW����M�����*�����B��l���t^�yu�w\z�����9�l�t�k�������a*^8H�J��h��  ���z���[��
�&M��s���b�i��z��.������v���JRR�����7^�x����
�r2�0��\c�7�X��1���s|��9���a��.gy��j���M�)�b�� ��kx����>��l��l����7�x��7��9���^�80����v�)�]����3%�
a�dH�q����Y��S &g�EE����-�JxA��+�8�7U�����onp��HT^�D��/��b��u���32�>��	!Y�Sn���9nxA��JNx�z��
H�h��~����l��p���m��}8�@�k���?�����N8b�u`G\��8��b+�f��m�D��$�!���{&6�<x�U��`3��d�C��Wa��!��E\��y���,�{��m����YQ�����3A��~���l�������cu'�����{+N��	bqI
�""""""���EDDDDD�%)�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�""""""���EDDDDD�%)�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�""""""���EDDDDD�%)�������KR )�}�Q|����?>�t��?�������q��%���{h��Y����30u�T<���4h�/�����������H!����k��1!�������;��'�c'''c��i���[�oGDD�2b�B�����48p���UC```�^�S�N����(_�<BBB����@,""�(b�B�����f�ED���@,""R����:���c��-�x�"�+��5k�o���������<�_�e��W�\����������1��~;�|mn�������,Y�/�������jE�
��C�5�|
���~��U���3�L��r��]����W_��Y����/b���

������V�j��x���Q�~�k��[�/���y^��=�|��9s��4~�x��Q��/;}�4~�a�?��C�m��~�""r#S )o��&�O�n�m���Q�R%���b���

5�q������]�q��!""-[�D�Z�L���q#��?�I�&�u�%K�4_�S v�8��<�6l��:u�����=z��	��|�����y�w����C��Q#�<��6mj���Z��,~��7��a��^��^z���W^y��V�Ygo������������3!����>��c��};>��t�����X�p�y��]���R�b�������L�XDD�:�����;�4�ag��mfx����5!!������2����
CLL�y��j�
�kL9Cn�)gm�^ v�8|�<�_|�'�|O=��	�6QQQ���7���M�x��z��/X��t�vss3�b���q��|��G&t��.�y�Ls������K�6����!C��������n��]y]EDDr�@,""r�RSSM >s��)��������8K��c����:{C������33���������@��q����3��;����@L|��~�m~�x��+%�,���0K���y�����<�Gy��z����f��~��3���oo�ZDD�b����0�����w�q���e	1g�m�X����2X24�l�����a(}���L91����87�	����7%�W�n���K��������r�O &�C/Z������wo���\)��Zf�T�������b��Y��>��~�)6o�l������r�p�v��l/C.����u��9����]�4�b����=�(gT�},��6G�i���X�l�)����)���gI3l�f��<l~1�Ns��k����������=k%��iS�Qq
�""",::[�n�W_}�+V��J�A����w���3��%J�����bg�3w���}���F�?�??��sX�f�)0`F�yU���b�z�������c�����O?��Hq������h���&�s2�3�����o6k]4h�������u�8�6mrj
1T}��wWde����w�������M����?/�Sb�8mem6f/�:u���r�=��rVl��p>����4��{���1DDDr�@,""r���`���Q�x�+Gc�0gr��/�!�h1���4�)�zW�][�dJJJ2
���.���@�-��9K�y��?�<�.��7�e���Vnc��],}f��Mn��l[,1o��
��W����^ �t��y�:tUG��"##�:�;v 11�<�����L"""�R �N�.�O�>f{!���0��P[66�b��m�����}�9����h�&�q`6���K,����!v�8���A�{���[��o���c����c��������?�)Cf,n�f�K��R�N)�b��K���k�i�����FPP�U���@L����^��\s���s�gp�0a�y��e�,sf��^DD�Q
�"""����l$�j�*������V�*U��gO���]�m���M����~kfC�96�j�����N�:�T�R�ks
����x������|��+������K��f�7����,,�g3�sk��1�������s3���T�*�@���g�������A�n��4�e]�q�F�1���3�xq�����I�o���)�~���LPq�����9�����~��W����.5��)���H��5���{s�3���^�
�DDD��@,"""E�0O�>���f�5�:�]�v�y�""R�(�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�""""""���EDDDDD�%)�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�""""""���EDDDDD�%)�������KR ��@,"""""".I�XDDDDDD\���������$bqI
�""""""���EDDDDD�%)�������KR ���Y)4�z&A�IEND�B`�

#128

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#127)

Re: index prefetching

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:

But the thing I don't really understand it the "cyclic" dataset (for
example). And the "simple" patch performs really badly here. This data
set is designed to not work for prefetching, it's pretty much an
adversary case. There's ~100 TIDs from 100 pages for each key value, and
once you read the 100 pages you'll hit them many times for following
values. Prefetching is pointless, and skipping duplicate blocks can't
help, because the blocks are not effective.

But how come the "complex" patch does so much better? It can't really
benefit from prefetching TID from the next leaf - not this much. Yet it
does a bit better than master. I'm looking at this since yesterday, and
it makes no sense to me. Per "perf trace" it actually does 2x many
fadvise calls compared to the "simple" patch (which is strange on it's
own, I think), yet it's apparently so much faster?

The "simple" patch has _bt_readpage reset the read stream. That
doesn't make any sense to me. Though it does explain why the "complex"
patch does so many more fadvise calls.

Another issue with the "simple" patch: it adds 2 bool fields to
"BTScanPosItem". That increases its size considerably. We're very
sensitive to the size of this struct (I think that you know about this
already). Bloating it like this will blow up our memory usage, since
right now we allocate MaxTIDsPerBTreePage/1358 such structs for
so->currPos (and so->markPos). Wasting all that memory on alignment
padding is probably going to have consequences beyond memory bloat.

--
Peter Geoghegan

#129

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#128)

Re: index prefetching

On Wed, Jul 16, 2025 at 9:36 AM Peter Geoghegan <pg@bowt.ie> wrote:

Another issue with the "simple" patch: it adds 2 bool fields to
"BTScanPosItem". That increases its size considerably. We're very
sensitive to the size of this struct (I think that you know about this
already). Bloating it like this will blow up our memory usage, since
right now we allocate MaxTIDsPerBTreePage/1358 such structs for
so->currPos (and so->markPos). Wasting all that memory on alignment
padding is probably going to have consequences beyond memory bloat.

Actually, there is no alignment padding involved. Even still,
increasing that from 10 bytes to 12 bytes will hurt us. Remember the
issue with support function #6/skip support putting us over that
critical glibc threshold? (I've been meaning to get back to that
thread...)

--
Peter Geoghegan

#130

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#128)

Re: index prefetching

On 7/16/25 15:36, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:

But the thing I don't really understand it the "cyclic" dataset (for
example). And the "simple" patch performs really badly here. This data
set is designed to not work for prefetching, it's pretty much an
adversary case. There's ~100 TIDs from 100 pages for each key value, and
once you read the 100 pages you'll hit them many times for following
values. Prefetching is pointless, and skipping duplicate blocks can't
help, because the blocks are not effective.

But how come the "complex" patch does so much better? It can't really
benefit from prefetching TID from the next leaf - not this much. Yet it
does a bit better than master. I'm looking at this since yesterday, and
it makes no sense to me. Per "perf trace" it actually does 2x many
fadvise calls compared to the "simple" patch (which is strange on it's
own, I think), yet it's apparently so much faster?

The "simple" patch has _bt_readpage reset the read stream. That
doesn't make any sense to me. Though it does explain why the "complex"
patch does so many more fadvise calls.

Why it doesn't make sense? The reset_stream_reset() restarts the stream
after it got "terminated" on the preceding leaf page (by returning
InvalidBlockNumber). It'd be better to "pause" the stream somehow, but
there's nothing like that yet. We have to terminate it and start again.

But why would it explain the increase in fadvise calls?

FWIW the pattern of fadvise call is quite different. For the simple
patch we end up doing just this:

fadvise block 1
read block 1
fadvise block 2
read block 2
fadvise block 3
read block 3
...

while for the complex patch we do a small batch (~10) of fadvise calls,
followed by the fadvise/read calls for the same set of blocks:

fadvise block 1
fadvise block 2
...
fadvise block 10
read block 1
fadvise block 2
read block 2
...
fadvise block 10
read block 10

This might explain the advantage of the "complex" patch, because it can
actually do some prefetching every now and then (if my calculation is
right, about 5% blocks needs prefetching).

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

I suspect this might be an unintended consequence of the stream reset.
AFAIK it wasn't quite meant to be used this way, so maybe it confuses
the built-in heuristics deciding what to prefetch?

If that's the case, I'm afraid the "complex" patch will have the issue
too, because it will need to "pause" the prefetching in some cases too
(e.g. for index-only scans, or when the leaf pages contain very few
index tuples). Will be less common, of course.

Another issue with the "simple" patch: it adds 2 bool fields to
"BTScanPosItem". That increases its size considerably. We're very
sensitive to the size of this struct (I think that you know about this
already). Bloating it like this will blow up our memory usage, since
right now we allocate MaxTIDsPerBTreePage/1358 such structs for
so->currPos (and so->markPos). Wasting all that memory on alignment
padding is probably going to have consequences beyond memory bloat.

True, no argument here.

regards

--
Tomas Vondra

#131

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#130)

Re: index prefetching

On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra <tomas@vondra.me> wrote:

The "simple" patch has _bt_readpage reset the read stream. That
doesn't make any sense to me. Though it does explain why the "complex"
patch does so many more fadvise calls.

Why it doesn't make sense? The reset_stream_reset() restarts the stream
after it got "terminated" on the preceding leaf page (by returning
InvalidBlockNumber).

Resetting the prefetch distance at the end of _bt_readpage doesn't
make any sense to me. Why there? It makes about as much sense as doing
so every 7th index tuple. Reaching the end of _bt_readpage isn't
meaningful -- since it in no way signifies that the scan has been
terminated (it might have been, but you're not checking that at all).

It'd be better to "pause" the stream somehow, but
there's nothing like that yet. We have to terminate it and start again.

I don't follow.

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

It's not at all surprising that you're seeing duplicate prefetch
requests. I have no reason to believe that it's important to suppress
those ourselves, rather than leaving it up to the OS (though I also
have no reason to believe that the opposite is true).

--
Peter Geoghegan

#132

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#131)

Re: index prefetching

On 7/16/25 16:07, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra <tomas@vondra.me> wrote:

The "simple" patch has _bt_readpage reset the read stream. That
doesn't make any sense to me. Though it does explain why the "complex"
patch does so many more fadvise calls.

Why it doesn't make sense? The reset_stream_reset() restarts the stream
after it got "terminated" on the preceding leaf page (by returning
InvalidBlockNumber).

Resetting the prefetch distance at the end of _bt_readpage doesn't
make any sense to me. Why there? It makes about as much sense as doing
so every 7th index tuple. Reaching the end of _bt_readpage isn't
meaningful -- since it in no way signifies that the scan has been
terminated (it might have been, but you're not checking that at all).

Again, resetting the prefetch distance is merely a side-effect (and I
agree it's not desirable). The "reset" merely says the stream is able to
produce blocks again - call the "next" callback etc.

It'd be better to "pause" the stream somehow, but
there's nothing like that yet. We have to terminate it and start again.

I don't follow.

The read stream can only return blocks generated by the "next" callback.
When we return the block for the last item on a leaf page, we can only
return "InvalidBlockNumber" which means "no more blocks in the stream".
And once we advance to the next leaf, we say "hey, there's more blocks".
Which is what read_stream_reset() does.

It's a bit like what rescan does.

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

It's not at all surprising that you're seeing duplicate prefetch
requests. I have no reason to believe that it's important to suppress
those ourselves, rather than leaving it up to the OS (though I also
have no reason to believe that the opposite is true).

True, but in practice those duplicate calls are fairly expensive. Even
just calling fadvise() on data you already have in page cache costs
something (not much, but it's clearly visible for cached queries).

regards

--
Tomas Vondra

#133

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#132)

Re: index prefetching

Hi,

On 2025-07-16 16:20:25 +0200, Tomas Vondra wrote:

On 7/16/25 16:07, Peter Geoghegan wrote:

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

It's not at all surprising that you're seeing duplicate prefetch
requests. I have no reason to believe that it's important to suppress
those ourselves, rather than leaving it up to the OS (though I also
have no reason to believe that the opposite is true).

True, but in practice those duplicate calls are fairly expensive. Even
just calling fadvise() on data you already have in page cache costs
something (not much, but it's clearly visible for cached queries).

This imo isn't something worth optimizing for - if you use an io_method that
actually can execute IO asynchronously this issue does not exist, as the start
of the IO will already have populated the buffer entry (without BM_VALID set,
of course). Thus we won't start another IO for that block.

Greetings,

Andres Freund

#134

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#133)

Re: index prefetching

On Wed, Jul 16, 2025 at 10:25 AM Andres Freund <andres@anarazel.de> wrote:

This imo isn't something worth optimizing for - if you use an io_method that
actually can execute IO asynchronously this issue does not exist, as the start
of the IO will already have populated the buffer entry (without BM_VALID set,
of course). Thus we won't start another IO for that block.

Even if it was worth optimizing for, it'd probably still be too far
down the list of problems to be worth discussing right now.

--
Peter Geoghegan

#135

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#132)

Re: index prefetching

On Wed, Jul 16, 2025 at 10:20 AM Tomas Vondra <tomas@vondra.me> wrote:

The read stream can only return blocks generated by the "next" callback.
When we return the block for the last item on a leaf page, we can only
return "InvalidBlockNumber" which means "no more blocks in the stream".
And once we advance to the next leaf, we say "hey, there's more blocks".
Which is what read_stream_reset() does.

It's a bit like what rescan does.

That sounds weird.

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

Does the "complex" patch require a similar workaround? Why or why not?

--
Peter Geoghegan

#136

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#135)

Re: index prefetching

On 7/16/25 16:29, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 10:20 AM Tomas Vondra <tomas@vondra.me> wrote:

The read stream can only return blocks generated by the "next" callback.
When we return the block for the last item on a leaf page, we can only
return "InvalidBlockNumber" which means "no more blocks in the stream".
And once we advance to the next leaf, we say "hey, there's more blocks".
Which is what read_stream_reset() does.

It's a bit like what rescan does.

That sounds weird.

What sounds weird? That the read_stream works like a stream of blocks,
or that it can't do "pause" and we use "reset" as a workaround?

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

Does the "complex" patch require a similar workaround? Why or why not?

I think it'll need to do something like that in some cases, when we need
to limit the number of leaf pages kept in memory to something sane.

(a) index-only scans, with most of the tuples all-visible (we don't
prefetch all-visible pages, so finding the next "prefetchable" block may
force reading a lot of leaf pages)

(b) scans on correlated indexes - we skip duplicate block numbers, so
again, we may need to read a lot of leafs to find enough prefetchable
blocks to reach the "distance" (measured in queued blocks)

(c) indexes with "fat" index tuples (but it's less of an issue, because
with one tuple per leaf we still have a clear idea how many leafs we'll
need to read)

regards

--
Tomas Vondra

#137

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#136)

Re: index prefetching

On Wed, Jul 16, 2025 at 10:37 AM Tomas Vondra <tomas@vondra.me> wrote:

What sounds weird? That the read_stream works like a stream of blocks,
or that it can't do "pause" and we use "reset" as a workaround?

The fact that prefetch distance is in any way affected by a temporary
inability to return more blocks. Just starting from scratch seems
particularly bad.

Doesn't that mean that it's simply impossible for us to remember
ramping up the distance on an earlier leaf page? There is nothing
about leaf page boundaries that should be meaningful to the read
stream/our heap accesses.

I get that index characteristics could be the limiting factor,
especially in a world where we're not yet eagerly reading leaf pages.
But that in no way justifies just forgetting about prefetch distance
like this.

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

Does the "complex" patch require a similar workaround? Why or why not?

I think it'll need to do something like that in some cases, when we need
to limit the number of leaf pages kept in memory to something sane.

That's the only reason? The memory usage for batches?

That doesn't seem like a big deal. It's something to keep an eye on,
but I see no reason why it'd be particularly difficult.

Doesn't this argue for the "complex" patch's approach?

--
Peter Geoghegan

#138

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#127)

Re: index prefetching

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:

For "uniform" data set, both prefetch patches do much better than master
(for low selectivities it's clearer in the log-scale chart). The
"complex" prefetch patch appears to have a bit of an edge for >1%
selectivities. I find this a bit surprising, the leaf pages have ~360
index items, so I wouldn't expect such impact due to not being able to
prefetch beyond the end of the current leaf page. But could be on
storage with higher latencies (this is the cloud SSD on azure).

How can you say that the "complex" patch has "a bit of an edge for >1%
selectivities"?

It looks like a *massive* advantage on all "linear" test results.
Those are only about 1/3 of all tests -- but if I'm not mistaken
they're the *only* tests where prefetching could be expected to help a
lot. The "cyclic" tests are adversarial/designed to make the patch
look bad. The "uniform" tests have uniformly random heap accesses (I
think), which can only be helped so much by prefetching.

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

Have I missed something?

--
Peter Geoghegan

#139

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#138)

Re: index prefetching

On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

It's hard to interpret the raw data that you've provided. For example,
I cannot figure out where "selectivity" appears in the raw CSV file
from your results repro.

Can you post a single spreadsheet or CSV file, with descriptive column
names, and a row for every test case you ran? And with the rows
ordered such that directly comparable results/rows appear close
together?

Thanks
--
Peter Geoghegan

#140

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#137)

Re: index prefetching

On 7/16/25 16:45, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 10:37 AM Tomas Vondra <tomas@vondra.me> wrote:

What sounds weird? That the read_stream works like a stream of blocks,
or that it can't do "pause" and we use "reset" as a workaround?

The fact that prefetch distance is in any way affected by a temporary
inability to return more blocks. Just starting from scratch seems
particularly bad.

Doesn't that mean that it's simply impossible for us to remember
ramping up the distance on an earlier leaf page? There is nothing
about leaf page boundaries that should be meaningful to the read
stream/our heap accesses.

I get that index characteristics could be the limiting factor,
especially in a world where we're not yet eagerly reading leaf pages.
But that in no way justifies just forgetting about prefetch distance
like this.

True. I think it's simply a matter of "no one really needed that yet",
so the read stream does not have a way to do that. I suspect Thomas
might have a WIP patch for that somewhere ...

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

Does the "complex" patch require a similar workaround? Why or why not?

I think it'll need to do something like that in some cases, when we need
to limit the number of leaf pages kept in memory to something sane.

That's the only reason? The memory usage for batches?

That doesn't seem like a big deal. It's something to keep an eye on,
but I see no reason why it'd be particularly difficult.

Doesn't this argue for the "complex" patch's approach?

Memory pressure is the "implementation" reason, because the indexam.c
layer has a fixed-length array of batches, so it can't load more than
INDEX_SCAN_MAX_BATCHES of them. That could be reworked to allow loading
arbitrary number of batches, of course.

But I think we don't really want to do that, because what would be the
benefit? If you need to load many leaf pages to find the next thing to
prefetch, is the prefetching really improving anything?

How would we even know there actually is a prefetchable item? We could
load the whole index only to find everything is all-visible. And then
what if the query has LIMIT 10?

So that's the other thing this probably needs to consider - some concept
of how much effort to invest into finding the next prefetchable block.

regards

--
Tomas Vondra

#141

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#138)

Re: index prefetching

On 7/16/25 17:29, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:

For "uniform" data set, both prefetch patches do much better than master
(for low selectivities it's clearer in the log-scale chart). The
"complex" prefetch patch appears to have a bit of an edge for >1%
selectivities. I find this a bit surprising, the leaf pages have ~360
index items, so I wouldn't expect such impact due to not being able to
prefetch beyond the end of the current leaf page. But could be on
storage with higher latencies (this is the cloud SSD on azure).

How can you say that the "complex" patch has "a bit of an edge for >1%
selectivities"?

It looks like a *massive* advantage on all "linear" test results.
Those are only about 1/3 of all tests -- but if I'm not mistaken
they're the *only* tests where prefetching could be expected to help a
lot. The "cyclic" tests are adversarial/designed to make the patch
look bad. The "uniform" tests have uniformly random heap accesses (I
think), which can only be helped so much by prefetching.

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

Have I missed something?

That paragraph starts with "for uniform data set", and the statement
about 1% selectivities was only about that particular data set.

You're right there's a massive difference on all the "correlated" data
sets. I believe (assume) that's caused by the same issue, discussed in
this thread (where the simple patch seems to do fewer fadvise calls). I
only picked the "cyclic" data set as an example, representing this.

FWIW I suspect the difference on "uniform" data set might be caused by
this too, because at ~5% selectivity the queries start to hit pages
multiple times (there are ~20 rows/page, hence ~5% means ~1 row). But
it's much weaker than on the correlated data sets, of course.

regards

--
Tomas Vondra

#142

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#139)

Re: index prefetching

On 7/16/25 18:39, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

It's hard to interpret the raw data that you've provided. For example,
I cannot figure out where "selectivity" appears in the raw CSV file
from your results repro.

Can you post a single spreadsheet or CSV file, with descriptive column
names, and a row for every test case you ran? And with the rows
ordered such that directly comparable results/rows appear close
together?

That's a good point, sorry about that. I forgot the CSV files don't have
proper headers, I'll fix that and document the structure better.

The process.sh script starts by loading the CSV(s) into sqlite, in order
to do the processing / aggregations. If you copy the first couple lines,
you'll get scans.db, with nice column names and all that..

The selectivity is calculated as

(rows / total_rows)

where rows is the rowcount returned by the query, and total_rows is
reltuples. I also had charts with "page selectivity", but that often got
a bunch of 100% points squashed on the right edge, so I stopped
generating those.

regards

--
Tomas Vondra

#143

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#140)

Re: index prefetching

On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra <tomas@vondra.me> wrote:

On 7/16/25 16:45, Peter Geoghegan wrote:

I get that index characteristics could be the limiting factor,
especially in a world where we're not yet eagerly reading leaf pages.
But that in no way justifies just forgetting about prefetch distance
like this.

True. I think it's simply a matter of "no one really needed that yet",
so the read stream does not have a way to do that. I suspect Thomas
might have a WIP patch for that somewhere ...

This seems really important.

I don't fully understand why this appears to be less of a problem with
the complex patch. Can you help me to confirm my understanding?

I think that this "complex" patch code is relevant:

static bool
index_batch_getnext(IndexScanDesc scan)
{
...
/*
* If we already used the maximum number of batch slots available, it's
* pointless to try loading another one. This can happen for various
* reasons, e.g. for index-only scans on all-visible table, or skipping
* duplicate blocks on perfectly correlated indexes, etc.
*
* We could enlarge the array to allow more batches, but that's futile, we
* can always construct a case using more memory. Not only it would risk
* OOM, it'd also be inefficient because this happens early in the scan
* (so it'd interfere with LIMIT queries).
*
* XXX For now we just error out, but the correct solution is to pause the
* stream by returning InvalidBlockNumber and then unpause it by doing
* read_stream_reset.
*/
if (INDEX_SCAN_BATCH_FULL(scan))
{
DEBUG_LOG("index_batch_getnext: ran out of space for batches");
scan->xs_batches->reset = true;
}

It looks like we're able to fill up quite a few batches/pages before
having to give anything to the read stream. Is that all this is?

We do still need to reset the read stream with the "complex" patch --
I see that. But it's just much less of a frequent thing, presumably
contributing to the performance advantages that we see for the
"complex" patch over the "simple" patch from your testing. Does that
seem like a fair summary?

BTW, don't think that we actually error-out here? Is that XXX comment
block obsolete?

So that's the other thing this probably needs to consider - some concept
of how much effort to invest into finding the next prefetchable block.

I agree, of course. That's the main argument in favor of the "complex"
design. Every possible cost/benefit is relevant (or may be), so one
centralized decision that weighs all those factors seems like the way
to go. We don't need to start with a very sophisticated approach, but
I do think that we need a design that is orientated around this view
of things from the start.

The "simple" patch basically has all the same problems, but doesn't
even try to address them. The INDEX_SCAN_BATCH_FULL thing is probably
still pretty far from optimal, but at least all the pieces are there
in one place. At least we're not leaving it up to chance index AM
implementation details (i.e. leaf page boundaries) that have very
little to do with heapam related costs/what really matters.

--
Peter Geoghegan

#144

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#143)

Re: index prefetching

Hi,

On 2025-07-16 14:18:54 -0400, Peter Geoghegan wrote:

I don't fully understand why this appears to be less of a problem with
the complex patch. Can you help me to confirm my understanding?

Could you share the current version of the complex patch (happy with a git
tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
/ provide feedback on, for others.

Greetings,

Andres Freund

#145

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#144)

Re: index prefetching

On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:

Could you share the current version of the complex patch (happy with a git
tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
/ provide feedback on, for others.

Sure:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11

I think that the version that Tomas must have used is a few days old,
and might be a tiny bit different. But I don't think that that's
likely to matter, especially not if you just want to get the general
idea.

--
Peter Geoghegan

#146

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#143)

Re: index prefetching

On 7/16/25 20:18, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra <tomas@vondra.me> wrote:

On 7/16/25 16:45, Peter Geoghegan wrote:

I get that index characteristics could be the limiting factor,
especially in a world where we're not yet eagerly reading leaf pages.
But that in no way justifies just forgetting about prefetch distance
like this.

True. I think it's simply a matter of "no one really needed that yet",
so the read stream does not have a way to do that. I suspect Thomas
might have a WIP patch for that somewhere ...

This seems really important.

I don't fully understand why this appears to be less of a problem with
the complex patch. Can you help me to confirm my understanding?

I think that this "complex" patch code is relevant:

static bool
index_batch_getnext(IndexScanDesc scan)
{
...
/*
* If we already used the maximum number of batch slots available, it's
* pointless to try loading another one. This can happen for various
* reasons, e.g. for index-only scans on all-visible table, or skipping
* duplicate blocks on perfectly correlated indexes, etc.
*
* We could enlarge the array to allow more batches, but that's futile, we
* can always construct a case using more memory. Not only it would risk
* OOM, it'd also be inefficient because this happens early in the scan
* (so it'd interfere with LIMIT queries).
*
* XXX For now we just error out, but the correct solution is to pause the
* stream by returning InvalidBlockNumber and then unpause it by doing
* read_stream_reset.
*/
if (INDEX_SCAN_BATCH_FULL(scan))
{
DEBUG_LOG("index_batch_getnext: ran out of space for batches");
scan->xs_batches->reset = true;
}

It looks like we're able to fill up quite a few batches/pages before
having to give anything to the read stream. Is that all this is?

We do still need to reset the read stream with the "complex" patch --
I see that. But it's just much less of a frequent thing, presumably
contributing to the performance advantages that we see for the
"complex" patch over the "simple" patch from your testing. Does that
seem like a fair summary?

Yes, sounds like a fair summary.

BTW, don't think that we actually error-out here? Is that XXX comment
block obsolete?

Right, obsolete comment.

So that's the other thing this probably needs to consider - some concept
of how much effort to invest into finding the next prefetchable block.

I agree, of course. That's the main argument in favor of the "complex"
design. Every possible cost/benefit is relevant (or may be), so one
centralized decision that weighs all those factors seems like the way
to go. We don't need to start with a very sophisticated approach, but
I do think that we need a design that is orientated around this view
of things from the start.

The "simple" patch basically has all the same problems, but doesn't
even try to address them. The INDEX_SCAN_BATCH_FULL thing is probably
still pretty far from optimal, but at least all the pieces are there
in one place. At least we're not leaving it up to chance index AM
implementation details (i.e. leaf page boundaries) that have very
little to do with heapam related costs/what really matters.

Perhaps, although I don't quite see why the simpler patch couldn't
address some of those problems (within the limit of a single leaf page,
of course). I don't think there's anything that's prevent collecting the
"details" somewhere (e.g. in the IndexScanDesc), and querying it from
the callbacks. Or something like that.

I understand you may see the "one leaf page" as a limitation of various
optimizations, and that's perfectly correct, ofc. I also saw it as a
crude limitation of how "bad" the things can go.

regards

--
Tomas Vondra

#147

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#146)

Re: index prefetching

On Wed, Jul 16, 2025 at 3:00 PM Tomas Vondra <tomas@vondra.me> wrote:

Yes, sounds like a fair summary.

Cool.

Perhaps, although I don't quite see why the simpler patch couldn't
address some of those problems (within the limit of a single leaf page,
of course). I don't think there's anything that's prevent collecting the
"details" somewhere (e.g. in the IndexScanDesc), and querying it from
the callbacks. Or something like that.

That is technically possible. But ISTM that that's just an inferior
version of the "complex" patch, that duplicates lots of things across
index AMs.

I understand you may see the "one leaf page" as a limitation of various
optimizations, and that's perfectly correct, ofc. I also saw it as a
crude limitation of how "bad" the things can go.

I'm not opposed to some fairly crude mechanism that stops the
prefetching from ever being too aggressive based on index
characteristics. But the idea of exclusively relying on leaf page
boundaries to do that for us doesn't even seem like a good stopgap
solution. On average, the cost of accessing leaf pages is relatively
insignificant. But occasionally, very occasionally, it's the dominant
cost. I don't think that you can get away with making a static
assumption about how much leaf page access costs matter -- it doesn't
average out like that. I think that you need at least a simple dynamic
approach, that mostly doesn't care too much about how many leaf pages
we've read, but occasionally makes heap prefetching much less
aggressive in response to the number of leaf pages the scan needs to
read being much higher than is typical.

I get the impression that you're still of the opinion that the
"simple" approach might well have the best chance of success. If
that's still how you view things, then I genuinely don't understand
why you still see things that way. That perspective definitely made
sense to me 6 months ago, but no longer.

Do you imagine that (say) Thomas will be able to add pause-and-resume
to the read stream interface some time soon, at which point the
regressions we see with the "simple" patch (but not the "complex"
patch) go away?

--
Peter Geoghegan

#148

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#145)

Re: index prefetching

Hi,

On 2025-07-16 14:30:05 -0400, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:

Could you share the current version of the complex patch (happy with a git
tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
/ provide feedback on, for others.

Sure:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11

I think that the version that Tomas must have used is a few days old,
and might be a tiny bit different. But I don't think that that's
likely to matter, especially not if you just want to get the general
idea.

As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

The improvement with either of the patchsets with a quick trial query is
rather impressive when using direct IO (presumably also with an empty cache,
but DIO is more predictable).

As Peter's branch doesn't seem to have an enable_* GUC, I used
SET effective_io_concurrency=0 to test the non-prefetching results (and
verified with master that the results are similar).

Test:

Peter's:

Without prefetching:

SET effective_io_concurrency=0;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem ORDER BY l_shipdate LIMIT 10000;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.611..957.874 rows=10000.00 loops=1) │
│ Buffers: shared hit=1213 read=8626 │
│ I/O Timings: shared read=943.344 │
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33 rows=29999796 width=106) (actual time=0.611..956.593 rows=10000.00 loops=1) │
│ Index Searches: 1 │
│ Buffers: shared hit=1213 read=8626 │
│ I/O Timings: shared read=943.344 │
│ Planning Time: 0.083 ms │
│ Execution Time: 958.508 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

With prefetching:

SET effective_io_concurrency=64;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.497..67.737 rows=10000.00 loops=1) │
│ Buffers: shared hit=1227 read=8667 │
│ I/O Timings: shared read=48.473 │
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33 rows=29999796 width=106) (actual time=0.496..66.471 rows=10000.00 loops=1) │
│ Index Searches: 1 │
│ Buffers: shared hit=1227 read=8667 │
│ I/O Timings: shared read=48.473 │
│ Planning Time: 0.090 ms │
│ Execution Time: 68.965 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Tomas':

With prefetching:

SET effective_io_concurrency=64;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem ORDER BY l_shipdate LIMIT 10000;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.278..70.609 rows=10000.00 loops=1) │
│ Buffers: shared hit=1227 read=8668 │
│ I/O Timings: shared read=52.578 │
│ -> Index Scan using i_l_shipdate on lineitem (cost=0.44..6994824.33 rows=29999796 width=106) (actual time=0.277..69.304 rows=10000.00 loops=1) │
│ Index Searches: 1 │
│ Buffers: shared hit=1227 read=8668 │
│ I/O Timings: shared read=52.578 │
│ Planning Time: 0.072 ms │
│ Execution Time: 71.549 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

The wins are similar without DIO and a cold OS cache, but i don't like
emptying out the entire OS cache all the time...

I call that a hell of an impressive improvement with either patch - it's
really really hard to find order of magnitude improvements in anything close
to realistic cases.

And that's on a local reasonably fast NVMe - with networked storage we'll see
much bigger wins.

This also doesn't just repro with toy queries, e.g. TPCH Q02 shows a 2X
improvement too (with either patch) - the only reason it's not bigger is that
all the remaining IO time is on the inner side of a nestloop that isn't
currently prefetchable.

Peter, it'd be rather useful if your patch also had an enable/disable GUC,
otherwise it's more work to study the performance effects. The
effective_io_concurrency approach isn't great, because it also affects
bitmap scans, seqscans etc.

Just playing around, there are many cases where there is effectively no
difference between the two approaches, from a runtime perspective. There,
unsurprisingly, are some where the complex approach clearly wins, mostly
around IN(list-of-constants) so far.

Looking at the actual patches now.

Greetings,

Andres Freund

#149

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Andres Freund (#148)

Re: index prefetching

Hi,

On 2025-07-16 15:39:58 -0400, Andres Freund wrote:

Looking at the actual patches now.

I just did an initial, not particularly in depth look. A few comments and
questions below.

For either patch, I think it's high time we split the index/table buffer stats
in index scans. It's really annoying to not be able to see if IO time was
inside the index itself or in the table. What we're discussing here obviously
can never avoid stalls due to fetching index pages, but so far neither patch
is able to fully utilize hardware when bound on heap fetches, but that's
harder to know without those stats.

The BufferMatches() both patches add seems to check more than needed? It's not
like the old buffer could have changed what relation it is for while pinned.
Seems like it'd be better to just keep track what the prior block was and not
go into bufmgr.c at all.

WRT the complex patch:

Maybe I'm missing something, but the current interface doesn't seem to work
for AMs that don't have a 1:1 mapping between the block number portion of the
tid and the actual block number?

Currently the API wouldn't easily allow the table AM to do batched TID lookups
- if you have a query that looks at a lot of table tuples in the same buffer
consecutively, we spend a lot of time locking/unlocking said buffer. We also
spend a lot of time dispatching from nodeIndexscan.c to tableam in such
queries.

I'm not suggesting to increase the scope to handle that, but it might be worth
keeping in mind.

I think the potential gains here are really substantial. Even just not having
to lock/unlock the heap block for every tuple in the page would be a huge win,
a quick and incorrect hack suggests it's like 25% faster A batched
heap_hot_search_buffer() could be a larger improvement, it's often bound by
memory latency and per-call overhead.

I see some slowdown for well-cached queries with the patch, I've not dug into
why.

WRT the simple patch:

Seems to have the same issue that it assumes TID block numbers correspond to
actual disk location?

Greetings,

Andres Freund

#150

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#148)

Re: index prefetching

On Wed, Jul 16, 2025 at 3:40 PM Andres Freund <andres@anarazel.de> wrote:

As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

Cool.

Test:

Peter's:

To be clear, the "complex" patch is still almost all Tomas' work -- at
least right now. I'd like to do a lot more work on this project,
though.

So far, my main contribution has been debugging advice, and removing
code/simplifying things on the nbtree side.

I call that a hell of an impressive improvement with either patch - it's
really really hard to find order of magnitude improvements in anything close
to realistic cases.

Nice.

Peter, it'd be rather useful if your patch also had an enable/disable GUC,
otherwise it's more work to study the performance effects. The
effective_io_concurrency approach isn't great, because it also affects
bitmap scans, seqscans etc.

FWIW I took out the GUC because it works by making indexam.c use the
amgettuple interface. The "complex" patch completely gets rid of
btgettuple, whereas the simple patch keeps btgettuple in largely its
current form.

I agree that having such a GUC is important during development, and
will try to add it back soon. It'll have to work in some completely
different way, but that still shouldn't be difficult.

--
Peter Geoghegan

#151

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#142)

Re: index prefetching

On 7/16/25 19:56, Tomas Vondra wrote:

On 7/16/25 18:39, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

It's hard to interpret the raw data that you've provided. For example,
I cannot figure out where "selectivity" appears in the raw CSV file
from your results repro.

Can you post a single spreadsheet or CSV file, with descriptive column
names, and a row for every test case you ran? And with the rows
ordered such that directly comparable results/rows appear close
together?

That's a good point, sorry about that. I forgot the CSV files don't have
proper headers, I'll fix that and document the structure better.

The process.sh script starts by loading the CSV(s) into sqlite, in order
to do the processing / aggregations. If you copy the first couple lines,
you'll get scans.db, with nice column names and all that..

The selectivity is calculated as

(rows / total_rows)

where rows is the rowcount returned by the query, and total_rows is
reltuples. I also had charts with "page selectivity", but that often got
a bunch of 100% points squashed on the right edge, so I stopped
generating those.

I've pushed results from a couple more runs (the cyclic_25 is still
running), and I added "export.csv" which has a subset of columns, and
calculated row/page selectivities.

Does this work for you?

regards

--
Tomas Vondra

#152

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#149)

Re: index prefetching

On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:

Currently the API wouldn't easily allow the table AM to do batched TID lookups
- if you have a query that looks at a lot of table tuples in the same buffer
consecutively, we spend a lot of time locking/unlocking said buffer. We also
spend a lot of time dispatching from nodeIndexscan.c to tableam in such
queries.

I'm not suggesting to increase the scope to handle that, but it might be worth
keeping in mind.

I think the potential gains here are really substantial.

I agree. I've actually discussed this possibility with Tomas a few
times, though not recently. It's really common for TIDs that appear on
a leaf page to be slightly out of order due to minor heap
fragmentation. Even minor fragmentation can significantly increase
pin/buffer lock traffic right now.

I think that it makes a lot of sense for the general design to open up
possibilities such as this.

I see some slowdown for well-cached queries with the patch, I've not dug into
why.

I saw less than a 5% regression in pgbench SELECT with the "complex"
patch with 32 clients. My guess is that it's due to the less efficient
memory allocation with batching. Obviously this isn't acceptable, but
I'm not particularly concerned about it right now. I was actually
pleased to see that there wasn't a much larger regression there.

--
Peter Geoghegan

#153

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#150)

Re: index prefetching

Hi,

On 2025-07-16 16:54:06 -0400, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 3:40 PM Andres Freund <andres@anarazel.de> wrote:

As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

Cool.

Test:

Peter's:

To be clear, the "complex" patch is still almost all Tomas' work -- at
least right now. I'd like to do a lot more work on this project,
though.

Indeed. Sorry - what I intended but failed to write was "the approach that
Peter is arguing for"...

Greetings,

Andres Freund

#154

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#149)

Re: index prefetching

On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:

Maybe I'm missing something, but the current interface doesn't seem to work
for AMs that don't have a 1:1 mapping between the block number portion of the
tid and the actual block number?

I'm not completely sure what you mean here.

Even within nbtree, posting list tuples work by setting the
INDEX_ALT_TID_MASK index tuple header bit. That makes nbtree interpret
IndexTupleData.t_tid as metadata (in this case describing a posting
list). Obviously, that isn't "a standard IndexTuple", but that won't
break either patch/approach.

The index AM is obligated to pass back heap TIDs, without any external
code needing to understand these sorts of implementation details. The
on-disk representation of TIDs remains an implementation detail known
only to index AMs.

--
Peter Geoghegan

#155

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#154)

Re: index prefetching

Hi,

On 2025-07-16 17:27:23 -0400, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:

Maybe I'm missing something, but the current interface doesn't seem to work
for AMs that don't have a 1:1 mapping between the block number portion of the
tid and the actual block number?

I'm not completely sure what you mean here.

Even within nbtree, posting list tuples work by setting the
INDEX_ALT_TID_MASK index tuple header bit. That makes nbtree interpret
IndexTupleData.t_tid as metadata (in this case describing a posting
list). Obviously, that isn't "a standard IndexTuple", but that won't
break either patch/approach.

The index AM is obligated to pass back heap TIDs, without any external
code needing to understand these sorts of implementation details. The
on-disk representation of TIDs remains an implementation detail known
only to index AMs.

I don't mean the index tids, but how the read stream is fed block numbers. In
the "complex" patch that's done by index_scan_stream_read_next(). And the
block number it returns is simply

return ItemPointerGetBlockNumber(tid);

without the table AM having any way of influencing that. Which means that if
your table AM does not use the block number of the tid 1:1 as the real block
number, the fetched block will be completely bogus.

It's similar in the simple patch, bt_stream_read_next() etc also just use
ItemPointerGetBlockNumber().

Greetings,

Andres Freund

#156

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#155)

Re: index prefetching

On Wed, Jul 16, 2025 at 5:41 PM Andres Freund <andres@anarazel.de> wrote:

I don't mean the index tids, but how the read stream is fed block numbers. In
the "complex" patch that's done by index_scan_stream_read_next(). And the
block number it returns is simply

return ItemPointerGetBlockNumber(tid);

without the table AM having any way of influencing that. Which means that if
your table AM does not use the block number of the tid 1:1 as the real block
number, the fetched block will be completely bogus.

How is that handled when such a table AM uses the existing amgettuple
interface? I think that it shouldn't be hard to implement an opt-out
of prefetching for such table AMs, so at least you won't fetch random
garbage.

Right now, the amgetbatch interface is oriented around returning TIDs.
Obviously it works that way because that's what heapam expects, and
what amgettuple (which I'd like to replace with amgetbatch) does.

--
Peter Geoghegan

#157

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#156)

Re: index prefetching

Hi,

On 2025-07-16 17:47:53 -0400, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 5:41 PM Andres Freund <andres@anarazel.de> wrote:

I don't mean the index tids, but how the read stream is fed block numbers. In
the "complex" patch that's done by index_scan_stream_read_next(). And the
block number it returns is simply

return ItemPointerGetBlockNumber(tid);

without the table AM having any way of influencing that. Which means that if
your table AM does not use the block number of the tid 1:1 as the real block
number, the fetched block will be completely bogus.

How is that handled when such a table AM uses the existing amgettuple
interface?

There's no problem today - the indexams never use the tids to look up blocks
themselves. They're always passed to the tableam to do so (via
table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
blocks & buffers happens entirely inside the tableam, therefore the tableam
can choose to not use a 1:1 mapping or even to not use any buffers at all.

I think that it shouldn't be hard to implement an opt-out
of prefetching for such table AMs, so at least you won't fetch random
garbage.

I don't think that's the right answer here. ISTM the layering in both patches
just isn't quite correct right now. The read stream shouldn't be "filled" with
table buffers by index code, it needs to be filled by tableam specific code.

Right now, the amgetbatch interface is oriented around returning TIDs.
Obviously it works that way because that's what heapam expects, and
what amgettuple (which I'd like to replace with amgetbatch) does.

ISTM the right answer would be to allow the tableam to get the batches,
without indexam feeding the read stream. That, perhaps not so coincidentally,
is also what's needed for batching heap page locking and and HOT search.

I think this means that it has to be the tableam that creates the read stream
and that does the work that's currently done in index_scan_stream_read_next(),
i.e. the translation from TID to whatever resources are required by the
tableam. Which presumably would include the tableam calling
index_batch_getnext().

Greetings,

Andres Freund

#158

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#157)

Re: index prefetching

On Wed, Jul 16, 2025 at 6:18 PM Andres Freund <andres@anarazel.de> wrote:

There's no problem today - the indexams never use the tids to look up blocks
themselves. They're always passed to the tableam to do so (via
table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
blocks & buffers happens entirely inside the tableam, therefore the tableam
can choose to not use a 1:1 mapping or even to not use any buffers at all.

Of course. Somehow, I missed that obvious point. That is the bare
minimum for a new interface such as this.

ISTM the right answer would be to allow the tableam to get the batches,
without indexam feeding the read stream. That, perhaps not so coincidentally,
is also what's needed for batching heap page locking and and HOT search.

I agree.

I think this means that it has to be the tableam that creates the read stream
and that does the work that's currently done in index_scan_stream_read_next(),
i.e. the translation from TID to whatever resources are required by the
tableam. Which presumably would include the tableam calling
index_batch_getnext().

It probably makes sense to put that off for (let's say) a couple more
months. Just so we can get what we have now in better shape. The
"complex" patch only very recently started to pass all my tests (my
custom nbtree test suite used for my work in 17 and 18).

I still need buy-in from Tomas on the "complex" approach. We chatted
briefly on IM, and he seems more optimistic about it than I thought
(in my on-list remarks from earlier). It is definitely his patch, and I don't
want to speak for him.

--
Peter Geoghegan

#159

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#158)

Re: index prefetching

On 7/17/25 00:33, Peter Geoghegan wrote:

On Wed, Jul 16, 2025 at 6:18 PM Andres Freund <andres@anarazel.de> wrote:

There's no problem today - the indexams never use the tids to look up blocks
themselves. They're always passed to the tableam to do so (via
table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
blocks & buffers happens entirely inside the tableam, therefore the tableam
can choose to not use a 1:1 mapping or even to not use any buffers at all.

Of course. Somehow, I missed that obvious point. That is the bare
minimum for a new interface such as this.

ISTM the right answer would be to allow the tableam to get the batches,
without indexam feeding the read stream. That, perhaps not so coincidentally,
is also what's needed for batching heap page locking and and HOT search.

I agree.

I think this means that it has to be the tableam that creates the read stream
and that does the work that's currently done in index_scan_stream_read_next(),
i.e. the translation from TID to whatever resources are required by the
tableam. Which presumably would include the tableam calling
index_batch_getnext().

It probably makes sense to put that off for (let's say) a couple more
months. Just so we can get what we have now in better shape. The
"complex" patch only very recently started to pass all my tests (my
custom nbtree test suite used for my work in 17 and 18).

I agree tableam needs to have a say in this, so that it can interpret
the TIDs in a way that fits how it actually stores data. But I'm not
sure it should be responsible for calling index_batch_getnext(). Isn't
the batching mostly an "implementation" detail of the index AM? That's
how I was thinking about it, at least.

Some of these arguments could be used against the current patch, where
the next_block callback is defined by executor nodes. So in a way those
are also "aware" of the batching.

I still need buy-in from Tomas on the "complex" approach. We chatted
briefly on IM, and he seems more optimistic about it than I thought
(in my on-list remarks from earlier). It is definitely his patch,
and I don't want to speak for him.

I think I feel much better about the "complex" approach, mostly because
you got involved and fixed some of the issues I've been struggling with.
That is a huge help, thank you for that.

The reasons why I started to look at the "simple" patch again [1]/messages/by-id/32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me were
not entirely technical, at least not in the sense "Which of the two
designs is better?" It was mostly about my (in)ability to get it into a
shape I'd be confident enough to commit. I kept running into weird and
subtle issues in parts of the code I knew nothing about. Great way to
learn stuff, but also a great way to burnout ...

So the way I was thinking about it is more "perfect approach that I'll
never be able to commit" vs. "good (and much simpler) approach". It's a
bit like in the saying about a tree falling in forest. If a perfect
patch never gets committed, does it make a sound?

From the technical point of view, the "complex" approach is clearly more
flexible. Because how could it not be? It can do everything the simple
approach can, but also some additional stuff thanks to having multiple
leaf pages at once.

The question I'm still trying to figure out is how significant those
benefits are, and whether it's worth it the extra complexity. I realize
there's a difference between "complexity of a patch" and "complexity of
the final code", and it may very well be that the complex approach would
result in a much cleaner final code - I don't know.

I don't have any clear "vision" of how the index AMs should work. My
ambition was (and still is) limited to "add prefetching to index scans",
and I don't feel qualified to make judgments about the overall design of
index AMs (interfaces, layering). I have opinions, of course, but I also
realize my insights are not very deep in this area.

Which is why I've been trying to measure the "practical" differences
between the two approaches, e.g. trying to compare how it performs on
different data sets, etc. There are some pretty massive differences in
favor of the "complex" approach, mostly due to the single-leaf-page
limitation of the simple patch. I'm still trying to understand if this
is "inherent" or if it could be mitigated in read_stream_reset(). (Will
share results from a couple experiments in a separate message later.)

This is the context of the benchmarks I've been sharing - me trying to
understand the practical implications/limits of the simple approach. Not
an attempt to somehow prove it's better, or anything like that.

I'm not opposed to continuing work on the "complex" approach, but as I
said, I'm sure I can't pull that off on my own. With your help, I think
the chance of success would be considerably higher.

Does this clarify how I think about the complex patch?

regards

[1]: /messages/by-id/32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me
/messages/by-id/32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me

--
Tomas Vondra

#160

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#130)

2 attachment(s)

Re: index prefetching

Hi,

I was wondering why the "simple" approach performs so much worse than
the "complex" one on some of the data sets. The theory was that it's due
to using read_stream_reset(), which resets the prefetch distance, and so
we need to "ramp up" from scratch (distance=1) for every batch. Which
for the correlated data sets is very often.

So I decided to do some experiments, to see if this is really the case,
and maybe see if read_stream_reset() could fix this in some way.

First, I added an

elog(LOG, "distance %d", stream->distance);

at the beginning of read_stream_next_block() to see how the distance
changes during the scan. Consider a query returning 2M rows from the
"cyclic" table (the attached .sql creates/pupulates it):

-- selects 20% rows
SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000;

With the "complex" patch, the CDF of the distance looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 0 |
| 50 | 0 |
| 75 | 0 |
| 100 | 0 |
| 125 | 0 |
| 150 | 0 |
| 175 | 0 |
| 200 | 0 |
| 225 | 0 |
| 250 | 0 |
| 275 | 99 |
| 300 | 99 |
+----------+-----+

That is, 99% of the distances is in the range [275, 300].

Note: This is much higher than the effective_io_concurrency value (16),
which may be surprising. But the ReadStream uses that to limit the
number of I/O requests, not as a limit of how far to look ahead. A lot
of the blocks are in the cache, so it looks far ahead.

But with the "simple" patch it looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 99 |
| 50 | 99 |
| 75 | 99 |
| 100 | 99 |
| 125 | 99 |
| 150 | 99 |
| 175 | 99 |
| 200 | 99 |
| 225 | 99 |
| 250 | 99 |
| 275 | 100 |
| 300 | 100 |
+----------+-----+

So 99% of the distances is in [0, 25]. A more detailed view on the first
couple distances:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 1 | 99 |
| 2 | 99 |
| 3 | 99 |
| 4 | 99 |
...

So 99% of distances is 1. Well, that's not very far, it effectively
means no prefetching (We still issue the fadvise, though, although a
comment in read_stream.c suggests we won't. Possible bug?).

This means *there's no ramp-up at all*. On the first leaf the distance
grows to ~270, but after the stream gets reset it stays at 1 and never
increases. That's ... not great?

I'm not entirely sure

I decided to hack the ReadStream a bit, so that it restores the last
non-zero distance seen (i.e. right before reaching end of the stream).
And with that I got this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 38 |
| 50 | 38 |
| 75 | 38 |
| 100 | 39 |
| 125 | 42 |
| 150 | 47 |
| 175 | 47 |
| 200 | 48 |
| 225 | 49 |
| 250 | 50 |
| 275 | 100 |
| 300 | 100 |
+----------+-----+

Not as good as the "complex" patch, but much better than the original.
And the performance got almost the same (for this one query).

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.

regards

--
Tomas Vondra

Attachments:

0001-read_stream-restore-prefetch-distance-after-reset.patchtext/x-patch; charset=UTF-8; name=0001-read_stream-restore-prefetch-distance-after-reset.patchDownload

From 817d5a7f7d19d0d9fe75ba1bb00a8164a438021a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 18 Jul 2025 16:41:03 +0200
Subject: [PATCH] read_stream: restore prefetch distance after reset

---
 src/backend/storage/aio/read_stream.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..796791f8427 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -443,6 +444,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -841,6 +843,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1012,6 +1015,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1044,8 +1050,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming
+	 * data is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
-- 
2.50.1

cyclic.sqlapplication/sql; name=cyclic.sqlDownload

#161

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#159)

Re: index prefetching

On Fri, Jul 18, 2025 at 1:44 PM Tomas Vondra <tomas@vondra.me> wrote:

I agree tableam needs to have a say in this, so that it can interpret
the TIDs in a way that fits how it actually stores data. But I'm not
sure it should be responsible for calling index_batch_getnext(). Isn't
the batching mostly an "implementation" detail of the index AM? That's
how I was thinking about it, at least.

I think of it in roughly the opposite way: to me, the table AM should
mostly be in control of the whole process. The index AM (or really
some generalized layer that is used for every index AM) should have
some influence over the scheduling of index scans, but in typical
cases where prefetching might be helpful the index AM should have
little or no impact on the scheduling.

All of this business with holding on to buffer pins is 100% due to
heap AM implementation details. Index vacuuming doesn't acquire
cleanup locks because the index AM requires it. Cleanup locks are only
required because otherwise there are races that affect index scans,
where we get confused about which TID relates to which logical row.
That's why bitmap index scans don't need to hold onto pins at all.

It's true that the current index AM API makes this the direct
responsibility of index AMs, by requiring them to hold on to buffer
pins across heap accesses. But that's just a historical accident.

The reasons why I started to look at the "simple" patch again [1] were
not entirely technical, at least not in the sense "Which of the two
designs is better?" It was mostly about my (in)ability to get it into a
shape I'd be confident enough to commit. I kept running into weird and
subtle issues in parts of the code I knew nothing about. Great way to
learn stuff, but also a great way to burnout ...

I was almost 100% sure that those nbtree implementation details were
quite fixable from a very early stage. I didn't really get involved
too much at first, because I didn't want to encroach. I probably could
have done a lot better with that myself.

So the way I was thinking about it is more "perfect approach that I'll
never be able to commit" vs. "good (and much simpler) approach". It's a
bit like in the saying about a tree falling in forest. If a perfect
patch never gets committed, does it make a sound?

Give yourself some credit. The complex patch is roughly 98% your work,
and already works quite well. It's far from committable, of course,
but it feels like it's already in roughly the right shape.

From the technical point of view, the "complex" approach is clearly more
flexible. Because how could it not be? It can do everything the simple
approach can, but also some additional stuff thanks to having multiple
leaf pages at once.

Right.

More than anything else, I don't like the "simple" approach because
limiting the number of leaf pages that can read to exactly one feels
so unnatural to me. It works in terms of the existing behavior with
reading one leaf page at a time to do heap prefetching. But that
existing behavior is itself a behavior that only exists for the
benefit of heapam.

It just seems circular to me: "simple" heap prefetching does things in
a way that's convenient for index AMs, specifically around the
leaf-at-a-time implementation details -- details which only exist for
the benefit of heapam. My sense is that just cutting out the index AM
entirely is a much more principled approach.

It's also because of the ability to reorder work, and to centralize
scheduling of index scans, of course -- there are practical benefits,
too. But, honestly, my primary concern is this issue with
"circularity". The "simple" patch is simpler only as one incremental
step. But it doesn't actually leave the codebase as a whole in a
simpler state than I believe to be possible with the "complex" patch.
It won't really be simpler in the first committed version, and it
definitely won't be if we ever want to improve on that.

If anybody else has an opinion on this, please speak up. I'm pretty
sure that only Tomas and I have commented on this important aspect
directly. I don't want to win the argument; I just want the best
design.

I don't have any clear "vision" of how the index AMs should work. My
ambition was (and still is) limited to "add prefetching to index scans",
and I don't feel qualified to make judgments about the overall design of
index AMs (interfaces, layering). I have opinions, of course, but I also
realize my insights are not very deep in this area.

Thanks for being so open. Your position is completely reasonable.

Which is why I've been trying to measure the "practical" differences
between the two approaches, e.g. trying to compare how it performs on
different data sets, etc. There are some pretty massive differences in
favor of the "complex" approach, mostly due to the single-leaf-page
limitation of the simple patch. I'm still trying to understand if this
is "inherent" or if it could be mitigated in read_stream_reset(). (Will
share results from a couple experiments in a separate message later.)

At a minimum, you should definitely teach the "simple" patchset to not
reset the prefetch distance when there's no real need for it. That
puts the "simple" patch at an artificial and unfair disadvantage.

This is the context of the benchmarks I've been sharing - me trying to
understand the practical implications/limits of the simple approach. Not
an attempt to somehow prove it's better, or anything like that.

Makes sense.

I'm not opposed to continuing work on the "complex" approach, but as I
said, I'm sure I can't pull that off on my own. With your help, I think
the chance of success would be considerably higher.

I can commit to making this project my #1 focus for Postgres 19 (#1
focus by far), provided the "complex" approach is used - just say the
word.

I cannot promise that we will be successful. But I can say for sure
that I'll have skin in the game. If the project fails, then I'll have
failed too.

Does this clarify how I think about the complex patch?

Yes, it does.

BTW, I don't think that there's all that much left to be said about
nbtree in particular here. I don't think that there's very much work
left there.

--
Peter Geoghegan

#162

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#159)

Re: index prefetching

Hi,

On 2025-07-18 19:44:51 +0200, Tomas Vondra wrote:

I agree tableam needs to have a say in this, so that it can interpret
the TIDs in a way that fits how it actually stores data. But I'm not
sure it should be responsible for calling index_batch_getnext(). Isn't
the batching mostly an "implementation" detail of the index AM? That's
how I was thinking about it, at least.

I don't agree with that. For efficiency reasons alone table AMs should get a
whole batch of TIDs at once. If you have an ordered indexscan that returns
TIDs that are correlated with the table, we waste *tremendous* amount of
cycles right now.

Instead of locking the page, doing a HOT search for every tuple, and then
unlocking the page, we lock and unlock the page for every single TID. The
locking alone is a significant overhead (it's like 25% of the cycles or so),
but what's worse, it reduces what out-of-order execution can do to hide
cache-misses.

Even leaving locking overhead and out-of-order execution aside, there's a good
bit of constant overhead work in heap_hot_search_buffer() that can be avoided
by doing the work all at once.

Just to show how big that effect is, I hacked up a patch that holds the buffer
lock from when the buffer is first pinned in heapam_index_fetch_tuple() until
another buffer is pinned, or until the scan ends. That's totally not a valid
change due to holding the lock for far too long, but it's a decent
approximation of the gain of reducing the locking. This query
SELECT * FROM lineitem ORDER BY l_orderkey OFFSET 10000000 LIMIT 1;
speeds up by 28%. Of course that's an extreme case, but still.

That likely undersells the gain, because the out-of-order benefits aren't
really there due to all the other code that runs inbetween two
heap_hot_search_buffer() calls. It obviously also doesn't show any of the
amortization benefits.

IMO the flow really should be something like this:

IndexScan executor node
-> table "index" scan using the passed in IndexScanDesc
-> read stream doing readahead for all the required heap blocks
-> table AM next page callback
-> index scans returning batches

I think the way that IndexOnlyScan works today (independent of this patch)
really is a layering violation. It "knows" about the way the visibilitymap,
which it really has no business accessing, that's a heap specific thing. It
also knows too much about different formats that can be stored by indexes, but
that's kind of a separate issue.

Greetings,

Andres Freund

#163

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#162)

Re: index prefetching

On Fri, Jul 18, 2025 at 4:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't agree with that. For efficiency reasons alone table AMs should get a
whole batch of TIDs at once. If you have an ordered indexscan that returns
TIDs that are correlated with the table, we waste *tremendous* amount of
cycles right now.

I agree, I think. But the terminology in this area can be confusing,
so let's make sure that we all understand each other:

I think that the table AM probably needs to have its own definition of
a batch (or some other distinct phrase/concept) -- it's not
necessarily the same group of TIDs that are associated with a batch on
the index AM side. (Within an index AM, there is a 1:1 correspondence
between batches and leaf pages, and batches need to hold on to a leaf
page buffer pin for a time. None of this should really matter to the
table AM.)

At a high level, the table AM (and/or its read stream) asks for so
many heap blocks/TIDs. Occasionally, index AM implementation details
(i.e. the fact that many index leaf pages have to be read to get very
few TIDs) will result in that request not being honored. The interface
that the table AM uses must therefore occasionally answer "I'm sorry,
I can only reasonably give you so many TIDs at this time". When that
happens, the table AM has to make do. That can be very temporary, or
it can happen again and again, depending on implementation details
known only to the index AM side (though typically it'll never happen
even once).

Does that sound roughly right to you? Obviously these details are
still somewhat hand-wavy -- I'm not fully sure of what the interface
should look like, by any means. But the important points are:

* The table AM drives the whole process.

* The table AM knows essentially nothing about leaf pages/index AM
batches -- it just has some general idea that sometimes it cannot have
its request honored, in which case it must make do.

* Some other layer represents the index AM -- though that layer
actually lives outside of index AMs (this is the code that the
"complex" patch currently puts in indexam.c). This other layer manages
resources (primarily leaf page buffer pins) on behalf of each index
AM. It also determines whether or not index AM implementation details
make it impractical to give the table AM exactly what it asked for
(this might actually require a small amount of cooperation from index
AM code, based on simple generic measures like leaf pages read).

* This other index AM layer does still know that it isn't cool to drop
leaf page buffer pins before we're done reading the corresponding heap
TIDs, due to heapam implementation details around making concurrent
heap TID recycling safe.

I'm not really sure how the table AM lets the new index AM layer know
"okay, done with all those TIDs now" in a way that is both correct (in
terms of avoiding unsafe concurrent TID recycling) and also gives the
table AM the freedom to do its own kind of batch access at the level
of heap pages. We don't necessarily have to figure all that out in the
first committed version, though.

--
Peter Geoghegan

#164

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#163)

Re: index prefetching

Hi,

On 2025-07-18 17:44:26 -0400, Peter Geoghegan wrote:

On Fri, Jul 18, 2025 at 4:52 PM Andres Freund <andres@anarazel.de> wrote:

I don't agree with that. For efficiency reasons alone table AMs should get a
whole batch of TIDs at once. If you have an ordered indexscan that returns
TIDs that are correlated with the table, we waste *tremendous* amount of
cycles right now.

I agree, I think. But the terminology in this area can be confusing,
so let's make sure that we all understand each other:

I think that the table AM probably needs to have its own definition of
a batch (or some other distinct phrase/concept) -- it's not
necessarily the same group of TIDs that are associated with a batch on
the index AM side.

I assume, for heap, it'll always be a narrower definition than for the
indexam, basically dealing with all the TIDs that fit within one page at once?

(Within an index AM, there is a 1:1 correspondence between batches and leaf
pages, and batches need to hold on to a leaf page buffer pin for a
time. None of this should really matter to the table AM.)

To some degree the table AM will need to care about the index level batching -
we have to be careful about how many pages we keep pinned overall. Which is
something that both the table and the index AM have some influence over.

At a high level, the table AM (and/or its read stream) asks for so
many heap blocks/TIDs. Occasionally, index AM implementation details
(i.e. the fact that many index leaf pages have to be read to get very
few TIDs) will result in that request not being honored. The interface
that the table AM uses must therefore occasionally answer "I'm sorry,
I can only reasonably give you so many TIDs at this time". When that
happens, the table AM has to make do. That can be very temporary, or
it can happen again and again, depending on implementation details
known only to the index AM side (though typically it'll never happen
even once).

I think that requirement will make things more complicated. Why do we need to
have it?

Does that sound roughly right to you? Obviously these details are
still somewhat hand-wavy -- I'm not fully sure of what the interface
should look like, by any means. But the important points are:

* The table AM drives the whole process.

Check.

* The table AM knows essentially nothing about leaf pages/index AM
batches -- it just has some general idea that sometimes it cannot have
its request honored, in which case it must make do.

Not entirely convinced by this one.

* Some other layer represents the index AM -- though that layer
actually lives outside of index AMs (this is the code that the
"complex" patch currently puts in indexam.c). This other layer manages
resources (primarily leaf page buffer pins) on behalf of each index
AM. It also determines whether or not index AM implementation details
make it impractical to give the table AM exactly what it asked for
(this might actually require a small amount of cooperation from index
AM code, based on simple generic measures like leaf pages read).

I don't really have an opinion about this one.

* This other index AM layer does still know that it isn't cool to drop
leaf page buffer pins before we're done reading the corresponding heap
TIDs, due to heapam implementation details around making concurrent
heap TID recycling safe.

I'm not sure why this needs to live in the generic code, rather than the
specific index AM?

I'm not really sure how the table AM lets the new index AM layer know "okay,
done with all those TIDs now" in a way that is both correct (in terms of
avoiding unsafe concurrent TID recycling) and also gives the table AM the
freedom to do its own kind of batch access at the level of heap pages.

I'd assume that the table AM has to call some indexam function to release
index-batches, whenever it doesn't need the reference anymore? And the
index-batch release can then unpin?

Greetings,

Andres Freund

#165

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#164)

Re: index prefetching

On Fri, Jul 18, 2025 at 10:47 PM Andres Freund <andres@anarazel.de> wrote:

I think that the table AM probably needs to have its own definition of
a batch (or some other distinct phrase/concept) -- it's not
necessarily the same group of TIDs that are associated with a batch on
the index AM side.

I assume, for heap, it'll always be a narrower definition than for the
indexam, basically dealing with all the TIDs that fit within one page at once?

Yes, I think so.

(Within an index AM, there is a 1:1 correspondence between batches and leaf
pages, and batches need to hold on to a leaf page buffer pin for a
time. None of this should really matter to the table AM.)

To some degree the table AM will need to care about the index level batching -
we have to be careful about how many pages we keep pinned overall. Which is
something that both the table and the index AM have some influence over.

Can't they operate independently? If not (if there must be a
per-executor-node hard limit on pins held or whatever), then I still
see no need for close coordination.

At a high level, the table AM (and/or its read stream) asks for so
many heap blocks/TIDs. Occasionally, index AM implementation details
(i.e. the fact that many index leaf pages have to be read to get very
few TIDs) will result in that request not being honored. The interface
that the table AM uses must therefore occasionally answer "I'm sorry,
I can only reasonably give you so many TIDs at this time". When that
happens, the table AM has to make do. That can be very temporary, or
it can happen again and again, depending on implementation details
known only to the index AM side (though typically it'll never happen
even once).

I think that requirement will make things more complicated. Why do we need to
have it?

What if it turns out that there is a large run of contiguous leaf
pages that contain no more than 2 or 3 matching index tuples? What if
there's no matches across many leaf pages? Surely we have to back off
with prefetching when that happens.

* The table AM knows essentially nothing about leaf pages/index AM
batches -- it just has some general idea that sometimes it cannot have
its request honored, in which case it must make do.

Not entirely convinced by this one.

We can probably get away with modelling all costs on the index AM side
as the number of pages read. This isn't all that accurate; some pages
are more expensive to read than others, it's more expensive to start a
new primitive index scan/index search than it is to just step to the
next page. But it's probably close enough for our purposes. And, I
think that it'll generalize reasonably well across all index AMs.

* This other index AM layer does still know that it isn't cool to drop
leaf page buffer pins before we're done reading the corresponding heap
TIDs, due to heapam implementation details around making concurrent
heap TID recycling safe.

I'm not sure why this needs to live in the generic code, rather than the
specific index AM?

Currently, the "complex" patch calls into nbtree to release its buffer
pin -- it does this by calling btfreebatch(). btfreebatch is not
completely trivial (it also calls _bt_killitems as needed). But nbtree
doesn't know when or how that'll happen. We're not obligated to do it
in precisely the same order as the order the pages were read in, for
example. In principle, the new indexam.c layer could do this in almost
any order.

I'm not really sure how the table AM lets the new index AM layer know "okay,
done with all those TIDs now" in a way that is both correct (in terms of
avoiding unsafe concurrent TID recycling) and also gives the table AM the
freedom to do its own kind of batch access at the level of heap pages.

I'd assume that the table AM has to call some indexam function to release
index-batches, whenever it doesn't need the reference anymore? And the
index-batch release can then unpin?

It does. But that can be fairly generic -- btfreebatch will probably
end up looking very similar to (say) hashfreebatch and gistfreebatch.
Again, the indexam.c layer actually gets to decide when it happens --
that's what I meant about it being under its control (I didn't mean
that it literally did everything without involving the index AM).

--
Peter Geoghegan

#166

Thomas Munro

thomas.munro@gmail.com

6 months ago

In reply to: Tomas Vondra (#160)

Re: index prefetching

On Sat, Jul 19, 2025 at 6:31 AM Tomas Vondra <tomas@vondra.me> wrote:

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.

Right, that's also discussed, with a similar patch, here:

/messages/by-id/CA+hUKG+x2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow@mail.gmail.com

Resetting the distance was a short-sighted mistake: I was thinking
about rescans, the original use case for the reset operation, and
guessing that the data would remain cached. But all the new users of
_reset() have a completely different motivation, namely temporary
exhaustion in their source data, so that guess was simply wrong.
There was also some discussion at the time about whether "reset so I
can rescan", and "reset so I can continue after a temporary stop"
should be different operations requiring different APIs. It now seems
like one operation is sufficient, but it should preserve the distance
as you showed and then let the algorithm learn about already-cached
data in the rescan case (if it is even true then, which is also
debatable since it depends on the size of the scan). So, I think we
should just go ahead and commit a patch like that.

#167

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Thomas Munro (#166)

Re: index prefetching

On 7/19/25 06:03, Thomas Munro wrote:

On Sat, Jul 19, 2025 at 6:31 AM Tomas Vondra <tomas@vondra.me> wrote:

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.

Right, that's also discussed, with a similar patch, here:

/messages/by-id/CA+hUKG+x2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow@mail.gmail.com

Resetting the distance was a short-sighted mistake: I was thinking
about rescans, the original use case for the reset operation, and
guessing that the data would remain cached. But all the new users of
_reset() have a completely different motivation, namely temporary
exhaustion in their source data, so that guess was simply wrong.

Thanks for the link. It seems I came up with an almost the same patch,
with three minor differences:

1) There's another place that sets "distance = 0" in
read_stream_next_buffer, so maybe this should preserve the distance too?

2) I suspect we need to preserve the distance at the beginning of
read_stream_reset, like

stream->reset_distance = Max(stream->reset_distance,
stream->distance);

because what if you call _reset before reaching the end of the stream?

3) Shouldn't it reset the reset_distance to 0 after restoring it?

There was also some discussion at the time about whether "reset so I
can rescan", and "reset so I can continue after a temporary stop"
should be different operations requiring different APIs. It now seems
like one operation is sufficient, but it should preserve the distance
as you showed and then let the algorithm learn about already-cached
data in the rescan case (if it is even true then, which is also
debatable since it depends on the size of the scan). So, I think we
should just go ahead and commit a patch like that.

Not sure. To me it seems more like two distinct cases, but I'm not sure
if it requires two distinct "operations" with distinct API. Perhaps a
simple flag for the _reset() would be enough? It'd need to track the
distance anyway, just in case.

Consider for example a nested loop, which does a rescan every time the
outer row changes. Is there a reason to believe the outer rows will need
the same number of inner rows? Aren't those "distinct streams"? Maybe
I'm thinking about this wrong, of course.

The thing that however concerns me is that what I observed was not the
distance getting reset to 1, and then ramping up. Which should happen
pretty quickly, thanks to the doubling. In my experiments it *never*
ramped up again, it stayed at 1. I still don't quite understand why.

If this is happening for the nestloop case too, that'd be quite bad.

regards

--
Tomas Vondra

#168

Thomas Munro

thomas.munro@gmail.com

6 months ago

In reply to: Tomas Vondra (#167)

Re: index prefetching

On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:

Thanks for the link. It seems I came up with an almost the same patch,
with three minor differences:

1) There's another place that sets "distance = 0" in
read_stream_next_buffer, so maybe this should preserve the distance too?

2) I suspect we need to preserve the distance at the beginning of
read_stream_reset, like

stream->reset_distance = Max(stream->reset_distance,
stream->distance);

because what if you call _reset before reaching the end of the stream?

3) Shouldn't it reset the reset_distance to 0 after restoring it?

Probably. Hmm... an earlier version of this code didn't use distance
== 0 to indicate end-of-stream, but instead had a separate internal
end_of_stream flag. If we brought that back and didn't clobber
distance, we wouldn't need this save-and-restore dance. It seemed
shorter and sweeter without it back then, before _reset() existed in
its present form, but I wonder if end_of_stream would be nicer than
having to add this kind of stuff, without measurable downsides.

There was also some discussion at the time about whether "reset so I
can rescan", and "reset so I can continue after a temporary stop"
should be different operations requiring different APIs. It now seems
like one operation is sufficient, but it should preserve the distance
as you showed and then let the algorithm learn about already-cached
data in the rescan case (if it is even true then, which is also
debatable since it depends on the size of the scan). So, I think we
should just go ahead and commit a patch like that.

Not sure. To me it seems more like two distinct cases, but I'm not sure
if it requires two distinct "operations" with distinct API. Perhaps a
simple flag for the _reset() would be enough? It'd need to track the
distance anyway, just in case.

Consider for example a nested loop, which does a rescan every time the
outer row changes. Is there a reason to believe the outer rows will need
the same number of inner rows? Aren't those "distinct streams"? Maybe
I'm thinking about this wrong, of course.

Good question. Yeah, your flag idea seems like a good way to avoid
baking opinion into this level. I wonder if it should be a bitmask
rather than a boolean, in case we think of more things that need to be
included or not when resetting.

The thing that however concerns me is that what I observed was not the
distance getting reset to 1, and then ramping up. Which should happen
pretty quickly, thanks to the doubling. In my experiments it *never*
ramped up again, it stayed at 1. I still don't quite understand why.

Huh. Will look into that on Monday.

#169

Thomas Munro

thomas.munro@gmail.com

6 months ago

In reply to: Thomas Munro (#168)

Re: index prefetching

On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:

The thing that however concerns me is that what I observed was not the
distance getting reset to 1, and then ramping up. Which should happen
pretty quickly, thanks to the doubling. In my experiments it *never*
ramped up again, it stayed at 1. I still don't quite understand why.

Huh. Will look into that on Monday.

I suspect that it might be working as designed, but suffering from a
bit of a weakness in the distance control algorithm, which I described
in another thread[1]/messages/by-id/CA+hUKGLPakwZiFUa5fQXpYDpCXvZXQ=P3cWOGACCoobh7U2r3A@mail.gmail.com. In short, the simple minded algorithm that
doubles on miss and subtracts one on hit can get stuck alternating
between 1 and 2 if you hit certain patterns. Bilal pinged me off-list
to say that he'd repro'd something like your test case and that's what
seemed to be happening, anyway? I will dig out my experimental
patches that tried different adjustments to escape from that state....

[1]: /messages/by-id/CA+hUKGLPakwZiFUa5fQXpYDpCXvZXQ=P3cWOGACCoobh7U2r3A@mail.gmail.com

#170

Nazir Bilal Yavuz

byavuz81@gmail.com

6 months ago

In reply to: Thomas Munro (#169)

Re: index prefetching

Hi,

On Mon, 21 Jul 2025 at 03:59, Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:

The thing that however concerns me is that what I observed was not the
distance getting reset to 1, and then ramping up. Which should happen
pretty quickly, thanks to the doubling. In my experiments it *never*
ramped up again, it stayed at 1. I still don't quite understand why.

Huh. Will look into that on Monday.

I suspect that it might be working as designed, but suffering from a
bit of a weakness in the distance control algorithm, which I described
in another thread[1]. In short, the simple minded algorithm that
doubles on miss and subtracts one on hit can get stuck alternating
between 1 and 2 if you hit certain patterns. Bilal pinged me off-list
to say that he'd repro'd something like your test case and that's what
seemed to be happening, anyway? I will dig out my experimental
patches that tried different adjustments to escape from that state....

I used Tomas Vondra's test [1]/messages/by-id/aa46af80-5219-47e6-a7d0-7628106965a6@vondra.me. I tracked how many times
StartReadBuffersImpl() functions return true (IO is needed) and false
(IO is not needed, cache hit). It returns true ~%6 times on both
simple and complex patches (~116000 times true, ~1900000 times false
on both patches).

A complex patch ramps up to ~250 distance at the start of the stream
and %6 is enough to stay at distance. Actually, it is enough to ramp
up more but it seems the max distance is about ~270 so it stays there.
On the other hand, a simple patch doesn't ramp up at the start of the
stream and %6 is not enough to ramp up. It is always like distance is
1 and IO needed, so multiplying the distance by 2 -> distance = 2 but
then the next block is cached, so decreasing the distance by 1 and
distance is 1 again.

[1]: /messages/by-id/aa46af80-5219-47e6-a7d0-7628106965a6@vondra.me

--
Regards,
Nazir Bilal Yavuz
Microsoft

#171

Thomas Munro

thomas.munro@gmail.com

6 months ago

In reply to: Thomas Munro (#168)

4 attachment(s)

Re: index prefetching

On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:

Thanks for the link. It seems I came up with an almost the same patch,
with three minor differences:

1) There's another place that sets "distance = 0" in
read_stream_next_buffer, so maybe this should preserve the distance too?

2) I suspect we need to preserve the distance at the beginning of
read_stream_reset, like

stream->reset_distance = Max(stream->reset_distance,
stream->distance);

because what if you call _reset before reaching the end of the stream?

3) Shouldn't it reset the reset_distance to 0 after restoring it?

Probably. Hmm... an earlier version of this code didn't use distance
== 0 to indicate end-of-stream, but instead had a separate internal
end_of_stream flag. If we brought that back and didn't clobber
distance, we wouldn't need this save-and-restore dance. It seemed
shorter and sweeter without it back then, before _reset() existed in
its present form, but I wonder if end_of_stream would be nicer than
having to add this kind of stuff, without measurable downsides.

...

Good question. Yeah, your flag idea seems like a good way to avoid
baking opinion into this level. I wonder if it should be a bitmask
rather than a boolean, in case we think of more things that need to be
included or not when resetting.

Here's a sketch of the above two ideas for discussion (.txt to stay
off cfbot's radar for this thread). Better than save/restore?

Here also are some alternative experimental patches for preserving
accumulated look-ahead distance better in cases like that. Needs more
exploration... thoughts/ideas welcome...

Attachments:

0001-aio-Refactor-read_stream.c-end-of-stream-state.txttext/plain; charset=US-ASCII; name=0001-aio-Refactor-read_stream.c-end-of-stream-state.txtDownload

From 2df6f67d199cfc0e0aa321830df2ee712057c61a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Jul 2025 13:58:47 +1200
Subject: [PATCH 1/3] aio: Refactor read_stream.c end-of-stream state.

Previously, stream->distance was set to 0 to indicate that the end of
the stream had been reached.  In preparation for a later commit that
preserves the distance across reset operations, introduce a separate
end_of_stream flag instead, so it survives across restart operations.
No externally visible effects.

Discussion: https://postgr.es/m/CA%2BhUKG%2BWWr4-8TYemyU%3DucQsNe6bUBN_Sq3mCnBoBtxaJ9w3ug%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..5de6f83c253 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -137,6 +137,7 @@ struct ReadStream
 	int16		next_io_index;
 
 	bool		fast_path;
+	bool		end_of_stream;
 
 	/* Circular queue of buffers. */
 	int16		oldest_buffer_index;	/* Next pinned buffer to return */
@@ -429,6 +430,9 @@ read_stream_look_ahead(ReadStream *stream)
 			continue;
 		}
 
+		if (stream->end_of_stream)
+			break;
+
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
@@ -443,7 +447,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
-			stream->distance = 0;
+			stream->end_of_stream = true;
 			break;
 		}
 
@@ -488,7 +492,7 @@ read_stream_look_ahead(ReadStream *stream)
 		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks >= stream->distance &&
 		  stream->pinned_buffers == 0) ||
-		 stream->distance == 0) &&
+		 stream->end_of_stream) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream);
 
@@ -498,7 +502,7 @@ read_stream_look_ahead(ReadStream *stream)
 	 * stream.  In the worst case we can always make progress one buffer at a
 	 * time.
 	 */
-	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
+	Assert(stream->pinned_buffers > 0 || stream->end_of_stream);
 
 	if (stream->batch_mode)
 		pgaio_exit_batchmode();
@@ -789,6 +793,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->forwarded_buffers == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
+		Assert(stream->end_of_stream == false);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
@@ -841,7 +846,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
-			stream->distance = 0;
+			stream->end_of_stream = true;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
 			stream->buffers[oldest_buffer_index] = InvalidBuffer;
@@ -857,7 +862,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 
 		/* End of stream reached?  */
-		if (stream->distance == 0)
+		if (stream->end_of_stream)
 			return InvalidBuffer;
 
 		/*
@@ -871,7 +876,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		/* End of stream reached? */
 		if (stream->pinned_buffers == 0)
 		{
-			Assert(stream->distance == 0);
+			Assert(stream->end_of_stream);
 			return InvalidBuffer;
 		}
 	}
@@ -976,6 +981,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->forwarded_buffers == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
+		stream->end_of_stream == false &&
 		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
@@ -1013,7 +1019,7 @@ read_stream_reset(ReadStream *stream)
 	Buffer		buffer;
 
 	/* Stop looking ahead. */
-	stream->distance = 0;
+	stream->end_of_stream = true;
 
 	/* Forget buffered block number and fast path state. */
 	stream->buffered_blocknum = InvalidBlockNumber;
@@ -1046,6 +1052,7 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+	stream->end_of_stream = false;
 }
 
 /*
-- 
2.39.5 (Apple Git-154)

0002-aio-Add-READ_STREAM_RESET_CONTINUE-flag.txttext/plain; charset=US-ASCII; name=0002-aio-Add-READ_STREAM_RESET_CONTINUE-flag.txtDownload

From 5c3cf2018845093041dcf23bd6e8ae90b5c5906a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Jul 2025 14:38:32 +1200
Subject: [PATCH 2/3] aio: Add READ_STREAM_RESET_CONTINUE flag.

For users that sometimes run out of block numbers temporarily but then
later want to continue, provide a way to reset the stream and continue
without forgetting the internal look-ahead distance.

This is done as a bitmask, just in case we think of more reasons for
fine-grained control of what is reset or not in the future.

XXX Name debatable; is it better to use a 'high level' concept like
'continuing an interrupted stream', or a lower level concept explicitly
referring to what is reset or not?

XXX Where existing users pass 0 or the new flag not thought about very
hard

Discussion: https://postgr.es/m/CA%2BhUKG%2BWWr4-8TYemyU%3DucQsNe6bUBN_Sq3mCnBoBtxaJ9w3ug%40mail.gmail.com
---
 src/backend/access/gist/gistvacuum.c  |  2 +-
 src/backend/access/heap/heapam.c      |  4 ++--
 src/backend/access/nbtree/nbtree.c    |  2 +-
 src/backend/access/spgist/spgvacuum.c |  2 +-
 src/backend/storage/aio/read_stream.c | 13 +++++++++----
 src/include/storage/read_stream.h     | 14 +++++++++++++-
 6 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index dca236b6e57..45044e3cc84 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -260,7 +260,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		 * InvalidBuffer, the read stream API won't invoke our callback again
 		 * until the stream has been reset.
 		 */
-		read_stream_reset(stream);
+		read_stream_reset(stream, READ_STREAM_RESET_CONTINUE);
 	}
 
 	read_stream_end(stream);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..1ffc4a872db 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -667,7 +667,7 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	if (unlikely(scan->rs_dir != dir))
 	{
 		scan->rs_prefetch_block = scan->rs_cblock;
-		read_stream_reset(scan->rs_read_stream);
+		read_stream_reset(scan->rs_read_stream, 0);
 	}
 
 	scan->rs_dir = dir;
@@ -1284,7 +1284,7 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * in initscan().
 	 */
 	if (scan->rs_read_stream)
-		read_stream_reset(scan->rs_read_stream);
+		read_stream_reset(scan->rs_read_stream, 0);
 
 	/*
 	 * reinitialize scan descriptor
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..596df227c20 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1318,7 +1318,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		 * InvalidBuffer, the read stream API won't invoke our callback again
 		 * until the stream has been reset.
 		 */
-		read_stream_reset(stream);
+		read_stream_reset(stream, READ_STREAM_RESET_CONTINUE);
 	}
 
 	read_stream_end(stream);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..c4e4a89680e 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -885,7 +885,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 		 * InvalidBuffer, the read stream API won't invoke our callback again
 		 * until the stream has been reset.
 		 */
-		read_stream_reset(stream);
+		read_stream_reset(stream, READ_STREAM_RESET_CONTINUE);
 	}
 
 	read_stream_end(stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 5de6f83c253..f242b373b22 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -1011,9 +1011,13 @@ read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
  * to be used again for different blocks.  This can be used to clear an
  * end-of-stream condition and start again, or to throw away blocks that were
  * speculatively read and read some different blocks instead.
+ *
+ * READ_STREAM_RESET_CONTINUE can be pass to flags to indicate that a stream
+ * was temporarily interrupted but internal look-ahead distance heuristics
+ * should not be reset, because a similar access pattern is expected.
  */
 void
-read_stream_reset(ReadStream *stream)
+read_stream_reset(ReadStream *stream, int flags)
 {
 	int16		index;
 	Buffer		buffer;
@@ -1050,8 +1054,9 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/* Start off like a newly initialized stream, unless asked not to. */
+	if ((flags & READ_STREAM_RESET_CONTINUE) == 0)
+		stream->distance = 1;
 	stream->end_of_stream = false;
 }
 
@@ -1061,6 +1066,6 @@ read_stream_reset(ReadStream *stream)
 void
 read_stream_end(ReadStream *stream)
 {
-	read_stream_reset(stream);
+	read_stream_reset(stream, 0);
 	pfree(stream);
 }
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..90ec278117f 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -17,6 +17,8 @@
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 
+/* Flags controlling stream initialization. */
+
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
 
@@ -42,6 +44,16 @@
  */
 #define READ_STREAM_FULL 0x04
 
+/* Flags controlling read_stream_reset(). */
+
+/*
+ * If the callback reports end-of-stream or higher levels decide to abandon
+ * blocks that it generated speculatively, the stream can be reset and allowed
+ * to try to fetch blocks again without forgetting internal heuristics by
+ * passing this flag.
+ */
+#define READ_STREAM_RESET_CONTINUE 0x01
+
 /* ---
  * Opt-in to using AIO batchmode.
  *
@@ -99,7 +111,7 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 												   ReadStreamBlockNumberCB callback,
 												   void *callback_private_data,
 												   size_t per_buffer_data_size);
-extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_reset(ReadStream *stream, int flags);
 extern void read_stream_end(ReadStream *stream);
 
 #endif							/* READ_STREAM_H */
-- 
2.39.5 (Apple Git-154)

0003-aio-Improve-read_stream.c-look-ahead-heuristics-A.txttext/plain; charset=US-ASCII; name=0003-aio-Improve-read_stream.c-look-ahead-heuristics-A.txtDownload

From 7e862c651a9b023bc9b3363f59d17b8bc6a22089 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Jul 2025 16:34:59 +1200
Subject: [PATCH 3/3] aio: Improve read_stream.c look-ahead heuristics A.

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

The new rule prevents distance reduction while any IO is still running,
which allows for more concurrent I/O with some mixed hit/miss patterns
that the old algorithm didn't handle well.

XXX But you can always adjust
your test patterns with more hit between the misses to make it give up
I/O concurrency that you could have benefited from....
XXX But it also finds it way back to the fast path sooner...
XXX Highly experimental!

Suggested-by: AF
---
 src/backend/storage/aio/read_stream.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f242b373b22..8961fad463f 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -343,22 +343,25 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
+		/* Look-ahead distance decays when there is no I/O activity. */
+		if (stream->distance > 1 && stream->ios_in_progress == 0)
 			stream->distance--;
 	}
 	else
 	{
-		/*
-		 * Remember to call WaitReadBuffers() before returning head buffer.
-		 * Look-ahead distance will be adjusted after waiting.
-		 */
+		/* Remember to call WaitReadBuffers() before returning head buffer. */
 		stream->ios[io_index].buffer_index = buffer_index;
 		if (++stream->next_io_index == stream->max_ios)
 			stream->next_io_index = 0;
 		Assert(stream->ios_in_progress < stream->max_ios);
 		stream->ios_in_progress++;
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+		/* Look-ahead distance grows by twice this read size. */
+		if (stream->max_pinned_buffers - stream->distance < nblocks * 2)
+			stream->distance = stream->max_pinned_buffers;
+		else
+			stream->distance += nblocks * 2;
 	}
 
 	/*
@@ -897,7 +900,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -910,11 +912,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
-
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
-- 
2.39.5 (Apple Git-154)

0003-aio-Improve-read_stream.c-look-ahead-heuristics-B.txttext/plain; charset=US-ASCII; name=0003-aio-Improve-read_stream.c-look-ahead-heuristics-B.txtDownload

From 60d5e1bb888b601ef04810d107376f190d069bf3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Jul 2025 16:34:59 +1200
Subject: [PATCH 3/3] aio: Improve read_stream.c look-ahead heuristics.

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

Instead, add a crude estimation for whether it looks like the look-ahead
window has any chance of spanning the gap between misses, if allowed to
grow large enough.  You still need misses to grow the distance by
doubling, but you don't give it up until you've seen so many hits that
there is no way we could ever see past them.

XXX A somewhat extreme position... but on the other hand it doesn't
contain any arbitrary unprincipled constants that would lose built-up
look-ahead steam when you add just one more hit between misses to your
test pattern, until you've reach the actual possible limit defined by
the GUC settings.  (?)

XXX Perhaps the answer is somewhere in the middle, stream->distance * K
rather than stream->max_pinned_buffed?

XXX Highly experimental!
---
 src/backend/storage/aio/read_stream.c | 34 +++++++++++++++++----------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f242b373b22..7995217662e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -123,6 +123,9 @@ struct ReadStream
 	BlockNumber seq_blocknum;
 	BlockNumber seq_until_processed;
 
+	/* Counter used for distance heuristics. */
+	uint64		blocks_since_last_io;
+
 	/* The read operation we are currently preparing. */
 	BlockNumber pending_read_blocknum;
 	int16		pending_read_nblocks;
@@ -198,6 +201,7 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 		blocknum = stream->callback(stream,
 									stream->callback_private_data,
 									per_buffer_data);
+		stream->blocks_since_last_io++;
 	}
 
 	return blocknum;
@@ -343,22 +347,31 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
+		/*
+		 * Look-ahead distance decays if we haven't had any cache misses in a
+		 * hypothetical window the size of the maximum possible look-ahead
+		 * distance.
+		 */
+		if (stream->distance > 1 &&
+			stream->blocks_since_last_io > stream->max_pinned_buffers)
 			stream->distance--;
 	}
 	else
 	{
-		/*
-		 * Remember to call WaitReadBuffers() before returning head buffer.
-		 * Look-ahead distance will be adjusted after waiting.
-		 */
+		/* Remember to call WaitReadBuffers() before returning head buffer. */
 		stream->ios[io_index].buffer_index = buffer_index;
 		if (++stream->next_io_index == stream->max_ios)
 			stream->next_io_index = 0;
 		Assert(stream->ios_in_progress < stream->max_ios);
 		stream->ios_in_progress++;
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+		/* Look-ahead distance doubles. */
+		if (stream->distance > stream->max_pinned_buffers - stream->distance)
+			stream->distance = stream->max_pinned_buffers;
+		else
+			stream->distance += stream->distance;
+		stream->blocks_since_last_io = 0;
 	}
 
 	/*
@@ -897,7 +910,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -910,11 +922,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
-
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
@@ -1056,7 +1063,10 @@ read_stream_reset(ReadStream *stream, int flags)
 
 	/* Start off like a newly initialized stream, unless asked not to. */
 	if ((flags & READ_STREAM_RESET_CONTINUE) == 0)
+	{
+		stream->blocks_since_last_io = 0;
 		stream->distance = 1;
+	}
 	stream->end_of_stream = false;
 }
 
-- 
2.39.5 (Apple Git-154)

#172

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Nazir Bilal Yavuz (#170)

2 attachment(s)

Re: index prefetching

On 7/21/25 08:53, Nazir Bilal Yavuz wrote:

Hi,

On Mon, 21 Jul 2025 at 03:59, Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:

The thing that however concerns me is that what I observed was not the
distance getting reset to 1, and then ramping up. Which should happen
pretty quickly, thanks to the doubling. In my experiments it *never*
ramped up again, it stayed at 1. I still don't quite understand why.

Huh. Will look into that on Monday.

I suspect that it might be working as designed, but suffering from a
bit of a weakness in the distance control algorithm, which I described
in another thread[1]. In short, the simple minded algorithm that
doubles on miss and subtracts one on hit can get stuck alternating
between 1 and 2 if you hit certain patterns. Bilal pinged me off-list
to say that he'd repro'd something like your test case and that's what
seemed to be happening, anyway? I will dig out my experimental
patches that tried different adjustments to escape from that state....

I used Tomas Vondra's test [1]. I tracked how many times
StartReadBuffersImpl() functions return true (IO is needed) and false
(IO is not needed, cache hit). It returns true ~%6 times on both
simple and complex patches (~116000 times true, ~1900000 times false
on both patches).

A complex patch ramps up to ~250 distance at the start of the stream
and %6 is enough to stay at distance. Actually, it is enough to ramp
up more but it seems the max distance is about ~270 so it stays there.
On the other hand, a simple patch doesn't ramp up at the start of the
stream and %6 is not enough to ramp up. It is always like distance is
1 and IO needed, so multiplying the distance by 2 -> distance = 2 but
then the next block is cached, so decreasing the distance by 1 and
distance is 1 again.

[1] /messages/by-id/aa46af80-5219-47e6-a7d0-7628106965a6@vondra.me

Yes, this is the behavior I observed too. I was wondering if the 5% miss
ratio hit some special "threshold" in the distance heuristics, and maybe
it'd work fine with a couple more misses.

But I don't think so, I think pretty workloads with up to 50% misses may
hit this problem. We reset the distance to 1, and then with 50% misses
we'll do about 1 hit + 1 miss, which doubles the distance to 2 and then
reduces the distance to 1, infinitely. Of course, that's only for even
distribution hits/misses (and the synthetic workloads are fairly even).

Real workloads are likely to have multiple misses in a row, which indeed
ramps up the distance quickly. So maybe it's not that bad. Could we
track a longer history of hits/misses, and consider that when adjusting
the distance? Not just the most recent hit/miss?

FWIW I re-ran the index-prefetch-test benchmarks with restoring the
distance for the "simple" patch. The results are in the same github
repository, in a separate branch:

https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset

I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
compare timings for quick queries). This shows that with restoring
distance after reset, the simple patch is pretty much the same as the
complex patch.

The only data set where that's not the case is the "linear" data set,
when everything is perfectly sequential. In this case the simple patch
performs like "master" (i.e. no prefetching). I'm not sure why is that.

Anyway, it seems to confirm most of the differences between the two
patches is due to the "distance collapse". The impact of the resets in
the first benchmarks surprised me quite a bit, but if we don't ramp up
the distance that makes perfect sense.

The issue probably affects other queries that do a lot of resets. Index
scan prefetching just makes it very obvious.

regards

--
Tomas Vondra

Attachments:

d16-rows-cold-32GB-16-unscaled-log.pdfapplication/pdf; name=d16-rows-cold-32GB-16-unscaled-log.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x��}M�+��������EW>���D8�/lk��BC�&���������w�|��\R��^f�D���YY@�����S���X�x|����������������}+�����v�������}�����[�������=@9������-�|��mu������!�x�_��q{����8�.�s�%;!N�/�r�������S��P���������7�o?��������?�~�s?������o����{n[=�>�}�cb5w����B��������l&c����cF��Mtb7�_�jH�����o*����Z?$�~�S��P����~H,�X�~S!������#b�':�����b<�K?���T(~e;�i�D,�X'��A���q�;��G��n|�����y��&�*y��2���=�kK7���T�~{��h�vD��D'~s����H��!��c��M���o!��9K?���TH~�<��\	/�D�fs�xU�0fH,�X�vS���-�l��X��N��B��[�z� �t#_/���E.[�uek���������vf����j����h.d���Bn�������o.�����~H,�X�~S!��sk�|�ab�':������`x�F*5����l{��"�n�S��P�����!��c��M���\������o*d�rn'���������1__3�-�{�C��k����F����?*"�g��Ps!�����~�Wn���L���~�&��"�z�d��*��������n.�������X��N��B��[8��Cb��:��
�o�b��<"�~�S����b��a?��X��N��B�;����0�X��N.���=I�b�b?����W�pl��P�W�f��LS�x�-������n*d���b�~D��D'~s���m?��Z"�~�S��P��������c��M��w�[l�����MTl6�����f.��W^�������^y��e���p�P^y��/���Ub�� ���H|�*���v��Dx1kQ�+4������G�����k.c�����^y�J��2�
[�;3����k*#��q��&��{�e�U7{��p�D1���'o���rA��L$��|��%���T�>aK�]��WN���� {y�e�[��S��
|�2P���y}/�-*~��2��e����	/�D�^s{������4b4�Oj[J�[0�����j.c��&�C�r"��D���bcf�+'V��T�^a��z!^y�J��2�{�1~#�J{O��e����������y�����s;����4�|�z��b��[;�O3y��k�[(�������3\>������n%T3o!�v��Op��y���������Z1�:Q!����%�H��W���W���8�1Sf�&el����lu/�3?*S?$V~�V[��Y�Q��!��i���~����L��X��-
?7�R3�+�s�_�2�������D��v���R�/![nv�L�_(���#^�P��Wzc&��������_����N�����`T�'��������_�J��3?*S?$~���<�����q�����=k�vD,��J����e�f�,��L��X��}���!�KU��/\0�����`|�zq�-��uf�:q�L�����s�}�+3��2�cb��v4��Qf��e�G��/F���0s?.?"V~�Vb��Y�Q��!���� �]����+��Y�3?*S?$V~m�5~�,��L��X������0�+����kGL�S�_<���XNH�����T\&S'b�R�~d����q����;��������������_��U���W��&�������3ne��11w�*1C���[���Y�Q��!��;����1���2�Cb�������\��%C�$!�������.��%�}K�~\)3}��L�N���Ja���+3��2�#b���_��1���2�Cb���V�{Ue�~\&~D���-�s�cf�Ge�����`�j�N03��2��X��>\��Y�Q��!������F����+���(���O�-���kG*y����v|=��)�=o-DX\j�����7�sk}{������������������O?|��_��/��#Oik�����vi�����(e;_@*0�����������h'��*�`s����y�p��]���1@>��qFxR����$	P�r��"�L����Z1�d��lpK���A�����=f�E��3akp��?f*0�'���x0s��Jd�Y�z.1l�(���_�2����8��M�
�vP�Cx,����-�/����d`BL[{�Q�`���Q	
��H.��?�3��=�$���w���X�`��_,@��R|�������z�;r��'�&����;��ih�Va��T�C�9"0=��0)���:�%<��,�L�O����P�L�����0)����d,�A=f�ew�>qN��u����A�j���j.�����Q+<�W&Z���'��3��nX�Cx,��X-na?���E{�$`�B������
^�N���"r������gn�d�3���=cT<���#(�!<�E<��rB7y��cD$ b8������, ��R�)�7�	m�x-W�&�sk/q�31�	X"Cx�g���/=����J�D����b��	x��(wP�/�#h����������F�L����X�`�������r��I����;�O��&�����&����=�J%<��,����y�C��:)jB����C�'�&�<"��A����`�^���a�	��H��L�z/6��A��SZ����d�.~�����L����[;���4���d��w�V�	qn��Y^A(0q�/�P�r��2*�O�=�G0f%�T`R���g��37hw��c�\������hg\+K4n�>N��L����<��\1
|����c���b���������r�ERz�h��V=|F|�����CT�\;���}����[n4�������l�3������vSi��a�K����!��x��
D$���n�����N$w0�n�Ri�:��Lc$x�Gd��~�7�"��_��-<�y?N��N�N��
E���(�����7qA�� �_L"oe�^�H1-�S��R�����(T@�9��	F�{�*�)0S���gy�Q�`����D���%��P[��
��*�������?����������F���j�����k�Awu����]��#���'��&n��
X� ��2��%kFT6�O,�������HF�u����.#������cF�
����k���3&#`��&��'gD#��-�o"-�Q�������Q=����X$����?�!#*'0�	=aT���$,�!<f�E.�o����-����/��K��37��+��1K����TlF���P�A,��A��P�d
6#J2H(� ��������QR+�I�hF�1*�������<#�4�������L�cR���f,�A=f�"#�+��D�rV4�JWf"n=L�	�FXgW������v`|D�$ �C;����X�`�����r�~e��G�����Q���N��h�"YG��������y�
z�;E��M��Qhy����G��18�
�,��Ep�]�}Ly`�+
68J��(Cx,�Up�J��N��h�X>GX ���R��A���H�0�6�p���Q�`��&X�Cx��y~�\b�)L��R�#�w�95?:���"�E���mf�LF##�Q1dH;���d#�\�)S[=�Fb���P�4B���Hw |t,������8lg��4`�x��F���S7(EF�b��2����d#�G�M��2�H�r!�� 
�����IO�4?�m���6?�X�#:x�������6@�A���G��1?������������c���Z��l~��v�u~D�Xi��	� 
�%C���
4D ��c$��qo���hD��'�u��}x�5�@���5?�km~��`�#,�!?���G����GR���z���,<g��6������n5�H�.?���,��y~"��L~�]'y�F���X"y��"�t�-���X��r�
 =��H?���)R���R��X U $1r�m����"w���R�MF`s��"	��4!0R�n2�XE����!�����b��5L����D�qU�qU�!���E�F��b�P�O�P����h���
E��6
�%��"Z!d�:c�F�
�P%��h�"Y4E��G��6y������n(rX$����$"!���Xn��F�P��aQ�.,��(
��.,�
�,��yX���Y��#�����aQ�T��z��EX����E
�&�,�A=f�<,*�r���(r�p���E��V��i��dC���6gj\�1w_j`T�?S��x�����=J�rge���[?���32=��@��Y��}Aq+eh6����0*lf�N,�!<�43�'��s���C�R������s���~�����M��A�f�g���FPb2���]8g<	.W���]�y��63����r�23��m�
M��./���H��f�F��k���p�aRaQ��c�,��+mX�q���C���E��K��d�v�����c��<����aQ��;Xg6&�9��e L@4@,7=G{��;(���H�uqft��+L�!3*T��z��Ef�������y���%7:��3���
���
�(U�zn��*5��$x0q��@�� ���#��6��~���������~�7�A�x�qD���qw��q��j�������#(���B;�3nK@4"�5Gq;��8�P<��8��q�����h�X>va�t
������!1�X#���&F{>lb��v�Eb����F������r��'����s�������)'F�}rb4B*��Q���c�R�#�(�_�%!z����kbN,���a�L:�����u�}����1?��_��i�v���EH��	��w~��F�
�YY��
H*�����N�B�@'0���'a�S��&�G���}4!n+�7Y0%!
���a��%����4�$wP�?���H�R��2x�������u%�*���z�Y��D�A�_6m���4��VB�
��,w��?��t*���J�yE��� ��t+�P�r���I��(<JF��2����}Q�b����gl_��Y��\1�*�����gab���J��;���M�`����	J�]��I����[��;���M��^���Fk�����I��5,�hm�,�l��������Y�`t������[.X"�z����J�Y.[��\n[�$�!��u9npebi��a�Og�k�2�������m,/������C�����&)9[J�w�2�2��0*l��v�t�Y��\%�;�[�
�v`4�s�C��_�uP�Y��\)�K���
Q�����D_:%�� D� �6I���1)���p�	����}��^�Cx,���Ji_�f�)

N=�r&���P@r���Ir�����pe0�/>aT��1TBCx,����qa���f��G��A>�J�_5��Jx�Y��\%^�psf2�P�v�3�'�
�Y�'������R�(�_�B��5 ��#�6.�4��K�s��?�$g���5!0����@1g�g���i�GS����x�!����0�t��
��~?rSi�b;����<s��$5"=�/I���Xn��v`����&Imq-P�c�^�&;�5Z>6Z��}���a��3�fR����In��j�����*��8����p�8������/���\u@T+���;���}���Qxzd���X�8����C,7}|G��;��N������	o����9��&���&A���'��,m��vn��v�)A��U�I�g�������4��-GIc^����7 ����'��m_�DG�P��!1�����5S�� ���L�.�T[lj�����O65�
K$%vX$�x��������Zn�r�*�������
K��������
b��b��c^�:@����+S����l�������E�����Y���u��m��i���������5�j0�!���vv3��������}o������E��z���p�:ictX�LO�}%W&����~mV�6,�$�a��3C�N<��[�w`���aR-����z��wf(�o���"��5�u�m\�3�s����Z�������O}u���&���I�&������1o;�|�~��H��F��1�����%��P���F0yo?��+�^l�����Q	G����J����Q�!��rC���vl���dP�Y��)�-���c����0����~r�vY���j��}qj�I[���Iv�>��S���Q1��'�'�s�3��5g�}Q�;$�}���K4��(��loe�8CIn+m���	�f�W�����[�����r�e�C���x$�o:��6�mX�������I�z��'����6u:L�aap�����O[5���\�X0�y��6�=����J�;S����I�ABq�m'd��R��}.f*����w�������� �l������{�fn��b	
��H���d����q���6�:L��y4S	�1K��!��>�M���t%�_�r�6*��u�R���8������'��:����4Q���7���h+���
T=,�-����H$w����G�rJ9�:���5`���w�F��������g����u�Tn�S{����6�����Q����2q!���Xn�:���P@r����Q� B�d5�duDX��j�X��F�v�����g�'\��yI'�����x<����Q�Yi��Y��	�"z����H���A������ai��	�90�>L���v3�)��������3Q0~��������'hRj���55JG73k,t#s\�`�ff�KP>:B��o��K��cCLi_?���#
I��%�:,�IH�[Sc��9~-"��������K�x�D��
���N�%���q�6��r���LR����%;���b�d��sv�i�S��hB&�l��hR�N�u�X>��P�)��R��$m��<6���g�H�6O-�{	��q?���h�L�����D��&���Z��T��*w��?��Y����������\yoS����:��,�l�&� �����j�:`R��JL�:�6�������z�L
�p�G�O�P;�f���Jx��� z{��[����KL��������o�l�a�{��&&��6����bG����,K4�p}z���\�-��C m������b���X"�z�����j�%���s;�B�Gh�)8���x){���T�Y��+}����9a��q���7W�Q�6R������w8���	k��Q���L�9W�������LH�`���Ld��,� �6I�Vi��&��PZ��^im�[(���B��l��V:UI��l=�tD��p��������r�`4�����:��6�mT ��J�g�������9d<��^[�Qa{oc�m�1K>���`�W�1���
zG�
�b���gs5���{o0�z��!��T���#f�gs�V[\�h6h��6��Nk
5�=�@�m-���&irZ�e61p�=61����m[X����&i�m�;%����&��)�cO<������m8�1��	���c8��?�$����M31qK@H.� ��������;(�N�]��o�%!v���%0>N*�������4�������F��^���k�B@x�tv���c���q�!���UB�������<;m�mv+�J7�;����B��#h�����zJ��{*k�'xK�lQ���5Q}�&�Xl�tv���&[�����V�-C�f�J����\|*DBbewz��)�A��k����LV�M�� �5[7<aT����x�f��E�y����U��C���x����\y�9�����o�����{��%�����o���S���_��/�_����_���1��?��������_K�����������������|{�C���L?����?=�T��+�c�9X:G<��������2���t��o��������97H�����{3��hi���'9�����/�#5�P]�q�u��8=�[Y��S�8�������q.T���uDf��Bu�
�16L�����n*S7:TM��X��L�f2uK��Y��Y��P
�Bu�/�#2KG��T���S��b��,��� |y�{?&_���.#o\E���s���������1PC�:"�td�:N���sC�ndGf�*d��P"���,Y��S�:�u�apDf��Bu�
��w�Dk����eb7��3j�����j8�c�'3���#�q*T��0�4Wf��,�+�(�������u������_F�����
3���P^��Pa��
A��Q��8�ci��3+G��\���6�UGd��,T��P#=NPGd��,T��P3"���E(�s�:t��:"�td�:N��Xh�!uDf��Bu�
�1Qs�\J�Y:�P.%���R�6\�P^������_JR��6y��YM]��b����X%���E(�s�:��&T�Y:�P�Bq���������q.TG>@Y�Y:�P�But��:"�td�:N���7��@�Y9�P�Bu���5����n&S�B'��e����22_^FR�����:��I?�j��"S�Ts��

g�F��Md�6��[��b6"Vn,�����F�8�7���\&n�C�+7���T&n���7$Vn,��L�v�X&��+��"����z/���p��e��-	|� b��2�R���&_�a���_(J��}�!^L�E�ME��4��/���\&n�6�7$Vn,�����I�+�������m.���C�C���e�6��[�3��
����m*�(�����+2q������7$Vn,��L�?>�A�R$V��$*�����K���0��H|�*a1�g���y�������
��������~�����2u���e2y"���Mo�<�#2e�~\&~D��p�,��������6��������0s?.?"~��>l�)~����L��X�h��~�,��L��X��[�'a?f~T�~H,�Z�z�'��������_�u+���������/\<��g|����G�3��:ye�/�������X%��
WZe�~\&~D,���U����q�����_m;"nX�f�^���e�x����L��X����l��Y�Q��!���m/�^8���q��������������+��1�3�����Im��G�}�������~.h���_+.���x���7;&�nT%f�^��Gy�L������5~����L��X�5xwY?f~T�~H�������EQ�����+?���3?*S?$~a�u
�'��������Ro���^7�L�����u���c�_8����"��Sf�Rq�L���K�lY7�s/,'����E�FL���J���xRb��Y�Q��!���x8��f��e�G���x�~�,��L��X�}��]"��������_��:\.�Y�Q��!������13�\p��!�����]].��L=+<�3sf�Zq�����ku�x��i�f�'eb��}�Ik���������GS���	3w�2�#b��w�9?f~T�~H,����d�����LH$V~y+���2?*S?$V~�xohMGf����L�����s+�0���s�;��������V��;�y
D?���s�B�]��aq�[=^����������?�������Q���(����>�����!� ��c�\�����r�2��_s>��m�����+����"9��j���hO����r�cb�0)��`+\(�A=f��\����[�S#0��m^`R<���`	�1K.��0Xsa&E`2mx��I�`�K�����%ie�����������v8��	������%<��,y=���!��0<Fc����z�g��;&���a	�1K.��
������I(1u���v�v�f�$L,�A=f�EHJ{����Uf*0%D���	����
�f��c�\D�
�i)���L�'��%.��51x}��c�\���a�x��2��0�~����&_,�!<N�c�K/�S'�����2�����1*��`{(�!<�E��J�q�%a�L���&����JdP�Yr�E�w*��Yep�l�nf��y1,1��7����"�K[����neN`J_]�����i�Kd�Yr�e8�����e*0�/�|�m�Ol�%<��,��
n}NX��2bb��;�t�1)����KxP�Yrf��M�!�"��v���w_��j���j^��]"l�0	�+��5&<�yDP�P��9�H����n��;&q}�0!S���W�fn���%<��,�������g���+<�x�wz�s�;Y@���
.�����<�|-u��:���������o�<��\���o��
e���t��y��v9�'�A*0�&C�� ���D���-��!p�Bm�0�C,0;$c�����6�:N�k��-�[\��E���!��!i8T�b;����;�[,N�D3�Z�~�q����L��9S�� �_�2���Zx�
�������6�%�_%���r�Ev���v�������p`G����F �v���z����I�g���h��;G��R������P@�9���G�SN����#[q�q]h��������'s/���;�{����>�����)��8���	kR�;Sv{"=a:^@,0q�oq,@��R��D����N6f�oF�N�3���{���e<�����oG������MS�9B��K��3�{�w,�!<f�$M:��1��d^�=�I/O���_��Eh��X�0���r�sj���&���Nz��1K��R���dJ5#�F��)�}�fH�d�)�����2��<F����6�����X$�L)m'|Q��Pg4C�C��S	�1K��R�u�0Ea*�(�����R)T��z��E�t�z���_3%�I1dJ�JxP��YF qO��A����fH��fJ��JxP�Y���Jt�R�C�d!��<	
4N �/����!J:�C������+����8����1�������G#F����D���d��6=�i���L�!:�M�H����W�+�����cT���%�9,�yn�
��(�����	z�kn����2�[1�.���Qgln4bT��JLn4b�,r�B�+���kN�q���Mn�D���%��Q�;�+��3������XP�����~��7S�i`�1*l`�����F��i`\�Lh��_KZ4B*��`���)��\?�v�5iRB#�\%��P���/2%����?eL�H;��<)b���I���x�Sep�S��^a����>��T���%�8���+M����X��A�Cq�n*�P<�J���a?,���4�Fb�M�"��A���HqK��C���Fb�I���
h{�J�4Ej�-�S��S�Q��H�cM�
��)����(@pl�!����T�r�~�""�!C
C�4A��(��)������<K|���
0����Gx�,���R�����wM�������y���&9�b;�<7:��Fg��H�
5&3��##��MJ��Q��&
�2`��F�����S�F�v�E`�
�"����m`T���s�����dd�Z���@*w�Q/0���\?�B�U���V�=�H�,��W6k@��H&�Q���~���\��#>0��80l��g�i�hB�q���$FKd��,���m�d�����j��p����j����m�?
�����4�����	JL�����-�!1��{2	����B%<��,Y$F�s)s�	����\��(P	�q�=��}E�N���^G�����������&���B��CL1K]H6�]H;0���0)�.�L%<��,Y&F�~��:c3����FX����Y��
�T�3�������2]H�Jx�Y�����0�H�g0���P>DHX� �/����b*�/�&0r�m~��x�*w����u�|�h�1y��I1�G�?��z|L�k�%�R��2'0�g�q�FR�w��D����zO5�I����G��<F���X"Cx��y~�5
0�yL
���%2��m��(-ANk�}G�3�8�i��w���d#q��v#������v����i'�4�UM�������t}�����B�����������i�Xn���@:�����(��?���Ny��0�!*�!<f��')l	�|���h����=I
+�'i�*��HN_0)�	�Eb��HJX�r�~�%f�t$�@HG���fIx<�~�%%:Z�
��G&:�Y;�v* ��\����i�fIk�d��,)�f�$,������Ldlf�������%5*�,i�R����kH*C?R���9fI1h�T�^�I��a�N�0N��qt4��Kd\��_�%F:�xx"�_�I�:ab$�|�A�I�x��!�����g��4�
�$O���u
��z��j)���Q���,�Z)���FJXl^GJ�6��H�&R����H	$R ��#�
�Jh����Y3$��FJ	X� ��#��F�}���[���p�\���z�^"��"����\
8D��_;�^�q��MH0���!So��3�4���� �aR�`�7b��c���Ho��;��[�mK��VL��F�m[�x���1K>��t ���}�I���ROC[S��c�|8W��r�mz�i�Hh��=g�+G*�!<f�gs��u�� 2���P�cR��+*�A=n�������T��J����
��0*l���PBCx,�������*=P���mT���d}+O(�A=f�Gs�|>�
��
fh����cR�MzlyP�Y��\9�J�/�Ke\�����R�x�K�<�gs��2zG�����V�%F<��cR<��e�X��z������c����s�$`z^����s�<Kh�E��\9m�$]&�����/�#<�Ia#���D��%���������e`$�{��V<h����c�|6WN��Y]�M��\���x�������;���MRb����jV��Y������%2��,�l��,F)���Q���gJ��moT�r���IJ
����6Y��������{�����|�rX�G��V�������X�Y�y������)b�x����S�g�_J!G��P�	�[W�$��M�b�3P��N*=�;.�ekT�r���IR��z�N6�fk@p��!��@�E,`��\��$)��mf����^������9�<NZ�r���IR`�WB'n����r5�C��}P��\��$)��'~(�sH�r���J4��\��$1����e\Sb��J���<�g�,������d�g��B��QY�`�e����S����r�g�����<�YLd 8s���
Q:��J�g��t�v�������hv�TnZs��A��p���9LT 8���M��w* ��\��$S�g^|��a�;^>�0�����b�+UNB�tR�;G�xW�����V��T�~���Tl�xvxN��3���e�a�Cqo����}���Q��*��	��b���S��;���QP"�=�V�_���p���;Gl����a��Y�O�s���OO����������W��������j�4n�����&���Z��c�����J�K��J@/�&�R������[�R�'S���i����S�� ����(�T��T����2�����9�3v���bh�LT�eG��wfhZ!#6$�r�]N�0)���L%�9b��3C�fL��*��J�����Z�er��c��3��V�U��(���E?��%F��Z�
%4��"yc����a�.���f�N�q��5I�s���;3���_`�2
��0*l�/*P�Cx��wf���}����bB���^z���oRu��T�����2f�;3��� pO�F�m���Jx�Y��%��=U��wQ�nM�o��Qo��TtP�Y��e�n�7�]|�k�=F��&8a��1K��!��)�cye20��>�@��5���U������J�[����4`4��6��'&��1K��!G����L�#�:H�&�
�
h<����iZ�-?���M>��p�.�9]��c��3��V<�wgR�c�\�%�
���A��%4��"yg���8�[���������C,�����A�gj����y�0��1����%�=�����89�M.���MB�a����	����_���-��FR��H�PGH�&�m��S�s����w���'�vvJw��;��J8{`��#��w��	l�M��8��H��L�9�r�@$w��������3�~T<��[WD����5�-Tl~kN��x�1&2n Fy�������Q��#��w���h��[�p7*��*��
N�I�H����G�qK��kB2V��f�'p�:B�kj}6��d�o�M����&��
h<�����T�:;�&N�P ���{�k����r�[���4A���0��#�Z�������;������FlW\:�F�E�=`�%�����7'���������f1��Xn�8"�����3��i�~�J.��w���4X"���"��IC����5�������x��Cb|
O��rR��{�E`t{>�I1l������i����j6��yX��g��cT������F�������Kt�6,Y���#��!��X��9H��M�H�3��W�7�x0����+L�3�b���������������N�5�,���/#��A��l�f��s�?0��,�vp�I����^�r�G����(l������cTA�A%d:,���*i�A;���(:���&�if*�A=f�gs��1�Z�~��<�	���g�T�Cx\���o���H~�������d��TM:�	����}���$%�<��G4���h��0)���b��1K>����0��f;����G�
�"���G����zD\����L!(2q������"/bn�����������������}#�:�+�&n�����A��l���=l���Y'���t]�F�]'������jzY 5�-��#k$�1*��4R	�1K>����;-�TWAj8��N�$MXzP	������j��C+��M;ab�b��b������&���]�:������|�&���[�1��Ft��?�&g���	������@+���a�i�5P�?�$����e��$Eu��^��&,`����T�=Imu-��N���D��R��W+h�����Ir�x�&*�8�����7,���v4`��\��$)=��&G�;`�\D��f���Ce*�8;IGw8q���	I[�r�T@�����IrZZx�?&pw>�Z,�����H$w��u*oOR[]cj6��@H� ��\�a�6�Z(�NR"�0�����$�R������ �6���
e�����c��4��`O�{]q�c\����x�����Y7�Dx�xv�x{��3�l�r�sBjT@!����Ij�+��#lB�R�
t+h������
J]��/�0	_�r�����D����r�F���g����<����^F��_�����o������#���O���O��oO!�F�K�2����~���o�?~{���������1�p������1�~��Oi��_��'6����m���\��E?TM���;�Kj�/_���[��{����O���=,��� ����]�{O��e��CZI�s�$���m<Yi<��x�g*�U�x���z�����@�Df��J�9��gl���%KG���T�~�mx�b�G:�7�_�O5Df��Jc9��g�NX�����=Yi<�R���[�\X�Y{�R/,�����w��W3����������f�,�/J}��R���P=�Y{��xN������'3+OU��B���+�'2kOV��T=wx�f=�Y{��xN����1V,�X:�N
�B�+������Yi,�R������'+��T���N��k3kOV��e������{e�Z{}i���%|��`�,'/J}��R��	Ob���T��T<�����f���T��T=E��������s*UOL��'2kOV��T<{�T<�Yz�R=�R�<h��Df��J�9��g�zJ�'2kOV��T=����1kOV��e�^\\��q{��r�������m�#�1���R_��T<�6�a�K'f�)J��K����
����=Yi<�R��N-Ob���T��T=�V��WG�����s*U�@]�������s*���cG�I��S��9��g��o�H�IggB�+t��\Y�Y;�R�,�����n	_�P��,_������������E'/�\�~�^l�D��D'~s��E��?$�~�S��������e�������Wh�$�Cb��:��
�/nG��s�X��N��B���s���/}l�N��B����d?"V~���P��vP��W"�~��k� ��t��3^��xy�\9�~�����;�j����i*c/�NG��X��N��B��Ml�����o*d?�����X��N��B���6l��X��N��B����d��X��N��B����L~�K���S��P�����A���u�7J���5CrX$��"�4������
x
[N����_7��2y�<�����y��W|\>�����p't�hM�:� �2����O�Z�U�O������+3��0s?.?"~X?a�~\&~D��p���cf�Ge�������2�O����+���1~�,��L��X��C:��a�~\&~D��p���cfz��2�Cb}�h})J^3�z|���p����0���d�D,^�3A���f��e�G�������~�L�����X���;"nX�f�^!BHm����q�����eY��������9S��C����+�a��cf�Ge�����B��cfz��2�Cb}�8C��Z]8���3��bf���_+.���x��nc����������Bzc&������������f��e�G���7��;a~T�~H,��[?a�~\&~D��p�J����������'d��Y�Q��!��;qQ��f~��2�#���V��X|�X_6�v�u�t�)5}��Nn���\a�W��%3G*SC$�~��l�#f��e���_*�3�5j��ujI���o�P��S��'��'1K�����L�<��x"��l�$}�dj�Iu���g�����PO�SOb��:aO���3���/+����ue����7����"��E�:��Y�h�l�9��	���:�$f�y�������#��!K��:��!S+G�3��,<���}��(5��:M�Yz���:z2���:����3DX�8��3���:��"��,�o��������'2K�-�L�/.���!�]\��������k2���S����/C���������Is����Z���=>~�L��i}q�+L�3�'�����"k
�y��={��������?�>���aR<�����D��%��� &��:R�BgM���b.�������	8
���[� �E�
7�>T� �_��_�n?<�O<���mb^b>�����a	
��H.B���K\&L��{��^aR<���t������`oG�*�)������L
���'L�38,�A=f�E�J�������L��i}��+L�3���%<��,��Z��2�4aN`Z�c��1*��>�P"Cx���d�lm�rB��2
����fpC',�!<f�E�����de*�,����D+������D�?1K.2�c;��������K�z?����f��T(�A=f�����e���@��L��J��)C��&���z	KxP�Yr�����p�&\��nx���W8�������^(�!<f�E��9���� 3'0g��0�1*��&�P"Cx����.n�l�A��R��?���3��}?1x,����"���+z�-n�L�<Nj����^�����P"Cx���L,�%��%���	����L���X�r���Vi�c�V��U������?~z�I�`���TB�z������v�^���*"Paqk/	=n��\�#?�������>
U�
;<syI��$��E<��f�aO'��tV�����h<�C<_`T<��+�����%���j���Z��I����_`T<���������X$s�.�-�6 }.���6�|�Q��vo�Jx�Yr�@�+�|A��Na�������@T��t�m�Zz�d����;T������ao���	G���L��Kp�Er�����l�S�m���*����q�CjY��|���k}�}���A��Y�g�A(;}c����|�@2��a��A��L���+��Gfn�����j.g_$�S�{����Z�f:{����A��
U��Y��	d*H��Je�X�`����;(�W���Y�	�tr�2�
$Z���0Hfn�������j.'���c&)�B�_�"�\�b�C<�2��C���[i�IB���`�
T��Z� Pr��\��<���W� �{������M���#��$�i���
�s�I�h���N������Z
X$������fa0�by��!�:�D�"�E2O��)����^�9L�!=�T��z���\*oR`����#�����K��c�����x}�0��@��'��!�������%a�����	Sq��V�������CI�:$�c�,�,<���Y��a��I��,,�0k�,Y�YtA�,��e����Yx<g�G��9��pF������x0�oZ,�!<f�"��[��[���hj�1*�+Q	�1K�)V�7��D��dVB�M���$w����U	��@���UO6�����I��2�:��_u��W&����D���d_�]���36�1*\|%&�1K��F#&����	��.��&� �/s+��ls����j�������j�"Y�V����Ugln5bT��
JLn5b��s���Ln�_�$�r�mn���r�J�<�
�,)s����[y��)��V!p	
�q�=X2aR���[|����'H4�V��x�'�T�������������;E���4a�$:*�a=f�"���'�0����cT����\�Cx,�yd�ea��"0Q9L�!��T��z��Ed���Te*0Qy���SY�T�Cx��Ed��GV��`d%�>��EY���J��.#+�4���K#�G��LdE%Y
X$��*��Ua��	�<F�FW7�������%���yM�0U�k�A�ybQ��X�"�:� *a*�%����r��%�[��%�_�c���Xb����@r,�~�cE����WF��Zs��
(����s,���s�3�9�������Ds,��42����;d���$Y�b��2����dg����*�	'U���8��"N�<��Y{�V��6�-��j��Y|vZ}qv��++o%�8�7]��k�P>dY
D� �/���zn��:c��L
��%�k���z.e{���f.{F��0*�\�Kd�Y�x]���i\gl�7`R�����l���nB� L� �Y�+�Ia��~��m�1K]Y�� ]Y;0���0)���L%<����3}dEXT��6e�������}������xR�"�*���	�:cC��bH�"���dd��Y��zfb�VJ�l��a
�N�4	��T`�Q��Wy�6��3U��A�_4`E�|Y	8C���x��FW��>�n�\���vjFP{4��x��]T����"YDW��7�U���i�*�Q1DW�Jx�Y2��d���p����m�5����uP��Yg�fK��U��i������%2��,Y$X�Re20�Xy���$Xx��	��"Y&X���2�`�.���A���:�
��"n���b	�K��37������X$� kj�J������p
W�dde���A�~Pw�dTO�� +p�D�jd��W�,c�U�2I��"m�:�{�0O��
X}
������5bT�,*���d���qKS��_�L��Q���D�,�e�ib��he��5�L�
XX�
X#f�u�����@������~�Q���S#,�!<f�"�
[�iH�20�hy��!�j\B	��"Y&\�,.�
g���%\X�	��Y�H�v���I��s���.�$\��s���H	W�-<�9��D��J�pL����������*�sL���A�b��pQ'\��U�V9����	��5@,;��@:�,��y�����M�"o�K������%4��,Y&\%&�p�y�&�0)\�E%�p��%����vC���a��gEC���bZ�`�lIu�t�	��xp}����1?��_s���vl���f^��7,D����p�Xn3/�F�R=����$e��w�������g�X>,	�P �9H�N������1�mQ;0�Y�0)l�U�T��z�����;�/����7-8g�G�L�
o�����X$�����b�[��\eN`�Q}�Q�`o��D���%�������m�*����t�=aT���~�n3]d����rCXz�:�������u�\��f�gs��-5��b&��5��<&���z�	������o�C��yWka`$�{����_�����%�U�����|e0�>����k<*�!<�gs�f��%�)���X>4��!2���?�d��i8(>�U����������<�Kx�Y��\9i���P�
�$�O6�;������rx��*�19�|��F��2����8�}oT{�����(��{rb�1)l�2�������r�T�D�*�<�C(�������r�g��Tro[J�F�!#Q������^�D��%������H�B ����H���w���A="��h�f�-?��s�u��Bu�$0y���Gt��?�&%�g��&f�J���	gilB�v?y� �q��`�?W�>���������nt��Y�o]�%4��,�p�����E��=+�>��m/��Kp�E��\)9�����;aN��P��Qa��#Q	�1K>�+%���L��#R@q�����m�T��z���J�+l[=��}_����O6��{m�&�}�,�p���B�7���(���}����5��(���}�,�p����Ma�F���',k4%�U�<��"�t����C�5)�����b�Ia�������%������$#�t���H&{�6Z��
H� �:I�u�AN�!"�A����I����A��p����Kg<9��h�����R�������J��{�/��;0�S;L
�[��%4��,�t�9�����K$�~��� h��W�`	
�1K>���0:�|�&k��L�h��c�"��=�*�2��������c�7�_K�_�V�����2���\{WA�ve�=�8�6����?1�N�N�c�����J���������FG�I14:,�A=f�[3���h����m���L�aMq��c��3C��{�
��0�	�$��=&���{$������Jwg����9���Q�+mHdZH��������pk0�7�����F\h�/�X��rx�Y��eIs��d��z�e������%����P�!��"yg��RG��V������n�Mn}R	�����r���� M�x��M������P	�q�w�_����f:�[��X�Ck��c��3C�E��>��S	��0?��t���h�������3���v�a�v�\�m�I1l9X����%���s��Ma�`���O�;�4�N�Jx�Y���u�-$����)������:X"���"yg�3W~
Co"	�=��>���wn������Pz_w�?V&#	���fW#��0��H���H���$��7[�wJ������-J������9H���0�@H�� ����{��;��oLM����1Yn[����eu���T@r�i&���8�m[�ccn���rk1
�,�P�� ��59����dE��l���?a�vo�Kp�E��9�=����K�����:L�az���G���f�M��(�g&���=F�T����"yk���e�`4 �6 >�����rb{H<�-�;��s����ba�xP��1���m7����d`$�}��6�5ap�
�[3�t�n��C|������pIT�Cx���fH��	�^6��x7&�������%4��,yg����p���&��h����"��%4��,yo��V���%�u��=�d�9q����sJ��u%9�����,����8��q�;���M�R�Q��u3���z��!�m\B9��"yo���t��-��Jv�C�[����|�FI����fJa�������7\���'Xu���������_�oyq��bq>���:(^��x\�	s�s�r�.��G�� �B����I��������;�w�����@7b��1K>��n�;����3v����m]�%�=��Y��\���A�dB���QL��0)l���h��Y��\��+���e_�|��X����������
��$9��	��5�o��H���Qa��S	�1K>��4��O��O��=���f��Q	
��H>�k���iQO���ED�h[}�-2"wL��c�|6W�Rbm�`��M<`R�<K4q�����m�oa�4`4	���+�G*�!<.�w�7�*�m�hD	X(����	�����,w0
����4	n�cb�Nc�����X�1��Y��\%���Yic�����cT�q��c�|6W��b{���;c�����7D�Pb�������|7��B���y��p!2��y�,�l����-c��	�6,{��!NNX"Cx����j2`�������A��m�h��Kd�Y��\M��1�1Sv�<E�q��,c���$1
.�����vV�0��tF7�S����'���	[
�s���4xvi���r	
�1K>�-7-G�Fm7,�Olic�8��c��|r	�-;,���a��87��rqI�=���H'.�!<f��s��en� 9RO'g��b��+�h��������s�9�ke0W�T��D��MJ�S	�q�{�X���RS(�z:������ZA$�����}��	qr��\�$:8��h�w:G����P<��H>�+g���+W������67Au��c�|:W�L����I��G�������.�������J���L/s���Ix�0)���p	
�1K>�+�����!�������;r���'I�s���M�]���A,���N$w��?��4(�(���%e{��(m\"-���s��!����
��	��I��o�����M9s?c4��q@�lw�x&P4D��7H���2������m��k����/������������?��������%�����K�^����~���o�?~{���������1�p������1�~��O!y��[��-��a�
��3uu�{>+�tF��H����|U�����l^���o{X��?C_�+XB�/L-���ZQ�R����S`��,T��P�|,/�D-=U*���^���H�]E��s��&�2;�"�v���������+RkW���\*���Vu%j��Rq]H������"�v�����Zh�S���v�\o�����������z���&j!��
�� By��Bu�����Dj�)R5�K�u/�E]�Z��T\Ru��c�����U��:������+RkW���\���<x9#���T![.���w�v�H�=E��s��f9�A\�Z��T]�Ru�5������'�:3�.3}��{��X~}�y������_<S���T^��T\���q%j��Rq]H�5�9�����U��:��k=e�&��tU��.��zl����d�ZN��X����$���"U��T]==2�H�]E��s���-W��"�v�����Z�W�/4D-]U*��Azu��[���^\k�7.6��a��Z�"��m!�����'1KO��\���6~�'f��Bu�
��w��I��S�b���k�������H�u.U��
�����U��:��+��2�H�]E��s���&���+QKW���B���
����"�� ����=xU���UL���&����UCf9}�k6����QK���"U��T\�4ZW���*��T]��c&�H�]E��s���-���J��U��:��kOe���E��S�l�����K�#2KG��\(�����,��4Bq�M���B��Q�|u�W����'��������e��3�1����_������0|��rT!;.���y4qDf�(Bq��1P#�:"�t�8���X�<A#GbV�*d��P�`,5b��2�������������j&���J$�^���s��:<J��,�D�W�Axq��pdZ?����Ax�x�������j���
��Z�18��8�rT!;.���`K5b��2����-�v�bG��O�l��c��������n*��j2\7�Y��P�Bq[��#2KG��\(���sHb����S��P�I�}@���Q��x�
�.#�
�����2��udx��T�v�w~t��Gh��B�{k]�|�����q�B�(���]�R'?3����������{YO��\���,=�-p�$f��e���_�O��PG�SKbV���:��B-<�N=�Yz��?x2���:������'3�%1+G,3�@��r��pk(��������g��Wf��.3�@|���w��K���W.*5�_3����t�H�^�7!��B-�N-�Yy�
��S��'��'1K��7~����7~����s��PG�SKb����,�Y9b�1b�����:��J��N,�Yz����������'2K���_��_R��x"���J�}qqQ��5%���{j��q�����k����,�L
�X�e<��
�p�:�$f�Y����������/_{��v�FO�V�Tg<�Yy�E�y�9�Zxr�z����l�L�<��x"���]�O�V�Tg<�Yy���V���������\[j?�i���+��v>���Z�f\��'f���}�=�mG�PsO���_b��
������'�OdV���o@��g��,=OXe1x2���:����3�y��S��'��'1K�2�����'�OdV�	Oj��B-<�N=�Yz����P����Od�p}�/��V��/�D�o��~QB-^4���������?�?��\����<���LD��'��'1K��?��������?����j��u�I������'S+O�3���<Olj��0���:��"��L[*��T��'�Od��/�71����H�)��������N�k�������t$�t"�u��{~������g5�jk���;����E���PP!�m;^(����X��<��Lt��b����;*UqO�;)�$P����NE2�A���7������"_b20!�F�/0*����6*�!<�b.}��0�����b��I%<�Er�[��^&��T��������xs��
����*X=p��~.L���
���N�Jh�Yr1��w\=�����
P!�
�^(��0s��E4������nP�xM3)��}�gy�I�`��%4��,�
u������Kf�L����&���t�S	
�1K��Wl��k�Y1|e
7E��7�&P�*���E4�������F�zn;6�
uD� �M/	����\�#?������`g���@A�^(�SRn�"�� ����	���o����B����p����.+���<���~����6�NB��W��i��	��#\�"�� �e���6��!��J��*����
�>U��`�����*��gk_��
B��3�sh^$��G�BE2�A��,-�v<[g��LH���q�-�����{:��,����VN<V���
�#���O��������e�'�T��Y=��qK��*R��gE0%�zz�E4����)���!��1�h��tF<B�� LI��E<�A���������'
��B����0(���u��c�\cp�E
^M���_�_v�1)��/��KhP�Yr^��b!���+�*���_�gE���v.�a�R]M�o����!�B���d�T���'�D0%�
��O���#8�%�^+��3��� ?��;qc�N�f�K���v?�������-��B�w�~E��!��%<��,���3�S�2��Q�\qe�N�Y%3�~�	����j.
@���/1s�J�oJ��Y{�e��#9t^_����E$��c���A��$�Hv���65�����"<<��D��b�x���{V������yI�:�EkL��1�������D2;���B{��|o�K�~#�K�������m�I-���9��Ph���j�y���%j��?:gx�2�1����b��-�l��KG�-Qc�<�e�����ws�Q��CR����Mq9�}-��1%O{�����1F6'B���-�"���#�^�������=�o�� R�R�Y��Dz�unb�����h������y�".Zs�-g�������K�����V�^��^ZE�j���l��]��&��^Z� _�F�j������Vk@���CQyiI^Z�]���#Y�(7O��.���$��$ ���\`�����0�/Xj8��[j��d�Z�UQ[jI�Z�M2���hTf�	�[j%����,�-�S2��"�E���>���ZE@T[j�,���j`�����5���U��n��R��`���o��A�~���j��i����R;�Z�]2����n��a�d�U��_�sK����VT
-���$G-��d�U��X��0�jl������E�v�2��F@Tzi�A\�F�j���w�]�_&�{g%�"���3-B�D�]2����N%f��X7�nX�3�!�!��U�������ZZhk���7m-����Z~����������{V&*����|�qv�nf������3���g%^9O/w����S2v��w73��/��
���S���8+�IF�Y�H�D�@�����J��`m���jh�m�c[
�L�����������q�a��m�o<?d���l�6�> 871���S
��yb29%��*���s�[�e%d|�6����w�p*b`,�&Q���{6�l�A��6�R&j��(�f7��6�-�]����k��~U���m�
"8���n�����f2�h�	srF�r�0�3/Q![���M3KS�f:�&3�JEe�!$�f%�d`�����0�&e�iVaS���[���L37�I�O������af<����������[��3�����u,�2q���������-���[V����[�����%C�,kFjn�0�[V`STn���[VbJFn��&�[f_������K�������,���~_��J%#�N@T�e'�h��U���]�k7���
���8<tk�e_�'j�?�����K��MJ����������[��<W�g%���L�����0��.[�������c���*������?����i�"�l'�F@T�d�� [�FP����<Dv�D;eLjSJ���D����|�a��]�M��Y����������`�Ng��.�e��h�V3������d��]VT��2{��e��7��J�5|a�Y��ev��]���>���3��H����_��y�Gc!X��.�eY���Z��rw�&�r/�X��+��~�W�dM����e:�-7�J�DeZdP��*�TC�l��D��rf���J�laP*8������
�������$���P���s�*���kf�kn�Y3���UD�\S�Y��uV[�s��-�X��g���Y���4�$����=�
��sM��C+	��2�"L�������V;��
W���W��6&���5^{������*3��^��E+	�h����0��J���)QQ��9��V�� Zi5A��J;��W��Jeh5a��J�����0�/T��g�*A��i^�V��fUg���*�T�B�S���^��A���U����1��1%#Gm�yX����8w7�nD������0U�Q�A�tH�Z�\/fY�0�;j6�A���8���_vc���w���p�[[��m��9j��:j|1GBVAe�
�����r�O�k8_����r�s�q%d�2�<�1��v�bK��%3M������IfZ�MQ�i�!�h�)�i�.�M����G�.�������8��K����m-�����l�U��e��Z���j[���f�L�Q���V��i�����d��%C3-�vcf�-�fZ�MQ�i�fZ���^�v��"f��j:�i%���Lc���W��_�G�k��<in����g>����0�d� ����U����-���U�+�^��������?���'��T{���o��wD��!��J%��F�����k�V���7�T������m��������5F�1��|�"����f�hq&�-	]�����m�xZ�-Qc�|�������V�"l�d����j9'�t�?rIP���Y���Qtj��B�����j���A������-�	9��	���WC��v�/.`�
�s�?�$}9�L;����3w�NL��������enU�n���#+�#&G�%x�P��z2K��%_����bx���������jx^m�y��&i���D&g;AL��F�����A\�F���-�(�3��8�=*�C�����$L��\dW���r�n��<4QHTD�����M���y�enU�n�l�uv?�L4���������r2K��%_����:�~}��Ii[�	�
oS�h��T}�e�%eN�8�A#�u�0����5������\0�<�F��T�=�D�7:��enU�n���|�w|�wj��Jn��0�n����|�a�o�,���d6���Mx�-�w}��[���|�WwH'��r�U����z# *���3���T}�e�a����Z@�]k�D�7+����0��0��[vCU&X���0�&��w������enU�n�lW)���4b'e��[aS����2��1%_��k���9�Z��J��r�W�5����k��������F_�9q����
ov����"�K���/�g����E��;��}����4���3�����{D��8s�����
G+�H6�1[��1%_�����q�$��G_���=;����0y��&���{��!�cR&Y�6EnA����z�aJ�����Rma�����%}�P��>[��1%���>���3��t�oxf#1����!X��.�r�4��sN�>��k��?�t�7������?��7wZ����u��q�����-p�9���7L��{U{K�)3��2O��[�@�E���c�U���4���aDv�����=���	�
}��;�5A�������x�2�9\���7Op�u��|����H���aY��hA�1\�A�	�W��������������
(�OaU����r��� ��k��/7��T^W�������X�T^W�����"L��F���w�ki��J5�QY��0���S}�Q?�����pjU*;~# *Ls��C=�������~��Qz����"�
o��2�n�EkL�G;�
Z�<n^��i����0QY*��o��>�(k]7}B��q��Q>7���(
h��5�����������r`{[�F�5��8�bK����v���+8�X�KV��xb{08]s����|�C���%]�F6��p/��	��12�+�S}�Q����Z���)Q'z.-��
�Dp��-�27�T�m�%��J:�����^��]V�;1��������d;������-�����!K�D���n�+�S}�Q����O,�J�[��u��a(r_{],�K����v�����'����"�1�_��p���uMP��F���X���EZ�����X�/s#L��F�b7�����>�^�{#B������1[�FP��F{U&�RJ���E��T5��O)97��enU�m��i�e������X_��'��M��\������j*���U���
�:�JNwE������������l�
�w��r:�y��
2����=����o�q|�g�+��T�� .s#L��F�\xr����</:�����n1��+����)nq�K4E�~��\���g�g����A
�h/��O���v�2�Z���(|��B�h�)�h���W�NH�N���\���z]s�.QcJ>���A�3�V�j%��kEa51k��.�h��N�������w���)
��N�f�r��/���������V>�I6r���me9���6r�)�h�t{�B�7yW&��5��(��%jL�g;�:��eG�X��O�x����$�if�`�����.:�&��W%�?� �s��`�z�gb�r��O�8��q�wbJ�Z�I�D��&���������l�5��}�FX��	��������a����(�������-g��t�3USwJ�X#-cl�J5r��&�g������e-}�����_K����|��4Kg���4������CQ
��cz>���.�r���s���;���#|J��V�D[�&��v����b���������)i��(��!f(W�%_������������8�OZ��FN�� .s#L����l/X3�Y_ne�s�7�`�Mgq���[�������1'p"���
��X��'0y��)��M������A���G���!�d��-M��o��>����dI���J�tM����7q�q��_o���m�M^��)����DEI��������v��Afc��Qy�l���QaS[���%^������.�����������	��u� _�F���-�U����E�9�IVu��(���B�D�)�r�r6Eg3L~y�Z����p#�ty�-v����T}�e���l<���d��x�[�.��p�S��^3������6zAd�vE@T;�������-{�tT��oeKX�����Q�cf����v�n=����^�mF6��" *]��At�k��/��L�4��)��q+�F������L_�FpP��[N��l�j�������]b(*��B��.�K������m&�3z=��]cU\��t,B�D�M��^��yzwU�X;[oT�i[���y��\��g~��������7o�+^�w�-��%C<����'��-������c)�z����}�
���������S��^�y��k�!G�����x����Cn�j�!/1%�����y]�,vQ���x���^��/�.�K��+
��}}��S2��6E������5�K_�����.���7�=��/��SP�r��O<tg��v���
5���b��2�|��Tp]�%?y��^��^��W{T&��KE��/���������Z���A_�
�2� �s}gK�s���~�������lb�Fv���NL>4��;�h��U���?�>��s�)�C��s������v��������!����?����i}�����g���Q�~�������?���~����������?�q�������������������d<����UL!�����8v'���5�l��r�N�^�b�����#�!�DiIpN��3�� �����l(f�&-�3<�9U�TV���sI7��'1�F]J�2���;c���Q>�y����-Vv��@�Q��u��oz�����@�Q��u��'�BM�|F��������o�#���0u)_W���Fg�@�Q��u��O:���}��a>���Q��
�h�����}��o���C�Q6b�u����-��y<����#1��t�7z��}�-�b�����+�|����| ���K��B�;k��|F��������/���S<�f���'�\���@�Q��u��O�X�@�Q��u������]��0��x�(d�[��+.�����|�X�{3���]���Q_�|�#|�����s���=���t���L���d�%�����d�(��<]_���uA�| ���K��B��I��0u)_W��x�����.��
�o=���1��:��z��
�x�0b��:�e��=C_��2�0>�3�_�i���5�1����B���o~����L���d�5�����es���=�j3�<�a>�R�����d�����.��
=�d%���0u)_W�|��RJ�3b��u��/�|�{�����a>�R����I/���4����o���l"�k?�vQoz;R���w�I+����*�L]��Z����}�I�%�
��/��T�b�����+�|�&�y>�|��|]!����K���s���=���X�N�0��R���sEk���@�Q��u��o)�=��lK���e�k~O���$���������;��L�����o�>�hd^'#F[w�_�������M���Q>�y����I������a>�R��������}�&1��:��z��]d8�M��y��Z�y&�\��l]���$_�b�����+�|S�����*Go d���9��e1��:�e��-c�{�^���D��_�g,�����2b�w����=���
��*%���K�����\Ty���sm�����f�.��
=������1�G]��z�`�@=�a>�R�����B7����s���=�^�z$�����=��5���a�0u~�(��{F8���bi���{�/�4�}���:���������w'�wx�&�-�Y��;i�l�]a�l���d�H��!*�F��M���{s'��,*es��
�l$���e#��������exh_�%3BL�u��~oz��'��}L���PI�J��q��UBP�7q�I��%�q����R2'��NIf������,�3�����g$3���<#�~�y��cYFg�-,��� �����d�g$3�����,2�dDX��L?������$����#1��h#�<����Qy>c�w�����c|���E�'��3��ea����_�m��^YFg�-,��� #���2",�H��q���YB�|����dC
@��� �a�SB2��r�q+2:��haYFg�;�����AF���2b�_��� #���d�O��f���I,*KH��$R=�"2�����o�o���^.���'�{�f���OYb�-,��L?��`��}���CT��D?�TY����gaYBge�y���A>���1�vZ;���� ���d�W�v�2:��haYFge�`��pf�ayF2��uUtbz7F�	i�o"�*�:�G~�"�v��&������l�������M���~F�2:��(�s��S�L?��e�d<��A��� #���d�e~���KL?��e�d��1H���A>������L�}�N8�����sf�MY��d�g$3���@����]���l ~�"����=d��-��UR����_*�6�L�RI����:1���et��q��a�?���g��,�3��Q�`��2",�H��QN6-�g:1�����dW}��g$3���<#�A�]^��	����<��l����Y:g��,,K�� ����3���������H��o����n�MH?�#�vdn����UF�����]T���������C�mO��qW����m���ND%v����b�Z���W����c�iHE�����8�8�}�A�_$^R",��+��c}���Z�:�~�2��l���`^?�zX�-Qc�<X��Z����0�3vV�j�x�aZ0��u^�K����!�U�E��';�G���
j�EB�I��W�����V���-1�2�;��������XC�D�]���:��z�V��C�K371���k�/QcJ}S��]���o
����hob(.2���#�K��%�f�syRS���8�+#s1�&��"#{	�%jL��9�7��
V���Ul��P\d^Z��!\��.yp��������y-��hhc/��8p^���:������K����z��������&/)n������+���.h/�u��r 1�������o<�F.��%jL��{��������t�?��k=R
~��}���U��,� '�$����N,JH���~�xI
0y=���0r�O�~D"�!b���Z�E�%\�^���N����j	��w�C�H�&4��U���MD��jk&���D����4�i@�_$�;�m��^-���BSz�:z�'
���		w�Y�xc�!�_���<�S�~)2���)3o�{o��n���^&��5^zwv����g�f����P����g�3j�Kj�5�
��Gg	w_<
�Y�(38�t�P\d^?��#���q�<z21=0�f����o?��#E7��	��,L�W�������[�w�>b9�*#�1Ck�k��g�����dK��%�������78�����<g�M�E����K���GOF:.��m������Y���41�K@�/QcJ��M���dG2b��Rd��Z7�s��1��eG��h�)y����K�{!s*���v��c(.2���B�D�)y�?�������l��BGCq���L�%jL��^��*�q��x-z������~�m�[��.y����M�]����0����6�EF�-�����'��U?�y�������D���9�/�*�)~���`:��j'���,~���F��R�i�
D;�QbJ>9�Z�����(��v;��;���yi�
�5vI�e]xo�����/�7���Y�M�
MLW��"���W5�j���c�'��\���\����GE��Wb���i��!�Jx�n�������O����{R~��<���������y0�G���O*hy�G���,��(�BM�E��f��p������oi����$6��F��\l��/qz4��
2���Z�l��G�yR��:�A�:o��Mq�hE�/ZcJ<)��,_K0�����4z������'B�D�]��IIi��6���,��D������������%j��GOJN��9����,��[�Cq�y�+�!\��.y��������Y��z����=�OJC�D�]��I�c8�i��Rf����.k��"��C�gZ��)y������YR��oo�q�����<�K�b�*�)��Q�7Bz���(0�2���������������K�()3Q���0+3��������`|I�/ZcJ=)y''?f�9�2s��}�
��"����kZ��)y���/��[�[g�������).20e4��1%�~��j�D��P�-�"��p�
2�����]�s(�TGk�o�K�
�%jL��Q%%���l�U`6e����x��xc��_��K��?yH�uGJLT&S5�"3���[��I22��~�F��QU@��F�$�*�)�cT��xZH��l�����10�4���q�<U�-{�2F�����T\d^j#�S�/Z����N\:g�����/e��Cq��q�z_.QcJ~��c�3����q���(��!�h�)�Ur�����<F�W��������*1%C�
C�s�J���*�)*�JC2�������>����C��O�����-�K����_��%�]���*��W^�dVU-~hTmh��U�u8����BH2�
��G�J����	������}�U��5��oT�{�#s��S	7�*�p7�^2�F(� ��*XJ��bg|#^w�q���*
�%j��bXIq��;mb�2r��'����*�
&!�O�8I:�T+��������@�� ��au��m��5?sbUB>�-lb�/)w��
z|����-��B��������:7�����Ti�d��}����=��s*41�W�%j������T^����U�r41+��[��.y��f}�'5O�����	a��Cq�AY���5����Bq��GN+�0���z��x�/}�
v6ev����c��M� 7�D�E����5�+��G�J��]�'��s�1��4�K��%���7u��������������K�:
�%jL��o���������#���*	+l���>����5��a/�k�*���$&b��Pkw�����Z_��.y��&�?V+H��K�������7�gHB|���"1��+���(1(�������}�1.�P�D�]��b-'�Y�3;fe��:Z��X�����GK��H���*%�,�TN�D�E���$���?�W2�C���/R�(#����P\d^jOi���K��C��=�	�J��6<bP--l��l=
�EkL���%�"tb�BK���zTM�En���5N�+��$��B���=��jmA
�H���'�W������TK�<#gv�>zV��J��l$���%j�v�#��"&��KLTF
�&����1���[��I��bmp��$fQF=�&��z1.���5N�gK�K���6'�TF=�&�l��1��E��D�)y����DYsogV�5����������
��l������'����2b���[���VU�S�1%�.�#����UW`ve������\,
�%jLI��:�`Z��������Q5�)�bp�4��1%C�nsK�\�G[[�XI.V�)�X�W��� 7�j<����,�9��D�D�]2��B����Jz=�UW��������*�.�Y�;=3��������2����=A��v�Bs*�vUo�Ha�K-,
�%j���B�<
?���>/�T�7�.C��?�;�J����+�^���,��_u�Pd�U�/k�~U�]������O���z������/��1����i@zW��JW/p�k���e{OZ�������7����K�5��
2^g~���5��?_
/Q�syO-����K~

0y=��MJ�:��)A�p'���CZ��d�S8�������+����".8~f��3<�	��G�2`p�MC|�S��^�
��c.Q
�TF���&��"����K��������GJ��c"#c�&��6�E���4��1%_�V����8�.i�D�E�%���R	=��M�xw��7��
����2�
.-l���)
�EkL�w{�~��������R���Mq�A����5�����%yR=�3������w61���C�[��I��^��|�U���$##�jGSq�y�e����5����JE���@���0��	�w8�H��' 
0y=��M���<��1uFK"�_��).2�,����
S��^C��c�c����L9��x�3���YOC|�S��^g,'��t�c"d����Kj�EB�9i��+h��mR��f<���Tb��IK���_��y���p����)���6�w
��U��wm�p�P\d^���!\��.�n�bD.���_��,��������1F��#�K��%��U���|�4����#]����6�E�����5�����f'���S����,'�a(.28J�!�D�)�n�Z��K�1@�A�AT��0�\�W
�%jL�w{?t�i��gPt��jCq�A����5����z���E�G����x)�
{�L�����/Qc�|���[��B8����8����QY�2pR -ZcJ��+�7Ts���%,D�a���T��CVI^A��n�,'�&=���,nz��F����k�=��p�������PN����6)�5�56E^�W���5�����>-��v����C��N������t\B�D�)�n�4z�*�����L������d9�Eq�D�]��^�^O����VWegzkl��m�MO���>��o�n����s&e�}��)r7Z�7
��5����Zm�%����a���j-+h�Y���"��U��_nRN������O��.�	�!�y�X69[��+��/?���O��$��s:S|���=��w�i�S���=��}Js�Ep~��n�3���������dR��o��G�����{�S��R��ra�aeO<o�������@���\��//9�\��g�u�����Wf�8Tbve��K
W;�e��$}���h{x�\�*�#�(#&uhb(.2/��5v�';�'p��p��0�z"�_�&�v���NH�/QcJ>�����j����9�5�c��"�R�_C|�S��W�O��JL�L���8�sCq�A���p�������G�i_�����A>-l���K_Ah�/Z�3�*��;�Z�9��TG;!/PL������y���+h�m�:�3�E9�G�j��x��(0v�]Cl�����7Q�7G�!�)��rCq�y���!�D�)�d��w_�Lo�&����C�H��U
8��[������i����A5����x������M��D�)�d�!h�m��$����SCq��n_!�5N�Ov8����t2�je�������h�/Z��k�/�pA����~�Z���c*x�{�J�/Z����s2fu��u���,V��.�z��~�@��@^A��dk���\'=!����o�[Cq�yi���5���nA�;���$��mnnpeu&��lF(� �?��X��-io�X������J��zH
0y=�����=N��#�_���[A��V��i��W���_������J����So���U�����K�8��*��C��Q_�L���o�w=���ma(��F������K>����A_�X�.2N���m`S��cr�n^�h�)�d�4`��u�
^��<�d���l]:3��D���v�';�U�mg�i=�Z��6�En����5����������l������b_F����b�
�x����V��X�H,��n|
'on8�� ���P���O��F,�%U#M�X��������y��7�
L�G;�7�������$�����!Mnm\,�K�x��oord=4�{���G[c(r�V�'��q������
���X�!�/[����n��=���
��?��_��8���;��7
��uG���K>���s=����b�t]kl����#B|��{��;�e@�y�5�G����y8��	7-���M������8�*��j�� �/z�^L^A����&Ot>]�f� �8�8����fo�x�)��+��o?��X�Z,��lx6��L�&��lu��<>Bl������8x0����1�$��).2/5>5��1%��u���,�^��9����[Gk�nL\
�%jL�w{]f�� ��0m�$:=���).2/���_���|�W����~D�u����64 �/�U��J�d����� T�G4�K����i�h����Q}�p�S��^ez�n5��e&10tK�0#Z4���qe$|�W9���^����a�I}��������[C|�����z��'��/18f/����P\d�
@C|�S��^��V"Z��0��y���v7l�����
~��5�����m+zV��9A�B��{���_$�0���+h��m���������D�����;�y��+��!�8�p�|��������$u1��341�^���J�-Q�$�n��:f�C.������tRF�������_����5������"�����0�4�����`��������*�������!�r�pm@�_$^�Jj�d�w�����F��3;���N�w���c��vK�5^Kw�������<t�A��	=�4o���p5��>/�\��M�,���4�Y�SN��������������5�������4����j�����5<�Xz�{�$����I��p���5�8Ve���
C�7���|�����rr���9�4[gR������x�G)��5��s�5������'*O:�=�Py�������Cj�
���bNL�R�o�E>�oB���o��w��B��%�O�/;2������ZC2����?sG���$gc�1�
��KjTw�%j�ge>��W���h��M�3��ZA��e�q�_����&i0���G�V�������;�d��Sn'�(l�Y����.�b�Z�!��.����x��d�8����/��.`����c�3;���&������>�Hr��������S��{ek���,��O2-��m�����$�����M�%"��1%���������
�}�:\��C&��|���������������6�t�Q:���7�N������v��������!s���?��������(���C�����/��������_�5���������?.�������_��no^��b6N]k����y���&N�}���4��|=�%��}���O�)�����
���k������)�<��e)�dx)�a>
S���3jyg�33��B�����NO�3R�2v�)�d��)#�aF
S���3�C�f�et�g�S���.��`�)L���1�=7e3�Ha��������n%�3R���B�p+��������[�w�i��I��2f�{����Q�������� c2�@�2.Z��g3�Ha�����N{�3�f�0e�
=�qX�J�h�(�=c_�2���"��|&K�z��m��A)�a>
S��0e���D��0#�)cW�2Nh����|&�m����"ay�^�}i�D~�������e�h�.��zF9;�����2��3��)��yB%��L���d�-�6���3��BO���Q�0��`�)L���q6O6e3�Ha����AKa��`�)L�B���Z��e4f�����/Lw����c�)��H!|�2r�e��&�����#��gW����]�W�/L�w���0��R������B��3f�����/LW{��2�f�0e�
S��J�SF0����]a�8�)���0#�)cW��Ig�et�g�SF���3�f�0e�
S�U{�g�c�)��H!|��,��F���,�6��}d��-~�@�vN�_�����`>�O5�A�$d��0e�:/(�f�����+L�V��3R�2v���8u�p���QFz��0e�1Y.%Tb��d)]O��E�k�����0%�
S�����[J��/K�fk����a>
y�(�����}z%�����w}Cz�icF�w�_���3�1����3f�����/L1#�f�����+��Q�����dF]����q{���A�ruD)���+~c�f�0��
SF���3R�2v�)�TY�d���*=��g�rG��(��.R�"����2��;��|pYN;������]��/L�w��i�0��R�������2�b��2����l��3�����0%�
S�Ex��3R�2v�)c�s$yF0����]�g�������2��3��)�^��d���.S��:��m��aF
�6R������� }A:��_���:>{Y��.p�e?�3��~=	�P�Y�'��ha��&3���^$4b�O��tF��Z���#3���<!�~�u�[M���~F�2:3��jo�<#�Y��@�'4F|�u����6����G��R��a�U�{g������;��^2�{����CT��D?�t=�!O�L?��e	�d�Ni�]��AF����3J1�Vdt�������2�l��� #���d�:�g$3���<#�n�8�$>eLL7#�R��2����������@��]��Ni����&e(���D��T�m��������]$1���etf���G�<#�AF����3�!����L?��e��g�S�(1���etf��}��d�g$3��)[yF2���3��g���"��L?��e�d��yB#�7��@�o$Q:k�8������@C�l���/��e�w����S��81���etf������d��2��f�dz�4g�����11��g�(1��g�(1����r�?���g��,�3������3�dDX��� #x��2",�H��QZo�����H,(�f��F�����'�G6�L��?g�W����;��Rq���YFg�-,��� #���d�g$���-z�-��L?��e�d<��@�d<��@��g�g���et�������2��F��� #���de�vqqf�ayF2��G�cpYFg�w�����{���4���m��<�}�yg������;��\�tj/����3����Tg��g��g��,�3��6{4��L?��e�d���A�d���A��g\��r���L?����8�AF9�y�2",�Hf�Q�v	����<��l�ca��n$�7�&gx#�k�}t#��E%|<�}3��\��e�]�����������mS��y[k�����G�X���[gj�E����$��
2��O^1� n<�����p���
7�E��-��%jL������Rv,M��l&��M��h��;��"��vl�%j��G?��/�C������~�Cq��x
�5v��^�������q&et�6�E���EkL������+�
�{��H|���Mq�A�9���5�����h�i�6��D�)�C�D�1���3�_���<X�2�GUQ�;�(�.��D�EB�ij��+���.k����N�v��3j�oP�/�U'.���?����WRe��'%��m@�_$^�Nm
.���?���^��@��C�������L���+��GgO���������1���0sC$�����&�3�d��������;Q�$�Jx:�{�K��?e���Yc�3���F�JDim���H�$�������Io�Yg��:�����O���-l���'S5������-���u��'Z��L�������
��"�}��5������sMj8��2Bg��l!�[~�x���$���?�V:[
�Z���o�6����_$^����k��n�������_�����&o��x�������B�D�)y��v4��~�Z���]�MZ�614GD��1%��n�����'fQf��������� ���*�$����s���kI	iE(�x[�
2�"��A
��*��G+F��u��v�!#�f���
L�E��EkL��Ks����'t)K���&���x��c���!\���<z.j��S�\�������N|��aS\d0�!\���<�1R�(%�v���M�]`6�b���yi�4�p�S���`��)_`�(q�s���� �/���\���`�,�Q_��k)#�]�=������H�d:�P^A�?l��Q��m�eFlJ��R�~�x�,�����?l��������HYgibN�4��#E4�K��%{�l��aZ��+?� -l���K�l#�����������x�rf�c��R56�E�X�Ek����v���2�-�L���H����������K����_�����g=�}/Hf��6������7��
'4�K��%7�|��>{{PC�F5'�r@p����4{c����f)`�����1�K�����K���}n��{Gw��De��qm������%j��k*.��&�=��07q���1���,�������rV~�5e��p����Yv�Mq�y�?����5����/b���Z����Qypj��Z��9��5�������b ��Ag��^�x�-l���)���5�����~R��G'}v�G���������
�����j�nq����e!R���P\d0<WC�D�]�`M-x$��u.K��L�O<��0�S�K����J����\����#��n�KCq�y����p���a/��b����
E���{�A
�H`8�P^A�t�t�N��;FbNe�M�E���;�K���G�J����A�����{�� M���1/��bK��%���������2F'�I��_cS\d^z�!\���<�X�p�����jQ��).2�B�h�)y4�tN���^��R"����m���D�E��p�
Z�c=����8���A��O��8��\�p�S�lb����K����TM�E��B�D�)y��8�Bon�8��*���_$�
�|<E-���Z��8�X+���UCq�ya\��`�'������Ro�����jj`*.2/�R�EkL����a���K�Hn�_�������.ZcJ��,>pxY��NU��8��7@��JL���%
S�fB_��l��S�41xY�5���e=��]������Tw�p�X�Ep�
Z|��:1� 3�NLFp������u��7����cbs�w���V��T
��[�14����
w�X����'4��N%��� �s/q�W���^�����J����U���d�K�p�S�����{����IV�MQXZB|�S�liI��j��%�._<`W5 �/��4��U��;^����Yq2�`�S��,O1��%j��hQ�)��d��HJ�O���H�36\^A�wM�(��t�bb�Gr���'���L,�S !nZ�8I:&���DG�p����Y�[c�����_�e��I����SCq����!\��.y,����M�r9a����7����N��+���*�I���7����U�$'o����Ay=������)g"���6+���G#cl����%n��4%^�>G���3�`;��o8�C����K�8�<498�C����(�2:,��jnaS\d^�!������j�f��9)b\����������J�4���1%�%U�e�C���3KPF���-l��������5�����w";ulbed����P\d`_i�/QcJ�+-b��u�L�j����p��m���
��l����9�+.%���11�2�@�M�E�15���1%�V�:�j����+�[�:�5�"a�QH�
2���R�`��1�0	�����{%,�P/�*�)�Vj���@'���Q������4Bl�������j�aW3��,��:q���'��y����%j��G�
c+�ZuF~���5U��Mq��W����k��U������F^v�������(*���p��Z�y�����*%�Tf��w�E��%@�5��[�%��x�M
Ysu�||�,���%j�}�gi:W5a�h<1Q1�bCq��,U�`�'��}���p���v�!��T{�o�4�`�'��}%s����6%fQF��&��;�1�C��q�<�W;!��K=�9��P\d^j	!�K����R,9�%�M��b)��B�&��"�By��5v��k�g���uSbj��W��(�Z,�K�������������U�������*���0MER|�HEXx�HUX����aE��5�>w��$���(�Jf_)��KjO���_$^j��=�W��;�&d��!g�20�Zx�Mc0�!�D�]�h_�/��q�W`p�B����*�4�K��%��,����,���������BH��*����N:�6{G��`V���7{�l\,�p�o�w�b3�C�OK�PB\��~�PK(� ��.�L��B2�2�Z��d�`���GT
q���.yv���}�`�z�M����/��1�����@ZX�K�/p������Rc��qfWF�����KM
�%jL�N\�x���o}��q�`j�iGCq�Ae�>r�S��^w�����9��������G7�E��C94�K��%_��D]��s��uRF����Mq�����EkL�w{=gC.���ade�C����
�4�K��%_�5��1g)|S�V����~���"]()� ���dX�&8�ep�XR2��@���|w-�I�#����u\��:g6�`�E/����n�K�����*�n��O�����f}�c(.2x��_���|��uA+T)*�Q��xM�0���
L�E�*$��1%��5.x9'�it�TbNe��;����4���1%��u���s�a]�m���	j8��.��<@i��+���m��W���8���(;W?�y����/}B�_���|���b�$"����U9��71��Fh���K��+\�U�?���*�Q��������!�D�]��^� g���~����9z��������!�D�]��^C�A��G���Z.M�s�`��%j����*�t��|�4\g���j�6�E-��5>����U,�p�2��vF>G�������`���5����JE��_�c�}Se;+E}-l���K�!��������6�;����*�|�'+laS\d�W�p�S��^i����N���
C�����}W����l�S��^��jP��ja�A�*����ZX0p�bK��%��U-U�������!\�6}��!\���?u����x���]wqk��ue�t���+h�_mRYi��M���^�����P\d^?K�-����K��++�f5��9��2����8�5��B�D�)�n�4v��H��q7e�*�a(r�X��l�S��^�������A����7�-K���8 ��D�]��^��1��$�u8Q����6�'�w`������/�j��z��������U��56EnO��B�h�)�r�V���-1��z��
��-a���p�S��^�QU�e����������P\d�W
�%jL���-��{=��0���6E�����s�\�S���]py�-'�d�����C�������u��l��B�|���_$^ZQ�$� ��g�����2�����A������N���������K�/ZcJ>��<���8s��D,o��x���,5Bl����3����!���	�����s1��H�/QcJ>�������H%=����;w����2������5���J������0I;1���C@�1������K>��<�OA_4�oXbve����;^��1/-��_��k�7�w(g��8u�7��������&��&cp(WC|���~q�Z���.����y�N�����TH�d�'[�C���m�Vvv�Z{Nh�-��[����i�8�1%��Pd=���Sb���&t��o��J�/Q�5���J�����&�����wn���2��[
�EkL�';�X��MQ�����p�Z��������\��5���+NL��,$U��.M<[�1��C�j�K�$��P
W���k(c��g�j���"cA�����������>�,/�����	g�����|��B�h�)�d�r8[�U���J�����n��NL�WN�5������*�PgNe�En���$�"�K���Ov(��g�a���;�)�������"�K���Ov(e�2`J��,���c6��m@�_$�VN�^����/nM�lP�sf�����M���1cPj�.QcJ>�!=]�
��8�q������c��2���)�d����	{^�+�(�T�[��
�T���I�W���l����-'��������56E�`Q�^�--q��|�C��������������OL�+��.P��o'��W��m`S��7��_
��5�����CV�J���KnM^�{���K%��d!\���|�Cz���	���-���{�P��7t+���vh���n�!������������Z���������z�uwG)$��9�p�uw�s����)�h�v�]�&�������L�
�voM����
���^?���*��"���q;mf���e�����3-Qc�|�C9�no-&?So�-&;D��:`!��5��
2�������^84��56E���!�h�)yvg���Z�w�f�����8kk� ��W��.;���z�M�xw���x�v��%j���u�u*�F[�/�5X����kn��F����~�������&�oWkdr��5�5SIbK��%����qZ�����h1�~��s
'4�0��I^A��I9�qd�1��C�K��������D�-Qc�|�W1^��a[��]��j�6i����H��
Cy=��M}
�u�C�N� ���7����K�\
�����n��Ko��}�x�dNe�����P\d^��j�/QcJ���4�P8~x�L0x��}�8�9Ac���5�D-u�]��^WkKN;6���3�b��6�E������������������_���S���M@5�K��%���je�-��3��E�0b��-l����~�_���|��}���z0�����z(
#����P\d`k�/QcJ�����Bup��JLTFL���g��4�lb
�5N���zF��Y��c��L�='�7/1��V�#�������*�.Xl��$1�2��h�����!\���|�W�toQ�K��1(s*�������.QcJ���tR�����Z���Zp��<#��q]2y�������N��z���	T�h-r.V�c�K�c
0y�����M����	o��;����4nA
�H��������n�����f�����G �Q���Mq�A3J�p�S��^7���V����i��%��miC������%jL�w{�2�]dn3<g0�2b*oM�E�3Bl�����j]�hg���h�0�&�v��kc�D�c~b������>��3�7�+��g�����i!�D�]��^�u���!�A�oFq�l��Sl��p�oy���{�B�
��A\b6e��f�w�X�7c�^#�K�����j
T�����g�f�l�Z"�^	o��x�T�O����;�!z6{F�0�P��).2/u��EkL�w{���`%���&w�KE�fkH�f���/�*�sT�u�����H�����Q��.QcJ����!��p�[�U�mt�P�����Bl�����ff���*�nar���PT~��d�w�]��g�V�aS����$��������[i������o} ���K��������]���������?��?����<��!}�����_^�����_��k����o�_�)l\���������������������xo�`������>��������H���o�g��V��������2A#��������{
����H�S��!�F�O]�)}a�7i=i��8#�Y���s�o�zw���aNW��}i��X���8'�Y��4��lfK�	f���,gW�9�3d9��te������5�J9��sR���JS�h����8'�Y��4�\t�lvs1f���ts)�����<q5��ts���2-�����1���2]���9�!:��3'�Q����i������3�Ie��+M9'����3�Ie��+��R�6��2���L9���3�[�<�������)�b�)!�qF*��]i�9�{���8'�Y��4��0;��#�M�7�B�������j�K����7���G~��n���b���s����s3�����/M9'��I)�g4]��'�|q������aFW��}i��p�	f���,gW�r�6�:�3�Ie��+M9���9��sR���J=�������aNW��}i���C��3�Ie����/,�["�Lz~��L�~gYvk��W����]��X_�r�����qF�e	{B�7v4�3�����/M9W{
�r���2���������s���2���������s���2���z�pX���4f���)g_�r��O�	f���,gW�r��7��3�Ie����;Kx����pg���"3#�V#���.]�����]9��0!�Q����i��<x��8'�Y��4�6u-�3�Ie��+��j����s�2��KSNi�\�Tb��tY��0��:U>Of���,eW�r.�i�8�R9�}a�7��K�R��3R���R����,�
����[���S�
,(�3f�W�K��z�
�D���s�2��KS�Y��9��sR���J=g<����4f���)g_�rn�2��q>��lY����y60�|Tf���s��qyN0��Tf9���s�H2��S�@��s=�Ojv_1f�����RH�+A+���g}��|pcYN�C�.�1�������)���+R*1�h�,aO����]�F�Q��u�)�fc�SB0��Tf)���s��})'�qN*��]i�l�U�	f���,gW�9�����i�0�+S��4��+��8�^��i��������qN*�����o,���A���������EN��o���Z����%,�����Qxjc�<#�AF����3��d	I��!*KGb�mS{)OGf�ayB2����,�3���<#�~F��t�yFgd��H�%$#����~��_�����1=�W�J��:����;��^�m�����:%8KH��QY:�lh���#3�g<��d����R�E��g��,�3����e�2",�H��q��~���~F�2:3��Y����� #���dO-v�3�dDX��L?��7����lF��]Dz5�"�p91\,��3�Kea����^�3��U���nF����d�� B��� #���d���hq�LL?��e�d��/@�d��/@��g\p�!��L?��e�d�� 5�Hf�ayF2������3�dDX��L?�Z��(��B$"K"h|�8���lqp��<���s����_�v�L��lh��et��������3�V�2:������3:��xX�����~F�2:3��W�{3��W�{��xN��,>���3ZX���A����eF2���3�d�u�f��� #���d��aY^�g����n"��������D���s�n2����&���w�{�<.��S�+&h��X>Q���guj�u��}_��N
�Z\���A�5���guj�����N��n�w��e��oC5�*g]b�Qpj�����N��.����Jj�qEVR���^YI��"��Jj���
EVR����I��o���:F7��g&��4�w��������u�I|K���� ���Y�d=W�guj�����N���	�p��HN��2.w��e����e��W��e�{�"+�QV�YI
����/G���AV�+�{R����Lj�(��)�elj#���4_Q����n#�q�Gw�^~~qR�<e�����V3�7%��������[=��6�>��5���cQP(�LT��N�M<[C[cl����5v��1-2���QB|
>d��i������2.]4�X�)������Ckb����3�M�v������p�����Ek�U�6&��1j-�
��"��i%��������h7���O�BfW��"���0nA��1%�)�Ij�E=3���n\��C�r��).2��QB|�S���M(-�6����L����C��[����1�i�����8�	�.7���8�(qHs�D�E�%3������h��s�TP��|bve�)��&^�o�1/�T�!�D�)y�5'�1g�_J�����=��l���yi':
�5v���)#e�9*~�#�b��&����1�1mG���K���w���}1���]�S��41�G�_���<���w��E��$&*s�����c(.2���!\��.y�g�S8��'��\���5�"�	�������y!������m������y��.Q���I?�%���y�d�2��?iw�E��'J���K1�����(���?�S�L�������^��?-Z���%L������?��e�N������
Cr���,Jek�����U�A�7&&J���&k�Z�q'Lt�rn� _�F���[���4[�����z�������\7]-D��q�<�eR�+q�m���T��I��4�����\�en��m�h��&;A��E���&V���K["K��%�P��	+C���0���i�Dzy�[�en��������������:�
�0�N@�[rnf����m<L�f}�0�Q���3�&�}���a"��s�A\�F������`'��yA�v��t����A@�[r.0���S=E��s?u��EF�Ra:���0�n���A\�F���Y	z�Zv�p(7X�u�~�.�z�"��n�`(��+�tjSJ�9�Dr%71���S=� ����9�~������5�	��%���������qK+n��z����~�����%r��|�a��-��Y0���3?�N��Iv' �-9�0���S=n����m|AFa���hk��["73���S=~��$�����B3d���6>���l�AU�$���]��=�0��c�9��m�E}�����������5q�I��t0Bl����*�O"=	0�'1�2��_�;�6y���zH�K����l�
p�PniDTBL��g�m�k����W��3����L�8�b��~}�������J��x��T{��S����3�2�~�[���9f�K����C&9�%1�1�dVF�*�6�EF�
!�h�)yp�fm�+�`>$f�q������$����K��k>'~�W;X�(��J����[WA�H���zd��mw��t>�H,xN;�UN^�p���/���^��k�5������,�����Ot�}V\#��5vI�6P�����S1��&��"�R�FC|�S�`�a��<4K]�Xede��M�E���p����*�X�!81�2b�M���1/���!\��.y��ly����Y�Q3�������h���K��f3m��.Ev�����P�+��).2��!�h�)yp�-NW�Q�9#�c<�����6�Es�5�����$f�6���}�N��67q�o��`H���5��/��d-p�6tapJ~��~X��"�N�l�A��c6������ou�d�uO�����4���I:V�����s�v�`��	����s�A\�F�����GN����-���L���M@D�L{O��a����c&-Qu��B�l;�\�a��Mq9�a���56I�1��{�2�P
fX���3�fq�a�G��Xm�a��M�0m���h�1#71���S
���pg��Tf���J�la������&^���}���X���9f���X�&��XQm,�YO�&	4�Z���c�L� �|#��wu���o8�-�$�s�A��b���JB���{��|�;+�7!]'t��%�r��Y��D|�+[���cw�L�E)�`m"����v�27���W&�`v��^�P{P
6X�0�{e�E��o��:^��l���_%�OJ�
�5	�+�2�+�S��2����^�R�R���D��)73���S
��(�a���S����kS�s���� Zc���Wv�o-j�G���2���cc�uek�����(Ba�
V�x]�rX�-Qc�<�e�����s��ged�u\)�iaS\d^:�VC|�S�Xe��E1�34���������&�j���K�r�{}j]R�9��������4���B�D�)y��������3��&�������3��3N5���1%����;���G�+���6���_$^b�i�� �6��y��L�u���^�#��6�E���B�h�)y,/�n�R+��G��K) ��8�; c^Z-%TZ�������N+a,"geN��ha(.2����5�����EW��U�22�`��{�6�E���������G�Lj�BD�:���s����B{%<��}�Kg
.��e:?7��h41�2b�M�Eue�K���G�LGbO
M2��W��]cS\d`
i����,��#I�w~E��E�H�������nV����W�R�����#fjjA�����4MM|���`��u�u
M��v�����i����q���)^����mM|,�=�M��a��	���Xas+6�A�ES�h�rk��vh1|�f^�F�F��
5� QH�B�?��q�^�+��}eUl�S_YB��cR:.�4��G�L!�}��5�D���*j\��s��.��P��M!kk ����G���0V�,��e�h,��k�;��Hn��8I7z3Y0��Y6����Y6+Txc��Y�2�fY8�k��/����#���fX I��Xb+_c=v�E�FU@�BE���e61�e.@�}�*���cC3�/��v@:yf2hIV������O�f�!=?�Y0�{f�'Kb���������F��3��]xfg�H�gfI�3���������uN�f�Qt�]��$b�r�`�����:*��.H�4�v:5�]�;�����s��P�pv@:w����h5@�7:�|4d�@���v�Tw�!��@�c}�-���@���V �h��;�*�X=WM&��b�6�M!fm$w����2�XO��~�[N��~c\
�
�T�jq\�D���mM�K���^��6�~�����!���W�S��ZnA��������O�zV�G�����&l�`D�'!�J��S�T�������o
��p@VE69��qA_��x�:&������N#:x	H�����^CK?��PX�
���"w<+�IFa������8��9),Q��|�V��)������}�1w�2h'��x�:&���n��$v�F;�-�K+�����������-r����\}rd��/�bcD���/Z��|��}/�Z��`�Gj�������<�
���p/c~{�!���:o�,��pm�`D��)^��I�l�rN��~�S�H��P9j�O�S
{���%�8S>[���+�1��I"��i���Kil���!K�1)��U��l����_Y�rj���E2�Ma�:v�gk������@C��X����q�{BS�h���Z�����HG{���Z���X��^����"�p9(]��z�ZF�"bN��������u�������8�G\8�@�"b��f<���!�?�z�:v�gk�g���m�7v�Y�����:6c0"��),Q�N�h��Ge���o~��eO]�V����!�>[�>�\�����j��Wd�7�ov��*����Z�[0}Z`����0��K���a����LhR�viF��la+_c}��y�&��]XZ�}J��-m �.,�V���e.Y�.Yv�]��ln�C�B�F�@�,0�e.��{�%������3}b��`���'6;���� ��%�#��v���Y�����'����
�}d�"�X���%k�d����3�3�*dm�M���)X`�\c}�d5������lCGu��d��4{�G�|"2|�q�+Q��|�V��q���A�F���|�t2��	I�|�����U���r�'�U��M_�NvH�e.��>^r�^�~j=�-l�:^���G��e���K ��%�+��-�v
��y��m�#e\�D'��/��{�0r/������)$�x���(�F:��.�%y�`�����0&�lavh�^d���H�@lf�\c}�d�%��y=��J����W�S����!�e.��>����}�!����W�H'>$&��0���V�,��W����[|s�]��;�]���d�������5����
��D�X�:v����Z�2�h4�b�"�M��4c0"�#�)^��Iyg�q�s�uL�������s�o��x��<C���)^�������+LAmy"�l+7�������� 2h��x�:&���#���:P2���-���/�Hb^�h���cR�Y�&�;����{��&5����b��%�K�����7W��U^y����@�"��<6c0"���j���cR�Z!^�
�^�9���N�f<�5�PS�D��k���p�`Y0OqC�0�M��87c0"��E
K�1)��p
p�C�
`��%q��fm�!��)^��Iyg����!�G�X�>��j�A��!��*��w�&&�:c�����L������A\bcD��WS�h���
#z4��n��_��H���
gk�0��n/�Uh��,-M��"�7�m�!�p��Z"��������o84����V����
�Q���%@@4I��������X�$x�*��w�&&��h��������G���1�A��j���cR�Y������Hz�tF6E�K��q�.M�uL�+��_}ie�;�*���k�|v��J���c2�Z^�?$?,&�PT(�r�[I'����y)�\��[�����������BQ!��m$���v&Y�@�{�t�m�wY���Q�b�e
�4�r[���`-�h}w�\)�0.C+��q�eX|
\��X�����c<���t�9=��@���HnN��FmIV����B�o��6���I*��k�H���ua+_c��P�������C�.=�+0�+�F��$��5@�{ugxu���l,W����0Rn��~xQ���c���r.f��6X��LI�<]�
�t��F&��0�[�6�����fM�L�F:;�6�"W��zs����M�)���e7"@�5�"�|k1:-@���y�'�9������
��e.���\�u;��vv@���+���:K�2�Xo.�,\yx6�e�ai
_be�j'O��7�bR�[��e.G�l�
�K�������A�j��h�o?����{��-l�`��<E&��0����$�+�����p�6&��u����P����
$����6lv�������H�u��
o�Uf��:g3`J���+���+^CK?�!_X�
���"}��d�MF����u<�S�K������)��5��[}\�NC���m
x�x��i�5��9,)^��I�p���6�M~�������-a��%��������
-��E����6��������q��
 "�+Q�N�l���i��Q�������37��^��[S�D/�{�w�*��wG��I�����H�uh�`D�!���c�|��'fM����sq��f�k|e��D���Z'�b�b*sF0�Y{��1G,�YS�D���Ze�C���4����I��T�f�A�h
K��S>[��}�^I������K#D�AS$��U����H�oDa@�3�8�{3� ZS�D���Zeb�������+���FN4B��35�*���-r�@j=w��
	�������8�`���x�:&�����������""�r���8�:&� 2���)^����w������#e�����K����kh��3��\�
-��Ej+�v���������,L ���FL�AK�({H�p��+�dL���d{���ns�������pV��>]2�T�:G�a��e������4mSP���S���k��^y���K��������6KLb�`�O���h9����:������+����0/s���zw�#��x�%hQ���6��p�]-mK�2��O�l�De���g�*�3�ac7ce��	?��ul���.4��������n e�{�!�{Q�����m�����yj�@2�3rHn0�[��-Lb�`�O����2+}:7`'�����;�L�}�`���=��9+�l@�;� �9+�MH�e.��>^2��q�}��-�g�Eo�0��[�)~���+�X�.�N����N4Cxw	g�
,�N4c����X��D�^��[�����%ZR�|i �H,��\c}�d+���B���G�\�������H�uL��ke;�n�����jF:w^'&�����o��h��a+�v����H�F:��sd+_cn���#����)~Ey k�q��_����������?�d�?��?��?J�E�����Z�(m�����_~9�q�������O_��8�)�?��|�������_��/����a:��7�(����0�k�-�	+��r3�f�'����������u	!�W��,��c��|��1����0���m'�m�#�?��T75�@��b[��$�gM�f�>�U�,&�BU�U3�U�Yu���*�gU�f�>5��~���zVujV�S]U�dqj��������@��I���T=�:5���Y5�~�U=�:5���Yu����ozVu��oN���M
�u�%M��~���f���5_8��L��@��L��m���zR-�T}�f����tV�������f��6����U��U�TW�6���A����������,���&�Y�K����8Iz�tj�S���=#'U@��N��}jV����6�<k��{��xw��_���d��w�7�3��b�.y�z\}��{��j�����=�f��>P���m;�*��Ib��]1�:���4�Q3S]���U����*�gU�f�>5�N�_sR�������fU�9pR������������]�������@���������gU����D���,�i�M�[����o�lfo���f����~��YU,���"��$f�.��<�R��G�Lu�jV]���B���S�j��U'��)T=�:5���Yu�3
U@��N��}����v��������@��I_[�T=�:5���YuQ��fc���S�fs���l���p`K�����o�lF99�t��<.��~��D*j7�v��z�,�}�f���*�gU�f�>5�����
�Y��Y�OuU��x�(zT�TW}�f�����<k��%�������$�gM�f�>5����i���\Y���8qg%o2��5�����zw���zI���.��mF�F����7��G�T�nTWM�p��A�������:����*�gU�f�>�U#��=�f��>P�jzU�<+-��hY�O��z����5���:��>'U@��N��}jVkk����X[�OTW�mE���!����~�9Qon5��{�e�Rs�n5��k�]{��+g�����~��Yu~��u
y�$1Kv��8���"�GE'�b���u����5��E����S�N���U��U�����W��������j���R��G�Lu�jV]k{����Z��O���W�����gU����D�����.�,����|�f#o2N/�	�$������������SW��T	=�"��J�Au��9'Q"�H+%�<).z��I���&�N���T1��J�Iy'UB}�i����j�d�#�uH��e�d~����{�m��N��$;����#8��p�+�.����N�D4�VJyP���]J:��iy��C�z���V��U�+UzRM:���J�Iy'UB�q�v�R��U�+UzR����*�'U��T	=�&�Z>�zRE�I�����?���� �����i�������id��~��8�p�,��	z�f�����C��W�:��*����n��������Ug�
��T3�We^���'�X����j��Ne�Iu���N���T�wR%��p�I��������=�����*�'U��T	=�nrh�I������N�@�q�����x��|�n3K��f���kfy�O���5[0'�Tu�A��JU�T�5��=�F>�eU�Tm�K��������=�.�7�=�.�7�=�n��rR%�����*����KU�T-�Tu�I5��''UBO��;�zR]e�I����Y'E ����y@S�v�����Z=�R:�p�,���_2� �7��U�W�1=��Y'���=�Z^������m'UBO��;�zP���U�T-�Tu�I�,O���T�wR%��*S����C��W�:�������*�'U��T	=���1U?+����i'Q��q������
��������e���Y^�#8�p�F���=�Z^�����:��vr�2��jy��CO���zR%�����*��m��Te�A��JU��Tq��I���*�N��T�Q�N;�=�Z��e�'U}Cw5�IS�N��<)6v���mZ�L���wyI����;�g?��dZ�����L�����?~��o�a#�����o��\�k�,l�������'�(�wm$��K,nL�2��[[|�r6����v��$���H����Lb�`�[�|��G6�knHT$L2��O6����7z�:v������@���o�T�0�N�� ��qlf�\c���r����e�����6������
��Krlb�\c�Z�N4�7}����AKP(L�����Kr,2��/������d����3�
S�����Krld�\c�|a�Z��=#r����6y_ce�``!R�D�r�o��N�k��8
���o ��!&���de.Y�V����If�����E�0�}�
��K"&��de.Y�Ufi�6vghW(�#�jj ��263�e.��n�JyL8�M�Z
��nm ��Lb�`��%M���h}�B�y�����
$]�qc���d�����Qb0S� ��4�y���
I��Xd+_c�z���#��82c���c+m��!�-�+Q�N�w4��������?�A%�6�^�fK�2�X��X�s�E.b?�CaR(�L��	I��X�$�|�u��
/�u�&:t3�)����@�%96Y�������&��'?�������P����@�%96[�����5��WZp��}��P��Q�����i36")��s�;�|��l�q��E�����6�,����@�����l�m@S���$�����H�%9�,�+_c��C�}��C�Ba�8����Krl�$/s�u��e��}d�:Z
�-{m ���m�%�� ��?��Ca������e?����Krla+_cu�)3�l�C�BA6D�6�.������+�X����]�N��pB2�Qf�.2��		K"�,�+_c�.iM��jM���nW|]�j�
�����$����'�&W�������Jm$,��hI^��vI����uR�b�f�����H�$bie���d��H]M��c���+^Q0�D��x�|���{�S����uh���^0�.���I�|8��nI�����FrC6E�������;,�J�1)w�#a�h�Z
��K���{m��1��\����������)''�y�6�o7���=���b�_���Q6�����X��de.Y��YxE���`�<^�)�u��������L�-�T�^���<�8��������I��c�I�|��������f�����K2w�#]��1"8�),Z���{w��F=���;@�BaZ&��.H����H�2��[�o\�+S�?�E�k�����8��D
���Q�L��&=D����P�����
$�v�&&��0V���Uu ��v�6���Km$]�c�X�����3JwSf���r�
�D���$�������j�D8[�C������
��$�&&����%��&#����d�s��KM�H���-Lb�`��w'&�g���a���r�
L~��c*�2���w�o8��Vd�v��&q���1"�>�[��cR�z
Gsn"-�m�����m �e�������]�^��y�U�n��	$&���)���Kr,Y�W���yw+��0����0��-�@r�N�����0V���{��;�; �/dG����q82����ul��i'�+�"8F����5#�i���j��/����������FD$�
D6H����'��H��D��[�Zhz�X�����z���p�x��"
p�z,K��Q�m�=����d�UH�M����;���t�Q�x ��J��u�ZIn�-8�I^�����M�"o��d��`�O�@��t+V�$�|����vL���,m:��������L�4�e.Y�n�`����	"��|C����� 2�0��h����������%�
���t��F&��0�7��/�\��?w�T�sH*��
0�w�9�]�s�R�sg��?����;���?��?���; ]�9M*��3`��?'�������,/5������2G���b%����1��b6�����99A!xnM�Hn�)-�+_c��9i�L���|�@�]+�7C=�),Q��o0��k����cMt��H0	�MLb�`�c.���5����b�sk ���}e}�
pV��K�o�����
g������w�v8�a���g��p()�Mtc��v�F&��0��56�<�Z�Q=�
SH���@B������0�mw���Ka��%Xw�4G5#��#-�+_���$�.8�\�;�b��
���F:5�-�I�|��;�S<������J
�iY^�
�����$����4��F���v��s,�tc�l���IV�������v"q��-��w�l|bn�rlb�\��}���
G-����84��O�VKy�����$V�k��E��f��bH�#�"j�m�����+Q����v�������I�<�>��"� 0h��^Z��M��9(��M�l:��Q!u��&`$�t�&��0V��C����l����M�c�g#6�L�2���M�M�#��24�QIm�60��c��$+s���u!i�<���ShVF\����I,s���u�W\�q]��)&���������K���/��:n�6~����H/6����8���)^����i��A����a��@r\k��=���-v�-���/��zn�;�N+qH�2����&`$w�v�H���X=�<s�c?��V�$�nB
K�q�;	�?���Q����@��:�F$�2�X=�n���ql�C�B���Hn��h-C������vA���2H�hdR?���M��mjH��`��C����!sh7k�����C��L�e.@�;d���9W���q(*�~��@�i���g$�� ��[�P��`B{P���Z��ro]�1��(s������6��	�g��k e�.�d`$��0��wW�v'��������:g
��2����W���~��.@�W���I,sb�J�i6)�NArDOA2��O����A
!Ma�:&��qn���E��@���
�n����8w���?�#��?'���;F��9$�\��\��`i���������q�<9K�&]��[fq���V>~i�Mh�1���Mm �i����2�XO6����G������rW`,>j��xgrW������/9Co�^��v�~�����!���W��]n�������.��}��Qkr(�w'���60{P�F���2�X8��%�C���_�W�y�J
�I�H0���Lb�`�O�<���nl���i����@��m����� ��%�AgH��c�Z�B��7#��-�+_c}��8�=����k��V��������bI^�����,y�iJ���@� a�0���2G�~�b%�����6l�1��;�H m�����-�c|,��/��>]��15D6�&�
�|[h ��$62�e.��>]����9y^���Y�
�@��\����m,2��/��>]��{Z�Z���$I���f�����I,s������p�����f�M�Ab+� 2����X�:&�����#���1��Y�}�����h����+Q��|��im���\JB��Q�E����D�Xb+_c}�dy��a��e���g���h�-��\c}�����{�X������A��hF����f�\���%Koc@'��w@��l�����8��3�j)V�����Z�����%���k���X�#X���D���Ze����Y7�tH�`��M�Htt�EK���X�.�v:k�������_�
��T�+�$�|��������o��7_8�G�;��Fl�����C�Wk�:&�����]wPdh��7��@��-��\c}�d�B�2����R�A�a�be��h���cR>\+��5�Q�-�
�.����11=����������Kk�$$�v6�e�F:Y�)Z�W���t�q��+6;�YGo���bnvlb���]`�O���P�#d�94���w�^����9&���de.Y�.Y}�M��TB�����nF�m�����$�|����G�k�g�NX���5� 0|�1X�����?\$=���dh���V���8�l��2�����i;����
��"�'�u����u��X���%�_p�L���?����0���N5M���X/y�`�%�g��mx�>b��Hy�c3�X��������<�D�U������6����MLb�`���4�w��T��j������N�~HLb�`,,�k=��[xMb��A��~��o����`�yk�k{�N���$
��X�f������P%��������m���Y?�C�Bj�Om$����nIy|����T��1�P�fS����JlY-�e.Yo.t}�0k���f	I��6�����FB+/��I�|���Bm��":o�9VHN�7��&`$�KLb���7��.4�X#L��d-6JL���		m���X����&i���ss��L��[1�)�b%����V�ZKR~������
5��_t)^t�D�^t|w���K{���j���MX�kE��X�,�k��.����w��@H6����r�����NZ+_c��P����2���f����Bz���@B�1��I,s���B��%����	�`���SNk�H>(@��$V��zo�2�s�X?���3�����m$�9}�aI(s���B���7}q3�	(C0������T�3#�����p�{�	����v�$��"��1�
�$�8�X�$�|�n�l:Q���f��le/s��9��%y�`��*��������1
P��6�|�`�fK�2�X�-t��K�brd��H�b�V��.ICls���cR�Z��-�I4���R��"z��Ht�����e.���Z�Z��K<��3�3g���@b��b�H>��� ������oJ}���Ig�7����@l�����zo����kG�}�����E�t��p3f�Os�?��)&�� �Mlm�PDO�X�K������{{�@�{��U��m>EA�]�<��
�tv*�5��\c��P�%]�@��	>��
����3��\��[�����:����������������c���B6^���x��������������`I^�����wN�WY��iV���K���^����)v�[+���ia9�!���`����-�+_c��P������hW�p�/H'';�Lb�`���f5tN�uR���/H'�Zm�P���zs������L��r=f*��H��r=f*�\��\���=#�Gf�K��M#y3R`����@�B��������!����|�t����$V�������%=N���%��kc8�y\��:���Q�c���!�h���&����NY
E�@V� o.y�1�M��t���bg�ZE�yx,oo�������W5�0�@7�]�)�x�K��1d��H�u��O�����$Y{�C2
v��K0�_`�I�|��=��%/+N$�6��j#d��"mF2voF�Y`�I�|������I�r����@��`��%N�T�������D�����Z��G��yH�dIj�^c�Z1d��Y�X�:&����1�r���~z3��qT�6��M�H�&�����X�.���u`���)��)��� ��62�e.��>]��8��9{7�A�����&`$v�[���`-����du�m3��W����@r��F1X���d}�dy'�stl��`C~������=�0��bE����:aD�6��}~��B0��H�z���-�e.��>]���������w������J�P��Z��zz������?��	ie}
��tIy�.&["�+_���N���S�R��+��>CyV�-�$���K��`-�+��d��>�js��;V+4bc���q�*���Q>\��x����>�PR��m$�&���]cY���%��t�
=r��`���#���6Kb�@��KV[������
�	^H=,�V	'"��
�<��^0v�r�����~�&`$o W,Z�W���t�����+������h�CI?5��0��^���p����=
<!�!�I5aF^��$b�fI,s��t�jd�Nf�92;�	��=s8�Hb�@��KvK;�j���l��q���!6��d�W1)��#IR����{^lr������I,s���������<C��;��`���-��\������%����{+���SM�F:MV��I�|�����%�����^��_��l�/e�xX���lSS��r
�-����b���4�E�7��2�X�.����W�W8���]�N'���H��2�������$��+g��-�p��@>��ec���d}���~G6WehFW�b[����U��IK�� ��;��M�\K�^�����/��Nf�LNE+_cf}{����yfi����a����/_��7bF�e�*�Y��_���_~=~���?~�������\^��yg"o!c���?�y��_��q�k�����?�
��/_��������������[���1�g��R�������K{�������<����<�Wc�tD�&-����_2#D.���>�i?�dP�43~Z9Y��4<��x��xRs�������9���G=��^��z��4���G^���7r+2�x�s�����'��J9�����z4�Zl���xT#/�u��7���Q����%�^�g>�14|����}��m%W �w��o�*V?o�.��eg/��z��P��'=��^��zr��I��yY�K�����C=����z}���Wy���Q�XY�Gs��;��G5��\��z��`���G^��]o�~"���G^���'�����n���������m,�����fw�/_��gDn��u2�i�����'���F���Q�XY�G�����C�OZd�V��ZQ7-b���r]���D���G^��]o��E\��yY�K���Q��3�I�y��'��j�]��yY�Kt��{2����G=��~q"��b}��X�����������E��<�P}���6��<�����D�������Oz�s�>��&}�/�<�����D�ubF��Q����%RO�X�?0<�9���D�����G=��^��z��z���z]��M:�.�8x�#�o'b��D�~���/��8�}�H�d����"������z RO����E�I�y��'��lg)��G=��^�H=�M�w*Oz�s�>��V�q9�����z4���C���F^��]o~�����j��b=�kmq,�<�����D�����������x�8�w��d�R�����[�\�����t��R����M[��J�I�y��'��h�b]��yY�K��N�)��x�s��������B��yY�Kt��F���G=��^��z�vz���z]"�d��^�|<�9���D���2�9?���%�y���M/���}P�m|��a�4�u2�i�����'����G=��^�H��������<��]o�=���Q����%�^��X=�z�e�.�z������Oz�s�>�����(�zk�"���zQ_Jz���z]��-���7���[����w����z!���w<�<���T�-&0�����[R�#������������
-]���/��"��BR�E��%/���r��eIY���)Z�ZH*�t�v5--]-$ZzZ:A���r`|�?�XR�r`�Vy�$������W���;��ot,������v�w�,)����5
r�:b��1+�e��'�u���t��T��j�i<��#]5�*���%�J="}=d�zD�zzF[)g@_M�J1�Z���}��#]5�*�����f���^�n��������t�w�`V�G��~��5�O7���E���R�������b��t�����N�6�t�,��s���N����s��gY��#}=�-�����U���I����H����B�@_+���R�H_
Y�����
���H_Y�����z�;����B������>�R�H��aY�����&�(����I��;��eIy��.��'{�
1G�j�U�9���t�U�G����R�HWo�P�B����ez����������V�3��[&��T�9����B���^���R�H_Y������o�G����R�HW�<z��n�*�y�m������������j_M�tG������;��T��=��#]=�*���m�NV-���YV��H_o��K="}=d�zD�z�"�r�jH*���������WFzz�tvZ������
�����R=�g���T���%��T���6�U�y�m���Z�����+8O:��X:���BR�p��$SpN���t�,��s�����[��s��gY��#}=L�+�����U����|���#3������&��[�'������	3#}=��,�����U�������{F�z�U�9���l�r�{���O(��<�7����}���"m
r���d;����:9_B�h[e$m����������>�����nW%#Y�����<N�7�F��G�e����~�7f/N����y��tbP@����i���U!�o<`l�]�����,=b��56��A�j�b%��)71�w���E�zYFEB����Y�b),Z���������&�:�a���fj���"k�,�%����U����NK"�=�tz����E],�E����Z�F ���n~��n"�"a���k�Ad�Z�f)V���r��M��ks�M'����E}�km`�!�v#�%���[V�c�a�AD&;@C�-�bcDd-���h�r�a&� ����EgdS$���f�Ad��X��D�r�.����]���$l�N��c2"�W����h�r��E���t�^G�E<�d�f<�������u<���4F��!��h�vd�F�Y��8�_IF�h
��1)7v�l=�v�e@R�d�P+\p�����	N�B��Y��R����q��E�)�V�V�h\C���n�b%��)7"�T�m|M����Y�i����5� 2��WS�D;����g?��&h8"Gs�;� �bcD���0b#�cRn���K�*M����Vpx�t�L�B��,B�����g�e$)"?���g;J���-�%���j��`��w��(L��\��1"rQR�h�r�%M�����/�HRD�����8���h���cRn�2��]z�W�����oz�X+6�AD���r�:&��7����n-��l��\s3� "k��S.Q����F��!?��:2c ��^K3��C��5�%��)�����H�:��me�5�h���u�/Q���z62�Nh������L
��d�����8 ),Q�N��W�v��2j4c��^�!�`W��En�9(o]� '��n}"���Z�����q��),Z���8"Ag�J�ut��a#N�������X��I�u�dZl��%��]q��f��K�!���RX��S�{����G$����Y��J{3� "k	��uL���$�s%�{���8w.��k�Ak�/Q���:L�H-ia��dUD~���q�Z4�K�1)7�#4Y�D#l�#����5KoO+6�Ad�ZW�|s�:&���.HC�i����{�<���17uw}U)CF�� ��q�y]���a]v$�
=�mem�M-6���U�}1/�
�~� �����7��NxcgOY���W�.r��@���~���+�)�E�u��RdUD��f�AD�`),Q��������vJ�I� Gi�BM?�'�T����o\���T���Pfv����!��:��`�*d��E4�yf�4Y�~F��/u�f�ADl�),Q�N���f��d��~T3�(��f�AD��}�K��Sn��;��}b3�+"f����}
��2[
K�q�~n�Y�V��wl�!���K����������*d��O5��`�3�(�����1�����%�8Sn������PI��:�+"m���q�Ci
K�1)7�K6]S<3"�"A�
m�`D�K�+Q�N���F�F��p<�;	f^c0"����u���j���i�4]�*"Q�)L���qt�9R�D;��!L�����a����a/��3�=���5�"��1b�k�od����5O�@�
5� �A�^����w���##�_H����yC0A
K�q�2E}�KA���������uh���&�����g1����O��8���Q_c0"f�T��cRnm y�gb�h)�b6����HP��8���/Z����#�R'u����vD������Y�� �NS�h�rcIM��F���gb,F��I]"�=<N�k��j�u.|kB�f�lv��#+�rg;U�� b���x�:&����t��+�D����b
����h��cR�M(Yq�@U���S#D�A`HqC�������'=KM�����XjbV���Vl����~��x�:&����n}9�la��"	��lh���),Z���zRp/�+�Q��Qs��q��)^��I���`����+�l����4c0"���RX��I��������T#<�Q������u	��(��(��iF)�*�VS3� b����uL��%�����H3J���ZM�x�oY�:IDS�D��/f�z���BD�����:6FiFm�E�E�x+~Q�f�,D��;6G�Tn"M��H�n��jQ��P��r`V@<�V���� �&�
=��FM��'t�?)�w�\�����{x<�=�~��vTMz��qEG�������"�0`�����_�d�8R���oo��@T[k�#V��~�lBB�)v
��x�u+�;�5m����xZK3� [LS�D�r�H�����"GI�!j�/�1"�O1������c��Wh��ddWD:��f�AdPO),Q����A2tCw3��l���	@�����Qi
K�1)���-qr�"�8qdVD
M�����S5�%��)7k��&�����"�=M��}�ON2h�R�D;���������gdVD���f�AD<��RP��3�f-�f%iQ��f�X�}Z�1�)�u��3����b�%��A����Vh-&�����p����i��!���V	#l�:��/���@�@z��E�:BVz�0�7#�"�57c0"h�B���c��4P��P�,X!D64L��5�bcD7��E�x�"����
�:�AF�WO����8���),Q���;D�����#�����52`���D�}�0.�V$������4A�?����*6�A����E����>*i��,��G`��k�t�\CK?"�	�W��������G6\Y9��G���F!I�u��8�m���+6Y�+4�`'v�n;�X�:v�m�~�Rz%<z;�+"�QS3� 2h��@�D�r�]��ea9`�
�������0�&�^�K�@��p,�����A�9�>��� MS�ho��P1��C9%��"�[�5$_��.. ��A
K�q�vq�4�J7���V���[���7�������7�@���@��M$M��%��bcD`Z!�E���[�
I�;��WbI�f�����V��%�8u� �DYf�H0leS_#����A���p����l�66�]d���q��K�AFRX��I�5��+8�9=#���Kh��|�6FO�D�:�>�������$)�;���S��/Q���M�}��`C`-nR�1�i�G�x�:&��A�^��&�j[�����?������_�@V��J�������5�r:��9�.��/!�o0�3���^������H9�Y�����$��0�m��2�����gfH�U�VM��"e�����	��L�I�<[�1"�\���uL�gk
�~o��a3�A����������4�K�1)��Uv��F�����j�L�"A^z�bcD��),Z��|�������SH2�)���f�A���x�:&���.��
��g/�vf2NN����l���9HY�[��3����GS���8��GvDfj�H?b� ��U���)'KkT����,������q`�x4��u����6yi��l&�]�`���1�]!�%������Ek
Zd`F�X�ok��~��=M �
-��E��vW��D�p	dSD�S�f�A�RX��I�h����-������WQiK���3
A�@�D���Z������� �(N��������9&���N6:q�yF�_IL�����d��5�K�1)�u�F^)���Eq?�V���A��K�U���i]�����!]�-�e���*6�Ad��i�%.1)��U6��Z��dx�c#L�%��!-3H�B�������\�[Gl{l�1����yC��4�8z/Q���t�����c��}Z6~M�G�Z�A`H3.�����w]�
�B������.����P�r1uNT�W���H��"�5�7�e�,8l��"���A��)^��I�p�I�(�� ���Z�'�������2�$H�u����*�L��'�S�V���8�s#D�A`H6��^����"'\�4�<#��Y�9���K�Ad�cQ4�%��)����'���x��L�B<G��T�����nk:�turVO�Q%��UD!���z�%6�Ad��R�%.1)�
_i�u����U�Wum3� "k-�%������&���eG���N_����V8����W
�&�^���p�*.���V�3�!�C�9H�zz�����-m_+�%sdVD{ �1�A�B���c�|�V�qZ�G�0+�M����A��	F�B��p��^�?���JdV���vX����u��?���W�FD
�5��ul����Fl��E�������������������-����6M��Zq�w�D���\u8�;rtc�&�
=�~��,-�A�e��^�g��;�
/![�n��uL�[�Kv��l}�A��v`6����
�Z��D���
^q�|����)"*m2^j��8�`RRX��Iyg��������>Gf�[�N��O��`�M��uL�[+j��V��WFfE����1�A��i
K��S�Y��^k���7<�_D��75B�5�5������Y�2"���j���FEBdl� ��Ha�:&�����GyA0�� Q���f�O6���A=t�X�:v�;+\�QVv�����H�md�8�K9Cu�5�K�q*��}s���z��z6����(�~�:B�a�����i(�"����/='{��I
����V�����A�#��U!��X�4.>O�����oeK���~�?MX|��)��w�}�&m�������g����)#�J��S�Y�����so�t5d���[�1"���/Z����B�5cc�6�/�,�L��c��1�����c����e�#���~���*���K�P���%��Uh��,M�p����}�������6�`_����),Q�Nyg�I;��������O�Z��5���� �4�����o-m�[8�����M��4��^F3R�&�X��0F�7��&������H���i���:6�Ad� �E���wV�}*�bn}�l��}�1\@M�uL�+��W����zG$bD���.�����)w^��Iyk���I��!�HRD&�/�x�����/
),Q�����i�������*"C��E���Vl�����w4���1)��p���l�2��'��x�g�L`�����x�:&��r;�[N{���R�1���m��{�eJ�R�1�c���B\8o��k����\��X��(�#��RX��Iyg�b�F��Ig&�QG��q�{/M�uL�[+\�,�g�]�@��^����w��
��u��w5��\�����I�y�g���{��a;fz2���I����3�8�lv@BC�%�9~���/Q��������n{�HR�[�/1e�L�/Q����!���1��������c0J/Vg���D���b�W�����='�����vx���9n6���������EwV���AN���*�|��,Rva���Q��?���A7���A��k�U�����fZ��D�#�"��������6���D;����v<�wBY&oc^[��;��w�/Z��N����cN�|���J$��N�����= 0���u<�}{o�u^�1�&���p Q9�zn�`D0�)V�����Z"��Q0b(#�"r�th��M
2dP�WS�D���Z#�u�Z�H�����Vl��^��h���Z�Z�=�����C�m����j6�#�J��S>\�tdFX�+�j������k+6�AdPW),Z��|�V����[�m���km#n���d2��u��D���Ze������+X�#�"A�4c0"��/Q��|��i���_uj�#�'L�W��l�1"����uL�gk
��,����n�����j�P�8e�3�
���"983���4�jp��%Nvu�i����D��uAgm���u���q�X��k������w�����������
�p�wd������ �1���E�x�=�}t/�X�1=�$��c�5�yL�Aw#�%������Fs�������d�F�Vl���%k�������������
e�������8��EC��j���cR>[�4����
�l����v7c0"=�%������cn�^52����f<�#`��~���D;�����6+6U������@=�;@��A}S��D;�����J9�C���h�=&Ef�t��q���)^��I�p���&�l�g����.&U���:���� �4�^2��E��'���I4��Czo��M67;"4�%��)�������k�(����&�_��W��/����D;���&��� ��
�A]qA�^����d5�Ka�:&�����3��{�3�����������%��wy�(,��\�}J��dUD����q��Ka�:&����E�����\n �"bco�������cR>[k����g�G^l��l��x��dC�;�)^�����\�#Y�����:GL=���K<qp��}��B/Q�N��>`&�����'w�/q2*;���D�����tD���?=�D{=\n�8�������?��?�E����>w/^v��(�i��J~��/��/�������C�����c`�?���]�q��I�����������������O_��8�0��O!}�������q���Wa�1����E�U����1�b���LY�N�����H�ZU� o�.o`�Y�������]�M{(
xZ<iz���L��\-l�/��y�s���Y�jV��HbV��b�7��"�GE�b�������<):�����^:�"*��g�,��e�E]�R���Y�K����o�"�GE�b������7��@�HA��}�Y��~q��}|��!��1d������3������[|mc��G�I����'fE��-�<*���DW\7����<):����_����zF�r=ZV����r@�H��]bV�u�T��HbV������"�GE�b���i��y�B�I�������;�l���j��j�����������G����;�/V��e�IP�G=�e���=x�P��F��uiY-�oY�y�#1v�Yq��3Y��"�Y�K���
���@I��]�+������!O�Nt�>1+��RfE ��$f�.1+F����CI�[��x�eD�kbp�|s���o����!Okw�_�>1+�6�.+yT$1+v���G"�>g�<):����8�����Q����%f��:w�"�GE�b����j���h���]�O��<�<+yT$1+v�Yq�1�Y��"�Y�K������{�!��$���D����/�Z����������^^-"k�D^��+���T~��<):������
�"�GE�b���r��Z���<):��������$������\������+��<������h�fA����z���a*�y�#1v���x*�!D���{�������<��4���gzy�.b3?��e���I���'��2/s�<):����(��O���<*���DW���[������]�O�������@I��]bV�l�zV��HbV���h�L�"�GE�b����f����!O�Nt�>1+�hL��	�G=�������If=�Y�;�m�U�$o�Jd�������;�/V����>��@I��]�+�1N��D����}bV�����Q����%f�`c�"�GE�b���#���!O�Nt�>1+���/"��k���������@I��]bV\l���KyT$��&'b�v�a?�������|�~�������#�|�s#�������!���D�z�U^g"}=t��zD�z��_���-jV�9����B���^�V�R�H_Y����tw�3���I��]-�t�]qd|u�,��sd��|��%��[������?��k�N;�����H�RYV�xG�k�u�U��HW��
=G�zr �~�vq��gY��#]�=h�h��HW��
=G�z���s����_>Gzzq��-�e������������J="}=d�zD�zW���!��#�����R�H����R����C�*���7�8%�,_,������b��t/�\v�!����D��r^���-#]�*�����E��#��CV�G��'g�����YV��H_o�'�R�H_Y����E=�nN�t�,��s���HX)g@_M�J1�Z�QZ���(,��#�t��q����7�����+w�{�,�X�#�+�/������ez����(����/#==fe����V=@��#��CV�G����7/#]=�*���%��Q���!��#���uJ�G����R�HWo����#]=�*�����L�G�w�`V�G���!��������1��uK�tG������;��T2����r��gY��#}=��V���!��#���i�����t�,��s�������#]=�*���a8\�G����R�H_o���K="}=d�zD�z�(�I9]5$b�Z���+��to�U�y�m�2������U�8��[lUq�w��U������4��?W��YV��H_��K="}=d�zD�z����
=G�z�U�9��[��������w/#]�9���}*#]=�*�q"}��������J="}�Mw��zD�z�*��t��[���;o�w,�{�\�-<}�x��%�	�+�)��	�XXq�6��vC��������G}��-���/Kv�J'��i%F��b�}-��g�V��9_q2u��S��3��/���e���������LX�CK?����W���1�1YCz���HRdI�n���`D���/Q�����kJI^��9l�2~-��������-de`i7����7��9�Q["�"jP7c0"�R��%���{3w[����YY��/�q�!����7���u���Z��� �w:l���q�1q�����������Y_zJG�7&���l�@rD+��W�����;�DR�h�r�sF�EV����$��3������k���@��uL��Z������������>���=�K������+N���5sgdW$�G]^�d�q�s���uL�����EoZ��2��q��m��uLJ{-��Z�i�Y ?������+���C�e�	�W����Mz^�q���zHO\W=}��"��xz��p�����w��jm��5md��>�5��`D0�
)����b��8r�k�q�Z�2����3�1�����cRn�,;{]tn�����'��/1�),Q���8Or���G�`w�#+NJ�km��8��T �E����Zf��x�BF6;t���K�t�RX��I���f�H�v�!��s�����\����F�:&���Z�+�����������n�K(J�1)7���9I���wd���v[+6�A#����uL�������|���i�q@�����q�����(Q���X1��3����`B��9S3�K�!��3s���~T�S�k��L��^��c�������1\��D�rk��8�c�X���0��u�	F�B��Y��C�e����!����|�J����8gz����b���@�a�9
(F=�bcDp���x�:&�f-��-L�(#�g�o��'�sb��G� �%�8���C��B��9��dWD���';�[��uL����������0}���H���q�H��D�rk��
��ro0a�GI6���84��Qj
K�1)�f����4�u�����P{3� ��h���cRn~wq����z8��/G�	������q��5!���1)J�h�l�5��~8�P�xy�p����
�~>���"&\#�7��A����W����R��Rw�/�\) �"K�p��`D=�),Q���xSF-4�������0��Ov �!��n�),Q�N�����M>iX�I�e���v	5� ��
$��Uh�7��p?=�����>��V���g���^���o��~t�����f���q�]�@^���r�TE[)~��ddQ$�������*�X�:v��{��C���-(����f�AdP+),Q���u���k���y���2<��6���A7�#�J��Sn:�0BY���8�PH��[�����!RX����Zb��H9�U����"�����L��`D����u�)�=k2�R��)Y�@e�x3��@RC���b%��)7&��{���YwWgdVD����q��ESX���r�ea�s2g_�,"I���r~��8��!�%���gk���u���&E������A��#�E����^3�lb�$����G��l�f�AnRX��I�q����x�����
O��mi��`D54�K�1)7���_����e�:�-5�w���q��),Z�k��Z�L�a.#f_��V�A����W���sL��Z��7%��L#��'�Vn��&��uL����)G����2����';���),Q���8[;�L��_���[�|u��8��Z�X��cRn�-�iO?GD�������Sl���G��ho����#g�&t\��rd�y���\bcD��!�E�����!
�MC@<�=�q�u�A�t�K�1)7k�s����B��\�F�������`�*d���e�t������Zq��6��b_py�9�����zy"N2f���O�����q0[B�`�h�rkpi��|��/t��1}U���};3�� �%�8����!�b�?����b_�f����!0�4�K�q��������'XJ��Kv��q�����\�XZp��<\��#���j�A���T����;^��i�^��j�w��~n�yY�R��cR�^�����HdGs�{Wul��������E�x+���xYr<&
���6���CLS��3k�g��sdUd�c<�1�Z�)^��I���v���[��_����lj�`D��
��uL���5��&x8�DEd�����8���%),Q�N���ZaY����]�e[����=b0�i��Wa�=��Y2>������_���A�5"�E����&,�b0�6���<) 
V�Z�A`X��q^�^���YD�M������H�H��`D�3���cRn=-�8afdFvE�4YS�q��RVo�:���xZ>�&�e 	e�F+��G�tci�U8w��H���g
n�8b��<���n����+Q��o�,��L�_p@�M������I���B��m��<��=-l��)��1�b!�%���+�F��-�W�q
����[����#�/�1"vB���h�r�b�,&1�p��#;N����� "kY��E�����,��t��l���L�l��8��y��������$�k��&�	sA&�����5s���W����W3������H�Q/;��.�d�c2�����D;����_K���������Y��q�Z4�%��)7�Q��v�����5���g�Vl�����ho��W=EmD_�����TyW��;�XiHa�:&�vo����D���K\tp�5N�i��2B
K�q��4������}��e3�������sL���BM���#�I8�9�����+�|�5�K����� o<�W]]/�;��;����8��y),Q����W2GG��0���	s�Ov'7d@����u�����xjG"��#LJ���	H���E����$�>-�����t'������q��������'����5n�I$E�+����q�u�/Z����X;N��Y�"��@�F���� ��"�*�������J6������	����M�uLJ��B�P���DVE`R�b0hcYRX��I�5��Y5��uF��K�n�V����:�),Q����j7��8s�#	��8V�����aiJ�����mo�-�ndg��)�P(E$H<�g�Y
j��u����^e�]���"v����)�H���F�����"��%j����>j9�����iI�~�����b����r�UnE{?�������?���^dB+h�����Z&|�^���������d� ���t���������5-�)�(B}���*�����W����P"$.�DmS��^��H��~�d53����(���0����	�	���������	��9c��;Y^���n��;2ku����D�<)X���6��$xIml����W�\q�LGLHn���:&�O�������I��66��%I8D�	:��_�Z;�� #/-I���mJ����G��L}�
��k�9��I���%G����U&������NY��D�����Xpf�f�����~�������_I�XY�����%�����X��r'�������*[�K�k�F�*�S�]���e�2��Dm������"��p���}V�����6�I�r)�p������n��s%%�?���o����=��6#������m�|����n�1#�E�
];�3���~���-Q�.�n�����������=����QT�v�RE�,�����������$�~K��
�m�����<�hmS��^�w�~W+S�� �.US�f��j��%j����:O�?�{�Y���i��Lu?	^?i���W&���dD�^�k�YO'vP3F���3�z���|)\l��v�w{]��,�n���'�L����M���8I^�r��mJ������OZ����L�������P�$/��.�DmS��^���?��7������'�\��&�hmS��^�����B��\���(��mS�$/=0
.Z��|������hSQ3X5���x�R����x������+���m�H�D6a����q����W^P��n.\������7 ����A3`��1wL��Z}�C��W��������Q� �JZ��R�m�����0��Ek����*�!�*#����rA6%��]�6'�sl��X��)�r�8['���dE������P�$/=�[��]��^��aB�+J�
���3��$�>��������\���q>���)Y�H�%um(N���n��-Q�.�r��\%=�}�+�9����8I^�=
.�Dm����j������P��d��
�d[�`��t�/����8��D��[�T��)�L�>al[^��)���3�RT ��C>P�����&����o�����(�������d����P�$��M^����mJ�G����{��&���q��H����{�y)#/M������Q�b~q��>H$�H�]���c��$@%�:p��4���v �hG�1�$*Y�Y����$yi�;�����Ov(uCZ���-)M�	�"��g��I`���B^�����V�,a���S���S�k���$��j�����v�9�9i
>Mh�G�&���>�6'	���/Q��|�Cyv���$�����d���a�	��m���e����DcS����RK����g��k;ba����p�%j�%��pAS4Yl���N��"��]O�5�)N�r��/Z��|����������.R;��+W��O������L��dk)��]���	rL2wd���Z~yi�.\��)�h��vaP�<[�,HR"���
�Ib;T.Q��|��}�c���]��*@NS��9#g�����I�O�&����s��OV�$wBK����0x�W��.Q�.�`��a����`F��P"�����,=l���d��Dm�2���;���.X{�L��fs�;X�E#x{[��]����N�kB���L�l;k�`���pX]l��v�';��%�7;��$MJ4���Mq���\v��_mJ>��<�N8��0_��dp��+�V���%��\��������'y/�
��Pfm�X��|//Q��|�C-[�[Mce�>uZ��1gt�5���dm�T^���dk����sn�}H�ho[0��6^���u0ye���������mY��$��������$�T��K�6%�Pkd%��$�vr�"e�];����7x����>����!^kX��LZBg��Y��w'xP��M�G;�K�rI+��{�-���P�$/�S[��]��g=,k�=)�������~�$c�\�2��
��nm��9k��^���;7tr�����X}
��w�p��M�';�����X'�n�;^,T��N����@C@�W&�?����12��$Y>q�M���'����<�%j;�9�_��no7�5"�$)��n];ZWq#�!\�DmS���l1F���IR�;Z����1Np�%j���~E��Ml��������vb�X��j���X��)����]�?�������<�uN�#5�k��~&TZ���8%�'�J������8I^z�.�Dm��/�G{�C9��v��Y�{�H�N�6�I�B�N\|��������u�A������qS�sc�����W_���������GI��V�Kg������m�����������,��x���z�B^��z*�c2Ca�%�h8��2��������;�q&�����	��g��I��:��2���&��zN�<��o����4�)N���1��Ek��/�z�K��a�a�%�*Y���6$F�m���+���m2"a�m37��h�
����Lu?	��U�W&��������b�-��0)Y#��5�)N��V���/Z��|�����E'�8��Hi��t�dG���".\��)�n������s�s��s!�p tm(N������K�6%��u�B�M���&IT"]3��
�I�qp�%j�%��u����[{�cB��-��Ec��$AKu�Ek����zX+�m�9�#w��"��V�)N�<����<m��^e��~%�S@�-:IHNV4������ET��}�������{
��r��2eN"��G��'��2b�Wu�%j�%_�u���,R+N6�EdvN���8I�.\���kj���F������8���
3�'�+����I�/7�k���i�t���T^�&�[��O���*�`�U&����jE{�	������x]��Gd��%j����*G/4wm(3����7��6V����
ye�������������O|�,(��P���P�$HB����m�|�������zX����x�������d��|����w�<��y
�|�	&�%P���a�����K��K���6��������4�����s���~ 5-.�L��r���\�	}'v"_
'������.K�DmS��^�C���:yr�0j.z��p?	^,���U�^������"�tXWU%&w�xjm(Nd�w�Rc�������W5)���X!%������
�I��~u�%j��/��Z��U��Y��� �#Lx
ml(X��W&2�DmS��^���Q�m/@������8I^Z�.Q��|��"���*�-3r���T�*�-E~�b����T�������_�l�C��n��~����>��R,Q��NC��-���UR�rd�?�����������~���<�?��?��?��Hp��G3|��������?��q����_������~?����B�����������8o����/9���e�k9�~����f^za�g��ARhk-�^�Md&�+e�����{
�#���T����q��}�B�w!>=^`Vh�1�+s��4�\��q�9&�E��4�d5T�	���"�P�1����yL#�1]�c��9f���R�sD�G�o��]e@���T!��3�-5�y�Ies(�1�{-o+j?��*�R
Y�����q�*���
���'�`���V$O{�J�VR���v��cy���s,�1�n��9&�E���cn���1�<�te�9�����r�6��tE��0����*y�Her(�1�~U(c�<����9����>	�1A�cRY�J=f�m��*$�1]�w���o�__\���o?�X���~��<n������9��6�!<G4]p$�x����;�Q��
�9�j�Zr@���T!������ �1�,b�9�d��9&�sL*��C���U�4���9�X�cn6�+�y�Ies(�1W���7#�1��7�����%�W|����Me�������x��j�q���Wl,�1'���c�<����9�zL2�\~��<�te�9�������1A�cRY�Js���3�� �1�,b�s�����i�1�+s��4�Lzb��	���"�P�c.z0��	���"�P�c��"������T���Ezsw	��{T��]~��"&��|� y�~V�%{�zLi��_;�����1��3���2&�sL*��C������Fc�2�Ks�
_9�����+��9�j��9 �sD*��Ci�iprH�MW	s�`�Es@���T!�R���A���������U���He
&[�yhz[����&;N���������l,��2:q/�%�<�te�9���2���a4���"�P�1�Lp/!Hc�2�Ks��Z��� �1�,b�9�l�	rL���T1��s��/9&�sL*��C���w�EL#�1]�c��9��D�������7�Rxso�*a���O�������L��� y��+�Ks�`��sL���T1�R���|y]B���9�X�c.z���	���"�P�c���c�<����9�z����X�i�1�+s��4���Wn$�1�����4��CF�� �1�,b�9&^�w#�1���������Q�^�9n0�r������9�,�1nrx�?9������������W��8Na��H���U�#��1����d���xN������x$�x�*����iAM��8�:���c�|��g��??��p$[��w����������Z�_)�X��M��;^)�*��dx�����"��a<�*�9�K�>Z�s2�g^E<'�x��X�#����E<�a<9�q��9�3�"��q<9Yr�=v2��2�0���M��0�y����-:Y��G2��2�8��]��x$�{�y��H�:�b}�{<�<��H��3],z��g2�X�>y�����T#����f�`N�����do�YZe<�q<x��H���\}�2�3�"��q���[�x$�x�*������G2��2�0�2�/G���F0�����2���A�2��M�����{�/�2��_���;^(�*��dx�������0�y����Z�S�#��W�dO�M�_c'�x�U�s2�w�='�xG��s2�w�LF��0�y����m�_�#��W�dO�������;�QDw+��g�-A.1�F7w��$z����������y�Y"���/����w4�d��c�^2G���V�t�s{����$z�	�KL�q��*����1�����!��|=������qL��_NG���V�t��/1�b����!�~���Q�K���G�����;v���s+�9z���I�������d&��v�8_,s+G�����/��qLs+c:�<V�a_�t4�ineLG��AF,_�#h�ne���C���@f�s�?�=�D��KL�������I4�)�*.I��c��%�O�s��KL������.b=��|#�`��/������3���O��~q��8�
���K�����������Q��M���m���t�����;%���g��$yi	+\�hmS�OhI����s�{#Y1�v����lS�$/-(��mJn��F�I=��9Y��k|/]������p�%j�%7{9��^�Nz]�D%���S���$� ������$�1%m�8m����	_!S�O�LQ�+��o3���0r�S&h������aMFl��������,��������J$�6'�K����K�6%���}B���Q'�dW��6<jm(N���L��K�6%7)WQ�M�2A��1�����a�Hn����mJn�+������H�lJ�y���
�I��9"p��M�Mrq��*i�f{1"��u!�
���8I^��
.\����^��J����<�]���R���$y��X\|������$�J��2�M�X����y�2�����%j����I�{E����$���F�k��d��K'���K�6%7YAt>N��YA#{Prl�����8I^(�U.Z���fe��tL�*�6����1�5��B[u��2��&���o(�k�f��4E���~`Z�:��2���H&n�Z L=�������F�Ay���f?!a*��������>Lz���W�f:q���6��y0��j�2BS�0��;�D�%g�N\���M�����-�tt�6Ea�A;"���@'.�S�n	
j�
����m����
�����Ek�$wy�
����q4�C�+
��	����8���e`���P@nv����i�aZfD7��c'����p����.����XZY����(Lrj� ���l��i��nA��I+��3��i�t�_@�[r6���4�T�<���6�6����a:0��&�-9[���`������J�Q�+�d\�&H���&������
0� �$Gd����]m����&�-e���+7�T���	�Yf�[��hW�k�}�D6��/�S�n)�%����D�����}9��b��f3��K��2�	��$����m�B�����q��e`�AjKs��b�i�)��n���%�@'.�S������&sYq��f�����V�-0~����rL5J�m��&1[����$���9���F�Z9'�f�s�*�nS{���$y� `u������Lv��P���}����A��
�I����p����v�m
�d����@V%�$]��S�bK��KnRW����1f���i��gmG��h��Y!�p����&�������H��<+�e�G�6�I�������6%7����r�F�Q9����\����c��$ya��p����&�5a����~2���BK���$y��u�%j�����I$`�D����G��&�m�����2Ay����f/	��i�f���U�1��p66'�djs���%7��3���Fg�y�G�0���g��k{Q[����6����\��,���X��P�$/d��K�v��d�"fy���]���p��P�$/���.\��]r�M��������E�!���'�K�[�`��t��M,�4�������cE���6�I"$��mJn�h_���i#IJI�G���lS�$H���/Z���&w�Vf/�!�]����u��iW�������e�v="���WRK��$�P��^�v�u�|u�%j;�����~
�L��,�[�1�����F��~\�����27�'����eF�"y���6$'�KO��k4�E��4�rXuR�e��0<Xe� b6
l��i���+�*�fSWJ6}%��/|�
���+�p��6�}�$��I�e[���B����;P�P�$��H[��)�O��������4Cv�DL��E:q���6��PI�vd�3A�=LRK���8���s.Z�� ��-�5I��%���I0}�h�F|��i@�r���$�+������!�����I��\��Mr�
���'�-���i?�m�6D�U���|�p��}�v���+���b�0����Mq��4��.�hmSr�����c�aGh:�4���z2���e�?=�������}.���^sbk@���lN�L���]M���4IjE&�@6!Z���*N'�9���m�A����=8R]rb}�'�l��9q���g��&'�]g��N��SN�U�TO�3}�N���>f�"'�]g�J'��S
g����7�����2g�6�:�f����hm�$���!�:+�����Mq�H�,��mJn�UAHR�Z�:Yq�g�$����$A�L]�Dm�d��x���R(��}�Mq�K��/Z���[�m��L��n����C�n7�������K�"��K�v�_��V6�#'�LYc���3g�O�U�Kn3g����g�@D�(9o|�������^e��fJ�����g��������W�nhp��M�M�n�P�Y����$*�4�nm(N������-Q�.�I���F���
[&��C^�u��^�y�&�p��^F���p`��Tk:� h��V��{5�~���M�^e��M����
�(��I��z�)N�tU_����;
9t*o�4���2��H&l��P�$6U\|����&M���c��j���Hj���8I^Z�.Z���j2-x�����g:�dd��<���� /�d��K��2�gj�'hw������Y���z�)N���n!/������N}l�;i��LV%R[�vm(N$c�bK��Kn�T,������ISPG���I��������b������O(�Y��h�ini������N\����II�V���_G������ b
�L'.��uT�&;� W�F�d�Q*a
�S�"l�X����4��A
�~�����$)�r����Mv4������M��!Q�,�������4C��"����t�2
0�0��X��%g�0aX�%��"����N\���M�m��C�$xq\ �)���{�*�D'.����cMliG�4kv�rh@�"M�-}sh���N�L����%��.�
k����(L;^NtD�[�:q��jP�����d�X��$�B��,�l(N'/-���m����t&�����T$_���Y���=N�+7�T�T��\W�T��J5K6��Mp|�l��i�>���t������m�t��aKd��|�,��4��	6@x�`
�M����>�["���4�T���z�Z�m	5%(��T����8I0��\l����)�����J��L�%kD�T�D'.�S=�����="����%kE�T����+7`/�{�T��������~&����o���>��<Z.B��/v?�6i�39ph��I��C�2��g��$yi��s���D�z~�W,��b��>�$%�4F�����Ay\�DmS��^��c���6�����2n�kCq� �.Q��|�������J��$4���=�������p������J������`����o2bW3��������&������[�p[}~���H��)N�Q��Ek{+�:��1,N�M-�M�X�<��_�Wi�����C�6�\��~���D�=�'�K�����mJ�����EJ��u�	���!��k��z����bK��K����ha�,44r�J����Mq���
.\��)�n�ZTwh�H�g �~V��1g7����:��2���&e*���)+_:��K�sfcCq��4�.Q��|��5��RFH.��$XH���3��$@��:��2���&��yJ2aa�?%rpN��V��6�)N�����mJ���T����Y1� !��6-�����4�K�Q�5�+���mr����f8�Wp��Y������H�����6%������,��P��e�
8v��uLu?	^?i9������W���w�i��4?�uR�&{]`"$V�:q����-K/��-����+��(L��] ����9�2
0��[��0��tX^���$L��lm(N���U1,Q�.�r�qA��e���������S>[M7s�2
���-/�=&~+����iI�.}��[yf�N\���v�����da����I��(Lk|O}�&�&:q����-K�W�/c����?�jr���&t�2
0��[�{��
@���jNx���8V6�t�e@��[��/�5���"y�J��^�4�DH,��t��
0��[���y�0��0�7onD�3��L'.�S}�eI��r�g�oHF����4S��o��Ny�������������YG+����s��5:�2
���-6wj�ohZ� )�>�[6:�2
���-����Rd��t�f�
 �b�[&K��+7�T_o9�i�<�S��"�N(&���or�$�'[�T}��]��:iYs}�
�y@�������l�P��M�9��Ci�{+��
0Q�f��p��
0���Vi��^��+N��R��{������������!��g�(�g�"�6'�K����m����K;\8v~�n�H����L*���6'���!K�6%��p���,f����J��@1fcCq���V]�Dm�����L1p��E������m�����-Q�.�d�[�j���Q}�h�r��1Y�b�%/+�ye��'[����f����D%�z���$A�Fu�����v��{G9%�%Nf����m��G���Ek��Ovx�������K�U)�R�6�I�������6%�P��h��i�����$��!��=�'	�����6%��P���V�bpb&��#��N���8I��R]|���������nMMI�����zL�6'��"����mJ>�����#��e��H�{����OyiF�p����W�_���w�8��
D':k����������/Z��|�C���oO�66�}�D����j���o_��}y��;���h��:��p��\n��Of^��A�^eVOr����:��Z���W��8vm(N�W_��)�d���g���:I6��)��zv�n� x��.�hmS����W��'�$��8��@cNl6����Np��2������H������S��-uA�:�;���
`���6��`f�DJlmI��T�� BF�l��i��>�h��5Z#�%EZ���D,����e`��6�,v�se�G'��q��lgm�����7@p��m��v(�q2A�>�j���Hs�� BJ�,��/�S}�Qi����je����?L�a��@�D���0'.��>���� |Z�h]S� kP�m����m��D^	�t�Ek���v�� �e�sL=+Z!��y�Ys��l�P��F���F!^���\r&��,nt�e@�G�����n�,������y���8I�����Ek���v�P_7�p���
��n����$[���`�7����c���d�]�&�������\��\�Dm���jg�E��m���;}�'��svD,�����i��>�h�p�8���{{���/H�lD�����:�2
������y�?����hE_L�����5���l�P��
����U�+��pR��p��e��*�^rX����i���0�l�n���:d�n����5��#v,
�Y�g4&�O���h�ye����mr�$�M��Z�P�M�t�z���/�`���i��mrM(����p�4��3��$x�=g(��i�_n���/����<�������Z��O��������L��I9�%}'���uRN���=�i���8I0�.\��)�n�[����5�t��r�Z��m����E�����hmS��^��%$��er`����Bm'Kxy��^]�Dm�k����JRph1�Bv'6*���k�����Mf����m�|��C�^m���$�=�;f���^��@(�L��I�5A=�b�@
������~�6����U&���dD��E��N����]�v�i������b�o��X+��^�:��O��u��^�xIE�:P^�����������!�+�IB��U�lS�$/m��aD\��)�n��p
���ou&8�x��Uo��~E�`x������C���%:���$�����#Y���Mq��������M�w{�Y���M��
=>�d�����M�%�����mJ����4w-��Q��d�aJiB��lS�$��	.Z�[Y���^�`C�U�h�=
����80���#�{��4�,�X�}����.}��;����� LQ��������+7�T�nyN���v�@8��}a�>���c�f:q���i���,����#.��bG�Z�N�D��bG����LL��������V�G�����:�A�Gt�t�e�^�?����d�K�*�ZUi��s��J���1�iU�nY�THo�eg�G+�g�h�L��8��t��
��E9�nY� �K4LoE��$�����&B�5�B'�������t���?DK�����J�4���"f��&:q����-KbZ�}-�O/sm�����`����$-'[�T}�������89��}�@���r'�����iU_nYr��d"/Gbz\�2)����&bzl�Wn������J��z�bG;�����|�����|���I�/���d�!y�u�g0$�����8��p("�K��I��k������3����O}��yg6���4 �CC�Zj���:&E���]����0���-0���
�rK;����D3�eS��������2
H����?F��k�I�cz�������������������
�o�g
)��hs��������p������	���~���c��
���?������������S�/�����>��&�,��9�W�I���&4������.���z�v/KR5"�W��IM���3�� �y�����_LN.��M,�Q�%�p	i�1f�z�i����=]�=Gui�:����J���@�Q]����u�������f�G}�����$��J�cR�C�9��.!��c�4Ks��[m�
���9�X����z��(x�h2����{���[p)�����-F��'To��=m����=I=�������c�,���5�Y�KT���.�Q�R���v�'G5�5K=��4G]���������Ca������@�1]����9j���ET���.�Q��C�.Q����4GK=j�}X��h��f)o6W����x�Vi��z?��kd�K���z���U{����UM��J�cR�C�Q
��T���.��ca��WqEH���.�A��5�\�KT���.�Q��u�~ET���.�Q�R�w�����f�G}����sT���.�Q��'�������.���Ezw�Y����e�7�������!r^6C�?@��e{������)�=Gui�:�z� ���_BC�Q���>Hs���sT���.�Q��u��ET���.�Q�R�:m6�3G5�5K=��4GM�t��=Gui�:����C�Dz���u,�Qg-�-o8�����o8��
G*��Jg�'������p�t��'��=�����I�Q����=����Y�Q�9j�����@�Q]����U*A��m��1j�z�i��3/A�<��0�
s����C=�ti:������J�cR�C�9b�V7��@�1]����5�����=F�R�h������3���>����N���&M>�������R�pR��.V��z����A��N6���
���9�X�Q��]���c�,���uaK��9�Ks��4G����%*�sT���ci�:iM�%*�sT���c�G�^}�Q
=F�R�� �Q��ty
e�9&�~�)�7���0�%��~zXn7��od���-	��O��~��9j��)ET���.�Q�R�6��*�z����A���!��9�Ks��4G
���Dz���u,��R�pM�=F�R�� �Q��M�s��~��$�QW�[SDz���u,�Q����p=Gu��r.��{�����2��=��n:�>�^^�����n��(�a�{���//1�b���hS��,����8���1=���-�%&�CL�]b
c�S��"fF��t+bf�S�^^By��^��F����% ��|���%$���4�����aE'��?�������Zr�	�����?���U���2����h������1�(c:�4�2����(���$z�	�KL�qL9�T}"�c�[��C�U����D1�v�I�S���kL���p��$�������u4�ineLG11���h|�1�KL��;�,����[���ad�p����W���������#�%��q<u*�x�u���K4��xp�D$���cq���g4�I�"fF1�����C���$f4�)�����<�qLs+c:z�is���D1�v�I4�)/��!I��U$y�7�{�K@��]�n��D�w��wx����-%.�Rn�����[�}G�K�N����qLs+c:z�����D1m�C�h3Y��"��qLs+c:z��5GG1����hS������c�[��C�Eg�_b=���%&�C�M�M.1�b���hs�����|F�����1=�^����m���+��c��;��������w4�f����w
c��L�8���*c:�4�2��q�y��`eLG���V�t�s�?�=�\��cF1������!&�.1��1#f��1�c�[��C�Uz\By��^��F�����$�[�v	I�xo��1��oD�R���I�����������������s4�ineLG��[TE��8���1=�D��KL���p��$��c��t4�ineLG17m�x�I�n��D�������	9�qLs��c z��`��D1mxX��!f�$ ���{��_%=�`vi�������/x�(5<��n�>������������G=������76����g['I�k��$
	�N����8����!\l����.}�YG��
�����s�.�
��c%��N\����nO"��/�f��VEa�W�&oD���x��iU�[BSR�~(R���H�:�"������-���q�6�n���Z{!a��gCq�������M�}�Z.�tP��a�T���J���W���N�L��}u��4J�=a����X���!r���i@�6�%��;����(��x�>�H�Dvs�2
��6�q_	>�/#��7�q�>����M6���iU�YRi��Jo�:���v1�42���]g2�p��4�T��-��7j����tn@$[�l��i��ns�32g�fY�Z�B�6�#�E��bN�rLu�����RD�g'�$$����kmCq:A�Au�Ek�$w����E?��$��(w�>�H/��`N�LLu�����v#��HC;&��(����D�%g�9��
0�m�0��1���"5G2�G�T��f[`"�����|���6�f�2�g�o?�6+
1w����%s��`��\\D��[cI<#2G>��
�gCq:��8G�Dc��.���,��e���!��^t)����Dzy����+7�T�i7��6�@�a�^��
ho��tK����X�����-v�W6np��md	�_������������%j{��$M&�&�m�":G�"M��>�H.Of3��LL5H9��c����;����D�s����we6���4�Tw[�����KIkeK���i�����E:q��j�e��M)�Ur����qG}�Ur6���4 ����-�P�[-	E��%M�� �-9���e`�A�J����G�a�`X��74-�[2�n��e@������Y����%��������������]~i�f���gx����>H�|�Y4'_�i�|&��W\�{����Ea�F�}�n�Y��i��n3E6hq]�O�f��6Ea�&+�"l�,��/�S
fG@3[�QA�O�~��y,��,��/��e���/2��6��WDI��xK@�[r6���4�T��cI�,���@G;��Eft����&Kt��
0�0i'�����R.I��m��I��w�uj	�6:X�y�D���'�f�����}��<nK��=e�c��<c�p�>�<��N\�����
}�w����'�i��L�-��t��
0��`U2D��J��@��0K�>�9;c�aN\�T�ni������w��vE�������t�2
0��$S��g?��H�A�s��1i���'#[��Wn��n�d�V
�����Is�}�
��L{���2
�jP���Nd�x�L@�i
l�1O�����S�VMF�LHl�'Q��y�57����%s��`�A!�|��9����Y��1u��e`�A6Q�H�	���(`��F���D��^+7��0�K��[�*2�%E!|��������|���vK���|�`["�������-�["����i��<�oIZ-"Y���H�ss@�X0'_��0����h��B\�[�������iU�[�12{��k��-�{�6����8��4Y.Z�&�O�I#tI�X;���"���>���;ei3'.��n�t���x���fw�)�`��{g�N\��8q�c\�|?����a����^� ��3�V��i@���2�!����2x@����;�>�H���t�2
0�m/�s���n�7�$*	��[������������$|s�oP�hS�K�}�^g�N\��e����V}|aO�����<��l��i��n3x+�[������H�sG@�X��i��n3x����x�N4�0DC���"l���[��L���a��\��n4,�s�DH��#[p�e@� ��&����d/X�u����df/X��
�q����-��&�=rt={�)�F0T�\l����>�����{G�%����
���w���i@��e�v��^�f�c�w{����P�$�<�bK�6%�<|���~kD�"M�m}3x`��|���vK^��8n
�6E��K}3x`��|��j����&��!J��H��D���(8q���)���:\���i�,�������W:q��j����l3��O�fQU
���Sg��MY�aeh+�P�Q~"|���b�@�S4���N,p�U�����h��a�j�p�!�R����Tg�9�2
0�m�[�?�:��d-�P�Y�v���=#Kt��
0�m�l��(n,�p$
R�!���L���bN�r�aQ��$������)��(���^�"$���6�8�2
p�mBOg�h���%")|���� ���	=���|�l��RZ��S���[j&I�b�i�����[���fi��%j;���jnG5��t�y������Uq:yi^
.\��Mr�Ux�y$~c���W�y���H��93�Z��i@~c���}�Qt�aV�$�����(#/���-Q�������U���WF8L�c{O}��{����2
X���$���L����2�iZ.�D��E��
'_����RB�mhj��"]k��tK�&:q��jP	([N�
��Dt�\k�.0�D�}��
8�[Z�74'�L�!y�����.0sw`��|��j���"��z��@�����>��wd�N\�iX�-�7}#������a��@� B
�l��i��'~5}���2��C��� �<g�F��2
���o��B���Ex�W 

q?�����,��/�S
Rxz�;�6��@�"���}Sx`3��LL5(�C���7G�4M�<l�t
L�-����+7`��@N-k.���;�[-�[�`���f:q���\����p.2yV�8{�� ��<+O�c�LLu���-�~xcE�mV��Y�&����9��
0��/� ���@	������ �cvf��L��1�5����q�*�$]@�3y(�2'[�T�2y��K$��=X��L����[p��p_�J�K�p��sS���~��V����D�2�-t��
����$����������H���"&��fs�e`��-��mG~v�v�)����!cE��e`�A^W3����%E�"��"l�l2'_����4������)J�����<�7��6�|��j�����$��z�(�,e���.�9q���)�'���l�rjKN�5��K^�X���`�ae�T���RU���z���_~�c������Z&�ri���o��6L�X������@��0�|��H�@���N�L�������~������dF��0|E���	$�t�2
0��[���q��Y��(Y�U����b.�r6��/�S}�eI=����L�'7�Ea��{���I$�t�2
0��[^��$Y)��hm)�A���M]`"d�Vs��`�������NdnT�l0"�y�l(N'/�.K4�I����L�m4��8
�je���5�d�9��
0��[�����<r���"��\�"�I"{���-�W}���Z&�{��eC�����x�vF��f�l�Wn����r��C�,�����X%��u���`3���r������5�
�
J���o��x6l2'_���v�3��5�*M'�+���k����9�X.Q�&�v�8b��q���+��Hs�K@��+X��i�������K��M
�hS�U�wD(�$���4�T���7�>��X��Ze/s�N|5b�n����6%_�U�m���h��������k��t�����X��)�r���� ua>(*�T��1��l����4��o�,u1�%���p�d�P��/R@��I�hN�LL����
������&Y-�����g�|���I���.�Dm�����x}�D��E9��^�"�&���e`�/���&�]�y����()
��+���[&���e`�o�<�A��'UIK�������F��'"����mJ��k\��J������4��H[1}��d�9�2
0��[^p�[~t��ut(�.�k@��5Y4'_���v�r�WU�E�W��D��oX����"���s�e`�o�l�|���U&��.O���
0��`+��r��*K�ZzFH/�#'�������<)�l6'_���z�H��:�a�iyk����#���e`�o����]��MI�L#=������a.�Dm������I1���H��A��aga��-�������Wq�zS���Uh�>��U���nN\�T}�eK���q;�V��D| �d��fN\�T}�e;�>�aGy�^�y^��]����N�L���V�����/�4����������B'��Sa���5�/��4�H��������v���d��i�)DK$G+�?����L�$?Y2'_���%�������t�#{����[��Xg��LL��F�k3V�jN�d����;6;�yi��`��v�g;D����:%G��1j���{�D'��S}���/���S����L���-��+7�T�m4��>}M!�.�:�x����2��9�2
0�g�f���g�3�h��9����~��Y8���4���6��s��4[c]G�u��M�v@��u�Es�e`��6z����[�C�#9/O��#�L���-��+7�Tm4J��$mV��x�]����>����`3��2
0�g
R+���m����Y�i	Q���T�$/}�.Z��|��s$-e���mV�}��K.�t��e�|q��^��}�	+%Vd5Z%�v��XJ�����S}�Q�.�w�yVt��e�n��D�(�bN�r�/��K���m7��dd��(#�� gJ2��H^��L��n4��#g��(�t����>����Q'.��>����q���_gt�m��;�A����Es�e`��6*o��:u:���Sj�z�d
7��5\l��v�G;<��}����%���0(����L��j���|�����J�X*P���J����������e�B��i��>�h�v��d�r��6E��t�����,���4�T�mt���l��2Y�9���C������^m��M�G;��M��,k���h���7�q�lt�e@�g�	3[�[FkK<�_m��uo�d��:q�P��F��Z����W���'�w�Z��&��ne��rL��F�����	�:�Y�}i�:�����3���e`��6*�u$��C7V����%}�;`��c��1Ul�4��7j�����
-�e�r���U�L�����iU�m��U�Lr�zq�CQn���.�\�9s]�e`�7��k��v��^���Q&��&��f/�gz���G[:�8����j�$%��A���&s�e��_�h�B�
`M_7���bIC�=;Zm���n	.\��cY��;����R����K������0�k�t�hU�b3��X���:���n@�r��"��LRY�OIK���g���4�ZF��%�Nk	��q-�}S��j�i�j��2B�L����7�s��t�e@U� �m9��GfI��%����
�Z��.���������r�q���M{;0#,+������
����IU�p��6��{�����b��H�������Mq�`�p���M��{]gd��eG��
�
����Qp���/����}�/��Z�s�8�?��t���8I^��.\��)�r�r�Ms�Oe��>Z0����'����N�L���l�����'Me/G�v���=6��%j�%_�u�u���������l��%���_9����i@��9?��4��&i��O/I��h+��gCq:y���V,��&�n�1DaK7^�+6$�tO(�oE�4'K��+7�(K)>��<Y�x�t�t�C��-=s�t/�8���4�o7�<�>Y+UG�"T��D^b.l��d�4��o�,_&e�>��6������i��>�F^h��K�6%_�UR	��x��-���� .���DH�����+7�T�nYf ��JL��:����2��kO�3�����%j��/�*�S�U��l�������G�Rn��,��/��t}����w�������+�"T���n@�]y#��p�2
���-�<�Hr����$��&3��������Np��mJ���"�9't�D0'��R�}�g������eq����I��k��GR3{��0��Y�������#13���e`�o�<[y��p�CF���`E�-X|�Cf��|����-K�[�'�Mu�T�<�h��k�sa��W���M��{�V���/"^J9����6i�oj=`"�6Kt��
0��[^m���o�X'�������@�-[6s�2
���-Kw��\v�D����:7��Dx@�r�Wn�~->�t�R��pX��q������'j�����L��z���L�e���|61�l}��m�����	������]s���u�Q1�o>;�(:�	K@��Z�Es�e`�����?dn���}�w��6�h���}�}��1��[��+q�����hS�&.}�wzQ���i����2���?���#J��~���qd��q����-[��<�*#�=�M`Z��Q9��)_�>���[�e����]6��9�$��Y��i@����d���o�����M���4�M����6�E#}FVi2��wZ{/��?��o���_���?��O�������'��q����o^��/�����_?�%L?��������)��?���������������A��X+~�C��"Y�����%�b�:���m�.����:�����F������S���k��q�Z��������])���4���cN�n��cLW��ci�9[�O�	���"�P���qX�3c�<��J�� �1��p�1A�cRY�Js�EN6�!<G4]p$��f����<G��9����
��1A�cRY�J=�~hs�|g!y��J��\����������������I���������2_��4�����	���"�P�1���%$�cD�r��0�K���A�#RY�JsL)c�<����9���A�eL���T1�R��V�1�<�te�9������� �1�,b�9f�T��[�<��2�[.���E�@��]������$�S\0#��we�`ci�9!�C*x�h�"�H����m���y���r,�1��$�1A�cRY�Js�I���� �1�,b�Sf�����cLW��ci��y�9&�E��4�\�s�	���"�P�cFk��7#�1��7���c��k\�4��X�_���h�+�bFw��|��R�)���q��cLW��ci��2:�y�Ies(�1'}?X�y�Ies(�������/�C��c���[���/���*�<�Nr)b���T9�H���Q~K��X���F2�5�dC��<F��������G���5�����/�=������v�:�Pc!�I��Z��x��:�7z��z6x<��x��xC!����T<<�s��=��;)����T9�H��fm�Rx�F]7z�`�E=�c<�r������@*>O�\���B��Y��w
������E��k����C���]��o��Rx����R��o�iX�1u9�P�x��#�3��uo,�x�NU,�<��.�
=�lo=�c<�r����M�d��<��.�
o���f�g�)��<�X���y)�<��.�
=���.��������{��fj��1K#�~vUn�~���{�{e�i����42�tX��x��:�7z�U[���Q��
����?g�K�q �����3o��!N�1����mzV�����o(�x��u�c��z�� �x�,���Q��
�o���v� x�GoWa�^d��^�J{�x�Y$���R;����������	0��N��(6Gs0��"�0�f��<Z�h���9G�3p9��a4x����|-/���R�"���H��Pr���I���hk�ow?�)��).����t#y~ �q�b�n�����i���P�b�W�`-�8�.�a<s+:y�x���_":G�[�dq�&99b&���VDt�q��E@�q<�*�G��E=��d����N"b�J��8"���$���z�H2�w��I�t������<�?������3^1s+��d|�R���1�aDs+":G�&t��3F4�"������.���W��8�>iK�"\&�x�Vt���eD'��p+#�<D�d�2 �8�z�����\���+�a<s+:y�hg���$��+C�<�Gv4��F��"�t����� ��E���L��K�0h��1�aDs+":G�7��z�L�����d1�z4����0���<D���|	H0��^e8�hR��|A������!"�I���#���H�q�D{|��2��3��c�e����~e8��;����m�A~��!}!/���b��k�%��T��d����N�w;��#f2�hnED'm�T��8"���$����8\�j&���VDt������#���H2������U-�("�������X^Ft2��2"�C��>�ED����~eH��;��=d��$�i�8[���L�����;_�q
6G�d����N�Y���d����N"������8�z�����gfq2��0����f����&���N��VF$y������H0��^e8�h2}q���2�3�"�������|`j|i_/H����Sz| ��w��D����_:�7K�Y���y�na��������o�z���J�j��+��J�,JR�~����K��w�p��vI�jO87-E�Q�29���5w���4�B�����MI?�����$�G�����:9&���-�=���"�R5��E�^eRv���L����6����M\�Dm��&M+:��������o@�1��$x�)�@ye��f�
���>��YQ�q��L�V�d��|��Ok��S�B����i+��k����iG�-�W���7!
2����'G�6$���$����-0:\�-��+7�Tw)�����r�YF��*��~��<�����LLu�%�|,�)hcP���E�����l��-���v��%�=�v�
��x0�Z����o������K�6%���$��3{a8A�c���]{f/���!-/��K��Kn�"���c�82�d���4"���8Id/��p����v/��LL�$�BV4}olS�$��d.\��)�Ih�/��E�4�$)�'�Yn�h3����F.\��)�I<j]������,l
$��tL��/)wU�+����:y+tH;{���dW2���um(N���!������mJn�xr6J���=�^�����Z'v#�\�DmSr��{�-���f�q>:�k�6���.Q���d�0����N�7r���sZq;��D���p��N�#�*�m�io�L������8Id/��p����&[4�Rq�(�L��Yn�];�D]#�����M�M�%���C��kK/���H���1�c�u,B������&�S�a��d�JKM�5��$xI8�z�I��&����N�6IZ�N];�Og�V�p��N�2���!�{q&���$�(�nm(N������K�6%7I%4����N0�7ZG����$�^��%j����O�@���I9�EA���3�/)wW�W���l��k��C{)g�*�� [����s��\l��v�M�&���C��h��l�k�E+�jkF[H�:���R5��F_��x%\	'�����
�I�+[��]2�c��DY�8��$����=����.�Dm��f/nP�y��j��?}:kK}O�%���@���u��/��]�����/B��|����
(��I,��X�~���&����9�����Os���������_6�)N#�>Z�Ek���Z�0Yz-2A��#het��6{������M�]2I��h�)�L�LV|}����4�BK]�Dm���nsA��4��hAy�A��*X����4����Ny��3���?�d��CL{"�����|}���.��v�����fQR�y8ef6�2/����vm:���hfM"J-m���*N�2Q��K��In�9V�&i����nM��x��W�3tB-�i���:������!���aMz��&�;�J|l�����
�o����$�����������4�z`��,����-���D��������76'����B;q��������'�������	��B[���$��Ds��M�m�${�hFU� r�{�w��
�I"{���/Q��
���~��IR"#�������Kp��G�J����=����#���A&�l(N�������mJnr]��p�9�@�9x�����F^z�.�Dm����N��$Q��	�,J�p���n�F^��Q.Q�.�I3���U2@��Ng�+����kCq� �.Q����"��$��x?�`�7J���	���������y�O������v��Jfi���Mq�`#p���M�m�KM�k���!��)�LV��P�$/�2.Q���V�+�8��Dz�oI3YS�6�I���Lp���M�m�MO�/v���3�{z��'���������M�i�#5�oz����C��8��
�I�4����M��^V},?1~����Y����8I^�s�����6�-���E6����MI�FT=��Zu�%j������*�w�5��Z+�n�����3���o�����MbRu&x��R��9p������M��^�@~���N�rQS]��D�*�\�DmSr�U�-!?�w����Jd��kCq� ��[H��m��g���s����f��j^�k�L��`/p�%j�%��/K
a���U��H{��m��������Si��0�%[O��y�ts��*Ms���K>p��*���/y���k��PTd��%����g�LL5*�Z�A��]��<��8q5u&��S_	��$+�MNd�g���|�6�i�R_����6%�t��o"43�����;?;4�m�������DcS���NOD��I������+"�����X�NxU.|�YZ�;�&��UyA6=Q=����V�I�BKa��m����r$�Y�i['+N�����Uq���s_��)�����"�6mDT�E�d���������8�2
��nK�}P�`z�OdA=k��76'�_���Ek�$75]G��^k� Gx�,���@�)l��bK�|�Pu�q����!w;�k$
�����Uq� �$�BefA'[���ca��
�{)��@c/V�gDv�z��Dm/�
�m�?�tZ1��ID"s�L����$Af.�Dm��&����-�|�d�uK��kCq����I]|�����rj��')�B�/'�~Q��V����8IP�.Z���VNibp[���@?v����B���+s|h����5SN�	stm(N����%j�����6Z��-�$@���Lu?	^��O\^����:t���lq&�)��];�Xa#��9���mJnK���g�#'��1%�t�lS�$/�g��/Z�����dV�e2�� ��������Mq�X�D\|����f/��~��v"	�Ub��NS��8I��R_���a~zG�Av��@s2rlr��\
�\J\^����M�$��]W39��Y����8I�����mJnrY���s&\&�7������nF��p�����|�$��V�$R�8G]h���8I�A��mJnRZn�n��	�  N�'\M����_�2����8c��g��d�r�y�z��^XGay5�p��V1G;�#��6��	�;E�z��9��~{��%j�����K.j����Qe�����8I�rJ\�Dm��f/��/��Qp�����1g;g`@�&p��2��>��r��P����	��@7�&�L���e��q������<��3g��7��9fye��>�u������R�*�4U���$�F�bK��K��[A_���E���*�e.k7K^]M��e�+�g�y�"M��.��}�S^��Uq���g2������_�������#2����t��}����9����L��m~��
�i�����K��Kt���{=x�LN1�<�L0K|�8�����y!����?��`Iz����S�2kQQ�DR�b��ws��zT9��Ne�Xu�L��_�H���9�J|�C��G(\����������P���,A��#I��omBw���� ��GB���lL��~qbAs��Z������E�#�j9����t���,uM8���YA+�������Ya8XV-��>�����v�����s�+�N�"M�9[�T��UnY�~�w�}�&OR��1�:Mh�X���9��4�[��M��n�������5�!���pXo'���t���������o�5�:�h_7�$�����rH3�l*#/���.�hk�����N�V�^�lJ$��6�d��bK��I>����2��alf!��R45��8� ![��M��^7���j
AbE�C{a��{��-��&�l�R)'��ML��ne�N�7o���N6��J���mm�|��=��&��%5����Q����gIg��H-��p��v�G{�����"(��u��Fw�S�#;�+[<���/��&�l�2�nAe*_A��,��vd!+	�'���hm�|�W<C��x:/$*�@qh>j���.\��]��^�L#�
C!���k?Z;�u�\jA]l��6�g{�x�%s1�|yT%�VaF�t�TF^_�4��cY��M��^�	g������������C;�D#��5���hm�|��]O�����BP!��ahO�� y}�a��hm�|��-�HwY&����E�,���6��.\��]��^�e������mD-ogg?�Y����m���*���x��w��<�/��H�20�~x�>�������l�3�9�����$+���<��8��F'���m���*���\E��+��_���Wr�ihCqy�u�%Z�%�|z	��:X,�5:S����7���I��%%�a��<'J��UO=��d�y �����������lP\�&�?=��T����L[���?������e����%�����f[$�W;*K�v�#��4�<cF�	��[�l*#��p�E[�$�u��:�H�cvtYJI�
��&�/���^c���M"m=m��0;h�%��|���)}��jV������F�����~+hST2�=���.�12�Z�Ta���s�|�����$=�~�0|�u���+����d@*w�$)�}`��( ������^�?�O�&W.fV�.�%�C��^�f�,�+L��yc�����G[�Z��P�^���I����ru�V_��M��%}.?fZ.P�VP�(O�C�mH���g����K����adz�VNK���c����:�2���������)h$�D[OIG�il��0�B�^]l��6�;;��X)�\5��-�E������!�%�m��.������E}�U!V��_�1��}����2������:�,]�h���8����;
	�W��Wh�"xk=��M_�����KxC��0�4��������mm����MO.�����`S 3s��	���K�O���i��lm����n�N2�l�n;{������p�%Z{�����O��#�xQ{��Q���A��oH����4�7�6�IK!���^�B�I�Xw���K[���-��.yk�;Z����Y��X�adSqy��h����m�wv8���Ss�H49IJ�y�_����0�D��p��v�;;\x�L�j'K�+��Ix��M�a���������I��������C��B�I�������yp����wv����w@���MI9Y��P����t�%Z�$��0�\z������d�d���yN�����Dk����������H�D%sf;����0��p����wv�-��y��E!�82�1�5�~x�N��z�i��l������2���D%2�$m(#/}g.��.yc�R<��/����n�F%��~�PF^���.\��]����&�K
E���$%��~�
�aDv����hm�����je1�B�pr���_����0��1�p�E[�$��p��T��#��dU"I�ihCq��*���m�wv(7�}��:,&u�Q	*e�C{�bR����Q_��M���d6���Bp�Ls�C�F(�����.\��]���o�|g���iaoI��D2�����0���-��.yk���!� l`@��x�~`�����x���7�����U{���:|72�J�VydSq�k2����m��v�f�".�\�E���	��vgI��{s���?��`~U�m�S�H�V���:�����e:@�]*w���Uz����c���?��������rQz�j��������b���� y�������m���u��a_����S��dC����6��<����m���*o�������V�}������������K�v����k�"_�����iY*�(���V�a�6;�K�6%��4]��4�����^�vf�H�U;g��M�g{�'��Xs��P&����H������Z�T}�e�L�~�b�p�O*���v0&�����t�T�n9����
���
�7�Q��1�H3���|�P�����7��jE�$�����1q��9���.�lk�u����[���y��M�r{`Sp�N��VlL��M9�/�X���7�H��u^1!���8��P��.�hk����n��!�+������i��6�v-U.��.�l�{F:q�i($������~@~Y(oL��h���B�=#�*=H�����1O���8���9\l��6�g{�j�U{�����`�R����C�Kz"�����?�����lUN���IF���m%K$��]��%Z;��������h�������k��+�E�����wa���g��N�R�^
�h���Y��C;X���wS�%Z�%��UO��<Gf��e��b�����8��p�����m����1AG�7��IJ��#��
�a��yoe��v�g{�CG���k!��Y��C;�=*	;��-��������������gcW�?�m[G6����.�hk�������������4�Vo#S�������?���6�Q�2`����bz�p?�$[.���i��m���w}

L����J��l*#���mm�|�W�p���6�%�B��C�����a�%G,�����?����Cp��g'z�O����0�B�S]l��^�3�o�5��I��qa&@�3���02�}�i��T�������������rR$�:+������/����?�U
�r�h��$)�	H����0��D����������s����+�U�5����P^6:@�����&W�5��N���}������f���X{��z�%��j�1��r��V���z������g��?��������O=>�e�=���N��0��,���������r���Y��!|�����2�kN}����������%��_��d�T<��c��v�G��������������On�����`K�9~����)��8�������_9�y����_�����_;����_B��������SR~x����2v>.	Cr1u~�8NP��4��Za}	���)|��Q��|<u~��������0��M�U�?�K��	�%��y��D%z����^ZE�����=G5i�V�Q�}��X�j�)j�z�i53�ZEz�j�*����*S�A�<���
y'�"�l�P�z�i�*����:��\Ez�j�*���D�W\:�o6����o7g���&���t���7o�n��:���G��\Z.����j�8��@�QMZE����25��1�	K�[a1���
	���U�[iu���p�
���U�[iUJ��sT���&���JK���E�J���%�����i��ST���&���J��������9�I���$�n�K�����Q1�/_��k����U#z��K�U��VQ'���*y�Ia�NX"rTh��1�KK�{i��bVQ������z+��N:*��9�I����U2K���D���%�����������9�I����*j���)*�sT�VQo�U��������9�I���$��j#Cmr���fz�fv\.������\�{i�:a�Q��1�KK�{iu���**�sT�VQo�U�I�1��=G5i�VjQ���XP#1��B>K��}zn#x�GY	w'+��v-�<F��D���h3*J4��(+��d%��E?��b�1�	��r^�P�o\�5^�N~~?�	��c@��w���z�����$O]���%��]@�� �MX"�
=b�9X��)�=���D��<��
�QV���J��g/J8��x&,o�%"���A#��D�z�d����)�=���D\����#$�M����pxY�v���m8No$o�It������i�.��u/,gu-A#��D�z��f�xD���.����1�;g����%���D����Dy�h��VX"NZhWGy�h��V��M�WI�"��#�K��	%"�cD����1�\��L�Q���Zvq'��}=n��s�r'y�V�[�.���]��^��T�^?Uy��B�x/,��JD���&,o�%���9��R�7��I=�)ii�!RZ�a�������%���DL�[6#�S���AX".��U"�<F4a�x+,g��{����&���Y8�}l�����������95����r'z��������F�w)�@�*����p�#y��k��:����p�#���
��X�mD�U�<DL��TEtrnuD#qf��h$lZ����CI%�>����MFg,r�����'e9kk�j���^1�U�wr�����D)��H�*�����������F�[��CD���#:���:����{���U�Bn#�����!"�jU��G�[��m�$��k]�T����V"�1b
s��}D���<D\����h��~b~uH#��)Z�������$ik�����bt���������:���x�����&�)\!���Vt�q��uD'��VG4r1���xf�FS�*��H�������X�U#x��k�F��}<����G�3�����F�[��C��3H��F���Cy�$������>�Mj����`t�v���������Z\���t�":������R���E4�����Y3uD'��VG4r1�n��X�mD�U�<D�������>"���F"J�����}D����G�C���"��VEt�q�8�*�������!�<�N����O����I�A��U�/��z��������8��D,�6"���N"��*����>"���F�#���c����F�[��C�]��V
��S�:�}�u�@����xt�:y�����{PE�#���h�!������N�#���h�>�6/��(�<�I�W�t�x'Y%��x'y��}�������`t�v�����2�u���T�.��U9U'������>"���F�#J��~En#�����!���}��N�#���h�>��b�:O��6"�������I����G�[��C�
M���N�#���h�>bJ������^9T/Z�o'�����v��7�~N1D=}sN���������q-�������j'�sW������w:/���dKL�������U���r���4G��)$(7.5�Q��/D���&8�e:@���V�2��D��
J�b��5���3�_�N�LLu����E?���.-�b���@�[2&�������������Rc�����5���@$[*l�SY�T]%I��i�8��U'��!�>:�Uq�!*��j���d�iN��s\���
EEI�]��W��
N�LLu��
�0�8����}(�-���+w��������S�������W
Y'!lUF^V�]/�����p��}����r�Bv%IA
m(#/-�U_��Mr������B��#�tv�hu������hm�\���&��Z���B�I�C��z���-��.�E��we	��mR�2Z�4&��]������������J&��o��, i�������H�+yc��"A9�l)F�x'R�,7*)��TF0�D]|��6�E"1�������,J�������0��o�|��v��^�ok��
,'9(����M�a]��b���I.�d��J�j��$-��6Aiu<���S��d��8�2q����u�,J��9&���8��4��������D�+O3k2IJ���zgCqy�vu�%Z�%9�	�j�^o�N��$������0�u�E[{���g9�,-��O�WF���f^t6�6�S.��.�����������t�������
�a3d�bK��I.�A+{�Y��Bv%��C;s�<��e��-��&�H���~f�t!����c�6��.�Dk�d�\;�����h��dS�N(M�m(#�7��-��&�HeL��[��~��tK�����^&���m�����I�q��B0J�����P$�����m���dL2�g1���2�~h/�D�����hm�\�e��8s�x!I�*=|�����$����K��K.�@3����QHR��6��gN� �^���hm�\�%���g����^���FZ{f�n�.\��]r����m8wy2�HR"j�PF��p��v��������s���HQ�'c@���Y�SY�T�%��S��JN���d�pJ�
i{����	�h�����v��$��#���4�I.O3G��M���������Z��E�s��TV�U���h=@
���*��������L��J<&`$������HjG��HO�f8�e:@�U�����T���
adW����N�zU,`jWY����Q��/����,���z�~ �-6+���@��[-y���1��A{J�$��/DL���s�8�20���fm ��x������"�����r���$�4��\��V-fS�~�c�f���p*�t���l
�tf�fzA^7-�;mK���������}��Mr�s�"������y�I���K3������m(�����/�@g�Jyd�	�p	$57����I���'�m3'/���d	���9F�DRQ;\l��v�e^.��)/4�)dW"��}hCqyi�K]|��6�e�������L�F���uR�������0�������m��lf��q�E�Q�����A��3��`��yc��e
���%�6Y���,�b5��
L�^:o@|����E.����X��UHV�G8w6���N��K��I.�"�y��$`����lyd��a�����4���X��o:��p�d�������$��.\��]r����F($+�q���^8��O���K��I.2cH(J-��L6�&%9e����TF��S_��MrY�%O(HZ������#Bz���e��j��6�efLd��'/dS�7���m(#���.�Dk��"3��zpC����������g��2WK�v���]��;�����v��.�	�0z;���K'(���h�x��L�ZR4$[^�����D��6�\u�%Z�$O�AK44i��N�yTOC��3c��K��I.��0�!.��W����e���`���&oL�_lb�&J�o���p�JV��0��8�`�7\�Dk��2%&�o-���r�R�
���� LC��-��.�L�m����x?��[&�/	����sa��[�����;�3���sa�R��j�%�0)����
���\X�U�Tl�SY��z������]��M����0����^��a��3(hQT��:���[V:yF���H����:W��
���hJ ����N�L���������-
��$��_����L>W'[�������fB�I��FA��u�@�91g��2��zK2�\��e���vEq����H��l�SY�T]��0W�����<+��Z�4'�,��W�UWi:��G��k
�km�8�.D��2&Ua�d�t�T�5Y�w[����BH������&��Mp*�t`�{�$��;��/���0)�2�<���E8��;@����'y��������l������0�����+6&�/2QA���n�YF$�&���w;����d�2��l���o���h�*�=1+�`U Y�i`FND%xi
�:P����K(����[q�w[)5�C������%Z�%Y1tR��2L�('(��
����J"{	t�%Z�$Ek�6��"3��5�m���t�^gN�ixi�K(oL��H�M�vk�k���A��I#����Ksf�����vW�����R��@!�./y�ih'~�'yi���������&����'H��pXY����=O� 9�.�Dk/�� �!�"e�A��`l�����m��I��R[��]r���oU�#���B%�P�_k���
.\��]r�#K�o����
�J��&;{�6<����hm�\���?R���k$�3�$�����0�����hk��?�k���Z��s��H,
��!�$�y���hm�\��,���L!��
KC���[��Mr�#��v�"EF�$i�����	�����������
W[��X~$W�r��[��~@���*yc��"+��r1r@l!��u�Y�����^I��S_��Mr�[��S<�gwV9�8m�m��FK\|�����J���(Er���%9�F����P�����"��r�Y(�_�e`C��/��%Z�%��b�N��	���6M���T�b'�%��%'��s���$@�^id��a�P����bG2����:+v��h�bp)Y���������1sA���^=����f8�e:@�MVl^���c���u������������'&-%VJ�t��=���|���%��wE��x �<��y��-�SaD��[�/��kd��Qj�b
��D�Gs�����Un9g����DB yL��+��"�P�y6g�N�r��p���i����
����l��6
l����t�Tny��ej�Rd�P@�)K[��"���%8��;@��[���LV�V!����(�m��,��e:`�mmYZj��IJ6� L��"i�/D�z���j�UZ�h>�����2z$�����Dri����0�����������*s�"�&��s�*I�}hCqyiNG]|��6��{Mz�c���($+I����J����#\l��6�g{�����8���$��RM�C�l*#/���.�hk������{[��i�p'+�����M�aDRK�.�hk��������������hU��eG6�T���/��&�l�������}�d��HY�6��8��<%\l��6�g{�2K2hE��HV"gk����a��i4u�%Z�$��u_=�T�)kZPbG6��9��-��&�h���\7����^#m+4:��8��cH��/��&�l�R��Y���p�SKU�C����u��/��&�p�wKy��H$c�F��^_������4��6��H�I�VO[��yE>���8��0E\l��v�g{�c���.�������=X��p?��y���i��m2gT���~��'=����V��M��
�}��uEY�<�7�B6%r�7���m$���.�Dk�?�MO����f��4'w�+�@����:L"{]��K�v�#�����������������id��a@�9����I��6�N��<�������c��}�2XpdSqy������m���2�a�I�N%R����w�$���.\��]��^�/�������,Y��<,��PF�Wu�%Z�%�U�!:��SJ	�I:{����z��)���5���I��e�	hF ������0�����mm�|�Wi�9*$����q�����M�a{U_��M��^#^�HJ�c�H�b�q��M�a7����m���	z�|J�����M�a�7+u�E[{�#�CK[�9���)*��@tJ���W-���e��9]�d@�Wr
q�Y�|x�����9��/��%��0s�I����:[���D��,������h���3kW�Sq�����D����N�L�zk�+��i}!���o����:�3^���P7).e��������7���YQ\����(����L'_�{};��F����rT(�D���4yl,`�ZY��zg���������'�
�Y��v���"M;� �j�P��FN��z����Y���1�HkX�e8��;@�{��J��}Y��������P
.��.yg�r�p���
uzm��\w���7���L�zk��}��"����$aG�4����C��-����/����������1@�)��NgN|H!x�Kq��1�����x��M�)�\t�PFp�.�Dk���f��+�u�(����X�E���V�p��v�;;��Z�:�Ob��L
c��9��*Z����i��lm�*X�@�?�(s��C�z�������������������
����d<�I'�8��=���4�.�,���g��0D�/O����A6%24f�PFP�[��M��g6EH��m
,�x
��������.H^>yc��;[[p��:���#��D���������������w�?�����l����$Ik$*�r�rhCq�1Lu�%Z�%��P~&M��e���L�t[��l�*^���7�������y�F�O
A���p���;���������m�wv�&���+jd=
DR�;
��/K/���!Y�6����m8p�,���@F��U�2t��~�/�&�&oL�kk;��.�)���J�q����8��tCp�%Z�$o�Pr���-���2I�C;[�
��wu�%Z;����P:��+%�Kp�^��`��j�	���k�1����i��lM^�I����b����Dj���
�a���W]l��v�[;�z�t����nI`%Q�fy�6����'��hm����
�&E��W�c�+w��PF�Cu����wv���>��l��J���:�t6�����op��]$o�0"��[/���[����t?����:p��4����q?�w��r�B�_���AkSqy���b���I��d�5Hy�vj�lj�` �3�a��j�Pu���]��%��g	]k� ��kQ��4�UG�J�S�O�0#Y���C����i#p�%Z�$~%����P�:o"�^G���7c �dZaNe�P����h�Zv��.h�I�4�E���e:�����-���V�����%�d�<��E�eg�N�r��p��������BQ���i:�X��d�t�TnyG��0����/A�{�������U]|��6�G{��-F������
EE�;3_��`lB
fY���p��l��Mp:J�d��r �,�$]">\�����
G6�����PV$����H6\�D'_�T}��%�[;�^m���$��yLsA�.�lk����J�����f����X2�'��fo�^R�
������77)��\_�
��������m���9]u�%Z;��@�����
Z[0��HT�r��Ho>��I\�Dk������3%�Kg$��H��f�W�lm*#/m]�.�hk����n,��;������MR�U��M�aE�p�E[{??����}�P��h&V��F��$go�6����.��-��&�h��zK{kc^������4�PF^�V_��M��^�����Dr`���g�:sBY$�����i�n2�o��~>N��Dk�G6���n�,��&�l���g���IF�%qFk/���N���hm�|��4�a�}Yycu"��������8��4G�.�hk���*
(4���{��O��4�&�/���!Z��d��7on3�$5��9�NbK���[3�P�-	���R5�����-�������Zg�h}c@^��_��M��^���A�u�gW)x�G�?���4[����s�B�$����Zf�����wNW��M��
�}�����-6����)=��d�Nf�#)�U��!A����}�����f����f���i��mRV������d%�����*I�_w��/���y���	O-
*��F�3�}���\�Dk����jW
���������!h�1����|�.�Dk����r<����z���������6�p��v��{�$�d]����rI���d]�KZ|�����%��Z�Z�I�S�f���w :%���W�t��*A�{"����������x$����_������?e�_�������}Y�$�=�d������r����o�����/_��v�i����������?��?�4�����?/�����mSD�)�Rf����&3����o��.!H��T}�2)�I�txUV9�=i\�:�F�KJym.�>����My�!G�������M>�%�d�&K\����������3�BVq�~��u�[�����R�;{\C�q���}Wq3��Uq�~��u�[q7~��V��RY��V1g�P��� �i����*��{o�qM[����,.�n?����������h[���v���@o����;����G��!\[]�{qw��d��@?�k�:�����;&��� �QMY��V1���� �i����*��<C�qM[��Wq�n��� �i����W2�����9�k����*��A�U\��5m�V\��Z�T���~���
�$��d�/M��_������v0��=�VW�^\��0��
��Q����IKL�����D�Q][��Wq�e��� �i����*���H�qM[����R�?SD�q][��Wq����qM[��Wq��{:��A\��qo�U�EG6�w��5mu�9�/�����"�<_�}�7n?�^��KG��\[]�{q�+�S��7e����������aVq�~��u�[qw���*.�����{+��;����F���}�������G���*����[����S���*�������t���9uU�;a�7����2F�#���1g��-Fz���X�i/����-F�����_/�����\�{������oD���%�����4�_�y�i�*���c���\�0���%���������
�#RW��x�����<G4e�VZb��)�SL�����b�J=�Le���I���%����\9>�o-$�1MYn-'���"SO��l�Yvzky��"�)��{N��W�kv/-1g��*1A�c���y+����E-UL����,1��%f�+��9�)����sfc��9�)����s����9�)���R�)/=��}��1�+K�{i��Y�\b�<�4e�VZbFt����sD����^�[������{��ia���qs	I;iV��q��,�^�1���^?�y����^Zb��p�� y�i�*����<��� -��3�7���;�bZ#O����=HK���=JL�����b�JK����3�3����%����%&�sLSV1o�%��z7��y�iJ�����J�N0rE$�pyC��e�1���zm�p���v�� r��Z8b�D,�6"���N"�z����>"���F�#J9�����H�*�����V��G�[��C������F�>����~uH#�$��?��O��Dj/��N��O�<"����r{��Vm���K�FJ�Bn#�����>��YN�����t�":y��jyY��}D����G\��k�#r�nUD'9X����>"���F"J��SMU!��VG4rQzv��ow!��VEt�1�m�����~uH#���5b���
��~������E������r{��r�g���x���x����]�srnu@#���������F�[��}D9�y
��6��p���������W��!���Du8'���V4rQ��.g!��VEt�Q���>���{���!�<�C��MZ�!?���)��.�����[�{'�L�[m���Bn#�����!������N�#���h�>���y����t�":y����:����p�#������W���F�[��CD���#:���:����[�����[���mD�Y�=iKEzA�7w���z�������w7���;�v����'(����_�=\�y�Z�:jA�Q�WGu�UOZ�S���������!��>G-�>*���������sP#1����!b
��Y�}L��A=E�Zv}���!*�NQ
=E�A�STGQ�w�j�!j���:����
�����6a�#�����9���mq��P����_�=\�m����u���������!�$���y�
�G�_��}�0Mz8�>%U�mT�������b�)�����;E5�U;7Q=D��)����2�����G��)�o�)j���STGQ�w�j�)����a����a��7���H�����r�O�R�wO��L��~���������vV;w��]�K?���)����	���$/��[�z�,����O����I�)��_���x�l��/������H�N��*h��d��r �-9�W8�20���Vi�:r0�`���f���������O]l��6�E�;q ���Kz���\���,6?�b�|�Pu��V�������1ks��h�G�����������N��%+j
Z����e"�.���i[���J�N��2��+��o`gv�y�B[�~��m���$}]�����_F�IS�LM�{�F^z+\Yb�e[�$i���bZ����hp��%3G�4��.�lk��"�����s�0�>)�>#����K�>p�E[�$	�]�l������u u�6�����-��&��s���G���R#����C������X�%Z�%�Y���Cr�������I�.�&�/�;��7��_&2u��m��c*X���6{�&��{.�^c��b��!^����rK���r�l*#������m��$J^�6X�3��-��T���K�������/Rv������L?�����y���mhCqyi�5�p��v�U&oe��`��
���Nc��������j���2K�E-�
K�v$*�q�@t�j�����U�)4}��!)(vu ���h��~x��u0yc��:�'_����u1�?�"Y��H���dN�r����x����	!C;��H��1�H��l��/��.���;�6"�-I{C[k�tK��9����2�c6|\�2���^�a�h>��tK��9����23�>i1~���!C��0ml���tK��9�2��2m�p����#2��m����`j��~T_��)��K�'g�3��Z�i���C�^g��l�Pu���v��q;7k$s�����PFpu���hm�\�e��vx�QR��q�1�H.�3m�W�t�T7)�m��oM�I������1�mUN^.\��Mr�(�iv�v�D�y���1���l��d�t�T�[��wg]�%�@I���
c�n��������.�6�{�������j���|QO�{�'�a47_�'x���l4��x��)u6�lS��	�yb�b�N�E47�B��-X|�o����Z����"�6k�������g�����P$[*,�������Q��|�W���
J�6���H�k��%�:�20�UBk�P9��6a|��2���"�Ra��2��jK����g�����H�c��Hsl��1�:�20�U�"�F�)�d�
a������0����IV�t ���6������'���Y���m(#/����mmJ.�:��i�9�O�k����� y�TZY����T[�q��	m�P��Hi�@t�j6����UW��
O({B��6���� �����
~�J����H�E�$���GIH�4��m(#/��[��]rY�������B�����v��		N���h�|�<�O`�������������0��S����hkow����+������V�I��&�/M���������pI����W(Y��DiqhCqy�Qyu�%Z�$���:�9/l�X8J�l�PF0�.�Dk��"�63�7}#�`���V@Lu?����6���1���$O[�� \H��Q&��PF^�C�Dk��*�����R*q�JF��1��6�������.�m8��'V�����qahG\w#�o���o��4M��d63Z:J�&��8��q��V�U�I7{S ���%CR��
2E����p��;@�e�����",���y�������?u�E[�$W{�Xt�cK<G	�U-�Z��"d��%s��;���o�����Z������T������d�t���l��_
��}��Z�.c�dl�n��L��|���vZ�>���
e�:K��6!hl��/��.����M�A�����
���d��@yc.��y���� �������1��?c��2��.�&u#r=�mS������1���m�$
'.�S�d������;���H�iqf{�,Lv'.�S� ��R�tK�I�5`�2�S�-�&�v��K�q���]�T���1���n�(t�20�c�m���s�
�v��[OT�$��2�|�� �]�M������QQ�u���y0[�"����NlH��o�����af�wR_)����e:k*u`l\PV$ug��H��R�S��L����K�k�9v�l�m�X��H�dl����t�TW�k3:���w$�J��k���E��s��TV�UW%c�%��C�*����<�.D���v��e:`��t��|�L���bG;��e���H3l�8�e:@�����iI���(�����4iM��D'_���U�G7�m�s������[f!Y(b�Y�SY�T]��v��N6��B�&ui�����3���Vi���Jti�}��W��/+������Z�����8��1��"�5�
�#��WNvL���W����0�������m�����$�`��.��^e�y����0��6���%Z�${AU�N+�_)'�5jc����(F�D�BY��MrQ�����l�^H�2OshCqyiM����������Ek+�0:�hFc�idSqAY�����m���$�}%���W!����d�
�a	@���m��|���JE����9(�7-�mM�^R{��^cnw���k�.��k���v�6$���QB�5:����,"i|/(�E/��z������kt�DW�O���;E]%��$qe��P�7c��l�Pu��B�J�D6�Mm.�r�"d���t�e:��n7��[
c�����6��-W,����@����S��"��2C�����t ���XjIXM�����L3(K�P]F��:3c���I^h�.�Dk��*�=�=m���#�D�tM���S��F��+u��ew��V���*�1��K�~f�d�t���R��TTJ���%����D�idCq��cu�%Z�$W{����_���P�w�=23����26��-��nrp+����Sp+^�	���
�a�\�bK��K��o:�=���^����1�i%�e7'.�S]��&���ut��r,K���
�a�.\��]r�u�/%�P���Jx�/	�0Y�M��������K����>��EvL���@t.u���J�Z`�������u�4�U��$���)�����K�o�n8>����n���_�SX�������J�����y�\����(h�ma����H���e8��;@����e~����SDu����9Rbq �D���N�L��p��5���{�iYG26!)���"M�9�t��;@��[���}�����i���h���1�H���K�����-K��#�h�]��~mi�[��V�a��h��,���|��5|��I+��(hS$�)��H���f8�e:@��[^Q�,�����EE���c�@�yN0y��Vi���p�r�t���$%�Hl{��>�	���6������mR*��9�Nv���d���4�W����0���p������p:K�U�N�,dW�K�vhCqyiq\l��6�g{�Q� d�)��'����e$/��S[��M��^�P�}��K!�Ib.C;��
�K�����hm�|����_�u��e�����P���^����mR����4���$�J�<�l*#/�C��/��&�l�r�v�����LR������F����yc��g�\w/�	��������2��8�����c�Dk����r��>��jD�v(u�#����K����/������i��^=���W�2{	l(��lH#/}��=��0���Mx��,6������"N�����U�nY�,HjZ�� QhD��D����
�����b��6%�u����@�4��0�D���1�v��b�|�P���%�&�L����5�f�b/;C@r��"�|�P�����B����� �<��*���t��O���wL����}�V�����>���X��/��>���r�=����2���
�{�����&:�2���-����u2���"����"d��Mt�e:@��[����%8�D�+�<�6���>&�����I>��������<����Y�f��!��R�`��l�P���������!�T!��P�:���
�V���e:`���������x�\��,V�:��8��Gb���lJ>�k��]*ufK����U�o�
����R��M��{E>E>��$
3��P�:���Y�%�|�P�������N���]���[��	./\l����������G)����NU�}@T������|��  wn[nr��w����e��g�y=�����(��w�FI9�e_[��9�Uq��*u�%Z�$��?���G`���q���U�4vL"v$t�e:@�[M��.�x]!<QK�<_����6���,��������j�wy9R6
������"�H7�,����?�hFu���1�� <Goy���@��kg�|�P��F��������8�!���1�H7�,���������cv�<df}mQH"����V�a��7�bK��I�����i��f��I�<_��R�iS4��*�]4�lS:Z�-�0�����6�Y''v�4�9���e[�$��Q^q-�$�����
G%����l(#8T[��M����{�V��M:����>'}0�l*#hj	[��M��%��/b$���]'Y�����
�a�����K��I��������8�N�JT��M�a'��b���I��a��~^�\,x=��'�\����0��#p�%Z�$��P��O,����O�#�G�G��"a]�����m��v�B^����Q�������!	�����h��~���J3}E��HQ!Y���C{�� �n{�Dk�����X��v�eq��'@{��&����������?)�+S�B�����-�"mn��!���Es��;@�{��qE�"�PR��c2���4�i.���T�mt	�Y���+)(Nvr

�z@��7�E;g+w`��J~�Qya*/�er;8Z1�B�.c���|�P��F7��_�3���B+OE�r���6����U�mT^����[����4��y.�K�A��K��K����J��~=��sGK�!���Pd=��9������nn"oop�B��Z�B �bj��T*�1�������S����v!�M./#;��6������m��v�S������(�N��R�rCrfw-_��7�0E�i������6i�;�D�(Y����t�Tont�$��P���C����tOp��m��v�/�����0#��{o�5��;��u����ff��������9�!����2�|�P��F-y�k�a�O�g�!z��*�%	b���Uo�Jc��F�u.�����D�/>N0��/���%k.)����?K#[s�u�\Iu\�b�+�g��^v��(Hs�2�~E��P����2�����\���,9���@��"_D�6�5�Z�����`��hmJ>��V# �T��d��WB ��x,�����g`��,M��xE�1h��H�/i���������2������-K�u}0G�~*�*�f���H����|��s��-��~e�*ntz)(E��x��i���D'_�T}�������*�rA��+��`��T�I��:�20��[^u����X���D���m�N��t��&:�2���-��R�x&�����%E�y?�D�~L�����l��;:�����
�X��Aw������k���m���:k�^���Z8�_4f��Z�����3�����I&*$�yNvL��G�}do6V��vu�E[�$��5F6��w��+�.��vN��tu���1����77���\�>[s���
}}��qNC���=b�T�hm�|�W�$����^���@R��7�~x��{syk��g����������H���U�$u6����/��{]���^����)�{���lxq�ic����a�%�fu0yc����L��O�6�2N��Z���[m�$����?�����2��'	���T�TF^���,������j�����M�"�Y%+���C@�%�W���������9zoe
�J��kd�b!y��p��v��{]6�s-ef��h#�|C��R�Lrf#�|����TU;���Bx���1��XS��N�L�y����v��`6Y�P��S����P�M�'��j����-g}Z��(�u����`)o��P]_������[f�%}(
�3���@g�K6|�PF^����hm�|�����}A�ZiC��J�m��Q-l��/��>�����9i�;m�9����"�����d�t����N���K��N��r��8������\-��&�t��9�P�����<��~i����LL����_���OAh��)�<�[�8�l����-�h�'����Tv�0�H��]���6�e:`������x(���p���8�=a6�����hm�|z��DyX��sa�V=�> �k2�a27_�'K�'z�I��\��Y{C,�i3/������7�������d�j����_��������o���������/��������0���YN2��__�������~;�uN_�����O�/�%���|��������y�����9��?(2�C���:�T~R6�&��E�����l%���+[K��%y�������O���%�4����CK���a�����Q���m�%�����)2�s�".��U��T��~��U�{�G�U8+���=F���I\E���SG�AdW���U���&���5i�VZE�Yx[��A\W���U����Ud�Dvq�^\"K��Nw$c�����Ig��M)��5��.��M��{���g��r��?G�+� �"/z&���]\E����/�	���%�����m4Q�q]\�W���m
Md�Dvq�^\E��@��]\E���i���%2�s�".��U��MT��~��U�{q�K��=���]\�I'��=i����i~�z�����*����Cq�|�*�dGzK`E?�k�*���D�LS��I���K�qy�6���`?���*����<�f��]\E����f����9r���*r���%2�"���|/�"'�^"�� ������*�7�(7$�Dvq�!��W7$��������),��|~�1��)��\�q�<�<sWE&{�\�%������%o��]\E�W�'X�#�� �����b�������������>IKT9|z��9�	K�[a��X��#*y�h��VX"�h�sD%�MX"�
KD4;�n<��c��n;g��]G���Y�I4�A��MG�^�����(R�jR��&V���D�Q���>HK�IK�NQ�����D��z����m�J��H=���D�N���J�c���������{H���.-A��%jy�Q�����D��z��s�O�J��H=���D]9��l�����o6'���&~��u]�p���m���������#z�E��AZ���a=�6=Gui�z/������,*�c�"���5Z�O�
���%���D��3Q�
���%���D��E�)*�sT����R�*���
��1j�z�i��9t��
���%���D��_�8 �1M���Zxq�������}'��v���&$�nW����_��G�v�Q�Q���G}���I��=Gui�z/-Q��	?H���\����H����e5���Z�'i��~�6*�sT�����5��
=GM���'i��h+�ST���.-Q��%*J	�����.�[�Yzu��GI�,�������&�q��.p4I5���������#���2���Z�}T��Q=EMl�XEu�~�����nz����CT�����:O'�J�
�F5�*jAOQ��n;E56=���=�5�����_	��!�������]9����+�����_:�������[�*��Z�}T��Q=E��hUTGQ�w�j�!j��P^E-�>*�������hWGu�~���������G�_��S����NQ=D��)����;j�Q=D�9���j�!j��*�������o:s����������6a�R�!
��t��?���K�O��	'�1�V�4�q�����b����}�e�Y�*j�n��_�����9������"=��/��tA���:����Ig7�B:z�	�SPCOQ7]��:z�
�STCQ� :�?�����9�a=�\9>?�\~poY��D<U�V��������\�p���Z�}T��Q=D�S����t�~uTGQ��p?G-�>*������r�C��CT���z�*�8����������:z���a��������S������C����*�)��'�NQ
=�s�x
k��������n:?���8\���/��O�����i����~�n��_��q�5D����t�~uTGQyZ��Z�}T��Q=E�R6q
j�!���B�<E����������;5�u��U�������:z��Z�STGQ�w�j�)��9�UTC�s<�5�x�����o7?Hc���@@�
��r��?���+��w^O_q*t�~uTGQ�Y���Q��J�:������<Eu�~����n�z��t�~uTGOQW};~���!*�NQ
=D����g-�>*�N�=E]�7���CT���z�:8p����9��'����=G%�������O���WP�����������?~����>���������?�}}o�;�|�v���L��\��N�a��i �c���N�LLu��	{)��1�
I�oQk}G@E���Mt�e:�e�~A�yZ��w;a�8��-@�g��R������utt��D!�#�y}0_��Y�g�w��z�<�S-�����[��0����'�qg���=����*K<�h�]���U1y*o�fR|A���
�t+��/w��F{�������������/g��l� �����HR���(^ar�>�O�:�u��@�m����	^�l��!��Pn}�3g�2]n�"������K�/w�o,�v�rq{�v�C;�nc���5��%Z�$W����m���"�g-;��"�g:�6s�20�e*U&�blN@���2�i���[oSq����b���I�����2�'�����)�1�H/����l�Pu�/���h�����������NO��fN\���L�����z����E��t�1�["�+�l���2��c:n�u�L����"���"��1����e:`�������y���,���t��^Z��*yc��*y�~s�����* ,����1��@��@yc��b2�JFc�HL*()���@��8�d4�:�20�eJp�F�:��?����m�'�1(-�
[������|��F��v��H���E�0�l('/m(.e����*�t
���	���(*�d�2��q�lt�e:`���[�z�]^���>�R@BY_s��"���D'_�T]��2������JA�&��!�H��,�����.SO�G2�8CiRb�o��tK�"�|�Pu��
�'��XG�h�;����!�H��l��/��.�GA����hb���N����8��������M��^0T����QV��8��q6��-��.3N��'��<o������W6����U�[�:�U���������@�[2&�:��e:`��-q�]���@�d3�5=��"l�l����t�T7I�H�h��%d���C���'9[���e:`��-�w�tb�PBTh�@�-�-;�l����7N_�M>)�b�
�����N�+�����M�����L4��i<����
��������`�K��eE���1H6��b��R-��.Kuw�1�U3m�'cr���T����������Rz�0������	���it�Y���W��d��M�s��pF@w�lC��Y�*]�1��p67_�'x����s|&Y/d����L���
&A^h:[��)�J�K�M��W���WK��d*�~�J7_�'x���y��=�W,��|���+�:����ne��,���:�YW���8�X��[� �D���|t��V��	��1|���L?e�M��b�4��l2'[�T]��)L�"�`H�����
���P�K�,������>� �����b�����k#	�<�Uq8A�.\��Mr�T���2)�$��������DH������U�u���E��x�dQ+3��8��t\\�Dk��*��%�nN��]��������^g�|�Pu���{�Zu��,$,2mh��p���\l����*��0�$yz����n�0��������0s.\���}z8c�F������J�-Hy�>��8� �.��.����Yi�C��rP�<c�@(�g��l�Pu��K��K���E^ii�?���E�'�e:`���?�7Nu�tpPX�cA[����'[�DgSr��[0�]�D����4)��4!�h,�����.�j��I��I5C+:B���\��>��0�8.�t������A��.e�KG��gQD�%g�|�Pu�%IN���q�IA;f�D�a��#M
[���t���-q�W\�o��4��V�~(B��X4'[�T]�����(��<����
1�8�@�<�1��(�t���-e����[x8���e��$�S�lN�L��L>$�R��2#�hQ�9�i Bn�lY���t�T7�>�a�p��t:�Z�@d�>��N�L��I��0��1)�(�����D��S&�=8q����O����
�E���j�1���}��������n�}::-��`V!L�L�6Y�O�d���������}��si�f�N����<��'8�20�c�o���Z�]Y������t�t���/�����7���~k���������|�Mw����U�:urG��,
���AJ;�6�F���0K���^H�2�v(��|�/lGj�V�{�����	2gV-��N���S�0���=c����:R�6_��Ns�������A?W���QQa�/ ���lI�-��������=�o��H���}lB�vea^�2��%���.��K���~��I��2�G�yQa���WDuL��$���R=a����Zhi��\3��/�9c���@����L��|O�.�t����+��B���9�DpD�����W�|O��[3f�a 15%Ea����H�~��&�����k�]47�5a=��+YX����%����W�Pw/�~k�������&�Zw,(()
�}�Dz��m;�|�����Od���4[}��o�%��
����$/��.�Dk�������k�R�nj��h�Q�)���!��Hc�N�r����M����O~�l#�&Ea�G�!��?c�N�r����C�I��4q��eAj��Uq8�Tdu�%Z�$Wi�I���!V��k��������c
���J'[����R��_+��5$5��i���!��7c��l�PuwFwC��%YOj��`.�@cq7'.�S]&u.��8C�H�)��M���F�����I����n%L7
�����Ic�OZV�U��@��V9%G�B�PB8UN����|�<��B�Jd����$C2.Z�m�dF�"$�Es��;�VOH�d`���|����������S�;;�i����e��N�	����
��C)�*Fj���"d�Y�������&�\�|`�!}-y���z�Y@c��l�PuS�������~@����s� ����N�L�}@*��U<11ih�S�$��@dY@���l���g�hC��>�����d��37
�Kb�'��e:`��-����=O��`��o��V������Z��Mr��L���I�z"Y(mUN�u�%Z�$�y?+�RhG�5�����
�K)�1�W�����g`����R������*�q��e�D��RW'[��z���w������4_ :��6:y���z����[��a�II�
�����6��/��/���#����8�������?����l��Z����i��}y��ejx��c�d������z��R��ne��0 �������lxc�"rZs���!Q�����l��0��;��`����+��a�
�Du/;$L8��,��|��MLJ��Iw�lC_(�W�����u8��/��x��'j�x��:s���(��c�����=A��w���N�&��|��?M;���X����6K��F��>�"���B�Cr����z1��*���fL��3}���� �jWx�_8�_&���� x�D�L�t��}�7x�o=g��������)a�ZX� �{;o;��m���e�Z&�������o������@TO��7z��+V'��(hG����N��	�J�L3V=]�A%����9a��������$5'�m;�4��%���U������'�l�{���lL�!_hpqO��E�p�i#+���i
��X�bF��E�i
�$3�UO�������h�2�eh)�����<OaM�d�����3�cA	\��H"P[�7*�\O����E����X$�Z��3����gj��8��	 ���&�p�N3V=]2_[��jR�)����4�A����E�d�������Z���&T��)�Z��h���K2�4��%C������.h��(�G��L�H�����v�s��	�+�};P��I�(�u� �f�z�d��v/��������
l���g����,�r��<V=\2��X��.h���L(#�m�����gfi� �f�z����+��Ev����*i��# 9���!�4��%C�H-��LE�P4���Ek�����d��������qN��F�N�
�=��+��a��}$nm�S�c�<]��W���+hG�
T����-l� �f�z�d(Z���vD��X �z�c����?B��x��!<E?V���n�z3���n�����?Bh�BS�c�<\�F�����JpMh���1*~��b7�)��H�*=��]�dF�A,�w����/�r��<V=^r�*58o��"f<"�mR���S�")�%�$Hf�/y�����dg>��"�M���E�3Y� �y�z�d���k���>�Ja{pt�O���O�s�A:�X�x�+@K���������v�4���M3��O�=����o|������O��W�*a:�H(!�\�_	�P:������{�Nw�������	n�P�,��;�����rO����n�dS�L5N�������~|[;���qlC���[���q���$a:�H(���]��7�%��h�F}��(K�/_A�|��?\S�����
m�q�@�e����N�u����6+��	%�p����|���4T^��,T
tP���2x�?�4`�g�;����_�}����������}������~�^(dH�������	��$]~��`�0�j$����n�'&<�B6V�6vm���gu\�,p�%L�	'�p�;on��3\"�X��>h ��������l��d.�h���
N4?����j��\w>Tk ,�Zfa�i��.��K����
��J��
��E���X� �y{����B��F���WC��9/��D�Pa�q��@}��5��=l��z�v��>�z
6k�g�J��<i���B�N����	����&���H:�"k��4�g�6QcU�y�hK��
[��iTd��f�>[h���M+�q!\.u�U��B8�fR��B�[K��\{ih�gO��n��h��1.��i��zpq�n�6�{�����&hc��4�M��.�*���B	e�Sg7�s@"Z(3hY��f��l���}��e���~�a��Q���0&���/|�A!2i?f�'+,�u�a�������D�"�J{9,�RnbU�d�����BUMK���4UD���HltbY�d�����������@��
m\�s@"��J���l���wh����VQC��-��DT.l� �f��l�5�o��K�#E��O������ra��t�������k��vAY*��@����]�T������������RgX���i-���]3{�"�*V$Hf�>[(l���k�jaER��5�A�jae+�4`�g���4����J�E��_�������)��J>[a��U`d-L'Ta��vHD�=3�rr�@T.��������0���
0��9 �8����gOn��������Z���$m'�;�)`���@�8Hg�>\([�X�j�-�IZk��=�;IC�y�����-��)�p����>#��<~L�S�����s�{���)%l����-
/���d��5����\_� �U�;�{��s8#���~�6�H7�����}�z��;o67����z^�q��E�t��P��+����=4��m!����t\A��_��G�!>X9i��:��*B�5����D�=+����U��\��C8dK��/8.�?d������D�z���G����=a5���f-Ex��-���TO��v����-?8v�_q���iTr�`�0�j$-><X���X�*+gV�����_����m����vBt�>^9���S�����I�dX��^��QM�P�4M�x���������H��G+��"j�!�J��<V=\r^x>g����6�
������Y�b1������S����c��`�y��#h�X+�$��E���%ck%�\�;MU�"g���`��&�x'�M3Q=]2<�����F������s@"z/�Z�a�L3Q=]2ta��||�F�F~�z6&���/<&�Bd�~���k���q���v�.�
�T���V	�iP����:�	����ROg8��lL�%_��`����%���^������B-�6��=��t"��](D6hwc�<]�����)��A���xD&�8,������d��������F�+wT����B��AV���"A2�rt�>\2��O?�GmT���������HlzbE�d�����~w�nR��3������1)~����UBx�~,��k��Sp\B.zB����VZ�0�b'4)K$��UO�\�?�m,RD.�v��1�����}*[7�i ��K>�x����C�Z#bK�s���e�.A<�D�t���	L�C[b�(�|�Z�'2�Y�#���pO��E�p����m|Z���(�V���HD^����t�������X��
�	�t��x��HJ���A��v�UO����)����i@�S��cT�(���f?E?f���V��
�n����
5������H����]�x�����i{���$QEU��i�;0���f�z�d����S��F_`��;}
�q���=��%�a4+�T6�-����~�
�|����wC�zso>e�.�H�\���[|x��5L5����(D�>K��j��S�����������.:����Ob�G�r�B������_������������_���Gh�X��}g�����?���_��������k��������!��/�}���������������B�~.(��^�P����j�vsL~?�[��`%�)3��*����@���������^^K�+������	����}SB�K,�H-gb��r����y+����;	5'�iNUZ�{���t:��D0��:��Nh�|���g�Ky+���!#|H&������R�Y���$2�)J��Vj93W�ZN"���t9o��3q_L������\����GzA�z�����ps����n����t���kv/��5���9�Ls��r�K5'6��&������R�	� BJ���s	���/���g�Ky+���x�-�d2��J�y/��o���D�9E�r�J-g������<�(]�[����B�m��0����[�����+�
/������eMXd�.���Ui�^�9���~��Ls��r�K-��=@�g\��[���}����4�*-���rJ��Id�S�.���r&6S,'�yNQ���R��;��S�,�)5�Dj9����D�9E�r�J-'Z�v_a0��:���������l��Mey������sW��t����u/����S�f�%�Z����D�E�R�J5g�����d�S���^j9���$2�)J��Vj9Wn�o9��s�����Z������Y��	5_�y{�&d2��JKy/����f�-��<�(����w���6�B���������_q!����.��Ts��
�4%�iF�Y�[�����'$2�(J��V�9�,3|2��J�y/�����r����y+�����-'�yNQ���R���qM���G\t.����-|@�%$2�(J��V�9�8��r[a2��J����m��A	�H�����+K�Zhw��L��J�f�R�)�I-'�yNQ���R�	'ORy��2�iJ�9�Z��d��D�9E�r�J5�����&���Xe��d����8>�y6��twB���O�&�o�	7�Ky+��	[n��D�9E�r�J5�v�&<����T��S�������q|�K������uS�B{�%���Z�>����t�|(�Z��e9=_f��6k�6.n�J�������������vj��2*��(a�Q�}�c��t�Q�}F	��J&������
�[q�Y�>����gI
��V<��'e2�Ia>%�YF�~
��2R���d�1�V�R�$�����fYk�9���V$��d2���?��w_��VR�����_���I���&�H�U��h�U�\VE��u���>��IV�sY���c������u/k�v.��IV�sY����,>��YV��YM�B�#fU4�*q.��YV>N�g4��q>��Y�m���&7��Y�q�i�O�����q�9��=�B&W������_3����p�1t�S�,��I�]1BR!���R
�d\�>��IFs��2V��R
���8�T�,+oE�Y��r��*h����1��IV�sY��6���
�e�8�U�,����.����������w�����xw���e�X����yE��&q�(�\�F��|VE����*�d�6J�YM�J���h�u_���&Y%�eU4�����.�f�D�f�8�U�$���\VE����*�e���g4��q>��Y������fY9�gt�K�g5t��8g0)z��s�W��w���������#(��p�>������XGx�24�*q.��Y��N��Y��r��*h��:���ch�U�\VE���k��
+�e�8�U�$+��{WE��z��8�����rmc���Y�6��������������&Y%�eU4�:�SP4���x����������������:�r��e}k������b���������������z������:*c�Sp�wM����^���JW'���|:�+�����4$����q�}�D�������"9��+���p��WUP����!��(H��}�R9����/q�X��W,��m� �w� p2&��/��D}4y�~���7����{�|^A�J�B��3�Y����z�r�@Dv,�sP��h����BT��E���X� �y��X�#,��Z��G0�?<�y�Mg'�E�vGY���e������3E ���Q�.�Q.��$#`u�V8Hg���*���v��-dccI�w?F���NS���%�m�o�p-��f�Qg��h��	�z�����t����W]V���
��2��V��<$�%)[6�i �k#�~%��u�m����w}$�%K���UK���/���i�� 8gO�#*N��`I�
��`���)
��"��m�����5O�`I��4`��3I;��ne��@���|��!_t�����%N^�&��FG������5iyv7&���/��b��c�\x^p�����>����
�Z/@�f2�%
�i���%AO���^�v]J���=��1+~�|Q�5����X$�Z~G�jj���PA���U.���(�i�i �+���vo���1�!�r�|�t�w,q�N3V]Yq��a�p^���"�T���d�r��<V]9f���j�����~|� ,IY98H����,&~'���M��A��tHK2�r�N3V]�O;n��Bb��o�������X���Y� �y�������������"�'�������`�L3Q]�M;�V>����������'S��0D���"���� �O+lA����7�����1V9Hg����JO���������9w� ��i�o��"��>����(#���@"\���m���jI������ 6��������������%�����F��
�o�(�R���A:�Xu�N�AxyW<���<��`�nR�<��6)>(H�����T�}TD�bEtx8T�@OW��x��@UW�q��,�:�O�0��w���+udb���Ce?E?f��
��1��������h���*����r�ca�~������K����)��z�a�����c�����+��g��(y�)��J����5�
��C��nH����S�ys4��������Yx�s�HEG���I���/ ����
on��5:sk4����[;[n��3�M��X�)G}��)�ES�}m��jj����P��p��@�`�-�4���kr����ND0�m$��\� �f���r��-�r���5n$�^���i��7�9>��{s��{s����7GA�����V0��Hh���1�����zo���7�����D���*��v;$RoN� �i���7���rhC���y�q,q�N3�r����c�G�����v��7wp�Zq=���[���_��(�q |1V���@TS����������I����5
����7�::�-8u��S��"�����S�V���aC���m5:u~���������H�N��N���u�D�S�A���@TS���w�`�@���Q�#���)����df�������[~��[?F��<�e����d����oA�CK���Eo�r��<���d�Y�]����6Eo.s�N3V����P�G�u�������3���Q��p�X$3S.�1�5D��@�cV[8����:���)G3zSnCb&\?&E0����X$o�rty�)�+�r�h0�0��r��S��FS������D�)�A����U3Sn9����g+"����E��kd3�U��r��\)�)�S��)�v��(:e�S��`�E@���� o�E��7L9z
��`�u`���`�a�7�P��\^Ro�
�\H��r�M�T53����	����^�b7e|��iP�M�������Tn��&o�sr�c;9�����;�D�����U���q��d���A�N���]������+�Z7����WzI��������P��]���Oj�-�[7�pW1�ps;����s�h0Be�V!��I*�6DV1��Ur���i���-G�
���B�a��&����w:��[��;f�U�3����7�"`Qo�Q�3�:���9����C���yq Q�������}�r��|�s�Rg
r Q4�V�i��7�9�
�9(�s�bs���9��|g8�l��m
U����
�D��[��[�f���s��J�����"r^\X���A:�X55���`�C"��
�D���n����USsn���`�U��0�m$��si�\T55�v|�t�n�H�� �`�a��=z|`���+�R:snC����(�s���/n�z�~���C�PA��9G5�����]I�s�2�t�!r��H����l���W��G��^�I���8&E0�2���X$s�mx�U0�D�����9W(�f��(��-��P(�������(���u��m���n��&��-��������V�iP�����6<�(T��5��1*~�|�q�!��.�E2����z�P��:@���� ��u�U���C��0GmE�\��(��5�����U����`�D���b���AZ?�Q�����Lg��V]����0�Yuq,��U������X�[u�����
����s���+}�\9��+}�\���n����:V�`�E����� o�u�US��q�|#Pt�=�~��!_l	S��d���V���
]�l%
9���n+�3��t���9KnI�:�n� ��:���G���|��zi+"g�u@D������`�����Ue�!2�m$
��P�M3V����|���v��������������
���Y�N��u����%;������uP�!��.�E��}�V+�+|���Cmv��a7lBdS�c�<Z�����5��!��x3���"/Q|���"�FW�#�w���i�8�������M�*y�V��%������/�S�
1������U��E�l�~o,;x~.Zo~n,���R����)�x�dsw�����&~��m�=`Q4+�������}]_���Be_&��
;@�hJ��`��U��w�J�G�)����� Q0-���d������z��dc3�H�a��E����3`�'�dt"���x�������1)~�|�>Y
�)��H���

7>[���"gtv@D�T�p��<V=\�y�IjP��`,�;�;��U��s�L3Q=\�$69�=�(B[�P�A���e�i�����.�,&���������*��(���NA:�D�p�j��Y����mGdF�H���zf�i���KVU��q�"2#v�����4��%��
[s������9�v����u������Z����)���x�O�������d��U��-l��b�&l��|q�X���8Hg�.Y-�R��I�6Df��D�-.��t���������&Sp�3x��=��n2�k����%���0�������<�y�8H����-�,�v�3zm@�f����g�S�c�<[�z���10���
s�G@�`S��A:�X�p��0��3:�NIF���0&�����Bd�~���k�y��\��a�=�Q�#���>�)�1K��UM�%��El� 2k{$��78���@U�,�5L�'���M�����y1���O�?��H�m<d�OWD�
�1?U�T[9H���.��o�x_�����W���y�=��4��%�Y����e.NV�}��8+�7
�i ��K��Qc��WD�����M;� �f���1�[�y3�#�y���a�;S���7��)�1K�����x��G}���nG�t<_��[�>m�q������1+�N�B!:i?	�wV�+nK�5�;������2\~�����H>X�^��!���#��7����_3����)�q��L�[�y��in����n��`�7D6i?�+t��R_�[Io?�jaW�K������>Y�7��@��cK��G����� ��w�U-�w�\��7�fF���a&���`�G��.��]rSC���E�=q�N3���m�Y�+�)
�{���f��D�u?8H]��������
O��E_��w�I�2�4`�G�Z��jee���(l���bYq� �f��h��:��Sy�o��u��g"�]:��#�����O��5� �+����@�X��8�y�`�G��y�
=����aL�!_d��5��q��V�����K�?�3���}�9�t�r�yD��B����pXWc���/J�C�?�6_7��O��f���!�2"��G��!����g��@T-���k��p �
���?��
�����J>Y���k�r��a�-�b��(z���l�����z��K�
U��6�{��f{7������>Z�7���L����E:�������V}�P3��=��eA�:��E�����`m�U-�|��w�;T9G{���w��Ajq�@T�,�y�U�^�����@����V�i���jV��MH\�����U��E��.�3�U-Tm�OQ�^vEb�u?&E��
Q/������������y� Q(��o�-�i���j�r~�zD��!rn�H�����t�������|H[3�����Q4���h7�X��B�?�3��U} r.�H���A:�X��B�5&�"�
���@�`P���A:�X��B�<>�j�1�������-�(�f���j����8�93z�f��h:��y�=�g���+���'i��H*�����w�����h���
t��������v���lx�6�/'��f���7S�fzG�NMZ0���a��;�O���R�@������+���[���a��Gk���~�W��������zH��D��8��E�o*A�u�@|y'�$	u��|y���ZC�z��E�='����bgXgDn�4�������,oz�uM����{ on�u]8�X�q��=.�;��(L�������z������->
}�d_;��
������S�r�X�p�|��wX�F���iyqH��v�}����.�7��/�]���{dt�D}�
rNy�z�ds�w.Pr=4=�Z��(����t���@��%��]�'-��;����=���V�i���K����-:�����m1�pEA���������G�{t��\���(��h�3�U��k$|�zx
=<:@����`��%�����������"�YB��-="��%�����Wt r�=@��$S�o��V=\���X���HEb=A���������B�7��*y�Vk"��8D���"�b_8F��f5�.Y
ohi�MJ���6�b����\��D�p����c�������;��D���A��I�z�dk�����������I��k���$�Y�l�}Sao���k+<�&��������]C��;�#r=N@�`����\�������1��J����i�V���E�]q� wYX�p����bx?^����!����E���8�n���;���wJ�Z���D��]U9v�����}��k�-Tjg�C3�����D��J� �D��z�d�b��V�z�ue����N,���D ��Kv=X�l,�����$,m���I��4
q]X��%���T��t���p
Y��w�c��t�����}�ZB99�~.�o�2y�{?�-6q���.��x!w����u}�*�D��A��K�zz�`��@)����ONz2���(x�PAf�w@U��?��3�o��j��q,�E���������?�+�@��������}�?~��o���"����}�/K�)v�ey�������?���?������S&���p.�"���V�k���������������O�����/�}���?��o?����5(+�+O%-���u�Z��[��W���hy�ny�� �WP?~>�k�#�
�����P��{y-	��{�~��MY����g0��w~r/uY�;rY	����e��ZV8���ZVF��*���R�5S�0��<']�;�����>�$4�)R��VjY���>(�iV�Z�{��Z����Jh�U�.���e��_�e%4�*R��V��&�Q�~#h�U�z�����
�UH?�+���|r�i�[n���#��.�������ee4��R�z/��e�����4�J-���e��=1)�yN��wB�1�����<�H]�[�e�����2�fU�e��������g��z+uY+��vY	����e�������n7��YEj�� ����; �g����'�8�t��Uc4]�J���K-+(�����4�J-���e]��"��K��q+����k�?���9UjI��.k�>�.+�yV����R�5��z���<�H]�[�f�-<{�I4�jR�:����k]VB��"uYo�.+lx��&��,�M^�g�k�����~�N���l�����c4]�J���K-��^[��1����������KIh�S�.�������ZVF��*���R���N���<�H]�[��*��\VB��"uYo�.+t��I��s����Z����K�h�S���^��6�n7F��"��L���g�
I�|mKW7�O�4��[:�n���@�v����5�TCkI�Ls��R�
]��m\JB��"uIo��u=^{��$h�U���^��B[�x�a4�*R��V�������g��z+uYW�+�I�����������y�R���Kz+��i�W�v�a4��R�����&c�X|Gyz�"�i>�������1�~����������g��z+��������YV�j���eM|.��Jh�U�.������+���fd�����l�u���<]�;��������o��7�Kz+uY�k����yV����R��/��mF�4�J�6�����\_�q}�y�>�e�������L�?����u�
6I�N;7]�B{���>�{@E�~�Q��gP�����C��+DC����*�dm���YM�J���h�:�1��YV��YM�n	�]R!���R
�e�->��YN��I�����!��YV��YM�����_!���R
�e���!����F�|RAo�k��*�[��NS���}C�WM����n��:C����*�d]3m6pYM�J���h��WbVE����*�e_����fY9�g4�
ek�g4��q>��I��_���h�U�\VE���6x���fY9�g4�z�~b�U��G�|VA�;N-	����r���48!|!���an�B&�l;����T4�)q.��IV�t?��INs)�L2�u�����$�����fY���d�2���%������\f���2���.��>��q>��IV�������$�����fY3V�����*��
��U�`������2���+������I���&���[^�U�$�����&Ykz���*�d�8�U�,+�����fY9�g4��V|���*�d�8�U�,+`���fY9�g4��-���eU4�*q.��Y��}��Lf9)��d2��^GI1����F�|RAo�m��U�mv���L�h���}E��&q�#(��p��Q�W���>��9{N�$+��~*�Lrr�K)d�l�5<����8�T�,k�}!>��YV��YM�f����M�J��\������U�,+����fY7�N�(�e��h/�Y�������o7g�v����n��^{��n>xy��c�����v��]:�`�h��O�����?��o��?r����o�������(�"�l��?���Tv�n���w��N�{��r%�/@�����`"�4?E?f��^�;)|��UP�x2\�`!_��X�
%���^�����E�jwJ�SvN�~�<kd3�UW�n�������(#*�G����,m��@TWl�'<�
k�� �������m"k�3@�D�/��x��v3����uAT�ZO�`I�
��`���[�9Q��%�����n���g���A:������>�:D�����\����_�%�i ��_�����k�xX������tL�&_T'M!2E?�����v)�nd�*�@j.�|>F���/>���)��H.MG�UB��Jt�S���E�Oe�p��<V]����27��c����jm�v���BA6�Xue3��B6�����*�����\�~��!_�f�I�1K.���{����rhCT7>���.��DA6�Xu�r�><s7�;t ���;M6�;�9H���,��:*�����"jKz�s���/���`�������1N��OeD
��_������d���bIp����C��������{�s�"X���A:�Xu���`��.����Q��5���dl� �f����)7>^���A�5\���M�XK2�(�f���3j�QW.?u�"�^������������m_��zZshC���g��"Ok�%
�iP����^�:�����
�f���v�SpXK2V9Hg����J'���-�UD
��$�%)�F:$�@T�������������kL�q��/*������{$P�qS��-
�m)�+�@��/�V�i��+�h}������@C�G�]�U������@TWK���
��i(#���?�#X��mc��A2�Du����R^�*PUt ��.�pI�����f��ZR%�
��d.�WD�z���H`��4`������>1��]QEf�vH�Kb��`O3U]Yh+?r����\QFN�vi��,�4P���7��{��O�p��7>?�d��!����������;^L�m��{���@�����t��p
6�����~L
q������dn��yQ����9/�1:T*8v`���K�A�i�2���1)������S�c�L���c�oo��	��r`Q�����`�;����nk�u����u��n�;��x��WyrgcV�0�Sc)D&��"�{v� �����E����� ��u�U�xv�g-zv)w�]$<;��]�z��c#�{v��x��E� ��Q���:Pn����A�����������9`,�X� �y���������X-�"��.@�c�[8H���p�`A��
^,�<
r^X5u��W������V3� ���5
�F`X5u�d�d����$�%�d��US#/��7@���@[�4�� �f��y��0t��������8��v`����+�ashG����(y��A����F8���W��s��1)~�|Qgo�I�1K&���WA����s���W���Q�3�"���kb�"[��shj�������e�V��]���X�;x��w��������#����q���L���5��y��E2��*�(
�����r Q��V
�i��7���w�`�
�],��;
r�]X5��
?�:�!rN�����X� �f����\�B{tu'.�MGOc�qz�/'M�'��u�k%�P|��O}��(�u���7�VM���'����q��H���(����0����3���$��:
rf]���u5e�e��p`Q4�
��`�f�,y��Y���?K��� g�u���,yMy9�P0�:���P0�(��u�jb���F����v����1+~��ilG����d���"-	m�����WM52p��Uwf�]�����p����vr���[�������n�^\=��G�ubS�c���:�g3Blf��c)i��:.b�)��}U[-���G��e���a�N`�$�E�?���f��v�������4������8��f����l��[3�<����r��]����(�u��l����B;*)	�v�B�],��(��u��+)qNZ��X��"����J_��|�9��)�q����,�JGK8�q6\?�tN^�T;S0�Y2���@�`�
6]�(l:
r6]X5+�[�h)#������O���:x5�!<E?V�����gtW�"r�t=`��sRGKA:�Xu���k8�Z
��@C4���?B��zjsS�c��s���^�;�9cnY^�;�P�M3�|�z=�p����-G,�[:��bci��Ypq��yiu�	�u�Bi]H���a�7�:��wJ��p5��A�i(���DCi���X5u����������l��#�p�N3VM�
�KnA�<��(����������ii����sPoV���@�`����d��jj�%>�����ynHr����s�L3QM����D8c����1�Ec�R��<��}��3���g.#rV�HK�6
2w����s������
v\H�9��� �f���s�������z����5����7J�����:@���$�K�(���E ��?�g�8��������.q�9uq,��[;��-��%D�����[J�3�US�.�����c�F��Y�vt����~���rl��j�k��s=`Qt�*��`��u����P(�����:��u`������]�}wF�H����t��j��YWEE�����$yvsn��A��u@US�����f����_���:�<�f�f����,-Ct�5?nYn��-d� ����i�8xV�{��������������M���Y����Ew�u�	B$Vd��I��2���X$x���k�����������7��������V=\�Yt����]��������EuYbS�c�<[�U�e:N�������Lg�2��f_�7��;N�����F-�}��P����~���"��%{�o��d��(�����)���`��%{k��{Pp ;P�xl�Vr&d�?��%����P0*;P���c+9���z�d�9���:
�e�|�1��$�@T�,��p�WZ�`��=�7���|Q"�~�3�U��w���yOIE�n��x�Zh"_��QM��M�l�fdn�Q[X���Z�����������K��U�l�R.+d��KV%q�oO��.Y�P4*C1�������IL��b�M��Y�l�V���!&�U>�<��p�E�F�lY����k��'��~��H�Y8�s�[�G��gK67�Qso�n+"���,
�mk�3�U��nk�zk����<�h��c�4��������j4��
����BL��p��o?f����M���T������G�4�����KV��J���<�35b�u�����%����nn�������#h���1zhp��U����z�3���"��E�j^����y�?����a�2���ZWD�2�E��N��t������?�����*��\��(X�i� �f�z�du����eoeC�b�f]cRx+
1����*y�V�����#�m���=E�;q�N3V=\��_^�W��������D�
��=�i���Kf�:�i��)_���y�q��!_tv��L�q��gkc;a}��=�t��j��y=a�������.��o��Q+C���'�D�c�R�L3Q=]�:���P���W{}���o$�@T���9~�:�G��K�����(x���	�l�������;?9�O<f�c~�q��w|�qS�c�$���}}mX�x�W9=Uwt�O��������n\Lo(S����E`���\���`�\��Jf�6�������D���Z��u���������|��k�	�w����;���z��bM1�X%�������#��<�`�w�Da�=l^� ��w�U-��s��J��_9#�,��|� �y��h�f��h=?>!r~|X���A:�X��B�~z*�[��������OAn�~X��B����D������{�s�/^(D���,�d���|Z�C�s�@�h��V{D��B�V||6��5��cR���"Vq���p���\�Z���$���
������K���#is�n���,Mv��+��M�	�����fX��A:���oA���z������� Q4��{��f��h�f�K���&��DW���]+��D7��B�����-�Q}������8��R�f��h��&_;�
�Q>E��p�3�;��O����U������� Q��7R{���h���S:�:D
r]��h�\��i s?Y���Rs���z�p
M(���� T���Ps�~5�xA����(�;��Q}�Ps�W�>�����w?^i/���3n��)��H>Y�9�;w�r��	�+����	^)�f�>Z���I�zC�mv�����7�{� s�{ ��j�v�^�m���=�"�K���U-�{��Oo���/m��=H�O��o
r�v�����^	{�)H��������!bnwc�|�B_c������b��(Z�+�4`�G����Z4�"�[�D�]�BA����>[�:�>��zE����WzC1.��`�G5CvF��z����G���[�S�c�|�B���KU�V���9� Q���OU���U�-T]��U�
����� Q4�R����h��A/�m������-=E[z� s�{ ��~��'���b���7}�BA9\���>@T7�4�3�S�����,i����� G��������[1uv��%!�*�n,
WU��1D'��"�+��Z�s�������I��&5y�:Q=\�s�����@�]����j
1�:�U�l��^�*��/{t��x��9WU�6����X$���
�������X�{��������������X����z�vcV���Cd�~,�gk5w���I�D��@����e�i������j������B9y������\9yjt?>\2�+�}���v��YZ���*[w[9H��.�U�����*R;~:&Ew.U����l,�gkc���FF2�����/���q
1�<�U�l�V4�
�PG�"�#�|9��S��<V=\�w�fC�����`�!��������!>^���X�:rh�q�s���2��[���*y�V��Q��Xh��#r};@���c� �f�z�d����'�!>U��� iOn��wR���z�d��c��	��r`v�X������3�����oR��! Ql�s�����.�Z�6.�r�@����I��c��t�4_h���}`�����.���`Q�
�8�5���O]e�"x�'L�*"��d$��I
�M�#��%[G����++E�
�5��>��
���;���KV/�&t���_��(+u�G@��|x� ��8Q=\�u:)����
�k}���Sf�E�<0��7����6�VkU�c�`:����
1�G��e���JR��E���dD%9!Q���"�I�bG���H����8������xI�_�N���G ��K�MU��'C�d-T���'�{�|[��������*?�����`��2��p���=��%��,
}\W������>-� ��t�UO�������B����F5hL��5L�:��qek���us��Ct����2��?���~.(��O���Rk��"r
`@���� ��%U=��c��@o��m��Y�b���$��:���4���r~������L������C�^3��������Lm��e�w������?~����������������������j+m��s~���_���|��/����_���������I��O������a������=����Z�N%-\	�/�����������f���iSJ�/k�^�
����aS��5�^^K������������~�����)�iU3�������K�����yo��w��X��h�W�.�����/,�ri�����>����\�
�%%�FV����b�����h�W�.���������%�F^����b�7�����yE����]��=��$�����;P_����	~�����;�'7���?v��?�j���[������2��U��{/��Z���y^����b������"y#++}�;����M�KJ�����io��7������yU����]���J�����+Z��V��V>L��%�F^����b����pb�F^��P�����5��S�������*N�]9F�O�Zw����7�+��2��U��{/vy��1��Y��1�Vj9�{�:����������H�����+Z��V��&>��%�F^����b�����3%h����w"vy����K�����yo�.o�>z�a�FVV��'H/�<���.�r�m�������>�x��]6F��Zw����wk�-�������%����	[������*Z��Vly��Z[�F h�W�.�����h������+Z��V����V[�K�����yo�.�B/V\Z$ode�Oz'��u������������C�}��yE��<A|��^H���-]�z>����j����3��]�{������'&���tIo�.g��^���YE����-�Jn}^F���uy��.�����z#�h}�[��[^������+Z��V�����i��D���I]�����&�����io����c�T�=��yU��=A|q��]�t|����x����t���C��]�{���p'���yE����5/lL���5��i^�Z����M���%�F^����b��S��eE0��:��N��Uj��"y##+}�;����!@���L����]^9-��%�F^����b������G�<�j����7������q\�x���`�({�fhiw~����0���?���:����v����v{��L�kGEpl��_J�s������f��G<���&Y%�eU4����V�U�$�����&YK�BE�U�$�����fY�)�I��rR�O�d�^T��9MrJ�K�h�����U�,+����&Y� rI�Lrr�K)d�j��Tt��8�T��^�Xp�y�V�S�3�M.�����hr�`WlWW�h�U�\VE�Y�e��g5t�U�,��Y�
O�
Y��r��*h���m��jh�U�\VE���k����YV��YM����.��IV�sY���W:�V4��q>��Y�
OY��r4�g4���w��r�w��`�7��L�����\3����gh�S�\RE���b�K*d���\J!��[���R�$�����fY�)EH�d���|J&��{�bq�Q�$#���Bfy�O)h���|RA��[,5��Y�-~qs�YVh�|��C�;��������@y��Nk�w��$r�|����+g������k�0�r����2K���l���z��Y�43���f�s}��c��,����f� 2+�e�@�Y�4sy�K�Y�43�����i7�,l��Cfa��5��zfE����
�f���S�V����Cba�;^��n�[���J�s��C���I���fWp��>��Yf	����2��'4��q>��i���n�K+l��Cba��i)�����+�d�@o�*�f�iKh�,l��Cfa��p���os6�,��66�L�cfa��2�f>��!lrO:����w����������=��s�dS�p��g�?~��_�����B��^����}����&�^m��c���@"��d,!s��UW7�����n%�=qk�q\��9�/�!6E?f�������+w�0B?�;\����M/���"S�c�\X��T���D���*�=W2�N�~�K�4����� j~qg�~�v8��,r��1�[A:�Due������ln�������O�`I�����y��X�q�C���!8�a;�����9`,�X� �y��r�3�R�T��j�oHj�;��8,�%�d3�UWv3����Lt@�C
�G5]���-d��UW�p�Jllw���HEr,�k?������@!<E?V��k��;�����d$`/������S�S�c�\��P���~�*�&���X���3@��/����d���bI�{qY����o�*��n]�����'[����M*d��M�k�PFt��|��W��0H������}	�OE;]���������7���?�FmL�_���9)�X
'<,�%kd3�UW�,���X�Hrl�k=��G���!6E?f���W�4+<�����':��;�I�#��N����o7kI����������_�:�^�}}��$L�	%�\����G����@��D
�N	�he���T#���+[w)����J���YZ���A��`�0�j$��re���5s�P����E���he	��F�������l��Q�M�*�
����	�he���#�������JV-�	�/�����,.u���3B:Z���a6�H(���`ws>�8bg\1���oit��)A�L�zp�N5Nx�2x	�Q"z�t�	����*����$���=j��U���xc+5YO��S�	!],���l������������TV���As��"���\$L�	%�ZY��T�I������XF�<�^������&S��^��xm�P�]~�������D��X� �y����27�_q��Te}����+�:�X
��T#���+kT`-r����Udh
�+�:Z�@h��a<�H$���~�[lO�"l���a�(�Fap��4��o#����
�_FE`��N�gEx�2�r�N3R����?z2l�Z���?���Y9^�����k�^CT������z:&�v���X$3;0v`���~�G;p���~����8
�n9��a��b���������_��X�a���h��]�H���M3Q�|@� [�W �����������-�~��k���J�a?:{gc��O�������X$s��x�Pp�"`Q��a�w�:����������3�����c��l�����[���)�m���;EN�v,Q�M3V��?:�;@���r�Gp�(���0V������������ucV��q�L��Er����C�M7�����wX$���A:�X5u�2%s��F9�o$��_����O3�|�e��s;�i	�_B����(:����<VM�?>��lH�,�~��,3����-x|��%s���h
��`�E����� o�u�U�X~k<��u�_Y��r�l���E�������?�=�W����E�$c��t��jn����-����y�X� �y����7�������������u�_�$L�	'�;�9���������&a��
�����cL4�*>�8Co(�Fe�� �f��[~��`�������u���r�M5J������������[~!�h�Q���zr���S�/K�
�}5w�HP�Y~y�0�j$�pn��h�m���BV�����H"����mp�(�3�" 9���� ��u�T��ym9~��
~^OP7�y����P���<����<t������~�9?�#��-?�^�FCo�����xe���0g����j"�le][X��z�����0o����a�%���Q�-����3��vRG��z����k��C?��yX���~9?/QM��V��c����;!���F�I�t��P�;[����e8c��'H���Y�������u����}�8&E����^��������������7�3=K��O�bS��|��a�[���$f����m�2?:{���n��Y�����2"W�7�2?�
�2���
{/S�%��
_,�e~�����V�\�+���������Y\�J!:i?����|Z�C���x�rZ�c+�4`�����.��(��u�D��[)����._��d���3oY�����l����d\��g#T�l�^?N\��\��C����*�v������5�pS8W����}c��t�wU5Z{��R.�/���)��*�F�"_����`�e~t:d(���:��������\�_���}���;t r>�H�>z��@�{Wo>�����Ys	���z��P��W
�����������G�H��~��!_t�!�����E21��B_����7�}�s�}���M����Dg�5�.F�P������������)�q��.��[�o�c��8s�HP���m�J;�	�p��t�%x��3T��u}}�"a���	%|���d��}�b}_OP7��q����I�c��._�����W�@0g��$��|�����V�@8�f_~��uf_C�]����7���FB	�2��Sy0��E��#��>
�f_O(�[�}������n@���'���(���u������vB^���JW����/>w��)��H�p�� ~a`GH7�Q��{����o�k�#��#�'��@sF`O(��\�����[e��7��F`�0�~$������[���{����+����6�H5w3���������w��F���7@x��-@|	`OP����l�H$��L��1:�+�}4��'���)l�����/4W{��S_�2g���u~��H�N5Jx����^�������/��!m���~����V������C��j��!�	�:����!��l�~,<�����=���9��
w��Wf�h�I�-����l��f| �FcCb�b?&�7��Bt�~,�gk��c��|5D�w��	���}�������L��PA����(�;�/�Q=\�m">��I��W�.���~�f��`��%�R�#
U�G��`�P�x���1V=\��M��8�Z��8F���/�� ����,y�V����QC���<����j�@T��l�����H]��q��Y���!2i?������F�R{_u�`���|�����m��:���eA�<��(��G�t������N]�j����y� Q0j���d������]��m�������m�/��.����_�8�9w���1�m��@T��-6�Rp�������(���t�������mY�C*"���(�6�f���.�*:3���PC��B@�X<�p�N3V=[�3�����%����|BP���5K�N5J�x�j�Y��Bg�;�����s�K�0�j$��s����L.E��qlC�l�����.I�t��P��+W��w2w;��IYm�Rm���U�t��T���p�jFg��Z���32gk��u�7|+�0�j$������^3W���e8�����s����l�����3�S��gC\�����V<�d��TO�,�7��/G����?��F�X����
�����KV3|)r��c=���~BP���K�0�j$�����,�Sas���R�9����.��`�rO5I�t�j�C-d���/}t�|OX��t	��GB	���-w���
��6>@��3��X��iq�N5J�x������wq�9�C��B�� �����?#��{�ru����YF������{I�o&S��>_�x��L�S����v` ��^"�M�t������z�{�:h�%���d��vl� �f��%�Z���Upo7��NO^�/�v���y�V���C�����E�cR��
+"����?�[+�A��gh
}:C��H�%�!��M3Q}�P�,�{�P����aL��#�c��U��
��O��A����h�^y�WI�'���}�[���(��d3�U-���
��{��{�����l������%�'�:9c$��� �f��h����ce��(�� Q����)���V}�Pq������C"��O���%�d��U�,�;�~��6?o��1+:��B���c�|�Bmc���	#��F��+o��^	X��!4E?6�'+��8��
��VD�9BD�;(`1����p|�{�l��ZR��o��a?�R��A:�X��B�I�7`/�\muX}�BA����>Z�Y����D��"
� Qt�
2����h�f������%?�~�*Z7�X��B�.���:�a'o��cR�J�V1�&�����Ns�"��
���@�h�'�i��O�����p;P��u���kkGH���U�t��P��k�t��c�[�ut��_v��&aj�������k�w�F��Q�b���u���H�N5����?Z����4
t5�������t���$L����?Z���+��G[����3�G��� ��|�p��@8���5o���^:3}G�]����7�W��FB	?\���u�E��������D��^(�f�>[�����w�fa_h/;E�|� ��{ ��j�u�/d�.?�K���G���.�fS��~�^��W�}v���Y�#Y��������I�I��z�S���Y����5k�'����*a:�H�w�`~��W���.W�����x�0g�d�
;��,���'�Q����G���?��T#)��������Y�����	�z���05��	?]����k�,���fy�u���I�������}>��>8��7���������B/��p���8�\������ni���4� �������t�����I��7�1t�:���I�2|�V3�Wi�j�R�S��@���S{T���@T��zj�3�������#=W���;��~(����k<b6����1���;YG�����}n��;~`/U��#�?0�w���.�W���PH(�G�����|!}��������*K���B�|H��DA6�X�p��tCG���~C����(����l�����}��o/||w���A�^������9(���2z��}�wbv�37��F� �F����d������������8T��~Vy�bl?8H���.94!o.�7S-�7	`�},��	��&�},/�������'	�kw�ES�r�����.�7C����yE��!w@D�?J� ���z�dvY�����4@�����?&
��1i`��%[��$�*t��uT@�rA�Sy� k��Q=\�5B��*�4p�vg��E��r� �y[�
�p�v�Z���*��yl Qp���`�-� �gK�=V�l����3��e$|0�O����\��4n��p�j�CKn-a� s��	A]��e�0m�2N�x��5`����A�o�<��Ma���0=��yW��h�?0�P8����	��C�8�
��������%t���q��9wh�HP�7��9�7��'|�rk0S��[�6������Pc��&�$��FR����+�1��~QT���v�H�	�sI	�g@��KvG�I�qA9qop1���x]� �z�V=]�o`C�l{V���6#A]��f�0���#����}����1�C�|7����oz�%����H����<4��#<<c�5�	���8���8���W�:������{�@H�w�nC����=���C)NWv�;7�4g$��{�d�t:�����W����m�z�l�|����.�Y�R�:���>_���Sr�����y���I��A�	O�z|Sa��@>��XAf������.a�R�'���f8[��'�������SK��/����������T��
����V�������?~�����_�����K���X^GZ�7������?���_������o��������Kj��������������cZ�����j�RIo�Ni_^�y��E��r�������,�����T���O|��Z�����Z\�7?H>vn����������Q^����eN�o�e&�Ff���b����]�ff6�lb�<��Y�������+R��V��.�m��%�F^���b��m��23�g6�e��]��G�������b��^�2g��2{#��]�{���^������72�XoJQ|uSj�V�3��v\��>�'��M��+�l�9LlWp"���a�6�[ff��&���e.���=��<��-�D�2���.1�7��������`�v�$fo�U�K|/����wj!3�yf[���e�p�B�L���*v���.s��;>3�72��e����U�/������Y�vO
��{R+�^A����-����d/�]>f�ab�|�eN��u���3��2O�.�2<�z#�2<��K-+Te��*�y^[���e����e&�Ff���b�9��"����U�2��53����.a��N��gb����u�2{#��]�{��[�����yE�w� �z@:^+]��t��/�i��f�=��/��G0�]���2o���� �y^�Z�{������KK���*v����^���B�<��-�D�2W��]fbodV��|/v�W|�3{#��]�{�����f���W�.�����]�
����yMl�'b�����1��I{�e�q�jw^W��������w�����:�x}���>>������?H��z���L,l�����������"��aD��7�=��z�
Ep���O�8Hver	��
@c]Ku�~jP]y������%�~�+/��������=���AY�Dea/([rP�'e�&_=��^P����O��~Op����	�d��AuB�`����%�~�+sk��|���=�GE���(�yN���^������G���wl�sx���ArP��g4*{A���r?��i���zweeC��l����<�!�AY�������|��_Y�XU]��7=V�%����A��TW�����k_Br�'���W6���-9(��]k��Y�l����,*���E���|,�����0�?2�2��)w�/�w�X��w���9�$��"��*�K���F���;�MG�O�����_�����Z,u6P����l�<o�|n�����Fec#er��$*)k`T66T���/����W��T��9���kl���Q��P9�7�RY�P�������\
u%�����k�Yc��Fac/<����>�^x�K�B�(��06�}c�����3*)k`T66T>egr��l���BY�H��
ec#e
���F������ll���Q��PY������,��
�i�m��l���BY�@��h�4*;([`Pv6TN���TV6x,Y`�������?�u�\z��D���������C\��n�,�������Fac#��>��f�F�������tl�Y�*�"�V6R������JX�*\6_H
5%��f�j���F���
�������
���1&��iY(+=�4�PV��#h�6u���B�<��,�����@
�?���
<��t������50*(�I:������c#e>N���:)k`T66T>��qV����,������V))k`T66T�lF�����,��
��'��/����W��T���R������d�Q���������G�.X��g�c������06���>�����Fec#�]�=GaE#]�EYEC��9-k%�l���BX�H�XdhT66R���ll��ecm��l���BY�H����z����6����*�
eeCe����W{W���I��|�D�}&�s����w�	?���������R/��������������1�����i�^k}���)��zq������!�H�/���]^#�����*K��g���{�w��d7��Y�l��h��;�}B��U[�:����6�m�@�h���	A6L�u;�����M�d�N�����
�$��1>a;�d��������c��6�"y��`�0�%'������|�	k��h��w��7D�df7��o���@{���Q?�=��f0�S�����o'�?��U�|�H��g=�y3�z&�3:�~�����H��,�8�-��g��{�T��=�N�:����n���<��A:}��l��@�vf�*��T��������\�}I6k����l��lh�z��Z[�3�[��3j�u����LsH;�9��j	�f�'�LZ/{=C�e=V��x�W
��Z�m�off}4����_�"�cI����p��L!5�B�j�
����fN���w��`�L��v��y23�Y�l��������W��b�T`������K��p����qj������
�O��Z��4�r-����~8���h���i�n=����U�	Y2@cz�n�y��E����p��w�%����O=�p���c�#�o�o����n��[r���;�p��pz(K�c@�|��2�
�$�g4�	�a��['Q�5�I����v2�W:���p��,����P-�����*�4L�lS�v���y<3���0�%�������6�>NG�S
$V�i'�?���#H�i�nm��y,I�CI���N����A��<�YW
��Z��-�ff	������1����l���\I���,I�����xd�����q�l��2���$�5�s�/�)���,/lu��C�F]C;�9��7�@��f�d�4Y�6�4����ls)[�LN9K���<�Y7���-���������o@�_����Gf8�o��	A6L$��[�	�C�jg��.�KM_�o����r����D�g�sI����S�V�p�,g+�l�H���5�����
��.g���x��d��8#��jIB����%){�vC����������p��L!53B�j�
��l��7�q�V��OC��rG83�3<$C�D��8�nl�����7������� ���qw�#�B��q�sK��7���1�zL����W���Rw + I����k��5@���\�V��!3�7�_l	��6��a6TKD��]~:��\l�
���-��p���$�J�N?~�"|���mH�v�h>6`�/���<dnd
�5t#���.G�F��������]��{l�nd��3����-	3s7R���cCF3s������6�cM�jl����n��["�/��Y��	���mlI�3y��=a��d��30%���KS���=Z�W��JS���=C�d�)�Sq���	�;����3��<"�!6D}���
���G�H�B.b���'� ��2�l� k�?�z�����n,�D�j�1i�
��o�G��r��S;j+6@���p�=u�$�����)��/��(^����c�0�%"86��1
���esFp[�y��8k�
����]K�C���op"y23�xg��D_����lG�$w[�y��H[�8��j���?�z�a�'�Z���H^�?&���-�q������ji@��c5�-�<��{�8�!��
��6����#Cih�[�y�
�h�
��
�qm�'��gf�_l��U6��iXp4kr���
iK<�Y����-�%�`E�f^dC��OaN�`W���J7�&����ndMD�7���g�#g�������9�������?G�p�������DaJ���ZSR��)Y�MI.�(<�Cv���X_s����4t���F�n�;9���������n��E7�&"��I��k;R�G?�&��/8��dMD�5S2��������,��]��)��dEN�64%w>��p%�b��%q��%��`�0�%"��7�MX�5������'�]|I��d$�SrY�%�����)L��p^kJ",��%Q��LI�g�����hJ�d��hJ",��%Q��)I{��2��.����xA8}����P-��7�w��N_������,�/��7'�s�zq�����8��{h�$��a6TKD�%�/=�G);��GY��^z
�a�����}���E�x��T�=�9TA�@��RIz�� ������3���=w#[2�?c0(w
3��!�y��!vL%���vR��Z2��<�U^�n�����{�/J������b�dE$�����X#Y|�FR��+�$��U�5����\%k�q������$���d���$EC��WZJ���l�W#��lI��t�q�>TKR�/�:���������S1a@^I������9�0L4�W2�T�$�d��d	�T��\�
 ��J��rlJ�m`R �v�qG�U�8k�
�����-�������3�����,vVQ��PY|�*���,�!�J���QV@����T>�����|��]1�9���*R�d��Z"�/�hY�|taa@VD�ZR��Y|����	E�����5���R�bdE �R�"�UQI�,������ k"�/�Z�
H�S�&��fFR��Y�d@��f4 �� �5��_3�dM���f�O+�rrA���
�����f����`��DT��:H��� +�IM��:� k�<.��)�M6(K�%�W9���a:TK �R$
m�BH~�,*!+"y����!k��m�����y�_��y��b���(a�y�_)��*�q�>1���NRX��YBB%dy�z�����(���X�k�G	��cMD��q�w��6y!�bK{�p8��C��/9�(C*�G�A���H^�<JXtkrv�������zHbeAdM8�vW
5�5����Z�����M����.�I�|����t���y���`�Y�[�y��xh�9�5Q���H��>T�\�$t��&�����a6|K��C���}���ef�R� �W9���@�l���`�*����������������y���-�j;zY$��BIz)�
%�=)JV@��BIy�d�%�n�?e�7=�909�9��-����\4��j�O~~{��	�+?���;�����T���d�4YO9�:gcYV�������J�S���Y��{�����|!sAg:f��I|���yd#7Y�N��KX���&�������k�P�tV�vC4�H�p�df��AV��l�p�4��7D��=����D?�9Y���|��:������I��N0-9��i�d}:e7C'}���������ya���0���~<s�Pw�W�;;��s3�H*��� ���t�f�.\x�[Z�-KH I��%�q�%�
���l%���,t�n]/&�	���S��l���������u�Ov`Y>��*�%�}��4��j���>���t�S���+�
G�!�W��&
��["��<����.mM����5���Y�
��Z��gnF��� +�]�OUK��$+�
p�nm��$d}8s3����
g:��?��/�U�t��C�D?���y�W��t-�,��y�s�'��P-Y�������@���e}A8�r�i':��P-��gnF�6q�F�|o+�9��� �W9�i�0�%�t�a��
g<m�1������*g|=5L�j	?����\'ZY�+�J�� �W������P-��gn~�r>S�J{�af.
y["y��NsQ�{AD����NM�������s�/�be��m�^4h	?�����}�sg��4���d��;\�C�$����z�����+���Mj�7IE?�
� ���t�f�OI�:t������-�$���Y�|�H��S6�~V�'01e����+��B�j�
~<s����3Ui�4��w�k��r��|�t�B�+�cK��3��OCg1P���V���t��l�����3W�>}���.����	�}]���Zr�r��3����Yue��R�`���7;p��
�d}<e]/8����XBX���5�$���� �����D�U���,��$Z��]v6!��i�����\��*^��]����k��p#����
��Z�T���! I�J��A:L�����������]�����E���$��]��EX�7[�������h
�T���m	V��E��X�h����-�<�H�vXw���_�bk	
��%@�{�&��t.{D�����%�W��X4��j��9������^���LW5�{S���9�XH=��}Q�M����	����+�k"�o�7���w��;-[.�����.3�|�������������zl9�>�vW�����a����9_s����PH��5@R���*J�ZU��D��O|�`����)����I*�qv��d�5Q����
�/��6�{xM�W��S�2�l��������F�%��������AM8�ZF�w
�v
�����^�<����T ��q��"��. ,,�D���;��s�s������p^�dpj�-4�o��}����;��uXhI������Z��8���ze���o��`-T��d����<�#��7������C�����d����a6TK ��|��?��a�L_�a�^o���Ka>|KN�5|k�������ZH������
���@���?������L��p^�x��`��D���{��s9���O�X������a6|KD�����?�bs`X!Kf[l��af�7�o��}��K����y6~K�-18#��jI�K��������������SQ]�����
a>|KD����c?sQTa�'��r��T�������f�7Q��3����)���o	�U%�T���D�������V6f����o��;{��?5�|������u���_���_�=��5A^��g���-�w���9�=��H[Q�_��}�����D������-���������sF��|w�f���b�!�1�|K8��p�f�|C ��|�y��<5c��;�%�W�8v
3�!|�����������D��/���w�O
3��!��;��@���s?^����~��D�%������_�~Q�_��
�����4H�i�Z��w��	��
I��[���&���s8��C����<����:8.1*�
T@��n(�BS�h��S����u����:Z�"�WwK94,�.W~<�p\��/Ky�\y�\e��Gy�\y�����r��8�n�=��� �?
�R8l���+��S����i����Y���&�W���|�����3��������;�S�� `�#�gF>��0�%G�|8�x��Iv`S��-���Dq�����j�����3MW��0�w�����K$���"A��K	4��)[�3�����/
�!�(�$(��`�l�7�\4�N�R� �M �Ru�����]��V�D?���sY�*P��z�4`�U��|�T�k�*������d��VT�{���L^������BBMT���G��j��y���w�i�����3��Y�W�
��?����9�U|hZ3���7�i���Mk6���-9�0���{���x�MmN9�3��i	��]�W��P-��g�
nV��X������%�u�-�sKXl�]��}?�����{�L����iI���w����:5����{��z���]lmI�%�W7�If��d�a.�����}��BN�bS��p^�T������"�x��	����������L���V��a6|KN���p�f�o�nK
Lv��2�����g�0���~<s���Yongj{���l�����-�m��lq+��3�V;;��k
i�7[_Dh��Uk
kB���x���8��Zg��f

@Ry����j���t���V�?P�d��pAf+(�m�jCK ���c�r-�5v����$����h.a�UyE����vB�>pq�����P�`_�����%@��S}��E	*�,:�W����Yz����B���|8��z�>�e;�S>���PK8�>�qEX�@T�|��A��	���b���p^�8�M�%,�(��~>sk`4k�l`I�ZCS���V!�.
�E��������7"Z��N0���+�y���yh�-O�D������.N-�����I��kYiO����/��I��d�U�5?�z����?��/����������O��f���D�)��J�����/{��O���_������~���/s����������o?~^���)J.������Iyo���W���>U+������n���2�����|�w�����n|�����~��^g���W'��2����	��&`��B���4���I�RICE�s�n��Q���2V�� �Mu�;�]S�XS3�f7�5'��FM!cM���T����Ld�i���Ou��N|�)d���A�����Gx���55�.E���EV~����������N�~5v�@���L�g�T�L'�P4A������T���T�2�����Mu�E�7�$��"��`/�����Ld�h�.�Ou����)d���A����*?\S�XS3�f7�5��~�?����f�n�kR���- cM��gK�z�lI�3��.��+?}��`Y�4�a ��[���~�i����	2��L������M��)d���A��j����
Md�i���Ou���g�2�����Mu��Lg��	d���A�����\S�XS3�f7�5e�6j
kjf�����F���:�%#M���K�z�p��t*	��4�<\���.�!.��/��Yv��i���XW2�����M5��|���	A������T��(<pM!cM���T�\���M!cM���T��a������f�n�i��V�	2��L������RQS�XS3�f7�5�S����"`��<����W!z����-wO�7)����5N�2�v�S]sz�Qh
kjf����&-X-������e�f?�5����3���f�n�kNX�vM!cM���T�\��^��	*j�v]/�bn2V�� �MuM=�5��553hvS]sE�7{���55��,E�����E�;������gU����9�p������j����OLd�i���Ou��Y���{��7=�V|}+jf�������h�&�P�2]������pM!cM���T��S3�g6�XS3�f7�5��wM!cM���T��P�l����f���H�|�,�s:������K���.�{�JF?�g�M������S&)`��y.�Mt��y�&J���$�������d� CM�t�~�kn8�5��553hvS]sA%�k
kjf�����\-�)k���� �4��[5M���e�f?�5��!�8.J���i�2���BG$s
�D����.r��/��|���.Q>{�s��M?�R�9��iX��F��Yv�E#}E�E#}E>�;
*��IT�S�U���6�9�*�A�����_~���:�;m����6=�j����;��.
��;���SL������a����8O�I������a����HEz�$�55.�:��Y�������EI%}���������QG���4��j����S2���(�����D�t�}�X\54~���w��,����v��?���}���38���m�s�����qA�Q_��I����Q_U������.6dTu�W����h��IC��jh����jh�z>�5������������o.VG}U�����3W�&GU�EUCU)�*T
��9U���9&�qv�9�G�x ��#��q�?B@�G{�?��U5.�:���^��Tu�W����h�J��*UCU�EUC}�=������UG����g�jh����j��J��Z@}U�������>M��������@�P5�}�X\T��/i����n{Z���8YZ���Q��i\�	����V�����qA�Q_�N���������UG�]�FUCU�EUC}��Zz�?�����UG��Y��22���(�d�x6�
4��o���*����OR@]U�����jz�9����O����^x���3M����*������P��i\��o�����j\Pu4P=����jh����j��������qA�Q_u��W/�A��j\��

TW^�)T

TU

T��M.���^���W��T�U5.�:�����9�qef��Y����>|�����y�������/���Z���u�e}�������:4������o�����y��k_�'r'�a`g��g��L8�����t94������o:�$�Y�i�:6�:�\#���Zm
�A�kM���W���U��5��P����'�2xHa3��j���f����4L�i9���*�����x$�v�+�N�p:L,�nJ�3�Sg9/O����y���h��4�d#7Yw���<��������i�VF��=�
�$n���6�s�����]'n
��S��8x��z�\]r������,�hzu���I`w56���!�~�S�-6W@�����A6L�u�����}#�VB��-J'��Z2��<%	�A�k�����s�2j������-��,��k��V�s�$�����r����D������}a4��9_$�?��� ����"��ny��%�2�H��HM�m�0
@����Swlt�eJ���u&i�qV���	A6L����Qwn���w�����5@��#�� ����B����g�b� ��)��%�+x��vH��K���(�`�'+������Y����8�����{S�[�����ID���q}��Jr7(����5�z.�$]��������e�y$�n��U�|� �nJI�'�/e���x� I4%g3�l� ��b��M]��t2Z2��]I�)9[%��i����9������|���o�$���-��4Y�6Y��'�M�a�,��.�bS2�N	�a�Yw.�,��=��6<q��H�Tt&�0r�OE:����-%����W@;#�*{$���,A>L�uo$a���cNJ��mv��DS2�d�4@��]&�^7n'.������7@�hJ�	�a��;3h�R������u��q`�G��E���0
���)e����1��O��:1Z�,�s[�$���� ���3tf|�������W��H���3���7���!�k���e�VUm�����6F���7@����O�0
���):���mRNV"d�,������!;K8D���5�����9��[�B�������� �o��M�|� �nJ��M�����q������hJ��l� ��wl��rZ�W?�������y������&	�a���r����$s�6�_K���r=����C�<I���_N����k���O�yH���=��LTZ�gi����^K��C�~��$����3K���r�[nzf6_��9���rh}s)���6�W�9����q3�CZ�����BG�'�o�$�
F��a`Yw���5����Y+���t�����F-�!��Y�C���r3�/�'./'J��>t�������@������<�F���������u��]��
��%��
@�������G~��q@�X$	- /�$��FGq����w��w@kMG	�L,F]I�)��y�a�Y���T�NIOG���W����I4%gA6r�u���^z�����	���n�yI-@�9�� ���/;?g�������G���I�7�&�>L�u��e)z%o�]�Rs.��@���3���a���7�o%�7'��7��I���7�#���h��7'����zSwH����6�I�)�N�0
���S7m���c,���p7@���c���a�Y�N�2����-GyaDN�~
�DSr�%�Gn�:��yX�����X�)^�
�$5��m��0
�����	��
k)N2��P����z�����v��{k)d��z����o���L�t
��l� ���w���zZ�R���	�	��I�S�J��d��:�z�Z=����XpW�����Ct���R�-�Cl3�^R��_U��������3��Z_w���aF��li�yt��v���G���i�:�[i��(�9A��n7@���c���2�����x�y��I��9:��v$I�9a��0
@����������^>+o���7W�������(=,�C���H*�9:��l� ������$kqihg���
�$���Y�|� ��N�h��Pa�U`�7�����`��@�^���MT��-Ge��`�/��������@�^q��5�;u��P8u@R��QP�����.cd�g������t2����V�7�}�j:��IO����!t`��j�Z�ql������o%4@^��V]kJ���Dn*���e�����#������F����U��N&yS���������Q_o���q��n89�;�+6���
�@��r�!�kM�q�~��������hc��S��ry��o�����'��m�&�����<�y(�3����iCQ|$	Er���L�0L,��Hn�&�	r���f����H�����Y��J#o�����S\'IOJ�����3��<�s�����Ny��U||�������C�Hb�XB���d
8,��>�n�������21���>z�@��H��m��.����N�K	�V�������K���m�m�����# ,�U�
���|6!��i��+� L>�YlU1���8����4I0��6��
@�}!
��� �p@�<2�M�� Il��>����eTe���(]��z����o%(e�9Q_k������d8w�l@+#2�� Ip�����A:L4��e��'��P��<N�@[�����e��
�������7S���zg2��0k��3*.� ���q��h\��b�H��l	3� ��u��qM?W���M}C��
�� 	.�&A>r��q����G��.�����A<��k��V�u
�A�k���l�Z�Q�_R�m��&��
�$u��-��4Y�z�
~�����h��@��N�,A>L��q�x��T�l�VF���I�z8fi� ���1�v�W��m;~�U��k�$���a_����d
=�����P��h���Y*A��x)���:w����QA����#��n�$����J��9����Iq���N�v�
���
������q�
Q_o='��/YXY'�ih��d����H\6av��0
�����J����NFb�]I2�
���0
@�.�*m���F�p�* I���A�e���.�T�.[bL�HR��2���u�����TqI3aG+#2�� Ip��Q�0
�����c�]p�d�\p�j���eK�#7`���#KL�q�K!����<o@VK!0��0
�=K��1�������-�HRQG_|(�]�hV�.?i��N���.�����������w����p<|q}�_U�����
�B�G��/��ty��o�L�m�!�kM�I���9���TB��������|��Z2�AP�E����5�����6��������VF'z�����
��;G��h�%}8�+4�����J�h��z$��?et'�0
���L�����
;Z:�|��5�tGK`��0
@��S��8�����D��.������6���Cl��)������.�Q��&:���Ve���%���p���v���7!q%���X�����e���0
@��S�uNn7'�%nN�V��5@��2�l� ��)g���%���]��n�mA�);�d�4 ��)����e��&��N���'}�k�$�	[d�4YN�8`�q7��%e��F�� I(K�d�4YN���r-�j�eC	�3���������#7�}��E���C
�y��HW�A6L�����%]7��
Y��>����@�
Cv]d�4YN�^Z��YW9�+>����t}��J���!�k�|6W��"GM�n"��Y���z� y7��������q���d%����5g|+�\)����5���R3��a�IK,�R8�`�@o�6�$�Gn�>�2�c>qKd'W@'#�
}���
l� ���p���f���o^=y����{!y�������J���q�fbh�vll_$��,,K���d}6e���q�����
;�k����������5����_�u�sMIM�Y"l9_^KLiv]�CBt���R>���mNY���Uw/�l�����<p��������\�����cu@KG��C�
��6N1�Ni�b�4��>�2;������UD'V��~��$3��%x�6r��������\�.;��6�w_����1��s�����uG7���V�*hcd�/�$�i��8%��i�f}:e8�4p�_<C����� )��������O�|������P��5���d���8},Z_�;�Y����B�����~f}!rFu6d�4@�>�r�
3t'�j�`�F�
�$u��-��4Y�>*�b�s�������Q�I*��,�2���%S���n>9��aP������������Z9aN����L�M��u��&�!�&	��u�&�k3<N1P��=����t�;�����d|�<�y�!�kMyc�3W�&)k�F$eF�:\^�h("�*�H��h�&�5�SO�_�K��m�=��t�I����G�K���zk���{�7�M#�<�!�jh��
�$��`hr��4Y�M�D
:�R�$a�:���������:��t�h�[]�M���iG������k0��ic����l�h�[M�S-
h�C?aE�V4�l��O����@��������R�����N��+��H�w0u�m� ����o4���.#�s������	VB�^B0D}m)��P�������K�1����l�Hv�A6L���D�K,:�m���@��
{���71AR��-q��z^��3<���;�$���s�����/0|(|�������x��w���@�"��U7�6!��i���(U�p��mB����Ow����?rFM�8H�i�f�5Qk����rD��G#� 	��`�l� ������E��
��&�'i[Y�H�zgc�l� ���R1���\g�^�NF�F�H���J��d�5��ei�0�m����H�qa�4FA6L4���r�,.�f�	��HH:\I��<�]�l�h�[U�z��>���db�ws-����(����5��j{�5i�#�4h���uB���0X�	����5���-�/�G�9��#6�o�$i�5��`��4@���(�	���#g8Ze=H��^$YK�,�I�#7YoM����!�������K�[ IE}��K�
��zk��S;�c�R�����
@R��t�#���
@�[5+{^�;2 YGvw�����)An��@���hFi���-h^�I�4I���2�l� ���ZM6
'[,I�!��nA�"�V	�a���&�����W���]���[��+q`�l�l��{����v�U�>�%���yr�y�����35�����>�j����5���<d�j�vtum)o�0����I6H(�8q��SN!j�&�'-,K���d����0�W�&��V�m&s&kB���.An<�@�zN�$�����k>�6x�/<�/���/m�$���"���D����t�k���	�-�!6D}��������q$y�����<1!w��� �[J���5���8:|��?�����I������	���!6D}�)���n.2�n���@r�m�����B��V�?}>F�����|�p������)>������-�$D���s�������JZ
�T��s�`���-�aP��=ez�Zf1����r���4__s����X�bC��H�l�����=���l�"��c��"�R��yzu���Mr������,��:�=��x�nBL���0
@��S����f��f����Re$5m�H�rcA6r�����)���G���j������6;���m��q%��)���O�~u��Z���H�;�
H
��m�ug�4@�>�2��F����>�8<�L80��F�7�pJ�Z_k�GseG���-|�Y v���9���=K������:����%-�a��j�0�)	�h�>�0
���L��l��.G����w\�����l�#7@w�6�4�v��9�B�#����2� 	=@�6	�����L��)a�B*�
m���� �Y���a������)�y���=��Q�����{`��0
@��S&��nR���sA�\����h�g8���i�f}8e���z��!���-ti��V$�d0ma� ��)����G1��������Z2���,bC�����\���j�G��*��
X�K[`� ����|ic�{�}�����f�
�{]\J����p�vO�/��$������2����9�9d>L�`��������q�f�3���^$�6�&A>r�����O�������Zq8�Fa� ��)k��Y~b��Y*��}J$���B�J�-U�YN9:�R��h^���$�&��:Qx�8���w�ly�.������g������mk��F��s5�<��K�����
�I*Z$O3�l� ��)����eY@�Q���j���&	�a��O�������5�F2�r����QF
z�l���>}T��~���h�����H*L�#K���d�jn8uN-���Q;�U����K�]�W�Dk�H��m�D-����������W��7X���0��������I�~��/�����������/�IC�:���X�N������l_���_��������}����/?��e�_�������Xl��	i�3����D��:������4I-�������15�B����F��N���b)��/���Mx�g8|�N�!�)���������]�BU�XUS�j7�U�C|T2��D��&�
FR�4��� �M
���R�
�jjP�����������T���JE�\F��@CUKu�~jP�x4UAcUM
������T��h�����)Ro�7t�n����7�<n6��o��G�T�q�TWM���U��������������*h���A��T��Q&cM$�^�+�'J�\h�i�.�O
�;�]
�����T��Au����*h���A��TWT�U55�vS����j��U5�7E���f^��!�e9�
E~�z�YC�H�o=�p���w�����5�,����������`a%�
�jjP����t�LpW�Z���S�jF���*h���A��T�
�����T��Au����*h���A��T'�������T���*M3���Q4R�T{���7_o��~�mI���fz������X
Niv��iAmE��'h���A�����1��@CUKu�~jP����P4V����M
�+6�UAcUM
����:�(%�
�jjP���*.3��@CUKu�~jP=��BU�XUS�j75�f�y��!cM$�S&&�<d�'n���=c�y�l�k��-N�R���S���AU�XUS�j7�US�1�*�P�R]��TW��(T�U55�vS�����AU�XUS�j7�U�(���O�PS]��3��($�555�vS����S�[4V����M
��3��
J�XUS�YS�^>lVj�5c7��A���y�q���P�7h�#X���~����m*^�
U-�U��Au}��P cM$�^bP���$�555�vS]u:���U����������q��BU�XUS�j75��|hC�*h���A��T4������T��Au�C���h�����)Ro7�l#��8��=L��7�7�:��+&�F?��������]��B��������H[��RR�XSS�h7�U�����
U-�U��AuCqgP4V����M
�o�/T�U55�vS��\��)����� �U��%�Qh�j���O
�;�I�y�h�����)S��7��E}i����7�<p��A\�w����������;:����Tj-�����j\�����y�?�����EI%E�NPHh".���&9I4�:������jh�������i����{���e�R��a�M��������Q����3]�Q_U������.��A�H_aA�H_��0/�o�kj\u4P�eSwT54PE\T5�W=��	����qA��@5I���jh����jh�z���jh����j�������P�ycq������9=��q3~�P!�Q��;��5
�7��g�,�DIG}M�����d���#����UG}U>b��|8��j\Pu4P��^�P54PE\T5�W���P_U�����j��FQ��@qQ��@U�
UCU�EUC}U�
F�������F�������;����O�Y���G���q�'p��o���U�U5.�:���Y�$UG}U������;�x��U�8W
h�����Q��@qQ�P_u���\@}U�����y�J��@qQ��@�@G��jh����j����Uu�8T
��8T�t�\��y���8y���'p��o~G���N.&�����T�U���^~2�U5.�:���� Q#}M�I#E���%,**�+�s�|s���� �h����ET54PE\T54P���Q��@qQ�P_��x]0�:�?i4.�z�IC�P���fl(��+�)������#8����j&��PW������:���
���qA�Q_uYd'yPu�W����h����X@������R���Tu�W��������s2�L�!�d��W�����^Y�N��[�u5W�q�f����c?j�Y-�}�t$u����[�O�9g��z]��?~�����^�������������z�#����sjT�R�F��/7@���-��4Yw��$�h3�-�)��69�� ���,K�����?%�Oim8�n(G��W>(�
HM��� ����������� X��VF��z�I�))��]�t�X�������V�������,;n�$���-�a��{�v��+��gZ�k��Y�7@�hJ�&fa� �nJN>�>����jO�������*�-Q_#��>���Cw�8��P'���Z2�AR.!����[��>\|R��qD���EMt�k���Gd	A6r�u��N8n^K����������^K���������H��!'���f�����b��e��k��V�s�8����r�NJ+2)���iH�-�)o7 ��'0y@�a�{�j�(EPt�������n�vy-�J8��C4�H�1!|��������y��
�$�=�69���i�f�9���<���{��1��V�HM��vJ�
���3����=27�(1�?�
�$��Nc"��4@���>9�������1�o
�
���G`�)A6L4���Kr��W�yFFr���k��y���Q_[����=����(Ko��N�]� �n��
A6r�u��mbPoN)%t���0���\$���%��
@��5��%���@���W�f~����o%���!6D}��M�9����L�t=:�Y:��4@��='�����&�b���5�8��hJ�6��
@��!����xA�)C;#����I�)9�d�4Y�N	���4����}�0k3Sg�� ��u�>�����0�>�v�Z$���� ������+�:�8���}^����'����8H�i�f�9H��FZ���FV9�u�v=�kd|�<�Q ���������������a�vF4�y$�n��A6L�ug-z��b�\m�h��HOI�|J�
����R~����:x(Y��
�_^K�7��q���-�f.'�]���cP���x��]���lC���=$��B[������R�������/�%�[�C%R�Z_#���k�i���eF�M/7`�S�3�0
@V��;�i��6i�>.���.���#��]��%�y}���'G�Q����@���p6!��i��kW��7��]��f)|J����Hb��XF���d��t������O�hc�rB�a$�]:0jI A��u_�J����q)|�3�i��r$�]:0��a�X��Kw�������>%��t��:P_k�7�C�������R�Ms��$8:�tr��k�$�s`	A6r�ug���I�����q�[�����@��:��������ZJOh�����;�����HM��L�A:L4������J�h��heD� Il�Y'�A:L4�����S����m��t��DS2�O	�a�Y����D~t�/D��[.�%�[�C0
�A����*H���	��58v���w^^�|>�C^���^]J��K'����?�b�	J#�A�����Y� ���7�xU��"~:����|�� 	f�&A>r��S��*r������,K0����Il�K�#7`���'6����59G����������3�d#7`�����&oY�&�1�/�H�:ad�R�
���7��o�&+���	�#2��k�$���� �����p�������W��v\�H�g�!�Fn�:����9����la�v�5@�zv��H���du<;�(+���`�����d�����!:D}�)���<I��!����<���Ib���� ���xv��Se��C������5@�zv�2�l����t����
�d�v�
��F�"��i@��Qd�m8FS�$&d�M��3VF�<��q���-����3<6���H���n�$�Y'lA�
�du�:����8�(���}�0�aG��A:L4kl���Ai��f]q������`����ro����K�����3�p7 ���f�0
H�?�����v�W���	��z��Bj��>D}m)C�����c��=��rR��C�h��K����l*�aS��wq�����K��$�Fn�z���D�g�f�a�]$�����d���t�\�t�������o%�n^�W]kJ����a�'vz��=�*N�2�g��s?������:���5?����%W!8��Cl��:�ViY=e�C[���YV9�2.]$�|lC���d��`w��ux>W�If�Uo����
��|9����5�f.���-Y
H�2�$.���K����a�{K��?\�<w����~:B����@��n�XB���d�slQ;��ndeB��ry-� ��$C���rcM�eX^��%���G}���_���(�4��u�����{��i )S�G�t�vpF��r����/���"���H���n�fSr� ��i����z��c_��	(3�����>m�d�4`�=m����"�����3#���H�)9������;�m�n#��{b$1a���Z2�AP��!:D}m)7��������#�^�0k��M>�>L4����j�������H{p�f	�a����9)�\|+s@���~$	s`��0
@V�`����s�k����W�]_s���|�=Q_k�}��G���B����VB�Zw
$I+�fx�d�4@�����>o��u�����C7`�/��-�a�����#&��y.~����j�
�$�o`+�l� �c�q�N��t���W������-	�a����&SO�_�����	��v$���lF�
�du���,��������Y��AR~�!:D}�)������B{(�[��� �,�����������I���|�QH�}� Ij�	[d�4Y���<����/�C+��rg�+�����M�a���_�0~�k�����<�i;Bzu����6��+ yS
&[$�t�v��V��n��,�nD��V\#�r�8$�n����b�V��5�������v�&��lF�
�����b�d(�N~��U���;]|��p+/5�o����h�-�_+/%�����x�%��v��v��n#T�m%@Rm�IP��*������#� (�ttJ-��K�������*A>Lr����0������#7�Z IEQm� ��j�Y��~!�DAz��6�~��o��<�?�O�oU-7����������m\�,�Y�H�3w�����o���Qq�
Q_k
���\�<�-���8�R��.��)u�-�>LtO�gS�W�Y�/'Z(�Sm�Y������
����b������\O��f�I*��RI����rI��K�����L��,�
�>:fF�����$��m�#7Y�N�����c)����B���I�y���q��:=?�+��E]�ahd������kd`������������\�N|��.�����*���t'Z`��0
@��S��m��>OjN
�3#��k�$��`�l� ��)�	�����
�����~$	�%�$A>L�����IzA������M��r���u�}�B�)u�!�k�|6W�G��2'�����8��������j���p�T5��hP�����9a�v{��o���"r�
Z_k�gs%+s��y�k�����7@��5]�f�A>L�����UO�I���d����]��u�&v!�����F�Gs����=�@��#^���%��������p�d�rQ �	����6\o���y�����i�f}8e��w}����D�Ew4�IB���eG����p��R%�B��W>��|�0�A6L���)[�B*���r9y5e$�(�\�0
@��S&�VN������ y�g��H���r��� ��������S�
��
he���
�$�m��,	�a�YN�_^��7~�\t0�8H�
H�����0
@��S�r�y�1o%F^��I**?�A:L4��)s�&����a�V�29�H���5ghe(���V	�!�k�|4W���Y�,2"�8fT7�3�y�<�4�!2D}�)��U7e�S7J�>m�����5������E%�Fn�n��l��K�'�9���}��r6K����G^9lo7,�����v�_�Y�8;�0
���������� dL�����52B�*�)b������\��{�JNi5��zs������2�������M��0�q:�B���HR��N%kd�4YNY��SS�$G�9�?I�c�H��2�%��i�>��S��+����w��`������=U����f��u����=w��Le��5K�r�;������w�Q��<iE-��;�u�=�@��-��E�|� Ko�Kem��h�
w������5@���6	�����&JL���\�J�E�I���"��l������������s�+�z&T�z����W���V�-�b���Hyg��4G�7:Ti�p��#zN-7@����>jd�4@���(����]�����U�`���y�i@��~u�'�@���\���
x� I����� ��zk�:fp�In*���M�0iS	g�t��l�h�[]��(��u�Dv�/����^��\�{xs��F�;3$[��-�����~u�����+')����5���85r�*{�L&����7`�2��	�a���&J���/�#�}��7���kd`+?���N!6h}}��gH�v6��d����K�o@�}�����4YoM�X���V���H
7=�H�s0��4YoM�m������oJ��_$����r���d�3���JK�<��e`�l���^J���o�M�����gj�	cOH��qA4�9���]$�K�#7YoM�\j^�������7v�������,�q�
Q_��o%����3��7#c��z��&���J�!��5:o/�0m�1H3�6����_�;�7@�����EZ�t�h�[��	X	\����0!�z��F�7�%D��5��R7���]�&�dWO7��>� ��i�v}|g�����*z_QA���S�WT�����^�zk��*�5/�t2bk�H����V�0
@�[�V�*�,���"}��As=����3 �rY|���R�����sF����y��uF!�y�Tw)!:D}�c���34;z^�k��t�v �O7������8I�
Q_#��j��t��#���`���52b5��%���5��jo�)�����;�� �����a�?�8����$��#�3��hG?�+ Ihn
6K��d�7Q�_OR)���lr�������t��C������{����#9:���[0[�Oe��$A����D�J[�zm�dh����qWM��pa��)��m���o�:�t��t�p�J/��[ I����h�ar
4�c*S�Y>�ly��5OY�3���Y�{�����q^q$B@�Q�F`�#�d�4Y~C��2�?��8�3upDK����R�k�$6��m�#7Y�Ny���<����d/��
��}���a���>���K�:X�$M}�H�HW��Q��a�X��S>'�^pf^��dDK��
�$������4Y�M��LH��s��(����|�
��y��vZog�����gm�ylz���CN?��yh6;D����i��G�T��F���������27@�x��f	�a���L/U�;�~�cf[rs��.�|x������N/�x�}��q&T��	D����k���P���	�!�kK�p����Dg{T����M��x��o�����%��W5����M�����Nh'�����9�[	�bP�
Q_k�gs�g���e85N�V!�:]^g�.Bp�)����s���9W���6����/ )�����hQ�3j�HA6L4��)��=r��������G��'����-{��������������bm��G��Q�IZ.L�>��p�m
oOy�F�'���lr���W���y�3���\�����s]���z�}rc��.�[�%@*�������}������m��.�F�M�?�H�6�6!��i�7��=��qM ���q����^\��g���Y��J#?����M�T�8�>'�v��������G8����6I*�F�qysr2������)!��%D����y�*<�e/]q6h�^_s���vN��^^#���Z��r�3��ua�E�
@RQ��d��
@��S��k:����v,�t�xN�d�rm7Y�M���{��,���L���i�!D����;������p�L@�A!�
Hz�L`3�l� ��)k�5���:1����n_I������0L4��)7}��e������ �4�3���^d}8e�b2���-vI�'&���FFlM��)o>Ds�)�U�����F��`��z}-��� D���5����.��������T�\��M�a����P�
s���������]���u����������	���� �0�����|� +���"�S�$U�RS�u�e��j�����_���P����k����qe����_�����?�������_����Y��?m�������}��o����?���_�����_�����9��������??/�O�������/�cJ�~J=�e��y-P����:�<�Y���<'������d���I7t�������|M�i����OA�"?�����?����,;XF�AW^�]A/�jn��&��~H�w�2V�� �M
�����^P��(�M��3����+�]������;��������F�n������\h�k�A��t3�AW���u��Aw�s2��]�
O�"��	D��'n����@�<��|n���7h�CXn�y�d�%Oi�]�����~r��`�]A/�jn��&]��Q�2yA�Q�������H.
4V�� �O�;�����t57�v����
�AW���u��AwE?��+�]��������La�����T$_?�N:Zn�r\~)�z��Cg��Sh�Xn�s�d��3ou]�����~r�]���BW���u���;��"p]�����~r�����+�]�������x{F�+�]�������']A/�jn��&���A��^�����M6]�<9��R4��\��7_���V�k�<���A����
Oy~���Ao������Q�����l���@c]�
�������z�+�]�������>�9���^�����M�3:T]A/�jn��&�n�_Qh�k�A��te�������F�nr��O�l��G������z���z>�-���5O�<w����~���?�����O���X�RW���u���KK�UO�X�r�n?9����BW���u��Aw�	!AW���u�������e��U53�vS�f�)?AT���e��Aw����+�]�������rUG|�������S$_?v*��[�/�<���;�eEwy�y@��r���'���?�T��+�Zn��'��Y�z�����(�K
���-D����Q������>C�4�����O�z�]A/�jn��&]:��|�����Q��t>v������F�nr�����@@/�jnx�7��v����L�7�@���(f

���7Hv]��V|�jf���	����T57�v�]������u-7������k����t57�v����--t����Q��t�z�Q��s��8Hv]���K]�����~r����	��]��'P�|��YJ8v>���	��#H��U�H���q������SA\��@qQ�P_����B�H_aA��@Q[�ICM�EQC������d��(5�*��`�_~0���X���v���i�-�/,��(�	?����[�j��:�si��������qA�Q_u�"� j���� i��H����?�55.�:�n��;��".������@�l�p����"�UG}U�����PU

TU

T��~�j����������
��7/<n��{��O`���o�{��i�nz.PW��\4�����9����T�U����������UG�MN���������g��h�z������"�������UGU�
U

TU

T���B�P�qcqQU��qs�Y���q3~��[��y�'p��o~G���w�_T�U5.�:���r�ZTu�W������z��������qA��@U:���������g��s4P=�����*���������UG�����T54PE\T54P�I]Q�P���qQU�O��z`���8�'�<��@*���9~E`�{7ODege
����+Z�Fege
��������������F��4����F�W�*���C/���Y?�"(o7C*��
�50*;)o���P66RF`�ll���������ee����y>����<����"�A
�?���<��<UOBge
�����0���ge
������$]���^����jg#��v�")o���H���Kec#e����R�y
tW,,�Xz�<o���Hu����F�W;	���G�� a-h�<��-��G�,�~�E��G���9����5)����_��_h���_���/����:��q?�yg}>�g�eD'�R��"��5g|+yH�<���5Rn��]$��wi�h���Y~
����,#�Fn��m�}�QB�S>����V�5��4�y��z�C��H���3�.���dO���������$�=����a�X���-f$u<��c`'pP���K�V�������%�oL��yh��E*�%F�.}���$�}1v�a�Y7S�&)�3����������$���� ���s�w.��F�<hC���|��EO|t���a�Y�F�*���g�M�����9-\
~����,!�Fn��\�]N����G&���u�����^g9��!��p�
Q_#���=�h�@��#�w���sI8f�H���lC���d��'�z�D)&
�`t�p�/�$���-�a���)��y�$]
��>������7@�hJ��A:L4�����x"ih8���(/��m{��I4%g��(��d�y��>�S?N�:
%Fg����
HM��vH�
���3g�J��c$�X�{�/v=��vJ�
���s�f.������VGybt�c� ���Lu
#7Yw/f����j�[��t2�
��=��mx��8-���������sZ��S�'B�4a�����(X� ��uk��v�yv}�(�5U^Q'�xM��B��U|��j[��U��N�f=-�*�YO
=�}� ���J\�,]Rw���������p�����I�l�����"�Ga���{@��Rt������Y�]9�s��{H:�z����dT�1��%)���~�k�k��29��G_�"��P	i6�!�af����-�awf���B��i�@��4H?�P��3���4��t��s�f�d9���>���������$�������������h�����:76"�����l�P�M`�� ����������;2�*�)9W�0
����)<�b\}�39�k���q���S�",
��g&%�<gd��t��GT�3K�X��;a���a�o�V3�lV�Y��GT��,;�|����?3�9BWI�W��H]��
VV�0
0UwJ+��M��������5��!��f�-��o	�3��M�~$o�C$G�I�Ur��P�7�Y� �Tu��*�aSgbN����#��>���|��L<�&�'2t����a.�����v8�e�o,8"WN7A����H����v/��%6��~�S��}iz,7�p:���S��M��M
�H�}�f��
�����	]����Y}�Y�&_).���&�B������uH�������6C�7B��.���D��sV�#7`���b�Z{��PB8�]rzST;\*�� �5��\�x�h�ohR$��� b�l[d�4�T������p}9
$���&��k@�d�|�P����[Qp#��
{�$��^���#[�#7��^��m�D�,��C��6�K@�d3�|�lw?KH��`~dV������3LA��D�8_����c�n�����1��cyv�	|
(����� �Tu��Y�q!/�9Fz��%��H���e���{�����p��4�|sCB$�W.���&��8B8���%��;;�{Z��Krw�5��	>�A1r��'���I��KhW$��� b��lBP��n|H��<;#�Y���:"K�!��1b���[�l��"����\J�n��D��9w}��(�������<m�/����k��(�,�5��Sy`3����-�[?�7���U�ZF�*���5��Ry���A1r����(�^#�g���������k�N)����P������#������<j����'��P-��m�o�lT�,��3t��>��X
�Ho����l�Pu�����f�����!qwM�K�=�����-���L����]���1y�!���N�C���i>;���C�^��3s(��#��j	
��}���D^{sLl��8����7���+��6�����k���2����<���%�
�7+��HLR|�$���C���aEX�%+��I�y������E�y=��H���Zb�� ��
	R�O��I�3Q]��CXN��
�3��;��\��6���#��R,+�|������H��#Sq�����d	�z]��g��[r�BABna��a���"9YlgZ�@��?�� �P=���e\!�)�������I�3��&���Kh���Z/{M��~v~�rq~s'�7�����+e�U�$���V��XAP��zW�����9��6�*8�N6���`�S���ti��)u���n
�i&�
������lL�/��i�q��������d�&��4)������U~X��W�z��������}�G+�~D��m)���4��lb����Y?=Uf]���d��o������Eb�A���}�����
�%�gdW�mK��5�$_�2��|m�N�oc"wQ���,d�'�����
�/d�4��8_��[������mQ�$]>���k@|d3�|�P�K�ym��UB�"I���L���3�shg`��*��y|_LyK-�����Kl`��������Y��$r!C�6f�E������U~d�|�P�����|N�_4)�k�N]W"��36��a`�~P	���:X��\�����6��uN���.�M���fOjb&K�9uV_�YF��i�}�Krq8�~��i?�Y�d����i?0��H��S�����^m	�.�`��z�)� _<�WBl���Kn���)�}������0M�]�<X�!���4vK�v[��y5��{E���z=�:��C��w�@�;��#g�)�L_��afG��P-������r�(��(�kD���� ��n�(e�&��h�_�H6�C2��c�Pu��
�R{���%��A�f���������|�����~�k��S���,W��Du)�p�0�%0|���1v ��N%~
��X��ZP��|���n;N�|@��]_R|S��7+�da>TK`x��%D��c9b��������E2pb��l�������u���TO*�k�������Z�U�!��P���gRm5�X��Tx&�v9
�9(�lC5!��]�PgGV6�0
0�S�_I�7X�{E��Pt,�[�#7`��b�l���\+ccA	�z�����!���[���_�g����+�X����^��s"���|��<=',5'�w���\�U�Z����T������<���$� �Drd}
�u.�����o,���y��e�P�?\_�GT4INj��� z��l����U�m��/	�!����q�o�UN������A6!�4q�Bs��o#Z�7"��8_S��\��WM�"�gh�/����E�1� (FnU�N��g3�>��l�e���3w0�|!1��|m�����x��
^����&	����Wi�/Mi��K��l�R-8�H'n���$Zi�a@d���7B���>���q�QN��2�&E�V�����p���2��i��>����\�gq���+�6�0W�����$?]2��I�X4@�iZ,���Mq^]S�M��7@���/:^�������2$�l[��E��;"�H��0
���)��gb�{2��������T|�|�L!6���$��U�#V����1��*��j��+s�d�A>L��p�Ev���j�pB�"M�vD�{U6��a`��l��i���������9���k*,?���
�A��&�l�++@d�v�&��`$;;v�l;G�4���,��K�.=��B��do�mA0��U����p�������@\/�o�����>LL���5w;�YT��Gqb��f�$�v
��L1�y������X�<�,Xo�-6
�!�q��<�LX����|�n+��9��.��*�k0�Nd���@�����*v*w�%���G��K����R	�%/
�����>���k7Vea�(�:HCwt�;p�0�%;W�>����H�,[����,��^��I��A>L����g=�,�%Ym����Z�����+��a`�O�,yiM����-����H�k~�GT������P-���3�"Vm�1��4#}%E)l���+��o�0
0��S���*)������}�B��Pw�����a>|K`���=�-C��Eb+;SX����hr� ,
���&�����u�uWD���;wJx����X�[e�a>|K`����|V�����<�9�^����:O
��[�����3Gz�nk��F���7O��������Ib�P�����^���n�M�r?����+���&��M�C����|�}�rK�m���
7E��"/VV,��iU?-c�����hj�������D�l�
�!gnS>-H#")!�����l���S���zi`�O�4�\7T;��=����$�L�B�i������&:������jZ�����i�����]���
�����h��U�<�9b���n�`��y�
�a@�{��Z������m����N�2PR�m1L\��D�u���p����@�����lb�������md�9`�Z��I������Z��me�
�S�5Q���v�3��$�)��C@��d#�|�����'���V�n�i���Hv/��m��=�iCg3�|�P��D%-����^2��a�E�>Lz��H���
�a@�[������4�d{iIs�zjs=���u�����kJ����i�)�FwK�m�$�?tD,�&��4���&*-�l��f'H���{CF�^]S�M���H
�A��&yg���Q7�&��v
@���vK���5�ld������2��J*!�q6PE��yf�V�w�uV����i�����:X�H{�'�'����e������4`��E������yP�J����C}v���@�[����c�I��R�V+[?���'���@�;E����NI�YI���QaZ�������%��L��noNV���X�N�m���lQ�
�Q]z��Z�7�+�rM:�A�R[V��5��#���on�^�C���o�W*�Q�in�����)��i�w�@�"g���-�������S����Vv���Du�.�S�0�%�]F����r���w�Y�b/��N�c���l������.<V[�^��9�����U�:@E�z'���4��&��o)�KE�`3~�#�~ATwJ��o���w
����������	��*_r��PthK!�%��mU�M������Cr]���N$�/��N�*���<O��7�("v�����e���-��T�>,K��d��v�3!���Sc�A>�X��Sc�D>�X\�9_���}��-�D�*�����30Q�p��� �{^���DY�.�*����E^����r���q�%�O�UoN��������������G~|�d��30��E�K?���c���x3�kB]�.����P���G�����w�����d�]��$_-���
��M>]���Y*��g��3�0����=5�#�=54��{j�@�Y���A���	�c���)��R��&��
�H;�r
(b�
2��J#7��O���!Y�G")g��;^_�h�N�e���|����j]�d��������i������T}8�e�Cg���0!��I�|���v&62��iUNY�nBb��)_�����G����o_�*�%��K��l��U�m��B�����Xl�{b�|�,����)�{��l!�d�N���W�T|���
Bl���I>��.���W�C5�nd�Q��_/���6��t����A���|6��-�����I%�����P�Jz�� �T}>e='X��v��=�����N����>LL���'[��7H������E,�'��#7�������+�h��76i�������&�B*!6���$��u��m_{��'�Ij}����;�|����)Kz\O��;;��
 L�_�,_g�2�aZ@��S��KZr��R� s�=�/���6���p���5%�u��&[�
�`mj��P�@�iyg�|�L�5�����v})�_��m��H�T���`<� ����}�>����9OvLV��g[���R�����a1TK*���p��C�~�5kI{0bk6�GTI{����C����\���Z������ZX+,y��.���l���o�v���;�Z����sBb�y��Q]$�Wn+a�����>��4lw$�mC��;d��GjlbX,��jI�����2�8i\Kt��z�0���&��23R���[�����3���d��Zp�Yd�0�yi��0
���������������b#�����
���){��b�T��~*��� ��&8Z������!'?����&�2�
 B�\�aG��S}<eO�oz*�q�������������-���3���l!�+eP����E��7Ce����>�r,���������v�a�Ai)���x���Z���S���G/���S��a�0�%0�|��m�mKeb3���#�o�t(K�m[�����@���z�;�a�����6?�-�a^@�L
�GU���X�_S��c������v�����0H���
��!��������?~�?���?��o�S��b������SM����_��_���?��������V~����~��/e������������������/e<�^t~h�d1�������Sw��N�$Y@��z!�}<�w���S�������:Y��n���5�%?� �f���<.�=���-)Mz'�����3��]������n���@��.
�{ir�Yr�m�~���d|/N��n;:�������^���l���~����|/�e�/�g�g����89/�TN�`?pvqr�'��O�����]����P�>l�w�,���[��y���~��?G��>��Y��������9��� N�3�\dg�8�89������h6V�_�&�[i�J/���d {�
q?����;���`?pvqr�'�Yo��~����|/N�7fg�8�89����h���3���.�g�A�{&�/����3n�_ ��@�s���c���C�8n��8����~�5!{vq8?����/t��g'�{q8����d��!�qr^��49�������^��}5"9�������^��G]�=:�������^���]vt����������T�rB�P2�����P:�{%i���C��4��T�R���*��'0���{ar������������8���;��3��s���A��M���~����|/N��w�g�8�89���s����3��]������D;pw&{vq8?���f�;$g�8�89������7U{�����(����H�������k�����H��==��?C���=��s�!�`?pvqr��s�J��L���p~'���H�`?pvqr�'g��8:�������^�����z�ui��K���
����.N����<��:Gg�8�89���3N�9<��~���x ��'�t@��Y.O��G�[��q�	��=��|�sYY������C���<�N�lD?�5i���&�A7�m�~���d|/�a��R�L���p~'���q�3��]���������~����|/N����xt�������89�C��L"������tw�Iu���Ky���g�;%������=~�$�[�$g�t4z�ui��K���=���.N���p^W�Wqg�g����89�~�z8�������^�������g'�{qr.����8�fm�I���]O�d��!�qrFmC~(������t�J�&T�Ab��>�t���������n6����l������jS���'g��=8�N��Gv����l���uB�����'_��=9�~���L�k~ >8[KZH���n���J}]�O��~R�2�����-���A����X�L7;{p�����yBMm6v����l���uk��{���\b�u|���9���f�`O�K�(L��yi��=8�c����-0;{r�XggO�<8;{r^p������C���~�P����L��#i���0
G��q�38z�y��y?��{���l��y����s�g�����7�!�����xpvv�<I�������&������a�����<{p�r�y::{p������;�����xpv��������Ixp6��G�$]���g�I���>F��h��c{��U�"�_�`����=8������=8[`v����P����'g��=8/#�Qg�`����=9/�1����>�=8����s�g�����'l9=8;{rf��������5gg�%<8��cI��������'�%��p�
�p-0�`�7�<�1��v���9[����8�&;{p�����'�
=�"�`k��u53�]=�2.�:zr]�abO�K� L��ydw��9���f�`O��������xpv��������������~�8�����8�A������%�p-0�`wp�a]�9���f�`���$�{p�����yC�����'g��=8oc��K����s�'����dggO�<8;{pF���&��GIC�;,N?JViwz08zr��y�@O�W;*�=<�.�b�����H��-�����?�0�_'��"��������?����������������mb��fj��9��L�y���Tc�u9�
@����
d�4��n:�.v��Y��6X��Q:(tt�:�ja>|K`����pQ��;5���;�8���K�:�,`�0�%��A{�����ex��$]X����P��\,��o	�3��v��^�8�'�
�7Fi
�#�����C�d��>�T�lg�
VP���L�{��3s(Gg ��j	
�3[���-v�Cb;N];5��,~�C����P-Yx�C/�=���Vl�QbUY��}�.��03��fa6TKh���6�W���*J��m��2n���:�,`�0�%0��l�>~��	jG�>����ET$����0
��&�<�|��H��!2I;�K@�L)�bA6r������/X?e��l����[�Q���%X,��j	�3�S����%�!�2��4�#�� �af��[���&;�f�����t�-��9��_��38X��vgV7���c��=���LR�r
 �0c�nA���fdw��"q�N��i��{t �)�w�0
0U7a��z�y\@b�����+�:��9�W��ZB�����<�Qr�S��p�u��'��,g�|�P�MJQ'���J��Dk�&��u	(�)9[�#7��n�q�5�)��F�i�WDu��XX1C��������lx�Qd�Ru���@Er���A>L��Ij?-i k���I_-+������f�;�|���vHn/u���������&�!T�}`��3sXW��P-�awf�{t�v�Xb����m��'�%82,�j�����i�a��r'b0t�,���wEFo�p��C�����-~X��2�F�������af'��Z����gSHS������R�
��+�:����1��j	
�3�P<0��kf{g���T���!�af����-�awfZ��CjO�`+��9�+R���0,�jI�jlLSJ������fe:��#������l����;������D�DM}9�)	��{Du����2��j	
�3���\6�ai6�XXG"Y��38[���g��h�bg|&6�LO�\���3sX-��o��������:i�P7�-�4�������p�0�%0��MN��>�S��:$'����(t8L��!�
��H���C6VMlaK�}���WDuh�p@X�%��v�������k�{���8T�]�!�`PR`�P-1�~�g�I�>���DK�"����p�W�P-1�~�?Th�QYB����8n�C����9TB��l���a?S�M�n��
mxk'���4O����1L���(����{y���4T����LW"���mc�
�Sus�<kQ��Gr�����Q�5]�tJ�f��
��.)9,�JY�%%�*J�5��#����B���0�%4�ARr����m�r��%�;'%G��Z�~R��O�>���t���(b;@E��lBP���IIo�1�����E����#�9Hp�0�%���I.�Y�y�&�f����IWd��|��.��ZB�������$�ve'�:�"�cn��`a>TK&�~���$?`���wHs�t��Enr�b�
���&�Ke����&��H���#����W��ZB�������
�r�`��h��GT��9�W��ZB��L%����z���"s%)��GT��J�y�0�%4��T�j�P�n�J��'[L�8�&�af+�b����.a���e����M��'���#����|��ZB���%2���c�9��@��Tg#�b������E�iMUO���%��S�5�#��[N����%0��[���\^;�Y�Ue���{Du����|����6o�I��[�D��f4!�!�E�re�"���-�o%4���i��^tA���d���(��|����0���~�2�KTz�%�aW����
w61���4�������fr�@�3����20�����M�a�]nr��?�tT_����\��S2&�fK�T�$)���II]�:$%O�6)����<��'?� )i��sR�[�RR�L�.qHJ",'%���p���$�S�)Id���2�W���\�2S��k���!W���
��dM@��x
 �)ce�V)����?����vo4}E���t
 �)��#7������;X�Uc#�5��!�E��Z����v
?���l9�5����-P���H:���Pu�q\�W�rH9�����)�h��/����"+�%����m����=2{iX����P-��K��S���/Lu�',�;I[�'�g��[��x�^q$�����7�B �|��&U��5���#	g��[�~�qa;+9�|B�q�^��E�'\���s�0�GnT�\�l)��5�[Ib��2����-Q�"N�C��w)G�.�AT�����r��C���H���|�����y��J�
����Gg����q��yU$�b�0�%4�' �k-3��l��XU����GT�	H�u�0�%4�K@Z�\��9Z��&y��P�4����l���ozR��:�a����f�]����#��N��N��ZB���H�S�l�m��Herh�ne��' 	g��[�����J�=�rG>����q9���pe�7�����>.�],��	C��GT�����l����.����!V�&�]I�=R��7��Y�
��V>���2Z#��f�$iYc����E?�y�0?��!4�g�C��!�Y����];��3s8�fC���������/QI��3M�H�7���fa6TKhx�]Y0���	M���I��;p�yE����� �3BN�\��!�yZq����}���rH��;��_��2(C��C��rH���4|K`x�]��X�lE\��=���.��P-��mv��}kN�_u5m�#�kXw��P-��mvQ�+r�{��������.����G���H���������J�@n����X��ZD��-�a?��j�:����,�I�^�,�60��iU����n�>�Cn�i��[��s9�ha>TK`x�r����2E�9�O��-�����E9$N�^�C���?�<��s�Q�������<2,e��?�<����<�������d��?(��G��Rv/J�"d��rc�WB�U7	�e�Y�ggP8cY�dK�Q9�P��4��j	
o� +���8=lP%��]�AV�2��o�z[q$�A��*����v�[I�q��y:�p�0�%0|NG���|����k� ��]�4�fa��l
�� ��6%Y�k�������?~�cY�����-%��jp���)7,�Ii���-��x�Hrq��1O��X������g>�����J���T8��n�"ML�#7�O�<Jq����<Z �	�������DuLp�����ZB��g>�����.����_��{������C����8�:��j)1_,OJV�����#������fa6TKh����i��c���LW����\���%Q����Z��g�,(���7o&�m��@]zDu�g%�,��j��-��|��(V��.�
A��wU�k��������|������o��,����Wf;2g��L���X(�p�0�%0�x��]�����P[��Y��=��H��C����\�����}��������u������/���-��������w�7��9ag��#���G���,�C��2����YY����6��3�*.�g�	g��[��g./���2I���S�Ue�J^{Du�q7��ZB��g^��k��6�^8�j���o�	��ZR�6����$C}7�&�l��vK^�)��,��j�����\��7����*3�����IZ��fa6TKh���%
��&����gs#�}	(���2vo��@��S�S���t�"�[q�T��!�yF��2,�o�����������sT�V����*�Q]d��0,
�~>��U�c��l��9Ce=�Y|�A����P-Y�����K��]7�Y����4��!�E�paX�~<�id.�����#MS�=2G�( 7���Zb��>�y��6s)ZAc�3�TZU��=���c�p��C�������k��|Lv�N%70h	t�>CeX�~<s��E7��7�DF��������B�j�~<sk[ ��1#�<�o���kB��O��@G�?�������|��V���`l�+=
,�C��~8������2~�Ml�wR]1������v��P-���3G5���4���&A^���b��2�FnU�N�������u�T������*[Z5�GnUO���v�v��ve�qETw\��[�0�%0�|�+e���oO�������sET�:;(,�C���?MlB�K<u~���C��R<���A6������3�)W<��u�C����-y\v���>�����d�
f��|]����E��!������h�cYu+��I����akn���P��f��@�{��(��,�oq6sb+��K��#��:m�ba>TK`��|��H��z�PUT���r-�Ho,���#��4�U�M47������<�;�����N�K�n��o�WR��.����d){����ut�B��n[ba�L6��}k�^�����61~�������6���|�����O�+�������8��v�����2F�c��na6TKh��|�"_������d;�eq����|
�p�0�%0|s��-�����V�V���'::_y \,��o	������V�yHl�?�,$�����`�0�%���M���R��.1����A�/�8ofC���o�W���{�'�6�h��#��@Op��5��da���;m�7�Y��(��[�.�=2�W���Z2���?�o�^K�P��`�ybK�!F��y�@a6TKh��|�C�Vc��V����+�K�,���da>TK�����.�u����f4��d��GT�5��na6TKh��|���&dQp�&�d+��e`��y�
B����ZR���O�+u�7����`7zh��b�H@6�OC�d���w�[��Q��	�i�6�G�|��O#��ZR�|����5�pc&���Q��V�_�R�}���b������j%<���=�
��QS�!��uO����P-������j}���V�����*�;���y�qc���9����b�9��������+��M�a�{���X��-����H�_��/ni)��wi��!4|s���CZ]�)c�@V��7�Y�Z��'��Z�7����2r�uB3O�l~F��R�;�bU��zs��$�k�������Z_�����bA6r�zo�^�?�^���x#W��o	t�mC�x�%++���/;�h�����!I���Q����l�P��D����)m���<L�Ae�HZ�!3/+J[N��������K�j�����U!�pATw��K&����kb����e�U���q[�0����o8��y��� �P�����K�?�&d�"�~Ao��2���y��!4�M��R���Z�������z��+�'q��k�
3�"��Rb<����{�"�]Jp�0�%�����3���8�}�_�2��4%`@�PR��F�0
���)����>�z'�b�W6�=�:�k@X,��j	?���D��j�����\���D�3Q������%�;�~<���-���9����v.�"��61��i�r|1|w�3�N�6hHl�+�8�h�tEo�p[-��j	
?��2�!u��9��L�;P�Dh#m�X�
��>��jg	�J/��q�]��GJt.w(_fC��������e��w,shwm��/��m��T��}�����/����o��w6�Q���$�+�:f�
.���P-���3�4;����F@��n0[���#���4�O�l���G��|�I����d���S����7�o���@�|��-�k�-����?�pET�����<TKh�����w�����q���^�|y��0���W8�5o7y���>%_�_6P��u��T������A6L��t���������x������8�"`]-,u�=�����p�r(�������;��"�S�����|�����-�����Y�{�m���F8����
82,�j	?��w��c�P����Z���sAf/�LpdX��9�m�=�:��+K"+����>I��=R}
<���4TK�q
���+���n��7��4��N�����{�J�|�������������<�#�
/��'$81,�j���
>��o���gu��{�ti�GT+��b������G�;.$h��/�����z�:����<;������t��b���c�.���������v�P}:�-�eMbL+l:���P���r����h�)�p����V����j��`����I	��P-�����gn}}
�����(�9w� �;56��ZB��g�M�
��
�)�d}���������{Kc��������b]�����/"��4�K-���!6������k��8,7l���
���"}��4��������~�b��%��������p"��/c^o�Z���O�"DKTwlm$�K�Gb��?F|�`�^k�0����@wZ��������yM�zu`c��.H�MK��4G[�_���{R�^��=^�������?���?�����������?���������/����K��Z���������������?����_�������KY~�����������?���B.?��>c��jA�F$;#����Y�����P��������C
Fd��.����f�^C�����.��/��m��<~WN�����9�;fx�<{�2y�J�S�h��{�<z�2<�������'���)���4<V��'���)����=��_a�������y/
��=+�������V��E@�	��i��y+
O�6����)�s�^?`V���-����������u����2����='k���$����{ixNz��'���)���4<����R��#u��N�~��-�nH�������������	��i��y+
����	��i��y+
��g�dO�gOS&�[ix�Xz��<{�2�-���E��y{�e��~��`)�6�K7��q���v/u����'���+��^����'���)����<������'�P���4<+++�������V�#��'���)���4<��dO�gOS&�[�{J{���I����������K�p1��iJ������;2��������&5x������2���4<m�Tx�<{�2y�J�sA;��I��������g���'���)���4<+�������V��~
�{�<z�2<�������	��i��y+
�����'���)���4<�o��pxv�.-Yx�J$_q7j�=Y�x������kN�8yW�������P*�_9�GOW���4<����� ���L����,��*{�<{�2y�J�s����=I=]�������	��i��y+
O�,�=A�=M�<o��9��Zx�<{�2y�J������l!y�4e<]�����7���S���������{F�8W�=����,~M����+�����u��Nh~+7C#O��t�ix����=A�=M�<o��9���y�4e����g����y�4e������-C�'���)����=��M��\#������Qz�\�����,�:�+���?XV9	!?�	�o��c�����!A�M�,o��)�m�O�<z�2<���9���	��i��y+
OY	�� ���L������.{�<{�2y�J�s�NKpF=]����\Q8�
��K�w���y�?XH�M�����R���2���.7�<YvtN��� U��>:��Y_���.��)�X�7p���t�
��m�{�����1,:�u\y�q8�u��pr��������#����{�;����QFx&G'{��_��������������k�A���B��}���ea1� ����q�2�[O����]���k�{W�K��\M\=�2.�:�w���k�{W�K��\g��]=�2.�:zp�.�lj��a���������2����%�@�GOeWG�������4e�^�<?>l��5EN��������������������{O�%K'��k�K����'����{�����������k��
�����W�eWG�������5����%�@�3NS���\�]=�n8$-�:zpe\vut�j�%�@���K����7�9���y~���c9�d&t{�<.>AB��m,�	9��$t�jq�5���8c�[r
t�jq�5����S��k�{W�K��\gN�]=�2.�:zpE�����W�eWG���� &�@���\=��������q�����4�;��@�O�����8c�����8?x�,3?M� ��}���	���uF��������k�{���9�|�����|�������7�eGG��r �z��t�jq�5����
�����+����[�i�o�	��z\��zp��i��\�s�!�W(��n�0�]=?a��0>=a�S��Nc?~�M���Y\���o�4�3�	��Z\r
t�Z��&��w�����u��������q���������}X��]-.��=���yvu��ZmQ \k������ ����<���n:��@���L=��{��?n.�D=>n�_bX������yL�����,�l��%G�'!���j�K���������7�?��O��o���Y����V6����?�������u@����o�/��k�q�6I/��:6���D8�����-��\�����l��.?]2�����}o��hVT��s@$;��m;�l���7�
'ZK��u.	M�r+[�;"���me�
�S���y�m��dG2�e��"�R��g��0
�����^Ue�V�P��k�_�q��
"m0hL�$I��Su�TG����9m<����������k(�I�P��!>���$�������A;r���� �N.���E�y��b����
��7�g�n�d�Z��4�D�+�L)O�a@U/�Z�2���D��T'��"������6"�d����!���t�z��6�y	V�	�9�y���g��@}�������y�.��ft�V���}%&G�H��S�R�?t[-|^'T���'~^;�6e1LL��?$��iy�z�zB��Q�WuD�_�M�a@U/9Yt;E���5�.��eym�L���$A>LL��\j��S�/��\��� ��H�	�4L��MiE���4���Q��^�dJ�f��
������_Rq��b���:�r}��o#�[W���|m�Nn���l�oiAZ���C%��}�[@���`A1r��%���;�OhB?�eF�@$SrVp�A�S��egXmeg#�@��h\W�]?�dJ��~#���zi�G��]-��9���|[@$SrVv�0
0U/$����U|;w�M�('tD2�`��4L��L������{�qP4
+�a�������Y=���zSZ��N��C.��H��;"����2��i��zy����2�r���M�&i������� ���M��S����B?�i�����H��L:oi�
�S�3XCY�N(�P�%������+16|�a�
�:I��2�M�Dc�@���.�>��tJ�&�0
������@9$gg��DU�\��^p���5��8_S����T����F���k�'��^�����!>����n��Wd��03WF�32&�m��}�OlDP������d�4I/q�����H?]@�Sr�b�4L����M�
�OE�c�w�QU$`����7�0
0��"�4�@S���f�~��]��n�wrv;H����r^��������A�/S��_��{����}�^� ���}
�7��V�b���I��2��C�q8F3��D�����y&��q�C��M�)7��E���~
�������,�if��� �[���H�M�=-4���-�E��;"&���0
��_��!k��g@'�K!K^(�E��s� (FnU�d��?Zr,7�:����i���z������5q��o�HM�y�Oyq$;�����H�@�ud,���@U��rdVT���7�R�gf�,Qzi������T��ux��	H�����u���u�a$������kJ:�����sa�%���X~ylA�K���T�oL�/����=A@���8>]"����&��5�����'y�qS�B���6�R��(�1cH6"(�iU�����J������y���N��Y�(]�|��n�"�CNx#D�"��h,�_[����I�L�l�����a@h����HRnk@����i�P�K��$Bt�HM{�P�����+����1LL���<���3V �x{�5VIs]�dJ���
��7��c�<�U&4)�e� ���19�@�l�������E��|h���n�-�H�s�*�b�P�K�I&d�����vE��}��4=���i�,w���3���8��M"��������6���/!>���$�����.�&[����9����
��0
����R�+���X�1Y�5�H�r�P��Fn�v�L��H��8Y�F���8*�v��5�H�r���
��~�R���AQ��\U��p
�����y�30U?/7#;[Rn���5w���)�i���h���)�����y4�pT�	o�#��E����	�|�Pu�|�$~p�pB�"��u�b�'����0
��7����/��@����zu��k0��k��O����f�I�|
cR���H�j����M�a@U?��%����|#��?$�6uDL����a@�M�M��TVdU������)�/�R��|m�������n��J���z��$_��|����MrSW_E�����\6Z:���t���t��|�����[��ei6�����f�/f����8_�.Qk�����KAv%�������I"��4��8_/�y%)���mgc�@��/lRrv
(��4��A>r��e�,��mv���K6E%�E:%g3�b�lwKA�4�K�V��m#r���R��k�U�m��0
�!���t3�x����r	4+*���i&��$��=
�S��nzr��N<K5��3P��z}=��@�I}^�!��S<O.�n#�K��K��d�_O�[�`��4����Id6r���{-\+�[���J�
x�����k(�I�P
��|m�n
�����	-��r�k@�)@g\}�a@U/����\y�Y�$e�������0�!>����~�K��x�S]@��W���"Ku����4L��M�"�W��v�Q�_��5�$_�z��|m�N�k�:����U�:(���|
(��8gX�M#7��^U2Z��Nl	UER�6tD����-�i���o+�
��9l)���������T��d
��p���#����:V�i���� �T����Xs	[eB���G�q��F+X��a`�^q��JJ��6�$*����k@���f��
��7%��.��KhV4J���H�dLz�I��S���&hRdE���Y�d���H�o��� ���e�F,]"�9j�����q��Z�EZ��"(Fn�z�^F�������XWW���w	(�$��YY�T���P];�(��h6�G�"M�9��4���>L��������C�gS���`�w�0
(�>&iB4�_|b%ZI�wRB6���#�H�p���-
��n��k��J�@���uo���
�l����qr(��,u}U��&4)�����8lXd�4�T7{?
q.g�2(�-���"������T=�����t��H������2N�!��&����K;��ei:�M�d����i:��A>L��I�-��]<Mg?��4�Y�n�=���T���fN]�����ySi� ���7E�b�P�������/���kE��� (Fn��^���vu����E�^��~������������Z�.j�t���M��!4���T�& U�.�����T5W����;��z�|���� ��m��/���&�BQ���|m���Z������I�9B�3))������EU����h��>����lH#��� ;�d����o�/$�4��8_����J-���.;{�ZG,���5���:2.��
���)Wt���-`K0��@,��bx(b�!����T}8�:�h	0���
H�W�mZ��P|��nN	�A���|6�yB^U�R<:P�)�;N��in�X���4�TNY�v���9��W�m'�k(�I,�$!>���$��u-|S�[]YLH4����]���u6�l�d�4�TNy+,f��`5
T���L����|�*l�C��)�l�R�=16���Q y��u7=��@s���yiP���>�r�����q��i��>��Y(����*�|�P���G!h�5�	M��a�r�V�������L����3�eio�@�%['���E:eghx�FnUNY
�+��Ph��J$�:]]S�M���B
�A��&�p��1D�G"Z�!I�����$_^���8_S��\�%�&�
����������%p_e����O���$�������+d�����6�7��b,�A��&�l�^Wu�m�����Q�;"��t61��iUNYJ&������v�����k@�Ug������3���Q��m(�C�&�^���v
(����A1r��t�;;�n;_����Q���E�_v�0�Gn����Mr�3�bW��HJ[u/����
�H����,�FnUNy�[,��7f��I;W��m�K�U���.-��IJ-�6���Q���*�}W3*B�EV6
�0�#7��O����l:�1eG�"i��vD��2&
r4��i��>�re�����H����\�4��lb�������!�c�
��Ua�'ZI1j��X�JV�4��O����z/��&�E�xM0�f���0
���)��_��|H���|�^��z�sD����!��&�l��-@O�Y;�������������6�s]5��8_����r�������p���D����[�7-�i��>}Fx^~�ci���,|(:��g�FnU������__���Ze������[�y����������b� �������zf���u��!��&�g���	m��2v��� ��
�(_\C�M��[A1���$o��+�X�-YyG(�.��!�"��5&��d�4�ToMT6Ei�9�q^&��D����,$wD�/v61��iUoMt�,5�&�Jh��S�>���J���0
����N�����Z��d+H�O(�m���&��u
�A��&yg�u���,��^3���.�����`���4���&:���"
���$ �i(�����P�J`�A1r�zk�r�����IvG�33�`�5�����a@�[]G����0$!t����l����S�|�X���&��5\Q��8�O�/BW��f��
�|�P��D7�H�����$�V�J	���*��|q	fMC��M���a�>+�0�uF���5V<���T|�|�������k��3��=>����U'QU4��i������ ��zk�#�ct��)G��\�V^��T�YEP���5�l~8�(�Y�t��$��� b3����K��zo���u�-�6�JhQ4�;;��"���A1L�zk��U�	�|�H��~G_��+�H�P�'!1����u���p��m)|s	R�HK�������Q��k��3�ud�-�+��@�����Y#����T�5�����Iw�&�+���S@����/X�b�P��D��"'_����NJyi�t����5�F����!>����wf��A�j�� ��@P/�g��a%Bl���I���T:��������o�R/=vD,�V&�5��i�����X����Tr�Nd���-����6��9I�
q�v�;3�EJ=q����D�"IQO��`�P�0
0�[��Mz��>��YB��i��/�`��g�6���a`��&����`+�&��k���|e��4}�s��zk�3B�,4��8�3$Fti��������c��^�~i���X�U14�O���T��!|b�|�XC��&�a�������Av!Z�}}��o#_��0�!����3�e@ws]5���D�"I2���e�dA1L�zo�������e^����@���e^�4���~�=��g�����|SB�|
����2kHz���&�<��8��5/?K7[G�������q}Y �a�a
O�	R����u�a9 _h4�!�|���U�	"��A�j�4���Z?���)�I���AB|���I>���oSAY�����mT$go\^S�M�e]�4FL��|6����^b����w@��kg�0� ���p�lV�u�<i(������A_���]�;o��)[�~y	�1eGI�q��
(�f0��1��i��>�r���l�^u��@��2�lj��4{�l@P��>��l����!�
�Z?W�&�|
���/4����|M��s-��h��A��OZ�,���4���}�c�P��������X9Y�Hy�vy
�7	�O ��8_����z%�<��~�KhV�-���I��N�aL�4�T�Nyd��t��,���F�����i�-
�S}:���Z�P$G`_1hEZ���2�Gn�zL(�7e�����z���-��$�� �qUd�b�P���G;�LtB��@�"It/��d8K7
��������%Ym-�=*!�i������a`��\Q��c�P��HN��R�iy���"��%�Y�#7`�UkoOy���&��L�c��2W��2I������L����]P���	���t}��o#_hP�!>���������m:l�Q�P)l �r{i(b�2���@��S������:�Wk-�5�v����:�i���|��6!�?��Bh�?n�m�o�d��CU]�T}6ei#�s��M���}�x�@ �q^�_��2���p��.�V,���W�!�U������)��goY_�|���:�X]�� �hh�]���%0[���������oOy%O�B��dW"�����o�/��5��8_����V4�����NxW�/��������bC��]��\+��4W�m���U������Y�1LL����+5VP6��H��H��k|
(���`3�b�P���QM��bf�C��h�<�����U��
���)���i�����4[|��K	�6������K�6I/����YB�:U�-��g���X�������������W����:X0�GJ�4����}Fx�_8G�W�s��Pt(%�U�97yi�z��������v��(i���"^�����~k��������|�\|q���a����������?tY�����_]���e^�0g��j��o�������������)C�_���2%������o���������O�oe����~���������������qy����R��]�<����S
�����*F�[HH�i��,���?$���kx�X_��/��)7���a]�,�j���?�KW9c[z,=H�+Z�\��]M�\o���,�z�D��.
�{ir�����������V�\&5�+���I���4\����+���K��^�\�'Przv5ir��&���srzv5ir��&�I��3�������9H;���5V��a�=s�y���\������#�4n��4\�,6�=��4\���u����+���I���4��5M�J�=)L�w�p��6[=z�4L�����O�@��&M����:���/>���I���4���\��]M�\o��ux�W��gW���� �~���%M����������y�H;���[O�8��]����P�� \�]]����:�i�W�gW�&�[��J�E�����\C����j=��+���I���4���i[��@��&M����:��fW�gW�&�[i��N���J����p��&���v��1��jR����721��q�<o��?p6��?�:=~��m��&W�z��@��&M���p]VM�eW�GW����4�V�H�@��&M����Zt���
��j��z+
�y����J����p��&W�<\��]M�\o��u����
��j��z+M����y��Ia<m�����^�I���5�<lje��eD��wi��{i�N���+���K��^�\+
$W�gW�&�[ir-��>�=��4��J�u���hv%ztui��K���'\��]M�\o����8��
��j��z+M���\��]M�\o���het~�=��4�7��g����f�Gi����#G�u��_~������q��p�#6�� ��&�[ar,��{#��#���N���d���ni��3�n� M�3w�&W�gW�&�[ir�����+���I���4�����
��j��z+M��n�<�=��4��J�u�����C��.�G�Q�y�L���������g��J<�3H?�	���
�����,��=M�Lo�������J����p��&��%(�������V�\'��&W�gW�&�[irx�{rzv5ir����6h;�=��4\���u�a��T��'���N�Q
��5D��&�g�Az�����%�r���������b~���9w??����:��Tv�K?�\B������]w�����s���W�eWG���=��k�{W�K��\y�Kvu��������fWG84�I�]�i��W������J�g����~R�L�9�>C��;gq�3��skeg�p
t�jq�5���V�*��w�����������W�eWG���}��p�{W�K��\g�
���\�]=���r��w����li��q�z���o������iB��������y�q�����fG��������Gv�K�!���������i�I��5����%�@��u���������k�W�?;�:zpe\vut�*�N��t�jq�5��+�1\=�2.�:zp��������6u$WG��v�g�@���\=�N���\�?s,.�:z|��R������#g�g���o��������E$L�N�=�,�<8����-=x2.�:�u]6��n]=.\zp�_��M�O�eK#���m�e�{O�K��\Wl����\�]��j3���i�{W�K��\��vvut��������Q��N���Y���Y*�g�O���Y\��������\���q���uC������q�����R_�Gy�{O�%K'�z�dKG������]�!�tH�$t�jq�5��+{eWG�����\w=�2�y�DX�4r��l�J����4�L=?i��U��'�s&s�w4AI!��m������u��o�	��z\J�w������]-.��w���wo�w�����u=�'�������w�c\���{2,g��<8����GKG�����\�s ����H���b`������I�����cle}|���Lk�A��6*w�A��+������k���������7�D��O��o���i�������~�&�r+�tG��Z�(�@����� �T���c����%m'��������o�/��#��8_��3��M����	���"��kP��qb�A>L��e���X����RB����c��H��L*�%��i��zS����?2%C;~.�������-��@U?O<�������M�����@$Sr�fp1LL��R�xJ�:i�����,�_^N����
�,?]"����U��r��!9�U�\v�J/D�_��0
��7���}�IXk�����$��dJ�&�0
�����xt������%�+������"�R�	A1L���>y���\p�I�G�x�G��"�R�A1r��g(��x�8�J�M��g��H�oZi�P����e�8"E~N��d^vb}�������h�z�6I7�����eg�s37A��Y������4��^�rfC�ix���&E�>��e �)9W�0
0U/�8�;���XwM)�|<������Uei��wwi*{7�gaz�FGA����E:%gUY�T����k��I��N6!K)z�����3&H�x����8_Sr=)>����yT���|�_�4���V��m�7!��8_S���-��'Fz��M`�_K-��w������Vb���M���a�;��.�@#�=�y�*�3���+��A>r����x0���B�8�c���k@�L)Xe����zY'�Z�?�\�vr�(Z�����
�a@U/)��[A��%�8*Z���-@�L)�� �T��L8����1<�M���V�k�L)�� �T�3P<����@�����If
���+�� �T��Sx��"�J9*���{�7�"�R��A>r��e�&>C�E�%��hW�����@�Sr6!(�iU������V�Q)�4�v
(�)��#7��^6h�z�L����X[,�_>1t�O�4������=/����f���u�Cu��lg�&62��iU���d;zb���h�&�
�H��ld����	�Z']��P#����;"���A1L���x8f��f1���QU��'�\��7��4�� ���K��-w���t�C���$q;H�:u'�c�'��~�QU$��{@��;2=�E�8L\�K��z����f���H�������K`"�R����T��d�b�b�UEZw�i���k�
�S��e��T�#`��	�B]�;���������|���\6������������C���"�=�*�|�PuSV:����R�&ERQ�tD��3���a`�^6��������@����'�m�eGns����=l
\b�1���B3;�!/u|�>j�������/~X�$��Gj>�fq�bU!�%
�9���`+�y��e�'E����:��`��	��=:�$Ff��{t+Y�=d��ZOC�Q��� N��SFM���@�.�O�'����' �����LA��>�h���,��k�[���ua�.7m����q����,Ff��%�*����L(�g�3jt�'@�8�g,��������{�P���9*�Z��K�@�8y��ti���,�W1���N��T@+��w)o<��K2����@�����F���-�\r�2jt�a����L�v�0��<�(�%I{�@x�2EJ��I2�|I2�M4����L�f���x�;*��r�/��N�+�Y#�<�~��_�r�
�L=H�S��O@����-0�0��,��`�.�����he�rF��-��"Ff����:��TrY$��(/���5���DKrVad��yvO��4X����	�����y;8qv�X��E�.�{2g��d�2�g���9��f��5Y��k��kr���ml��3$����x"�'��)K�4��k����K�&�����d���U��A�I.�%Y��k����u�R��Ab$�������5<�A���e
�|��>B:�o#�����*��4�mG����l�`�a�^i=}�� 
x��d�N.��[��$���?_����;�q���j��q��5{|+��[@&��Z]&k�0�m�\N�Q��<����o����UY��k��[7����N����H���a����g�|�.������wIZ�&dgB�\��Z<�A�$o(&��Z]�9�Y+��X����&@�4g�#&Ff����i��W-��<-!N�M�8i�NX#3x����T�1e��������[��K�sO��r��K9�@������u:�y=�|���	��d�v�5c�0�����x|�|I��M���]�3,Tp)u]M�3;��V�bl��i:cFf��,��l��6�QcT�������t��]�,��k��c�����T����6q��6e���4��k��@J{�����Q�p���g��-b�a�YjL��
�m2���l�9����5f!��5K�U�jz��x�*#�i- N\��l�`�a�^��;�>����������+��9�����y�����W���|/�1jy��8qjLe����@�.Rc��l�����1��������e1�0����3�h���H���N�:��ew�h�4Y�����N�|���
Uw ������k��V�%�32���p������&Y:%+��v>};^L��%�76����2��I�,m���o@2��Roe�	�9�,Ff��%���Z*m�r��6-�>^p��_f��:\��4'�xy����D����8�N(�[ad��y�Kf�f&+y/���A�����^W��H�@�f�^%���<u�3����tR�8�el��E�~�<]�|3I��H�����%a�mp�%9k0����lI;��s�N��������3P��A`�,���q��������Q������N(�[ad��_!���XQNb�z-G����lu�	�8tP�a�a�^�v�X1[$����M�8i���#3x�RqMN�q�:��QC�� N��S�:�4��k��k�I���|}u�3�u�l�	��[ad��_}}�T����)@���x	�%���'��+0�0��,#��hKr�$U�M@�%[�n3�����*�7�FTC���PhV��#�_mv�!�� O��8�Q�}V��
����`��������N�0$�������5��e1�0hW5��E�
�����hcD5t��
��2�,��5O�Qi`���J�|�>69=�#3�v�e6d�*��:�����{�f�o%_���b��k��e�vmJn�~O�=���C��4"#�<x]$��]���c��9�_�������O��|2V3v^b���Wi<��H����h�%��/�\Sx�7�56�xO�����/�I�|��S#�����$�|��&����*L4����|������K:����?�
/�l�|q��
�����-r�rh�W�p"��o9��8
!�KRElb!���������u]���	�$�^��7�������x�V������,��$�t
5FT�L�8q�2J9����z}��T}�y��������n�q�%+b�a�O���T�U���G@�Q�]�s N�|4�`da��LIL����j��j�WI��8i�T�*Fy��p�e�-�Z[��J:J���k����r91������Zk�7��SQ���|/�8�i���"^.y���4��'�=Q�������8�F���G�>\2�p.���L��:#>l<�DKv&�HC����%o
�ta�GA�����`����X��E�>\2mf�JkE��1�<�6��d�0�)�Ff���������-'�����N8`V������KNR����t�%��P]�9���t�����>\r�}���V:�=+J�.�Z���#3x}��R%�E��P��H*������U�9�b�a��\q��f�4j�G�
�~z��K	����i�����-�~V��I����:���J��&��_��,Ff�w���y�N�x�QVD�x�@^������UY����%S�zm����q$�8�<��gaFf��p�[�y�8���7�^��Ro��k��V�%�c�|
������;�3����}a�n���������y����%s�'����s����6�|���s'-.���"^.9I�l���H�)�J�����kx|��D�0�������Z�8���j8lN�	�9� N��y�r�,��>]��y��{�c-��U��	(�Xl���@�=�5n��%L��6i��>"t
}9��o�/Y�X��ku�p���r_�u��\E=��V{�N��6�`d��O��=�Rd9>�H�G�W�����l�`�a�^.Y��G��9��,�)V��
���@�>}�Y�~�A��8��Bz~p:������/Y������:aJ������$��Iy����y�������.���7�����m�a�00/�o/TZbJ���'��%�a�bz���B;orH{HI�)L���vz-� _h:I&���\�Y��u0m�f��/���d=pBgJ�*Fy�zo��
<�j��*AcJG��9����UY������zd��d���}�)[P�Wu`�r&�@���_[h[::�������R�su��oV`da����S3c���1jtbj��s���Ff���B��c�m��aC(�����z����a3x���U{
�����7T�'�#���4������A}>z��c�V4����_g�qBSO�|����@��Z(%R�����H�Qb=M�8!�����^�-T�NV=��|Qm�6�A�/��eY������nM6��[�&�A��L��8��o�/9�.&��Z]�Za��j9,���������e��V_���p�T�w���x$�q��w���*;7���8�n���"^�,t]V|�������E?�FP��@`�,����B�&����g�r�<��la��ck���[�.�E�S�N�Pg�zv���
�x��
��t���,�6�#���*����$:���7����D�>_o�y��
5zI�m��?���61h��#3x���u��+|���7��#)��&@���[`da����h���|a@���t�	�C���u�HS�����Z(Y���X$�m�k�h�O9�n�d'��C<_�����OR�]����T����$}�S�N\�ml��E��Y(��y�	�q|G+#NXO�8iV;#[/Ff�zk�����^�%�
m��J�'@�8�c,���^o-�j���#�b1 �7�i��$���l����@��[�&E��1R���}a���s'M_�0����[��j%���t�hcD
�4�X���a�����"�=eMX�D8}~��Jx.���x���[+��SN�2��t����{�M,���^o-T;���bYi!4����������r�c��k�������������pH/?_���������>_�e�~Nl���:���}���������&��y��))��N��M��������[���s�D^>]������J=��}�F�����Y/�Z�YgE�<�����W����o�b�T��kB����0V`da��L������jG����,#X<U��h3��p�T�P����������8-��8q.�X��E�>[2��w���kC�y������<�]���K��b�A������&�T�RO���o8�u���������z�����`<MjME�K��H��F��I�Kut�`#�<x}��U6��_���9�#�,���0~
���"������\�9C�hyPgD���Tmy�44a�O��Q�Id�����&@�Pq�`da�����E�p��QK���G>pB��&Fy��p���� ���'t@�u�� N��V�g�[3��p�[�gQ�8����m��O�T=�X����>\�^�_��s��$���'^�1�j"^�-y�I��F��uI�(��S������B<_���k��o.���t��Q^ N�37����^.�2���2�w
�3�*�}�	��`	Ff��p��:[,�H��%W~�w�ti����K�E��;��9J�S��Mgg=8����b������p��6S����5X���s� N�p�ada����>�E�z���-���m��a����'��������[LU��9�2��*&@�8	o,���^.yk�������B��V���q�bnc	Ff��p��3y��
]RFh/e/\6��]��[��dq���>_������9�2�t����1��{���55�eY��cM��KN��=������d`e�sgP��/�#3�z��{s����������������B���D�^3��ks�p�2����2r�M3�	cU:��N�CYX��E�>\2}u�����=��d3�
����'X#3���'�����^����t� �/X
k�s'����0�����K���I��w���K������w��
�|�~����i��m�yM���PxN;d�a�O�w���Z+o�~���+fp:�kA�G@�������%51������z���\�
�������g�N-��~�%�'����O���?���������?��I��j���������r����|��O��o?���K��?�����N��?�������C����*�=����g��50�=������~RJ��r��R���|z[z�����c/�����__t?^|}��0�
�4Qt�*�7%���q���]A/��o��tv]N�G4\�^�|���s�����G]A/��o��t�Z�]A/��o��tv��!��@���t�����g���^�U��{�t+��AW���u/��ny,��B@/��ox
���B��|��]��)��C�:����n�����u��u�7�^;]��;�
zAW}���s�]�<-�2yA�Q���5if�z|���o��v����t����Q��9��}��^�U��{�t��]A/��o��t�:���]�
������.���~����;O���E��9��W`���];�.�@��O�{]�
���AwA���+�]������K�T���*��u_��q������t�7�^:]��t����Q��9�.��+�]�������m���@���t���.����H����O����+��hznv�<���AT`R��@�/�|���v�	�����t�7�^:����y�u��u�7�^;�����+�]�����A7!ct����Q���u�G�r��Qt�k�A��9�v����^�U��{�tF7]A/��o��t��(7>����
�������OzT��z����w�?����@���|�}�vv]�%���G�{]�
���A���.�
zAW}���s�M��	��^�U��{����l3�@���t�����E�]A/��o��t������t�7�^:�����+�]�����A7�1��zAW}�#��|��~L:h�t�#?��y��R@�y@�/�|���vv��=�?�@�U�3�^���8��yA�Q���4i+��kQE�����7�A��+R�����F�K��[���AW���u/��nzle=�
zAW}���s�]p4:�
zAW}�����R��zHA+��5_��'��R�D����O�[;&E@�_�z�[w�43��QA/��o��tv�u���Q�^�|���s�m|���+�]�����A�65����t�7�^:���=�
zAW}�����rs��.�����k���e�H�e��*<���k�l�C�O�T�7<}��O���v�m��������|��	�����]���:x�J��TT]��]Put�*����nTaU
���#[T5t�
��j�Z5/R�T]��]Put����Q��%�s������p���F�4��='��cy�2�&��
�����S��]�����A�����UG7�8U
���.��V]�c���CG��jT���r�(��Q�]T5t��'����kU���nTW��>�*���(��F����M�EQC���1�I����F������MG�������w/����.����k����1�5���U�.�:�Q���GT5t�
��j�Z�?��VU����F����U�T
]��I����U�.�:�Qm�m?��:�Q�]T5t�J���Q���*����k�*[�Q���3������gUY����9Mf�������]x	��o��?��8F�5a$�\+��'����kM����U�,����kU���nT;���Jn4�,J*�V�e�'J:��T� ��FU��T
���.��Q���8��Q�]T5t�J���7��v��q��e��qC]zj�y��?m��69�_�������U8��-0���e5���n��&���(�aTvv�\������]�EYCw�4��N;����A���r[��MTvv���Q���2�m|�7vv�����;����'aEw�bw�Ut��K�����������>����v�8�Oy��w�k_�������u8������>��8�QV����F�>��9�QV����Z9-���g��]+�a��;�Sn���#�Sn���#�S��b@w�bw�Ut�������B`7�jx��0v����	��)��|Bdw�g'�]?���a��?��_�������6��U�]���P��Z�����������?��?����e=���H��CS��Pe�
�&@����m����4��k��+�|�Q�+oX���������x|�|I?� ����L��Z����PeD���$��������z��$��i�v�?����^�x�8����0���/�R_R�N��9�������)�K6�V��u�d��O�j?�ms��-��E#�	U���k�����Bb����r��=c�w�Z���1����9'�72��bda�^�drB�|���G!u�OE]P�9�-�Yc"^�%Q�l��.�������'�]���K�
��}�V�I�9������Q�?Z��*��'@���"[���^��-O�h$8)L��b;��o�=UN0X��ks��69gK��z>���N�q���,����fK��c>���5FtPw�q�%+;�4��k��,h�[w��T���4o N�$gY�<��5Kt��(RDAyN%�	%[��k������r�B<_��d-;�����IFJfB�@��5<�A��T[L4����LS�2��n���QFx��U�wV���^�L�������9'�?�m���x-�J�PR�C��.���*�`[��0�\4��>iY�������"K;�4��5M�5��M]�7B���E>�'^���Fy��f�>|���,Kjj��������ljf�G���g9���X�k}�3j�Iw��eB�OV������NWF/�_n����Pm��ti	=uG�N���,���@�fKZ���a6���
�o�?�3 N�$cY�4���]��w3c^d2�.-�zz�������������+��n��-�$��|L���|��������fj$�$gy,�fCY�����-��>;�?��E�<���"�%�:�,"�=��+>�G ���"�Yg!�t��lI+�
��[8x�s������2I����Y����|e�vA�Y��l
���a?��`Q35�X1�$��rC��6��{;-��<%�=M,0�P#Y/�T�,Y�%���m�RJ	���~�2�Y�,�HDp��
��i��n�7����uF�+3���B��]��I�.#��^Z�G&����K�����<���%1�P#���U��?��|>����;����OS~���q��f\���1���	'N���6B���,�G��*&���JE;���w�v
6�
#�<7I�irN�q����~p���k��V��I���Z]&�=�m�#���\]9>	=0|"@'�O��������r�1�.y����,���^������/���vTQ�l�q�%)�~�dda�^�*T��y�'����:��5{|+���X��k�L�I���� p@;#z��	XuN_`rz8���9I��������$�����zv
�o�/�>�����2-@�W�v]��5K�T�u=8QB��
#�<��rI�Mr����4$o
j�&@�hI��|��0P�y�'L?��S�m�(��'@���b�a�Y�b���tc�n�UF��[&@�8��F����z���H����N:��!S;\WL����db!���e��a�����'�JfD���s'�=�V1����<�fA��G ;��o:����=��|I��L<��5\�E�2���w�Z4T��XN@�������0P�i��J{f�~j,�/�}��\��������Z�=Q�.+e�e��mqp����Z�S%�����f��/	�G"���`�1o��������~
�����i:�"`��7�D]���j��`������N	�yfP`Q35�g���?z��R���>f���2�:��C�����
���f���_E�!�#?Y���f~$����v�d�Q�(�r��;g�f��de��Y����<OX���:Tr�PY^��r��w�D��'4X����D���Y��d+g�����d�
������@��?���S����MQe��Jq�I�q`��,F3��Hb��A2���d�j�mB���M�B����U��F�H�Q���(�Ho!J�	?�V5��#��<!?�)9��������:%�gyB��fj$��~��h�O�v]��@4�C�m����+b�a ^�w4�D����w`3���a?Kf5�P#i��t� ��K�/�����d�f��<A(0���I����u�����A������I��1�0�L�3��Efp�aE�������q�������`JR�Ci?tRq$�����L�8Q��Y#3xMR(��7m��hcD?��	'��3�ada���v���I��7���H����7������B<_��]q**��Uv��7C{����i;#����:�#`�,~�Al���;W����w��j����8gl#�<x��p��S�,[�.�����b�s'Z��&Fy���.�z����Qg�u� N\"h,����fK��;�*�����?�;T#X�o��t<�.��d�!crt�#����e�	h��-�#3�v��m����I��*��*�����k����BcU�<��5O�O������F���N���[	�
r5u�|��'�I�0?����T
��d��['[�_��Fr3���e���Q�u����"9��aX�K�+]�,��k���=���p�rSC53����8����b���Y��#�U2?.(ki�1���	'N*�;�4��k��
5E4���w�2��E��/��#3x���Ej�s�
��B����e���/N����?]��d����!1��(�>�D���#3xM��M�u(I�����P�g���4���B�D��A�3�L)���d0-�=�a?Y�����B�$_�����[�#��l���K���	Y�'v�fj$��Ol����d1�"�]~�r]�����
,jf�F"��$��H��]�.?��e�l�v��f��`q�a����������$�$�wv]�T�K2<b�Z�x]�����O�
���j� �&�T���Y� `V35��L(�k����(i ��~�n0� O�,�HD�2-(���h���y�����v�� 3J��f����nKG�|`��i��(p��CZ�`�~$��n��tr��+����v��n���� `Q35���)8�\'���m��D"��4K��a��F��S���W#js[c�����yF�+3�a��F"���A�C��%�/�Z�q�o�����,�����A������n����X7�P7�L.�>1W�k7xg���{���D�����f~$"x�L��i�fz��<9������R����	kj�9�������x ��������M@��yg�#
3���o/��T�+�i�P��~������ _rP�M,����|�V+��3i���r���pp����&Fy��t�;�,�p
��
ZQ��L�8!	����^.���*�����:#��2Uk���)!�������E�*��u�1�d_������0�0hq0��K�3��Z�����.��*��	G����&D�>[2U����&iUF���& i��Y�&�f����w4����i������������*�,�����i���x��*#J�. N��Z�Y����K.	��Rd@K �O�����5{|+���x��k�|����YK�����9�VZ�
#�<-��l�M�W��}���0�^�����C&�!�����W$Y�L��
�0���e��O~+�"6�0P��L��?EQ]��JY0�LO@��bc�COFf�����������k�MAiaD�����m�b�����i�+�7d��#��_&@�P_��SM'i���gK��P���w��T�Qj��Yw�v^Z
VQ?��G�QR����o���h�=9�>��*O��OGNH�-��\��B�D?^9�,S��n��#�.���J9#���L0�P#���
��\1�`O��Y�u�/3�~�r���fj$�x�������r��&m�M��!�m��������OW�J��YG&�4i���a?�\!U���	?^y���t��IN�+)��.���5{|+���u� ����|�V:/#���J���39Y?#���e���fj$�x�[E{�K�BiL&��e[t��H��v�4�A�4�H ����&��{���C)���
����@�4/,����>\2e����w�D�^�&o��o�@�����,Ff��x�8���mO?6�����d�a�c��5�P#��W���H�LE�2���g���
h�0�P#)��+��{�?@9FM�\������i1�H ���W�n�@{�j`U�Y���3�~V?,p��LC����|�F_w�[���A��f�!8'�6�jf�G"��?mSa����8����H��i��55��#A�\����@��������N��~�?�7�Yl���x�:��~�pf�V�o�/)�e�|�.x�/��F���<wi�T����- N��7�ada��J�>����VRH�hE�����x!��E�<����BNeQ��][/�[�Kut>p��[��#^o-��#J������d%B;���H��t7%�|
�wV�%���RL)u�����_�9���b�����'��{����;`H&v�=\'@�P

������Y(�K���pq�tQZ��g�	{`Re�^�-���j]�W5*����(z�I+�7,C�,����B�)}Q{�H��hO�|P�q����"Ff�zo��SZ�^���	3k����c0�`da���=Z�Zo�Q5 E���>��U_\��p�T�w��2�JG?mGua������S����v�<x��P=�H'�����*�#{�C��'���0�������6r���)Z���2����`�y����xu����.���tT�B�q��e��H�@��Z�.���8�sCo�M�o�8!e���#`��}�o8,���B��~��	��g�����%��fG������
����3��4ke���=1�R1�e�,�%9���D��\l����^t2���Q/|���"~��l0��#�7�k-P�y�1�:��<��T/u���B���H��������v��0
T��j=��%Y���{m��X&KT�H3'-�;Y�Z�e���5��z����j���2\{n��'�B�43�~�K��3�B�D�\/e�z��O���TgFw��S����YR��������M�I��[���1�����eKh�?�!�n���������J��O$W��G�(7�L;il����x��P�B!�x��Ys%��_x���:K�D�/���9B<_��{+\�c%o���Z�;3�H�+
�'D���]`S3?|s�I�K����VD�E�S��H���+b�a��Z�-�`���wk�+[x|�]Z��q���,����B��9�������;��������Y��TL�xo�����k1��c��d�q�Z~eI�4����S�R�������r�Z@���5h�L;$� ^��6����h�k)m���O�kH�_n�_�g��S)W�;��q�^�M�8QB�YA�������^rE"�#�AI�������U��V���`��Ao/��H�) R� ��6�<�f�o%��K&��Z]>\k�}^�*�@FP���A����
�%�;���>_o������8���6iLl�[ZRy>��r�K�0���x���gk��-Pr!�pA����"'Nh&��y�1S���w�����hZhe���'@�8-nl#3x}�dJs�e�<;N��L���v���kx|�|�1	B��Z]>[k���k�Z��2�y^m�	C��1�0���K�z&��GJ�}�Pg�'=�<q�l��$Ff��p�<b��A�@W�)�9e�N�77VQpn��c��K�R?Da�H�0i�"q/+�����
�%�c�|�.���������  iI@��m��=,����>\�����H����:#JG�	'����y����%�����w��p$�)��&@�8{�,w1�0P�O��-p�w�������'�b����{�����/a��"��~�;BF�WE�y�Y?���7���`�|��@��5�?�l!�b
UF;O�>���P1�a�^�-9-���H G��2KU����-����,vg|"�|�M��R@���VT���^����/9��1��5\>\kAu8�9)X�7�%wm�<�C�AdbB��l���+����6�1�:���x���a�`S3?���|�H��'.�g�i�^X�9y�����n�)O3��
~��u�]���t����"�&��Y7i��f~$"������B�;�Y���U9����iUb��F���|�y��5�Y;j�Mz�R[�}F����f5�P#i�����wi,�����*�Hy>%��a?�*���i��@���S��[�v�Y�D�}���g$�J��
W�P#9���`����}���������%�W5��#���wl�m�>"�RQ���;!���Ljf�FR1����'4�����kc�6	��-�YC�q�ES�j$�1�����G�8�e����\;���8������P#�Iw��������c@{q��x��y��q��x
�O���E���"0���C�����SQ<��8��9!=v���f�����A5�X���d*�Z���|����S/���4�$��=�����_���/���?��������D��6��P��L��?���_~��/�����[Z~�����|�sZ���������������S������	��i��b����p���N�&�]sF��g�����w&��>����w�����0����/��nKJ[�����_�;��J���sP^��{T���9�kgWF���v����|��3��QY���������Gea/(�sP�vv���p
S�WvgW�q�+�Dea/(�sP�v�U�AY�������rA���`{A����tp�<���S��]�����s�kzgw��u����gW.���2���;���sP.!����l�A��9(/:�������A���Us��#Y�{]wv���,=���^P6��|��f
Gea/(�sP�v�2���,�es���Ay�s��g�����������?(�F�'��o�?�z �S������������������o�����������9(_;�2��H��+��[��l�w�A�j���,�es���A9s��QY���������m9���^P6��|���Tv�?)��+��+�8���@�PR���9�C��<{(�G����}�PZ�x*�!���>������9('JzT���9�kgW^�e�L�^��]��9(W>vT���9�k���P�����l�A�������e�{ewv�����������l�A��9(7m����l�A��9(��f&��$�]u�GRt�<����rk�y�����y"Qs���\�
���������g\�^��]��9(W����^P6��|����
��^P6��|�������2���;���sP^�8�QY�������rE�^T���9�k��\x���,�es���A9��������9�C��<{*Q����v:��Ko=���I~z:���w�;x��������"�^�\]��5�R���*�T�5�^��*����ye��������r�QAY�������rA�fT���9�k�����QY����������@�����9(_;���)����{ew����y�<����t�<z��D]�r+@�/�\��]����g���^�5� |����f3L�^��]��9(���,�es���A�X��+{A�����sP^�D(*{A������+s��'e�{ewv���L�a������A��5�J�������9���<{ ��C������H�
u(j0F�4���ax�
����y���ecw�0<(�Q���s!��e5��������;(�S��A���2����)���l�Z9�P�R��Q���A9�=leL�+��?�}{���7uyxa7('������a|!�nna���eTvv���Q���2�O~��sv���Q���r�R����(�aTvv��	Vecw�0<(�Qnh-���(�aTvv��e�GaEw�bw�Ut����Ycw�0<�Q^1�=*;�y(�aTvv�P�ma����+�.�P�����a|!�nn���A����FY
�����������FY
���k�BU�O��V6������Gecw�0<(�Q�R�t�����Fegw�EN���)���l�Ny��
ecw�0<(�S��f��l���d�ec�����oL/<�J���<�g7wP
��pvsk�C�Q���.�����V�E�uv���Q���������FY
���;��x�pt�+vYEw����:�����A���r���**;�QV����NYZ����)���l�N
q��n�GjxP6�����G��������3f�e8���j_���Xi����X`��fSE�n�S�pQ����Fegw����Y��NW���nT�����vv���Q����&m�����axP6v�\d��A����Fegw��������AV��j��qYc��#3<�U�E��>�^H�V�:��c����a|�n����_���(�aTvv�������n��0*;�S�d��A���2��n��2�<�FY
���;eV�$��NW���nT�Ez��9��U�����;�2�#�S.C:"�;�����n�I��A�^x&Q���o�Io�Npt7�.Z��^��	�������B��_��������A���K�tj��3���V��Dm6g���t�B�D�fj$*�=M�c��z|������D^��w$��������D�++	�hsQvtk�#/gn;#���;���lf�F����j�Q���J,hD.�����z�Q�D���~���c��V�$%0i=�r�@tB��50���IE_�Yj[�&���R���,���$�'+s���B�D�+[i�l�aIU�A������V�{F�OV�pU3?������M:j��~`EZ�g9#�u�w�v5�P#��te�T�q���-�]1,���f���v�N��-0�0�i<��Z[u�h`2J=e�����f�X`��F�b��<I.c��ij����� �8��2���G��s�PsX��K3���t`����o���z�E!�HV���lG����wg5����Z��D�deW5��#�1��"��o[jZ,�+h'�J������+0�0�irw��)n�����5V��B}Ig���f���b��F�����V�������J��|�%t	�8��r���"@��y��uT���r�3K�:�����rX`��F�br�,����Iz�sNQYN���2���$t0�>��f~$"8_�.�x��Lup�5��ae
i(��i��@p��]��RI�Y��3�l��������a�����VVS���F�?�	�.�Jy�3�~���<�HDp��\�=i��2����s�iB�OV������D�+�!U+��f���4�T3��>2��+2��0��lI��0/�9����Y�/S��\�yJ�On�C|�F"���5�$���V��5x�T�1#�'+s���B�D��sih'�`
dd�Y���mHG�~�2�KW3kC:NWFM����f��QG[��P�3���Vf�����D�+��g=px�P�������@��O��
#�<��&�*>����q���.�j��AsB�On��35��l���t�Wpq�����������aV35����D'n�%h�!c;�T�#���Vf���C�D�+���'�����M��M�r��Yo�3�<�Hz��V����4?�ll����'|���f���jf�F�0�i��DS���]i�������%=#��w������	�o���,-�J>jtHK���htf�-/��R/O��k�Y�O`�6#�M+�T$�j$*8��O�h�����!j�9��e�������5Y��k���/�$�%�Y�*�6r�����/���Ff��HG�D��a�E`k�4O.�!F"~�������������I�����z�m���y��C��`V35�'ieN=CZ�Za�v�W�+������R�+�<�HDp�2���n�����0��g����V�y����<E)�c)�)�f���P���=��k��V�%�B1����p�'%�����OG�|a�|c>EG�����@����yR4f�X�o�yk�sB���pHG�LC�����.���h�,P�7�������4��i���")�ih�D{����|�_=#�����35�JJV�^.<*I��9NPr�s g����THub��F�yRRF��aI(+5V��F��k?#���),8���F�ynr������b`���I�}F��r�sW3
5���ge_d��*m�d�3a?$
�y����wQB�02YSCe�t'�,>��oJgFy�2k��=�X�2��VI��?�����,��:�P#����.c����ha���+[��]#'�Y���!��k���5����2'%��n�Lb����e�����	�I�l�[�O*�R]��=q�,���F�4���")�����1z4�����i9�C6�`���I���\dH)�	�h�le�*=�g��de�y����<)��p�	3�2&���A'D����������	3�����h�[N�;���Zw�l���de�<�HDp��,�M�d�6�q�	��(	�����%1�0�������%T��U	R�.����!��27lf�F���v�������$=MR���F"~H|T�����8=�~Y@�9B�4�-���.�W8�8#��II���B�D�+�����+��$_�R��N�f���F@���E�A���������huF�����fj$+�����km<O�H��s�f,3�~���0�P#�����T'd#�J���|&�7f#a���D���u����
��$��Z��F�v|��JCR�Q��l���v5{�C�d��d��,����,�"G�J���o��K7F���a�^�����O���HC;����}V�rX����xM����,�x���{�z3�i�v��!�H����j,���J��s^`���Dk��C���E�,�HDp����L���Ss�����mF������B��^���F�H�m�����0K�"s~B�+3Xa��G"��<������fy7��R^+���q�?}cI�4��5M@����n:�Y�0�L/uB�5���Y��l��6������uw+0���\�:#�v�R�O�4�H 8/�D~��\)�)�,�{��8I�T�
#�<xM���5F���M~%e>�p
�Ij$�e5�0h�����y��������`:M�'��8��,���@�.�}25�l�4uV�)W=�	?+����#�.����.t���JDc]�#�B��~�2�	fj$"8�DL����eh|X���RW�ftF�yH����B���l_z��$-�����J��3��=sX`��F�^�3�����f��ddW7.�@�$�lQ#
3xM�}���������2R:vt'^��Fy��.i�A\����r�������3"~(A4�����D�?�,��P�~�	����N��!i�V!�H��F:o-��i�Q{�r���J��&D�Pi�����D���E�(��nt��4�:q������W�y���������>k���+��,�L���~HG*L��i��@��FR6?R�mG�H�m�I�������Hf�y����U�d�P!�d
Z
9^5�7�u�:Iw�^���E=dE�WG�9+���B�Y��������	�)��U��_�d�������Q3i���U�1��jt����H�8�'D�<�(����I��5�|���5��v�fN�������U4���W��ob�?tg;�Zi2���^
������z��O	������M6�9�9#�gH�35��@��"��5)l��N-��yR`����^f 
'm&��M������y��.fj$"�Br���(�GS�<�8�{�����Y����u=���s�kz��<������R���Ojk�"Y�p=IM�0S�a_��4V%AF�������H�Ek��FA��~�yA
c+�c�*_�rB�3R�h$@lm�P#)���t�;��6;��hK89I��N���RYS#�<x}���,��7�o�"#��A;��K6���Y�	���%��i����#5���g�h��C���f~$"�����fG�j�6�`��f�a?+��v�Y��@�����C�3ez
]�jg��c��O��3�~$z
���S��31t8Oc-�y,*�,"~_Z
�����D?^9u������V��?"Cg��e����C��`Q35�x�{���Y�`Z��b����@�$]�lQ#
3x}���l�n�le�d���Q*Ky�s NRp�lQ#
3x}�dJ� ��Yj��������9'����F3��t�e�wx�x��f��V�)��o��b�.0�P#)�t�p����(��y^E$�gp��H�Nr�^Y��E�>]2�Ew�e��^j�8c���8�����06�0P�O��&���_���Z�P2y����3`����~����������p�����4;�`V35��#Y��'l�P��>g
�[�g���r���0�P#Q��W����U��>��������;I���#3��t�{ECO���-�lW�}�l��/�Y����x}�d:�/����F39���D���$��rH7��,�H ����|?���:����@�T4��s��	?o( �����D?_9~��-�������(Y��H��T���fj$�x���}k��H�������uF��r��fj$"���i�F�?������U�,��iF������i��@���s��*���m
�����+8�$�rc
Fy����q���v�v;`@����zr���
���V��{��O�KG������
�L�T���p�=�����[�t�u�J40�:O��yYmB���Q���D?_�&M����	�j��0#��;����N	��W���I�f�l+A�$�x�`�I
��Q�-��>^�>4��X����D�����B;����4�H ���Q�L�z�8�2��������i�0�P#���O���l��1�Hq�a$���I���Y��� V�K>nRl�Z��H;�k�k���p?����
�����%A�k���"[D#���o�Z�3?��p+^[o�����B�QG���)Irp$������f~$Z��z@�8X���T��f$��H��35��^�iH[GKK�^��L���y�\��35�����������0�I�L��c������fj$"��z7T��YRd��i��:|�H����E�,�HJ����^�<��-5�!a�,N�8l���"^�-�w9C�;C�6�u�(�]�<!�g����Y���(Wyk��86������.v�O@h�kL��Z��1�[���"���s�3�)ryF�On���fj$"��z����$�������b�jF��{������	�\oj8�S;��:����L����8����
#�<�����P�R�	�&6�)�;�ZC�a?� !�`'�C��o������J���O h��oL@�wX`Y�<����0J��W����%����?C)�_&D��������D�\/?Wp��v�unZG[�-���2#�g[��<�HD���v��m�}���0.�M��O"~�������D�\/���bM�.e(�����NZ�.}�<����[�3��&9��9tcE��'�����.fj$|o�uI8�H�����,%�
Ge�yB����1N.����^jl�����X�V�T��g�Z���<�H*�v�����S�������*���G���F��<�HD���.\w�,.AU�SW�v�I�	cJ����f���Bk�~��V�V���r�}F��1��3
5|s�
��o�-����M�w�s?�������dG���K
��&7���FY�V��~�p0���~$=�q��^�}���=�������><#��vR���,�H ��z�]���v���Uy�Wm*xF��*���y1�P#�����]h�MJ��}���I��7������`��G"������|����*�)�.8!�g{��f~$����8��7���@I�mF��j�3�<�HD���<2���$=X�."��f�����35�7$_^�f���Y�����<�~B�e�Ljf�FR?������N�8���c�����"@��2�]�7v��\_K�k��~��D�%�f��$/���� ��e�������c�~$�����+���.Yvc�[9�
B�f��D�x��AM���:I�eM�4��X)���;�D�6�%0�w���P�	��/5�P#Y�)�wWN���;UoX�fA���P��I�k0���O�L��]�k�#@��S��u����,���@�>^25��.�`�dr�_h3�m������	?^9��v9��G]��d���8IeI�4�j�>x����eC_Q��������8���%5�0���K�
��������]�Js?�a?�=,0�P#��W�o��xk]38�V�����������Y���c����mj��������2���Vn0����~�r��"m*��h���3v��/�N��7����@�>]r����2����Q�ct�N;����#
3x}���`���Eo����Vy=?rF�}f.0�P#)��iBma��������m��@y�uF���3�<�HD������w��������>#��������C�D?^�g�7���;	��m������1���#@�>]25�i�5�C���~���H�S������~�r>��,>���\_���������'��	?]y]���6!�XZd��v�	?�@�&s?��	I���j�����rcUv�+�����V��eS3
5~�r:������q��U�6�gd�v;��]y4�H �����m2���m������:����y�����{K��"���g��<���/���b�00�O�L_�v��m8+`���L���{�������H�@�>]���;�t��i9H������(J�/��/9�x����k�Yj�0�>8��x���<j�H���fj$��=8�a8�^���Y����?#�w�P.��5�	?^9O9��w�)��hP��tZ����C�5Y����%��
E�O��6�v��f��O���Y���������w\m��������d
{'�<�H�����+/������X�F>0#��A���!�H*�A�r�wt����������R�O���fj$}�>}���D�Fm�6�B�m�������^�������
���%�
��;��Z��V�j��/LZ���|�"�}��~�%Q���]�����_���/���?��������D�?���Q��������/?���������--?����_��9�?������C�����o����;l����&����z�����d�N�z����p�=K���>�����=52][ng�����;��Xj����i�8�}�������i�p����4O��vu����k
��T��y�����nqI����W��W�o�	��*��K^�����/�)�^S=����kV�!��B�5�3h^��f��
QS���z�KW�\P�b�{M����������3@�]���.o<[��O=��An�o�~��]M37~�A�V�<]���5�d�5��k�g��t5M+6M�[M�t�kW�l�qtM!���4/]]�<R;|>��k�g��tu�$)�dp�� x�hz��Y�&r�h�.y���7�
��{M��g�����B'�pk�����~��`������n��i7���53R�)�^S=����i�����[M�t�kW�l���B�5�3h^��fFcL�r���A���5�G���)�^S=����i��z���4O��vu��\S���z�KW�lrzL�-����'������<J��Ai�������*������v���w���4y���r�i��y����Q��;�^S=����kf��wM!���4/]M��h(h� ���������935��k�g��tu���H�)�^S=����k��5��k�g��tuMmFg��{M�������'Q{�U�HYfO�7/%�0�p�@n�o�~��]M3�8_f� ���������m9�9��k�g��t5���-�i��j��k^�������)�^S=����kV��5��k�g��tu�����)�^S=����k.���B�5�3h^����cD�=^@n5��/����.��y�xy�������2�������kG���!=�z�zW����|
?��*��K^���������{M�����Yx�e�r���A���5�d���{M������w>�4An5��5�]]�z'^'���z�KW��fI�\Qr����\9��>W2�e�;��y"��+o<X�����������v����pGM!���4/]M��|�-h��j��k^��fA��k
��T��y�������B�5�3h^��f��������������7�{��i������8�A�����������*�q}���+��?T������X��C���B�d{9��I\A;yj^���1s�VOQ���"����k�����kE�E#����A�z�r9�j�gF9#�z0�F�;7k��F�C7�A�Hy���z����H�Q������Pq��(��|'�7L�|�N.oX��h�Pk�����h@��T��/4�kU���nT��A���*����kU������kU���nT�GZ����U�EUC7��s��j�FvQ���j���Q�����UG7�����j��acvQ����&e��n8����?���sj^���;�f._����U�.�:�Q�yo��j�FvQ���*��GUG��jT��������U�EUC��T1���WG��jT����$��j�FvQ�����8~�pr�)fQR��"w:�%9�~��]ut�������������N�>���mf�/!���i�J����U�.�:�V�+o�GUG��jT]���fTut��vA����t�<��Q�]T5t�ZePJTut��vA���j���A���*�������T
���.��Q�����j���cvQ���#�����o�9/<q���#)����v�8��o��EQ#��0�F�7i^%]k�]ut���8��Q�]T5t�J���o#��0�Fn�����n4�����.UM�8fk�T5���st�Z��GUC7�����nTe8�A����������'
5�Zn�4�	�����Q@�7N��Kpt}�Jy>e����UG��uy��t��vA���j��UC7������U��(���1�kU���tC7���7r@7���7r@7�2��j�FvQ�����p�jTut��vA����x����#�d����#��5���>r��z�:x�C�/����x/"��~^�������?�:E�������������6]W���s����v����t��1^�����*&��Z]�S���!p�9/��n+�/��>��������%���y�wIv~����.�:���B���)a�|�.���&�9�IR�^�6t]��~�Nt\�Y��E��K�\��uH��m��PK�S2�5<�A����96�����2IDo��%I�@��B�h@�4o���[���w��}���$QL\sn�{u4��![�N�Q:�x����S�2�)�Z�>+j�H
i�w�x��J���$&��.��.Y����T�z���{��n�1:��Ff�5K'.�Bp�+a��D��3rp�%���!��5��6>�P6��;��2F���8���5m4���Y�}�TVmbdG����{��}?���P)0����2Y��_�nM��hgDO�e��n��#3x����X�C����$G�h.�r����
#�<x�2�l4����*�NJ���q��
,ui���<9+��sM�$g�E���3qp�%�z�N#^�l&����)�F����i�x
�o�/�'
�|�.�\CI��=%���Q��M�8��q�ada���Z�|2s�CI��(KX&@�xI�1�0��<�'����q�FyM���8���-{����fI@n���������I�M����x|���N����x�V�I�n��/���G�*R��/g@���(���Ff�5��Ic��3��"�'3�^����/Q'
�|m.��==�M{1$��
?���)}EV`da�]�����#w��qY%�g�Q{���'���
�w��D�>_��$����b�n��a2�1�����gY�<��5[R�p~)��#������c�@�^e�%Y���/ ��ow��6SjJIgR��Q����
�%�lb!���e���	l'=���q���L�8��q�`da�YbI&����;�DJ�<
���J}�2���p���h9�%9���3�_�q���,���^y7��y�N��,�����` ��;L4����L��#�K�	�������u�	'�=�Y��k�]��]%��&�#
Q[�}�Q�&_�F'~�kb�����MzH����e�rL���K�����i���^�	�tr�a���C ��W�]~�{��ku9�G�F~{��]���Nv&��x~-� ��*bb!���e�j�����������<���/cFf��-I�e)��}�h�5����NXF�z��kV����9q'_����/��8���51������q[��M=�7!Z����~��	'Z�1�k�Ff�5����?9��Tq������'�]��Gr��y<�V
:�N���-z�2������TYj�J�������t2"j��%����N�d�R�i��Y�����&7gQ����i$��4�3 "vV��MT5��e�H�97i���q��"�rO��x)yH��M��q
��bt�%�'_9�����P��P�@r��M��q�^}��L4hQZ��5Ue���6k�R��@5�~!���8+i�R�2TQ�
�5/�������*�)J�9-&-9Y0���H��'yl���ML��q��A
���d.�f��<�d��f�#s��r����vI.k�oO@U�x���:��oO`�B�����n~�r�N�_P���e��u�����l����T�������������Ov"��3
��@5���i=���|@���\`�|`��M���\����K�Q�?�R�L�m�Y#w��F[Z-�0'I�����'��X�"�U�A�BLVK\��*�����l$B�X�oOCh��P��<���?��q�]|����l�
����k�:�yDD�T�f1r7�j�%��o��6[	���U�d�����D��*�����|���}�6FT6
��8gl����T�r�	e����3NV"4	���Y�R�.������^x�oY�z9��Z�����_��(!�b�.����^���:������#�@�@D�]�X����T7���1;�����_gQ�������q�m+�AIOG���@D�{�`dn:�8�&����
5F�[� "��1��t�q
�����#�L��4�R����_���)w��e�ri�����q���)�uq\���v��C��s,���mAn�t����.��q
�M�m^�sJ��'�RH�u�]����=w���7�]^��u9m�tH����z2#���-K��Z���������b�r�x�|���Q�Y��$���d��&o���q�&�l�I���G���������)#Fk�d��/[��I�JT�����af-i�,��vC��f���Z$/��d����h����
\�&�*F��	��>[C�A������]=���a~L��<cp-�w�+�6_�D\[f���<w�Qm����\��� J�e�
#�A�����M��MT5�R�j�TV�QuT�D�4�D<��X��y�T�����������*/l�>��xK������?�'�'crIOf�e3��� Z<"����F��P�rm2���j��PEj����3�x���������e��+��"Ud2����AE���_����!5dd�N�kH�x�}����(���lWN��sW��02������RN:K��L^���-2f2c�d��	lw�/�k|����.
�5!�	U�m�kQ�@ZL��q��A^m�oT��+�9��=����[����!�bl�.�k�j3Oeb3y�P�iP�N��8��l����t@U���"�v.�3R�P�l;]���@�����Z%�����I�f�H>��_���3�D��t�T�LZC�D�M)����IR�Jz'kU�@�^uz\�d�I��t)��q--��0nq}Wh������F5m2UD����
��k�V�g@E�P3V`d�;���&�)��d�Jr�/[���jUR`�\n:pYM��
�%45�2��Y>]��������Z%���M[�>iAEm�VF44�������F��P��j��o'4���������J��f��@5N��r5�N|q�Q�l!��`dn:�8P���d���hY6�M�A�B�������BvJ}��]^	$d���	�� [R�2��9$%N�lS����Q �����:��!M �����'�(���y��2���	��z��d��&��MnRZM/���P!#������z��]����\��U�o,;[��X�FhW�v":V��Q�p;���lKC�n#��
���Z1�M�12q��5$�m�����s]��k������9�?���d�������������j�5���If����_'TlA��M��qml��^7�3��� T�����_��K�������K��I:L�fT���aGM^�����5$/=I�a�mT��~����1OgC���������f1r7���-#�Y�DuR����|�^�FT�HM0Q��J���<��k�s��*����GZ���9Tg8v��;��P��2eV��!��E���v���V�!�2q�5$��5o�f�-RF���O���e4�(�F����r�T������B@+#���@D|���#s����r]q�xI8PfD���$=�l^a�n:��/��R�s��=��4�b����zD�Gw��027���-�IN6P��&
U��*gN�b�Eg�}47���-oHQ�)�=�����T�m��&d#u�U}�eNi�������Q�t����HO�02����-�U^N���D������c��.!�F-&+NB��.�n�T�(���p�}�y�����tYi��1OUl����^�o�hQa�~���������d������|��<�������#����+�[v"��4V��=w�/�\6Tc��f~���7�K�RP~��P	Vad�;��[nj��G��4+������������f1r7���-��l�#c�Z�U&U���������P}�e4�a�2#>�"_VT<�!s,��6��������2����q1�j*j��7m����D�Y6��(�������&.c�s������;�
��P�	����t�o����*i��~����s��"��1�����-��WJ��K�AC�~��ZVX����T_ny�PGI�lIp���W
��8����0R7P��[����B�09����a������F����v�����L�i����+L@DZ����0R7P��[.I�$h��n;ZQ�<
��P�*�4127P��[.8O�#dV�������yva��H�t@U�n���Z��5/���d����x)y�
.�k�|�WM����Twy��0�|D����#u�U}�wgY��o\m���3���Z6�P���P��u��t~{n��N��)���?=�����<M�74��h�:p"�mA��n����~�l�.�k��gg��M�����{u8��Ps��uE�
!8�N�]�*�`�����q#��p"G!6��:.J0�{"�?�M����[�E*r
eF��*�n-����Yr��}��4{���W�J�	�ly����u�dbN�k�|�CJ,�G�g4����T�6�����y���]��|���s����#C���g-
��p.l����Tm4O�3�3N���>�-g9���b2P�a���l�R��O�L������0��7g��H�t@Um�n�
_�r����U�1V�9E��adn:�Gm��4��_��"-2�G��gt��@�-���qm�OvH�����h����90���}D�`dn:��?�(��{���[�#�kB�:��Ws67������R�rGj���A�*���c��6���t`���77�Z���&H�M!T@�-�q���K�C�T��9=�!�d�t��?�"����^�N���i�
��lb.�kH>�!�no�,I���"G�(�="�reeS#s�U}���p ��fJWC�i.
GBz "N�����t��6J������M*���{B()}
T��pcF��P}��:{��$Z�w��
�h�0�Y��MJL������{��6��h�V��=*��%�{�����d�T�i�5kGRQ@m�����Y#w������-!�uLi+
��
�y+�|���8���ZY�5����>��*=C�����sTQ3�eD������������R�en*������U��R���@�5�����@��F��pZ'��<���	e����(^ �OS���.�k�|��e�Ii�+sv�Pg��%���Z/%�j(�����OvHM����B�4M
TQz!M
6���t��6J3�6y-&�hF\�=��9���MT��FZ���:�}5Il���Qe�/s���6���m�8�CW�j����kQ�@��HL���6�G;�<6�S��
@#����s��j������>��f����e@R����LZ��
F�c>U}�W��_�,������}z �]���Ks�����R���p��I���h����3����R��4���m��5���IfB��|-������qm������hs��#�9#�4����x� �&���V�W{�*,����i��m�R�[���	�90�w�M����7��`mV	�:�������|�	n:��[�a����-;Z#��z ������@���g���t����&� J�l�"n<a���<w�/����N��zeD���Y����x�����@���)�K)�x!s���i_*$O�kQ��<dwlbN�kH��+:`H�Y���/�qD�ine��H�t@U_n�6$W����UB��'�v��K�C����9=��.U��^[�����G�,����g�YN<8����j��&7|	�*�w����(iR������f1r7���-7��X&T}�0��v�I����
F����n�����U������Q�V����x��ff�M��r�iEK�WO3
�Q|�AS�@��S`F��P}�e�W��L�C���Y������`dn:��h��2}�Hw�j\C	e���$o �-�b��;��j���L������Uv����fx�]eg�����*�+a�Q1?$�����?:Pu~H`��M��r��iC����J7 ����U����L��#s����rKr��R�h��'�(�$k�b�f���������2}I�q|����T�YG�u�_]8��
w�������5�r�	�I��9j�������PY�`dn:�w[^��:��(<se���h�4��q���#s����2�@�p-��������$+d4F�?���t@U_ny�K���M�Do{P��D\Yn���{���O,~��
u�<'%��hcDE�uD��r�F��P}�e���# �A�l|���#s����!u\93�t!���9X��p`��M�xf��-�49��%K���CB��?8u��L��7����|�}��������5^g`�!���#u�U}���\&��5i��	q�~D��	;#u�U}�eK��hWPaR��>����r�G��o��4O��y�e�)������A�'��dJfp��r|�;�[Y��NpN���T~������:!�?���5��W�B/Rznm�/�+�>o�s���������_���xW`o(#�
�=���i����/��������?�������eR������[���\~�������~��/�����_~����R�������_���!�U><��C��]����	P�ns�����k/���~�<�g*���w/���
t���Iw���
��g0i[V)\��z��Yp�
t���Z�j��U�}T�����u���!(�����WB��W����@�1M�A��!j���]TA�QU�^JC��#�!����*
Q/�!��[s�*�>�JC�Ki�:�A�x����R������
��)r"i����'��E+����L��Z�Q�R����n����^KC��W�!����*
Q/�55����@�QM�Q��!�tP�EtU�!��4D�B�U�}T�����5=�����>&�!���#�D���9�nc���^KCT�)�����*���Nz~�I��r�3'��*K�������}����Wt���U����3���
�����R�Q��<*�mT�z�ki�J-h�}TA�QU�^JC�MjCTA�QU�^JCT=��
�����R�Q��J�G��jR�z-
Q+w��EtU�!��4D-���n7 �1!���N8x�Y�
]C�:��L�l������v�&�Kv-��e���
t���Z��BQ�GUi�z)
Qg>���*�>�JC�K�G�2L0F��jR�z-
Q&�������4D����Zl�
�����R�.��EtU�!��4D����6@�QU����tp�Y�v\ut������(��t���/��������n����^KC��3�������4D��z��q�Q�
t���Z�V3Q�GUi�z)
Q37o�EtU�!��4D��?vQ�GUi�z)
Q'�UQ�GUi�z)��4RH��
�6�I������p*��d�|:>��8�r�������&��v-�����>"�!���#�J[=$�mL�z�ki�Zq4?DtU�!��4D]��(DtU�!��4DM��}U�}T����R����D�
t���Z��h��
�����R�V�n5�����n5{��V�<�HkY���t���^������&�+w-
Q'>���*�>�JC�K�G�*���!�mT�z�ki����a�*�>�JC�Ki�:�D�]TA�QU�^J=j^����mLz�Ka�Xo�@�#��[�ka���7w��G�0D���:F�}�>�J�>����g�U�k�����>���f����%�#9�6��a~~G2vo$]iuK]GU�x�
�Dm2�,F5tv1����9�����uT�Q�D�R����
���M���B�T��u�7�RI���?�LcM���>w���N=H����'pt}��.���/\Q�P���:�����������Q]GU���MTi���j�&*�bTC�Q�g���#ttU�BTG7Q��:�Q
�D�]�j�2j�&i��Q��jv5�������Q
�D�]�j�&j�I�1�������J�o84/��~s�)3����;��hj~G�mY��7DutU�BTG�Qs��1����j�:��Z�^5F5tv1����%I����uT�Q�D-��'F5tv1�����LT�Q
�D�]�j�:jM������j�:����J1��������no9�����[�w�u��6�0r}�`�o���m��C!]�T���M�M�����n��.F5t�N�Y��]F5;��uT�s�Q]GU���MT���Q
�D�]�j�&*�����n��.F5tu��WG�Q�.Dutu�V81���������n�65m�<��G���M������\_5�����f%qQj��:�����n��r�,F5tv1��������u��}��&�������MT�������Y\���:�����n�V�c���
���M���1
�&�v�t�&��m����]����n���������>�Y7�%#��
fa�F./ZK����b3t��B��M�&#bTC7Qa���:w����:w����*-�wQ
�D�]�j�:*���?�tU�bn?���i���E5tv1�����1_�M���/
�:��@G�����'�no7�����������9P~�MW�@�3V���LOS�������@������?~����Y��\����p��������"���^���
���j���IYc����6F��m
��hK���@5�����U{x$-�2O�:��+�F����k���2����s2`��O?��T�,�������l��K���E�7s�8��
he�i��m�#s��F��F��<x;R�$�9��5/��T���:=�����4?e�rYQ���&Fej�v z="+b��;�hK��m8����1�B�)�D�%gF��P������>n�%E<�q�)�`y �-+02��j����Z����?�!n�����V� �-9�02��j�c��>\9-�r*�
��Y��8�Q�l�n:��Q�r�od����&�Q62��hK����@5��-��v�|4��Q�)� "���Y>"�MT5N�rF���e�D@�Q�cF0��	gi#s�U��$Ws�	d�H!B����/%�i����V� �9�ZY�-��4R��|�f�K	�lKtq\C2���:�4�)N@���R���)�3������j��+<3q�x�H�1Sn��O�%1r7�WO���U(ceFun����c�hd
���j���*���>���)1�K���������{�T�-�1�,��W���E�*}8*�-9�02��j�T[����r�<��Q-�~DD[r�����T�<���p����	����Lgk(^J�M��q
� #�>������FZU�r>"���l��n�����F���]��^;����%����b���f������6�k�MG,$�����6b!0���t@G,2UY���%�F�F��O�c��<0��L��q��A&I�hS�������)��hiD�zD����t���2��oY8j������%��@5����vy<�����?���-�������@��%�)��O�����$��l���=�$F������-2���(�F[2T5z����M027�j������������[@D�%c��M����4E�`��������^m��Z�*�����"F��P��
���=���y���N���,��d]�d�[j���q�0�j�yDD�cU�����j��+���������fz3z��K���Sp����8����r^Y��8��r���g@D��S��=���o��$s:���#t!���$�"���*F��P�rv4�[d�e@�Q�N������d4fp��F%�I
�Rm���M���u=��������P��T}^��H�D��uD����3�������j�'�rd"����k���Q�$~D�[RV����jTZ�aPl����0*s������w�7���j�����_��q@���������3J�������F��
������������"N��02��j�����?HI�N�J�2r����_
��L~X�~�	�k�,��(3*4�vD��:e�
#u�U�3c��FI+�PaT�N���"��M��MT5*2L����r��5f(��J{�@D[rV��=w���V$s���	H��?�����M��MT5*�Kr��r`�2�I��������E�R��������A���-�E);%�;�G�v��Hs	yx��]��23�,2��2
�)C�:C�:�2>����u`�M���ML5�����71���i���*����u����@u�����ajY�u��G���%w
T���
����<w�q����#3��lLj���8�E�y�*1Q��JFj�n7KVC\hYiP���"Kb�{<T���F�A��,3���P�'��B��{�"N��b��;�(-�<m��Q	hcTiH�T��&�����2�25���Bi9C���i���d�J�%�����D��m
�6�	H���dcf: "��f��Q��6�Evn��8���<�r�m�>B����t _=B���C�f-��G'��,u�n0R7�.?K)};kT��1�L����d,���t�q���?����5F\�7"B�,����T�\����]��4"������x)A��M��q
�E��_j����B�;"�$�&99127��.I7K�"&�8C�B����t��%#�����ttCk�r������I��d���n�7~z�n��r��E���@D\R�l]���t@U�,�>��U�n:������qI��F��P�J����9>)<t��b�
� [��3)+n:�/�i_�6���V��d��h�Z��b�]����V���*W���2�|Zq)��	F��P�J��F�E��;��y��=����Mb�n:�hK<�M[U��]3����D�)4Vad�;�(-W8�����"?C���;�NuFU�l�n:��QZ����d����PcT��z����r���M�mI'�������Ft*���xK��3,���j��K\p�_�W�;t��E��g
�%1r7�j����,�G����Q�~� "�*[��MT5J�5\�%q��o?I{��xD�I:e�
#u�U���V|.��Aj�
�� ������M�e�f9J�3y�q4�����=���t����l�i���m�oeZ��8���l��='��M���@5�R���+	~�	(3���q}��i��������n%1<�h$���.�-�l#w��rU�My4~����H�1 O����������C������a�nF%�4UT���e�V�,�kU�@���M��q��av�z������_����,�=Pg��U��@5��%�9��%������gv�V���<
�����]���<P2i�:E�����A�2������2lJ�i�,);�����[�D��V���9=qz\�d��+�%��!)�"���N��x����M��q��a��rAR�I:�F��o���/%��]����sM�p��Pc��������g#s��.�s*��L�r�q�mD���
�rl�n:�hK��!k����"�$�F'`�Z#g��M�Pkt���Z|�:��8j�(��
����K027����nY�����-(�%�Ur�hG`R��t`�*�����<�B)]���Z�JnD��t0127���R���C)�])���XJ'F�������-

��p�,���kQ��<e6q��5$����<�M���r�����������?��]57�Et��>M�-�����9��(/I��:���9���eb����x�5�0C[���
�h"�,VW�lB������d�e���27���0�8pxf��pFs��H�t@U_ny����~$?����>�������y����a"'W���b���q�9�9;�"F��P}�����
�!GH�d����8�g���<w���LM'�c��"���
#�� "���,Gj�MT���SBZ��K���&>f<��p,��{�T_n�y�7���V������D�3V`d�;��[���K�c�N�D���N��x)yH:�M��q
�w{��s��Hn����d]���4���N�;�`dn:��[.�$5s<�/ y|�XmD�[VF�xddn:��/�L-Z�|�Y�1��h!
�`dn:��[���e�(�t��z�6i�s�A�VTJ�����-�
�����U9i��!�,v`�N���@���7t�!�Re���`�������b�����-W:���:W�r;�$�])GzV�r&I������n�3JZ�7d'��)q���W�]A(���
88=���2�t�L!)r	H�8P]�6��&:��@����=�&3od�w@���2�eD�ZN�E��M��r�%s'�����a��!+=�-�tx�5$\\~X��w��� ����J�P�:<JO�"���
��s��r���S�UrHC�q�yD�iha��M��r����p-d�v@���jB����8��027���-�Qh>a[3?7R:�PcD����'����M��n��^��k�:��QG�%�������#u�U}�e:�������
#J{/0k����H�t@U_n�>�����v�\HfBy��t�P�-�>��T����$_��J/�#}V��(T��}V��F����v�+�p��2i2�H��)��@DH�+�E`n:��/�����6a^�������Y�l���������r������t�2�3#��=!iVad�;��[�zX���+���PK����JoK��y�T��.[������
�.,���j������t���e��>�_���^P{��^���=���<�_����R�-f�9j�(�?@�Cu�%��@�����kYu��#�P����������@��F�q�[i���IwEt#k�'�	�6;���e��@��F�	
����\8��4����z BRLNe�����n	=���oA��s���z��� ��)��q
�G;��)-8&^tV��5���u&}uV����t@U�l��������P��(����=��X����@���ontN8\���:Z���&�@��Tg���@��F��t"`��j�
u!qn�X����Tm4k����`��L������P�&���sZ|<~w�
�;��L"e�2���yD��_�	F��P}��j�����Us)�R���90yZn:�{���F[�)gT���;�P�Un���Y�R���`2q�5$���E�\q�#��)OP��G`��M��h���Q���_b8�=�?�N�Iz��p��UL��q��vH%�?fQ'dn�PaD�������M3w�U}����Z��'�I��[v�N�����(b��I>�a��=`��]Z���yD�yhc3��M��h���V�����"*K�#�Tu}
T��
���������y�w"K�j�E��Z���(PM����ML��qm��vX-^^$�_mGDj�m���(^ �����qm�OvH��U�����&-,k��n���R�=`��q��OvHG�V�SYu����R��t-��oj�D]�*�d��|��w�dr����r���	oam������>��6s���:2H;#�	%���Z/���wbuq\����Ig'�����������=��h�Yk0R7P�GMY��i���T��z��x�[W��	y��8)�V�u��ko�pN�����k���*�����QC6227P�g�R�25:d�F� �����5+^J��M��q���v��~�*���UF�<uD�9_eu������>��2c�O��_��E��:s���+1V����t@U�m4�]z�vu�������i_G�}y��)P�g����^|�����s �%�k��y�T���jN�XiG�0�3�=������n:P�w���+F����������h'�!$/������E&��m�W��*�L'1`�S������������[.�����1O#	�2���������&��@����I �x������������J��f1r7���-S����P�U���V��qo<q��X��y�T_n�.�l��' 9�����X�����Mr�a�t�����GQWy�t�d5�-�"t��g��������-�	�#��X	H�)��N�&T����-��-S)0�h��x�Y�P�:��Y�R��$6�����/����l��2cG�*�J��2cc���M�����-S�Em�!M��Ckk:�,K�!w�U}��2!��4+�VD���v[�Ku"Nx���6�h��U.�9�������(��e)����b��02�h����L����������
-�%��ss+b��;��[�s�����,IZf�-�l
��!�csz\�����I��e|�
�1*��,����+�$F��y����-������)�*#��1
���l����T�mY�&�ok{��)n��{
 �����;��=w`�g���[�e,�6��d�8�}�^�w*�������8��cu�G{��J~���
I�u��'`�N�5������w�T_n9�3����~G�Q]���:"��sc���@�����"������S��"�:-�d:qcE��s��v�E�eN<���lhaTK�~�OSF�g�H�t@U_n�����{3M�K��������&B����'u=[�����q����t0���K�@DZ���0R7P��[�*�q����O�'nvcK�|�z�Vj�����	�=��LGu��Sn�r B�9X��y�T_n��Xm����p@�F_����p`�!�t��3��>n�����6�d�<7�hgk(^ I���:=�U��^gii�~)�-5�@��
h "���%��@���7��n���|�q������&���S�i�[o��Y��Pe������Mb�n:��[�8p��'��
��&(S���Y�R��8l�.�k�|���q�m�����es�$]Dd9�Y�'F������XS���9c@+#O���hs��f1r7�*d�O��t���)���R������Lu����qk:=z�^x&���?����������������_�����)���%-?4vz�������������/��������/���[�?�������������_~��e�^���Z�,�)35����uZ%���o�
��^�;���2.�N|zYZ[�����kL9�*?TzN�����.H����)��)L���+��=.�Z�..�}\�����7s��.��7��6������CX&oD�2��z�������G5m{-q+����F\�����7�AL�+�����q/�!���o�+�����q/�!��g3�����
w�����.�M��������
hIh����0m�x�b�KEe�	t��!��8�M���
z#�jc�K��M�=����5m�{-q����������R�je�+�����q/�!n�D!,�7�B�^I=&�����
t��!��8�]q��o@@o�Um����7 :J�+3������>i[1���������~�n�!�~�q�W�1�����8��o�����
q��!nA�0��F\�����wf'��������R�N|�kW�qU�^�=.�����@�qM�^�C���f!��7��6���������FT(������gL���g���C�,����������]�=n��y��9��5m�{-q�4���F\�����W����������R�q�����>�iC�kq��0�"��F\������`,O�+�����q/�!����!��7��6����3��~�z#�j�h'���3��.����4}r�t������
��Z�q9A�����k��Z���t	q�W�1����&�+��@�qM�^�C��li���F\�����WZ��������R���x���F\�����W���
z#�jc�K�����R���5m����� z��3��t�#��>�	������L�W�ZbR��}L&o��2���zL��A����6����GMC\Ao�Um�{)q�"+��F\�����7aXt�+�����q/��m�6�p��5m�{-qutT�+�����q/�!n�)B��(z#�j���n>	5�o>��}jE�h�z@�?�i���������F\����b�[*'�q����6�������������R�N��	q�W�1����f��
�����Rb���F�7b����ki�Y��'[�7bBc^IC������FT��;�N|~��h�����;�;��]��"0��h��
�.^C�����	%�!����j�:��Z�y�j�&*�bTC7Q79����
���u�y�.�!����j�:�����G�d�#O7�RI/N��Dg��rR���'����)G�7�)�~G��M��������CQ]GU���u�:�Q���uT�Q�D-|z|��MT������j���uT�Q�D�2�(F5tv1�������CTC7Qa����N���o8����]���&���~bP%�7���J�o8T�z{����lS���a�]^3�� ��k��������:�����n��Rj���
���u�9�����uT�Q�D���S�j�&*�bTC�Q�E��������]���&�|Q�E5tv1����[����&����:��J����gW@���Q��q6����r���P�����\_5�����f�(C:���v!����Mz���n��.F5tu]�5d���:�����n�6�����
���u�-�T���uT�Q�D���CTC7Qa����v���n���������
^���*��.�7nr���q��L��|�����)�|�>��.����������a�.����j;���T98#;����1��������M��}���.�L��G6v����n"S���COd7��0FvvY\�#���]dcw��1��]�v�)EvY��#��#��.���;R�,�"��W����g�����Og����.��n����'�H��&����n"S�����Md5���]G����@d���0����2�z{��?r`w�a��l�.�vL�Dvy;&~"����'�Q�����������"�c�)�����v��.��YFc����71���I�O(�>&}���s��$+�^�E������o�4���/��������u[��bcM�J4i'J������N37����x)yH_H61��5$�i�e����p���~Z��N
���^��f��@5��������J'������'9'����l�.�k��L���&�9��)k�����Y�R���r21��J������Q#��4&k�V��:�;���t`uq\�d���!2s������}���J���D�k����<w�Qz������sTQJ���hK�&��@5��6��f��R�Qa�Rs�oI=��������in~��J�w������MuDD[r�0n:�(�����(�YF�x��p�l
��!��lbN�k���	������l�!�����vl=��Y#w��.���(rI�/�(��^d�H�/��Fc������F���F��.:ZfF[��
����U��@5N�������5�m����3 "���#s��[�Y�Qy�V�F����xK��&F���m	%�+��e9�����,��/��bY����SE��:d�yw�����t���d}��02��>���^�/?���~�I�e��`��.�������H�����VO���3��C,MS���=��6a�������a�h��,|P:��FJ
c>*z="+02��j��$�vj�"��H�.���|b�@D�%aP����fJ��U+a���X����)�N��53�=Y1�c��x���;m�����Y��>M��I��_R_J��UO4�pg��t�7g|�Z��U>'���53�=Y���$�P�Z��x�����L\<"���9\`��zR1�k��*x�K�#�u��,���a�����f��'p��m��w~n�Oj���:"�?Jc	F��P
�$�S�Pf_E&����2"Y�s�v0���������f����R���'2%��K�"�X�7�pK��[�o��'c�Y�?�a_������H������2�.2����2"�����L��UO2����}�	g.�@��1K�9���3�	f��'p����m[���3�A���Nv�0����IFo���4�g*@�����,���(
53�=����W��F4��a6�$��u#�
���E&���L@��:i��"����&1r7X�~Y������A�6y��.����8#h,��<w�QF0=��`���R��VF�����o��,F��P����ow��L���XN�1�sm��|�����V�T%��J��,�qJ�n��5/�����D��*����M�j3��J3�D�4Vad�;��(�����3��&�I�����l
��!c\����*$�t�����$2DG�%�����8g,��<w�QbPS�i�*���h�F���+�"�v����<w���� ����O�����g#
��l�.�k��H�ywN��Np�eTW����f'V`d�;����Aj��I��e� <�����5+^J��csq\C2�.HE��qI�9����4"�$��I��M�eWy���>'�����V�T���,s�@0��M��q
� �6s�'���������V��qf����-������F[�h��+�����('����C��b�N�k��%�:�h������mES ���1��<w���2�a��Y�%�#:���m�����hK�
��s���
��a����f�I�}I���<+(����UO4�87�s�AU'~8����:���a����I�<�Q&n�Q5<\z�2K������0sW=�����r�}�,D_�%y�,�vT�_��U5R��nRp	���r-�����e�R�vN�Cr�`�������.�;Q�v�le��o����A�3��UO$�89H�w���D1��Y�������A���UO$�89(�w�=��7�m�����.0sW=���/���6y �/��3�5�#����&����H���*6oY�2��%+4"�C:�`R3s��|��J�Se$����D�����;3��,������/[�dg�'�a�^��������p��m�����
7\!b��y�p�3w�	x�*lI�c�������<"��T!`�����|#UH��tHVf1��S������H��T�L?o��
9E�b�'�;�
�������H��T!�qnh�^N;gr�:���L��
7~i�I74yq�0�
iDDIcT��F���%�&��N)���<�K���'`�y����m���j��Y�PQm 
���@2#k]Vy�uD�[26����T�-U��IFmd`�0"_���s���#u�U�j	7�����L�H������Y��9�!�l�n:��q�P��Tb��BC#�P@D�2TF�K�H�t@U��=���W�H��9TX�o���~�X#���F[Z���������B�����/%�	���8�U2������%���3�1Z�TyD�YLc���P���"�QnD[2T��6"�,��	F��P�3�R��l�7�(��6��9q��X��y�T�l��T-�9�j�1�Z��t-��lb.�k��B��E
��;��efD��s��02����mn6�z�v����h�i�=o��<K7�j��K�yk�w�R�g$��S���/%��[��\�*9��pD�/������F��gK6)x�8c6P�a	��&�t�����- ���-r��d}�X����@�z��m^DMr*��'�'���rQ�c�R�Q
��C�8#��#���D�1"�CZ�S(bW=��W;��L�]�rA�����4���avfpR3s�	8N��:]�<�����Y����t{�:���3w�	x����w�2WMX�5f��,#�:KX&����H�qv�w�
~���"�X��o�#���������j����]����0�����l#w�Q
����&$�����zM�3"�K?�jf�zR�K?�l��n�H��$j�����26���t��W�s�)/'3�k��T����@�	f��'p�3Z*�hIO�je+�����:��'�C����8����-��R��S��@0-�����d��u��,��I�|a@�8��p
&z�:��8�8����28������l���b���*�`|yD�+pR3s��%��
�����Q���BL��F�u^%�����]�d�:��L��y����d��
�bu�Nj�}gG�\�9{���@�'�bF�"�Yc��h�4/�=���y
�=��<��������S���	k�����n=�n2L*�,x�eF��� "dm:����J&����_.���D�
#���@D(����M��MT���T��I�$��
eF����@D�eeu#s�U}���������>G��Z0������l#w����r�1-��"EQ�4��u�a�@�YHcF��P}����H5U4���}����q�qn_��K�����%���$=	��$�FPeD����B�M027���-���E���to��d=]O�ty k�&pq\����rG.��ohN�+�J%=����XBR�'&pq\����R�RZ#P]��-:�RpH��X�n��#s��������8�L	SE8��%-w�D|�X#������BC
%��Iy3'�7���������@����n��~?�kY�c������uD�S�`	F��P}��:I�6=yKyOC���l� "������M��v����4@8�O�S����<8��*?,����/�JBC���u�N��8{:y���:���r�A�+WV���Co�O�����02��c)�m�&�D���u����������X#�����2w��M�r�1������|6������\p�{���-��s�yN�$�c���P��;�9���6Vad�; �o�����������&g�D��F�uH�*�gI63W=A��w^�x�D6�D�LZ�p�6"��Iv������,h����K���TT����Zb�&}'xBD�bV�f��'���~�cGkE����q_�6 �C����f��'���7M�R3#a��,MU�'�u(�58��]�D~�s�]��q[7iH����d�m�;i��;�zw���Z���z�^Xf&G�G�u��+,+��UO���o�:bGz�p��GOG�G$���`Y�L]���9=Jogr����23	7"���[[��]���yCy�t��#��'u� 
���#u����2�z�W2��h����LJpG�u^�+0��]�D~�s-�-��TVg���%�=����T����l����}/?����uf��=�!�_�����H��o*���:H4��Y|!��*���f��';�-�
,���I��t~���i�\��y+�
%;����@h�?����Y�R��j_61�5$�������y�8�/��
�+�����;<q&�X��y�Tm47�+�C�������r�����t��027(�_��6Z&>���e6�I
B-p~V:h}�f�K	z�nKpq\������-����~���+=j���U��$F��P}�����&�)
haD-S���M���`�n:���6�V��c��u�������3����M��l�8"����4a��Q?+����Y�R��bh6Q��J>�!5���&(�
����I(�"�^�02�����n)k��
3�Q��y����@��{cU��s��h��gg�&:�����O�-��,����@���w7���uV�Q�-�Z6�\c=��E�R�@��9����|�C�]��R�����5F�6`qc�n:����77Js\i9/�FPe���g��M027�����k��r����Ff~�B]��Z/%ht�&������R�f^��4�p�i��n����FBREM&����!���7<z%��e�2�<�,z�)�	F��K|G��F7��Z��P�
$
���Bq�c��M��d�\:��n��5m<�0��cyX��g%�A��[~z��>�,\Q��dg�wtM#�:�w08������p��$���H�<k�Z0��KS����u���	f��'������A��Pcy45FGx�+�"������d7X����D~����.�MN�["C#u��A�^��-j�n:�g���\�ft���1���4"�C&�`R3s����7��&%L��jM{[:['iHI��u@Dg�s�3w������w]������H����y
)@$�����RH��w��
l�	P������Q�a��g�����Xv��~W$`�����f��ZF�u��B�vS3u��l�����Kqjd��$�G�u�3������ ���m�h��-
L��K|Df�[��nj��z���7MRMJ�K��'�����a����3s��t�3��#��Zn�`o�8�=�����`�n:��7��i:�r�M�KC����������.����� 	i*���^�m��>!�n9\��\���q����(�J��K~k+�v�
C^����ue���a�|{�{D����yWld��G�*�m��Z�Y[z8B/k���l-=�I����P}�ey���5
��f@+#�S1����l����T_n���bn��^�JVt������M[�
�|9�����Z%����6p�V����< QV�
���:[���t�����L;�,]q��M2���.�"N��c��s��r�47�����*��6Fk�8�}��	l#w�����*�.��F[nG��G������X�`�n:P��m�:��W��-�4&�����'8[��K�����!��j��&)����y�N6�C	��u�� ��H��q]���VI~Kk.'�	��t��[K�C2�b�.�k�|�W�<�9	��9�d(��a$ND�mO������P}�e��c�Zi�{�$�^���kV���/
�������Zd�.F�Ct ��LP��9�����e1r����-��S�)�-���3j@����x� /&���V�w{���|�%��"����(��@Dty��b�n:��[^�N:��-G�3yM?}�����X��y��������i��,~�����{��{��eAsw>�@&?,a��&���~��m1�C�-�6t9��T=��F	��'��m�E��V���l�8��}B�OVq���]�����;�v��s3��#|��?��u�R �p�	\�*�r�t�!���*;�{k��Mz��uH�L0sW=�����F���V@k��ie6�o�P��P+��UOv��������rQ�*��Yf��"u�!K��N�����	~��6I#nz��~6T%�d?"�'�&��@����IO�$�yp��Xf��������H�<�
3s��z��8s���f��m��~�Y�<�3w�	����'N!iGq%�=��V�\���w�v���v�|rR�mdL��1�,~�,}(_;��,��I�,�ow^��/_�	LZ����6"��	pR3s��e_k���)OmB��l�A��Q���"I�+����@���g���G�fkw#�`*�*S9N���[�I���P}�eK���fYas�=a��m��f�6�#�O�98M_������|�b������^�[�n;W=�����y�:U���}�����B������\t���9MT����t�z�O���o������������������_�s-+
�[��8���������_���_~�s����_~}��T������������O���/i������9�M�������9s}et8�t\�>��8S����^�D�lk��T�_���uP�k��R$������� .����n%wb�\�gZ�>2�}d{�q���9�>��7"�8D���vZ�3z#�JC�K�G�+�������b|#�+���#{#��C�kq�,	�}daoD6q�|-�g�-�����l��Z"O��dwS{#�����nJ5=[.r�j����G��%����`�?���
��=�\l��E���b�|#�m�G�Fd���b��r�!2�}d{�q�\�9aY��M"_�C�����S���&���!rz����qU�^J=*�������u����+:��{��M����xtO��|ry�z����
)m���/���"�����C�mebdaoD6q�|-���D=��(���b�|#��
0�����l��Z"�\��,���&���!���&������!���#����`��]��o�!r�\�>��7"�8D���E�	�=	���*�;�N:�!�'��nTO<�!M���j�jK�v`�?������=r�4!����#��#��C���������l��Z"��.uY��M"_�=r�i��`��]��o�!���������!��8D.�|�/����!��8D��8�����l��Z"�!��-	���&���N<�+��9��Y��l�V����2�R�~`�?������=�\��G���b�|#�g�������!���#�M��zd���.��7��r�}daoD6q�|-�3~�G�Fd����yFo�Y��M"_�C�IO8�����l��Z��i
�F���}d�]i'���3%�>�t�&��>�/������@�?�K���HCT:�x�����*
Q/������'Ce�q]��o�!�5
�����!��8D^�GT�,���&���!r�Y��M"_�=r����>��=��8D��!��7"�8D���U�-��H��Ml���xp?�����OY���t?���T�5��Kv����/��8D����>��7"�8D�{d�Mxx��>��=��8D^��>��7"�8D����������l��Z����<�
������^KC�z|�	�F�z|�y#
Q��>*�7��4D�����(%�����kb������������rq3z�n�6��bc����2�pw�n�0cj�.����0�E6v��������&�����"���}dcw�a��l�.2F�"���]dc���i~R�����Fw��Q�0'���Ke��L�O��9�S+3�R���1�]_?3�?�����`�y���&�����"79���l�.2w���D�������DV���]�UF9�"���]dc7�)<";����1����2Pt��]d�"������]dcw�a��l�&2��^��
���`���+�%=����;��&��v?��������pvs�����";����1����4&s:DvvY
cdg���s�
�:�����E�2�j��]d�"���&9�#;����1�����9�����"�p��]���;;����ggw��md���]�w�����:����Ko����W�?������3��xYzx��:����1���������Md5����D�� p��Md5����E��z����"�p��]di^��l�.2w���Dn������DV���]���;;��\�ggw��]f���-I
w���qK���S�yXz���I���S��|��?�����'tX�g=�]�5��%rv��|o���n"�a���.r�g�.r�g�&�������Md5����E^���.����0�E6vyI|�z��Md5����E�]&+�����dvv��]���p���-�.P���W��72��H���c��~��?����W���i���U���M����1����j#;���g���g���.r�5FvvY
cdgw�[�M
�.r��I��D��gY?����j�{7b�.�������ty���"���4vsO:}d��{5I��I������.?��/�L}�O�v�������@C�����?~����Y7�\�[��?��Z�E��lO����c�_��e��J����HF	��g��&�������,�I�M�������UO$�xg�����9(�C��2P���N�;k���@5�R}���^���qi��#�:�Xy9��	w��_�O����7&�[�����	a����3w�	8���W�S}�f�����dg�=u��z����Q����|�iC����,�\/3 ����3s������H�Jz���Yf���1x���Nvf��j��z����������M�x+��L#�G�u�3�4�P��UOp�3j����y�O8��++�g����;3X`��{"�;k�������*�x�m�����XW�g53W=)��3��&��^V��P�)z�a?"�[l`E��sD5��\��^b�y�:Fg�L#L���U�	_������l�l8�����FFs�i<�2��<��ag�����4
�|��2��w�e;��.;-Y:�����V53�=�0�r��qri�~�q���J��/��i����T)[�;N���fl���g�U��xX�[26�0R7�j�]l2;(MY�i�0�����,�� �<8%11�u�x\64a������P�����Wi�v\C�R����lbN�kH�	Km���bE�PZ6~b������XU#������m�h��m�R�4�:!��?�53W=�����UR�|���Q_Nn}��"�i�@Et��e5R���"-��D���,��1K�F����b9\`��z"�;�$��5���
�q�4^=*��e,��z�T�-Q?�����G;�-�R�u�:'���epmj��z�����
U����8Igle�r���rBX';s8�������T�P2x��dB��>�����c'������!��n�o;k�R���#�:�.�����H���V5,�fq4���d�:%���&����d�W���H�YG;�������W�sJX';s���\�D�w&�8�p���Yc�Y�2"���&53W=����m:�h����D����uDX��L0sW=�������GW�Z��Uf��[G�u���	f��'��o�_N�U[I;��=���
���oS`Q3s����o�������&�������
�&
iS���6���]�q��/!��qKO=66
��#�j�=8���wz����Y���:'�uH5\��\�D��z�(�INGjk�8��G�u�L0sW=����-�����>5D�Xfv����q�X��y�T�-e��3�Q"g��)EJ��R���e�t$p�j��K�6A
�J����8�d�&�Njf�z��	�(�7�>�K�M��5�2K4�l����0sW=�������['�xg��n�s�yDXg�F�eU3u����������pV�:aM������f�`��L]��	����'�m�]G�+�9�b3b�M�3�L2k�fe����,+�+��5w_����e|�+<��
�TX'3D5�4}d]���� �������-��5+�U�gfs���;Dtb9,f��D�=�U���>4�H�*3.��{�u�2����T-��}Co��|�
:s���tS��+S�l��Z�����A;�n�,�� +�9��I���3�A<k�a����q�m����%y������.�dOn����T-A���6<8�7=$�����SNxt��ae����-�pzb�]�p����f��Q_m�Q�j��Z-���a�0��%��o2f�K����l����Dm":Y���a6}KV���Y�c�KX��jZ�39�zN���%���)$7	a��%��o@�RPK*�&;le�aV�e'��0��%����M(>����v���WY.=�:����T-����-�3I��b=��IO�9��;I]���.��Z"	���$�D���*������gL=�:�#�|��H���������be`�?�sJ����@'+s�5��o�$���r~4�;4�T�O�M��zM�3W�h�M�Ix�J�.��$����	�a\I����T-��}��J]`������7G���{I������Z���g��#K������U�=�����4���jI����6q��M1��q*K��!�l� �����*�>G��Y�J���V{�ufU�fS�D�WfNY�F���m�<��#���8k�M��|����Ci\@�f���[���{�ufU���Z"	���]���w�x��/�C�?�p8!��jI�����<�;X�����l��NV%uj�0��%������3/��_.����J=�^���*���6��6+�ef� ������T�Fa>UK�����_u��6����'��� ��q�.�O���oC�V�<L|{�1,=2�C���W�t�� ae��4�Y�&OA������2����T-���X������\�������q���GX�`�0��%��_�8��L�V��l�����-�z����0��%��|��sf6��<������+�:8���T-����Q{l��T6�������eE��":Y���a6}K6�cy�\RC��L��A�Ka��A,�a�;���a:UK���\���\����]��k���T(�� �����P��F�O��%9��b��p���@DR�h�"�fn�~�$�`_ �	���i����Q����8�����bn���,���23.'�{�uVu(p�4L�j	�M�������X����x��DtX�AuY}�����)��T��k�6��^K�cF����,f��d��mRY������ R�<3�S��+":Y���a6}K$a���hh��.��%f\��z�u0.�S	��Z��w��z�om����8��#���	��%L�j	�-��~@�����
m��$�)\�����A:M��:�+5�e���VC�����@/���dR�r*A2ML�]���w
�
i�N���@D�$e��4
H���T���m�����5�C_����P�5��%�m)x2Mvf����-��GX��	�T-��}'p�����M&������*�r7:x�3�|����.}.�{�7�$5��dW���5{G��l�@�-L\�����6��V�����,vI�m�	pF�O��|�M�T6f���l�rhr�a����3�|������U{����V^������	��.����5��j�$�+���y6��VS�O.�%��X$���Y�|����$���;�3T���\���M��4�WR��vDu`8�:o�
zE���'
��Z�GT�]Ii��'va�J��K��a�����|��H�o�����p�Jf���+y&�k]I	����H��+�V=1�[��ll���=o,#�fn��n+'�����?�t�F���_�y^����l�"�t���r�{�����S�������������0UK���r����s��\m��t��{ED'V���a6}K$��+'�{����o*Z'�$�lEjDR��L-P��P}���I[���~bC;�/\�uD$KV�[�m�@���mw�����d�]�~AXw�7^V��T-A��W���d6��le6/�����r�3�|��H��WN;O�(5��~XS���G������>UK����p�k��2�h�
[��d��L]BR�����I�l��H��W��g.I
�#!�M��s��aVnpA�O����
���
�$~\���,3�J��GX���
�M�t�� ��+��l��W����v���~�p���i��K�F�l��k1t��vn*F�R���(^J��]�����d>�L6<���Hw�K�+���}c��:�i��K����������&zP�#���u8��S�	?^9U9��5iU[`IJ���6��dUm�������&�x�
4d_,gm�1IC'���k�Y��Y�l��H��WN�ml�����Ms[�F��GX�W!m��0��%H����UL�Ie����-�,����Hz��37@T�.y�J-*�����fR]�#��"\�d[K�N�$�|�;~��oC9��N����CD����U�l��H�OW�5��k�Z��l]����s;Dt^�+�"��o��z�OW�H��5�����q������c�l�@������S�&m�X��+l���d�[�U�t�� ��+����>O��[�����=��T�3�|��dt��p�T�Pes��j�o�A�98��f5�u��?�u���e��d.��
���|��%I�B;��S�$���p��,[��j�(��	����m6M�T-������W4	���f#Xf���a�U@
,���T-A��Wn}�l����\j�2��M�eC�M�$�|�E������E;K�!�e�=2���CrY����!H���7���7��-$���:0\��F
����0��%������?~���*0.{��sXP�S��j]��+��t	�����
O�T������������.���>���E�sq�I���$�%����4��jIE����A���<�PU���g\6����v�b������<�%��C��S�L���!5C��M��%Hh������7;w�H�=�Fo�	O*r����]V
��["	�\�R������9�R�F�F�5����&
�i��M:_]��4���x[���
�����T-A�7�����vl�j_�������a��
&
��Z"	�\o�eS
�0��5-��'�{�uV�
� ��j�$|s�u�D���	��?i�z�u�@pA�O�I��z��9e~���66g�y�����xAD��+�|�������z7y_�j������j��u�>���["	�\/w��DN�mp������S����f
��["	�[/U~����COR"mhe�=q�k "y��l� ��P�����u�j&�3u��W��T7�&
��ZR���7��[~��5�����e��@D��AY� ��P��P�F�m�(�|$��1�h�<����`��x�A����b��,�{�&�������a���&
��Z"	�]�&E�3��W��^�"�M��#w���^�Y�l����^��z�~w��g��6�6O���<w�����`�0��%�������m��i:]�
3v��a�C�����T-a��%����������:�M�0�-���+Rl7t����T-)�
��zw�!'%�zP���.F4w������i�N�$|o�d���EV<�%����v0u���,�`F�O���_�Z�2����}+����#��6�[�0��%H��z���/���:[Q��$��uI���:�5��jIF
�[�ME���\��8�Sb�0R{${����T-����w�[f9�~��5<[���9�O�k "����A:MT��B�����n�������c�o�Dt0�
���-�>�q��������F-��1�J�a����DmF���$|w�+������X��"k��C���;�,=�>�q��3 �qsT�|�Y�S��H;��M��4@To.4���N�[������?u���_%K��P���
-��<�`�3��1f�w����X5��o��M�o�����94-&��w��-����b-J:	�]�������%z�
������37`���`Vx��������}�crAXw���Io>�1� ���{�'�d�������w�m�^�6�rM|<p������x�}������	��D������l�uw���yA����n.fS������rn��^+xTz�P'��G{T��Nj�0��%H���7t�>_(�u&���)e��YC�-�4L�j	~�r:���/��84�q��x���=R��O��|��4��p�%s�<?L�^���
����vEDK�`�0��%;z��r<f���1����J��;DtX���a6}K$��+������X�Y�����.�����M�t�� ��+_��*;������g�.��!��}�U�l��H��W����t��bV?�>O���yAf4a��T��j	~��D���o��U��"��$�leL�J�4
Hq���K��K��h����M�M����#�*W\4��jI��+���_*���PDn,������!��.4�E�l��H�?�r9A���Lvj�0��2����CDg+��g��%�����+�GL����a�hrm,1�=#��a��Bj��a6UK�����Y��G�����d����CD�g����-������Wy,��t�6����D�w0�����U�l�@���W9������|�H�C�G��bP�a��h��W���q	��_���01.���	tx�`�h�M�I���3m���r��k9�#���A�����r�m8M��j	~�r=}�m��=��~�H��ap��j��%���n�����'�O�0����L�1�������0��%����ox
0K9��W���r5�����<Z����i��k���qE���}�*3~F��������0��%����o���d���_���;>�0����\!U�H�N�$�x�u�g�t*�K��L�����$B!��fS�	?_����9����9W�,�������a6}K�c���+�<�"9N�LJ���y�;#�E�l��������3���Lu����.Hf��a>UK��L}s��{l��|�G,[�	��GX�[
��
��?f9I���W>A�'B������Q�I�=%�I�l����=��������+��qV��������u�LC =_�0��%H����C�z%�)��Z�B����SK$��\r��%H���E�>���8�������9�a��quv�0��%1����%�����rz�R��-�����n_��r�|���)K�K����O���o��~��_?~�����o��f��D�TS��WJ��,���}��O���_����Z����~�����~�������~�_�����������V6��Z�����_����H+��u��T�g�����$��$i��}���zy��O_�����^@����Nr"���0%mu�����$wl�)3���
=���b=��qFU���R��6n�r�s��s�K=g�*��S�8�*C�[��L�}9�c��s�2���zN�o�9��s�2���z����cN!���9o����H��a7�aNS��� ����r�U!7����\�����G�%0\�����
=_�UO(d�Q�!���rN;~�ZN�aNSz�{��3!�2�����V�9�U�)d�S�!���s����9��s�2���j�L��S�o+�t��H=���o�S�8�*C�[��������d�S�vo9J/�-��0R$���[��o,������oJ�`�R�9����S�8�*C�[����w9��)�4����z������B�9Ur�J=����bN!���9o����1���������B�W���f�|��|wB���5}��`����N������2��J����7�W��1��
e�������T%����/���r�����d�����^�9��=�d0�]Hx'�|:�yB!���)o��3Q\L)`�Qu��V��
��yB!���)o��3�	�1��qNU���R�I�9�9��s�2���z���G��
�8�*��r�^�*������]���
v)�K2\�)���K-�����r�s��s�K=��=z�S�8�*C�[���VT�XN�aNSz�{�����)�2�����V�9>�(�2�����V�9�g��P�qNU���R��v����JF9]i9R���Q�~{Q2��J�������9�"��{{����zZ�)��+���~����s��)d�S�!���r�+��[N�aNSz�{�����R2g�.$�z�����P�8�*C�[������� %������R����s
�Te�y+����/��)d�S�!���sR���- ����{�AzyoY�s�����c���������Hv�@�/��~����3o8��r�s��s�K=g��1��qNU���R�9��1��qNU���R�I���GJ�9M�9����b���2�����V�9�����q�|~�6�z����cN!���9o��sF���`@�9U�7����C]iv�"��s n0�;U"S'���\	��*�[����n$��=a���Q� #�BF#���(fT2������}F�����d0��FJ��/����+V����H�FI���1 �~#�WL�|�Nn���w�bF�3j�gtr����p-��tt���<i@�Y���rx��Y5.du4�Z����FY���4�AVy�v��h�q1�����r~+9d����� ki��of-m����8*fUt�������^C�]Oy|��k�,���k04�r^���+���1��AV�Y�g�>���~C���:d���CVE����Y
�g��,�WC���:d]��!��QV������R/����� �����Y3�����rJXL�d������R��~�q1����������f|��^��
�_����fq�
��n������ �������.����� �����YW���*eE\�j�>+�y�w54��q!��AV9���U�(+�bVC��{�n24��7�&G�Y���)��AV�Y
�i_�*���X\�jhx��f�����|��S�s���BC���q�8��n�f�t�
�j\���>+��;��d����� k���Y��".f5t�u���CR%��)�2��|�F�h�q1��A��y'e��w�����l�������Z\��
�fi��*���X\�jh|�����������K�,�_�K04�p^�������d����� �~��w4����������h�d����� ����e��WG�YK:��"�Z� ��E��� �����������}�:5Y

�j\��h��������o9
o9�p��[��)���m�Ev�
�w;���x��N-r������_�}��?~����YW������������<�4R������(^ )f���<V��3�K/S����''��V��i;^��G����Q�Nq���_O���{K�[G;��H7�v,��CJ�	�9���z����k����6F�f�Q�g�l�@�3�7�������F��-}�$��(^J�
������t�Z�
,�dn��-����T�q�3c3�l�@�YR��l����� yAS�����P�@������<VI�F�-����Ya�VFsF�� "n�klF�M���1��J/[�a�|Y�4� "Z��� ��P���U�$e�e)��:*r�*�+���d5���i�������q����h�|�|z^�b}��-�iU��\�[�
�Y�s����-=S���dl���4@U�%�/���$�C8Z�y}�DDKr6#��iT='TN�X�g�`u4���u"��@�zDVd37��M�=�B_�v:`�]R#r��;�����Ah���<��ccn�i������!��E�I��P]gA6s���~���������'d�v��NA����WgF�Nq��������(��/:�Na�X1��9�^��*�l�@�Y��/�&n��o,����M "Z��A6M���p�CB�r7�`���<��X/�[��J�8�M�������	��%F��v���1��d�4@U=ki��4i�Q��h�vs������1ia�@��������O�	����z��$�C^�R��1$�k��nE�|e�2JS�S�.@�����y�a�����d%��;���XG�!���v��hI�&	�iU�z�����a�R��$gX�O���%���5�~���������t��hI��]�l��������G�$�xR�k�����
8H�i��z.�R��g*�h���������4
PUoIE��|�
H�t�2����~�u���i����6>����E.����,��r,��~�p�8�M���fv8.k=����Il��1+^J�
�)�c�t����0?�"���2#����S[7�4
PU�o,�W�2U�Lc_T�]:@D�go,I�O�����Vn�B��+o�_������zL��E�I��L�:t�2���WC	)x�kF6�i����I-<9V��/'�	}B^�E�y�?E��8�Mr�������h�p���|�t9����~b�����-5���w@��2"�m����d�4��/Wq�����9��EM��5�����"�fnT}{�n�d��'R��L���^�E�y�'!:�y���/'���S��n�r:
9G "�bnlF�M����������)�����/��x�<�Q���<VI��������l9!�
��6]�E�R���!6�yI���0[p2r��$d��r$;a��� ��P�
9��bgKz��2�W�s�����36#��iT=C5�������!�m��������sl
S��*�8q��'c��=N��e��^�Y�R����bS���t-8>��d��3������"�'lZ��4@U}�O%���.���*#���-��� ��P�����������(O�5��g�*�Vd37���_I�C�UiL;�:@D�$g�l�@uc����m�Zr��Y�r���@���05��
��o����S����"�e��\.��x�<�w��`���$}'nWi�������r�:@Dty���4�i���!�����"�!'h[���Z���@C�J�����a�m�1��s4����!�A��;�n9.���e���w;,�v���5��8�yI����r�6���T"��]�Y�R���;fS������}�0y�Pe�&[$}��$A>M��%���:vf�e�a�]��|�����8�y�n���/��o��m�.��!�������M7�Q.\��Mo���O��������i���a��h;L7B��Dt6�8(�nG��;�
?����i�@Df��I��[�G��o�n��5�n9M��Xg��C���&�1������I1������([7	�i���������&z�j;���[E��������
���s����j-���:���4
��������$�-�3_�(:�ozBq�8���|��a�M3yMN6&TGv=��!5gbS��*������a{L*f���Fd�^�g���!:�yl������|������d/�����Bk������I:������v���<%n�s b�XF������ng���9����'��t.�,�&.D���
�l�@���*7����T���*���EN��"�9c�l�@���Vu��?���0��r
 b+�XE������:m���?�|���e�4�TI���s��+����iU���*�<�bvT�&��Y��&�4
H�s��"'� UF|�o$-	lB�M��tWBn��R;3��
#*��:@D��6��i��~���`�U-�cB&Z����j��!v������J�%r��%>�FJ�V��oK�%r`�l�@�����uq��s��G[�z[�D���2�l�@�������r�s@���g)b�"b��� ��P
+�6��RN��P)w*:V���
��g2�&�������}��Y��mE�%D�8�o�D�������x��,�����Q���?���u���2���d����M;.�� ��"�)��r�e�K�.*�upE�����<��n���;�Mrc�m���ZZ��]��q9���37���"/L*�Y���rY����\�E�R�����������*�
~�A��^���<N��&���)��t��.V�i��
�N@D�8�p'�7
��e=�:�����p�N��8�����
#�m�"�8��� ����F<�X�fC,�;��<�X�A��Tu��e,��Fl�u������(��4@U7^�4���7�r�p9�
�]U��H#G~�l�Z�������
A������~��6"�o;�g��x��$�����$�"�S�����P���m�OU��H#G�l�s?Yj�Hp��cV��<�{?�g��Jf�R�������d-��L[	�i�]![}R������K+�����������������l<�a�q�x�4\�$���d���E;��!�LbS��*�c@�]�>s]�,���h���8N�c(^ 1o����J>[����m+��Y�"n�g��[��z+T�a��X%��,5>l>��r��6F�:2u@�r���4
���%����la���@��l�;@D���d�4���h�'U<l��Q^����@�
�`A6s��p�I�������I:����t��i���-9W���[1g��%y
N�z�@�}�`A6s��p�e}��H�Y��c��f:"�D���2z�r�N�U}�d�<���m���`��r\�Wc��JBt��X%��u���K��2��D�>��k��������%o�|�$gJ�NheD���"�_�A6M��p�d9nE�f����&�������K�"u7�e��
���%s� k�ucU@���;�6V6!��i�n��l�3~�56i�� 2�����Gh-��(XA����>\����U�����e��]�\� "�����W�4@U.9�����%k]�����4��"B�#�q��P}����_k��+���@����-WG*��t����%���O �RM��N'�S�W)��+]����1$��u��\�����v��P])�ld�4@U.;_���Z�%#j�����@�6��,A>s��p���r7��%I���`%�I�_��Z������e+�,z.��"]!�-�������O�U}�dv����(�H���O��F8�R���4@U.��b*o������:��":��y� ����p����L|
����������k;�d�4@U.���bG�:�G��W�@DS;�d�4@U�.������l������
��"����"�fnT.�l�f��>u����b��Y�����J>\����ZC�����j�+�����[���y;;�e>:��1�i����7c�|�V���4~�n|���������ib�37�O��;�|����~�.�|�@t���A6s����<�E�����j>~Dt0��4��4*Y���h�����"�=q����s���i1Y���5c#���E}���i�<�,�Cl��X%x��Y!?jG���_��d&T�~=����S��&yc����scG���!�r9�Z��<|�\�_���ZZ��^���#)*��P����A6M�zk�t�������~���
�r
 �kVd37���T���,�@Y�P	n�g����7bS��������y�����"�M��7"��U��
������':�����UF���D��/�M�i�[���]{���o��
��X�[j�O��ZW�N���;*�oVl n�b����A:MT��B�U��_����4���u��F ���d37`�O����MR��
n��dY��T'w9�K	��9�&=�!yg�������g�n�m�d��5�u�A6M�zg��
���L����h��0����"��)� �fnTo-T�d��|qn�b�h�����q6!��i@���7�����C*����E�r�K�"�`�x}�@��B�^8����9Hy������"�: \Z�K�Nq���fi����[p��3S���\�EJ�63��To-�H'P~���"GY��E6�^�������t�������
��n]��Ox.��>j�������"bS��*yg�dNOUf[���*���������$��iTo-����]����I�$�v�_�'���c��5Lq������� ���a}���>�;���f&����E������s~�3wTqDU��.��5�H-nar�37�����x�pGY1��6F�Y��m�E�|�@��B�0�^��j���O{(�nA������r�������,oU^��$�0b�D�&6�"}�|�����ju���C�v�D������!O9D�8�U��
�D�?Lp��/c�t3��/�Nuc�����$o�PMcrK9��51r��":��eE���g�����z�,8+Hm��g���"^��� ����B�!�l�\������,4�C3�"�����{5�yEY��:1r�����d37`�e%��;WO8g�SH�������[��=a�@ug?OO>��>S��������
b�����e���'���d�l���C�y�.��h�*���Z������+-�
d�Rl�3�B+���ar���p��SM[���|�&�\&���a��!�.�����+��]+�����HJ_��
#j�v��}�������i���<��;����}�.r��|
 �g
Vd37`;��~s��:8%��x@�8���k$}*�� ��)>{�i��`ndb��p��t�o�D����G ���d||���H{G��k�����"H�i��>\25�]6i�c�
e9�����k�s+����������2=�.G';1
�uDtt�W��}��p�k8hI�8Y��9O�cV��<�hv�S���|�V:�����;6w9Zdku
 �s�&t�� ��{���������r���8Z��.�RW [[g3�l�@������������2���s�8j�<�K�C�m�I�cH>[+us?�I�C���*�����O-���C��P}�djY������7�iR\fFmo/��x�<��^Bt��X%��5�	�lN����pj'��w�����m;�t����%S����V�Kt���X��@���`A6s��p���#�%n���*�k�d^����i���,n� �e�����;��	~5�6����b���*�l��2$�9Mj������Z����t8����Ib��<���eS�|�9Z�3L��/A����3,���?��f�@��]���Y�P��-F�c(b��o��Mz��������EH���].!q�����:��w� ��P}�d�����&����� "5��-��4��=n���tB;����v:E;����y�6�����:�6��)~�@D���:I�O��>]���r>FlO\���;�@D���eG�N�U}�d�oB~8�y�y{��z*�32�)��i��>]��@��b;�MX�G�O��-�!����*�p�Z����P����<�"V��*����*�l�^
����*-����������
���%�|������|4����)/A��?�>�G�gN�������K����M��4 ����k�?T~��f.�IS~n�K1M�i��_�^2�����TZ�N����oKi-��������������="���~���m���?������������~������N�O��g#�3���f������?������o�~������������?�"\^��Tn���:��"�O�Q���.?k�����	���
����s��T�?]<#��kzN3]�o��2�]#���-������k��H=k�O����B�9U�)o�!�����EH)h�S�!�������<+�0�I=��4d�}a����YU��JC��h����YU��JC���CVA��*
Yo�!��;s�*h�U�!����.����h���~�9H;��������hK���;��y���nB�/@�~�n�!c�z�CJA��*
Io������cV�aV�z�{i�ZPC�
gUi�z+
Y�v:d4�����V�����
gUi�z+��y�q^�eU4��R�:����SC����YU��JC��N?v�Q4��R������9=���;��pU������U4W��4\�I���KC�����
gUi�z+��kE� �
4�jR�z/
Y��m�*h�U�!��4d�`����YU��J=kE3qO*d�S���V2���[�8#�!��0d������3B2�	C��nm~��T��c��W����rE����Lo�d��]=|�*���U��z���>��h�����^��8����8'�!��0d��!��qN����R�����^*d�S���V2�F)�s�4$�������SVA��*
Yo�!��G"!��qV�������L�|������o4i�F���P�����s�Y&�P��4|&��v/��sFQ�gf5�g�����?����YU��J=��r�B�
4�jR�z/
Y3������YU��JCV��s|7���4d����J|BVA��*
Yo�������kU4��R�:�����6�p����n8G��
���F�S�<���8o�r�(FZ�����������z�a�*h�U�!������]��{��aV�z�{i���i9&e2�	aHy'gt
)�s�4$��z��G�yV�aV�z�{i�*
�Y���4d����E�!��qV������5���x�gU��n���&acxvm
�����M�n��
_�I���K=+ww;��
�����KC��+Y���4d����3�@CVA��*
Yo�����%��YM�Y��!k�
�CVA��*
Yo�!k>?�S4�����V�&T����YU��JCV)���qV��=� ���l���G-9��=�;7�M���C|C����|�".�~G�w���i���P:�h�U��U_���6+l��*eE\�jh�u;�D�TQ�6�#i�j�G��mB�aP��J[N�E������t��24�p^���G�
���34��q!������U�Y

�j\���>�&}�bVC���:dE���U�(+�bVC���'�*e�-!������(���Y5.du4�Z�'R��h�q1��AV�m~��hp��������C�iS���3���y�����_9������+T�]�G���:��Jv>�K
�j\���>+�&Xw:G���:d��.�1��QV����Yw�
�*eE\�j�>kIrG�jh�U�BVG����%&�D���R� ��>zH���~cq1�����
-��v���
o�:�-\5�����u��{����Y5.dut�u/��1d54��q!����u��_��g�8�� +�_Z�Y��".f5t�����
�j\��h�5I��U�(+�bVC���1���l�1���,����I����bFC�;L��^k�c�w����=�C���q�8��ny�Mh!��AV�Y�g-UNx��J9R��H{����
rj\H�h��J���U�(+�bVC�Y�I�����Y5.du4���Gr��h�q1��AV�9$����R�}F�4��;����F�BRG�;�Je��O�(���&�w�K04�l^����N�s>Z���Z\�C�g�e;vH�d�a!�����r�zmd�a!��AFz��S*�D\Lj�>k���n3�Y5.���Y3*�CVE��Y}{��������]pG�����>��.8�Tv{��z��hx���m~�y����������N���
]�hc��]�j�������@��_������_�e=��������i���mfDEi�5����� �fnT=Wz�!�������6$_��_;@D�$cu� ����Y���R�:`�9�v}<gT��X� ��P��T�x��A��UF��J����lB�M��z�s��m��{�\g%�	�����(^ 9�FB0�yl��A+�����Qe��&�$=3+�	A6M�����e��0�����]��2e9������
��oKc#j��W�5���W�S5�D�$gA6s����|�)	��^@�
�P����Q/�A6M����t��"u�\�������z�����4����<VI�f������*�������k]gA6s��{��[�z�m���@�Q�� "Z��yC�N�U��F�4K'��!�������� "Z��E�|�@�s"g=�E����h�n�e�z��@DKrV%�gnT=����QA��~G�WJ������l� ������N�=�7tO4Ds*��?�3P-�YafnT}�o*z^�������r���=fS��*�^������h���%�����(^Jr�����1$�M��MN�#�����.�2uD�zD�$��iT=/���	�ph�G�*������;��2�:L�K^
\?j���[�QbT��o�-]cd�p�N�U��3)���+�nd��T�Iu5���M�
Bt��X%���(�N�
�2��������� ��P��(=�zKh���B�z^_���M�iU�X�����."UFu����@D�$g�l�@�s��6^�#	H�Y�,�/��G�l�du�g���)���s1;��tx^�#�m�l���4@U=	
��_\
�%F��"-/I�*?�}�����
����'�PaDsm "Z�1j*�A:MT�s�����O&����f�.��x�<���!:�yl��Z��q��WCY��8�
���6F� ����-i��i����ZJd�9u��x�y�Bpy$S��&�Xg�?�o�|<I@�9xk��/��Uj2}�������k��I�zCbk����D�c���4
���������������H����zx���>Y|6����Q�O{
V;�4�� ��zi���1��v���t����*�@��J����z�}�������V9����_���3�%�gnT�%U]��iI�
#6�:@Dl�)���i���I'��p�N	�A@[�g�X/%q�8�&=�!�WQ���������Q���u\�IglF�M���&���g��n��W�������?���W�������s}�[F��PeD�[����d�4��I'g�-��a�N%��(�#������"!6�yI�����0tT��I�`6!��i@�{����x{N�r��i-�o�"5����
����:���]�d���N�[iF��q��[�G������"�����6]�Y�R�W�B|�����-r���oL�nAy1���C.�j�I��m��m�����@�g5[j�z�����A!:�yl���Vv!OJ�z��U��&�A�4
�wOJ�
����5Q ������q��E@r����1$]�m������]�T���r5�����c
�I�c�t-7i��.��
hcDn��"���H�O����v���,�������-���A:MTuc��A��%���=��q��A:MTuc�m0_��h@�,������G`�� ����R6������UFl�u@R�"�	A6M��a�&���?eH�����I��6!��i@�{��&YB�fv\B�
���8�!�/;��8�U�7�V��|�v �;l��$ge�`1bS����8p0�w�v��0s���8X���4@Uw�:u�8��n�@D����k����q�r�8���ak�z�����d���<V����09�!��H��k�����]�l�������of����7Xk��Y�������<V�7����������������������Yo%7�����u��zc��k�]�m��%/Zo�{����@D���xR�O����7:����e����G��#����L����/�_�����n�zDD��A6M���&����F�q����(3"kb�����n�i����i�����N6"l�]�Y�R���Cl�����&����L=7�J��������:��8�!��m�&F&��Q]��v
 �*gA6s���n���!<7�J����z������q�MqC�u�����s�����i�5�nXE�����%qR]���*%q@�����m�&�4
��_�=��B���]����Y;�Z���� �������a�N�H�s����"b�M��#H�i��zeh�"�QFeh�6F��v���t3�H�O����qh3��C�#�+�l�-�"T��U��
�����}k�h������� "��3&�\�4
��g�M�q�i��;���[7.zK�"T��	��U��[�x�)��`@������D��80���i@��RH�Y��)�����	�p$?v�c��1/%����&=��[��3;}6�Ag�&8r "���l� ��P�T���s�*�������1���(����<VI�2n�b3-���\�v=.�������8�y\n~��i�J���r��:����r������+�bJ�a��1D�x�.��x�<P�!��<6I����J��t����69��<��@c�"�fnT7���u�����=�HVW�lB�M��t[W���R������9���M��r�M�i����9)]���h��	�cNB��;��]�2{fZ������!'o "����*	����T��
�*G��
9B�
���BN�B�������#�[�H����v
Dd��,+� ������p����F2q���3��	Q�i�>�yl��?�n�)|��Q�����?W$�gn�v�S�`������Cw"�
9aL�#P�����9�Ib���y<a�O��V��Mw�dT!'���B.1
q
��BN��P!w��tK����6F���@D�n�j�0M��������.�B�k���������������G�%r<�a|m�-p-��mI����T��k���Ka�@%����d��������Cd�)�3�L���Dl�+����K�:�y��*��heDv��"��;c��4
���%o������*��.�<Ty
 ��<cA6s�?t��[���,)#t�������x)y���!6�y�gk�F)�	/�H������ZC�M�i�gK�Z����G'��J2�r�^�E�y��C0�yl���JM�f���([H�Q��
/@�����l�@�����3y0�$3�Z��r<�������I�8�M��Z�l�[Q��`�����C	)x��5������l�da�q<O�]�4����\8�^�Y�R���Ct��X%����o)�B�|~Y@���k��l��w	�i`���Np���*%@x�)=L��F�X/���R�Mz����n��' ,��������D����I�|�@���w~/'���*�}bB�i�C�y�g�����J>Z+�����&�"	�j�w�UM8C��O����l������������w��9+L�7�i��K^��7�����%W��/��;�n�S��&�l�)�{���&��Bv&d���cQ�@x�IBl��X%���*&��[�e�Klu�-�mK���!H�i��>\2�����z�-�����t@���%�4
���%�
������������,lF�M��>\2����,�$�������z���� ����p�����q�eF��":l���t����%�},�
�x���f���>ku�7�4
P�gKf�;-;��Qb�.u�H�lftZ,�4
P��K����M���/fu�kv�;`��;gyE�N�U}�d�1���G�V-UTF�r��"X�`U�|�@����bg���B�
�?
%\m����r���4D�g�4c;oV�����v����C�Z�=L��>]�n
��z��LD|{�y,
���g�9K�MzC��Z���<���af�yE4�sF�Nz�����c��g����/#������Y�|�@����:�t��|��.�y��[�u#V`�V	�4*Y���h�S�<���~!~������������'�m��s)V0�ML�0�C�y���Bl��X%x��Z!�!��v��k�0!��^�E�y��v	���I�Ya����
mQ�+CzPQ���4�������qi�;��:�T�Q	��s�2��[���vc3�l�@��B��N�������23�[�N�g�+���[�wl�.R9I����(��[ "���M�i�;-Z�E�>������;Y^�����m� ��P���y�U�o�z������o��F;�A6M�zk��)r��[8:IP����xv��� �fnTo-4�ZE<� &���>yj�Z��NL� ���zk��q��w�	hc��y�v�	lA�M���Zh��W4=�(z�H&B�x�����P8D�8�U��
W�)�'���5ZQ�p���l� ��P��P����'��`d��a�r<iW!h@!>�y�;+�f��q?Y�q�v	��s�H�rjG��b�z	L��B��N��}HJ�yY������%�������*yg����kR���W5��#n��"����U�l������;��l�9�+����n�A�� ����0���������YE��A2�����l�A�\Y���J�YaF{�Y*x3����}�}?3���}�}gIE�����
�2�"�HZ��� ��P��P<b���^���0��k��&V��^:`���7J�\y���}�#.����N�"x�`A6s�zk�ZGC�E��
Y����4cQ��<dMK	S4cH�Y!��$�����1����h�N`�l��X����J��.�z�^Y��'$E��@DVY���)	�i�[5g�z��^vG����-�����id���������M�z���4W�":X�E����}�zo�p���F�N�����~�^�]V��i}���j��:q|hfZ���5/m����L�T����I�Z�VU���Ui����
��P�M��q�����[���>��Ti�oN�yEt��y%���������7J(+�@9�#o���=J� ��P��w��n��=����A������ua�@u�"�!t`S}��=Y{5���J�k�������L�{e{�#��n_d{L�"�7�d37*� o-���ry)�sb��Pa���t���\9���4@U.����]Z*Tn���s�����%	�i��KN\y�}��a@�����E:#��l����%owc,��PA����6��h#�l��y}��N��]������i#Tc���o�����%8��8�!�h�e��@v��
��
oT��<�k���
�l�l�
K�1���IMn�=��>a���qi�g�Z����U
:
���o�k���*�l�@���J�@>��V���2y��5�5XE����>\r��$��]o)+d����X?��<�����2�y�|���V�\TeZ�
�$3���.�3Z|�Y-!��<6�gk]'<���8o��Y��.��Y�4��8��S�������V7�'�'����x��p��^9����X%�����5k���&� 9:������U{�R�N�S}�d�~�V��q��"2�	q�K�"t��r~����>\��5�'n �H8��<���3N�q�Nu�����-[���JW�>Y���l�*�l�sK�~B��dj������Q�����k|`�NL� ����p�Yz�YCX����a�����"��^`�s����>\2u��r�5�x��*#���I��lB�M��>\2Y��m��h-F��gi]��5V�8������
�kC��
�!�������DZ�-� �fn�v4��\2��j��	G�XJ��k��QsW�l�������E�	�������������[d�4���|���1��'t��O��I:+�t����%�c�+��G���WCtzx�S���pY�tx���"�����^�@��?��/�=�!����)�c�|�V5�gNN2w����Qp�7���~����Z�u���	7����f<��5��X'~����[\j�<�������KPp�O`?����d�����/�\xo�"�{y��u����y����X�MOq�X���u�%K����>��R���(��)���KDtp��@3�,-�*8����RQ��\0qI�+�7��~��u)|{dW�j`�+W�����_^���������7����*����m_tj�B[��������������_�������?���O_���������?���?�����W�����8r�pP���:�<������9����e��O���F��5����LW�������2�m�FE�Wa�y�V�s�=o����i����2$����3���
�FV����b����t�4�k���^��&��W�7��6����	��B^A������V�.�O��W�7��6������!��o�Um�{+�������p�5m���w�-=wi�E��@�����0�vB�/A����JC��R��T�7��6��{�iN��h���!��8�-��
�F^������7=���
�F^������W�q�����W�1�������9n���y]�y��w�������W�1��8�-\�n@���W�~:��o@���7
����q��[���~����������C�	}JC^A������V�y��J>�E*�5m�{/y�"������yo�!��y}#�jc�[��%s���d�U�!��4�����=�o��2�����Y��CN&��	e�y'
9�u������
w����������2��]gz��C�"��Y��%�6\�{������=4�k���^�����7�B��IC�	{BRA������V�y7�i����2$�����BRA������V�&4�	y}#�jc�[q�;?�~�+�yU���C^=��o=@����p�9�;�jZ-�f�������{�B�+����k0m�v�b�;g��y��yM���C�	��C^A������V�y�]</�8�iC�{q�Kg��Z}#�jc�[q������W�7��6����t���������yo����kn���*�u���C^9�;��}#�j�t_����#��.OB�{�7!n�9n��/������C�������W�1����R��t���h���!��8��r_H��Y��I��!��C!��odUmL{+��u�a��h���!��8�]�,������yo�!oA��W�7��6������:�����W��twn@\���&/��3����������
���w/��y�3�b^�q^������7s��C^A������V�����
�F^����b��6�z^�q^��������8��W�7��6�������Q�7���!�@�&4�
y}#�jc�[q�;�O������W��.t_��h��&������wnCUJnb�"��]n>g��!�����hb�d����� +N��Y��".f54�Z�E$UT��d,�Y
�`�_�M��Z�~-e��.�"��Ti��?��k04�r^���+���a�jh�U�BVG�Y��?�bVC���:d��T��U�(+�bVC�Y�</�Y

�j\��h���SVE����Y
�g����=lh�U�BVG��Iz����FY�d���N�U����q1���=�J���M����C�[�k04�r^���+�OEzzVG�Y-��t��6������ �����Y����U�(+�bVC�Y�|�����h�U�BVG������1��QV�������Yv����Y5.du4���/IAF9%,�T2��J[��R�����bRC���N�Gw���f����`hp�4.�G��M��Y

�j\���>+uP�����AV�Y
���/��FY�������n��f���Y5.du4�J�w��FY���JI�x�14��q!��AV�q{H
2�)a1��A�U���)
n6��l��������n3O3N�	������� ��k7OK�����F�50fv6�L����X�(/�bZC�����w`����xd��Y�%2e�����(�&G�2+fF�!��Af:�(��8e�����(s��G���y%��V�(����zJ���^������w�y��=i���&�<Q��~�+16�~_����*O~�ll�Ycfg��{~���P�(/�bZC�Y�i:}7w4��q�464�Zdk�!��a^e�����
3#���� ���h>�qS6����~�'����Sl���]��F�����.�����S	c�����:�z����vl�{�5�G�e���������`��/�������g]�s�/l���87?�������6S��K��YF����z���2w�`{�;r|miy�����<���t����}�����
�
Q�������Dt]���
��g��yv��a@����������4
����Ze.����ym�zam���z���n�J�Nq�������,�u��l~�V�S��(^ 9�FB0�yl���l
���)�4D����k�f��jD�O�,#�fnT=���3�Y�z���h�3:@D�$g3���fsP���*-i��>��	M�}��,-i��(^JRY�!6�yI���Z��<���_#��>���z�����t
���<VI���������"��!�Y��]��-�iU����]�����TG����l���4@U�%�m��Xe����2���N7�-�YB�M��z�����1|�v4�[w�c/D�$gA6s��;�NM{T/\�PaD>��"�%)��}�i`��#8�����_%Gr��R��k "Z�2n�GA:ML���&����d�,��h�*z��@D�$gr�a��P�~�M3^�<�}
�fd���@T�/9��n����=�n������������]\���"\�S���{l��[��z��p�f�D�H{�u�2�d�s�M�$���:�l���.f�	J����F�"z=[W�4
P��7�� �c9J��,�`��c��+�t���ke�*�����s�-����p����QW�9N��zK�Oe���"hM����<��=�:yK9�4��j�$��l���Y�������	��$�����a6UK���}��3��6�aU�B��i[���a�����a6UK~��\�\�����x�y�'�����l�5�4@U]CH���������0O{�YQ�ur��ab��%��[������1����+=�:^��u�0��%H�]m[���n���Xf����#�����l�� aeI�F�����Xv�r��+S�n��Z��a=�����|N[B����H��k "z�[7
�4
PUoI���m�gF��������#���ep�fS�	���M)T��7
c�O|�����`�0��%5�.:f u���8�s���d9����z������&o��_*�I1�_+
�
o�"��e��
��g	V� ���-N���F�����x)yH��)�c�t��"=gI��I����Fg�\��4Vd37��3��7�yPg��t��!���t6��i����zv���Hb����Xp��Lq������g�d�s$[�6�
���������i@���E�Eia�,D�mzn�cQ��<�Dm�I�cH����*�s�-e�"d�-�cQ��<���I�cH�%�N��^`N��/2�����|0!q�$D�8�U���=%���S��.u����� ����J�%��OI�����&F��m�"X�`A6s��[�����l5�J���H�[���A:MT��y�w+w��2�s���@Dl	�d�4��%(g���t
�Z�������x)���S��*�z�\oC��WO@r>%�|S������i��rv�@�NF_��z�b����"�_�*�V$�gnT7^���AGF4�@A��| "�7���A>M���e����)x������k�HD%^���Aq��H�;/�Y�������L��GXg^ ��0��%���[s\>���G�,3��GX�^ �uC�M�$����z����F��#��}E����Z����`�a��	2#�`V�CD'+sX���D�:���~�^��I�m��z�u�2����T-���7!,�\����s"$��5[J�&
�iU�zM�����c����8���!��HH�jF��RA,���<�����'u�z�u�.p�fS�D�z�3\�5W3g�X��������B��$L�j	~�
����@bG+�LX�Z�v�O	o�@��g��0T��w
=v���k�i�����\�q���,:~-a����`C��[��1t��=�3a]�JX�O	o=�gN��������X��I�N�����W��H���	�ZP��x"H8�i��� {o9�{�u'��d�0��%�������%.�3_��:��z(s�8��csLZu�����<�#�v�j`A6s����	H���U�.M�%FT�7u��hI��
� ����y�������]�����FM5;@Dl�K�i����eYv$����w�9�(:����'����e&k����V=dYY����{p��ds�\��_�8<��p���i�V(1�C'Z��E������J�Y���i�zZ`�o���T�r4��9��s@'YR�L#��z�%��a}rE��H����DI�PWY��WO��'+.��F*/�@	��b�2��i@�NE�[
���Th��P����)K4�0
��z��
�1O��\��*��z~�)��|�;i^���|%�%����W�Y��n'Q
�
4�0
���N� �������0 �]����/H�j�A���U���	��,�TPY'��0z��3���0
0��@��9��
�1WhU$��c��5�d#�<L�����;H���d+�k�8i��1���4���:afRlY�l��sz��	��&�x��*$��ocI_��Vp�^�����������I=^�������i	�A�Y�H~���I�����8^�K_�������6���6D<^��;�%�!U1Z`N}-G�W+Z���U��{D�\$i�ZrY�iN���J�4�
��zD�BT(_Y��C��	/�@�����D_[
E��_h��3�"|K��R���	��/*D��_h���f�%Hx�R�*sH��A�;p�*A*[4�0
��"�L�<���v"���(�(D���}y�l+�Cy`QV���D����	���j	^�r�H�P����q��{
���,TK���L{�n�xZK�����4x$��J�4���a�Ki�������j~=�~!
���,TK�t�a�rmyGsm���0�~�~Q%���D3�&�V�h��9B�H�(��8A�$��5�0
0�w�����V����4x$	G��� �ji�@����@kM����/�_��w#_��}e���\n5�A�*�� 6�k�%�w���&�&���b}�+�*����~MPi��C��T��N5A}Q���2��\���?��sZ~���9a�0P���OU�1� kcY��y������k	����,ts�/i�cq?^���[�PXq$��E�����N�:�i��@��K�F�r"=�tE�E���X�N�+:K4�0
���%���l:�n�*�")/���N�A$�0��
���%�����6�~G]�(R���d���i���i�y=\�t�����n0��E>�'��A������8^��������2�6ThQ��]����&Y���i�.���G���E=)T�>���+�L<���.��*U�3��{�&E�t�'����F���p�R���d�+�`J�)�����
��&y�������t�d��LA&%z����IX��8^��������;r��V����r�8�"s��A0Zb�������j~6��t�c������)0'Vl��4��
���%���!C<L:��B�"��,'�&K4�0
���%k5&���	�)AE��5<��|��&�xm.����,3rJq4!U9��|�Ud���8^����J��l�w�m�)���5�b��0
��k�~M��^Z�E� ����n'Wb�=���U�*_��v��x�	6�$S!FW��(h�h���)Tp�Y�j�LM���e�L��9����a�H%��G���f����C��	�������V�yPVI�-��A�.�f�%��y�A8�*6��R
6+��#�J5�@��$|�r��KlK8�=�[0F���l���t�"?#����&u�M� c����yx�{13�&|�r��1a*����M���G�/qE3�&|�r/m-����X��E�'d���
���,TK����E���S��P�u<�FeP�;~!��f��[��OW�����Dxgm?*%�%�;�r,O���-A��+7}<o�=���L+��%�;�y63�$|�r���+v�+T��v]��8�`���("7^��l�zF'���/0��V�[���/s}�,��	���Z�>h���I[�o�v�D�K���,��{A%��h�y���AY%��~�}��i�[�����&�O��v�p[�
��~�}�)���jI���-��o0���%x|o����������C�v�F�MQ�H�p�w�h�a@/�o-T��/��6s�w L�Z���'N�g`l^ida`^-T>��3`��B	�l�'���4�0
��G�7��P��b�u�Y�H�K���d8_�i�>Z��������hIg�|����fy�����.(��b�����PS/��k��5�d�<L���BW4��y�Q�Nf%��/����N���X����|������bUhR�n���I���~�Y���'%E�24u������sz
��F���i=^���Jko=� U�Z�_�Y��;�'E�
4�0
��G���Nh��Zi�	}#7��w
���
d�F���h�������P���P}66�k�������0�������\aF��F��Z���D��:NT��Fy�����J�T=��VVU~�{��&,��
���h�R�}d|ag
vfXAZb�p�V
d��#�i�{}�PyI�7�U�F��{D=�tvM��F���AM<���.��P��e���^!f��~��"��M5�0
0�O
����,��)S	����<lv~}���������>Z��_/,��X,+C!w��_�{/����j	~��q�\�eA�R���T�����Tm
�a��TuK�.4:8Lm��e>��8!�Y���i���
">Z����a8���Lh����N�Iv�.�<r���BK��L��e��1K� �w���`13�����>Z����l=�*���T��_��7�h��[b��>Z��>C��%v�E��Q�����y�� ����������5��UhO������y��L<U��z]��GF�QT��-P'o_�}����G@��
1;�'�������:g$U�3
�f�%L��zM�����i�����O��4������
a��k��e��{C�hu��Zo�v��e��Gn�V����P��7kl�oL%|�~}��[L�%��|�^k[!�����1)�D;`k��}y=|��-�<|KV6��h�!/�	���a���[��$^�ar�o��N��=�L�����^W�y���~!\����C�$�nW��Ci��n=5��������sI|��am��-�@+6��$-@O�Tw�i������/y�y\}�(�B�"���;Nl�&%�jda`^��g������'���4
��E0v���
���\LV��mA���Tt@�-�`#�"L�~��%����7����H��j���Q��6�X��u���|�V)r/���?�����8�z�,���4�^�����Wf���n�����^���6
Vz�����n����DA�D��T*O������}�*���\�U�}x���������]FNZI?Fy����������8��2�I����+��u����i�y=\��:��2�@�!}>��<W	���`jb!����l���M�1�Q#h�DY�H�S���K��J#��z�di�mP�7}�V�h��$u���l�hda`^��Cr���#�(
��J��9��U��ey�,���1�@T���Y�
���R��d-��y������ |�Y)9�\�D������b�,��#7`W����EG�c����hS��3I��w��G���8Ta`^���j#G-���`��;:�G�T���D���{���eK����J��e���Q?������y��0����:AEvk1k����n�)��	�`�F��^U�t���5����u�fe��#���>�@�����}��=�41�V-[F
����!�;��qI5��d�k������2L���7��?���~����������<|K����}����>�?���*�������<|K����4mQze'�`�2�������P����j	>^�������k�	�U�	Q�C�����C�$s|����d���&���`�����T���<TK��6�����}N^E���QEB�WQW5��[{�~C����+]J�q l�qp$��n���88&|�r�/��q v$�-�&~T[�|������]����~3!)�7��fB�Y�o	>_�U����s2k����?^�c�]��,~2l��U�V�W[m)$�t�����B�Y�o����O�����FB���	��D���%6$"6N5��Ec��t8��2����`���/=�eOc��������=�����������������������Syq�7�QF���O_���?�������_����/?��s������������q���_�)��m��'�z���&lB�<�c����Jy���V�Y�����]I�����e�c8�+��&���HI��M�!^,���C�9o8Ir��E�������k��v��Z�:-�y��J|����j�"3�}�p��7�U�Y���3���������U���m�����\e�v�2���Rg{#�;W������=���`odv�*��sd������Dv�9����s�<����Nd�S����3I��i���������:�'u�����\%�v����j�2��g��|�\e.�����Ffw�2_;W�'�Qe{#�;W������F�W������U�kg���
}��0v��r��w�UfWU��`odv�*��s�Y����d��������s������&�N�o��>y �u����}d���p��w�\et�>3�����|���������}�p��7�Ufo�\e{#�;W�����s�������U�k��<����
�}^w����U��ux/$z#��VY/]��Y�g��*z#��VY/]����V"�7��s<�v��Q~��c4�����������yd���p��w���bm�#3�}�p��7�U��u��D�F^s��^�VY��"-�y��J|��1�g��>��F�k�*k��E��`o�u�*��s�y���}f�72�s������^�v��Ffw�2_;W��������9�F;����l��'��55����i4Z�������+�9���sdN�e���3�sd�q�2^.������U�k��,�����	�}�p��7�U��#j�������U�k�*�����'�����\e�v�2���>3�����|������:����n3W��������a���d�����O��s��$mq�����X���$��9� � ���#���8W��kN�!3�����|����
G��d���92�8W��
������k�U�K�*k�n^�����\%�v��3Jzv���3�sd�q�2/z�r�����\e�v�2��3���������U�����`�$�72�s<�v��g��Y�2�k��L���$�����@v�	���7��Y�3
d���92�8W��k>(�����U�k�*s���*3�����|�����a���}�p��7�U�Y�O�3���������U���m{#sn�6��������3���������Uf6�Ldodv�x0��{�I����)������)�.hW�`�]��������]6���]���n3/�c�2/��������.����fXgv�y�������3�����J%�w>7kz�Ou������w*G������#��t���Q��;8���u��e6�:s�����A�ufgw�����.3Z�3��L�]fg7�3��2;��l�u�`w�q�l���mf�2;����r>|���f��.����>����2�a�9�]��)���n�Kn������$?���~�O�wL����?���-4������C�evv�����n2o������.�����e����G����4�evv�y8����&�V�+v���7�}fc��i����.������n3�p���Mf��|�'��K�:����h��Ok���������J������7I��[��$��n����`770O�Pgvv�����n2��#�ufgw�����.�������4����,�/�����]f3�3��\���>����4�evv�yCs�]fc��i����&��]��%��
�:����]ev9��=��p����#m��<��ym�+��_gw7��F��(�}z��>���&���I���)�����]^��i�e]���`�y��u�b7���{����2�a�9�]�5w���f��.�����k]k��f��.������x���]^��i�e-:
z������
w���?���L��w�w��rAk�������3�������?�evv�����n2/���%��]^��i�e]�o��n�����@7Y���������a�8�]�
�w���f��.������S�!?cw��p�=0���u������f�9�bw��6���������7�G��O�}=�l�����$}G�e
���l���o���������������g[�k�X��������5���6rTCK������y���<�����nI�H����@YQ� �����x��y5#�i�yu�~��5��#��Ul���q����V���<TK
���$��������,�����D��2��"TK������0U�g���6m�?J+��mf@�&E���R]f/I.)��X�$���K
�`a�/��]D7L�|��M��X�9!��������C��	�+�Yp:�*Ov��b8{�F9�#���h�Z�����#�R���y��m,'ei\V�>!����E�� awe����e������4J��Q?������C��	{+���?�Q����b��4n�%�C�+8��
�$��,�	��j�)`���/M)i�0XS�`i1#�i�y��4lZH��_��-��$�zD��fU0���j	vW6�+lV���1��� '�;~XY�bf�%H�]Y���������:1����H��l�b�#)`�Wbu�x��@�!	��?�L�|�I?`�f3��
�WW�\��y��?�1�G's����*8���j���?=qq{%�'�+��WL�I�0���;~XY�bf�%+��wV&s�����@h\����sP|TN���,L����O+�V�%��
=�~�Y�
f�%L�]YJ6gi����QY��s��OWV��f�%H�]����*Y��Z�o)�H{��OG�E����Q�P[�F_�YY���=�~XY��f�%H�]Y���L����5��,��I����4�P-A������F��~��f����EO���2�@#�����8g��rWYlV�thl��nV��f�%H�]�l�&��-Q��AI�:�Nz��-�<L�u!���C����eIg4����f9\63�P-a����	���p��1+�����zD��2��F3�&��l��G5����2
WzD��2��ff�%L�]��/z��������[c�$o
(?25�P-a���2��V����H��|��}c�jF���.iy-����u�� U,�P',������4�^���:F�
v���:�X���7���df�%�	�J���M�����|��N�<�s�&�]Q��m�<��V��D�G����A93�&�%e�Bp,h�hR��a{�s'�����,L��+J���l��!�`�q�b�s?J��y��,��[��N���b���dU8�P'�Y�&y����Fj���B\�yT��I��:A�46��#7�^�j�d{+�q��������Vp�Y�j�t��w����j8��0X��Q��u�(�9�f��[��}Q2��Y\'��`�2U�Q?%	�"TK��/Jf>m�?z:���4.6��%��(�0�,��d�0��(�����(w�6e�6N=�~.JN4�P-A��(Y06�EeL������4`x�)Q?|8���j	^��K?��n��d�2U�Q?��aZ��B��	���l}#��nVl�/���s�����hf�%��wS�BUW�=t2e���G��tv
��F�0�&�xM���x�����6+S�q��s5�p03�$�R#1�`�8�'����Tj�	������y��l���%g�������������G&Wf*8���j��Wz�&4��3�W]1vX����G���.ff�Z�
�{�:�f�l�b�$��*P��k�������.��.j$4���YCZ��=2b!�k(�����&�T#1|�y]�F.�,�Af��5p�Y�j	^��g�����ID�s'�hlYhda`^W2$��!B&|<GO�������/��R�8^��
����d-<
��G�~���Zx<$�u�����@��m�B �b���Tp�Y�j�����Vv�6i�B�7��b���`�G���Jx4;�����@U���G��_S5�P-a����el��e<���~'�#�j�qO��Z���+wTQV<p��1��R#���!���V&��2(�=�~!<B2^��U��0����{�{��(�e����Qx�h�cC��Rx�O(��!���*���aeEl���j	^��%����:�� .���qe��f�%Lx/C��^�*��)��2��V'3�P-������-t��>���:��6k|>�5�/�T�=/(Zdq$��*0�Q?�!	'3�P-A��9��f�Oe���43�oN�����C)�R3�&��~3K���
y*��Gt��G���H���<TK��r�Q����\�}�����I�$�=R�+W���
��r��k����=�@����X��L�$+0��
X�� ���C)\�<��h�/���`�uB������|�e~�'LB��4������L����F��L����Z�^&6�V�+��(�7=�~�#
N�<TK���Fr��Lh�(����q:t����F���,�d�X�:�����I��l���i�.H<v�E_j)@�M�2UK���H�lf�%Hx%@N��=+6)KST`=%�G�`Z��B��	��#Q�/��b��d�2U�Q? 	'3�P-A�+R_�����x�%�����C�������<|KV�����(2�c�r$��Lu��G���H���<TK��R����ms�����P�
��[B��#M��-��E�y��%�����d�2��Q?�#	3�P-A�+9R�� j�g�6�CK~�Y�{��P�o��=���z�VlU��G���uI���<TK��R������>e�T���I�7�-+�<L��J��)�q��#����x��������F��u%GRH���r$�APO�\�,��T#��z�SiS{��H�~���<��)��"aVWE��A��$~�:���I�YT�Q�'�E�� �eUd�!��@:���N^��5�0
���*9S�geX��B�%�2��:�*9s@�<L�u�JR�)��
�����~�*I�f�%Lx�J��O1��l�$YQ�r��#���$ 7��P-a�KU��	lA�q,�J�=2������R�j	^WE��k��eE,v<p��1Qja�0
0�w�!��(�����������r��J�=��u��3^�A�l����t���Q�A��y��A���k:����^R�B.f���
a�{r������%�o3��C��~����r(C�T.�)<�!��O�������x@�D�������c���s��d���>����m�X�\�P'	9��������O$
��it���*(���B�,L�����<�(V��Ce�v}{D�(d�*TK����'�B�OG�K�LG4R8��
Lcy���i��.9Ovx9��fZ��g�}}9
>v�:�yq���x=]r�P�$
����I�fN��Nv����#7�^O�,��Z*Z?�0�NE��G���T@Q���C��	�\vw�$j\^����`��Lk:����~f�E�� �����_.�;����Th�zD���:�@��$|�rib��i�����W�c�Q?/%h�Z2������s�$��������-���>���,B��	�<��*�(�m�T����p��dF��z�d)����j-[���UPlzN�5����<|KV6ny��i�V�a+��1PQ�5��9���lc�fF��z�d��h�W�W�`��5��(m���f�%H�x�e�o��+���"!O�����c�i��|�d)Q��d�l�~*��������EM,�df�%3O�<\����nI�T,�����=�\)(2�:��Z���W.�����\��s�P3�S�=R��sG3�P-)<��l�:�Ok<gA_����.8+�<L��t��L�LPf?Zl�ypQ�{d���'�E�����f��-/���K��� ����E#��f�%H�x��������1U����j�=2�!��y���j	>^y�xY�AD#��Zz��_H�
�J3�&|�r��"��=�2OZ����'$��m�R��f�%L�x�Z�������`e��/���	��� ����-��w�p�^����{���QYUO������O�C�$|�r���p|���w�J�o	�r�a��[���W�]d� �Z8+���Z�C��<�,��dc��+��/���n�@~��jg�x
�� �r��=^���Z]�����&��Zo�~�z�\h�[��7���sW@����?����[h�v2@F��z�d�/�/"�H;�W?�����6(r63�$|����)��X���Y�3qB����1%3�P-���o�
9�;r�z|o�������o�[9���t���n%lH�?�� �f��[��~+�Z�(��gTh�8����(c]���[al���i�>[�H|R$ E��=
c���F3�����BW�B��$�p������6���Q��K���L,�gA�<L���Bsb�`�O���&Ei����N�� +�,L��������h,�NwUl��,�b�����*�h�Z�y������k\���e��2��pY1`����[��q�o�����Wj���}i��E�`3���z�G�/���E��L���h���#n�*��l�$����
�h�Z���W�$g�w`m`0�(M��t��q��a�Y�o��
�O�+J?N����7�4�vqx�%���qB+z��-Y��7��kcU���H�
�s'�P@��������B�4X_����u�MVG�#�_���Y����]zo�����A�o�y������/�N�<p6��#7^�-4/�=�a�`�'���G�8L`P�I�����X��+��^ve��#�<m_���7
�
�<r���B�in{:<>��(B{�s']�1�-���G;`^�-���3�!�m;A,�C�Z��
�X�M3����!>Z��Y��s������D�c@�j�P-�:�O�[5�X�7�j�1��":m4~��fU��,~���7�K�2y���6���R�r�d�����b�U���g�]e�y�����}��BA�%��p�Y�o����Wk����{!��C?m�2��8'�m0�,��	?\�
��j�C�g��l2���z����f�q���G+�
w�$9z���#Q)B���T����P-)u
���e�D����v�f�V!�G&?�]��f�%Oh�^�g�����sQV��'D�*�4?��i	~���e����:��������T5I5���U��0����
z9����XVV����;���)�\�P-a�O�kM?��5�$�Cu�����+���c���l�!�o�����QY���~��4��-A�O�k�A�'�;�L~VzO��HK�whE"��a��[�����5�E+�v��<(��������lf�%H�����oi+����-�y����Am���<TKr]?w��K���_����n�K���%�z�=������ ����J��YQ��T����A{wX6�y��0auc>Z��&-�C/����`�;��Q?v;1(����C��	�|��s�^�da�f����
���9�a���K�����s#�yG����t@19�b�<�0
(;9��%gv��Z����VFq���!���#`�Y�o	>^�T�,��}��aN|�����}����f�%����|���"/9��F7�2(���C��'3�"|K�����*��'<^]�������=�~.�N�<TK��������~T�b<�Gz$��y06B�d���W�q8�h�<�lJ8Y���~�	�)f�%+i?[9zE�T�Gk
$��h=�O�T��i����KNznz<ZO:E^zFj3�S@'�V��"7�^O�<��	,j������"�o=���F�df�%��4�\&��|��Z1l9b�^��_��S8,ff�Z���W.���KF=�g�`Et�Q�����rHf�K�x�e�8���;��to�������3��-���W>k�����Hy�	p��;#6h�0
0��K^�-W%��������6dE�C��W�����o	>^���X��u���e�������=dif�Z��OW�e�[��>;B����d�����f�U�[���W�����w�IY5J����a�������j	>^�(�y��XlR���Q��bP8/4�P-a��+�V92�|9L,����n>aK�w��3�E�� ���k7J��9�6���6N�Tdt�i��@��K����x�`�|�����ei�ZR�e��<6�>g;�H,�������qG"�,B�	�<�
&?�l�Q���h��64�P-��SVn�";�B<��u���2F�x4���#4�	�����Os�����i�S�b$�1Q���
��z��j����-��WoY�����-U[{@��K������a����_�y�z��H���������<�	�9)��
���A)fd�@��K���e��I���AG�C�%�;�Z/��<|KV�Lz��-v���i�V����	Q�������Z2��stE��2�@�G'��R��������\uQJ�k)��}�N3g#?���?�������������.�/���'	����&��7�o���������������o��������o���o�����y������R��������
endstream
endobj
5 0 obj
   306709
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-1-0 8 0 R
   >>
>>
endobj
9 0 obj
<< /Type /ObjStm
   /Length 10 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
10 0 obj
   16
endobj
12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
   /Length1 7412
>>
stream
x��XyXW�?���U�U�T5� {�4��FTp�%�$*1C���Y4�����`\01j�K�g�1�Q41!jI�F�o��<��a��1$c�A���5�����������V��{9����wX�����p��}#�0g�#���Dg�t������:�U�3k��i�����
�hY�`n����}��%0l��s���2��fH\��CK���e0@�Y(��|���C�F���.���b@����xpA�H�+[��e����F@� �k�WA�JdV)PHo;�������=Nw��qezcd�%^/Y���A1���H�X7�	���
���k��l1�����ku^	h�3H��k�L�]��d�$�Wa��m��n��m�Ub��]���#&V�;�i��;H&�$CIf_��y
_L�$K�c�I���A
V	$P�7p�2�g:><s�s �1�����^g6P��=.A�I��b���
��O�\1��?cvQ<�o��>
j�@@�#��#���F���q:�rg�q�8N��g����)��Mt�$���sM��HE�#���b��Z8�������n�8�,������s%hB����v�r����1
1H�����x����#cH�����	z&^��~�n���T����}}�}/�����]]��]]����������sK�����d��b`��*�/C���La����
xn����@�W�L^S���Tab�8$SO�3�	��~���\m�� |��C��K+�m�"��U�QT�$P�5�u�C���;����ml�����G���8����bb�Rw��$����,=!+S'���_|���55l;��ZC��-��A(�#c8 ES��o�h7
��-9�5�.�G��k�6Yt�V5@���`����7��
KF��������
�H,H�H\������z����D)�=��'�����M�38HS�����c->\����G����������X���o>��Ka���k�V#l����/\��%�����'���4��@7�z��c�0�^[���:�pTK�__f�0`1�j,���IZo�����p�-���y�������g��3���������pe�e��&{��1�X����/�.�)�-p��$W$WE������j����%7$w'���zc��	%1%�%�������
WE����+\+��br���p=!�J�X��t��t�p������m��s[���~�*vo.9T�����v�e�y�?8�r�������q�
{��!C���{����h#��Q0��VG�����(�F�vubY����� ��.gxM���(��b���]��O�����&��7h���~Qi��������.�I��d������:#6F6	�2#&6&OQcb��I-uT;k��������Q��H	�G�VIv������:u{v�����o.k�\6@�?c��Y�������(�$R�} J�R��CL����`�H�He�:����E���@55$���L
306����\�T���j��`� ����-hEj��Q4�����������)�RR�SC�"�B�]Kb��!��C2=���&��!��'�L�3��v�������i�[�����r������|�n�A�����B��8k���[%$��*��_��%nZ�}_L?��]�.��p��FB5y�Z�-O(������n�I��Zo��F��W�exU[��"r]dC$#�?���6�i�HHq�P������'^)�6������Ig��Y�
���qq����D2�X���$�� ZJ}���h�������
d�jp���@���2����%�����~GPJ���m��������O���<���Z�A��@��-�D;X�^��l�Z$@�!�U;��	�A�
d���#�Y��`�{#��n�9k0�Y������t����������z�����M7��O���.>�����GH[#.�����P�M'1U�O0�������-v�ymT�S��2�v���`��T`#L��e��v��n�VJntEtC�����,rI�����d��t9�4X)�rR.�;�#M���P������b��`x%Z�{�����'���>~��$)���/�|bK�U�k���C��OLF���[���6���zLq
��^��J��Xc��Z%E�A�`78��WmgO�,y>��f?�~eN��<x������U!C���]o?v�-�_6�� ���By0��^�d�R"��i�V���+?�2$3��tR�0{�P���~�����U�D{�������y����\��\�l�n�I^�������
���������>L}�:o�7Dl�C��o�yv�4�;�����&����'VJ�r��R�T}�JK���V�U�>{Cxw���,�54�'������n��o��nb�����_/t��N�_N��l+?��~�d�� #��	@��P�����Hu���1J�� U�nR����$K&]��<E����:��	�h����m����A�sCp��&_?�?V�J�A�;��5���9�(���2�������A&����{�3
�Ju*I�hz�2��!�V��G�c��!�I�Ry�>O7K��u�.�"}���^P�()c&E�@'s�"�LbnS�����l��J�L�j�e2N`�MST��f�"��`��"i�<�T�X�-KH��9�Q�K�f�{��>Kz1��I0�L���>��?�@^�~@RH
-�v��q����)B(_D��~���`��
����Z�d3��D��������Y������3����D���I�q����������e/_;�{��={���������S:��#2�����y#Mv�YtN�a�7!�m�W�����%)���jj���N
�<Q��]�S�_7����H~�P���C%�+RRV��iB��T8������L��s���;7n����g��i��O�`��e������@�_}���S'�����?��yep�����?��$�����d`4��% �7u����
"�D�����P ����m";��������GA
W����ovQ-�+?�G�S��{�P)T+���FA6��	M�	N�4	�H
�P��Y$sh�<&��8�N`�D�<�@%����=x��-K���!�C}�=&VA��ZZ���z�'��-�}�m���b���|A���CqH��d���o���]o�_����B�w�����g}g#;��J�`��fH�����f��0:�EQ���J�E5i��<jU[5����D�Fm���m����c�u����!!�/�<7CzT�[�g:�=��������	-�J�E�$X�,��_(S-sLs�{��
��]�$���ZU[q
�X��P�k�-I���.�"4��D5��lI���\�p�"YB�`#�a�0�K�5����/�
^�R/��^�+���+-���m^���i�L,�l�8S�%�a�C���4Y�lz)*�X����>y�u��V���*s���Zk{��I�d�b�bkT�{�{m��{��O_P��J��L��\bhP��a��e���03�����S�m�T]H��n���F0����
S 4@�B%��d.YB~C�'���$W�+��/.����f
d:)!K����!�W������y��l%��v�p�8AN�S�Tp���
��{�%������	��	61��2X �@��\�l��d#Y��vh���[�.�C,�vh@<~x���Z!
�at�h�h�&
d
dB;|��B
� l%��A�%��J���O�h��i-��I&�`��.6�e�����X����g���t����&��M��Q;�A#��R��O�up�u����d;9�p���s�Ral'��8����
�J�����F��:��)�sD.�F�z��Fc;<��|P��_tH	�L�c��[$ n�8���E�!��	t�u���c��c�XF�����3V�%�	>�%�<�������0�u@�B<*j�AF�*��:0F��C�B�a��������^(�����P'���+g�Bl�ePG���)PF/�b��DFQ 0��5�����i�]��������I�f(h�,u���
f�HV�����-7Sw���h������]���������qi��g�n��������������������_�F}�r��[���������<Z�Mz���C�x4���ky�r��P�p�e�5���������l8�	0\j���|�;:�i,bv8���L������K����H��E�I[a�q@<�C),�CD �d!��P"����	?��t#=M�g�l�8L\)~*�V~(hi8��!�%�d
��+�}W9�8�on�������V����u��V���������K�����z���=�_?����a�O=x�s�T������x��tv�?I��9~���������?��,����8�q�X���x��Dvn9���g���p�C$�;��9���]��x�=��������;OT��D��m���&�����:�c�rl�x��a[�������#����C���#��
z�U7;T���C^��r|�p|�c3��8�/�����n���6��^76����c��=���n��8��c#�vX��a���b��������}��m����[�g[K��-{.�h����p�\oa�9�[p���lS=n�`e�+�����aOs\WW���u+h�SnVW�u^�����v�����!����V���'TV��'T��v��R���Y��t�-�U+u���J���c%Go�o�/g���|9.+E_��������X�Q3>������=�`.��
�����8����z�w��q�r�[�cep,�8��<�ss���2c1�;9��X4[aE=8[�;B����qf�����B'� �1�;p��6�c����8�v�M�x���q����9N���)!89��&k8��9N����8���B������|�r��8f���q��Q66���FZ�(o�
GZ0�c6��lD������af�843-��Ee���������8$���h�f����f�J15��R��b���n6p,&�1���$�UL���1��qDcqvt�bl�D[XL)F[0J�cQ#{0"��<�q@)����0��Dc����������1;G]Hczj��V�V�s(�p4��CQ��hh�(iL�(9P,EJ4F9z�(�1�(�	iH4��OJ��$����O[�~xd
endstream
endobj
13 0 obj
   5328
endobj
14 0 obj
<< /Length 15 0 R
   /Filter /FlateDecode
>>
stream
x�]�Mn� ����e��l�	Y����Q����E���������R����7�C�v�.����8�w>��|��FX%��&�wy5��������&���f�5/>a�k�����
O�s^�G����.mOR[�� $^���+��ay&�E.�wB�i�_�����-�E�W���e0�0�e�p�\� ����J��|�i���e�tM\# P�H����,����0���uK�E�Z����+��j�=�23�+�\�L�Q?�~�Z��0��3HC�������}(��9�B�$�D����/�~Sx�8����-F)?�<���x;��`U�~���K
endstream
endobj
15 0 obj
   338
endobj
16 0 obj
<< /Type /FontDescriptor
   /FontName /CUEFNG+BitstreamVeraSans-Roman
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -183 -235 1287 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 12 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /CUEFNG+BitstreamVeraSans-Roman
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 16 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 317.871094 0 0 0 0 0 0 0 390.136719 390.136719 0 0 0 0 317.871094 0 636.230469 636.230469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 612.792969 634.765625 549.804688 634.765625 615.234375 0 634.765625 0 277.832031 0 0 277.832031 974.121094 633.789062 611.816406 634.765625 0 411.132812 520.996094 392.089844 0 591.796875 0 591.796875 591.796875 ]
    /ToUnicode 14 0 R
>>
endobj
17 0 obj
<< /Length 18 0 R
   /Filter /FlateDecode
   /Length1 8332
>>
stream
x��yytUE��������ss��wH��aHBB�� WdLQi:�0�E�Lj@b�������he����������^"�/��K�|��[�[�������[�*Ug�s������}N��@�2kR��1k*\���)���n���?�g���g��z|��d��U�{4
�j��i��~>c�_#��1c�$5��
�k�.3f�}p��� 9*��=e�Mu�������&=X��$7@���i�'m� ���`P!ky��6�,P0�-4������xOp6����IjX���p��9��vQ��W�@�@(m?��x��O,�a3��5���|����2�0��~)���m��q�lmq��%������P����������13��^^4b�m���R;����-����q,�����3��^~y[AOy:�>�A�	v���C����o���H������BpJ���tdB�s����9��?#��t�-&� �&Ngs�/�_�W����/�����Ncgz�E
�K�2���<���5f|I�{���}�l?NpB�F����b��s2'������	V�^`��:��5��;���	�%��%s4�Ba��Gx�����t1A�H�����=�C����1D���)O�'��S���/�ZQ��d{��*{�^���_s�io05�������bGmG�����,ZeIa���������@�c�@�
�3l��+���3��kFv�j��+��[������� �6ldh'w��{B^<?f-�E�����j�+d���1���~v�rbv�,���@����?^�M��Od*7U��M���'�/��Nq�3�=�>���I Hz��R�)�)�-k��|,�Lf����'���P�~^�y�!n����W436��V�4y���ueZ��=6�l�~i��xk�IuE;@~AGyK4g^��t
��p	��^Y��A����V/�����r�<!�a	f��e�g���
+y���2�����.�+�[^�h�}�:���4������+�-�n���l��T�s[<?)���)�Rkt��3�t�|
 �8	f�!����mK
��dLVR�Z��M���,h��!�
���e#7�x��\����4��W���}����L��}b�$��*�i�Z�
g}�����)�O���/�%�l��]p�4��lk�;[>i���u�(���j��������<�A���i��UL���'�:���sP������yV���p������:;.�=������l=uY~�O�=�������d^�p&�Y���;�`4	n�"fK �=:���y����|m�������L5���w�������l�75�/��h4�����X�����%������n� ���6�Lf��b��L��bM��������\s�%��k�	E��L���,}�}m%�bs���:������������������+v��h7��6K����9�2���{=n���Vu�����\,������}s>�X1�d�@L�/����K�~sn�=3G������)���r~~aQ�\�)s�;vff��W����<�1������ \�
|PK�q@�i�	��;2Sm6�3���A�'����z!�F������j���"�WVv���<7aa�W��J��3<��Z����������kcg����R2���t�/�4�������y9Xqi�XI7
"���`�e1s�M����^�����L�4���Uk�,)_X�V��dT'�����4���/(��[����Oja������MW����Ws�\��D�y�9f��'��5/6�*!D��; >2�0����}^�&7w	��48�.����da�M�(]��L],k�]��^�����5�1�hh�F�P>\3����|�q�i�e�u�������\)7����-�X%�#*�JC�q����$>������	�����/WY��7��h3M����i��u�[�[y���z��?��d��S���h���+����u�IE���	���[�>� O���5��xy[�k��]]G�M^��l^�A�X�����G\+�M�#i�������GK�[}���%���U�:+�N`�.X�h����a�����������������<$�2.���F7����Q�W�
�������X^�1�0��.g[��`Z/�5�Q������1���1@p��������0�Q<�,�����X��z�y=o��Z/�u�4�c�!�
fXs#�M�@�5!�h����v��/�+�����_�gA#wQ�Rj�h�M��b���+U���q��x���V�Ze�a7�x�W�5f�D�q,+3Ng�l�X�*�|���${J<k|��(V=e�@��M���q���l�%K4�o{W]��6��� ��Fv�}����:G��fMZu�	�fs�-a_��~��<o�����$S��M[��
��V����ZY��M�����f9�tC�@���BS[�9������/�u��&;��H�������������]��B{^|Y��=s������_�����
5�����w?������g(�������D0�w��S*��2�l��]s0%z��c�����!.C2��r@KV}�M�2��H�2������6��`7z=�j'��LS[�F���v�����tR�M��_����8�e1_WoW�}"���%g@�/#��|E���X�
Mvh��	>���pv8��F��������K��������s9���"���A������8�[����f���L� ����������X>z�E�@������
/?���:�z�}����;\
�$�M�l�L���=��P�O�H���5�����<n��C���*�cfF�:����wXne��;���Y�����r�����������X��uta0Av���
P�Jsx�y��k]���'LGl;2��>��e�=�	
R��hy@k�@�qM�4�i���y[����"�5P��`/
k��OE����������V����g�7d�=MS��9�aS|���T.\���%qo�0es�*���Ca��.��D������V�j�!����<qY���X���A������������/��Nm�U�f	^��ph�=����j&>*���)yc��YcqL�xQ~#/~�
&��r.�98��Y���L��t����)�g���]o�nc�`�me�2}^s�
��&gK�M_i�*/��:5x����_�m�b8E����>.U����-��������_����{�Iy�k�M~���l�Y�yC�>=o�S�ztj��3D��XW������PYk?�������<���B���m&�e@�l�~Y��\�S�������Z�:���F�e
��5tk�,T�/t_���#���%�%��C/���������=���-�X������@y��@upy`YpC`]p{�!�,���
��ut��.�E��
����}�3x�|q��G��Bf|�������o�b����������y����+��n����_����jZ�wz�f�R�
IPKs
��+�6��f�d�I�lC��L�x[��&W�
����
�:�4�����g=np_�zLZ�Rm�6�2�df(������jY)-=8�?�d����>Q���<���Q9�����6�`��� sc9��� ���C'2�H`GZ��.���w�\����tg������h�������x�U��O7���h�nqd]������]�]��,{[��u�����V����:�3xo��G�J���*f3�2v��J�gzc����y������������yu�����=�y�e=��X^�?)il
V�l��q*,�v�'���>��3��P~�o�>����}�ux�����s<)���4��B�`��5��u��x�!��NU��>0A� ��7�/������0`�����
V��W9���o������X������]AV��zqh��X����qV�,�	V�b~\�+�d�"|6���(.��]���is� ���"��P�g�|;�o�
��+����+x�}�W�*�!��R�2�5��Qc&��:�~�����������(���8��03���<,���	?��<,bE��_c!����T���C��F8�"|s8`<'N�n�$���S�3QCP:�'�Iq^���s��I�m�
v�����`&2d�� �n:����������
)H���&hW��XupN���x�^� ��g��q�g����l�bup�����q���
�y*�)u�Z4��G�8��h����JO��+p�Fb9U�D��6+.��b�c�*��*=�#�w;*�t���������~���8��dX������.���ou�0C����EFLm��^:4.�������P�n�=���>����q
"��"��<�����^2�4�o2�s�!��{t/����E����������38^�f���? �A?{vs�����r�L��5��~=t0�a�L�<.g���?�\/i�8T��P*�����z�`q1J��pJ������a�����0\,�R���POw�lQO�������� j`���5�J�
��c5��,������<~���F�<�D��3�V~Y�!f����]���Pl�3v1�7���)�``p`��j0�K������j�	�P����/�	 �/ft������	C�x�a�a�>�~ L���g�z���>�����`��H�'L��]z�@;,��H�'����@�~���-0�>Bh�j�hA3d��G-��!4�w��V���P���+nNBE.��\���'b��{��/���HJj��]�I����~�2L��������$�CR����������.K�.@-�.��������1������������oj��I_'��]�����.H:/�?$�����/$}���g����t&�N���S��S�Y�NE��E�_��'n���>9�������?���}l��|I�G5^�Q�N�`'2�����,���K|���.:V��������+�H:|�L�K����X��#�P��?G���?M��eN�$��4zO�����WH�;[R�;}i������5�b_
5�u�F�������c��5^�[�.Io{�-�Q�NI;$m���~j���-�b[��nI[���+��R}�W�/�7$��E�IzU�+�6K���6I���.6J�`�
1�~]@�O���X��5^�6Ak^��5i���������V���U{iU5�{."���.�_����\�R���T[���X;�x�*V��+-����%�|�W</�f�S�xi������%-��D�������������$=�'$=.�w�X-�V�������L�HZ$i���=���$-��Y,�43���*�%hn*�I���~I���������J��	�������4c�U�(���*
h�T��&i�������f1�J��4��#&�R9���C�t��2IP$��+U��t���T'�4A��4Uk+�W���N7�q�_����Q��i�m~1:A��R�m~���Y�#=TR��7���b�F�ix��
u�a��!	|�]v�-vtsDJ�����#�1Io�����N���/��Q���WR�M7J��D��RD��r��*:�{�m���zU���(tSa�X�g�f�SR>�"3�Y)7�zt�+z$��'"���nS��T��t���}��PV�"����X�.�T)l"#Aa�c<����JO����9�D���v!�jx��R�%"e�Q�J��S���o�<�
O��S)I%�$U�JrN%��)I�8��N�Ws��� kY�6a�����md�q�$�$�$E��"I�I�8OM%��`�lU�.���R���F����,��_&�/�
endstream
endobj
18 0 obj
   5973
endobj
19 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
>>
stream
x�]RMo� ��+8n"�	!i���~�v��hM*d��f��6�A�x���(����%��#n�������v����Y%�[l���j�!���.��'X;?mLk^|���)�������x���g~��=Q�-�X�'^2c�������x������t��m���:p��E���=��g`�,
��dx���h�d����V�p]�J�����g����[��+�E��(O�'����
��e,��p�XV�E�
�*�Z@�!�A��+����#�#f /���������5��s��`NE^
�$�K��$id�P~���W�_��ry��i���]<���b����w���xx������/!���
endstream
endobj
20 0 obj
   353
endobj
21 0 obj
<< /Type /FontDescriptor
   /FontName /VOWDLB+BitstreamVeraSans-Bold
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -199 -235 1416 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 17 0 R
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /VOWDLB+BitstreamVeraSans-Bold
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 21 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 348.144531 0 0 0 0 0 0 0 0 0 0 0 0 415.039062 0 365.234375 695.800781 695.800781 695.800781 695.800781 0 695.800781 695.800781 0 0 0 0 0 0 837.890625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 674.804688 0 592.773438 0 678.222656 435.058594 715.820312 0 342.773438 0 665.039062 342.773438 1041.992188 711.914062 687.011719 0 0 493.164062 595.214844 0 711.914062 0 923.828125 0 651.855469 ]
    /ToUnicode 19 0 R
>>
endobj
11 0 obj
<< /Type /ObjStm
   /Length 24 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL������T�1b��!���hJ����G�u������ 
)�P�r��gQ������t�^������=>Z��P���c��i0�01�uj�^f��9��/�,�����?���N�1�uhm���1�,IR!Dj�+�?Iz�r:����T�����#.K��O�^�!!�1�t��������:;�(�(���$:��������\F�����F�����~����R����a���Y7�p�!�\���ls������*��<o0
endstream
endobj
24 0 obj
   274
endobj
25 0 obj
<< /Type /XRef
   /Length 105
   /Filter /FlateDecode
   /Size 26
   /W [1 3 2]
   /Root 23 0 R
   /Info 22 0 R
>>
stream
x�-��	�P��9z5���d�����=���
|�+Z�2�������}@����R����G�CG����,����h;�?��>et��jtk���5��]�g<Pg�
endstream
endobj
startxref
321802
%%EOF

d16-rows-cold-32GB-16-unscaled.pdfapplication/pdf; name=d16-rows-cold-32GB-16-unscaled.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x������Hr��?W�/����\h#`0@����U7�����L���F��������3�LA��������-���>4w���>�����_����
����?~�����Z�������_���?~���=�e}��+-���x��G�p���/_%-����K]���G��vlb���������j\�mSy�K,�%bq������k���G^�T���������~�����_�o�����=��~��_����k{Yw\��vU7��%O�6��u����#���|^w\��K{X�����1)��q�6�����e��K�����RBU��1+]��t�xUz�K	����I�kYB���1+]��t�xU��/����Pz���r���%��6i�e�c9�G=��8�:���%���BQ��@9K�>u�E�),9�(��?OuEd#_�y�u��_�n�r�W�e]���-,[�u�Q������n�i]��_�=.55��sq�E�z�c]��\b\��Q<��\uI�~]b\�EQ2���������s9���tl��+K.����d>
���1��X��|�L���>��B\b8�|�\��a2��������J!a}>4�������W���"��l�/����aR<�7+��F3�8��q��G�d����_>q���0����|��6�����Ga}>4&������'��|&��1�o�x=;��(����d�-,k2??q���0����|y	��x�3����|h����5���8��������K�\:��re��3�;���I�d\�Sq�K5����O�x>6&�����|�\��a2��b��g^�8��q��G�l���Z�|�L���>���%�l�cg2������������z>�����W��������Q}2����e
�g'��UC��th��1�%����F,�R7S�8�o�I�dL���/e7��s=��|dL������=�z>�������%%�������d>2f�mK�v>v&�QX���|Gp�b�����d>2f������O��|��Cc6����d6&�aT���\i]����u���!ac�x��q���8�l�iL
�R��]��wJ��v2&�UZ�b>�s=��|d�����f>v&�QX���|9/���u�z>�����w,1��L���>��Jr7b�����:xLo���R�����8��(����d��.!���\��a2���=�\_;�K����J^�:���1�N��k|��.���?�����.�A�c������/�?����z��o���������w\���.�,)����'&�0���'���l|�R0��s���9l^v�l-15KF�5����G���d�h<��M�
�6x�_��E�!E�xN^��.)�"�h���8�N$�?��"Z��$������I����kZ���WG����>���oH�P��P����~"1���������"b^R�Z�y�8m�P!�~�)���BdP�9��,�Q��������7����xN^�!YE����!�2���';���>��9����
	/a��E�!��L�U��/�n�A����5�H_.�E����5.�()����b��$��Dr����m#&N
��--�LS��|C0���S�k����9�_;m�X�`=���d�j��kNy��%��!)$���p�~�MOv����4���s�B���{68�F
xw��e%�?��_�����D+�w��|�!-r���T��v�)���!"�z�)/�P\B�����oH���"nK=���dk��kN9�%3���=��q�
NLxI4e<�����-D��S^�����a����N]���X����';X���^s����7��SW�#^�&�k�x���`��v�3���������a*2�r��%�H��/P����%��/��8;8
f�S�Ov�)����S^�����
���;��r����h�x��jY)����S^p��>���@
;	��&�l�3��|=b�(���ZR^r����r�B�o�8)�Cl�)��N��P�5��d2q?���T	���!��e�N5f<�i��!������������6pr'���3MOv��c�5��`+eY�v
����/"�u)�:�����#��Bh�%����6����6�v&!���|��Iw��_���=b���ch�Fe%�?����P��$��(����!�Z���NCR����';_��BCx-)/�M�������l�`��iq"1���W\#p������W��wd�����h<N4g<��z��b��5����,)�G��i����q����'��l|=R�Jw��_��}Y�#K�g`���L���_���?�����o��/�����Z�W]�i9��/6����)��"2���@�$��,��p��)����h����S����K9,�����rH�"����Ts4�2�2��������`��?C^�is�?���LS���BxP�9e��@��CS���0����Tpx��	�'���)�^��|�t>�e��	�Oa��)#/�Wt�{<�[:�[8�lj�hj��/����kL1�hN�N�LS������kN����=�
��)�)��)Q`�hN��)��(0w�Li)�
� ��)#//%�M0��VL���&`�i�0`*c��5���T�L�l������T�`�hNy	�����.,F]���S>��d����{I�s"U�'R%X"�5ex"!�H�)3"C�D��H���Ha�)#9~J���h"UaJ'RV�(�HaH'RV�+��aQ^�Z�����k�0D*P�5�L�TX|�t'��	���a��N!L����)�����Ts4�2�2���N����	�*�UY��N���C�"��^s��H���j�&RFS�#R�����2%R��U��h"e5f8"�!�H-)S"U�`*$���3�������k0U�������s'QNb�S�(�I�����]S;P�����0��)�%e�����)�w0ed�_� ��)#��o�bF{9�j�SZs�S�����r	�r�o4L"
B���p�r�Jw��_���-�~����
2���F�����2x��)�,��TL�L))����������u��_��)f��HE��b�Q<��Q�a��&�MR��HE�"m�T�����7�[��:U��#��B����?��m#4�j��PFS��P�!���2�PGu��B)I�BA@�PZr�B��S�zX�5ex!�D�)S����fQ��,�h�p,
C:���S^��B���($
�m��z"1��1������k���>�P+���p
�2�xNnWO�,*�*B�����P��'y�;*���h{��m��CgT��Q����>%��eO���4��d�S����WCq!l��	�)�d4e8��K�:{�z�j�`�����O��\�M^c��O�Bx�9��>��@���G`��$�k�)�����	}�}�m�:��)8��u~�>�2M�
���5fh��
�!���)}����>5G�'�}�3�	C:}2ZR&�i[Y���i��'�)���L!<���2�Oi���6��&'!���$�I��`'��;!D������\a'���A�����G	�=M#T[L�����U�m��ZR��	��i���������`'Q���|��L5,���L?T3T?��n��0@�����	vZ��;	����^�\a��B;9-)W��}d��S�\ATr
c9���/I]1�<�M}y��i��&�9����!2�����gB�knR�	���%� ��u\r�������i�K��n[��]���������������2�(��	z�����������_���(�d$��m{���Nvk��Y������,j?��Y����S]O�GZo�j�j����mSUsTS������0���qiI���
:�2��o�u/(�
���4������{�*��6����9�M�aH�eVs��Zu�.&��]m9������'3�]�z����{�*"���vK���%����n�r�������	D��H?�����()��.���I��W�Fi���.��iM��A��uFs��Z;p5[������p��0@x���H��V�����9��M��aH�}Vs��Ze=�_�����F��I�)�,����z�)�j���gT�+8�:M�f
�A��~�'��~D':V�����n�"t�h$��+R��F��;�x���4e��1D��S������#������8�)���A�����S�����k�5�;Xz]�;�"�	Cd���������+����������CNZG(Cx�)�jU�ru��n���;�1�@�@!<���r���R������
�;����g��������[�����Qgs4-���:=��4�kI�W�������RQ��
Mu�2]���z�)�j��4��;��Bh��#];}M��^K��Z���D�X5m��%�k2�p���H��0)F�Y���DE����-@��L��H"����-j�"8�q���u�J!<���r�V"�����b�5�#��i�P��}���5�����m���pm�8niOx����p;)Bd�9�f��u����Sl�
a�Abx��a[)������E"�
��t��r���	�
)���!�����{E"n���M�@�ALy���9��i^�t'%�f�D��C���t�e�c��Nb�F���M��^R��Q������n������U$�<���$���Kl��u�G������O������-,��W�Ni���}P�����OJS���t�e���RI��3k�[������?(�4L�a�ppk��[$�U��-n�p�V��0��T�/
A��aj�!A[�)�T��0U��3�����B
�1��g��T�2~�d<���(?BZ������d�+�}�As�'���v4���l���4�2\�'�t n5�|R����3�"����"�F���4a��N����O*T���
]��^[]�6�S�/����A@�LOK���4zS��&7G�d�)��d�4�jN��B�'=�R��c*�����I��xC��{V�']oV�ZAq��n5m��55�2\�)��vS�9��
�beZ���B����ZK
�k�!���6��?)M��84%����M�C���Vs�'*lK\P�a�z�
M�
C�b�Vo��Y�j8
�1tCkstG����zZ!D5�Z�)�T(��U8bBm�W<�B6��3������!���O*�pk-�U�_��[3���H�m�@�Nr�'�1��-1��J���|��;bP���F�������E�%����l�1����B��:-)�T�h.v������K�{��+5�)��QH��FK�"p
p��:�mN���������R�5�|R��������"����NS����BxP���?��A6�1�7%:X�Zn���m�w��[�%��
	���;I��O]��:�N�Bq�N!����O*dR��@	v*8�n��e7��Cx�)�T�=�m�/-�f��#����Ml��_Z�
�����Q�R�Iq�v"[Hf��p�i��0�I���4��{&���
���D:����^�����--���*�b��p�f�8�I���4"��,�����4fh([#��^s�G7r�K��� l�^/�_��|�4v�[Nw2����>������{0���o'���@�K�����]^�'E��(�gE��q����vV+��5����vu����[�l�������V#[���H��Ud������0J>��b';YC����vl��B��x��O/�o����Z�-���{Q�����{�j����5�m�5�����64B�5���(��n��-Eu��p�t!�#]#)�^����k��aXh�����+�U�0H���f���Ej��<�M��&�v�@v������[������c1��JR�k���8��n���Ej
�=���f�y��y�������[�Z�O���`�������v�������r�V�d���h���FS��������r�V�h���`l��%�[��sD����{EjL��������a�V����v[�)�j��6O�c�TXk��T8K����{�jt��U���N5��2<@�����r�Vi���G\w�n&�^W��Mu'�:�Cx�)�j�P�]tf���
36��s��1@����>�����wqA���p�h��FfZ���1=62��L�7��'TzkN8l^��i4f8��!�#-)�jm�7o{qTqh���z�-R�!�!4���r�V����f���;���fz�_C���!l�����V�iU�����\Qf�)�Qf���jN�W+����+�����<h���y�0���ZRn�����
�����a�+-�S�9Q�u�+�>�U�b�/[��h�l�J�_w����v'�ZR��*�x�5C�I�t���J���5A�F[�Sn�*�(�KG0����5��@�9��7�dx��c&qu�2��z��\&��F!4���r�����CU��vrwG�ZA�p�vW�l[��a�]����7����o��NS����G����^s���R$�{�CH�@�����'�0��W
�t'�^����l
���F�CX�
�w{pl[���������?��h������B���bD������|������i�`��������
g��}ZJ�G���~������?=?<~���7�>~�.}{��/��4|����B�[�s�s\�6����[^����-��]�~����5-xO_��c����K���><�rrf�K�TnO+O;���'�vzp���~��0��'cR��q�&��rXVs�������A����3R:9��%Q�7�����`s^}����������Y��(�����������'���pz(O��vI��M�U���O/�N�h����.i\�I�*������'�G�oG.2h��1�&��vUx�U$��}Tx����h�E��+&���r���e}[�#���/���'<�P�.N����j�0���voul/��B�-�������d>2&��a�j>q���0����|m�U�z>q���0����|UM�|�L���>��Z�z>q���0����|��m�������_���`g2�������c+��O��k��|d�q������j1�X�u�=���\�Y���q�f�yGZ�����(J&C=�+�^�}2q�g�0����|U6`�����Ga}>4&���l%�������d>2f�a+�����|������|��$�|�L���>���i#��}�z>�����W��b�c���!a}>4����&��7.��R�^�8�o�I�dL����g~�����d2�������d�\��a2��8�RM���l%����U�VO��d6
���q=_]W�����s9���|l���� �|�L���>_������|��cg2��������D�<�sy��0����%�������Ec~��i����tq��*�����U9����\��a2���/�WA���8L�#c6_�6����l�'=��5��[��\��a2��
��tdLf��>��\���d�Lf��>����n��s=��|d��+r�����%���|h�q�h��'��9��G���T��\�W&��q�^mkuwC���O��X�1�/d�����z>�������������d>2f�Uw���|����1�/�=��S2�g���A���K��L��d6
���1�osw{������=6&�������n�����uckg���W���sjM��^?��h%K���n�+0'H���T��_���k8���a2����;�I>������{I�/��!�"���@�zx�\��)�)��;����@v����i=��������4e<��Z d�G��)/�i�vc����7,�j9�����A����u��s4Z�mA�IN��-���S�3MOv������^s�H���8�C:"/o��*�Z��#���������:����{�%>�����j�A�����>%^��c����#�R�4e<��ZBb�����&X�a�*9��O|��X�3MOvxW����~uFMi+���E��4���\,����>������pes�C*��C)�LgZN� �R���S^2;��4����ei��g�2�����A����ZZm�g_��Y��7I�XR���l�K��KZr�x��v��C{54��i�x���WQ�kNyA�V�'�M��m~�Ve�����9_��O�kNyA�"=<�`mCm�����X�
����2���9�1hNyA�p���-��T�-����^S����Y}P�9���N�����bT�@�jnu��lP��{I��E�
K�+7�����lx1j�x��+.0����S^p���������������Y�Ovp�x�!�������.������Q3-�p��l��	�t'9��Y�-���;�;����N�9_��y�!�NWg�56�nv`=l� N�
)j��z�)������BdP�9���=-�5j�
���<��x<�v`+����d�m��kNyf2�9�P���;�z���?h�x��@b�5���6p�T��cj���m
@UF�OvZ�o��kNyt
nH3���
�-�����3��G���!<�������v��*�e�+�t*8[[�~�1��nn!2������
 E�a�%t?����T�����66�Cx�S^��������;x�J(��@��|��Z�!����l����S�=���g�2��|�.1��z�)/j9�$�z��q�N#\�LS�����K�����[��6�Z6����K����Db��
���z�����~\j���y�q������O6�Z�:p����e�vF�#��J���u��7ol~�j�O&h�C���z�;4AGA��kh������n��n��>@��2�����h�-�j�&ZFS�#Z�����2'Zp�0D+VK����{-�j!�h�_]":l����h���h9M�h�A���9��_ZM��o�&ZZ�_ZC�Z�&ZF_��*���k��M���G� D-������D�[1D+�hi-[�X��B4�2�S�c-$}�h5L�������!�h}M�4l�O���-�)�-QD�hN����QD��hiI��ha�-#9�
�uD�pD�pD����������A���F��h�e4e8�!�hY�_��l�]D��B�F�rA^u�)���"�z�)S�U���Vs4�2�2�j�&ZVs��h�E!��h�e4e8��!kY�)S��5h��
������[�����2[��`�9lY�laH[Vs�k�����d+������.�a9X��^s��oE���|o�:��34����^K��o�����[+8�o9M�oe�A���	�*�azw��x�Y^��\���!2���2�[�s4�*�t��5fh��
�!���K�S������Y�.�C�������^s��o������
N�Y^c��[+��^s��o���S�����oY��oA��[Vs��o�����[	�������6
a�euO����9�U6C���p��Z�B[Zr�lU��GS��h�eu�-u�J�Cx�)����(�U7p:��3�
�Cx�)S�U�*������� D�-�9��o�U'1(��)���������L&���dU���|����/.9��x��z�}�p�WW�r��Kk���B�2�S�Y���]Kk]��h��_L-�j1��G7Ua-�s�
�;������-Q�[F_���%����K������8���lI#��1W��+Y����,�J7�K����&n�Jr
�hr
���e���@a'U�S����e+-[��l������Z���O������*�.���5�J���hNyp�p��\a\���������\Z���B�2��n��c�5l�~��]�uk�f���|)3d+UK���j��(�e�5�U�T��jE��m�Vz������W�ug��N��b���,���2<��E����	��h+�N�p��N���C�2���^s��W+
�Z��j%����^��z����Jo�j�RDE��h�������-�2�S�d���[��M���G�0��-�9eJ�j��l5G�-�1��-�d�jN��-���l5G�-�!VC�0��-���U=U���Z���Z�����[�0@Z���W��6��Vi5G#-�#�g��(�e���FZmEu>�����
X�i�x��-["�z}\�p�6���E�;8ay�i%�!���)�
p��lC��������t�e��L�V��i5���t��5f��R�5�L�����Vs4��:��wiaHGZVs�i�E����������8k�	$�I��dY����#�ucUsto���aXV��kN�fY{��`���E���1����!2���2AZ|���Z+8�W�i�0H+S�5�L����Vs4��:�9yiaHGZVs�5��vgQH+�~-���i��!2���r���>�-��f��T��p�q����F6_�j�eo���v���x�,=d�����4�gt�h��M�����(aH'JVs
�*�q���
o�tWst���rocZ�0��xY�)�j�
Z��c�1]`��ci���.���1L�W�[T��g8Q�gF�Z=�E���u�*�jvx�9��M��A�BlVs��Z��:�
���� DA8�9�^������
��W��h�p���tVs��Z5_��	
���!���� ^�4���S��*�m[�j��Vp:�s�2��"�z�)�j�tNr[�{�%9�{o�?<�I�A��cO������+���G<x_v���fO��x��z�g�|\�����5:D��:Dt�2D���z�)�je
��H #�"��Vvn$�1S��$��+R��6@��AH�)�@������r�V�C��2D)��)[���Fs��Z�s��B\qp}Lo����B�����M�!���[�v�(k��5�jz���5�2���r�V�+��5mW��
��3
�Cx�)�jU<_5�l���V'��S�t����k~_;�t|sw���M�t"s��y��kI�W+Q��_X%m�PU+3����xN������������:h�06b��5����[�XO��N�3t�2t�a�"�z�)�je���I��!n���4fh�w�!���{�2
1�
'0�u�5�
x<'7ut��E�Z���t���A�s��`�H�Cx�)�j%p�Z��X���0
u�3����5�!���{��
e����x��AS�"��}�R�x��>���Z����D���Ar��^S���m!4���^s��Z��_	�����<��6p�z�0���ZRn����h_�5�nkG�^b���`���h'%�f�<=��Rd�q�[X
�*�>�.K���
���VG��zf'F�������	�n�K�JwR�o��"������������U�f;��p����x���������>]�L���s��.������z�=���O��������z�oU��3����v�nkMnC���F�����*$<�����q��.c����Co��[3����]�gv��y�����3��O}
�[���2���Z\ch;C�k
-�7T,�V���3d�2�m�Z�=�-�-D���)�Tx�5��V���+�w��������^��S>�PCc�P�i��i�{��L7K�i��T��A�2��N���N��k�i:��tj�������gt+jst+����ZQ1���Z�)�T(�����+8�E�i�0M��BxP�9��
Uc)=!Q���xCu�-�;L�*���U���7+T����
��j��azZ�}S��>LO���=����E��!ss4d6�2d�����4_x�BE}+t�h��������!D�e�9��
�R��1�z�����FS�[�
!j�����A�������Y���q+�3����O�E���6����=��z��A��!<����I�Bl�Kq9v�ua���p����)��
�Mp��ap��_�1�t���Cx�)�T��W�'��fIF�NJ�� `��d�k��R�j��E��UV��n[�)��_\����S>�P�G����GX����[�	��e��,�^gM���P�+����jqo����t���;�M�7���Vs�'
�����|��Bs������n,��zm�;��PnC�E?��U;���e�E<��w=0�����OJc�zT�Q���>$l;�*{!2�=��^����
Wm��A����T�#�v����m�
}w�AK�GM�����mm��:���5�-�B�����OJcx�[P���S&���T!��r*Cx�)UH}�q��d�Ii��2t�p,��z�)UH������b���"5{Yx�En'n�zn*�I���4B���3�`�p���iJPL�(�#:����!*�2=�c�����H�����;���s�.�x�uI���� ���o'��?�k�zF=i�����-2�{6j��)�]!Dm�h4����Q�z-z����]���j�4,vOv�{z���i��a%|p+��[	������?��Z9���(;�@����(8����^;&�a�\"a�t��aMG�����������0X�������cv�<��{|�hw�<��{�w����Z
��}/os1�m//��n��n{y���{�j���fhv�����T�~n�a�-D�S��Z��kU���j��MQ��GQ1�ST�9�^�jE~��[�����h�p��1�����S��������n��n��>P��B�����[�����[av+��n�y[av+�����[d=��,�f���
��xo7���g�m)����|���V���J�s������F��g�2��b�bb��r�V�;�+�� �C��FS���B�Vo��z�Ii��|��{��!���a�wX}�9�^������m���VWjp2pC:�������kUzw4�]i��p�h!�Z#)�^��]��7S�n��8l:��39��!<���r�Vu�2=uM�q��-1\3���OWK��W�j��������C����a�m���k���n|\� ����l����^'>#�s�Cd��^��q�[+|�m�8��z����7�>���r�V��u�`�awp:��3�M�Cx�)�j�v����I�t���J����(�Q���r�V!������@^'#!KA�����$��+R m^�����9Bz����!�����vf`_\�m������a8p��kN�W����y�D�9{���m
7���>���r�V&�A�(e'�������E)���S
������H�Z�z�H���#lx���aq�BBx�9�^�v��ZF���`(���7�3�xNn�Y���QD�{�����`02�2��RB��8�I���������<��v�.g�n����|������igm��������R�5��NH����~����{����O���z�M���K�����������?~��>�E8z+0��vU�S������z��P!ds���Q�{��p�^�������K��WvI��L���^�I=/��F0������)8������$����1�^��x�xU{�d�O7b�������r��fR<9��%��oR/��m�����G���6\?9��%��oR/���;�W�G������;�O��z���M�U���OK/��NM��t2f�K��n�J�yI����I��!�.8hL�7��/7:���������}T�����%�(��5���~�����zI_\��u
����7f���Sa)���%��g#��]�UZ�Obi>q���0����|�?P�������d>2f���<;��(����d��n�|�\��a2�������|��Cc2_;N��<6�g�(��l�~\O��d6
���1��?������pHX�������{�^9��pl��*/����aR<�7��{���l%����u�5�8��q�LG��|������bw.��0����|�R�l�cg2���������)������8L�#c6.���3����|hL��}��'��|&��1�/�;���d���C��|h�/G�K����1�l9����tq��*�����U��������(J&C=���:����\��a2���D3��0�Oz2��;��������d:2f��c�%Q��|��Cc2_;���K�8��q��G�l�B��|�L���>��vC�Z;�������%co
�����%#�k;�T�.(����qr{�����p��"eM���>'9�9�F�}N�&sr\������\a�dg2#��	���w��1e�f<��8�9���vJv&3RX���|��n'dk6#��)���YVw������$g:g��3'[����Dg~Y��$�8����MX���l/,bM�4��/�����gw�������$g6�aK*=�X�99��I�t�]�eN�fsR�����a���MY�sJ\���L���qPMI�lFS�1�/�%�w�[�9N�Rt�sw�������1�3���)[��3&�����h��K��D����{k����e?���N)��SI����=��������=�E6Z�m��Wz�A��<Voo��^�#-��c���-kr���x��^_7��p]Kst-F��*���X�)/Hp�����������u9N$�?���d��^R����hW �=a��3��{����!������:Ooo;��|Vd{�&���@�v���Mr�	�}P�/O���u�j����s�FHNt�U���u-�n�����Z�"?����W,�z}�W|�K��Z�+f���c��MOvx����4��������o�8t�~l�����tIN?�������TB����%�����z�e�+r��X}P�9���r3�-Qh������';|NC�C�S�kIuY�D���
:�Ha)���.����0e<'/�<�M�n���F7��i����AMOv��
�A����q�;���i?��>�3MOv��!<��������i[�SP��i��8���'��l�F{O���_���`�K,���i�fby[�-MOvhM�A����ZZ�O��"��8����4e<��)�>��9���pZl�+��u�u��=�J�����%�Rc'�][�ikw ]l`�5��
Lz~i/q3�u����m�_`���\Xc\����;�.���|�k���`�#��QNmNXSZ�S�p�;�7�����R^q�Cv&*��+��\d�#g`�����0��Y�J��� q���i�[CB�V_:�Oq�`�A���W4'��I��>p[��V<+���$x{�K�0��Y���j����
������u���OL����-��0�AY,��������`Z���W���%��R�3����?���0��V+���ub`�$��A<�`P�K�S�5�3w�.d��#���.����VR�
���AY/K:�kb�.Sc�Xx�.�un`�D��s
3����$���6�@ 
Y�
����$,��LA2�`P���*\�Bm����-��,��%�W8�G�zY�m����%�Uq�=��|jP��^� y0(�u�B�e"U�:�``�~F��O6�Z�=p�����%�%�p�)�{��N�����l��l��l��l(�n!��p����<��Je������%R���"40�5-Hm�%RZ����IA��RFs���
���
t-Zw�`j�U����A3,<WL3��hHf4e8L!��Y�)/j�M�������e�g����Q?#��������4)#G�X�5}�V�M�xP�? e�@�G�G������?����������N���C�2���^s���%X��IYs4)3�2)��N����))������hH���HtR�%��A��@��#e���2���HYy�I�!ek��l-��iM��A�"eFs�R;�R�6K���O�Z�&eFs��������[R�5exR�B4)3�S��O�1��Y��Y]�C��eH�38iF��r�G��v�L��,r3�uE�����J��JMQ5;����l���1����Sf��dIY;H�1�!���v�`RCx�)sR���R�,C���I��QP'e���))������L���K$�e���}f����50if���18�
`����f`�L���0seM�Y���B��X�``�fa� f08�
`��5�3g`�f���3(k���$^�������d��qps7gH��,�b^�`+
���I�QPf���0k�,�"]�X
�
&Y`�Q��c���)0;��Of�t`�
J���r�<����N�p��_,��U���9�0k��0����f����	���(���0�������fm��v������j�'{5��Y��,Yb�,1K1K�����?TSY�o}����s{�����%P�u���gK��������%��%<�$������������eZw@d����K5�}
�L�WZ��k1��,�-f�������e-�(H��K��z���I�e�jr���}P�9e����g���<����Yv���
�0�2����Yq�����8pV~�7�nH��Ys483�28k�gVs�-fC��k0s�ec{��.s�eo3����}��Lk�X�3Q��hN�3D��5|������tb��~4���
51��e:13�21��N����)1[7����2-1��2���H�����QY��,YT�T�,*K7�TF;8S�!;�\��i�x���@���v�}h�[�QY�4*����M�5��2gp���={T�,����I�QPGe���*+����N�i���.t��PY���r��&+�
�ae��:L���=�P��y��� f���t�(^f5d<��8�!��)��2n�T]e������AI�e
R�e����$�j_D�Z���f5��f�`�%	*�R��<��*KCkYr�e�5��~��|K��,����NS�JGE���f�2g`�ed	#�gMY��L��`)�5�d��s
3�5k*������FaA30�e��H��Q9�`����q���^��$��*���AY���J��:�R�k00�r�DA2�`���s�R/�,�#.g�^�x+�0��&%u�E�A��0�'��,�A��(�c/g��� �*�9�!�a��TA�#:.�������W,���-�����qe����Wo�S���!����eb��P�h�pXC:��S`���k���_���l�?�a�c����
�H���O�t�Wst���r4�������e�9M��Zs��������������O$�?����������EV�J
y�8�/r�S���<��3��^�����kU�g�������7-�
�
n�����Uo�jL��4��-T���%���������{E*��_�4�k��{FS��{�����>���Z�k�W,�k��FS�������r���6z����
N��NS����|��5���Uw�����Z��n����i�������;���������^�
N;g&4Xy@_��)���
!2���r�V���7�b�C�������2Q�;����~����)��o��NS����BxP�9�^�B��N���+8c:Mdf
�A���{�
������#�;�t����3A��;Y���)��,�nXh�����k�0,4b��5���Uu����`�n#���sT[!6��!���[��I���l�CEN��x�Q��I���q�P�Bcq���e��30�@�|p
3�u�d���Ob��F<o@�����!l�����C�y��R3P3
i�gf�_�,�
��e�-Yxk[EV��V��;�L�c�8����[2���i��J`u�;�dXo�E]5���[2�2�m)����+X�#r0(��M�LA2�`P���	��c��h
�W���t����*[����AYwK&�>-R���� �%i�KI5A>1�_s�C8��d��[;�cE�|jY���A<�`P������DO9p�n�������i��%
�!���|��\n���c�VG��AI�c�S�0�G��[2��#,�4�>28��Oq��N!4���r�V��������!�����8_��8����S����������[�w��(�z�����FL�k��*bl�,������p{h
��Nr|?�m����������O�T����<���{�E�b�*���)���H�Y���s���%�S�_�{��
��d���[��^�u�4�!D-�6�S>�Pqi��T��!�6�wTj�
!�}[�)�T�[Cq!��?m�]u���}���?m!���hN��B�0�ctWj{���R�����BtW����I�z�u�k��]��
�0,�v}wx��?+Mq�X�����p�h�ppB���S>�P�f\���vs4�6ZV��
!
i[������Pw��]zm�kst������\���\���O*T���I�j~m���W-���i~m���H�0����N�[��-n�n��5e��!Dm�h4�|Ra�dm[\�F�f�FY-)�6�b�4�������q�}�d�9�LM�LcH'�Vs�'*T\q7M����y����x4�tm5�|R�j�M�P^��6G��M��BT�����I�B�+-���������|�����$�R�����O�����{?�����Ov�1D���S>�PQ��k���;Z�������#���cF�p��Rm������,��30�v����e}V���/�f��f�m"��IP({y� f08��B���#���2+��AI�+����>+T��=����t�h`�]D�rP������
�K����3T#���0�&��`�����x�B���2����`�E���$��c� y0(��B�
���q���1��~�RM�a�������k�JK��u����h`���G� f0(��B�+8��+f��\��x�����5P�5�|R�p���8QMk�����$�fzLt[�hP������U�[��t�;�d�o��#��/D��8���i�p����<��d*^�(�����PjAq�{6���cx���y40�t;��0��Y�p�H�qTM{�
�p�AcF'�_�c��kN��>���=�b`C6�^/��Z�������$��f�a[���T�C��9�v�9o|�]��H�gE\��������n�|W�g�r>���9��U��:l�Y����m�Y��1�����15���V�pS9'n���FS����8���r�Vs����7Gc�)�cQ��jN�W����?;v����jv!�[�����
�����c������C�3���^o�V��Z9��������gZ�y$�o
���S��j���j�0`>`�������f�f���oE����j�"�����
����jN�W������P�Z-v��2<xn!�<�)�j���fO�k�Zki��B�6���\���8����JR����y���H
��mV�����������4�(2m���f?�U�c��kF�~r�QkM�QC�b�Fo����ZUS3��������i����MC�j�����|\��5{p�M�����!D�k�9�^��1�ih
d�Qf
d[]�[���@����|qE�7��A�6�h�3	�O6�h{��5e�,T��v����2T����6u����[27Jx�t�t��N{��uz�j�vZRn���s9<�.��Fc��
�PH��VS��Zz�H
�[��f�V'��T_����Iu@~^�0�m�x�I'�+X
u{��L�t�$#e�-YptY���1w��L��;R�3�u�d��>�R�`)��
J2{����d�����%���P�,Z��y�``����AA�s�
��]2��-P���2X��&��9�1<I��4�u�d���4F��W�
�%^�nn1�G��]2�����h�
�h(�� ����v�a����,�;��U3�X�G���``�e�+�0�AY�Kf�������(L�5d���c��II�[)��-����Tp�:�W�����Ch����6���m�\�!U�p�~*���8����r��Ry�<���>�He9������rv��~�����������v��}��>/{�[~[NK�4���~������?=?<~���7�>~�.}{��?�������}������h�%h�\��j����6�������w{J�G\��SO���YF�u={`�����������K�dM���R�I}Q>�G����[1kK�>y	k{�s��@��%�Ty	&��%�yYO�}{��^B� 7C�����%�T~	6��%��u������K����j^Y���S�%������R������B�/��%�5}	=U^�I}��#��-��K�G�a/�P�_9�W����������E���-��:���'gV~O��M�e�m���E���G^���D�L/D�(�!�xY~j��E���
�3�2��W/�����"�w>��u9����e]���������!>�|�
8��rZC���/'��<
�9�u=�������EvH�9����q}Nrfs�kf��bM���>'9�9�Ri�,�������Dg6g{�_�{+�dN��s�3�3/u�vN�fsR�������o"sn�����85':�936��9����H\���7�0��P�2���q���u6���5y�8��rf�Z/��������$g:��n�dk6'��9����C��������'Dc6���#�>�X�9�OI�t�,k�dN�fsR������R����5������L�LkX�����[�sJ����t�� ^�d���"qjNt���mA������%���\\�5y�8��rfo6��)���Ha}B4f���s���tk2#��)���y���jN�fsR�����m��f����q}Nr�s�������lN�Ss�3���%���nM���>'9�9��1{qk6'��9����/%���X����9�y�����1^_]�����t�r�bM�3����3{����x�w�^����|^ay����\��a����|����5������L���;'[�9)N���l��|��nM���>'9�9�\3'[�9)N���t�v�0s�5��������LVD�9����H\���7�%�-Y���7HQ.
�/@����q��3{�j\�l������q}Nrfsn������dF
��1�o���fB�f3R�����{�'zN�&sr\�����uI�^\���IqjNtfs�1�y���dN�Sx������Rs�5������L�<y�����r���y�����'�%���)K�=,n����A�~^�9�UN����������vh?DLV
�����%Ag������~z�3(�%@�
�T�X\A;.:����������6�	/����VA*�H�to
J���+�I��3(��������t�����_��Kz}�\�������-�9�}�N
���N2�`L~p�^��U�.�%oJ��`�����x�����1y��t�lJbK*���C�$W #���v�q3\���������p��%��)HF�z�[q?(�H-9���Xw�ex����@��5��"�	v���1 �%~@i�zjP�=�
����.�)/%*117�K�x���%�W8(��%gP�K��	�(�p#Y��7#�\O
Jj%u�p�<����P�	T���b�
�:�"l�5�%�W8�G�z�4����(V����5�O
J���+�#e�d��.O=�z��^�fd��
J���+�#e�w+la+�����,X��,����$�
���AY/yW�o�t|�Xm��^��S���$�
����__���\QvPE��Fci_�MOvh�-���^s�+���r��A��+8!��'�2��`-��z�]���Q�����p`1l������S���G%^� y0(�����5<�$0��6	����hPR+�{��x�����m���v�Fm���cF���F�6��%�a�^_���>Z����H���
9fY'�L�����e�*)�D*_��U�B�S���Hr���JA2�`���uCK�y[�# �Eg���'���H��&f��g]������K��Up�|�p93(	J�P��<���:
�_>|-��$V+�vf���IP�x+�0�AY4�6���q0���q�Hx�:�%�W8�G�z	�B���#�'�2X!o��g40	K"/���g�,i[
��&Dd��{�=9A�O��5��$���z���p
f�;��2���W����3���9_���!<�����b>�f����p+F<~�A�xjP�t���#�qy�{>6�5�?���������IX{��x�����+~N�F����h���4f<��j�) D���S�h]X��hG
��M��;�l���3;:X3;8�@3;cP���0H1;kP�2�_��bZ�#�w�-yS���SC�K���"#��]��3;*I3;K1;c��4�� ���1)�������0;�mf�
��-�6���1���8�J���R���$��
��1)����Fr����v�q����0DS;��cv�4�����14 �������i�V���jq���40;R�������4���4�3�.I3;R����~��N������3���ifg
>_�
fWGfW=�����fW=�����7�����s�n�u�9Z����
��W��h�p�C:��z�~l�	Za]��.{X�O`]��.{X�g���:��}���;�������kJ�S��(]��s&yJGA��9����t�3�Q:��AQ:g`��t�)�3(�
Jw�]��t�N*J�L����se]Q��/����?�og:��:�7��^K��U�AD�����}��m00���DA2�`P�x��+j<_�4�sF����a��s�H��5D�cg5F��b5�sF�cg�9������(@\�B$[h�`���$x��5�s��&���xN
<�
�U�����B<�k��sVc��s�����2�s�7����*<g
J�x��:�s�q��.�,�����s7�0@���!<���r��\�As9@c�s�Ac��rp2uPCx�)��\;�8���S��s�z=�
�?��������B/��a(�s���5&%iZF|G�9��S`���N�9�`���6������9\���F��9�
����I3]���o�+����4��LW|3��1���`�x0W<�+'`�x0W<��(����7��Y��m�k�`0��p8y0>s����t�i0g����i0g����*�+��L{�h0g
J�`��:�se����������!�@G���Ln�I�#ra$r����\8!r������SH��j�\M����"�A��Y��� rtD�&rp��&r��$O�0H9gP���
��	��
��j;�����7|V�zM)g9\PN�wQ����_ZP�����[�j�-rFS��n�����2�n����t�h�f
J����:us�v��#}k��u����2�����$�"�8���z���|k��o��$�(��7gP��8S���,���,|���e��"W��E��E��4��A�n��e��[��[�o������0��7�%e�6���������7��d�[� y0���/!bq���)�QT�k�x���<0����Sf�����.E7�Y	�O1��=�h��$��9[�v���e8�30�s6R���5m���"�
fp\���B���68��O9[���4g+�t��5fh�
�!���g��
g[��
g�%Y��9�G���*F��5h�g���9#�#����a�3��#a"a��k��
xW�k^:�@P���t')���������������`�![o�������L�+��agoE��AI��QP'o��,�%���5.��@�hS�N��AI��QP's����%+�������7����[@r�-��vV��?�Uq6~{���E!=k��W!=RH����U
rtQ��xp5��x�P}&��T/�3���f����=l�A���$�1H!AgP����[�^�������o=��[}/��{� �n����2 �f�h
J�H�Rte�-Y5��qo�+x�_��Gy�^���&��W��y��.�H8�F��P�D#IRH��>Q���lm�;=�(],X�~	���%A��
����n��qc (��f����5(��M
�d��u�d%i)�F��H�OkP�oI���?�AYwKV�^4��Y�Z��St���: u�9�����jn����+X
�z��,@��#e�-Y���T�������7(��LA2�`P�����nu�	�]���4fX�1D��Rn���j����&�[t	@���$x{�K$�e�-��k�j�s��������Q�7Q�u���|^�����E���+����(�\���A<�`P����bg<�,wn�Y5�L��8p3Y��y�v��x��X
&���a������7�V\�g��ckP�a��Ru?�hP����> �����-���G�J�w��0�Q������8y]�nu�S��4fhF��������r�V�)�g�f�	�Rxz40�0��S3�u�d��m]�f!wC��������?��a��n�L�:D�F�ax(`u�=�dP��5��[2���_����/3��Pd���+���� �n������'/2m�A�}]�`���
|�<d�6��x��}�>j�P�=��Bx�9�n�H�S\r������Ac�S��G�eC�;-)wkE�j��"��K`���I����8����}���^5�
;x�"���k:����}x����"�����G����G�O�����|�[�!�����-i��2~w�I+7���������U�iz��:�_=q�����N��b�������.�����Bu�l�C/n����4��B����e}V�^��\�?3�t>9\/?z�t>9\��[?z�LK���o�$u��1t����� ��k
����P�b�f��zy�k��LtT�n��?�����5�����4�C�e��^��������
UP�Vvk�K�5��Z���;)��^��Q�
�S+��������5T+������3����BU�.�������c� l
J�[nRP�v�nOK~��v��M�c���fV��23���S�����|^���	����2@����:u�����P����Dcth����1:u�����P�=<������AI��8ZV>��Y��Q�L��d|��z���L��dR=��0&�-T1l�v��p����${� E���in�n�}�?~f�����Q@Kw;
@@�Q����OK�+��/��W^�_�v����x��d�����
>]a��A�X�r&Y�8������N����v������``����x�����
�:���7��L�����x�����
Ud;E�P]�2����$��l���Y��gS�[o�����;��AIqW
RD��nq{�Pf����b:�W���A���a��#��^s�G
c�
�z�4��������z���������}T��T��vv�����[�|z40�@��Q�0kop���O�y�,�n�k4�4d�N�(����R>���q�M]��M\AZ6uA�zq��	����*�����~�d$�f�uu[�R��uk�w|�~��?,�Z�W<ZJ�����4fh}B�����Y���C�`=4�n���A�y���!t8"��^S�gng:��A��V����I=���x������.e����8n��y������y�-��^s�p^�m��R����g�v�g�v|���m� ��G%����)����?-��y����
;#�S���	�S�������a�h"��&��h"�l4�F�o4�+J�T9��:xx<���z#���u�[?���d
��d��q.����K�,�4����,Y�f|:n�v{�m��1(i�����5v����������W�g�?�
�2
JB��^� y0(�n�?KWy'��'��'��	�><�><��_��]����l����1�:��-��l�a��|^�!��F��r�x>ntKA����iFM�
4�u�[C-7�8��:w�Yn�y��V�a�\�6m�xp�6c��m����[rYi�	�D�q?
p�}�MOv��
���^s��Ze�����+v��v�AIv�]��#��?/Y�������b����������kN�Yk������-C`���2�h�p[�`H�2�jN�Y���e�b����b���}�"��)�3tgT�h���u�������Y�������)���;�����`w;��q�f��L�\��:Wwe�-Yao4
�]Q|�����/����p�hN�Y���~D�@v>�I �7�:�I<>�IF>��f�L��������6t�:	n��9�f�B���V9���
����$�3�vtgP�����^�E�,E��AI�W
����n���9m��9�������7�v
��q�9�f�����TC�f�L2
�a� f0(�v�B���9=�t4/�B���G�����c����c���?B���G���nh=v��xN�����d����Oa+��0�``����u�S
3�u�dF���P|����,���e��+���=�jDvp����A'^�(<_i�!�u�����������;P��%|������[��������
N�	|���?����O�����������[���}�YS��X�K�;������=����������?=�&����o����������?��y�p���9�[~����T;������|��T�rl��K���]��{����(�	�q�l��cm/���CZ*|3�����-yR�I<�;�e���p^w\��K���Os�r���%��6ie����t��o��I�y�/\7��%O*7�W��u�OWm|{�OJ���\:��%OJ7�W������Pz����RK*��I�=�K������C�#�����Oj��W�����v���M�U�m#��E������v���v2f�K��n�jok?�����Km+f�5������k#E������*m�A5��a������j�N�>���QT�M����R�\(�gj1j����ct5�3a�������������+K����\�1�58D��yk��R_0DJ�)x�����n4��Y��7��7����U��t8!�j2��KOn>����������=M��c��z�Vw��m2�Z��o����5�����fs�6[�mk_\en���&��}��[�����g4�&�3�;^"��|j��3�>�<��&��l�j��x��k��td���MH�!#��6#���0k3�y�(|	���p�Q�
���#�7az�r<_8BZDXm
'q�F�������$-����3����I��}�"�a���}>X5�H�g[&=��s�>��5	�y��[mF2a�f$��q�v�6#���0k3�������6!��|�j��x��Uzj��y��6!����{]Wgn/f�&4��u����|���������[4��W����)��$����IG�>��S{aw�>��l$�������M��lF�g�Q�$6����gfMBg2.z��f$��fmF2�eDR����}>X5�H<d�����#��fmB2��7	���r�Q�
��uC����p���u���z�����;efM����S����m���hfMFg�3�A������g4�&�3���f$��fmF2����a����g4�&�3��V���|�x���C������y��6!�����(�f��gn3�����y��T�o3������Mh�/�#�����py��Lgk�2���`��N����qU�����}>3k:��QF(��O��tF�gK�����\��|f�$t�>�t,~�W�>���r2�����e��C���2U�!�"�{mB#��U����l��k:|d���gfMBg2��x:s{9? �|���X/ ?��������\��'�K��w�{WpX=�Q���zYT��V�Qy��x\����.��;�Wu�j||�uf]�)	��9�)�&}Nv��V����j@����q�\�b��3�2�R�ja��� ^�++����v���f���{As�3�2a��(z�������&�c�\(�2�V+���� 6%B0y��0�I��������-W��.�����1h�s2�������&�c�\,H*~��2�M��u����Z`�=���*��o����3�2!�P��c'�
L��t�PLq�(c���$V]����*:�l��@�5�{���Bu7kW���d�'�l�z��
�	C��.�cU�mu�3V�,���dH��Zl����1].��U�k�l���3�2A��G;���1]��l�������������G9���2�	C��.�����b&��z���1<v2�&�c�\��v����#F�Yl�[�{�';)d��`�=�����_�bt����
�'���K����;H�
n�3b����,��x�8�����
�	C��.�Zd��5�^�*S��1<v2�A�C�c�\(R����g{�"E������#�8�Ac�HL<D��r�UM&���%Z���\0������y�d������\hJ��U���BS"�����W�3��N��	C��].��9���.��M���	3�2�e�����ZV�0D���R��B};ue&��2�|��c'#�<B��=������L^���a�Q������$^9f0^��?f�^dfH�h<ufQ&���!��NF���c�\�42�+�YeA����6Z��������L���r��L�<���}��b���J;S�����<v2RK���1].j)h��N_�Z�$e���bx�d^?9�fb!z�.��qI�1v�UfSF�!.�2����6����1].j�����<m����0+u����N����b&���n�{�1�O��Ms�PUL0�d0#^y
f`�������A�6k�6"O�>o���gh�;��O�t�V��A��("�s(?r&c��eD���;)d2��1].�;�S'��9���3#��';Bh��'�v8�C�85+�������Q�k��/4���n8)WK�=����x3���+b(W����1����M��k�/4'6%�D��|'���0��A�_F�mM�6�����h�V��M2&��_�7b��1�e�NdQ3��:��Nu��i�ov��|�&��r{�+G�V��m���"I-#��;	S�����f!M:�IKfS&�8�{�����1�H��2D����T�����7���B�N������
�
�;���6q�acz��'_���!B2�q!��:�����<�&�c�\6�AQ���~>�D��D���Nu�����/��m����I`@��~�0��v#^�D���w������-��8�
��@5�I��P�w��/)��%~(�Y�	[����8��	��0a�������^�Zdg�3���C�l��1�r��t��V{��&W)2NuO��=z��j�	C��.��j=����S��I]!���3��N:L��t���l��
��)����!��Fc���=�w�HF��'��gp�,�Sbx�dL�R��1]�u)����K���i�������&��t�����^�l�K������
1<v2��4
�cw����!n�*��Q�i�g��1�� 1a����.��^���e�.u���t)5it�v�k]��me�l'wt���NC��t)5a����.%O��R���LTf���!��N��f[k����.5��[��&S���b�:l;hdj�A{L�y*����<�Y���4��F��R�GM<D���(O-1u��0�<u����)5i��v�y����F��I'����cx�)���!zL�Gy*�����i��#�G'O���S�.�T���<�G�U��1<���c�<�S�=��S�����G'O���SGL�KyJrM2���E���c'�RQ
&�c�<�T����2UP��TGL��R��5h��r)V�\O�h�UY�Y�"���6��6^#�`�V
�c���U1���U*�U�:ax�bUf�=���X5�s�~�_�U^���jU�����o�e���vw�������Wj��Vf�Yol����{�kK��]�au|#S����N��)Lo<A���x�)���A�����vz���.�$?A��R������;���������4� �P�;	������������	ZS������#�`��c'c���0D��r!Um:�G��;��f�����?���`PL��t�������������V���c'C�jkB��.��U����:��Y��jbx�d�%PM��t�l�2��>.$V�S��g�����P���|�A���<W5�������a��c'��p5�=������y����Y����4����Xw��0D��r!f��(�7LeeD�*C�������q���Q�����P��.3aT�bx�dL(Q��1].���~����`��HU�'��4��55�����]���?�L����j��+��FP�wV���w��/U,\���B�Zlj�0�Q�C���#�0D��r�bA���E,�DB�h�<A��$^*����;�����l����2E��!����!j�!z�~��$k���o�����k3�8��
jQ��x��&�RA#.f�W��U���F��JM���r-_A���2'�*M
`�A�F���J���n�O�Q��LcNL����3VE��:����wj
Zm��AF��E����6���K�5��=���F�J[x�U�M�hc�����)c�|�=.w�D>��R�m�ae�2z^f�g[Lh4*5a����F�W�V�����Bu�j��Sb��Sh���d�����T1���u56��b&�:�.���t��N��:qj:�SS'NM�85�N�B�l+N	��Sl�8�*N�v�8[u��?fg
�]��q�s�&3a���?f��JQ��D��@*���4�����T�Q����ATt�u�&'��8S�{�;�3����A'��x������FrS��T�RT����AZ�C���/U�u~g|��pj�y�����$U{�����C�j~���rr��
$��y%#^<�����N������,r)���VU!��� �w�Or����wEn5���;��j��hy
�����{i�U�A�n�1M��V-�*�
1�U��v%5�~����]�a�����@�BH��6�j��x��pu���W�.���e�Z#�cRAwQ�0<v2�s�&�c�|Wk����}�,��9qa���\7����k��v����n�z����V��*#�_bx�d��IM��t��V���{�B���rqa��xi��{i�]�e�;�������^����x�E����&����]������&)#�i��c'���X���w���5�����d�=M����:N:l;����A{L��j�l�^,&}TfSF\q1���S��1]��UOS�OY���<�qH�D�@�����F������|W�|��H�<����p&Hw��7��|'���jp���W�t����(7r��SB.i����N��'z�Z�;H����ce���_�����29�9�����B���A{�����EnL|������7��y�dp�JM<h���]���
8~V���?nC\�,�1�,�w�!zL��j����4f��#
�(�l����0<v2/=��&�cw��V�3�O�L4U&a��Q�����A��0D����ZE��!�ay��=�����`=�#l;�5�����._���|��)��gP�1�E�=�'��gP��RC��]����N�u[8��	n��d=<�*��������"��j�fk��L���8�z���6�Y���]�C��]����e���1-��|x��oFLe=4���|W�)�)s�Se�����s[����6OM<D��>�U����$�?zV�2�����v����O���X���w�����VO��1����&< �=v�/k�g*�j(���2��O�<�YV���A{L�/k]�.2.Y�t�G�A������c'�����0a����Z��Q_z�K)�Cm�<�����H)�i���w���,2�a������Y'#
��#	���1�	C��._�W��g���(��9�����A�����v���nx��n������;�{xl�^�����-}���01��Q�3��ab;<LW>�������U���c���m@gTp���(G���B�1�g5������wbSB�|G�;	��K�������L!���Mu\����&������=�M����
��\�)��0�2!��������X�j�=��'���u7U��
���wq��
��5��8��VU�e�q��l�h��F6&l?9�l[h���_�u�q������A��������q�@���\Q
�c�|R�.��a\�X����6�1<v26�OM��t��B���$����X��������su� �?)M�`;>\��OeeTKbx�d��TM��t�������1s����+V��qb��j+���2�&D���I���E�YY��L�i&C�7�1��s����Y��
�
�A����
&)#jqbx�d8�+�=v�O*�������P	��E[�������;H�OJ�qV}���Fg�,��#���+-U�����b�@jW�	)��@���lc����0D����
����G�N�+k�F�e��c'�3��z]���q���*�;tib��1�T���;��y�dP!L��t��B9��O�8!I����N���`����piz~Y���x����&��,pk����w��?)mh%�-:�����[�8Z��1/}��&���mN�]�.��+7���`������<Z�w�����t���`-rs�/�J!T�����F�TTw�`��w����
k����.��3<ax��
�1�=��'��*�������:{���U[��i]��e6���n�f�#n��uU����G+�����!zL��*D�YZ&��v&*���3�G����b&����I��H��>��e~��ka���Y{�k���G�AMz<int��g�f�Y�8Ys���I�<�M������BM��S?�A
��3����jk�g3a�������,��r�K#�&�0;�z@����X�2L<h���Q�8au��������z���c�[���1]>��4���?s&axe����j�i.�v�].e�EF���%���;U������vH�x*�����0��3���}Bct�s���O�[����At�5�M�=�'����Y�{i�U��[��C�lG�}7���h|A��F��|n�{L�/]��-ONlJ���m��0�w3������<}\�5��J�m�s�`6e�`}bx�d����0D���]��j{��6o�������n(��c���~WW\}m���U*�)#�kbx�d����0D���]�2�>g;����dVeT�lO������	���>���VYtb�1��2�2�)e���!3����j����Z�:d~:�Y�Q+�����i����0D�������
s�GX�Y��vbx�d����0D���]��,�%�u�4R!D�����6��`e�����"��v6�7-8Q����#��;	�u����f�U����x6�28���'����[8nf�����w��[�g�)u7�s��;���n/�x����;���)3��fO6�|�$eD���;��0�=v��j�~�`O2f���_���mIX�|'au��������d�'��(����J��^v�;��H�0�{���"�)��������y��q�Y`�A{L�/k�T8������p*�
���w/)��[{���"E����f$(����<P���:�����s=g�kS���=�����a��7]�����U�6Aa�xj8��8F/4�mv6�a���6�����wu�2WVw�8j]��m�#6GH�j�!zL��juI7�L��e�N��h��x����1]���m��3=��"*j�4�ri������#&�����RU��uOU��U^�q���Fn0�=N��Ok��p8���94�"�x��Vx���v��j�x���:/���(��Z������K�����u���,�tb���=����)>�p1E���1�	C���"�g�R1N<��LV�����,��!z�������=����>����N�$���Qo������O�L��e{��&`a��#��\:�����i�o8'\��xk~�lB��C��Z��M�P�[�Z��F�>B�W�����l���$b�<�XT���I���>)�������?�?����o�m�P.����o_�GI6�����������������������S`�?��ZB����c�r������_��������?���_�����?~�������AP��7�m'1����I��W���w��
�K�����}��������n�=��[��t������y*��������E�=��'�mS���/v����N�*���x���X�������M����Q�i��	/�������?8�V�e_����Q�r�Y�{{��<U��^������������T�����z2�WGVtW/=
QM���0}T���m�7��|w�����g�zQ�G���h�7��|w������������;�6�z��}�����(�f��&����K��^�|�H�!#���d�3����zFg�3�Y������D��1��w�fmF2���f$��fmF2���;�-�3:����IHf�t�����Yr�����:]�JxOSx��������?���������7���������Y�����K��������f�dt�>�:���{��`��#�����y>�#��fmB2�7�����hfMFg2&�%�	����K�����x��������C>��	��f�I����������&t����<^F��"��9������_4k�w����ks�et�>��5�����p�����hfMFg�3�P����W�>��5�y����f$��fmF2�s�������f�dt�!c�n"f$��fmF2����d2���H�>c����et��jbVMB2�W�����.&���,�v}6�;s�n�YS�3����~#���}F3k2:s�Q[����|�j����V�Y���t���G���2����2�y��6#���!p��%$q�VM:����tHG�!���d2.����<d�)����}F�7�2:sw
�U����5��Rv����R��}{	q����US9��w���7����gfMBg�3��{e-�3������}F9I�{���g4�&�31��Ih�C>�j�q�M��l�t���3�&�3�����<d�Y���CF����<d�Y���}�u��D����5����d~q
�gd���&<k�e����xg��.3k�w���Z&Y5��]���H�FDu�>c���V�>��5���8cQ�����f�dt�!c�	V�!c�	V�>c:g�}H�>���b8�������d2���H�!�b+����`2��(����x~�����d������D��5<^L>@"�,d���C����;�mu����bT_@aE������2�R��`$@���!��Sc^�������Ti���#}*��)g��1<v2V��x���B����*}���9��5��Z\l��1l�5!zL��Z����2�{	���^�����}-��MR�V	��&�4�6�����U���������UP���H�4������Y�	��r��GV
�����������1<v2vKM��t�TSa����2&��d�{��w�.M
M���Zg���]��*�H�#�m��1v�jMM���=XzH b���*�b�\��8�_�1/�SC�8���<�o�U�Vf���<�����l�QnB��.r_������J,z�%{��A5�I�����w��/d���bTse0l9��'��l
Q��t�$;/�/cJ1b�|�Kz���������w0��������Jd��������!G����~*=D��r��a��W��G��6l�+1����9��6�4�	������`�Z�6?Mx��fL��VC�8���.�8k����8��@	�o�>@{�F�7��;x����cF��gS�*s��%N���B3�I��)X���B��~w3�+�!\a���#��N���B��].e-9{!f��!�LV������N���B��].�b���v��=�61<v28��cw�n��X[2��hG��)����g������B��].4������FwgRP���6��v���A{��mt=F���e3���@[�w�	�#�`�=.w��R#�����~v+���2����������q���.a���
�wFnSs�Y���S=6������A{L��%��`XUj���9(����c����t9��x����,�����Wg6e$�:�������&�c�\�'3V�����V&*#��Ov��l�P����\�b{hegA-deD?Y�8�"*c^zV&��t��{���r�m�I�L�feD���;���7�	���.�����t��INdeD�*C�l��1R��=��ek�N����f�`�S�71<v2��83a���F�*o������)�=�Y����?��t���JZ���%K���%-0���!z� i��U�}T�pfSF�mC����&D��r!#Mz�K���Ye�*&��^���i1c�1�=��E-�������F�M���HJT��;x/#o�����`���C�!��|c^>M���1].%-�#,�G
I���e�������5!zL�KI�m�AUf��)����c'�}�[��t���&<�����u��|UNgd��c������c����+���,z��2+~���&��;�.�M������"'�%��Oe�i�������.��Y�c�\|�d{�mPeV�pK�I�=�����/��)��t��pm��T�C��V������&D���: 2�����)K�;9�����N��������J���$��/]i�7����,>������L�&��t��������(h<�j��7B�)��;�~����n>�S�����Ty���H79�t�C��.�Z�np��M�T�(#JU�hs:��1�=�����?�01�NJ�j�G{&bjQ��x���/�������8�P���F�g�~P``���������u������u�`5�����:f5�=v�KI��e=��DRB��2�0�I�t�����n�e�:fl��6g�R5�\�i�,�X�����1��������v5�����$��b��Q��g[1�Z�<���F�<v2��`��=�����9N�+Y���l���bK����?�Q����Z����2;(�V�deT�bx�dl@��X����nE��V�eE��G��+No�������OJ��	Ve���*U=�8�Q�V�4�����R�s>C��oP��l�H�m�����R��1]���i��#�R���T����Ta}I
�c�<(U�,�U�b:�Rh�O\��qDj`St���o$*���T��E�:d�{�Yof<T�q�F����[�gR�fS���ibx�d^ztLM<D��r/J��kQJu���P'�V���L��t���yA�T�8�B��p�2�}�J��:U)2U��C��Rq:�R��{U*o~c�+�A��F�����y�����t�P���i2gL+E���+d`��C��A![���hW#�e�qb��6��S�k�����j��H�]���i%a�9���:&���q��],GJy�8�l*�MU�G�`��&��{�������Dy��Oe6h�S��;bx�d��JMV_5y�t�h��lEd�KieV<���,C���h4�0D����4N~K�&'V����F�5��0}*6��wM<����6�8�@�V����=��N���!zL�KA��0�`�v��f�����$��)7�4���*������?2k�]
�������� $��3n���
���'Z��c'��|5�=���85���W�X M/d�&��b�*dj@���R�*6����,8���v-�q��1��Q��t�9L�Q���q��O���_uC���q��!zL��F��]m�UQ�6V��F�	&��t������&M�Z�#������Ink����m��+Z1����W���bx�dLS��1]�U�X��*����x�
�@�������\�R�\���X���x�Mt��B�&���r-NEk������Z�}��~Z�@�������\�S3���\�L��&��OT����
&���r�he�Z�>+���5Q�����J��l����u�6?%��n;����vT���`&�c�\+U�jHX�D�u��hPg4c���a��*�9~���z�nz�f\�kU��q�����!zL��jz�����LU�z��R��������R��|+T-BT]��j~P���A��������\�P	!+��O����SE
<^��A��z���p��Q=����j2��1].%*�EJ��h2�K	���gL��F{��A{L��.�"Z�[��J����H
1<v22�{�	C��]�+��?�U�Uy��{��X�&>�{��Z���]@�N�����"9?H���Oa�^���W}��b����d=�g\�!�1���	C��.:���Z3�~�����*�#�2a<1<v2���	C��.��Z�	s����8�J���Sy7�jU��1]��u��CT���!����-1<v2��	C��.�����T��|WfUF��2�����985Y�W����Ze��i	G����4u���N���B��w��_)�_<������GS��}G���P�0D���]���m���l9���*$Wb�8�g�I@j�=��w����|����`�����C�
���"5a��~|\k�p|mX�I|���QA^~Bq�M8��|������|W�ht�u����D�M��lp��������0D����Z�IW�5�]�Ao=r�>�#��N}b��:�6��?.��'mMWsH3�������G{HS6�����1]��u��>.�l��*E�^�
.'l;���Z��A{���?�U����*;��W���o����yi���0D����Z����3EE�����*���#l;�y��1]��UD����"a{�T�<C_7e����<^��>.R{�X>��,)��s���X����VO�����Y��q�2�s.z!W����o�,w�����jAc�]U�~(��F���MZ_�{e�����&=D�c�!��Z3��J,j��NUD�8��UcP+L���]��q�*w�������l�@1axPS]�>��1]��U����$82Q�oPEu���0�`�=��w��h*�e��6�UfQFU�!N��f���!zL��j5UUN�X��3	]�&��q`�%�[�K������|U�)�r�H��ve��&�� �]�}�r4���n�]�h"�F7�9�����G�������������Z�"��0'6!��`��I���H�[�������I5W�#a=���:�C�7bq�2��f�]���'.?q��	����DeT.bx�d^?q]`�=v�/k�,��j�i��y���h��d3a����Z�i���)k=��[k��s�'h��:��a`�:H���y��E\�H$\JD���+�W��j��;����nPf�������'������A�����u���n�;��r~'��-�N����o�9��L~tS�V��������������w���5������:������<�u��/qG���������;�;"�s��\�3��#��1�#����Q�+z�������KT���]�����������k�u)ve6e�����c'c�	C��.�T�����{l��
u��������i��X�j�yk��V��]����K&�d�lX���F��{c��SM��t�����,Q������
�($}~��R.�w�&��|����
g<C���lb3���fQ�G8s���L�	�c�|R�4#�)[9�k]�vtW����Gy����A�0��(����
�d�;��*��6��b��cRiB��p��
7�g�(�Y0L2g����Qf���!zL�O*\��p9t�.��0����,��w0~~!�V3�D��6�o����>pa�����|R�t����x�*�</��zF[���'�\d�U��1�G5�fkG~��T��VU�}�x�v%c��E(�cw��B9��H��LgL�����'
�`P��0D����
g.F�x��;�)��O<�	�	C��]>�P��&��l0g����`��y�dp�&�c�|R����j��7�
'L����e����!zL�O*��k��H����`eeT�bx�d�	��1]>���l��gg�����	{�����=D���I��f��]�ID\N�F{�f�6[<����������Ol����z���	�1���w�C��.�T��Y9-[ZmW�������y�m�r�T-����A�~�[N�����$e��	��9V��s�!z\]>��zmu��5;�fz6��q��zo��6#5a����
����8����U�~���[��Q�����\s���F��Y�J�'�V����0D���I�v�<�V�'�V?cx4��E<R�c�|T�>�]~Z���u�4s��?�P������ME ��K�.baBPF{�G�<v2��a��=��'�9����w+�y�������5�j�=v�O~�SF-����I�������Zm�rTj�v�k�V����(�����Z�_��!T�=�x��N6��_8c�.b�i�=��`����t�7��Z��+��Rl�#lv561<v2�%+&��t���%��dc#u8We6e�ux��������=������,�:����h�\�wJ�i:�=D��������s�������u����c'c�����1]���=�E��C������nC�������1]��U~Nr����	��=FP�w��.�{���"�	�kz\~���`V�z�h�;ax�dx"mB��.��*G�!&����O����8#��+�����&��c����l����:c-��':��r
������������Z�T;��d�0���-lC�LF0�fs����q:*�Z��^��X��^���:���},c�1�6��C��.���L�������`�����G�?ljU�c�|W�lps���T�8
q�'��Z���q�X�6����p.'���S�u�
*G��z���!z�._�*�/��'U��.s��jcr�e�����2*�����|W�H��	�.3O�s�UC�	S��07U����"e����.Y�8#�wKQ!ya���@�U�c�|Wk��]��K���+��M�g<����1�YM���Q���VQ�u���7L~�����<��y�dp�&�c�|W�j���>�!������!.��e�r�0D�K���q��cMN+Ige��cx���>��5D����Ze�|B�:�j|JZFxn��B���PR�����~\���=zN?�'E�=1O<c�B�!z<I}\+&������n��w�#$����E���1��.6o�.��%���G!����-���� ��+2,:�Z��Al%�M���G�<v2"X'3a����Z]UF/N+]��^����A�N0��=�����znT�r�����6���K�)��A{L��j��,���Q�����u������)mj������6B���3�X�������{i�e��	HN�M���JaV����<�uCS1�1]��Uu�����/B"�l��y��Q��
��A�����R�p��8Wu������]VP�>B�W�����n��]&�Z��R�h_���o��������S�t����)���S~O�*��<
�r5������?����������������/���?��������_���_��O� ��?zL�oQ_���0k�C�/��g!����u��,kyg��i�Q^����h�e�xc�w�Z��u\�|h������g�~�����k=����x�������]��`y���'}T��%�38^�1���g-��z[ei\��������o�c��Y�?���/c������������s���S����?���/:�7��=�������������������2f��|���������������2����OC.x�L��:/�f����h���:%[�S�m�����H���&��q����H�!#���d�3���TyFg�3�Y������N��d2���H�!�b�njF2�����Mh��hM���~�~2y� K�/~3��)<~���?q��S��7�������R�?���g4�&�3�sBO[���}F3k2:��q{?$��U����l��;�t���3�&�3W
V3�y��6#������}OH�>��t$�e��T�e��t�fmB2W4�6��^F��Mh�/.#Ed<^F~q��wX����_f�������M2����mF���������h2:s�������g�A�gEFg�3�Y�����8��f$��fmF2����2�y��6#���2Sk:|r���hfMFg2f?���d2���H�!�M�m2������Mh���d�#,�����Z��b�$�vg��-3k�w���*�'ZFg�3�Y������
���v�>��t$�������t���3�&�3y��f$��fmF2���Ih�C>�j�q�MF"����2����I��C�����d2���H�!��G�MF2���j��k����?I��!a
���d��}�h�����[a����t�!���I�z��7�2�SY�����CV=P<�:������N=e���Kj�SN5;�4�)#�
4)I=���!)���O���N=d5�6�SOY�7�zVROYaw�J�)���
YI�]e�������L���.�g�0LK�(|N=�of�����u�=65�SY�����C�-�fu�!���Y����I�"�O����J�V]v�)k�$6�S�^Kl���h�;d%��v������j�_�z�jv���S��^���)����;d%��u�����g����_\q�#��+��[>y���;�uy^�����2W_�aE������0��?���a�����`F�����r�Rh�k}���*���5&�z�!`��|LX
�c�\���h(:mX�j2�2��bx�dx�lnB��.��n�Q+��u���u�I�
Z�Kg���F������^����U}�l����q�����$����1].S���bep�p���!��Bc�9���t��.���s�t�SeVe�u6��������T5a���B�]l�K�^,Qb���p�SC���mQ��1].4�����w��*�Y������;��&�c�\���	gY�-V,M�[�5V��@��Y�k����m���s}B;kKeeDg��;���&�c�\��|����
���?[LvR��j��@j����4�qi�G��5�2���d��'��6�����V��0].���-��2����,�l9�/��c'c�8j�=����1�C^�N����l%�3�'ej�	C�8�����M?G�+N'�TfQFt�e��M�3���!zL�5*�s���E�GW�(�����c'c~��C��.7A�=��zRE��&�)��-5�A��F�a���q�r���E@�)��iZ9j���m�1�oJnB��.W�W��+�bk��T���_{���p2�P��D��R�>�+���������zl/�������q�������h�Q!�TVJ�iL�I�rq���9���{��3�������0���1'-�\\hdaN��J*��9nl�tJ���BL�S�'�IKr����O�y��;zc!g��@u��T��Vy&��%9i�0'��.���U/�Yl��L`a��4����p��������l���sCmJ�i����L�I��"��D���s�0�=��&
6TR*LF�8�$��
#s"�uY�EuO���i�M)�U�����E1��0�Kd�N�0�����IE�T�Z��P�q�bFs"�uY'���Zoa��lx��:z�H�	%��������,i�`9��D�@�R��L(t����4b�a^�%a��L%����Z�R-�	8�$r�Fs"����F�����E��TfQf��J�����y��9�	C��.7�Xz�?1���x��A�����+���F��L���lbx�dL��c�\(e��M�r�_O���Y��d����1uJM��t�P���{���`v��`y�9��:K���q��)!�[�t�q%���g�f�,XOl��17�,u��@0�X�V0#C�l�.�W�L�V0c�?f�o�2��l�HO�6��6Cc���,�����r!��6� +�#��mH�y��;Z�=��e�"G
m���Y�Ye��c'c����gt�t������Hu��@�����J�;��!zL��v�fX�]LJ��MO��q�����t"1�=���|F]a����Y�qlbx�dLQ��1].����>'��^�E��'�n��Ej�=������g�?;�h`�������c'�Z`�=���rU{Z1h�2�2���!N65�{_��C��.��d���*�3�%�e���h�� 0����q�U����;��@���b_�=��N������1].U4t�G��P|��F�8�7�16mJL<D����h\�`p������%C%A��I�t�[�����0m$%�vTj���$��c"�mG��y��no;T���tn�]K���n�M%b������&���r��qV��*��*o��
���xd^6��	�c�\�h�~�gg�2a���l3�1���cw���l�����������p��cw��`b!zL�k��Ps\��)�0*(L%XJO�t3r�F�|"��V7��N��*�a"YI�;,0�c�<
f�}��`����
fG�N03��]�����1v�RC�M�T3HXf��'��~!�aV�A0� �u�z�L�Z��H��N0�0��HmJApr�l�$3b�a^w��LS�;��<����1'���3b�A�G�'����T���8�`F�D���`&�g��tv�L�J'4=@tr�L�B#F>�u'��MU,��n�E)hacN.�)h�0'�����YN#��h5W�����
f1�	C��.��Y(�E��T���d���e0����lF[J�V&L�Wv���:���i-;b�\Hf��Y�����,`��	��2e����N�������_���X��b�
r5������A�_he�n��a�2�()�ca��.���a���1]�%3�A�Le�)���k{1�jQ��q���Y����T���Y�V.+�q`P�p�ZY��"\P+����y���]��y�jR�1��w�R��
P��j�e�XY�m����������3���7�d�Z�������&Cn�;{'��!zL�ul}'4�aU�����lK��xq���w�����X��xv���&���;����!zL�KY���0�2�2"zmC�l��1�7RM��t����Xn��\���ulbx�d(��&D��r-�E�����,���5�����EM<D���3_Q��}_���2���k�hM�p0����q������d�VQ�E��'��0��U��������}��1g���7������2���!z������CbM������ |
q�����|Uc�D�����{dR�&3��I�z�������������Q��|��
{T�Z���j�-�0�����w�L�*�0�U���Lp�������D��V+�n��je�;S	a�p�,A,S��1]�[��S��j&����#�����Qm5;�����B�T3��)Q;c����T62�c��he6�;V�,B�����8����
P1�!z�o�x����,i������-v�D����E1��(��TU��W�A+�J5�����Q+[`T�����csY�'L
�*����8��f1���w����M/����Y����'���\V��#��z��8b��0Y���N�
Q?he�y����w�e�9��Je+"����;��*NfbA{L��,��
wL����m�@�ax�d^:d�L,D��r�������l�ZX#Zq�2&?,!������e1��?Q���`��B@?hIM�v�ab��B���sU��1
����E!0�*y��v!���F���'NlJ�H�`��F�8���;X����6N��{'6%V���`�a�F@�Y�������������mB�oUQ����c'c�Yj��V���w����	��2�BfUF��<������&K��w�t���esGw����r��`��Y#P����f�]�k���.AgV�����{��%���!zL��j��b�`"w���,��[�����N�n���!zL��j)o�FI����Q�G��d��*�	�������V�&���BgCe�2�NC<[��1h�������|W�����I���ph5��{��;Eq���f�]=qE�S��L�h���)��G�<v2���	���.���6�]�j��`���y��'����	C���Z��ZEU�03�Bg2�>&�x�\�ms
�cw���e��	Y~�6�$eD�\�;�u�	C��]��U6�y���U��D���x�gg����c�!z�.����q��nH�P��2�Z�y�����y�JU5��=��W���4n��9�V� ��0\(���i��jb�:}Y/�������*��	��ip��
.���9��m�3��K`��h(���:&����r�y��%������@�F���T�fh.g�� ��+f��O�y}[�\�'�M�&���R8<$��+�i��'���-Yd��'��C)�
���t;��`7�|��� D����Z��(?��t���0��6B�	8�%e���0'���-������2���/>�	����t�L�og?`�|Y�����:���.!+�(�xL�I�^���D���%����<�nu����yU��X=vgd�]0����|W+e�el+@�(�$��|���;��s�4!zL�ok�^�������S�T�������Y�A�oK=y]t�����UzL�\�j���<���O��7�������I)W���9��v,��<��0��K^��(���*}GY��U�����H�1O4�=����bt�^^g�����0�G�3/=����|ykn�v��#���3�]-?ax���L(�w�]���^A���2v�����I�������?��so��n�M�"������>�pm::c��A��������K��PU�3�`6e�[tbx�dxF;7!zL�O*���<���=+�)����.vr��PMr=E~�t������bO+�rb�q���X�j�=��'����}B���)���! ���w��4������ym�[;c�����4���[�i�[{�wkgU���Y���GIB�i�ua���7+% �?*
J�:g�FfSf]3t���N��<5a����
e��>���	�p�X����gT��NB�Qa`���'�i��|��������y����4�ZT��!zL�O*��V��@�sfQF[d�;k\U��1]>�Pz~�M`+���(���E'���1vV��M�����/+�q:b@{+���v�!��NO/`b!z�.�T�7"�3@��QQ��;�c��`��3�v�O*��6�1�&`s�cXCP�w�N��A�R��.��J���}Z���C��N��	C��]>�P�k�6����7�p�0��;�����B5a����~������wV���v�\����_yM��}�;Pub�xW�.�T�-���qH�%ip��<��X?�^�:���V�WJ6c0�yH����e3��'b=��]�G?"�^�n-��;�������N�0D���Q�>@r�;�hJ����(���)�Y��N�<���P������|��{l��<���|�[�1��B��������c�	{�7�M,D���Q��1)�w��8K=O�N�<p:����s;��L��g���+R:�9��Z&��>���-42��'���Bm,eZ���i&`��>�n9 ��N�LrjFs"���B]�-��j��%(U��aNm�r��M�a^jRn^��F�
~TA�D��A5��F�|"��u@�]�O2��Z�R9 �	8�V��dF�D����m���
q���s#8�
q�3b�A�E�r��_V�(�-�'M��-�a����1]>�P�����vJ+�*e����4�����aN�y}v_k*m	z����U�
U�=ax�Rp^���o���Z
��?q��,�S�9e`L������0�.&��I��9����q���L;��!zL#>�5.�TS��N�m+D�������{���"��:���Y��������Y�ufAi�]�rA��-U/.�bo���=�z�2G���~�#���k-�DS:&cbq�lT^�]=�j�b��q' X�������k���,v3*�����x���H��������!��.��#��;	j��{���"�E�u$(.K$V%d`B�l���P���A�U�h�l�/+����_Y�3
1<v2��GM��t���`OH�
4ws��?z�
%7���l���w���)J����9Pn� /�b��c'c�o��!zL��j��x~��8g'��4��;���Ps� ��+2m*X��>e���ded��<����@{������|Wk��<�������+��C<��jU�R���|W�2�n{�6vT��{a�_z�������b�=v��j]'%
5<	�9
1<v2���!z�.���a����1r������!��N��	C��]��U�d<!Z7���H>frH��&>����e�@?��_W&�����F
�\?U��<v28��1]���a�a5RP�����t\����#�������+P�.�,@����j�hF�Dt�+���N����m��,��B�}&�th��n.1��0�oK�Y~H��=�T��m�C��|,�r�F�|"��3�OK.8�!�!p�m(����=&�������F�D����H�2h@�hJ�Je��f�	8��-������z}[���Hm���Xa�uH:dc@���o�
��D���%��y�)������4}&����1��H����\��76v�Iz��Jw��q���K��vN�xk;��}�)�*��&.TV���!����������z}[r�q9y'lOb��_@�pru\�hF�D���%c�U�S��?��8��9Q'W�8��'bk;��(b�����V1�T�����*�S���1]���q������W,��!�N�$���4�0'�^������������U3�a�	��������U>��2^�R�{�E��������������������l����S�����N)oQ���i�2K�����_��������?���_�����?~������������?����_|4�q����)�B��bt�|�����\������}	��7������?~&y7~Y�<T��u��������p��_����//�ug�����KHZw�
�yz���o�Ok|/�!;����}��~�_@u�Wpp�	Y&`�_B��%�d�@}	F=����/��z������~��!��~'�RO/�q�K8�^��(O�a���_B�>{
�c:�%�z|
��_����5�����5|������5���������k�X��5��/ZT��;������^������?�/]�����7���fwx�I=e��!+����5�d%��u�9����SV���z���>��Y�z�jvmV���������^�G�CR�d6cZ�� ����X�,;�.~a��)<�E�!~������9�5/�R���,K����J=d5�6�SYe:\<|6+������N=e]0�Ij�SN5;�4�!�,���CJ�r�]�����E~���z�
�CVROY�����s��J�����9��������f�&u�)kFsl������f��F��Z3�$���Z��K�}Z _�So�������7nM>��Y�z�jvmV��n�7�2�SY�����S���6Y�Az�v������i����R�Yi�d��S�b�������g��!+���a���5�SY�����S�h�T����������SV>}o����������_\r�L�IW�_\p�><~D�zx���}N=�m	����N=d5�6�SYUZ?$%��fmJ2O���)I=���!)���e~�|�9U���f�fu�)�/V���<�T�CJc2.Sm���N=�4�6�SOY#&��YI=e��!+�����`����/4fvHj�o.4�Nk�����B�%v��~2o������eiJ���
�S�9i��1N=d
3�4Y�z�jvmV����>����z�
�CVRYg���I�<��Y���S���=%����;$%��U�y~�T�!���Y�z�}��g%��v������{k����������_\h�������/L9f&�o�
�zx���}	N=�oC&��N=d5�6�SY��F\��N=d5�6�SOY�1N�J�)+�YI=d]Of���'�ROY1�����SV���z��M:Z��t��CV�;<H ��5��CVROYaw�J�)��I2��+����Q���HCTx��|����y��re���2������XY����-�48��5�Nd�����52>��	�cs���q���LF'�a��&=|9 �O(�\�������k�Q[4��y���CNZ�s�Fs"��R��>
�
Wj�i_�X�1Q��p�"���]�����1_&�������0�M���m~�\4#s"n��9����nJ/�M�0glq?��Q<dlu������\�R"�|��D����L���CeG����R��1]�E`4��|[C�H[��f]�'�N��Zn6#s"��*�0�\
��N�R����KOD�ur��i�0'"�������2��%{��Z�R=9�	8iI��4b�a^7z��O��i4�_�(M�cNRR�f1��0�K�������n�E�C�a?'NZ�s�F�guD����l�������R�R!���}&��%9h�0'��.U�����f������(���!V��[{��!zl.W��j��$Tj�xY�V�D�)	�4b�a^�2�k�!^k��TQ*D��p����h�0'��.����Z�S�p�8�t�1'-��@����v�y]jt����=���RE)��������D#�9�u)�M��&/���*J������$�&3�0'��.U9���� ty9#n$Ry�!V��;t�=���j�/d��Q��+*���11���9[U��_ u-��8�f+����� ���[Fm����B��]�������mWk��THr^zL�I�rq1#�9��*I�V�(����DJ�^�"�V����9iI�e3��'����&�����qN>���|�<����9�u�2yI��&��Lt�X�%L�)O��c'�=Sf�8�����Z��v*Y����T�p@F�Z�C������dF�D���P�Elz����J8D�tn��~�������z]�IQ��H�9f"7� ��l��86���r��<���d���d��M�����Ji�2&��%��Y�A�KQ�F�M��*���v����*������,)�`9���g�(9G���@4�A'-��B#F>�uY���@�d%*"#B�����������!zl.W�>8>-[
��WjQ*�a8�v�\�Q
s"��V�oy�E���:�r�
�1����EK�=�aL���B�#vuLU�j�?�sK������������,P8�:77+�6��xR��������T&9_=������"����=6��F�`�d��u��- T���@p�F.��aN�y��r���n�Rf�;$���U.���9��nO3�����JA���<&��%��gF�D���:��i��Q��Y�
�3��I������E3�0'��nD:|�{�d�J�J����(���������i��/��8A�J�y�Y�'-����<��0�K�N��1�	'��Y�	��Y`x�d0��L,D��r���v
�9�Lu���
oyL�����Fs"��F�k��L�Sf����n�'���y��\
��t{�'_�f��;�(�z�2�����zj�,����r���>��z���
���p����@�3�c���q�Dm��J-�h��6&�_{*h�0'"�^{T$K�h�\�K�'��ImcN��%�of�0'��n��y�;�^������������+7��aN�y��q(�{w�(�mL���8�&1��0�=�'��`�*+�Zl�����8�d4�1�����l0�|� �q�B��V���$�0D�	s"��7��zR��N�[;MN�xcl���)r�O��V�����z��S�F��lL,D���F��m��V�8�$%��1'��f����&����o�8�Z=N��[=�@��I�S�V�;�u���BU�&�Jm� ��<&��%9�����zR�����J����\��qP�V��J�awyR��+jU:],�[�D����E3�0'��~���'�n�4����t��1/rs'��?}�{y.�4+Y*��\�HB��6$�T�����`�aN�y]�s����Uj�|�0���!^��s�!.���qQ��F��8�B�!�#��F�� �����n���;�3��N�����X�o��.��:���b�~\s;�>Ar.��aND�2Ju,&~�W
�M���D�/�����0'"5_�c!N���4�� �%�#��`��U�~}A$����6��.�	8��$��y�a^��\y�M%�J���~��,
�:�r�Fs"���Y����^��r���f���w]�i��<D������5�j��)i�O�#ZA����Q����'A.�4���*�a���cNP��E1��0�A7�YO�� jUJ������/r3��Oq�����5(�J=������b����Q�[���r��el.�&�9e?��e��'N����4b��od�<���Ut�Z��������Q�0'�A����e����^�	���.n���������<�a�'FdV���6q��1h��	C�8�}�D$�/z�a��J��KQ���H��r�Fs"��3P���@�\��t�7�	3��@��s���r#����*Vj��L�1���X�@#�9���(bY���l�tj�.I��6[:h�0'"�7[��F�'�}W��<�����H��]�@#�9���[�������������1ak�^-7��aN�C]�r�T�fKg��r��4��l	�i��=�o��,>H�g�b2�m��=�4������q�{:�Mk��Z�SJ�v��	8Q�S.n�/1�����gm���
�]wJ����1W-��.s"�u'�����P;�f\P���6&��%������0'�^�����`�.�U�jt���S���������t�+����Sr��c*�
	sr]N�lF�D���=�&f�s�zCa�-���D����������]�����b~�����RY)�ncN������0�0'�^���8�����!���:�x_s+�����:���5W%m���}nV��z���Mt�F�|"����.���5��J5�q'N�&���<��0��&���_�f7w]{�zb���+Wh��'�������O��]Q���N����dF�D���v�p_�kw��vw��8hw�&��t���dX�6sS@�2F��,7&f��������z=iw�Z�;�-Q���)���~���hU��U	�v�$�yb��2�0|�<�[����9,�l�����x��y�Na6|�y�%��E1��0/L���9$�~���S����{�:���k�"���PuDi�l^��/mo�$��#���)r�^�� ��\�[����h��i��������Y_R��Z�
 �eUe&=��U�}�����lZ�)~�g�������
�c�<��p����R���JaL�a��d�
K��P;b�<�U+	LAC�]��oU'M�1����r3�r�z$��4e��X�� ��i�V����RN�9��Hn���x=MY�DS����R�m���d���D��4��i�0'"��i>OY�p��}��T�/�RX�D����0�0'^OS^�^j1��#%W���$�N>��#���:7��aN�������!h��!C�VN�Gx�f0\ij����\��7X���Tj��=�	<&&����Rh�0'�^OS��@����L4�Z��m����yi�L��ty���UZ�!��P���R}4�	sb��q��D��i�2�8���TeV��R:�1�	s2����aN���,�v�=z�.�*�*Y�4&��R&7����x=M9�:�p^�������������y��6� h���0�l��'���b7
*��CN�o�K4b���)���Z@�]�x�Z�2�vL��.)*�hs��#��4e6/��;+���Mo�>�t���{5�0'"�[<?NY%\�<</���TF���cb�~��MfT��x=M��<s�C�l�O��������D��i�"K�s��O�Y){�7!yL�����y���)S"�������*4�	s:���
#�9�z�2%�)�c�R�>�\�>�?�*`�aND�S{R�)E{���F��$K��{��1/����!zL���Ze���aj��c���,7=c�h�Oc	4A���a��E�_��SE)���9��wLF�D��q�&Z�%P��T4q����0�V�?<!������!j�)���F?��m�qi|@�S���i���9�z��������T��Bm�R&�	8���\�#�x=M������F_������0�V_r���z=N9~������=��e�{}��c'�z/)�!zL�����"�-�,�U)���1aN����i�0'^���!����1�JVF����������/\��	z52�2��5�����?����:�?/�}^�0����=~���G\���i�Yf��x�p����2�}�Jv��@w�����������:�ns��e8g,���J�{�J�����^X��Fs"r�^����D�-��N�C*��5sL��%��<3�0'^&��i��L�]��	{�1�����G��:����I��>������9�����:���%*B9��O��vjS���1aN��+���aN�>K�D�F��N�z�L�&y@4���z�����}��&*���
i�qQ��R!<&R��pn����t����De~	�����;+�����eL���Yq
��5�����������*J�b�eL��I����D���D�i��TLN�T����?&�����M4b�����
��NQY�g�A���
���A���D���DE���T�����h>������z0�=v��2l�s&��l&t&F���s:L�\V!����g���Gd.VV��J���9�:g�����z}��H���H�W?/�����1aN��kCn���z}�h�jwz���{�z�MCN����4b����Z���r���T��}V=$����1#�|"��Q�q�x�8��z������8�����M�lu��c�|��f����oNE�t��ECN&���4b�������;}B�O�8�_�Vn���9K{�����3�����+����h�C��3/�-C��t�(C1�:�4�%�I�:
�L��a^�T����z}��s�����Y�3K;����x��W7NTP
?������f|�4+��:L�V��2M��t�(CJ���cYZ�z��������A��|�&������B@.��j:1�2��u�&���aN�>K��|�/���s�&l�X8pj�0��`��O�>L���(�������4��*9��#}*h�0'^�%
�8J��H-� 4�	8�jt�#�<������Z�v����D��EV��pF����O�����,��
��lE��#R�����1aN.9+7��aN�>|������*Y�
�L��AhN�U]�'�u+4O_e}/R��~Lg��2aj��_�����L���i�[�������=>���:�+����l��f+I��\�D��&�[�����x=My��CfC�6jSJ+���0'zg_����6:^OS^���=]����y�9A���29����{������&���|L�5jUJ�nc��8%��F�D��i�r}5W`9��W)����hV`9��W�Dp����Mw9L�%�Ce��v&�)��|�,��N�}�r��)�
��mu���A5�N����k�����Osm;q�V�C�6d5;�:;��R0�;�:��������b3VL�Y��Z�i��cw�^_3a���Y��7�<�9�5A����� ��&����
z�r�����{�Ie�T�^��9Q�6n���x=M�7J������T���D��ay�6�&������E��2��P�������=
�z��XO�� D���0�������L���!V��S��!zL��������^G��8#
��11���s2}3�&���������������p������4&���ppK���z=My�����^,)����C{����q�2�<���������������f-p�j�����G>�u~��
��~|\C*N�[$�CN�&7.b��G>�z���xg�.X��`���VNa���`���a�=���\Y��,_	k���J�
�3aN�2�e���9�z���v[��K�Z�v�#N��r�y��oM?M�I��=��c���T�u��cw��5�ikC�.sU1|�?�?�6��z6������>!�5����qrV"[�����u~��o�y�tj��H
<��D��i�.�'��o�uU��O�9�F�D��q�T�KM�����(����Z|�#����Rf1w��6���'{g�%�g��u�)����x=N����u�tXnE���]f����,Q:[���9K[��y����;���*JU�L��A���5�0'^�S��6m���T�R��G����L#F>�z�r��v������4u����KU{� D���S'5�8
���Xu���sa�.��}��Ru���^������s|/a�� "�����f���M��+��9
�~������������������?���?��uY�,���������������k����o��������/�O�_����������_��p�0�g�E7J���'+:#��xt��m�{�[��|G����M��U���[�����?��|7Y+�.g�x���0�9}M�{��i�o��N��+c�4�o��~L��8����_�Rd`�����Or_�j����.w����w�/���"��^wYR��_���~����.�IJZ������9oX���I|�{�c�G���s��������c�n�������nwy���m�g�|��OQ��k� ����<���]�������^���N�vNb�J�U�0j��$d���W�z�z�wg������Y��5g��=����H����`��,�g�Qs����,%��(�����D|{N�K�� �'�Qs����Ak�"�������H������z�5�Wv��/Bg���G�4��;�w/�j�N��Ha�K�z�wg������Y23m�a�����8g��S��=��ij����y�Y�a�����8g��K:r�=���yf��G���M���<2���U{����Y')5�9s{�����?/�H��<2wo�j�#��F����w���0B��&ygn_,X5�;s�bI��z�ep��<X5�9s��c������gV�ydn�+���9����`�����y�(�q �OS��0�g����9���Yf��E�����F{�����=���y��<~O��?���������������������Mc��h����e���Z��W����UM�2w/������+s{�����=o���5�9s{�����=O��m��3����9����������?�������'����<gn��Us�3��i{�����=��Y6��=���if�G���U�����{��U{���6t�Y���l���F�_Sl�M*s�R��I��������8���Qs�����k����=
V�q���g;������gV�ydn��l^s�3����9������d����?��������L�ui���;�V�`���y������?�����������=��������|���I9Y��m�{=p���������
VM����V21 ~��=V�y���'���k(���Qs�����'#s�Y����=/����
��if��� ���Z#�F��4�j�#s�m�k�#s�Y����=/O�=�����y�j�s������3���Kg�{��W~���g||�L�[n�E�z{�]�;mz��DaepXz��S���z�T.:R__P�vs�f��Y��`�x�u	f@��~��YSU��J�������&I���A�_���~m�3E�����"F��\;��=���b������u���wt�	E��!����������/O�nd%�$��4L~~����,� z
-�fk����W�a�d
���|�3�Xh�~j�!zo�F f#��e��b���X.j�!zL��\d�����)�)#b��� 5a���Bm���-Xs�3E� ����<v2�EL<D��r�.��h	�h� l�c�qh�;��|���;H��-h��Z���A1v)�{�*k0/=ZM<D��r!����a��U�H��c�"�\���C�����:�y�u��|X6��$e��yb�������)�!z�.���Z+�^r��J�<A����eQby����1���E�
��	BO����|Mg���c]�rW��c���e
~��G�c�*���I����y�d^z�� D���B�����F�Ee6efiW���	0��-8�!zL��\�.;���huF>�����<����H.	&�c�\�R��~�'g�4k#�������n3a���B�Zu6�:�;sf��d29sax�d$��1].��IK���M_2���6�0<v2�����1].r	�wQ����2�2��}qF���U�c�\� �U���6TB7�H��2��v��<t+@u� ��I����gh���,V�,�Lax�d�EM<h��r!��v?��;�V��_�V�G{�	�`�EQ��t���}��E���dc�:a�JCl;��{�"61a����(#-1����2���/6=��V���\l	�����\���`X�=���3��Q���^�����\L��t�t��9�6���Yk�y�|'ay���� �/u��XC��w%d%��a�FP�w�����A�_$�����W��)����u��c'#��0D��r)i�O�/U��$�1s�S��X.j�!z��h�D$[&��v�U�(#��y�#V���\��C��.����������$�Z�6u�����`^�<m0A���� �~��*o��H���Q�������,�s�U
�T����{D��8��B�:�x�7Y�V���DQBfq����$L�w��BL��]&��C�3���O%=�^���x���R���={�t�sd.�s-�"&��t��e�Z7��gLm��&��<v2�%5!zL���<Q�TB���,~-:1��#>)�@���x������9Xw�����h�;	��5�w����$]_�]|�y�$�A����$^��'p���_*S�"D�r�����B4�q��'0P���C��.o�t����s��B�� BvP�w&����w����]�X��)S����P�;���c'c�����1].r�"��Tv�D��}�\ ��P�N
����Vb��>�/��t&)6[)|����1e�L���r�����s�w�{<��j��&z�Z�-*j���"�E�(:eN�e�|'aJ��������2�N^�d�^ ]�.k��4���=����� �/����+�$�������{q�����D��A{L��b(�;�74�;#�k���_i�����\��A{L�q.i�CY�EF��TH���;���=����q���ly�J3:���{r~��6"�0<v2������.��l%��K�����1E3{rN�<v2������1].������<q�����s~�+�^;����������.Z�-W�F��:��x
#��d ���=.w�0��(Y�&�.����e{/S��'�����&��t���l������8���Pt�	b��]w� ����eS�H�7hQ���J���N��(5a����%������(c���:���n'���n5a���u��*��"UfQ����x�r#0��� D���B�	V���N91�������N��*�x���;*�,j����l���5q��0�G�	C��.�KRk�H�GAj=�Q�I�Z�j�����hQ��%��)��"��L�F�����w��"~���
?MaTbb���@�c���P���Z���*:��y�"TXaB����r-BM_I����t��C��`�P�Zg��Gb��)�AtB,�!�����Z%�1�UUT����U����n7�w��w����{i?�����J�H_�m���=��y�8�&Ww
j||���_)�m-	�?�9�^�������QEwB��
L��#�q��i'�����$�8�p&��tT3��L���6��'5a���R�*����1�V��>�*~��4o�.��{�+-Z7�F9�B��+q-�R�)��=��0����p��3&��v��������b&�����!���.�$\'�N�M��a���k�E��V���R~�k+Q�L=v&+#eOq��c'�IK���t���fX%�O3N���4�&��x����w����-M[�Zd2�Z��0�3VG����&�cw���f+-�=h��Eaf)�b����$�&�cw�,��{`A�3��i���r�r�y�5�����\���c5��2��oJ��=�A��� D���B�J�,h��S�X���u�|'���U50����*Dmc��FgDy���G;(1b�A{L��\f�HI��"�l���5������k5�=��Ew^@IK��:V�2�2s.�`��y�d���x���R���d)�)
�1�2s��	��N�Ab�!zL��\����|Y@HC����`��G���K�Q���z��H��&�7����L����c'c?�j���R��L�Z�����>�$�����w/�HQ�w����\26T�6�L�!�X ��b�!zL��\6�3����^�ho�����(�����R~��U)�%A~2&*�,6*���c'c�Pj�=v�K�I'p�j*�(#�R�	e`L~2����\�B��O}I��Rf���iax�dL�R�c�\�P�*L;����H�8�
��H.L���]���&:�MzV�o�����Li��c'c���0D��r)Fi��I���3&+#r�<������EM<D���Ma��5]��mZ����m5�Q��WC��.�����U��,��N8@����<��	5����&��_2�Pn���A����Y��T���
}-�Te�7J4�U��+&T�&���J4�����iv��_����	+Z*��A�#�x�b��;H�GI�ia
��w�N%���g-2��q��h�U��b?�p�&�er[���UPG��J` ��	C������\������x`�2!�x6��y�d��6!zL������g
�3E�0���w����A�j��q����,W������ND2Y� U3Cl;�(�y8yk���a��;�8=y��M5��#���`te^�=PC��0��\e�D0���$��H	Bq�g�kcr��0D���a�����3�����;������UM<D���,�z��E�f���������$^���p���?Kr��g2{�Z%�3hV&�h@�1<v2&k��V���.s]����"V.fb����m#��*���A{L�G��7�H`5!N�V2��2��!F�DuP�w���,x@�_1Sp4�(#��a��c'��N���=���\��R��f��W	����ig����8������$�iD����0G&L��-����;���0��A{L�g�&�*����!���d�1�-YN��;�W��b�x���$��Q)2K��H`��,����;�WY�0^i�,�b%�E�B�0����sax�d^�Rl�x
�c�<�U�K��P�I�0��6Tax�d^����x���a��^Q�u�
��,���Z���M<`�u-0A����\��b�d��'��B��2�j��xij@���Q�����.q�M�����G$�\l�.����I
������ONi2�-�E�crF'�1�2���xB�*������0D���Y�(�+��v�
���:��h��t+��UMzL���nZ�*aL�$��U��e�|'ay��u��������9���E�b��{m�����"�5�q�iv�&*���&��4+#��#���4��e
�c�<�5�����d"��I�G�qp�v�J���0A����\��y�Tp\��Q��;�$���NFT��1]��*u�K�L&�n3&*���Cl;S\��!z�.s���-��k^<9����TF����rr~iA�x����S�ei���^���^���T��l�M/o����e�%l�J���u�1
{���>�s�\OuI�v9�is�\;������s��A����#�m�.rj0	��.X�Ao6������� �0I�fl����C��A��#����Iz�#�z����K=h���Z��v�Z}��\���N�L��������kn��d�����o<�`V�i�M�;�y�d�ajB��.�d(�b1X~tw3������Cl;H�b�!zL�O2���T}���a�\�v]��#�B�w�t�$�lC��@���;Sl_|*����#���A
��x���������\4tl� cB�� oC�@��^L<D��V�����|\������_b�����ZM��t�$C]�aw&�F�v#���a���X�f�=�����f���0}%�����E�������d��N���Wj��#,������N��W�����|�����_�����.t����I_��T:V����|���XN�c�4���z���c'�BM�VC�0]>�0��f���og�22�/�y�dL[W�c�|�a��]i�M
�DB�mgSw@��.t� �?I-�b�����9D�
#������Tte�����U�J����M	?�y�d�.�&�c�|�a���Io�E�&�(3��<a���X��� D�����
}w	}�Y����f�'���g�0D���A�:P�ej��3����b�����&E���t�(�c������2YYG���<v2V�f&�c�|����I;��x�-���b33Nx���b��:M(�e�����f;�]��%"�����F��y�d^��YG
�c�|�!�����B�9^V��`��G\kx��\B����qloB~4C;X�������p��/����
���L��t�$�d%s%g�3;�iCr���'���l��z-��L������~�-9qH�3�M�I�C�8��I���=��G�"B��?8~�d-��O(�plbm8y����1�g9I�n��"�6��L�B^��ax�d�F+���qi��4C+�����P)\�f�]Q|��bY),�jkS|������K������f���cx��+�dj�om��H|�)[�N^�~L{eW{t���l/A�C)6i��K�PK]G��*�*������Y�[���#~�z��WU9��#�����u�����\��?�Q����,��5�������|�Nxr��1I�p��YM P��V���Y�!������we�2AF8�������qlo�>�u���vo�y9�F���w���{i�0�#������]��L@_���N���=���\���T��`�#~%���y�f���T
��A�?KR>z��h�D���i���8�2[��&D�c[L�y��N�_��]������!6����Ir����,���1s����d%�2Gm�gV9��!e�!z�.�r��h���hq&[�,��-!`�uU����"�4W�l�K��-P��I��-[��	�x�X������0���
�V3����i9��8������Y���Y�>�r�o�PM��V�h��<s5L8X����,W�����Ry�uP��t��T�#���)�j�A{L����6Bh.�Zfm�`x�d0�RM������������+�������e�#l;��5�.�	�c�<�5I+9�b~�E�9-:b���c'c��� D���Y�eFSj�Q+.6���*�y��c'c"��x���Y�R�+_&�]@��x
X�tx��c'���	�c�<�u����@o��m��j-|���
x�N*�N����av�	�3���'�L��C*+�0<v2/�U����x�3�RR���&����bc6eD?���<v2/p��C��.s��(�Ka�+c��`����	�H�&�c�<�Ug'�N��Jf������N��j�A{����U����}J\�,xS��mv;�b������1}��	�E���e�2*:q�� ���TC��]��*
r	��Z��3�z�T�ax�dD-Lf�A{��m����t�I����tUk�G;����f��=.��^��Q������Z9|�0�I�$-3@���Y��J��G+o�(+.�j�E�Cl;���p��!z�.s�y��@����de8���l�����;�&��ty�x�r�F!�L��'�{�V�N&����S�/��Vq�g��u��n��+�9�5����o���T��?����.��9��,7�T�������������������������!��]Rq)g����?����m�[�����m�������������~����B�"�����8'W���Xg^8����KS����������Y��x�����=�)hA�'������I7N���m�)�":O�������|ne�����|�yud�G�q������<�	���?II*6���.}w�����/���"��^}�������.}w�����OI,�����9oX����|�~ud�G���s���r��G��4�L�{
����������ee���v����x�����.yw�����Orcs����Km���<�����V�?�������|L1�#s�Y����=O��1.�9s{�����?�`z=���yf��G��<���s��<X5�9s^�~������3���������g���<2EFQ�}	�v��^�����a�����_W��/����}�`�d����U�������=����H��%7T��<'n�2��,�g�A�����fV�qdn��fl�����=V�y����}>�#s�Y����?o��!�G��<�j�#sw^��>B;�2w����W����u�����k��=����[����w���4��T��@K���VM����X:L������y�j�s���lS=����`�����y�Kx�����=���y"���;cen��Us�3��%_���@���F�a ��Z}�1�q��g��=+������i�j�s����-�<��������|���d����{��e�3V�y����T�jRw����f��h�s��<X5�9s�����<2���U{�����������2�z��9a��H?����`�����y�Kq�����=��YE������?�������7O����Yj��|�M�jO"s�6A��42��Md�W��/��M���{�,ugn_&X5�;s�R�MF�����if�F����E�<��Yj�������;����`����y+���yd��3��<2����M*s{�����?/���3�����t���������i���������=jU���V�q�|�vQ�W��y	���Y�4�V����+�&wg�^+��v�����y�j�Tgn����������9���Y��ei��*s{�����?�������?��������������=V�~@2���i9�G��<�j�#s������?o�~���w�Fu��}cp���w���������/E�������WE��GFH_i�*���
$�����6kE��am��n�8�foi�����c5D��r!�f*c��1�0v���`0HL<D��r�6��@�����lO�1g��y����t��l������J�p�X6��ws�@�V�������q�V��}W�7�)��l=����as����1]������2E���!���Y\[��t?O��[+�@�2V����[3����`�K+&��t��e���:�26�+��!�8/0uX
�c�\H���0�.(��[i�f��8b�-�Z����1].eY�pYP�X��L�w�!^Pu�Tb�!zL��/�ty��`���LJ���?a���p!DhB��.��{��U&+����t=^0BZ���Q�x��N�
�g]-���2I���w�c���X���0D���"��������$��,���Ox���1��� D���B�KR���R��@���$s�0���.�0s���_
y|m6�3�MOf}N���Y�	���.b�]��i��b3�����9����&����vb��{�
��9#������:����X���0h�K�����b{(���8ck����='���1�K�L���6����]S� g�u������N���0h��r�gY��N�L#�Z�����p�_&{]��!z����e�m��#���'���7�0<v2���	���.�
�-�J�nH���w.��+�P�w/�xi�;�ICe�����Ea�O^�0<v2���	���.�����&���V'�Ek=���90xQ��!zL�K�f����
�2�F
MI�"��y�dl��� D���R�����U����G��0��5�	B��].��/�u%#ed���-XM�1<v2��'&��t����B���3rg�,�:�����V3a���E�)��\����2��D+Xmu���� 1��=�����Wm���N��#a�&���FA3�{i�1����?K��f��qGb�����SM��t���b�����2���-��t��T�X.j�!z������0%]�]��L��>S��y�d^�,c���u�v�iKV-���uP:H[�����?����)A	U��e��sb���P�JM���BT�����Jd "I����dL+*1D��E���iK��V�1I�Tg5��eA0�������R}�u4~el~\C�a0G�����k���BTZ0����*c���l^b�"S��=��E��d�+�O���K���4�����RhB��.�R�X���2�@��*�q�C�mb��s����"�W�=��E�(!�T@3�I��!5�{i�&E�7A�����0I&~�y�d�l�It����r�v�`�������de��a�m��c'�Y�97!zL�es���h.�deR���6��������1].�|��N��LV&���.=6�����&��t��e�K�U��*��I�^:������Z5A���Eu[���#�md�2I��q��0/�!z�.����mVqee�|_�O<����	B��]..�N-z���uFD$��2MXy�ax�d *�	���.�l��Gkus`��k���c'�
�j��=����%�D��	�|���L�g�b>��9�`^�xh&�c�\\I���I������dN�jax�d�3R�1].sI���^�U�������^����u6�6���e��4�>g6�����8�L�����	C�8�J}���{�����y�J����K�U�w���Z�6x�Ee6erA��	g�
`^�b�=�w��M�lK���2)!rU:C��$^�"A
����,7���W�X�dLQF��m��c'c���!zL�K-K?�fn���m�)�(x���rQ��1]��,�'-�����/�0�T
�y�dL�2����\�1m����>�����R�2����k�<��+��W��2��*c�ee�
�3��<c� ��!z����D���D^����
��Lmh�;	S��Z�{�K�Jkq%n��1�f�K�!6����4f�=���z�T�Z�J�&QQ����P/������le���P��R�F�<([%KDM<D��r/[�A��6��)GU�:a�he����2U���Z�Z��M*��y�t+N������Ur��f�V�f��*�1����k���B+���V�"��'[K|��J�V�b��[y�#u+# TE�2�C���8�V���{�*�*`Vda%X��{Y�e��"��;x[��t�7�J!j?p�m������w���Xl��������<���[����5D��m	���������k�8"��+_szUW��c�Kmj��
n�����)1�Z�y�����t�������z����LQFJ��!�_��x������Z�1k�W�de��*�y�dP|%&K�/:b�\�R�FO����������g��^`���0D��rY|e*�����deR� ��x�g%_�	C�x��x-3.�g}���'2I�������c'�9�b�=v��\V����8���*�<�����\�!z�.�����+�	;���Ui�|'�Zu�rt����6�'�esm
=��-j����1��L��t�(�
z�:��Y"�WF��h�Y-z��A��$^�o��u���T�s��5-]!#���!��l�����DM<h��r!��^tXTelXT�7�
9a���@y�c�\���{�Kn��5�Z�5�0<v2���	��x���M�xR��$*cVeD��Cl;+�1��1]�%�����D�*��T����f�����Hb�?�����Hh���1�0<v2��1������.��N��L��"c��y�_i�����$NM����N��H���*UY��L��<J���1]����h�?�����j��*���N�E��'L���4�X����Ye5��A�*�RYe��3v����h�AQ�Bf���ZS5�������&���]�HJ��7�T����f�8���K'��	B��].d��!�[XmN#�U	4
�y�dl���x���F�B����|���
2���+�	C��.���^^�P��
~�e�>3��y�d,3a���Z���kYj���R�QC��o������o�&T=4*�\�:��R
W�P�]U������������+���:o-��U��lw_����vo@���p��5pQ�u��{���knB�����\���*;B+S��)bK�#~�A�j����#�����s�l?�R*gP:����c/�J����)�c�<���l��}��iU�#6�N�3�*�1]�������	��W�(#E_i��c'YGL<D���(W��fa}�Ym�de��+
�y�dP
�&�c�<�5D�/�`���Tz"�.�
�%i��;��;F?Nr��������E�4C�O0������1]��*Qh���v�f�r�?X�x�S��;H�gI���=����\m��pB$�Z=�f��x�����n�,I�����D:���9�=�0��9�����")3A����\����N���+��2�2������51a����\�\�����+���Ny��f-��N�������.s�vQ;������d��l��=��N�r5�1]���R�Lh����LZ���pB'{�7��qj�~�k��H�>"���Q���2i[m@C����y�#�6xy���a�|�I�-�������!N�������0D�S�6��\�������Z0�]���;����2�� ��%)�ky�����+�(��������K&�cwy��<F-����6c2�����4�0<v2/��d&�c�<�U>l��.��
��$e�W6��r)0/�{�&�cwy�kN�&�$�b��mx�9���N�F^�&���.��8������O�f���sCl;Y�U`�=v�g����t����p��TD���1 �K���;H�gIJO�mu���UfQ&oA���x�E��-�R����<���U��w��Jb���M&���vw��9/wU�>~���Ur�<�,V�H�����K�y�d^�"�Xj�=���\g����-&���eJ���36�����K0�c�<�u���G�ON�E�N�l��e��]����f'~s����*%�Y���t���N����	C��]����=/[;T��s�;�"��g���~i0�� �&���Yi�A�2������;����+N�R1�&�cwy���JZX����c���cj��L<h����9"����6c����6�V��������]����-�@z�$�)~L���������}��3��'���D=c��z��t�?��2L�Twk�v�-������c���!zL�O2�9M�8��J�(!��2�f��xY=�Z�;H��RK�5M�u�he�2�L��u�������&�c�|��}Uk���\>��w�uJ�}�C���A���|w�-� �Q��7����:[o��S����� 5��'T3�z��-��<��>�A�\rk�;�>}�RX���JG*S��{��:0��P���9��.�d�����L���1���yCl;������1]>�0N�	��G�| ���$�3b���p`hB��.�d��W����~���������<v2���0D���I�y�v\k*�LVF���!^�!�2T�c�|�a�JpM��M�Fz�����n������:����l��E��+�I�d�O8��7���!z�.eh�kyj�1�v!�p���� l��xa]�����b�`��}�UfQ&m�n��O&0���$@����|�av�=Z]!��e7��ax�dpE0�����I���[�.�����:�3�q{����	B��]>��zSD?�&����e/�M�	�|'�������$5y��*���t��.�f
F��&������~?���f����l�������{�|'��&�\�;H�OR�-��/@.�M8$���u"��x����j���'��Ca��	'��A����P���b�
�;����&7	�a������X�8A��'1�=N�P��n�3���W�*�(���Cl;���l&�������V��-hUjx���h����NF�BV� D������"��V�&vj��GL�#�0D���I�����[��r_#d�������N��wei�;��R���"ql0�3�2,�=���9/�-y�	C��.�d�	�E����M&e��w�����;�d&��t�(C��,&�%+�
��������.2�)U�����V�RY�	�}���3��N�%�Tf����i�Ij�(�J�B�L
�pj�	����X�	<5h����T=t�@�������=�G+��nh5a�o�@<]c��W����i�l�/�6yS��5�x,pNu�(N24���W�3Y�A4�����a�rU�b:�l
�d��F6��Cl;���&D���0��Z4�9d��m���(��}��c�<���>������,>B���1���1�C��.�r�V���mg8�����q��LS��!z|��~�k�����yO�QmW�J������c2�����Y��6����v�2V���c�{�Bl0�v�����v��\�u�5{d�`o�4����1��#�=���\u}���1]��b��rt�#�+�7�&D���,�������t�	�7�y���N�oh�;��I..m�2��le�r�=�+c��D��C�x9
5��#>K<`U&+#S	�!^��s�$���#^�X�Z�nk^�{
��cy��Va���vu5`���7���|��Q��LRF��!6���	B��]���q����2�22�s�	Se���_:�.�r��l�(5�Xk�����c'���5�'L�g���6l��:g�wSV#�	�f;���i�f�|=��m���ZM�������M�~JY�f������;�gI����,�+�:'�����^q�*Z~�������+V�6�����V����N�	9�'�����P�f�jd��I������<v2���	B��]����+��\
�������u��@c��f�=���\�m>����]�E�r�������$^�<��;H�gI��9%�[�����
�!NX�F�a�=���\7���]����������j����^��}�0!Q��)F����2Y��SwFxAe���3�	C�x93}�kH6�T�l&J���HQpb��� W5a���Y����dE~N���"?�#��;��
�b�����$��T��?q��E��=
���$D��f�x���$�//�����A����.b������^M<D���0�����UV�� �bx��$��T77�������\W����A6��2Y�-4�&q�y��<,�	C��.�ru{B've�y���=��\����L\����<|N���q�"\����cx�
w�0a�o����6{��p���8O��{��������������Kz���?�?k�����|+�e`������?����m�[�����m���������������/s���������o���vm��.���Cs�p=)�(���B�[����E)R�	�so\�~��'?���������U�n��I���N�+M����{���\_fm�f�d���zz�G�q��o��<��k�?�H����?����������?@&o�~z��~��O��?����������?@����	��G?�g3	� ��?�z�?��z�����\�>�	�]X����O��}�x�������w�����A�=�o�w������0��9\�~������y���]�p�?��8�,�Z�~��g�yd��3��<2����Q^.�y�����<g������yd��3��<2��YmX{���PS��G���9�??gn��Us�3��q�}=��$m������)_�����o�e�1�����k���]q���&{gn_-X5�;s�j����q$nO3��0�g����=���Yf��E����ss�����fV�qdn�+S]�h�9s{�����?��������gV�yd��+������;��������N���9s{�����?�is��w
X�����]CJ��o�5�������5����X���W����6�����=V�y������$�����7��U{����hp�y�����<g��3)�=���yf��G��<iV?G��43j#qV����,�g�Q{���������i[�ss���8���v�3wo�j�s��7�Mz��7�?���6U�����}�`�����KU�r������y�j�s����ry�����=���y����8���if�F��,��5���?�����������_�����9���Y�k���3���U{������F��������;K�Z�j�g���0��y����o�8�����2�Vs����{���M����l��r��d�O4��@2���x��p�����=���yq�a����0k�t��33v7gf__|{���$uf�:U�����Y{�S��O��S��O��S��Y�������r|���7�m�sYC�����p$���R����|������b���S�/���S�/�Z�J<�b8u&��3��?�����d�O4��@2���iB1c�cs��D��:�S���jo�$���fv8��7gn���P���u�KC�����pI��L�4|R���N?O��93�~�N}s�������������o.�cH��7��/[>�{���8YYlZ��U����I}��ii)+zBt5g%| �:�X�	�WsV����������{Lg

��uz��60����1].���~�iielJ�7t���N��!zL��\��~�����i���y�������I���\����5x�Y��� {�@������w��D�E��������P�z��TL��QC��.:kD};��VB����� v��xqgku����u6XB�����L�&���-�q�� 0h��c�\h����'�2������L�	�!zL�2�q�ji�!�de�P���<v2�EL<D��r�����]��P�<Y�M&W����$0#ki�;�KA�-M6��2Y���/K�1��1].;{��[���TB�6��<�j����!1�{i?N"�.PY%ie�2[Zq���e�`,5a������`���	�Y��d>�O������!z�.� ��� W�������4���5 ���� �/������?�������{[��m������"5��=^���m2�%Z��.rb	BH�m�|'ay��u�WzU�a���2e��H�J�I�=��9����i�0'^��V�dZ���u�(�Y���9iJ�1������m5�lpN�@aI�d������D.��#�x]��c�������L�ln���N���`�=�������.�w��Z�
S
�����/�s3��D��*�l{�u���9c�aJI�N�<vg�/LM<h�����0����I�_�|C�&b�IV��	s�����F���v�.��.jvY?+jU*L�5�s����i�0'^7�O�8=y����!|B�I��	8iJ�e1�����>�Z�<�T�CF-J�i�����,%p����9�����eU���T����D)��y��`3���8a�\I�nN�eC�)`Ve4Pa���`H��0D��r����2�o�
CF����n}O8��2/��!z��\�}]
s�����}>�$�6�4b������Em��JeTQJ�uyL���D.��aN����P��EL�$E&)#�A�������X��!z�.7"X��m��6�t���3��
`���X����Q������w�9�L��_C��.%�Q7��d!��$QL]�t��	FkD��E�{a��q�tg��`���k�C�{l;hb�!zL��+��
E�S*Xz�c��y5�S��w��ZT^�V��a�TH!Vi�A5�I@vY�����/p�q�Q�gL��<e�`�c�����#&��t���������3������N8�!rQ��1].t��=��K�)6�{[����O=`�R����8�=(m�-�]�L���0=6������	�c�\�n(q2�Y��de�Q��c���0����1].u1[�����2Y�M�~��y�d,5�=��E�[@u�������LVF�����N�rQ�c�\����*c}�[�'N�gl�c��� D���B���2�`�JH�{�KS:8��T���������6�?l_�+c_�7��4������ D���/��:C���Y�10�oX�:/�0<v2&����1].��82��gDo��V����N������x��7,��Gl*v�u#z\ax�d���'L���������I-��Uj������e����=t�V�dV��!�2���LE�4&��J�q3��D��F���g�se6eTt�F8C>c�L���Nq�L.U�:����p���
S&x�	s�����Fs"�u����y�.'o(���"�6&��J�q3��D��R)�}�z�M���3�+|�!'M��H#F>���w��B�3���*J�V��9Q03.��aN�.�,cpf�6�r�E�0��#0��0��*w�m_�c��U�>��������:�dz�	F�D,w!�Bg�l�f��Y)���!'�c�e3��O�.S���b���DjSJcmc��T�pn���x��^mf&w|����|a��k��c'���a�=���F&["DFX!:U��P��0'{y�y���L��U,�Rf�|�0
lb���yA�jC�.�kd��42�YG@q�42U��#nu�����#�Ff:U��N�<Z�LJ����X���Z#]
��!y���S��`��X%�G��e����~y�����j)��j1D��K�b�
���0Y,�(�� U2#����U�Z+�v�B���4�K���I����B�
,��)S�bB�)�|��|'��Zj@���[y�Vu���j��@v�����T����r�L�.�P%�]
�*��U�\
2Ulv5���Q�6��;Se��]!/�0���A.j�=����n�[ l�le�����G�q��X0�1c��t�P����~����D�i���"?A3�I�/�\)��������+�����K&+���t������m5a����Ce���Z]�2�>��m�~��6�a�/���P�*�����l���<B�y��I
�����cK�;�Q�Q���h[a�g�j�yy������\���8`O��I�l�3	z<���m=D��������'�B����J\Cl;S���!z�.�Ed����'��Y�Yb+,_;B3�I@{Y�{a]D�����:#��e����N��~j�A{L�+I*�;�h!(�rjC����K9���aND�/�
���TTf���*��beL�K���i�0'^�B�lB���H�;V�����'�d)��0��'^7%e�G2[#J�ft���beH��%e�ey���NRf!#�;���1L��qL�����L#�9��F���Y5m�P�Z������iT����jj��3jq����^F�(&���dz�@#�9�����#����f+����1aN��)7��aN�.��d���`j��SE�0��`>�d��@#�9��)03U}�������BL��O�9�`.!������&k�V�lBCSJ$��L[�������F>����L,�����}�j�uL�54�1����M��im�a�0��@�m������Dj�!z�k
M��D�Z254��RZG���9��C.����x�hh��-��s��f�I����1aN,73n���x}+�YU�AJ�J5*��0���6��aN�����IV/^*ck1]:;a�h�4�'TJgv��r3������p��y%��]RG�g����f���S�a�2o:�+�������
{5D��>�5����Vd9�
���>���,c��,���&��sM���M����eV�0�����05a���Y�y�my����+TH+��{��
d"1�=�����s-zq���W����{��1M����1]���]�����I�d�22
pb���@��c�<��s���XkZ�I�C��D0�����1]��}aU�6��	"[��3�f��x�K��{i�(�-���XQ%g�2����6������C��.�r����=�W��EG\�x�C6���d�O\�.�r]f��Jr�D���No�%9c^���&�cwy��\�X��)S�R'���!�!��+�Z�������ivi��<���`k%�rax�d�p(&��ty����&����3���)HocJV��cx�dL_3�1]��*���=�[mre�Y
�!f�1���,�	����7�8��Y��l8����$��u��.���6��1]����^a}:C������&�36���K7r�	C��]��i�*�ac��hCe�tj�/��#�NJ51N��f����0���V��$�k��9�fLn���9�z��lWEV�m�t$U�
�T��	s2�\���9�z�2J`�n(���I}��0'x��Fs"��4���[I���7�ZQ{�~
�D����Fs"��4�$�sE����rJ!�a�}[G`�;�|�������]����.�����
��S��U�&�����W�M4b�����j��`�e�[R��}��r�L��b�q�F�|"��4��h�YA"f�e]�aZ�9b@�ntnZi�0'�^OSFr�&��s�nhdS��#�r��z=L9p��uY�����R:�	s�Xm�bF5������u�MN��t�=�^�9��:���9���s9cwy���o�[�P*�T���!LV&|&�d�2�H#F>�z�r��0n*�n�u��Yc��XPj�#s"��8e���b����6���'���e1�����b�{��&�6��t�!aNULn�����g�[�����]������������v�����e���N��:���>l����9��CS��@�`�:��!�m��4m��������2lG��e���
��+�y�dX��=��'���6���:��cl:�E������Z��!��9�����pN��b#\�6*��md���i�3d����;Q��=����+Q�������(�bc����d8Nq�QM'\�+*��Wd����W6���s�^�����+�G����|���+z-�W0�D��(������N������� �M
�����E"+�M�����N��y1�{i�Qj^+���tcL���8�������0D���I��9-,��������m�Y{�2T�c�|�a�����v��l������/(��k
1�=��'�H���i���2I��8�3�,��2CL����Q�+J�Dc�W��-���5T���]_��;H�OR+�&�>�M�$�(�e����y�d��WM����Q���P=m���i	�X�����ya�^�!N�.�d����u�Xt[�M�M��6����;����1]>�P�v9c�*m���.c��X!��yy�A��
Q���T6S����7+�F�<vg^����8a�|��lw�����2��IM��X��<�&y�����R�!uZ����de6�T��LC�������1\>�0����\7��IQY^��0&��9W��D��N�fe�C�|�i���I)��!'���E1����g�������5aF�SI�0�����9�hp���5aN�>K�${0����BjU*LOBg��L`&7��aN�>KT��yy�m��J��S���9��,�Zh�0'�^�%*
����G��/���c"��Qre���z}��h���W���F��QI�0m6P~@��%
N��[<��z}������� 4TR�*���9�
Z���aN�>Lt����6�ZV*�J� �����i��'^�%���r�&>`TT�Grsb��r�k7�M�A����������R�9$��N�n���x}��)8�������J�l>&����q��D�����.�n��uF��X����U�c�	��xk/l�j��yg�kX~L��2�c`R�������z3��<X�P�!���4�
<D���/�G����7��U����������*1C��S�?�5�6�DK������
�������$X���w��LuE��"qcPl��"���������C��P|�q���a�.����@d���!6�����&D���,�2[(�M�����ZJ.�Fz�������=���\���d_����|�6�}�5�.���gI������V�b�7���=�h-�����1]�*
1�k��:cu���g��RQc�y!&������\Wt�/�v��|CD;~	����q����	����b�Y�"J�������<v2�+$&��ty��j���q�f����u�t��f���Y��J�D��Zs�����@�u�|'a�������d��63�!�2I����x���,����<�u��F@�melp���=}�������n&�o�8`wy�kY������Z��:�D�^F;��j��=���\��
���oS-��!���X��<v2�d����1]��)��&�'�j���R�
|bs��X�}�b5�q"��0_Y}d�s�="ddq�n
	�e�����1e&�c�<�uN�IKs�ujSJ��nc��8T���Fs"��4e�&1CI���T���Ygg��|p�r�Fs"��{�#�@C��NE6���1�D����,bFs"��4�!O���a��K���C�O'2/���!z���C����������l��"�K/���D�����)�ld�~d��1Y-�C�X=k"1���r(��4W��K�>�uJ��+��2�{L��+���'�!������E�V��#��f���=&0R��r��Dp���mDmYg�U��I%���0&��RW6!��������7�w)6M�F-J�$���9Q7W.!��������]u%�_��
&�/��:%�S{���;�z������.Os�ON����i�l��R���oM+7������xk�q�Y��D�����J���1aN>�D8�0�R�D��q�V-��FT
,��Z�3�����D#F>k[#���w�7�Co$�lB�K�=�G+�����0h��R%���+���;&����WN�������_���?���_������9���9}�M���4�������?����m�[�����m���������������/s���������"�(�bF��C�b��������.~w��\�v���e	!�+�]�i��������8����Ux��h)~I�����������'��NO��O���=v.ff�/�/��e	���7����E�=���U�"o�&q��E�Kh����N�}:?�=y
^�]��������%pb�:�7�@r��'�!��H��n_CU��pR]��@��&������P�V_��k���N���@����s�����f]�����FU_�Yu�2�&��k��������%�}�h/�U\�����s��o�������j/���~
�61&�����j���!�:{	*��zCt{;�����O���O���qE7>��6>����OEk�G���OCk�k}���m|��Tt�S�N>��<�����G��4����
-`�gE�~��l|VT�k�����`����������_�m���K0�|�T�}	��omc��K%k�"�:Tr��H'��_��BK,u������uhh�b�KC7>w��7>�Z���)b'���>����:
�}B��i��g�3`����"v������^��OE7>E��S��g��%�����b�OC�����a��K�O�,��W`h��A�}����~�Z�ih�b�OCk�G��j����'�Z���>��d�n�Y����5>+���_�^�>����OEk��N��\*Y{����Q���?�,u�r��tGC7K�w4t������S����b'��n�,���7�\WB�1�5�P'���3���Z�g��������O��>
�}R>w�ih�b�OCk��_�p��������4�:��(��%b'�n|�t�]��xd��C��?�K��hh�b�KC7>/�~����>v9{��Q��v@��O������^��m�O�GF[����{�6~C��,���4_�d�RmrF�����������T�O���C�*��x��KEk�[|��tcP��'�Z��n|���ME7>�k����O:�~yo
�}B��i��g�<-��/O�Fn���]E�k���\*���D��������X>�R(��2��/���M�����OCk�k}Z��2�{U#k�"�:Tr�o����CE7E��R����S@����'�Z��n|�k��������h��p��>
�}B��i���g����n|���i���`�8[[���e�9������W���/�w>)����Q�����>�i��MHAX�}��"I�a���;����-�*�f:��7��M���3w
�Cv~���c�3t��y�����YH�6�CD���cKM{��M������u����7jX�7i[]��\cSy�����C��t@��������F~;��: J�1!3�hMs��pG�#7H"�-j�`���eBf��)�~����si�)2Tq.�Q���y��@k����Q;�=q�VQa����~���!�23��P�QH� ���$����Q��1%��W!5�hM3����M�$]��vP/�!@S�G��
��h+�y2��%�4	�D:�'��:�I�P5q]��,����������D�z�2�Bj�z�#�����W�#6M�w@�8$et����LTk�Bt|�����p#I�>S1Xo8|�Q�,����R�hE7O����V�� :���0k��t@��!8]��wyEA����g�����K����4��o^'��K���r���@�C2!d�;P�������ISU
��������Q�����@k�R�f�6��������em#g@���j��V�Y�MZR��`�A#�]N��@���1&�2��@k�����p@������EB?����Q7��O�o\����6d�@v&�����h��<d�4����ZU��%�?�������D|���5k��H0,b&�k�Lb��L�t�X��L|�|J�k�x+�PDDM\��2�EL���/�����x`�nD�>��y��@k�^����rR�i���G
���6"�JY�L\�P������%�0+�D|��]N�k�xyp=Q����SRt�I��$MI1�o�AW����P�?���Bj���f������v�:.�z�m�h-o#>b�"f����,��6z��+����Mm�4�P���XT!��h-]�t�VI]�vFl+��(IH�6��@k����`�����T:ZG���1�"�&������$�b�v9i��5���2��&oW�UH�t .'o��T��4��������{�
�f�4�9��03P�e���������e��p��	:/b����#j�*��*�!	��;8f <Z&s3��	��P-l�L����l\��Py�I7����rp����[�T����o4e�����a5q]����T��YQ��M���\��(I�k��	���M.Z�������a��[��3hsp�~X$������rp���������l���d��~��������9���Q����&��������-3���������J
���:F��� �m "0q]��,����O'�Q������Fk�x+A&Q�����*3�v�V�����Q���GR3��"��Ky�
����+
��O��oM\�P��LR�e���H��,@�z�`�D+s23��4$����� �����nDIBR� df:�E�
�q>�Hd�9�2Z��[�tI���Me�.�;t�tEQFj9�&
������4�FC�dLn��`�XH^*��k�;&}1q]��<�&v��*B�)���!�����i"�,w`_�Ii���zJ��Zh�����J�n�j�(R���4���J* �fhg�������l������Y���e��KF����'����f��m"T�tZ��P���,�!i5�g��K2{�P��cI��r��(��/�q�#���<�����\�8$cY��r��L�I���v��J�m�]�+� df:���kJ�q�-��"\Obt�4���*�*�f:��%�������gI�do�!�FC%	IYT!��h-��t2�?��K���}��c J�1�Bj��Zd�x����]2���.��z���������*Y1���d������@�,Y1����L��(/��z-�
����1�d\(IH�"��r���>�C:�I*:�$�=�d+'�,����@^��P.��75!��]�p�_$I��r3��i����#��-�,��0D��ni��"��C��tZ��\L�3��cB���w�Y�mDr�"�&�k����6�i�6/'I�&��Q:���<�Hm�
Tk����h*=f� 1�M������x�z��H��2�y��J���I����S3�I���)=����!K�E�xm��2_��1������os��]Rb E�<l[y��:b��6v�����b����]�$
P��;mVpe^��L���,�]��S���po�����:`���D$�V�y�*�X;|g�9E��K/?mbN����L�s�Vp��Z�U�.�:�3%	)�*/23��41'�e�4s�DSTIU���T�)�*�f:�iH;2���e��K@����D�9R3�+��D*C^�Pf��<�$?���AH�tZ��v�"���T����o�?6�$!)s*�f:���'��M2���e)��D\�`���9R3�����W9
�z��0$��6�>��9�����;��]+�B��.�R[m�~[��R�
����"ig]�c���5V���������$�b��.B0��Z$�8	��yi��KJ.�NF�(������u�*���EJ���7�������E�W%I�)�*��;p47����2�p��oCZ�M��@I�����r�e�w8�N�>�V������� �c_e��L����n��$d��9;!����qe�Nh�
"P$RM\�P���6��E���'�Bn�������V5F����6B��$��`N�.�K�<\'}��@��5q]��W��CauE��G�m�VW�A��t -�c�r��.(�R���A���J��!��h-��4�U�f������,;v��!s*-;v�Z��v�R��sr� "���Z4�J�s�����Me���%����	
�$�6�d��]J"�f:�Z��*��r+��l����EN*_�"�V������_��_��!��4t0���1%M�i�K��L�5
iG�3b{���{��c+�u��2�d{DDM\�q���Y�,�D��TI
�����	�23��*?�5�6v�"L��������a�t5�;�����g�+��rB�mD������������t�Ci�t�D���Z�b�;��4Y�+P�e�\�A9�W�
��M�1V����9���'u�\�����p2�r�6��dez2��t@�JrW��C�f��q��-WdzE�2.cZ�ef:�e\���� � �,F�f��@�4��!�f:�oC����IL%z���sGk=P	�Ccc��V�/c�6�hLA����1Jt��TlQ�����
��@�����fz�!�0��\���i�f�2��e.V�YQ��GJ�cl��2�Bj���:�z�{*�r4g�������\�]B����h}�m:#�oq�h�	�mw6�D�C6�T(�����!W���\��$����Lc J�{
FU[n�t@����k�����^��\o��i��`p�G�JU��T�/��2
��9���mHyi��kh��<�#��D�V�/c�a���`������s:=���J�e���,��COR����T��"�B��4��<U�UH�t �s��
��*���������^b)��h��<p��7&�kU�2��f�z�_���$�:����=������j}r��Y�{��"�q�3�4/*,B�,wZ���o��A#t���h�����Fu�eTI�����Z���
��|�Zvi��^��1�,�6P�T��!��h}����#�,$YQf�i�}DI�7�9R3���!SEJ,B�$`(2��y�W�(I}'����t@��
y���BCwQ�iH�2J�xDI�?�9��@���i��x�����!H-�1�;��aC�����~r�����#��8=
"�g�[0����V��9w,�&�kU�2�$35i�6�*�a�>��'R �FUeBf��)����o�J����Dg������jet/)B0���6dJJ�g>\sz�A���e�t��qJ6g�W��R��
r����{��*U�&����h��H����|����c�%� {-X���nI�<�9nZ�
���|�$4(��B���T(Q�<�A�B���
��@������\+a��2#���c J�x� df:��C�l9'��)1������{�61�y_�����Y���~i�c�y;����~�o#�g��'���*�>�#�7[j3��No���@�N�PD�&��@����}F�Zg�L���YJx(��9I�g?�v�TdM)���A��R�5�T3�.�x~h=?�}"��:�x�G�OKk�P�6�Zm���B������"wG������2+�U3�2�O��%PA����hU�����w�R�������(y2\Gb�h�sk����Gfwi7y�,��e������*��������k�`#E���rx�c J��V�UH�tZ��;�������9|h_���|-��s�j����Y�?�����x���v�<�4���d��9��@��@wiLE_7|�
eF~;����$�`��
��@��@)����a�N�Bx���x��_����T���5q]C�������T����7Z��[�����kU�(B�U��8�mH�@�>wD��x���hi�	3P��=�icG	�q���u������u}���i�������@�1�h-o%R�+"0q]��Gf'�[��	�Sz��o%�e5z]��G���>�R[j���P�f�eFa��Q`U�,!�Qh������/������8����6"��|nL\���Q����"Z�ijXQ@Nw/��p\M
+���?��@�>
��qz�f-�P-�4(�<XC��D�����^���Q����6waM[�2*�@��@�.	�y���oE(V��������I�f����sw�����ZU>�0H92R��{FR�<P��hf	Bf���,P��E[O+���a���g%��V� d�;��Mr�6����H���I���M[�
y�������T>�0GN�B@KR>��=�8�G@���[Y��Y��>�>Yv�)����S�cPo�+�=x5��on�)L����6�9)D�~�Y�m��,��������Q��:J�B9G+�f�Y3�%+�f� df:���I�CT�H�Pd��0��iffiW!���j}����GI���U��-��1�����x@H�t@�>4p�L�������J0���9gbL�����9����	dnG�C���dQ����
��@��G[M�\��f�9��q�t�"�,B�r���"��������%���C4p�T��u=�)�Z��9eW��)��\������+�u<R�KN��������fWT����2 JRi]�@(�H����!s�����&s�Q���<M'c��k8�@;1|����5��D���i����
8_ki\5q]�����w����h
i�Y����1�;kf:��h��f���&F2�hD�� �����@����MG����4����}�����C��tZ"d9ro��*�q�-��@=�U�kU3��[�����J>F��!���2�1%in��������
����m����P��Q������*s*�f:�Z��C��FII +��/'����5�F$��"f������Jf2��F�K�nr�[�����*"j������6�VQ��^�F��
	k��E�)	?��6�����*�
)�S;�$5�`t$���LT���������
x�����%��H6\D��T�/��^��P]=���h;@��=��J2�JY��Y���
�xi��%��aH���#�����2m�af:�oC���%[��[�u1�&�c���s�PX[�t@��
���I���YYB�KE�1�[{��"	Q���|��*<t�6T%V:�A��!EY�Z4�F�������*���	o9MQ4�.g��H�|����}5q]����ROC4������l�/���bMj�23�����!��nw�:E����q*=�4�.,C�,wZ�����k�C�Utn�A�]/�ik3�h}r�����.+�oO���P�rpe	Bf���{��"���-y �l�]�]��7��i�/���lJ��@V��#u��@�<��"�&�kU�6��SB�x�7d$J��y�p���oH�X@K"5q]����RQ�Kr<�e+�ft0���1%��23���!��yI�Ff�Q�q�=�43/,��Z���
����g5(!��WeF�TeD�:�0s23���!G����&�
���L��gs7��!5���:d9YP��h7�[J�YB��eD���R��B��e������{\S����$�;�S�z����Y��q}\��H.��,�I�%�����QUQ���u����6G��h;.�v~��_����������/�O�^���P�t�����-o����x��/�������=������_�/���������������_���u0|YI:�{O}��
���Q�>�����z4r��fO��w�&8�F�x(n�V<�?
�4$��-���U��n��A�`����+�q�����M��;��6v����T�A��F��U�G~yz��zY�r�L������P�������\yyN���}�Y_�Iu�h��>~�'����&~7�W=����������V�<�O��#�����nz�Iq{�q��on��}��7h�w����~R\���&����g�+����i�g�E�;���\1����"���o�*�����%+��|����x�V����7ko���T�f`�M�o
��ikW�Xz��������q�����7HUo��" 4���G��w
�%^|��/L]*<p�'72/���'�|�>���)H��
,�)�693o����b�a%7.�h=Y{�����G�P}��m��#��Fn<�Q���������=nt������5�����y��q�`���Zg7���W�!k���\��#��4���*��4rw�`g��r�����5�J���������tH#�����G�5��=��������G�5��=�M����J�!�x4r�s��F�E���d�q�1�����G�5��x���]Ak,����V���J�����md������G#7��5.��^Kv>qps)���PO]z��7d�~�X
�����������J�!�x4���E�Z=V�������G9��+Yz�X�����('(�F�E������~�
j���~�
�d���T���J�!�x4r�1]��
Y{L��v%7�L�i<*Y_MT�u���jB���������j�el}%��bM�F��WIR�Y=V��������F��J�!�x4r��p�����Gk=*Y{<6��5+Yz�X�������g�
��X�up�Mv�o
��X�����v�N7d�O��l������4�����n� �����'����@�&�J�o������,�9=�Y��D�q�`�-:�����d�b�C#7�+o����������=b�P�[hd�Q�5�g%71���hd�c��Jn<9O�x4��(b�G%k��]����D�q���[��k��"��!�������/#o�P���m?��/���r����i��
uAH�}�Y�������$�<8h����J���3c&�kS�����D,R�S%�IA�~-o��[]��um*���k^%�6�$�X���O��h�A4��W�������s��*A5�.e��zCG!K�&�kS���in*�d��������z�#!���um*�d�&G�������A	�aE��; J|�T�?D��t@�f!5;����PB��&�
`��C�����`��5K�&��#94�6�^����5k���O�"j��V�I,{��a
��$#$��p��M�P2�XJc��V�IX�A���m���C��]|����P��i���\p����S��.YdC���' J�	1�w����"�����.�I�A���mp�1�CH�t@�f�Hn`�7��5�xj}0\����4"0q]��$��	�r�D>�vx}��=N�	y�;�"j��V�I�3��.�Y�@$��t~��������,%����J����MJ�)-���[M}���������������i������|t|�N2���b�����:-f`R^�o�}Bw�Jv&d(�	M�� 1���Ls��j'��h��3lv��A�
� df:�VsJ�}(���]u�3U��1���t�D!U�!d�;�Yn��V�����Ft��M@��(
23�����~ ��2?"Q�����fq�	%
��!3�h�3��
���3\B_M���I���@�B�,2k,wZ��2��fwa�,�����UaD��h�+R3P�IH�S��G���`E~����0P��*�2���,+%���;yh���"/?\t�y(QH�%��@k����a�Ca�!j����z|nc%
��!��h�2M�"�a��WQa��j���d��P5�h�B������.n�N$���`�d1���MD��u���)	�'�����@x��D
�5k�� �&�k�Lf|�?v���A;#2'@��OU�!3�h-2j\C�7�BJM��(P��	%��!3�h�B��"�kJ�=" D/�L�(qH�6��@kV2"��gS����'��u�9�����uU�g������zY��.Y�,�p��������e����Z2�y��`��o[sF�a��|�
��H�{�-#�`����-����h����&�%�P�X�����k��I������x��(��IJ�"�a�-�x�	*!mN�lt�&��|�HN�SV�������Y�~c8'V�t@�fI���s����g�^�����(q�M%�XH�t@�fU��{����J���YmyDI�i��!3���%�2�Y��?J�|������������,�U�����D%��C$����
&� u�"j��V�i�%�>�X��M�(�g�m�Q���Z"W3P�y��<�SZbG�"�����<^��[	�)5�pZ��$	��g��!�?�
���T�Q���=������j��ng�� �"t��������;��LTk^�'Of��:�:(0���g������!5���f+d�!�F0�w����54� ������Ue����5���+���O��M�8g,�P��h�B:^�;�}-�uT�9Fu(QH�E��@k�H�����tCv"<h�f����H�����2�%=#D#�)�J"�
�$k�x+y�};����ZU&���1��G�"�;���[����y��9��V3P�y.N6[��~�yi4���]�@�sq���r�5���|9��V�T��I��'����u ��Bc��V�i^0m�K�'��w�1���
���V�@��hLt���$S=2+�F����t�������1���Uz��1y}�<��A��O� J�vS�
��LTk��r��"c�*�\�Bcn�p-o%�S�����P������r�Y	%���D�h
�7��5Q����L�Z��yG�T7T�#j}��J�h3E�Z�@Y�8P�B�E�V�[�A#J��	��[�aBf���eo��R�������* JR���f:�YH�c��LS}�>S��~��!��&�W��h��5��8��5����<����!3�h�f��6�ArW�$�:`�����Cf:`Zw����MfM�[5�v]C���.CD�^��2��Q+��<�8��5�>�����������
$�����������f�L��\f�v�_�����$�s���r��a"T��R�g�k���T��TZ~���RkL��kb���K����RB*(����v6}��3W�1������i<��Z����Z�tH�=fEk��!��M�����q��F���MH�,5�����h�^FI*��>
[x�����
���P�,U~���k��+J2o�G��(I�d0r�B{����j�rXw���J��������	%���w42J����j���
�}��K�~V��6�{�m��1�Hf!5�����D��-/~�*���[7VQ�T�2N���LTk��G����(����W�Q���yT�������k}�.�Q*
��y8����(QH������LTk�]��]����
T4��K+@�]j�/K�O��	7|94 ��8'�H�Y��1_ �f:�Z���S9$]Y$����%���1�J���U������J�F�5hgD%l�$���
Bf���W���M�="����8��1 J	,@��tZ����'U��$�`[�8���Or��!5���%���M�h�AR���\��`E��9�Q�f:��E������eT+4H�R��m����

23��j�m�8������R�%����=���2�8���G�'$���t�R=E�����J\�f,C�,wZ����M�EJI1�
��H�2zb��$3���e�*�Gg�EF��s���ne�?J�������Fb�����
	D�FS5�k'�{ H������Ue�y��T���LE��AW# J��2 df:�Y�+���bV$MyC8j���Ko��)oc���wZ���������U��	H�Uk�!3��������u�T��
	�pfm��7���H�L\��2���9��(����L8�6\��p�����ZU��8�8�b]��4T��*�>�aBf�qurJ�����q�Q�~��(�+�<����@�&���b����X����m>N�k���ZU���}{Q���_��l�)�d\-u����z�����dg
���\�z]�iDr\�a�8�x1�^������8��*���n'��kM;	�������T���/iQ�
$�!��F�)�����~^�$o�\��*��,�X1^k����\jd6:`J_��I Nh���T���X�mk/���<�E#�]T���Eqr�o�6�L���D�3�����?�Zw��e�!��C��$�RQ��)5��Q���2z�e!5���6��l��InPdD��	%�9)B�1�3P�/C�;.����+	�_qa��~C� g#"0q]��w��#4��4�A��jV�5~���h�APk�"j��6��b�
m����9�@{F�ln�1���f���m�IR��]Q`D���g8�Q�,�f:�Z_��GM0��m��k-;!*c%��e��@���i2J��7��z����^f��!C�v��h}2�Qk�1Q�A�����wv��D���!��J=es��e����*W4�c��9�m�z[uP��AcBf���,�q�a{��C\�������*��&@�8dc^���@���cz���H�V� 4������5�Jh�v?DDM\���]��]���B����at0����&r���@����.��F^�'���h�QvP@��d��t@H�t@���`�������R�	����0!3���g~��@3go]��lPfDI�}�n6��P5�h}��%��m��G��B���Z4� ���"�|T���|+���8r��Ft�{�Q�)o0!3�h}2M�����fyG9q�DZ@������P ��q����ZU��u�Q<��<Q�pr(�[z��	x�������2��=����PHR�B�]�R�0\��[��yP������ZU��5z�"�Y��
�LB�,b����U����ZU���F�p)�svC��j�K��<�!{���d�p����8��L�j���VB��ED��um*_��x��������8��	g��k�x+y<:��"j��V�/c-�������p��~��,�V�x4;7T��R��R��d���VIb�)�~-o%����um*_>�jB}�U���sM�_��hS�[������*�A�R�t��������������If�FN�Y����375s����
���[��k�*�������������tp�����g�vtT���|!�7�����:M���c��>�i�rY�����T>�����S�F������F�&��o"���K�oEZ�[�e�eP�[.e�f����Jug�njp����h����_��D��$�&�kS�(���K����A��gS�����k�x��y�����M��#���sL{V����L7�$#��<W����G��vqS�k������Z�k�Js�L�V��@w�B�\�VI�&�����fB�o�"0q]��'����d�"s�zEQ
�2=h��(q:\m������j}(�e����C��Q6���.7�# J�(��+����G��N6���Jdc�Q���=����N"f��V�"�����g�C�&���3���C��Z���&V~��~n2*�-���m�@����2��r��o5(����p�"=�O;*c���L�������LR
'Y�t�dsD���1\��[�C�X��^�P�$�(����a�Qj�L�(��%������>
�=��g{�?U���3����%���_|8)����+B�x��<���2��W3H�������.�5p�2�-oh��Q�@�9��@��@w>p������m�+��k�x�<�x0�����������QWP��kPaD�r�D��0!3�h}h��J~���8QZ��x�M�(qQ�1!3�h}�G�P��
�.����J�����8PcY���@��@7I��^��eU�K	U��9c%nj,C�,w�hK�~h�T6ig)���3PaD��iD	�`Bf���(PjKO(.e
���R�\5D���@�&�]�g�x�(\EI&�rjy���j�!��h}h��T�����(�I���(q���U�8����������CE��P^yD	
H���Lr����@�	Dg�R����@��!5���,P=�vkw�(z���h��(Y	7�!��h}�g9��~���$~k����F�u��^��2�:�n}$J8�,�������d��-��I:�����M�IT��������E�����?r��i"����g�T����k�F;)?���X[3���l�g�����9�H����^����XI���^���U��l���mz<Qw�	��G�`5��?��lH_EI/]���lV�g�k�����8��2d�_:��A2yI�>!5�(��h�A4y�"0q]��w�5W� U��G!jv�u�1�zdf:`c��9nM�Z>�� ��m��3%�miKBf���e�)������
���<����*;+�;��L�Se��!����~
|j�!�g�P��x����&�}c��V��b-z��zK�'��(=����%��@#ft�63P�oCN��R;o��6�	8u��^�6�P3����|hsDJR�	������Z�	�������3���.d�h[���fL'��6�9(q�cRj�X����~���p��gEh�K��y��"p�!��h}2�{�����4��F���c%���H��z$�J[-�q�T�(���VKV$O��VjO��V-YY�j�����a�1����r�B�A����g��(q&���������~��7��B2�P�����A�\Q���|k����3�*�����(q���!3�h}r)H���7Tn�`��
���WT23���!cs�y��$2������Yi!��("b���*_�J���	���/:27(3�b�}D	�`�Yc���6�M~7W�GL}��4m}�Q���9f��@���=��{!��J��4w9�k�x+���""j��V��b�v9p�����x%;��${�E���cM"bF�k�|k�����U�bpC�u� J���!df:�/C�rJ���������gD9�2PBb,�P��h}r�J:�##`$_����SY: �f:�Z_���4��M
��\Qf��&@���s"T�tZ����6�s��2S�4^���d�2SY������
� 
��
5����x��!�s1�u<�
}�Qw�	��,-���q�u���A��{�o�5w_�`���'��j����F���s�,�F�kU���QPIF&���6��u8��^>������y����=��-N�����R�#��������o����
���E*�������������?�������I�&��zq����o��<����?������-=������_�/����������O;��!��_sF)��^1J~7�)�af�*�Pj��)�o���H��������7����7�g�����,v���M�b?)c;�`w��0��;�>��y�t�
L������5�>}���}�\����n_����pR]�&��_C��5y�F��%(�yUQ_�Yq�A7f�]�_���������r�)Z�'�e�t1��oA�(|������_5���*��{����??��b��q4r|����[���)�!�����G�5��x��f�x4��(b�G%7wT3V�F�E���d�1Z���6.�k<�����y�7���h�.�$E���
���r�������s������-�/���[�&~#�����T����G�5��x�n��G#k�"�zT���'m��+Yz�X����#��+{����G�5��x����<Y{�����E�46��=�X�Q��crN�cCVU�z���cxmg���	�Z��#H������R�h�pk��
Y�Yk�7�~���E����G�5��x�_��#���Gk=*Y{�f
,��T�N��7��6�����X�P��c
�,�X��#��Fn<f�5���,�Ul=*��x4S�����Gk=*Y{���j���\H ��4rw)I��������R����YC�o�������C7i�c%K�k<������y��M�[������Ly�����G�+�������hd�Q�Z�J�yj[{k��#��Fn<�S����=��#rU��x�NY�
��X�u����������D��F�.%�s���k���$�C�5��+Y�[k�7�~�R��S<V��������:R����?�j�����������?�5��x���b�hd�Q�Z�J���:�	
,��T�N�������]���V�D�u����������N�Zy������]�6��\E ��4r{)�u�W��a>k/�X�]"���`�f�mHt[�����\���7:����gC�� �84��HMN��,=B��h���� 3�F�E���d��(C�k��#������2w��hd�Q�Z�Jn<���yC��K�����~����22�i0rw�?����e�������d�g]5�N|����+��n�iT�
&AP�}V��<��YS�@�e��	���a��XH�t@�&9�t4�T#�>*S����g�@��WjL\����1o�Gc��w�ys][,;�A�o3q]�c)�����%�
��y1�W��M�fT��y�L���$��l�pd�d0���l:��2���f:`�l&���s���$��)��'�&W�A��L����+h�������8e;^c0�S���r�p���E[����������G�
�1�j�vpnRA�����RDJ�D]��k�x��4�TM\��2O��/���
J�R�# J����m,�f:�Z��j����M�d�W��=��Me}Z�t@�f��][#z��4(2��`�Q����O8����$��=f�U���m�k��rBp��E��um*�T$�_:N�K.(0J��6�Do�1"ABf��5�U&�;��$+���[�CD�B2F�OH��t@�f�>_���I��M�������3��TYf�X��f!P�����&={��0������!g�DD�^��2I�9=��6�rh���&DYC?��~���P5�h�B������� �����>I'7l����@ZM�a{E��y��t0�f6
�(qH��U3��$$j���t�����9���$~��7�C�D"�&�kU���,{��?u����!8&"0q]��4e%��|�������2
~�Q�?UeBf���%��>�~�����RNeD�B��A��tZ���>��I�v���(N�-: JReBf���%����
����H�V���zf��2��!�Y�L\�P�$���<�Z�Y�L�.�������0�CH�t@�fy"��L�]�)Qd(2"[������c�����j�BJ���������P3�	�zU��F^�����j�B:���6t��EF��4������������j�RQ�r�Y�9e�:�<����x��WG1���Lqt�K��f��"�d����=��!5����������F���	p:�2W �f:�Z�G��r�s�VI`�b`_��a����"b���*��\~��<x�x�Z���3����ovk���M�#J�m�izt5;g��q��{tw��"���<��I�����"t��W�>��L�������m,�2b�4��;��&#��`j�7�=�>|��!|�w�uW�5k�������u�*��H�trr�M��:����
��j���O�E-��,
J�5�Q�Pc�������,�]�6:�A2
:���	���Ke]�t��COJ4�q^�V5H�#'o��X������������$���(s������m'�t=�K����R��������l��j�����ge~�P�;Yg�-�=w?E������[ J�2������j����R�$s������'����Q����LTk�Dv��W�A����x���Z������j��q��5!I:NI`����h-o�f5q]��$�Ek^n�)��V�JG������8cBf������iY�T�)�G+cP�4L�D��W7\�&_��;��6�(�P�w�J��3!d�;PVw�y�z��1+�A2������'�|���\��/Z�|\DS�m$]~3=�L���mX���������m����A�{E��m��2'�W�L��,�C_\���I7����u ����y��@k���2�/�0��L$�*�u�'[$yX�L\�a�0L�2�.��2�A�Q�7�@���-�*s��LTk����c$����|l�~���i9eAj�t@�fi��=57��R^�PbT�����(���1w@H�t@�f�9K!N���lPfD����d��P5�h��sr��RfGBvHv�e-'@�8;���q$df:�Z��>y��6(2"[�x�����q,�f:�Z��v����7��d(0*[|�	%I�r�����j��sx��1CH���	�1c�S3����Wq�sD$�������	@��G��!5���M�y~�:%�"�&'�Q:'�vY��
T�.Iw��$���jR���F��;\����Ue���:*��-��Y:�����������A���P���H�C&@�($c��������.��PK��HV.;�fs]k�NH[J�&���,�.}��
m�L�%H7��@�G�04���������7?�7H����`w��Ye�0�f:�Z�R:�|�X���{�z�	������{5�k?I��Z��R:#(�sz���R:&m)����oJ�r��OG��x6(1���0�����v��\{���j�K�l���� YH?���f�X)�^M\�w����&��`%�I�P���7��	A,"�u���L~�v��R>-������~s��Rw�]�!�f:�Z��������XI���T\��t�a�hB~1<M���L���H�s��.O�(q�n��B23P�Y�.��
�u����5��
M^�Y��5|,�&�kU�����D���M��}�IN J\/������<���HG�����KndD�CR���}c��5��y4��k����}9e�k��3����E��;PV��l-r����m��(S�o�V�6,@��t ��e��-$����a���.�5g��6,@��t �n)w����O��:B��$����M������RY4��/
��6(2��D6��i:e��EBf���8&���P�����
�L3H' j�J�<��L��f��5fT�I�s�2#J��	%.�3� df:V���BCuU��oE�����<�R��9���e��������TQa���9���d��P5�h��u��'�.5H����' �u�2��1����.����PI��e6}c�Z>LQ����i��c����<+�L(�v�����$�&�kS��m��K�+��"�k��k�x� ����T��r�z9=� 9,Pp�<vz�2W �f:����"6�y�S�\`���u@�8����B�\�
TkQ0W��G^��!��l��t����\�����j�rq(��V����4!I���(Y.�@,�f:�Z7�8z\:b���lX����m.�g�����hR��\\�����+���b������s2��]��s~�����pJr��k��zl����438:�8����2��*B6*�z,�jz��f��L,=�]����+�
4����
�Z�'D�{$�&�kS�.�R={�������wi�D6���"S�,U�� �^G�Rq	wC�����&=�A4�����|����I��$;�T X&���+CIX5���h��C�k��Ijs��hwpw�4^��[������u�*_�J	7��FC�%�g��Q�L�(!�h��D���pg�Z_�L�2��i(N����3mAK�k�x+y�hc��V��b�[���P*�L(���k�x�h�$�&�kS�.���!Z�Y���(��1�:��2L��f:�Z_����R�� ��4%�5>���Fe��f!5���6�]����]�7;�����a�Q���NK�XH�t@����
o�����M�S����
��g��%��@�����Y����
:�rh^�Y��6,�P5�h}��j�^pR��j�!��� J\*i�C��tZ����S$��eh���H��"�V��z�W���_��\���������j2��I���<d�_�[�ujF�}+�[�$���<��{��S�yT�3���U����!IV8*-�d/�4
�2�	����o���f����wA�,�s�����"e�Q�@�@�j��<��L��e���e��������N�eB����wiU�,��x��E��x���%�3�#�i��%�A��tZ_�L��x
���+����1��!s"T�tZ_�|$���eF���2�e�������5z]���P�������E�hC2��� �H J\�)���,#V����e�t�AS���	w2�I�e>�=ZC�
Bo��Q����|�!�U��RkH:f�+?I��5l����@j��}2M�I�g:,���rr:<��p\,C�,wZ���SSi>p@����e�`�n-o%������1����]�a�3�	Rj���������aA���@�������v,�&�7�5����p-o���8H����Q�2RT�:��K��WzSJs$�_��[	�ZDB,\�U��Gw�����r���tx����m.�
�YD�^��R)v���UJz���,?<�����{_���y[��5�����
�Z�*D��,������(��W�e?	4�����D+BUh����2QcZjf:�]���(����\���6��p�V)�iN�������@�C��6N-S��Pb��Uc%N^+��XH�t@�>
�(��E��������"����MODTFIuR3P�O���R���
���#��
���{e�]3P��Es&�Li��j�B���:�6mIU���1���(�M�LN����������9�o-��������	:}�IYiE�����j�K'�j���Q��h_\�
(?�@����	%T�w�!5���(Pz������Xf�)!���	%��m�����G��H(Gizi$2�\�1\���(,�&�kS�$������FT�'@�8���K��j���Y����^�|�(0�*�4��R_aT��Bj���I��XM�w�y����Q{N?PBO0��i,wZ���{�ICA���!?P�JecQ�������<'G��$�����M/(T�I�	%�.�D����>�N���@
�q���Q�������L��Q��sa�����A�Q�G�	%nI���������j}h��w�
*���9M�(�L6�3��v��w�.�\��hE�4�	%��6�E����>
��
�LP�G���=������W�JheQ���@��@icB� *<Q.;M��
c����g���$�_7N�@
F�6U&@�����!���i}(��y�"��Huq%�%��x�o%� �3�1q]��'z�d�^�~����2���E���!;Y$R�^���C�a���t�5eTh3������:���x�P�Ef	T����|a���we�V��\J�8)=���kf������j}���c���j�D�;'����V���bhL\���I��7���]
r��c@���_B���q�P{������\p4�"�9����(��,B�,wZ���l�3�k*��st�������!3�h}��+�gi����%\s��54�\4���MO�n�*�\tL�m�����T�vi(�.
�������9��$J�T"M��UT��6%�4%0�����Y�Gm��X����������X����	�i��b-MNK�Z98�z�>����p�h���Z9x���}?��k�����1M�Z��f:�'C�C}������T3�x�5�,D�^�N+c?����Oc�u��4�0�-#<��_��B��8����T�����5	�k��&���khR�g���
M�z���XK��B�w@�c(tE�Q�������q[��1���2d*����rb�A�u��������I����j}2�d/ym Mcg����&���Ic�1���?9���|j��g%r�<�dz]���XQ���|kr�����a�"�D5e���ae��&!3���2���I~r��J"��=�I��h�A���L\���e�6j�KpC"��m��u���Ds��1q]��w���K���_f�Ud(����8�-f�N�����j}��kCS�+�6�8<�vJ�
���"�kg�8����B�Tf�������j�B3�d�]�����b3z]���������j�tcE��	1�g��+"T�t ������v��6TB��R�=^'����=�T�5T���_p�`G�^C�5�� J�WF��,�f:�Z_��<�-����7B=��;����V��4%�T�5T��5�2�v9	W��ck�W8
���V���$R�^�{s��XK�#�t����}�����=�kh��<�����u}�R���j�6��5$]���Q��y���Cs�`Z_�LGz���;�?�"q ��`���1�A��tZ��\��l'���������f{�	%.7� df:�/C�����=���A;#J��	HZ���
Bf��\��i���������������(�-K2����!S�3�:��a��A�M[0{w�n�VV������!<���������b$]=��&�`Bf���2d��B��f�A���T��7�
�[��L�����b�	�^:%���C�@��If�(!5���6�&G���!�DW'��wJ��=��m���?s�������M�s��&��kh�	�!�F�kU�	�q����Io�������w��Y.���
����&�/���������������o�I��������1�<C���-=������������}K��>��������������������_~��������BsK�����N�4������<�ple��i�����w����G7zO�����E�v:7�����_u�8+�_A�\�O7&�����'/"e~�l_���0��E���/"���w�h�M>����5������U�n�*N��W�e����X�
}J�^C���pV]����2�B<���G*�=�������/���|�.����_�nE�6~���^��U\���or�������������V]�������������$������^����\I��aS���x4��(b�G%k�~�#�c%K�k<����<��hd�Q�Z�Jn<��^=Y{�����G��i<�v����K#�NR.>��'?��H��4����s���,M�/���[�&~#��,MS=�X��#��F�S��{�X��#��F�s[!�+Yz�X������p�x4��(b�G%k���q�y�d�b�G#7�V��hd�Q�Z�Jn<������������=��o���/$��+���Hq|X�������!}���+Y�Ak"7�|�4%����b�c%k��Ww��5d�b�G#k�4������7j|a}��`�^�ed�M�ZJ�������J�!�x4r�1��'�F�E�����cy�t~�F�E���d���������/*��4rw9������i�A����&z#�7,�:�W=V���������y�O����G�5��=��`��,=B��h��c������������=���+Yz�X�����p���l��c8=W�[��f:����?�j��x;p,��Srs)�\�Q����H�l��Zr%�vO�"�m�V���5������w�s���h�r�WC7^C�f�����Ok]*���'"��eEk��k������[����W�;yUt���q������K�,vr	r�1�l�trY��'�Z����f������*r'����:E�z=�Ic��'����4|����K�}��wh������}�6~%7��^d�I�hd�Q�Z�Jn<Q�M����>!�:5���gD���-��\�S6t�5a�N����W�;yUt����q��Qg�-��������;��~'��
�S�^���t�����k���t�u�i���Z3��Pt{���F���Z��&�'�����|��A��p��41�f� ��>
kw�4���#�i�/����'�@�!�j��^N�C{s�cV$SU�-����1+�a�J5����=�W�&�9k#�s�������V�oOhL\��2�f�:9WNn5H�j�T�16��2���f:���g���p3�%F�r�Q���Q$df:�Z�|����^�+*J�����AD�B2Fq,�f:�Z�o3���+��T��L>������"��vm*��m��|������N���6\��D���L\��2��n����}�S��"���dL�(����:Rc��5K��wb�#2"S��M�k�x��!-Q���LR����|L�A����:[�D��h���P��E�@���M;��qal�"#J*�	%
��`�R����j���2��o?n�����oQ���Q���3P�yRS����6�T�^y������F,�F�kU��;"������?�G�8�|^��[�������,!?���z�F��UD�C~tX{�D���U���<Ej�k���8I����V����F����a?��yR?RI�%F�j J��s��LTk�����Bm/<�U������O+�-s��L�5M�9\�v�R4���^O"z���#������id�I�3=2�z��]�RS�_����*� VM�D�#����@VD�����G:%�3�*��;�i��k��y<��(���<�!��VGDO��
�N�0�=����I�;<�,����f��$2������	N#��K�q���_I�������
p^-5�����RP	�]��G�vf|($����S��bf�'i����<������p��0�6~FX����z"��%�oJ�����A����8���?c�
��@k�s
z,4P���g��8��hXIB2�j�j��5
��w%p�?�-�:.~��OOX��~
t*f�z"���)�P���s����Vz?N�A��t@��� 'm�Z����iFXo�A�j�'-�f�Ei�F�l�2k��q����D��N}B�z�OR$w8�7�� %$��Z�[�k�x+��M�uUY���k{�s����N����n1L��\xKR�I���	�y�I=!MR�L\�������9U�>j�<\kO&!M�)3q]��$��t���7���3�d/3$���Q���1����LTk���Mj�[�5H&3����Q0x��0g�f:`��'��+�9+SIb�g�:�k�x�h��E`��6�If��>���]��c�^���@���1F�����j��y[m��E�[��F�l�	�O���S5�k�4I��dI�q��n�A��m��$d�:<����Ue����DB��%�~��8��K����R�I�.5	>>�����������se~�Pjr�'�Z�_�}������(2���8��!)�R�V�t@����y��"R��D��h�E�
����DKo���2I�����h���
��p�E�2/=B����,�&��QL��hdB����k+����1����+b+m�PR���4��rtn���3c	Bf���e�g?�� mo�R����4�@�B�,B�,w�,�����~���B:j�?Bd+L�(qH���L�5�2�'p���eF�.]�F@�(���U3��4�q��o����g�?�����<'t�1�L���=<��C��":C�<��Wg�=�V��P5���z(u[�8��H���Y(CA_�1%��!����f�6���s�*�:�P��@IBR�!d�;�E��33�7$�b�KKGD�<�Y��|O�u�,.�`Ai��0z�=0��L��!�g0������������d�nGIWUv`�kq8�8 �'�U�a�����y�/����(��1i�������"3 VM�D�f�i��\��l�t'����!�g
�S=��]Z��th.0k��1����a����*�*f�zW�u�����K���t�&�	�!2�N�S=���������4���hd�9���S��5T)>�|u�)��0��o��d)���%[9���h�$�'�m�)>iiR|Wd4�5��1�M��Jz��R|2n)�����d���kR|z"�+���������I����u
�&�w��P�^�G����h��J��a�I>������I>��{I�)A"Mv�kK�19%�`���I��dO�V�g(Y����Z�g�
��Ln*���0k	���*��^���4�>e��33���E��yk� ������d�������5����r�B���t+�T�o����w��Y�/7��O�.��k�	/?#4���S����5�J4�����Lr}������������Z4� �W#5q]�����kQ�c��������G�u��*��z���HJ��w�3����]e�>���?��RW�N���OE0B��Rf�
F�,�3��8�	�6�O	��y���P����C���Hv�	~�*��h$�@��t�]zH����`����7��F�/!,q��l�����3IF�A#����5��>�@NX����A2r#���]�<*{2B5K9Ket�N\�D�_4�-���\�r�)n�l
Q��#W9b;mYi,�l$�@��+���J�_T��Y��y�b����<��p������]�+��aMR�whk���s�N��1�l�;���d��mUI!d�M�Jc@E��;aJ�*��`��:����t-�U�!R���q|� '����l����.e�����*}G6T�q�z�N�-��}�������*�$����l��y���t���@>�+{
��:�U�����0�����e.$[.8!����@���f��H8���w�\K�-����u����W�)�b�[��^l��6�L���2kt�]���'�(3�+��V#���[��-�_�3�+w�C���X{A���?���l��P��r�U6��E��o�W6�n$������f��H(���W���]+	l�����~E6n�`f3�j$O����+,m�Cg��#��A4��]e��*f��H(���������Et�vdX�W��iu��f��H(�uf�)���u��i�XOgf�+�~,�)\�L�	�+���fK�Zi�6.p>H;!�j�
rI��j$��VZ�yP�!Y�1n��k�����B������D^k[��-+����Ju^��H��3cy��l5	x����e�~��\y�qE��3cyp�<o5	���?k����:��B��|S��@�&B`���ND)��������0vMd�~-.7�~������i��q������h��_���5��Mh2��Q�&4)]I�@�J�v�9[�����*u&������������x=L�d����?�~:T�4�.9��xXe%�6��)ov�t��8��<�^l%����&�/&"��	o����Y���2Pzb��_�Z��4�eP=�z��:��{;Wh����r��s�h�/9��*�fn������L��QP�xt�PA�6��y�	S��k�m3�z��Q��A���~:T�c�� '</,��Jr�@���|��>�D!U���-;�(���0jr�@���d���}��4�������������C��_���\�J?�a��C���2"�����G�����d������.�c
���k"PX�9`'T=�U6���^OS>�qS������8h��9`'j�U6���^S������e�Q=<�`�`��z��R�m3�z�����@f�WQf�v��2#XL�E0h$�@���\�_A��l�!�Y����<����?2�-���<����=-p^�f�����hE��u@�P(VV�Hw{=K^��A+9�C�df4������+��V��g����=O[��;r��:{lE��K\�d���pU���=���Pu%���A���D�X���|��,f��H(���Wn�:$�:����*�DZ���A?�\`=�L�	|�y����%X�eWs��E# ':���X�f��4����!<0�"\ER=[v����W��F����i����[��r��������r�	SV�e:������Gc���=66>8���.<��lR=�Xf#�f��4e����<�P[x,��Q������l�����)�O��Q�B,����v�F�/%�:��l����a�Mt��A�^^E��1����[g������x=L�(T*OXX�&�h��r���r�.^b/�s��^���������pI���'���~��*_�e1�m�^�S>(�5�z��'�a};O;I�.1�|�;�����������R^�����t�F����d��m����i�Ekf���!*�P���R�a��,��f��\ q_+R�d�?y�_�RXbr���'�i�n>WI2��yy����=}O�?=�����y��'���N�'����n.�km�����O�7�_�_��2��\������������Z�{�h��f[�kuy'C(;����+r��;��������'��S�����V��>Z��"�.RR9W4,Lk�e��g9�Lt�/���p������`p���0D�m�a[�������he�����CT)/��P��H��m ^o%�@
����:�*��\����s@NX�+,l$�@��J��G������<���,��P��Zs�@��Kt�N
�;TA[�z�6y��l3�z'Q�T���
D)H�A�v�&�/&��4m�E�V�w2����.J/V\*����z�����gJc#�f��V�y�V
������I�5d�@�������m ^o%
B����DF2�'����������l������j��F+��*`��4U6�U�Z��x������L����-��� ��_���=��|p�/����Z\��p�r�x��j�V:H	���9`'������v����o&
I�*@r�C$X��u�	se�6�^�^�(\sA���.���ed����^"�[�}j������l�&���EK��� �����``X�L����kZ��!~��Lh�U��B�r��:30��.2�m ^�%

�ZxF	�]7���������p3Lb�[����o�	JR3=%Qv$�~�t����v+�b������o��F��X��R+��N9���lc#�y��^�[�g��`+�����7�rT�9���l?�H��x���^��8<J*\��l������^��>71��GB����"�<�0&�::#��%��i�m5�����\!����:��|��\�b�:&6��FR�P�{��
�`WuV�	���i��E@N�p3��Hwy��(�w��y�\ce���� ����r�����#�|�Z�����c���'`.��hf1��F�=�C�v�t�;;q%:���������V�3\�o�zE(�����(mm/~��('�E��9}}V�I�v���D���N��ru�099��S,W���~���{��E�����S�IW6�_��W���(���Z\�U�5���yT8p�=Q[�Y�'�,H��������z������� P��9jXk'[$^��]gT��4�7s]������D����#p����������L9g��B�E�c=^�X�e�C��6�x�r�Dt���B�����U�_�=����jc���7s�7�3�,*EQ�t��6��S�l��������%��|�7�!�����*F@N��V���l3�z�rK�~�<�{en�yL-W��x�wu�Z�n���
��^HYQE��C&o��0ea�!�6�g)p�Lem��6�>6�X����%�CT}�E��g�B;����@��_?�����9]r������ou�@����������X�K��zIeAe���6���x=L�J��L���
���Oo����.t0�m ^S.�jE����2D����P���������6��)�����C�%�&v�3y@8,���llc#�y��0�ms�����!��>6,R_�;q%;��F����a�PBH������6IT�^�MR�4��3)_�m
��N��[�3���Z:�"��(�����m�^S>�f�X�G=�mT��.9a��2�m�^OS��	����d��-;O���#��0���_���\I@'�����X���`P�l�N����JA�m?
�<sml@���9��l�J�������My�����;18�I��v��3V8!�I�
U���FR�T�n���Y�� ����pT���& M.lB�U����r�x�V5�Lk,#���W��[��}g3�j$�y�UN�X�dc�yKr�f$�]�Ai�l[�D�%?��Uj��lE�
��m���W�XW!�I�t���*�0�c��~v�S���5~�1���q��f���w����>�`#���F�81\� ����nb�����a��2��)��t�(�Y�n9�L!]��n2K������3��}H�t��PYYSY�Ye�IJ��mb$;��g*�|P�Y��5geM�U�9{F�O��zq��H���<���{����YCF����i�j�+��V#���3��������5|������_��:'1��FB�~=��4�=(@�>���~��AA)l"��kq�gI��>�ZI���Y�\3<�m����n��aJ�'=��j9{���������~������?�����%���ly����������?}�������e���������O�����|�������i�������R\�Q����t�>��&y�?�h�T���{�r�V%����KJ[y�"j�����������/}4���@����^Ct����#��9����g]���cO(���`6�s���W���bi/����c@q�[R*���l~!�l�'r���������tX�.���B��.$8O.��^�y����G�a�|�f��\�*���E����\\�{�P�
5j�M��\����5���N�����w(���%�^�:�x�����Z���A~���v:u��hz���\��N�������(w� hv
�U�!��^6b��[(��wh�fo�g�"�/�A����fM����p[����o��]�*h�X�A�F5t��|TE��Y�#X�l'n��]�*hu�1�.��IT�Q��6lB�
�n�3�V������b�����+�kY��ob��/5�v��vb�.���k���t���������h����
�j�>*����&Q�bF8����l��*�E����EU4�Jv!��Y�����&Q�.D4�ZW���Q
�Ge;U�,���k]TE��d�
�E��2�����v!&��g-�������������t��_5��W�h����N���������h��t�KC�Q��GUt5����1r��\H%���KHE��d�
�Em2��*�D%�U�$j��z��}T��Q��J�
U�$*����fQ��(�������6b�
��o2<�o�o8����\]�)]���W���%(��rE��kTC�Q��GU4�Z3ws���������hUG*YT7e�6*����&Q7�}�����v>��Y�f��$��IT�QM��)��j�>*����fQ�	�*d�BH&����'���f�6Q��7�n_'7���n��i���ll��@��W��o�cp�6��yqF�$*����{#�1���2����2�T4�Iv!��I�U��kTC�Q��GU4�����O�d�BH&��0ru	Z��d;T�,���uQM��]�*hu{��]����FCXA�[M�!��V�
9��b#y����nd��2y�6h�"*��Hf>��I���-����c���h�Q�w��IT�QM�����QTC�Q��GU4������M��]�*h�Xx��{��>*���	�fQs���I��?04�z��_��^s��D��^��WN��^����;�i]�W�'S���Pa��=��^�&1(A>��T�H����0��������x ���F��Fb�P���8��3�����\;�~��A��������������}��@jZS�x=A?��Z<6;�	^G�g��Z��r7�rT���iZ4�� +�$�������Fr?�=��A=����Gx�4��&���������fr��J2oT}m�=���i-�Q���K!wXt[�D�.^d����\��xVH�_�.�F�~��B>���	��lw�����;�������z�~����`d��7XG8�uf��Q	�Cn�^t�7�2J��{]���NB�+���KbO��v]��K!?$�jbbG8�UfyY�)�$=�S�i=v�!�$�l	�`��j$zV�*�]�/��MyNy�2�|$���LM�d������3���[<c�1�`�B�I:� ��D3�j$�23�FL_��O��L�^���Z������w���g��eey��
����K����������t��p������e�U����d��<A�Qf����#i��J���t�%�� ('^a����
������u;����L^������P
T���c�9;����*f��H(�ef0��k��C��Nd�Y=EN�:��H�@��8����zS�������|$��{B�~G5�41��F��2+0p�&2�tWp� K����83�0��l��p�����T���+wk�PE=��af�lf[��^f�.<I;�r��1:�J�<����z�� �G3�j$�2�\�����$(�)����+�~����3�j$�2�B�L�_�I*4��:��P�41�m ^�j�����Ib�LY�fNw]�E'�\��m5x�����+��;cpY �m���@��23X���	��l��t\���K�N0��.�Qf����#������U�x�;L�����^�Lr$���),lf���^gVdp�.���%�����(3�E�t������Kk49x�C��X���x��z
�	�(�1���^7�#	�`���A0���*B~�b),lf���^g���e�.�fvPIL��z��83���l��P���B4`��)�;���t[ZnF3r������>�6P�[9��`O���7�r��X�Fs!J6WM_�c�/E:����?�"��V#�IK�~��G��Jub�3>���3�����t��R#�S]�����W���H��'P>�j$S�o���6r�3���-Wd��]y���j$:x��9s+uS#
�������)L�G�f35rw��B9���`-���A?V#����lwrw$�2�����*����cx�EZK�d�v�A�#3�j$�Z'm�[��yF������}BV���!w|s[�D��]���_��@,�.��
��t?��o0a�'
�d���e%�&�������������	����Pn��H8���\���$&��SV����������Q{L~��p�;��(�	�$�#�8a��J�@�K~���UJ+��&��n���G����G��vo�*4��F��3k�1Vo{F��S�W��,�p�`��L��*��������)1EQ��%���3f�������x]j��u���O���P��u�����U\:2��Gr�>|�Rm|��C�J�A"v���X������
����#���ji��!�+Z���
�W�y�	��6��@^��d�
�������~��]���������n������X�H���G
yV�%����{�~,�|�C�t�@8��4I�`rO5�wPh�u��[#�t&�E�N�wSP�h���[���%��.i�	!?��V6��GB�3;���L"F*+�R��W�(3����d��p�k�o5��:��p��kWd�����l[������
.��]8V��C)������..M�d��p�kM2���P�AMRXJ����D=��&����m?2����Q?���2�8�l��]��P+R����f�u)FrU$t1�@!�E����N���%�N�����]�r��x���K2��\�S��a3�~$�N~��OA~��NNlL��:�q�"3�q�G��3jjD�W�83�Y�t��H+�k�A��P4�hP���!?U!6��G��(�(H��@R"+�+�8��0��mU>��s���rF$�9aq����c���Cv�����J#Hw
���R���N�e\I��;�.�������s���Z��l5�Y�#��Z��R����*E�P���*���&R����Hy�2�+�qR�B+&��F2���]i�C	=eR��h����S��{��Hf�|9�L�u��(���Q�>�����\j��sQjd����O�N��za1]�����s%*�)S�n��#���@�����@���b�����q�n�"����<��`3�j$�Z�[�j�OJ���U��\�i�A?�L!�:��n5x������sf*����������g�L�%3�j$��2s�^��
��G���H���LM�d��p���G>#R���Q�y�PL�_t"�QXjl$���.u�����P��!c�t-�P�S�~�8
�����j$�:���$��*��!�YF���Xq�cd��I��G���B����d/�,-eF�~\)J2�L�	�+���p��+���6��������*[�H�y]�zY�0�8��=a��XT)
zGB~�8*,lf���^C�V�'���v.!�W��I�jd�I�v���I��Fn�<?�8��Ge�*�-�C���efp3�j$�:�C>��V���[�U�I�j[����V#��O}���qy������\�E�3D�*��F��k"��H���X���Y�?���qfAbD3�j$�������2��2�n)�F5zg�X���l�[��^g�x�t�����X�Vr��W�X��b&[��^������j���X�:���AqB��3��h�[��^�}��RQ�!#"����:@��~A�OK#nlf����W�mt������-'�BpE���M2��F���]f��(���m	Eb�y����4�a3�~$�N�\��m�����*{��J!?�La3�~$�Z�R���7^U����v���g����m?
x�M�����O�z��W�T�d��L�	�����,p�C������9!�="0�����v���$��Ume��@�U����G'Vf6��FB��%��+A�D��8Ir�tK��F�����Br�T8mo���_=�-���H��c��P2�����������:�S���m5�8<�����5������8W<�����Io�|��m5�8��A����Mz\��&>��/:��O�?�o3�z����s��U,��p)�Q�h��Q�����
��1Q��N��������[�D����W��A�}�w�R�6j�"��J�@���n5�<sh�G��a��
U��u�3@N�6e�d��@����5|���
I��F�
NW~8&�����~�a��b���*
�2�m��~E�OR>���L�	|��a����=L@�r�VM?�@�Z,cPp��m�^SEt�F����T

��A?}&.���F�g��3��7jn`l]�6�~���X�U���n?
�8�u�w��uO��
�Z((1�/��B����n?
�8���v~�jh���P���rB��X!#�y��4������r)r_n���09������F����i�U>����G��t/S�j;�Dj���F����i�[��M:T�����p �]�c�Zaa3�~$�q���m�a��%�S���_��������X%��_��u�n�hf}J���c��t(�:�D���ml�;����|$�E�#~�����-�.���-p��L�	|�y����H����f����PXib��@����Z?��P��-
?��c�\aa3�~$;��3��y�G��|�3��*��9 '|�����d�����A��s�<\d��XF�=�+�~,h�g�d&[��>�&���!�}�J��V~D^J�_V'���"X�&f��H(�����3�`D{�An�.��B����M�ml�;��S����p�[����1���p�0�����K������������0W�?�>���
0i�@E�a�KW&��.�d�~-.sm	k��d-��wu�	�
���F����q�4m��.m(���t>�D)3�v4�m ^�S�i����8!��J
������(&��v�	m����a� �	����,T�}>8�D
��d��m3�z���u�	�l\��c)Ty#�N9Q�;73hl��@��(����t���=8;��+���^n4��tt#��s�o=^8�������C������heZ�������Z���G�+��|�U�M������OD��N'�w��-bu��h����MM*�\�[e��6K��H|�[���V���|at����J���5�$3��u�%�����a��Nq��I���Q�d��H��$���H�I|���|k�w��P���Z"��j�{��+(�^
�j�;������������<��Wd�G�!;2k�<v���/�u�� �r/�b-G�Pd���%�	�q���!3�j$��|�E����c�
0`r[�"���&����F���������'C��_�H���J��C��H���o���V=s����#B�Z{��D��Xc��n3�z3�C��$�����g�~���������O��L�	|3�u�R�a�p�oU��S�_�P��f�*:���7o��g��c���9���g$��x
Bw����F����,���Cz�V>��s�B~�_������G"G����U0�#��1�c�\\1DG��;\(��L�	|3���w��[<b�Y�*.(�W�(_����V#��o���w�t����%����H!WV�Hwy��(�y|z�e����=��d}�cla#�f�=��n�G���E��8F��)Y�����W���V#)��@y9�t�$�X�
������sB��c.G3�j$��|�G>��MK�*wf��`@��vE�O����*2��F"��w����U��2}�`��+�h��B��d��H�7�-�4/�.'/��|\�m�B~,�+,lf��d��o�[���V<[����U:oR
�:#������;+�V#�����?����0��n��]��:W+����j$��|�[��/@��c�=�-�k����PV�Hw@{��F�3=�_���(�+�
D�+�~\��0��m5
�f��j������Q������s�������F�����D3uH�����2����^.�Y���D���	|7������Oym��C�tE��h0(l��d�<���B{����0�F�� ����`��B~�O�`e3�~$��|���R�+��XAfM����^�Knl�[����X����y��o'�������`�[��}uy-��x�L��t&���Z��R�n��?�H�
oT�u\�aa�D��4����S~iQ�������J���i���zm����M��>'����v��s��G���l5�s�4�����1k�l��z��5bn��~:����sz�K)�w�$r\���9�l3���2���'��+S�����x�^�)���Hz����aR��a�DL�*2���A?U���E��n5�<s���1��J���q��U�3K��6��)C��hq�Z��LX�V��dD��o3�z��Q��J��!l�����n�E�\���u�����l5�4s��nr�d�~���
�z@!�A?�\ ���Mu��p���W��i�J��Ge��W��V��\��HV�Z�0s8�/�y��c<�f��+�~:�I�s2��F���������SA�2I��!]�E�
��t��p�����t�A�=Z�[[���s@NT���4��@^�S�r���~��P���q�vA���(��L���� f^�����<��������S'������+,b����>�|�r�hOc;7�����I��:���V#)�S�i�;�[����J������z���H>�,:���Z\�z�b��a�I/X���.���`3�~$��Y�e�&I����� �wq�*z�L+O
����|�9<��J����@8��(���	��*����n?
�8s��l2���A��@9��!0��H�{=MDx���]`;X��g�!?}�������G�s����!tHI�*������ ������m ^OS�t�^��h�����&�r��aI�d��������l��F��\&h�]��f��|��m5����F`,��ut�c<���p�b�[
nM�d����w�S���5A�4���vd�ha�"���+Llf[��>���q�����_?
m����S�NT.����<�z��.4�T�8�������T�bqp3�j$5V���9<*��jx���������C�+�~�l���f��H(������(�z�We�F��sK{lq��=��{����J�x#�*�)Av� �N�AE�F����=�8��������e�\^3�����~��2���j���f��=���?����o?��������o�����?��V��BQ�
�����������~���_�����?�)��Oi�������/��o?����HN����B���|euW�/�]����_�y�"|�F����/�y�TO_��^�o%��v�u&��%o��|t=����T`	���O��F����H4&��������C��|���N����>�Y��������/|�����[�C�E<���3�f���~p���B{����[�����g2�_=-��z�?J����w�O�Y������7O�?������:����o�/_Q�����]����������'�Y���Y����kY�4�5��(�_�B��/�3�_IK�=_8BM���J�#��ETr�q��ETrQ�,�����?�TrQ�,��I�����B&��Er�'�$j!
��T;jh��IeU�,*����fQ���Q�����*h��%�����.j��*��_�_cY�7���%������&����kP4y�
MW�QM�����h������fQ��G4�Zy����hU�\TE��P��Y�Lb��)d���R�,&����fQyR��*hU�����&QA���G��IT�sQ��J���*hr�;U�7�70�;�n7�����|��{E��M��(��Vm��Q
�GU;�jh5�XU�$�����fQ���G4�*�Y.��IT(X�;��$�����fQ�W�Q2�If>$�I���s_R�$�����fQK�#��,j��E���er�3���N��}��f~���������h�����E������_J�$�����&Q�����&Q��EU4��xJ���hU�\TE��>��A��b���d���MMb���h��j��Lf1����dq�	����_���fl)^]C���s
��o�i������4s�)C����)�����/d��ei�h!���xS�\PE��P��E4��v>��ITy��*�D;U�,���.��YT��QM�Vy����D;/�
�E���>��YT����
�D�&�����IT�sQ�����Mn7'����H��[u~�y_������~�������^�<�0cZ'�pR�v�V]x�����g�g��c�[�����;��[����-�x���\�$#�����5y|1���`"[�ku9W���%�^���HF
������Dr��m����<�F�d"�d$r�u\������Z]�m�Rt��tv����ez�����z2�,I��Mb3�c��M$������5�f��Vi�
v�KQ��d�(A '<�(2��W����J�M��AS2��51:��2�@��$��<��;K+������uF@N�h����^��@��������������1��h]����^l������_�����`�	
��@�����y|��]��T����,���R�
��q�	^c<U�v�q;��q�'��>�u��/
�9`'H���F������O)T�J�*�����tM_B>�kKr[kv�8wn��H�-C��j�D;}}x����<i�u�C�[�f�'~��5,������_2�M���\�v��-��j�w��������1����<��H���rpCCiE�<�9`'H��FF����J1[����x�����y�v�����X�����Wk��G��Q�h���<[/2K���h��m����Rd���CT:����9��c,��m3��R�*���u��SG��
t�<����G��$4�-�5�\�G�4��5�)�!��_�����9��#,�y4�m�^W�8Gj��\Ai�r
�������t����X�q�����lH�w`:]���:}E&�E���\V�X��I�C
Q��'�J�~�V6�m�^W
S��6p���UDPr�_r���Ak%4�m ^��������)~�2"l;y9�o��l$�@���>P�������Qp��^�M��9�����v��t�
����5D�� 'LI�JF����*%����p��1%Et���^r���A:�6�+��[��@;�e�PFo����$f���n3��J�&IA-��gE��lF��<N����O�d�[�kq����Ux
{*��n�����v����|�7V��v{��m��X������N
j�����0��Pb
�������qS���^b
�����E�-�e�(������z"wyUe� ��-Joa-�^����7����4����J�ua�|-I����E��]�Gz{Cz[�m������pPy}�Nu[?� �������^sK���N4��kn���.�tBX5��kn�������z������F)�
���[�R���(hnLR2�L����F_�L{���������D����r����HR�G�.�8�QR���X�c����<�������P��z\N"�z9=�����v��9��";AJ�66r�\��;�\:Y.�,;
�yY.���rt���r@�,����rdb�\\��7d9�P�,WR'��N�,WR'�E�^�=e9�4d9@A����zY��,����r-Hr^��R\Sc���_��Lz�$�;��8M�_W�����&�E��wR��aK$��FUNe9�?����t���7��#���������%��,4q��_��\sk�py�
P���^s##��u��f���6�#�Ma��<
�MKFq���Fu
X���Al�9m����F����b>�Z����������jh�t���o�l��4�l[�T�x:M����W��o�l����(�l �^eC#��u������#�*���u��z�
���������X�s*[B���l �^e##��E ^3�
������
�Z�U�BF���k��m�x�=E5U�_o����l�Lt�~��=�u����������6q
*tx#�y�{�u�����J-���
�M��o'���d�*��$'!��lR.��E"��$��_��I�a�zUfc��Zy�k���D���w�7d6�*n��p+]�[+�JW�V>� �=*�j������N��*�!�2m�������`����_��" �Af���^f@��U���
��W����-�T����-�n��M����,�mP����,�mXs���&�d��j�:���m}yu����X����Me����&��f��Z�f� �md���kq��l0]���p����"`�^f##'�u���#��QfK���z�-��l���R/�].�������-v�e62r2[��;2N��2[-��;
2y�-��oV���Afd��������l`�����kM��"'������V��v{}Cf�r� �
2[���lh�e���r_�F}�|�_����U����~�kq�KpG�%��E	.�������Kpq�.���Dz��{
�:@NQ�[����u���!��T���
"\���~##���^�P��E
B\IQ��k���A��8.����}E|2��7 ��-���+{#+{�kq�)q`�)q"'�
������n3��*q�]9D�N{@�GW������q=�� W�p�YB����S�*��`�� WUjR�H#2�mU�&c++K�����d*YU���Nk@RUW���H��x��t���N8k�7:�-v����F���k*�U��PC���T��}�u���}�5�,�2�)HL�����9�!��_��� ��������� 7r��|�E#�f�uS����'�
5��}W���z�T�8�{���~=?}Z����;Z��e��;Jw�|�A�����"��/�;b����������H�����d���^�;��wDa�P]������m�4�?F�{�k��� 
�NL�(�FQ0u�`�DA���n�lm�����w������G/*'�����G/*>~�F%��{���{�1;0DG���E�������R��A��OP/JF���^����(��|/��x��O�0���EE�nMQ��!��i�f�g�����
�#�Ar0��{��/����l�~���)����%�k�J`���p-]�a��h�v�^����^�H��F�N�6JFN����&���4k�S����i��k�d���������	��:mP�V#`�^[%#��v�����h��>�^�^�5{����l������s���*�����f{�NQ�-dd;���l�*v[���)���<���,	&�i�f�g�z�5���W@A��9EQw%#�f��0e���x�^���s\���D_��i[�kqy���o�IbPp�5��q�CM����m����Y���B�_jQ3���� �������b?4�M�5�<��4�����20MvB��)�����l3�z��	�-q��C���#HR8mlol$�@��,z������(�HT����E�-��l����Y��$�������� 2=z�D�R�Hw{=LY���Z�zQ{�)�W������&6�m�^SV%��W�*"��G@N^��r>2�m�^S��M��O��V���dg'��;��m ^OSf%{�;���lN�����W��F����i��[r�+�;w�"R�������`#�f��0eQ�W�f� ��)#R=������9�6��)�^����I&��cA""��f'��V�D7�����3Oq�dX������?r
�}�dd�`��T���ZQ�_�q�R�O������9����%h��F�>��)\��k�Z�_�_�w2���4!�%���<�E�`��l�:>M�^�04�~��<��������<O��2���L����Q~��(�"GQL�(��Q5�^�A%��A��K���Z�D)L�������2���>��{'�����L4�N{�?� �e����|'�Xo;6��}���7��'�+s��2��+�������7
b��I}K���+w5���AV�@�]�I���R�E_#]}��_�G_#&�F8����}�V����B����o�IF�af���D���gW�d*�W���<���k�&N.�kvy'C�[�i� �
y��K�d�$���;�F����aZ�a<v�q4��xG ��D]��2�"/])���"/c)���"/])���v�L�W)��9��E�JF�����D�*��^�^s����<z�{�Q��kvy'CS����E�����~�_B0��&�E�f�w2T��r�?�����+7�s�6���-�������m��������U$e�d����P��5���r����9Q������Q�Hw{�����?M���!r���)VG/l�����I�I���;��z��_�G�����A�����N��'�u1���H=�>��D��MD�����N��K��
����T=r
RuMl���xu����js�A��"��G@NA���6`��U�8Un~�PCd*��4?wle#�f��V��+�
z1���Lg91�dd�s���DUo��c�Q�n��=r
4����f��^���UG�(�49F5��S�3��1����V���������H��P�{�c���m3����v-�N=6�h�w�\��.�T��@������3����rs.��<*������R/����FSN���?�{���f�Oz3��N/���
�4l0��4��k���)����f�V}!������(�P���CUx��,<�E��
j���~��}o������k���j#�e��tX�G/H����Z]��uy�cC��7��}C�z�P��
%j�P��[�B�~3� %�["��0�<���?q�Z�QP���o���
	��XWGi����Ltu�f���f�LC�g��^C����E����:}h�z�u���x�d��Ts�g�3�0*v@��
������W���)���N���)�|����a��?W��w� U�!�.#�N;��]�!���6#`�g)w�/�F}�����wD��A�d����8W���T<���7�������t�&��mX���\}�:Zt
-:"���E��N�����4vh�(������h�[xt����K��Z|
->"`��������s���x�a������:��)��i��	�a����4{M_���{�'c2���Ge���}���o!r r���)��l���v���l}?�+�]����(2r�":< #�f��,��p���I�Z��k��-H2q-G�Z]����C�+����j��%��.�����%�?KR����:��lM4������F��8����M�ui|��PK�L9;y}�l��w����o�lN�W�#���i�`� ���F����a����n�\cS�z �~'# ���f2�m�^OS�N&|��7M�3�eE�2Kdd������,�Nr�F�vj?��RF@N��JNl�������P`&>su�"�~*#X���k�|��5S��x=M���%���y��5~\�����>2h"[�kuy���\�3�����?����%2�m�^N�?�����%��d'��q���?m���C����?|�����
��k����������Q���_Y�����x��kh�^%/�������������_?��������I��s��ui�G��z������?������_�}�����_�����|�����+<R8
��*+�aT�Q���@@��������e�����DW�
����>��S={\��������M/����7WM?���u
E�������3-�;���xP�.�����.!��_���/a}��Y����.@5��x�>\\��4��^��R������� ��_B�aD��P�������]��%��]Bp���5�r=����%���]���%��^Bt������q�_���K�q3U������]���go����������]�N_�������j�E������v�"��j��+HEk��hU�\TE����a�&Q��EU4����S��OC��b��*�E�_)�kU4��v>��I�=�3c��hU�\TE��8�'e2�If>$�YD�B
��d;T�$jK6F��*��p��EU���e�����vS��Z�(�����Wr��U���J&��E2�?����$�����fQ�!�E��[��l��
�D�_c%|�D;U�,�F���*h��|TA��e����hU�\TE���������IT��Q��n���U���F�|TA�[M�������w��Rc5w�&����+P4y�v8��*�D;U�$*���U�$�����&Qw���*�D;U�,���s�VE��l��
����������v��,jy��2��$3��,b�H!��YL��AM�&z���*�������h~���[�:�)5��ly����nb��@��u+k��dhU�\TE��u�Z�D;U�,���]���[�Fe;U�$���/��&Q��EU4�Z�s|�*h��|TA���@?D4��v>��I�}��"U�$�����fQ�ae�*hr�;U�7�8P�^��g.?m�=�nB&/������h;��O��G3'�	�D�g4����IL�sAM�b�p[54�*v.��Y����ohu����D����U�$��yV�,*=�Q�����*hu�[|����]��	M��<8St�9S��o5{����[��Z
b�����&��U�JG��ya�����^�&1��������C<#��0�N��w��m�=�.d��v��7w���,P�p���TL�����_���X�e��Q#�]�+��E�
�b���g����d��$��
�c�,5�����dlc#�y�u%x�IJ�3`�������Y�����R�n)�
�bz�s�z��}{��c)�q;1�f��u��p�;Z�w�k67<z���$�m�G�.e��f�e����g9�5��Z���Cz�����su+��#AF�P������!"���l����B�-����m8D�5j�j�3��6��t
�f:n�Bi��Z�����"C�� 'HI����l3���&�/����$M������U��z�N�����t���EJ��:E�qa����Yk~���� %c��`�+�������6|/)�7[F=t9�)��Hw{]����(|RO��?��o���[�q&v��m����BE�n:a\�!>�R�9v*�1:��v��J����|�nT�V$��ta�_L��d����nj������+����;��v�������<��H	���RK�~
��h[WyC��� %c��6+��@���e�F�[X�qP��� ������d����<��R�����@TD��6(��)�l�����d�E`�#	��{�	��|�	R2���m3����XX�nJ���h��A�� %ap`�d���EJ-�Z��!��VN�X`/hs�	R2����<����v�#��oj[Ycb�!������r���-l�����5(�@��Rx���d_���5z|	�&Vh�[�kv�����$��*���lHv]}�&�/&��ja��_���b���u{5�e�@����������@��_��E.;����������y�������><+l�;������j���&�>�������))Kdd����R�/��@7E#tvZO���mD>�1���Z].r��})� ,C<�
���� ,�66�� ��.D��]���Uoc���P�;��))�_<n��������e{e�F�E��u���RV��v{�s"A[�����=��0w�_&|C���^���9��N�iaQ�����~�~dG]n)�.��6��y].����N����HQ��
@g�:]ng#��z�3F��u98��r�L�[/@���9n?�������.wt����r����.w|�A�t����m��m����������v���Mb��eS]N�����v
�Iu9�O�Hw{}G�k�,�:U�u�|=abcl�$�>� ��X����V^����_^�##'�u�^�
*~tEAn9:A.v94��\��
A��v�� �E�N������*��������J���$�S�N��I`d;����\��Q^��u��f�N��u����.w`�U��VDN��;E]nc#�y��
]���]P��"`�^�##��u����\���%DN��;E]����<�v����_��.(�r �^�##��u����\���#pU^p���n�Lx��~]n��A%���9@A��9��yA.��
r
�uA�K�� �q��\%#�y�5��+�=
r
���@NQ�[�H�{M��MQ�"�!r��	~�[�H�{Mt�=���t9��V���5y]����p�Z]�u�m����m��p��<�.����p�Z]��u>����E�N�.GFN��{}C���^������5z|	� 	L�$��2��W;� ������#r;�� ���e&���%'��s$'���,9V�Hw���%S��?� ���im`� ���F0���^3Az����H
�Lk�8A�����<�������aH���
ENZ�o'-�/��#�L�S�
\9_;E�l�+r�E�~C���� �AYY��"��4eR�d�&�i:���.�\S�`'._����w��m���/����������4�r���W��6�/�	�I�^���Ksa-�U{�����D$�G�����Q�+{��E����>FA�@�[��9���>������\Q�C���s~=����n�{us&Z}+�V&�'c��v��d��3�{:@A������92r"]��2��C��L(�t�S/�����:�^���h�s��`rs��`�A�C#/�E�^������pn���������(�Ep��������/��e��f�N��e����LG�A�d��������t8�>;�vM9;�nA�d��S��
��`�o���e(���"`��|��\�\��2��P��:�I!o�����t��
y��Vp�q�������y���5U�v�
j]A����ST��Z���F�����C��P>�8��sh���:�^S��HE�!z������3��H�@�+��b]^�V��:����L�P��r/����a�Y%�c�0Y�_������&�i��I�\���2]��r��=�LW�D7���2��V�=�PE����{�X>�H��xMe�����(sH���JM
m���D�K��Y�Z7 �\.��#�u�Md�~-.3��E�����U�E@NQ��l������s+��t�"r�qXe����22���5Q�V|�_�Zw rj����u+6P,~���]�\}�����N{��#�h�HuV?G��zZQ���*��
5u ���+�����:,���Sx�,� ���Y���|l�����|>���8c���C5]�����u�{X��p�m]�^�3�����%��
��Rj �|Qx]/�,R��GT����}�R����g�zM1m�p	��.��<z�L�p����,W_
���W(<
#`����|�a��a��Fn<���/���s�N"|n����t�~x��){���vaPF#`�^%#'�v��C;�N���
�)� �F�N�pJFN8�{=L�����*� �F�N��JFNX�{=L���t.����;��+y����0eSK���&�&��M��;Ea����<�z��i���K	�mBb�m�f� �VD�i��g����4
>�_�Aw��(x�V6�mP�(��S�U����|%�P��r����F����a�A����A����A����0��H�{=L�����Q6.����n	�_>������?KR�]�2iH�C4�UU����H�2�6�����SVA���^�S�@�"����^�h:��ya\��)�>�W��s�!2�yU>��OC������3�x����!��e*��|isla#�f��|i3��4�������x=v
������<�z����+��j^�^w$�pk���7��hRM�kqy�+��+��g*_�(�������)�0sMt�~-.se��-(�k��T?���u��H��x=LYd�u�N����H����W��u##�y��4e�W>�n���B�(��:�IxU����	o����a�,��z2�!:��b�	����� �m3=��,e��S��!��c���,�����b��
���^O7����P�W�QN��X?r
�~�dd������A��_�U�7��|O�?=�����W�|W�+�}�A�/R|KF�7�(�3�?�J��D�I��
���������C�h�� ^o%��+��L����{��k�������M�X��Q��f�A9�R� �C!q�#��A�� ����$����#���'�sw�?w'��x�?w'�sw�?�73�e�of���@%������~G����(��� |G�f�A����{�=��{>��s/��^r���D��]� ���	��� ����#`����7��:� �G�N��NFN^�{���W�k�zQP�#`�^T'#'�w���J��������)=v��t4�Rz���D}�1l���x,�W6G�N}e3�����;�zE��N5/-��~I�"����
�+�a����f���rP(&���?����J^�Hw{���	���S����=����h�[�`{��<�X� ����k��~-0� �Mt�~�.�dh
�_�����hX�G�w2Q<����M�n|�5V@���H{ NQ��dd;���J��k;��]9�{�5������x������,���4��P|���SP�=�s���x�LTEg���1*�L�9E}�����=�wu6�NQ�����=r����F*b�@��J�)�mI^����������&�]����������{�)�iG�
��&_����\
����21~��tkhNY���G@NA�n��T���x���J��t�2�S�,��i7-��y��y�M��DY>��d?rT����i��#�e'v����O������"
Qw�5�k����y%��_��;j!7|��*E�����|�j�_�v{��3R���J�0C$>��<��=��q#1�f�u�;/����5*}Ov���I������}9���ve��3�U���~W;����<��w��)��d��G�>��L�
z4y=:�^�}3�e����pL�0�F
,�F�5�5y�:m��,�?����$C�W���,�������e}z?�`�f�QON�t�z�:��u:��S/]�^�NOS�#�����fm_m�j�)�4��Z�o���=�3�?�k���M�^�����h]&l���Vb�,���v���LZ>�4L�j�o[����o����R�+��Z��r4��H��g)�vX�;s����;
�9��w������o��i�`������sG�4t����#�z���7�J��
(���S�����;�^�R�M9Xmv�?j��?��=��(,��q}D���\}��y�;�`�
�$v�;������z��W�����!+"�9��E����&��z��������B���oLFN#�{=L���]E(��'���5v��G
�J���Q��)�z�I;ZC��5A}���5����{=L��)�BgC;U([��-t��L��7*�@���o����p�oC�zu[g�����w;A#��$�z��oJL����H���S�|R���%�{=K�5-9^+�s����k����b���\�������q�BZ��"��k����= t�~-.�r�v*�4�UrPg\m���7i��b���w]+�����jSd�ij����>� �����l������]S@Z
�<�Y(N����(�Ml�~�.s��*y�F�F*��2�W�go�[��M��K�V���J��5�����! k�2r
�]Rc#m���z��NI�GT~p?���i# '�������_��x=��,"{�e�v>�-R��WF��n�����=�S��2��-�s��+m�����e��o����J^N�	���?~�����������O�7�I�����ne�����V��������������/�?~����}������o��Ok���o��9���&����t�c����~9o��Rj���|SJ����O_���WaQ?���$?}.�J|�
�1��g�
M��|�
���l�+5*�O[��������_,���`4��u��/�����u�Fj	�L����v���������O�3�wu�o�v���������ETh�u~���HM��E0�_������SF��E��.Bn��"M/�|�"���E����/b��E����������Z���jy�f�����j����������E��_�V�E�'��W���0����r]].��hU�\TE��9�h�&Q��EU4�z��Lp4�z���s�Eq��kU4�*v.��YT���
�Ee;U�$j�!�>��IL6s!��"�'�R�Nb��*h�:�����o8j��
��p�_bi���7n7�.��M��ec3�����v,x�w�L"���(dJ�r)h��|PA�Q�����j�>��YTC���@'D4��v>��I�T�����hU�\TE���ko�V4��v>��I�5�!`U�$�����fQ�>V�������j��G4���;<�n5�;�Q����c���u;w�&�[-����*�D;U�$����?GU4�*v.��Y�C�[��:��Fe;U�$�N�N|TE��b��*�Em��vQ�����*h����2��f.��Y��:��w�h��|PA��
&���&w��Q}�n��k������n�h7�8v�����Ik���KK:��%�f���GV6��5�o�f���GV6���[����"����l����;6�|����E.���.��Yd1���M#�x�)F6���!��Y����%���f���GV6��_��v������0D6�+������\�J����?�f����4{�Z�>������*h����V�,������[��96��z��I������c��j��Te��
�bda��l"�E���G|���"�a���M#�WM���l�
Cda��g��M�G����o���Bn��������-���p8oz��,��]���i��Mb��*��Z����~
>�y�{a[���I�-��9?���������7fb����>�u��e�0�C�K��4���a(c|���������;)C)��p�XJ�$�f�����t���jP�p��STD�Mx�~}?���f��s��Zp�3��0�q�/�f�#v������r��g��`Y��vZ��{�m3m�u!(��v�c\m��wq���YD�Pd��_�EaY����j���6,�-�Q��:r��n���T2.zh�X%p������j �D^��m����B)����z:D�;�]~���1��i�@z^
���?��	���_�">=`'H���F����"�\ES<�U���;Z��#'){�;AJ�66���^W�)���48�������QVQ�z�N�����t�������_�������QbM��� %c��`�+1Zz�I��@1P��Y��c����� %c��`�+-��ju� �1�S)m�<v�����D����n����<�c����gW
����P�+�����1����<�������W[I������q��5�|	��Cm�{����$�*��\8����$g�B~�_V1��G��>FP�K������i��W�(3�Y�t��lw3$���<	�<����E����efp3�j$�n�uZ�_TyL�FO'�BU���(3����V#��7�"�-�
�F*~�5�
TK�dN^O��23���m?
x����@
Cpi�g;��X��Z��Qf��V#)w�8Q��
h^�p��g\���4_��pc����V#����T�%J9D�"T��a�3@N_�A4�m ^�)/x('+����-Q#�����g��0%eU�d����,U�OzgkZ��������	!?��s���n?
x��������YCF�]����b�[���~>���
���/jK�
?�Z�%F@N���,��/�f�u�}w�M�\�%R�P�+����2X�L�I��zW���}=v�mn,/�I�X�s�~��B����G���B��&xV-������j[�<��>����U2���^w�����Q�k��[-�����L7����'P����|����mv���<����\��%��}����q���=�j�]�����o2()%�];��F(O>J��`p>6��[���7t�Z��W����@:�+�������4�O%��^�[G�����:�}spG�s�� �
��H���W���T���L,�i�'��cN����)��y.t�/8�/�_���0~�59��M���A��i"��z�4H�"p�rI������F�
�
����n�"��H���=?D�o_X��!� �E���A�����~j��V&V���1�Gz<c�z_^��zbq������z��=]�"\�N��u��:@NN�#�����>���������Q��/�z_��c�����a3�N����7r����F:��u���F�u�0�����T����������������p}��~���(�%6��=����
��	�<D���f�����=����pk��_c��7�����o$������A������	���ghFSy�?�����~���~ 'Pe��FV��~�V��o��n�������F~^�C3���~����6�(���2����mG#������L6(a\���2�}�UW��������oX'�����B����
����'B_
^����N��y�����7���W�����U�<�Q��l&SyR�^:���0
}
uB��i���}=`�k�oS�Q���eC�A�Q�c�R��d�� Ew�����
�*x���(��l&Sy��_�����b�>T��jx�S'��M�X���:�����������������?����t��Q�V@F�9�J_!#����H�K^�K���F�/M��4*}iT��+���;_�7���U~M�@b���o,��H���:���b8+�uc���+����|v|QN��pX��I}�����z ]��q�Z'�u����S�����
�_��	~`d����fd�6���1O�~`b?;���W����(��Q��/��_���Bq�_(���:h��0:��W�: �s�~�`���>(�M��������[^����������X��F��o�������/J_�J_����|���P����A��99�������@���S����������Q�@^gJ�����	{q#o
�I�>��tf��J���r�B_�25���!���GG6�o���Nw�������$���(�����@^�H|yIN�k���z�^N�kF��7�����1V��Z+�
���GfF��vZ���9������������#3S�7��������6`V���F�/��L�	�@��hX��6#�y�
�H|��D�	�H��&�����ot_��I|`d%������T�2,����'�7J|��T�����^�k/��8x��4����P��2��=���J���j�"�����:�J�FF�����>(���f��>��cK�z@NcIiI���Z�K����qpO����F�/��L�I:]h2���:��^V����4�3U����x'�A|�����z�/��������2���6���6i[�7�%���d*O0�_xgRjm^y������V6��<��W�Jg	u�&d$>
�RdXD#���:����.�h�u����D���������v|)�X�T��:�o���d[622�_�[N�M�w���,�N����N4l���h��X�DC0��a�b�Y����)w�b����=�����`d���������r�]w?K�w-V|����+f�C1c��/f�C1c�cx��������i������������� Fv?L���\��r%<��ss��-|�z>�t���\E��.�")�F9qa$W������nP���U<�T2��
u�j�iW���� ��)[]���\��6���= �qw5v��0e+���A�m��h{@N�F�FF�y=L�
����\�j�u��\'���X������J�,e-����3���# ���3�������E���7T��Qx�GW���D&���,W�o7:��H�x@���G@N�T��Hgv`��<�M����Y������U�(xS'%G<����	F{��*��
"�Rs�"E��!{~�����L�	|��*�+4*���������	��R4���-��p����r��K��h���=#Ev%�L���P����Uk^�&T>W���UjB����x��i�"<� E�����=!�7�^�*��T�P����2]�w������h���� ��Df:�'�q��\�����X������ ��6���'�P����Y�^�`�����{�����@F2����f���:����m�EE� ����Z	"�4����E_�Z����m�c�%�3~������tSy��gN���A'a\Ya!�t���(wm�h�L��$��8�gN2{h?
KW�v,��~"�n���L�y=N���)�	����0,�����S��L����sZdN}� ��~w���LD����:�]h�������>~Ig?�Q4v� `�j]�����H�q�����n���?�lW�r�����}����`��O2�%{�����&�v+=���;����b�����=�$���&_`��
�4��I#�4�����'j�y����N�.tciV�-��Y�cq��a�������M��d������`de������
|�b��0��a"��Q���n&j�w���$�5�7^�$�����H�v,.w2d9~-��+����LE�o�����J�W������6� �����5����D����8C��0t8\��)������^����]]tC�t�����}]4Y�>u�����S�����6�	�= �Q�G##���n%j����L�P'����F���L?����U�#/v�|C�8�r�y42����V�����
���m��r,��m�0��������[���-�����	�MLEv?��E���&^'�7���= �Q~#+����$���%D'�7���/'�7�^t	F���-��bs[x�E���{$���-l&Sy����V��w8��oC8����0���f"������J���������@�:{��u��^�*m�w�q��0+�{~cA��f2�'�f�V.����mx���=�ABogW/q�G�����|�H��HgXf�sO�o�72�2�@(��|E���>���\�8D��'��</d�Sy�o���u\���At������s���;�&>���(����i��P���(�D~5la#���x�U����`��E;���s{N(�
�d$�8PL�����z������]p�1Wa{B�o��[�`4��<��7�%)���!U�z7A=�LM����U�����O�{�L������a[���@'HTY&#����w/Q�=���/v_�<S�b�����b�ee3���l��>}ue�y��� �_�Z�������N�y��a{�j�5n�=|�gs��m��U}�,t����������}����<�Y���f��?��}���_;�N��v��&���v�x�m������; �h�����gW���<�g_�����<������q��{��x��?�k���_���k2v��{+
����a��	Xx��2Z]�i�n����+�qA6��O���VF�:�]�����&#�R�Z7�`�d��:Y��4��hdd������"x�A����?��*�9�^%`d{���������u-�y�Z�rr]���v-�y=L�k5�������x��������x����)���TZn�CI��z���c�z����������v��h�8
�+�����x
Z�uw��Y����������]�ct�OI+��w����a��P3�����X��@����Ff���T�I�~��5J��O���B����2���X���b	|�yw(�\�����u��
�&����=�m �q�]�*�W�Q��vn�$i�~���l���$[�?s��e�yh��Q�6�^<�����l��p���������+��#��=!��%/W��#��u���k{����~_����O�+d��{�g.}^*��[�����wxQ� �N�����67�z�2��)TQaV>��3��0,�O�@�+*n�lM>�����4��=)r����d*O
�F�4s�#��;����3�h&����)��L�	|�9���^Elg�X��D��x�N]g����L�y=N����;���|�J��n[�L�
�o��f�'�
�<s>�.,����)w��YW{����_�X�Zn����|���Z~L��^�F�yE��$���4�iHv=�2W��;l�_��'�a�Kx�qy�R_��8����am��������O������������������_�������+�����_���_�������__y����_By���������������?��[�8��l��@�����-�Q!��r��)�Eu#���X�;O?���^Q�
y���ki����B��M�f��`��z��kH���9l��J?���[����1��A��:�Y��s������c�w����9]^���Et������-���+���K�� v}����9_\G��[�_G�y�W:�U����u��^G�|q���<��|�:��k����ug�����:B�.����|x�{m'�v�[f�����q1�>f5^���N��Ow�����d����J�������+V�-���m��R���"�����"��~��]EfCY�e���Ygf�����n�/#��+��,�*2���.#�>�>2���d�Efv9.����ft��lXF�Q�	VN���K�]`f��[G�u�����$�]df����������������`t�����FW�^k�;|[]E%;��U��`'^V�U\6���]E�+/�jdaW���Fv��.2���������"���� �*2���.#gX��#3��L�]df��+��]df�������"r�tA;�$��������>��7�zu?��v�������Vv����aW`\���,�*2����"��W5����lh#�������i���d�Efv9cw�.����lh#��������]F&�.2���ey�_0FWq���et5�Kn�.��a��ed<���������]df�����uY/��>�%�5���I�����2�]}���~��]EfCY�Ed��8��(��,�V;v9������"���,�2���%�]F��������7���]EfCY�e��������d�Efv����.2���d�Efv�����3��������>�-�h���m�],�e|�t�����FW�^Y�1AWQ��Fetu[����v�
m`a������]F.N�Sv��Y�Ud6���]F.�P�O[�ed2�"3����"d��ed2�"3�����Z�O[�Ud6��rf��gK����GS���'����c��=Yp���B[���l����<�����J�]�RK��_7�y�a��7��)��,�<X!�!Kv*O(���<����!:RI��Tw>��=RI�B'(�4�����B7�Yi+��@����|X
i������v��Gs�/v=57|A{����e�����L��������I~�qV	��
k����i�=�Y����>�5�Pp;�e�}�����OI ��2Syr~tX[���u2�;��5���03����L���3;��7y���q���i��LR�i�J����V<�Nni
k��md��2����8 {��4������1	������~��SX�L��f�����-k����@>"���	lR�T�P����)m��e����:�mm��03�����������"6R7wn��,�|���
A?��@)����$;5�8r3��4{��=P!�+y��f����L�	��:z�H�B��^f����9�V1w@�3SX�L��f�6hCgNs6�No�F����f����L�����P
���;|1��R��szoS@N�F�����y��Z��&�������=�5�1��K!�9a��d?}l��.,E��/����%z�A?�L!-`��=���HP��%�r�e!P�v�������r��N�I�V5G�cz����D��RT������W�V2�i@�C=���n����m���#~�a�L��$Q��#�lc�"�/�a�80��ZM�H�_J6��<����)���7��������RL�2'���)Ll&�{��H�e��e�Z|���>�{���03�T~o���+��45�	�5�/�a�����?"3������t*O��/eS�6(�]���APH��v5+7hPVv2�i`����w�����, V����A��	 'HIXa#���:��V����	�����B�*��������Hf:�'�8�
Opj��e���`8�?�L��f2�'�0�=�b
��
��$����n�L��0g0�)�1���p1�=��!����_�'���7�_)e��df��0�
T?(���|a��i��?��)l&Sy�hc�q�l�k�AD�Y)�*�#n���W!���dzO0��Y�x��Xe�w�����
|j�?�Y��!k������K�`�i�=������H[e52���^���5���f<�'r@�Gjd���������u�FV�FV�F�d��U#7/Fn���R$n����m"�Q�<I�W���;rw2�>�!w'C�N�<����2��z1r!m?W�q�����F<����D���������e�\F���| C>�`��WdM@����T�,|2����`�e�}"C�N�����d������!O�+�
�%;~&:r ��eH4�2�H0�G2d�6'C6���A?/C���!G�?�!i�r'CB=J'C��IfF��N��?�!�����>��=A���fV��D��FV�L���2d�i�!���s@^������[e����D��N��~�����������ne����D��N��@����B.�^�"��	���������X�[)�����'��o\
#��T�`�d��5�E���������(C�l&Sy�?�!�w�� Cf`V_��zN F3����Y��2S2�2�H�o�!3����`��dH:!��!�1��!G���N�D3+C�$�	��2d��M�q�#�=�������t*O2�l��!�	��!q���!�N��FF��{]���;s���}"C���z2���y}"C��\ae���9��2$�Yr$�R��F]������13�?V���t���B�rk8E����������D�{}�F����p��jd����hd������>�}P#+0+3z~����L�	�V#���&{���X���	�
jd�\��	��`�S5���
���~ Gr��2��,GB�"+�)c��P��1��@��d*O.D�?�����}��#���/�����]q$�����H�������[� Q������$
��C������N1$%�S'A�����%�(��*�=�R�9J%��4��
e��i�hd4���'��q�(��~_*�j�?�J�<�Jv@�_i�PS9H�:��p��T���H����?�8�6Gi2��$��v�du�8��'2CW�|mdK#�XI2CW���0�E��E]�">
�U�������	�M�"��VEd?}��qM����6k�*�'�7��D3[9���'rdvrdvrd���������'��-U�_��*���UEv��\U$�����'r$+�V�D�����J��#����#�P��:�����z��#�~^��a9"�392�^�L�(H�����Df�&9��*Y��TI:����#?�J��Q%G�?*��6�]q$=�����$i�G��)�I�6����B�v��p��roh���D�{}"G�=�r$���rdO����df���`�����:9r��V�	�y9��9����w\�����[�q$�7��m[#��T�P�k92���N�L��E�G���#+�9���#�wK'G�Q&"G���z92���y]��I��*#�������(G2��<������12��VVx�����
�~���&<�8&�2�w� ���������df��Z��Px�)�K�b4U=?R��l&Sy�?�!#o�3������	��:��f2�'�|S�j�~�{r�'�Cz~����S������k[�j�E����W�?�Svf��e�x��a%������M�i��;���m�N���M����~�y/4N�M�k:YzU���N�����)��S��<���J��;���	f���������w
%	��$
ja'����$�fV���3���������U�K��~w����{���z?��prs��fX����VzD���>�l���>�����������������)Sy��#z?s�KU�Sd;5�T!����[>L�t�X��U�B����t ��+H��T��>���T����8wq"�2q'�.N�]�����W<8�y������tr"o3�D������)[e6��^�
w�N���T4�J�H0����t��/]�_�N+�y����V<�8s+�����������~^K&3�%�>��_*����]X��A�������oh&�{�g.��5�E/XW�Z�����������U:�D�W�z!�K�jG�o��Y��=�H���J������T���F-�d0Sy�gn�h�������S�G�dK����t*Om	}����A�������"q;N]��&��M�{=M�5����:�a�C����_/��O��d*O
��?�\���
�['G:JTrO�o(�m}������>�\���@+;���
�x����~���mlf�)�>�\����)����dj�=A�A�/��tzO0���E:o�%�g�lfD�	�A�o�1��N�	|�9�k�{����<{+�{�N��*�4����I�_"ohRD��D���:�Y���q������|{��S��������3~�Z���O����<s������v�`��N������tym�h�SyBgNrk�)MEE@�h0�dW�u##���z�2-4���7F�eIaF��o4��L���H��fNM�]C_!��.9L��+�6��=����^H�_���)�$1!E^��l&SyR���<�\�w�m�km*�GK���y�~�`1Y�p�n�����n��-U�g�Y��g��;����������T_��kW{M�']����v��w�}�����i������\�v�Hc��t;���s��D��i�H�Zm�0�Z��,����j���������U���t0�\�Vp������&j�������qy]]w���u�����{�K����_N��jB�X�]��.%�q%��M�S��_=���A����[:���A��D{	w���[�z�~�_��K�i��|�rO;(�����-��R���������V��^O}F�x��z�~�4������|�J���]��[��e23�#��7��:�jOG����o?�����3����T�&�i���W��t	 ���K# �~	 ���@�j���Z>�.�N�o���~^�G3+�������u����X/������fV�	�������kE~�[Z�$In����fV�Izp��Z����=7���8
��NF"����%j��vw���W�a0M6zB~��_����#��7�UU}�0�Q�#�{�J#��l&2�#�f�V���/�����+�������6�A�	��������
K�����'A���d_�LzG(��|E7�+wv2,c;&U�'d��NF���LeyG(��|Y3��9Zy>&j��������k%#Q�G�^7%�����^��7}��=@�^����5����DE�R5�l�)��=�Z5e �O��{�UM}�/��[���N����_��	�A��6��<��7�e-<��J'���*��S'��BF2��u3Q�����=���Nl��*�'�7��kf3���J�=������1�V��%<8E��'$�,
��6��T�P��o��s�D�NXG9�*�~����l&Sy����vR����+�g�:7'�&�Ip�? �k�������b��tR;�'��(
��������������m�/`s�6�/`K�3�uq�+�	��E�
��}E��fV���o����a�V8��)��sw�4T'?j��e�����~�]�>U�w��$��0S]��@3�/�'}u����8����������/z������&1��D��<��k����-���e;����-���e;���S�=;�<LY��e��{��-�����S��W?����-�f�ul���]w������	�MN���C�@v���0�����]��2=A�I4����>��k�R''�Uw�^u'����{���W��{�7dn[������3yO���@����!3��f$�q��qL�R�����GM�i�QCF��f��4e�R����z���0]�����a�f���Hv:c�a�����7��c{����7hf{��dX1���=���F�'��ew2�@����f�d��`���E{�<`f���{�$63{F�g�+�]���{fW<���Nd3���8sm|��	;&��Mu<����f2�'�q��vg}�ml����V<�����gg3i�3�8s��S�}[��j�?1/"xB~c��Lf:�'�q��Yg'Z5����r�'�7v/df�J	|�������n��@���9u�[&#��|��8e����I������y��r�V"Jb#���z��t��EVU����������
#��T��~U�v���'����7j�p{0�3~Cg��L��>�������R��,'�)��*GBiw�������q��'h]�a�K���g�����wF;I������i
!.����4��8#av�cu�6��>�x5!��T��F�j��/���/�;[��M��	w������W����j�PB��_�4Ej����Kb\&k ?���_~|�����������G�����[�TZ��NknS�������_~������������%��_^?�������������w�l2��Lvi?� d��[&���P��na�U=�N���:�@R{,~��=6����O��}�%_��`���L^<5��u�|(�+�m}��c?z�e��Z�3�d����K37n�oa�����;/�o�+,����.Sg?��:�&���A��>��/�`�gr��zJ������:��/`�����$q�.�����^@�zz��~p��k~'��������\O/`��4]���o�������\]�z���g�Zu��]3���������\^�x�t�������H������n	)������\���#G8��dp�L8��J�M0M�i4��h��%T&M8!����rq�;����������(��7����D���������_��!������4}%�X;��-�jD%��L#*9�Z����VtS�4�AQ+��i�
��Jv6�����:�����������"*n{�A�\�D3��y����v��G$3Q�ED*p�!]���T�E����@���������o1!��S�o2�����45� ��s#3sB�?����������c��	��<j;�����<*����.���\�}F�ET��Q�G�V�MTE�Q��DUt��J�
��"&���L."�����6E1��t��)1&����lg�*���q�Ft~�a;����m{��;����������~hl��+9��Z�c���!��c��	��<�J%H&����lg�*��Z�u	}TAQ��Ft5\�6Q�Ge;U�E���6����dg�
��Z�r�Ft��lTA�QS�v�&����lg�*����%�>����������o6kZ�Bp�d�����P_l.@���Ff&}!���F������c��	��<j]p�����<*����.�f��`�2���f6$���;U3����c��	��"j�-6�����wjTAQ+���Q]D%;U�i���[K�R����bg�:EQc��\��i��Fe���f��)�Ww�k���>,��A���KPt�����������lg�*���c�JU�E���\jTA�QS~�>����dfB
����^jR�EL��A�G�w:��NA�Q��*��.�n���Ft��lTA�Q��^��d��������"�_OVt~���I����J��\>�<XU�p����zZ!�����pk�)�����v91��z����n^{�+X~_c��l��D��h"���zvvU����fPU�d�
���H��=��p.�D&��2W�w�c���D&{I�=��l��D8�f"��cv9��V��v�#��a������Bg�9�N��HY"#�������@�w��l"3�
�pV�dL�L�(z}�I��vv���1w0K�m
O�s��8�N��JX"#�����`FhX��Ke�2�N����lu��Hfv`;?&5���f�e�l$�2�@��l�	;�0Kd$3;@^G�p~�D_6��c���L��`����DF2��u$��w���
d�&�j�H��2���RR��Hfv���4��{/�]�M��Z
���e
����,����yI��w��*jF-�(k�����RR��Hfv��Rjm�*��#��a�2h(���M;���%2�� �#t{|����kA-���{�x�N�_�%2����N�^�F��OME��7���n�1{|��6�����]L����
���
�+u\p�����,����y+~+��\��Q&�q��j�����=����&<�8f�E�����X�����
Kws�N��Q�H�q��R�Q�eJ��?�=�\e
�	R���#���:����Z�Ek�
��%��	@����HF2��u$�����
E!{#m��:��7�/�&2�8&�C-�!��d���2:g	�&2�8f��g���pX�A; 8'��S�S)���4��AJ!����+<�TE�8:���e��F:��u�-���1��%B��N8��RR�H�q���$ j��>^<���(��;tj))��e�4���*EK8��Q�"���m�H<��RV+�4���hMK;~<�/A�/84f����������~�3�t_j{<��^F��^1�w�r���%4�� ���2�[)5SS2�*�6W>������L�y�TI���=����}9r~�	R���L�y�V�3���o����9@'H��V�H�q���R���e�i����[>���� 'HIn�53;@^�� �<��>�� Dv|�M���1x|3�=�X��S�cr9V����v��w����k��M�3RfZ��d9d &,��
e�n�j�\���I���T3��Iu
�2���N� �F`�OT����E���MR�J4�-����l��D��tog�aI,��HG���qo6������i�P�2nI;����7<�!f��",����y	uI_���-F��e�]����P}%G��/� �#�.a�`����D ������7�����t��ai���^�C���{��,�;�B'fh$3;@^�
]��I�UF�~5I��O;�BG�N����:�����h���g:�N�����tf��X�i;a�F�~U(�Sy��D
��h$3;@^�
�|K�+#���N�.S�N���w�df��H�K��������u���7��&2�8���fP/��l��Hs��iU�`'���;�F2��u,����X�.*���b\�vj))Kd$3;�����l�����l��)`'V�-�����y+t!��
_i��i
���)`����DF2���4��e�;Q5d$2a���@T�/�H)��8�������s��1?kN|?������&2�8.'�sM6����x<� Puc��N;�BG,����y+tRb���
�������N-%e��d��Z�[�C���B]?F�A�#��cv9�t��97h�4�p���:b���L�y�u�*
�����H��o~�L��Y�D���r������B��
�o�X�C�H�q����L�+���
 ���D����N�y}��A��N�����q�F'D�P	#�ucv����{+k'����n�S'�m��x��D���ay�v@��tbiYD#���:���e�0�B�m�I���2�4��'�\t�\��8�r�1���pqP��kr��U��Hr��h���8p.@�H~�f�q�.�b~�Vx&1Q:�@'�*76iF2��u%��uPV�k��J���:�;����d��D�[�9�	
u�@�����d�y�F��R?�@���l�qh��dd>*���?R�6[�npLP�[_5�J:��7��[_:�8f��(W�ia����$����'C2�f��b�����!��q����������z����MD���K'��rX+'K�FD/���E��:�N��-Z��FI��@^�%s���/�d+��)J��i�\>%s�7�G�������*q+��U�M&��+q���*q���(q��S���*q= �#%.��
�c$E_y�#�U`���Kgv��
��1\�,�A{���M�C6�V+�_�Pt��&>��
��gl��.�����;L;��x�df��Xs�
����P��^gc��f�
Wx�I�1������h9��W��;I9��Dz�� �#������mh��ij1-M���M����K'��r(�A�J�vXl��jku
���6b��df��Xlew]Xl�����)`'�DDF�ME�������eYk���2�������D��^f�qL.�*[����y0�@5���-��)`'P��%2�� �#�����	B �4�4��7�x�D&��r������epD�����|�LH@��N1����.�0X
�k�v@P�v���5b��d��X^�4 }���{:6�l;E�le�m�q��)� }z����l��!�:���lf�q��>P����l;��
 ������6�|���5����-�"|!�J�9@')�#��t��Dl�];+�[������jL:���R�X���:���X��J�@L;��RV+�4����V���B���i�1z|3�B�Ld�q\OnHF�:�m_�~�
��e������U�[�
��R���9����@F2����i]����+�2�g+��-�a��'_�������>~�1"���:)-2���$�m��a�i#����N��g��3i�e���R6��7�/�c�t�g���[������G��:��*i;U��4�Y��JZ�Z'���]�B�xJ�lCeM����R;�jk�E*���k��������|�������$k65v���	WU��T����KU��PI�N:���Q�PZCo�(�0a�*�'cUo�p��d���~�.�r
Kd�~?h��N�����7
Hy�����YR���Y#����`�b����v��:b��df��a���.��}�]��"{�����M$�7G�t���\��-��PxF�%�p�6Hp��`��0���3;����o��N�XHrM(�0���2`�v"��X"#���z�rN��i
�%j���E|a
L�a���)���<K��
�`wxCv��'w�q�LxSU�S�cry��V�g�#r�N��;O���M�j������,Wx�N8�i���6�V�}2�zF'��4d{���������ma_�����d$�8@^�Rn5��G?BH,7$T5i�t�rCb�4����U�\A/��U23 #~:�N�B�^j���q������s����&���N��"k�����q���+�
 ���J(j
j>�D���j%#���z�rSGq�j������\.��tb!wmd��8�^S�OIl>�����B!6�9���2����Au�SnR+H�m���+�E�o�c��fB�ILf
7&�g�����u���Z���:uz/u�5�8@^S���L\P�W�A9�@'����6��H�q�������[�w7EMJ�\:@N]U��������
D^�(z�W��P</�L��}�>���X\��*��e��*Ek��`��EJ��5u��4�����3�*g�>t��rN7�o&��MLt�qL.s�J�5�A�m�Z�:u��@F2��~;e�s{�Y�v@�YO:I�(�HF2���0e���&�Y�J@��{��n����F:���4e*�[�m��p���N@�m����F<���4e*��J�,�����P= '�a<�LF2���0e���6BS����Z��q��n(o���Hgv�����������hH�@ ��9q�+��F:���������$:T	U��$��	�uCz^���8
��:�>�c;�(��N�C��f?��N�O�yP�`W����qW	��`�H����]��?��&�q$%�D�$v�>I�����%�2�8f��<ty�n�?�Ww��.��JG�=����nd9�Jv!3�M�����bv!+�M�:���N���~J�u�I��U�;��e
���xb��df��V������v�E����u�Yt�I�q���i�E7Hr�E7Qb�U�Yv�+^u%�f9�'��,L]+������v��d��D�������]�d������V���d�8�D�z�%0����/�����zO��:�M^T�`'�G/�eU�@^���V�������U�[����D��R��FYk[@^�m]V.&D�@�v��N��Y$#�����h���T�Fj+�yao�9��y�E2�i(�Q��$
��|�e��d8�*O�=�i>�o&��&E��~L.w2l[��d��Ur";������q�Y$�q7�)�1����I�%�HO�8��
��a�Y�|e+�4d+��h���_�M�J*�8�+����6�#�u30�)�1����m�_��J���}r
�&��N���O�d��V��hx��A�^n�����N$x[�H�q��n%�hm��2���E{>N�C��*�`"S�cr���-9�t��):n��]:����4���DAg��5���E�m/��s�N"i���4���DA]�-VSH��6.
�F}���l`m����4���D�e'����J�D
a��) '�J�D`�3;@^�m�2lX��;�*�� K���U
[�H�q��n%���B��B�A(�`��8e��F2��u/Q����1X�z��D{�N�\]v2�i`�[����P�[68�EjDm�s+�N�	 '��%2�� �[�J�45��u�%�zk��kx��;��8�^���9���
B��������A�+�4���D�oA�@F��Za�bTg���[������D��hZ����f����6*���n���f$�8�^�%��

��\�H��	@'+9��H�q������4�Th�K�Dm8'����4�������������gr3�[�&�Pl��B�����9�������`�~l�g F}�I���>�����I:�����z�&n���M�T ��#�t����^��y���D��3�|G�t����u�q<h�7s��7���N�2����4���e�z~/���k������*-�: u�_�i��t���S�����Z�=H��4��F� �����#?��9�
�+���t�R�v���"��C#���z�rJ|��b�8 \O���
��� ea	�tf��a�9��O"g�{�"���1y|3�=/3�8&�g����hXB���?������������}�r����4�]�H�*j��sv�2:�Hgv��.z�r;������9A�c��@��R:�N�uK�4��!e��0�v<	��#U�'ja��������`�����h��!�?J����u�I$��N�i��NT)N,��L�y=L9P���[�����3����WF[�uxo���W��
��D[�W�!k��^�JF2����Y��|[����oUO��6�,���V2�F"�A��ho���8
������8��HH�G�b���\�����;���N1@�7:���b>�z�r��7~��������c@m��7�����4����A��[k�zB�|6�t�B���Lx�q�mMU��[3T��-F>9�zzF#���j�f�M�n���������d>�X�K��v0�f�q�.�r�{�?�(�$�l�5�(��4�����AH-�:��"�; m3�:u�HjD#���z����U�|z{tWS�(|��;�kC)P�q����,zx�X
�����'��H�q�����eVk�%=�����/`\�,�8��0�NY5����:�� ���P�i�
{@^OSf�����
�%R��t�����D�{=MY�x���I� #�;�N�D��4����EA�+t~�b=�l�y�����F����������A���b���A���i��@�v?~�g}]�,LQ�N��{�JY���T;���������r��T�l�����mZ��i���!(�n�h��<eii��m�����o?���������/�m��.P��k?���_~|�����������G�2����}�{�^��������_���/?���������/�?�K(��������-*L?��mi��ciZ3}y	�2�����!��U-��~�r�i������'+&?���|��=��?����U��\�������D�����r�"74�K ty	������/���cl]�g�V?�J�;�����-�����U��w=����L�~j������FL{	H.�����i��������7���h��G�.�@\�:��k��]��5��P;{
�.�A\�:��kX�{�������X�50��u�k�]O�!mp���>�;�$����*ok.|plf���O<���5[
YT�=�Iv6��������_����lg�*��*��4����dg�
��*k	��7�nW�6*�
�y���7��=��?;��/K��^��"�\���O���5(:�����*�<&���B�#�z��=[t��LPEQ+����U�h�F%;U�y�y%Q�*:��v&����w����.����*�<�����U�yT�3Q]DM�tjTAQ��Ft�p��*��~�v6*��7��������s}��!��2_�A�����5t�����vA���$3R�y��0�w�K��c��	��"���}L���#p-%�&��Gd;S�E�:<I)��Y�')%��*�?]t��LPEQ37����.����*�"����5���{�������19S���{�����p��?8�3������e|4�<*�����������kPt��LTE�QK�[���F;�j�E�<>�t5���]D�is��*�"*�������>��/�A�Q��DUt���J.b���\�E���u5d�f�'���A]�nJH�}_/�7�w�V����J�?323�9��`���w��<&������f9yM�*:��v&����%p]���<*����.���!����e|H5�<��pOF���<*����.�F>D�
��Jv6����ex Wr��J."�|���t~�a;��w�-��ry����cr���llg.A���-u|"4�4���U�y����t79��������<��r����<*����.�n��AQ�������1��S�Ge;��������vA�\�D3��E�m|07�"�6>�tu�����v3[�t}�i������u�������JF)�������kX�Y�r�������nI��=A�'/������/�2���������=��MogK9����j_�e2��8=:9�(�A(��k���.[u���0��W7b����N^P/�GN��5.S�N��P����~`=����kG{��:1a�A�A��g:�;���%2�� �cA���'���h��J�l,g���c�e&����fy�zhf�S�A����=�����eY"���s����^�9�f���r���k���DF�3�wh;����Ox!%mR��n�������P���!�(���=SB�	�_9���8f�o"|(0�����]�����Z�Z&
��e����=`��7R��Hfv���$H<W�vU� 	m�A2�
'<`����DF2��u�R����PE�A;�\�.z�N-%e��d��H��Y����U����+o���SKIY"#���:#�H��@������������
�^2SB�~���ei��/9U�	uH�������q�s���i�k~�������&��}��Q���l����79��e&��2�e]u�'}0���h��:EO�6�>�����'F>���n&hTV9��tj))�h��8@^G�Z�^vt\�!p<r�(R�c:����l��g�%C�v�Gh�d��Hk���c��fB���L1���@�
M���7�UM������E2�i(g��7��z4��_
jm#�>f<�������0$�yu�� �kgn���=���3l%#���kw�m�k/�z��!A((`#�9@����@F2��u,�j����h[x'�+7LSF��t��XY��;0��t�"�-�^�:������4����}4+�m)�j�-T�������a�refv��mk[+���\����
���tj))[�H�q������^y��A�9��f����0�j�q�w��Tp��u��|�lBN��=p"��mshBS�cq9��w�n���������Qk������v��];��cr9P�V8�������T<��Ipe6���$_�=Ld�q�.'o8��X�����E�.�q�C���9��&2�8.z����V�P9R�;�yO�NS�����N����4v�	�hP��fc��f"��^f�qL.�Z�~���Jw�Jm��H��/������nqu �%���>�d��J���gy
�	T>a	�tf���:UK��{� T����)`�����F:��u$�%���g)�����<E�R08���Y
r[����c�c��F���N���M��60�I�1��mE���dqt8�e
d�/������������u�E�:;��m�X]���9��-�����X�I$�}{��=|<`'HIXB#���:�6�w���?A5n����SKIY"#���:�����@�_	��d����K�_f�a�����$�1BZ�5�rN���1y|3�D���G����Pn�C
4�0O��#0t�M�BF2��uT�������������V:��F,��N�y�Tuy�B'�v@�F�w8�N-%e��d��H4���H�8�W��]���R��]��]q�u����MF��S�Dy��q��e:�I�7�t_*q�4��y��@d;����h��FY(g�y �I�2hpk�D�P���a�t��X�k�h�2Iq�P�nJ�v����,��N�y�r,��%�"�rC �tj))�d�������.�� ����A;���b;�n�d$�8P�n@8��[�����W���M���Q"��^`R��W7f�cy.��J��E
��@�8���\�f]h$�8@^g�=���&�=�4��<�:�<,��L�y��s�>�z����v���9`M�#����D��M��<�� ����<FV��{��s$��O����-�t���<��|��Ld�q�����X���]t�D2����6��7���L�y}����t��]���t�f��r= �k]n��,��N����_ s��Jrv����\��+dV�5��Qp�rj���LF2��u��%���s	?��Pw�r����fa3��u%����y���&���r�� d<��d��D����+�Z�n��t���L���t�&�p���-���]�0�w���,���B
�i��m5=�
��i�[<pU�vY���>�2[���X�N�sUU�\�u��868���u�P	P��N;Qq����Hfv���������E]�gqn��A��5~�����Z�����Bd�T����ZJ�9���!����,�R��W�-��V���z`l�	D�U��E���`;+;�/��4xcf�]=.M;�hG,����y��I�x\��X+�Z�v��;���%2*��~���_+fK�/�}�X2���|C���2����DF2��u����l��nm�R~�`'���ur�efv���$�]��a���]��
5��L;A�\�o3��}y)y�����WM�Hp�T�Y�NT8�Uj`TM�Z���pn�����d�$�u:.�����b����V�&W��+Y[���XEW$Ft�I����
�m������ l��<n
�	�;a	����:��[�O
/;� �e�*�)`�����H���yV)esP����Ia1��X�N �-�00��������Ge����6hT��](|��a�t��L�#�8o����7���c:r����k���r��-|�A�Y�[�����,w������K#���:��c����SM�4�r��������.���j�h�J�E��h�JCX77��RR���L�y�Hx+u&�:y�* �*:���Rr4�i�g��\&�OP�l�.��Hx��4��'v*�IA+v�������9�T��;-L�md��S/�2�������7�ai
�����u�J{9����Llq]?>����U������u��Kx�V2�i �$��)p���$���(�����@^g�P[�"�%tme�9@'���m�4��v�:V�-k_a��7*��+���T�����*��vjT�+���@�;�WZ
�h��8P�VZU^�w���U4�4��)y{ #���:Q�Z���Z��������m�,�i���i�G�[���m���S3�9����m�0��W7bK��};-�d��{�eR����Rm����=��R.�]��e�/� �0$(�S�N��	Kd$3;@^S�x3�����4�����N
��S�I�1�<����p��	���R^M?_���`W��U}hT��o��0�=�hcaQ���@�:���b������l�$��)��Y���`� ��b�}�S�N�9�wO�Q��� ��)��b�^��s+���Rli���M��	0)Rl����Y�k��F�����>��)��%	��nZy=L�m��mi�G*m�
o
���N�h���
��n�y=L9m���{��_bEP��^:��;���d��a����D%r�6�m�Q%�$)�S�H�q����
7.�M\��DP�x"vd�BF2��vt;�&vRe���n�t:�3���chBS���;��v�m+1^s�#
J�Z����M2�xR�N�{=K��L7\!7(��v���"+���L�q���jj�y��/$��K���:�p���f$�8�^Sn���\�R��N5(��r���@������[Ae���qA��h�v+��r��Md	�tf��a�RpYq���v��)��M�����XUj�q����8���+*��=s�T�l�WF{�u��a�%��^��g��	%@M:^:���l�d��8�^S�M�[��\��{��^����������$<4�)�1�<�U
>[SH\pQ���Q���--��d��a�R��v��Y����RV�A�fe��d��Y��6G���(������Q��DF2��Q�Y��F�yB�g7�* ��� ���a+�4�����u�#�*M�8�X ����x�9��F:���4e�/���~������������"Yl���*+L�,�M+b@�Jd���=��4e�q�������'�r��������nb0�I�q�-o����'�}f�()~�>3,��L�y=L���MGHm�n�G�uO�@'����]���{=��=/	wR��D������t#=�{�v�qL.z\�����~h��������fz��?�ju����Z���V�4�\�Z]aZ����G�Q�y3NP/!-��zh�
^a\�+3;��w-�M���%d�x�)�z�q������It�C�������.p��8�Y$$��	O:���N����WZ")��o� �?$�o��x�aH�wRk_
n��R��%��JM������}xhR��b7f�6���Ta�}3i���m�`'��OL�d�����
��X2)��q`� ��)`'�^M�0*��@��V�k���ee5^6�7�N#S�N�������n�y�J4SK	���`Q#�����X���K0*Z�:���&�h��y�,Bq� �1�E�)`'V�yM�6]�y�J����M��&SI���-�c��fBu`Ru�~?&�;�*aj���k��S���1z|��D0�)�1���p����7$E�������N�i�X$#�����h����5�{(�X��T����C��F:��u'���JKd�+@U@y�E22W��b��N�y�J���^S���J�w:��a��N���D�� Z��Nby��j�F8�Db9��t��V��`y����������N\��b�4���DS�ffP��r��J7��bs��=��(�4�)�1���0'�`Y�UTP`M;o1r��G�lA#���h��-�����A�b�5=�@zN*��:��s�N�[�t+^�%'AT ���:�D5��
�����Da�<���o�����I��rb�Y"#�����w��
q����4#�������4#�����DUq�2<�Xq{����=@�N��V2�i �[�J���P/�" -��`�^��m\�q��n%*��%�3�A�p�5�~FV�+�4���D��i���)�(�I�T�(�V2�i�V��0��Vw�T��"���rz�����2�����D�C
�wFI"
�8P��+���X\ne�E�1RK#���Pk�=�������d��V�i���u����PS��9 '���%4�� �������n���W+?{�N��6PW�@{@^'�tj������g�47��&�#P���B�S�--��5�}��&��l�5}�K6M_u�qL.�I���j��"ue��5��"�:�:�cU}�� R�L����� UnA��^*�~��&f���F/[�;-l?��,��E�]b�{����E��4���I�1�<��=oIy=�	
B�<�R
���v�R�`
����������a�k'�3a1�������H���/3�8��{��57��j����@.�V�)`'jn!� m�������|���� ���4���\T)#�����������Xq�����
+�t�1�q/�����	�N9m2�
RQ��|Wh�)�q
��dn�E�FE��
�����w\�e����� �%Aw�:�)Kh�3;@^S��7Zo�f��]�cwCM�S�N�|#��_4�� ��)i�Qhu� ��lZw9�D�9�E2�i(����[8	�(�(AI
���B

�������un�J5�0�R��<S���@�>���x��`��8@^�Rn�5�c�pb&*��*�-���H)'�3�8@^Sn�wy���TA����g���t�o��I���IpT�A��X=����h5g	
���#���j���|�����5w�4�iH���v����7��:�A �"?���h��8��u��)k��D��A �A�����a��dR���L���0<�\H=J�����AB{������,W������	���@�^o�d$����a�R�]wz���7�u(�9u����jfv`�-OZ�],B�R����ZE�:u��5��L�y=K�����-m~���=���W�3�}��0e���"��5��*r��+5�2�i�v������q�a>�Fy[�H��qL��r��t�q�.se��Dx	�"})�T��:��,d$�8@^S��*�P9�A; m��A�r�"�4���n������SH^��P?���
���	O:���a��b9q�b��-�6j�@����L�@��0o���+�����3�������|��17w��z���Z������AS7-c&�zY'#Q�G�^F��������M�C����a��k��F��8�B��)(,m�c�-
��_?~�������������������z��}�k�������������o�������������/�������?��O?~�s�V���}g���x����%G
�2��B���8�
!$�wl����g4�����y-����
J�eY{���@|�t��+h
���������e�sq�c��E���5�9�/�| �m�j��|���U�V�%r�.�B}�*z���X�w�����6����R�P�������\�u=�h�0�A�����=��%���5�9�_El_Wq������6���_s���U,����?�x�o���*]^���U���W������*>�[��\w��`R����k`3{�Lvh�r�����t��lPAQ�CA��
��Jv6�����n����������"j�3<LTA	���\mTF[�Z���e���W��G�����%\/�����rlg�A��'��w����<&���B�#��^	4����lg�*��Zxk�Ft��lTA�Q�����Ge;U�E����5����Q��Ft���U�ET��Q�G������������"j��
������~�v6*��7m�T>��p>�������URt�����E�����B�
��4&�iH%�Cx��}/
:��v&�����{�}�MVoOc��
��<����]HE�1��Ut�OSJ.b��iJ�E���_4����dg�
:�������<*����.�fs<Etz�;���}f_7��<��\�f��jw�N���v��p��� 4�<*������n�PTE�Q��DUtU�5j�����F%;U�y���U��������"��-a5����dg�
:�J��lTE�Q��DUt5��J.b���\�E�b�gPHA����A}p�i�R�zq���������[0��S;�P(:������y����lh+����ZY�Ed6���]E���	Y�Ud2�"������j�Ed6���]E��=�D����<2v��]D��t{Y�Ed6���]EN��AWq���n�U�
W����N�Ej�fv}7����]������lX_�]�����
�u(��[��e�3(����6�����SH���."�����<rX=T�����GC+Q+�����\��"��A������ f�U�}|��"rX�ae��)���d��0����G{�����������Eba������I����vyOz��sg���mQ�������Xw��@Ci��ab�� [�"H�+m�>t�1k�9$��/�t�q|z~t���7�	�}|u�XE����g��33;pz~tM�O��h��z5V>��p;�f���Fly��Qj��;��V��6X��@v�}Y���������x�u|�� >�����OA����~����?�n�������~�Z�w��e��_g&��2��&=
�q�sz����!S�N��Q��Hgv���$��;�*�
�gA�Wm]>���ZJ�������,G�n�3M��K�U[���SKIYffv�������������m-�����[�so�d?��y��*rY��l�_����y�p�+&�S����o��'<��(JyNmfU�
���:�C8u������"^3����U9�n���tW?�z�DKrV`T}���5�6�W�����O�N����y���
���<����{��%pCe}C�!w����DKrV`�y���L=��~�-����vF$��	hz�,������#��UL�}�V�p]���\u�%9+0�.S��f
��'SW��e�W���}+~�������h�C�'��P�#��N��)��D�^�g=���[	N���8��2y�������*�s�[�d��0P�o~|z�"�����%���,��Z�r�|Y��=?�����0�0hw=F�%�|�O��q���vI1^�����}|~`����L8]YN�&$�"�����:�&]���9�0�P#i�+7�����c�l�CFK�ZQm$�'+s�#v!�H���e*;�(%�#��w����5�	����T|5����������+�����u��Y����G�~�2�I�,�Hd������
z5G����Y��H����>�U�,�H�]�f��rE���U�#������������� ���c��`���P�����CBd�'g]�N���j�o5L8[���F�����s��,9:�N|�5V��#^�%U��S��Y�Xg����������
0����L8]�����g���1�p���Vfp���IF����2>�M���W,#�k�j-g[����zM�����7�^�:�M���&D���r�����D&��l������:�����E���8X&D��2�E�,�Hd��7..
]��c�in�*�\���[��,&�<����W�H������:`���/�������vz#����$�Z��T�����;�0����L�_������o����� b_��?������t�0�M%zY����w�����;�	h��i>�5 �����D/U��
@�xI�����G����*����s�C��B����N�v���#^��O�r�������W�+��-�Y#�<xMB��b,B��2k~u����B���k�=[��7R���� 6g���G�E0�[���v-$�Y9\ch���N�/�:���S�y����V���I��
y�
��i�Z���B��	�k��d�V���K~e�������N�.�,F��O^s
�����h�
!���%P'h����Qs���5���;m_�-fd��B$���1<����M,�y������m��J_���'D�W@�����Q�+�
�q���vF{�����~�2�,��u��u�V�o���D�����_�}G�,��5W�������B�c��Z]Y�+�@�hI�
�,��u'�e|�R�"2)KAi~����(���T�qF�P�b.Z6���l�����a?y�v�a��F���%E��/���.��,����.������I�{�N� �K7"�G�r����;��!���ND'���!��&:�5�m�%Y�h��%���S���@M�e��%�a?Wj=����Lx��P��D���B$�	a��b0�P#�(�s���eq!P�1�
|������l����x�D�{$@n�s����F	fA<L�#	P6x� ?$�3Y��,�,H�'�	�$@�!��\2
P�������?��`�~$�A��hUD�\K�v�~���Z���~A	�����B���=E�����;*�g�~�(fQ	<��N	����W��V��H2_����n�{��`�[A��kl;��(��J�������Y���w����qr��eo�R�V��	?5��#����`�zM�>�K�#�zM�%Y�h���.����3�v��Am�.�L����7�`��I�+Wc+���B��.I��c�$��$��*���2��K�N$8�0����\D��95�)p���a�	�6���q}>%z���!��+��4�M�:��g���G�f����kA~CZ�r��%h^�?0��"@�Oe?}����H	�H����]���De.2�����L���i����6�����'��3l`R�-D�v����Ew�7n|A2 �����������7cFy�������N�?��8�/]���Db$IX��x�U(������W�����J��r	�	*�%��Q�����]& 2���!m)+I~W�y���
��''�������������
ra��\u�
h�DbT����5W����\!6�"�o���L�U��b����	�k.6=#����H�['�i�m`Ff�6����i�m�]�L�j��g@T��A����U�v�b+fk�IhMi_;K}e��[��g�w��&����W���D��}Wc���?k
��������b}�e�K�[(��~��^"�+r"��v���j��o���.�I��3�~&�f�y����s�/q�1\Ees���
�zuF���?�35�vw�>f��+�1-�sei�g�~cZ�����3i�E�Nl��y�.�
�i.w`Ff�.�[D9�{I20��__����/�Y#3��I" �U���arH<�j�*�D@��D�3�w�*1//�e 2<������9pQ35�|�J��\�hz�b�3�#yo���b������	&�W�'��vW�U����/�m{&�T��i{5L���i�O�\��4��X<��|��TSL�;�����"�)�@��;d�	���0�g"�@�KH���uA�	��t�m���	&|�\���!0��mO��837Y���m r%�#0������]���4!���I������H+\V����+y3�~.�	�jf�F��J�A��7�m=~h��Z����l�����;?7R��������N�������5��%�
��,qa����x9����lT��#b������>rm�lH2 �K+P���W��HFy�zs�e���o�����p-\��_ub��X��E��\2���9W-�)���������c����&�<V�����1�*^,R+���W:_1���]l�A�c���VJ\4�k�\�Es�6�^u�\�Es���"^o.��/9(=D�2�zB�����X=�A����=�����R��N�+���#��9���r��	��<�����Y��#o����t�&�������(�<�y�.��u-~|N�C��?�e�cx' N���,�����\r������U�i�>B�-�&@��U�`da�7�����I�r@�WLz�>E��[a����	���%��p����Hd?�S'����l�����\2+�z�K��;j�Do��d���b�-��:x����q_M��(������jl���7��	B�����ZIn]��%�@�]45���I�����C����z��d�e��m���G��h��N�k0����{K�v��/'��W����{
��$baM����#���^6������v�\>�������5';R.���G�x��d������_�Z*L
8I�������� fj$2��+�	�K����l�8��d�a?��W��-�H
��7WN�1���Q��M�����<#��G�n�,�H0��+������4)vG����5��3����dR��7ck*�%���i&����B��AT{s��HS���bMg��J&��5`'��e����@��]rGv\�
�5��������y����fj$��}�XB+���kb,a�ScG"~�Zz�3?������]���gA\�;v�TM����f�z{����U����\���^�;���Z����DO���rU�)�95��yL�� �w��i��D�+�!5���gy��c8CN��>?!~�$@�	�F�gH�]9���j5AC;�E�&H���Y��hM�wW�K��B�k��buY��,K��~� �����L�����R����Ml`$�X������1\���y� �s�&m_e���e�|�����V}���`%����q�����`z������Gh�^E&���W���[�����!�y�WVXZ��@Q�6/�@
��
��y�F-lo��~�P�HY��uUVdFpX�1]�E�;m�7�c"�O;���a���B�G�9��<V�WV�U48������H���z����E���.�k�q��hR��wR��'H�@�P����G�^Z(*���aC�*�Qim�	�YA��E�����}�{2�xbHJ	�,��K�N(y��?i�����V������W��&t��C^uB�3X��E�^ZhB�0���CRz4q����J#�Y�����R���{��Ij�"�}���qC"�-*�&�<V�WVX����H6
(l8[N��#L�8A��0�0��K����$��3�k�S�' N,�'?�-F���	������eR���(�@;#������2�,����B-9�j�a#]j[��y\Q*X����b�!�cuye�t�7��|�6��U/"�}�q�8�`da�W�(�6��hCg�1��NE�h�!�Hd�K�rv|(Y)?A�P�a���\n�fj$^Z/��
��*�:k�DI)��l�j\`��F������R���0���`@e_����m�CVU!�cqyi�%�F��1�!<V�����_(�doN�,�H0�������u,��m1T�	�����
fj$�����b����4�/�����r	������F��zO^�-��m��J�QF�=(���D~A���"\��B�$�s����|z�Z�$at8���H��s	���p1��#�	_\���m�7X`N	�">�;	���X�,�H�7��,K�%<�T%���G�����*��f���B�\D�_P=u�>��#`'��6�`da ^�-�������zL[����>�;���3?����"���_�j�tJ=����a8}w��u(\{��k+�b�9��T`M��l���7��fj$���)����'���'�3��|}
��U�e5�0���o��8�94j�cd�����-�c�N@>�\%����'u�����m�y�]��m_>�?T�>���J��I&j�6�K�R�Tu������T�E���}o�E�j�2���y��� �����n��������������]������
�:3�h��X]�[+%��l��PdG$*�= ��p�#y�3//y����gH4��9�����&���2�O`;����dz�
-�Q|�Fjz���s�"�o�HM���;'���^\2^�Q��C8g$UQ:����@�P���"^�.y�����C���s��t	���l���"`�93�.�n~Ak\S�BLg$�z��p��1���"^�/���*	�k��	�j�K�NX���bd����K�Va��[������plP��#�#3��������'
uvkZ��
�����
DB���h������;R�u�\
4w]J�_.���
�����8�����6y��>��I���>����������K���uA�f�Wm �P�M�8!_,���^o.��)4�
Hv�U&M�>"�0�0�VRo-Y2���{�W��:3N����|y=(AF1�Hd���M��
�����j��
V����W\��Bb��FR�����s���.!m^7RK���bg�DE�j$g��\9�����z��7~./����d�&�}��<�Hd��W�Y�8����i�8����%~B�/���D��Y��`��W^��v�G/������c���a?�REZ6�P#��o������n��=��X/W���/�D�B������L���I���#n���/�3AF��u�I�,�Hd��WnU��_��6�|A����S��������G��C=�-K�������2#`'���-0�0�7��}�J�"�h[��?8�Z���p�z{����G�#��q���a�S�����m��r�kCk�����.��v
j���IC��wW���8���]��z8l%t�F����K��E��CE��D�R���N���fj$2��+�m/�A�U2��3���m'0�j��������)q������~(�s���������@������6��g���I^�g�r����k������jS���js������_����������_�����������K^�g����~��?>����������[?����o������|������Oi����yM�-���	�W�����.Y����~�i*{�s��\�z�28_~0���W���P����������h�km�x`���~��E�]r���i����,/]GN(1���:�����<�����+]����_[_������a���B��]����B��U.���2�]HkZt��C��e��_Et������z}��L��4r���
���pg������P��2��W��w���qg������p�������OD�N��({���lrt����qy����O��!���P�����P�������R����e�a^5<|���f����g6�433{����C������0<�l�a�=f+`fg��}<:���m+wd��e�B��S�����#�������i��e�]H`���	��#Li��0��������z��{ad�6�������f��'<���af5�3;{��s������f��afc3�������af5�3;{��i#�0����ax�����v�,���~?33{��&UK}fg�7%3�3��M�~4|V��������@�	���#T�x!�>����������ya�5�4�.��{�����=�|�u��Kz�5����>���=������g�K>?�9��W�����f�P_9Nk�i^&6�03e���M7����0���i��=�3����afe?�e�m�o@?��d�h�O����OP
�u8{�K�+{�Y
���f������=���qfgO3W�|<�l�iff6�0s[����fV�8���������=������f��r�f6�433{����O7C��.Nk�iVJk8�-;{�'��abe?�'���==��~rK��DM�
C��5������'����5���8{�y]=�Lgv�0����=�������f��afc3�ux�
�af5�3;{��r������f��afcO3���:�l�iff6�0sN�G}G��.Nk�i�*�f�i������0����J�����������W�g����p��	�2<d�0����=�L%D��n��af5�3;{�yC��8����ax�����=
���=���qfgO3���,�����p���|8�����a^��!=���G�������`O3_mY{�']��(��=��'���=��-�W2�D�2+�]��rW���zaa_�fi�zi��l�*�	�u�q�z����*�,�����i
�_�.��6_��_�qO�U�B1�3�m�����H�cj���u�M��Hz$�1�����1�tv�`���d�P�pM��=�j�q`E�<�Pq.�K���t��%���������y���I��l�������9�5���k?Y��3?=}8Y������2&]�v
�����OV�E�C��l���)����l_���9�u�=,��+_����y������QR�~8��L�3�i���$���9,0��#�S�38y��V�'�����o^b�H�'+sX`��G"�VF����T�i��~�#�#H�~�2�E�B�������~�:K��"���S��a�:�����<x��D�6�4%E����eN���_�f�W�jf�F�n���\��s�~��;uh�y�����OV�������~3Y�z'��E�����^;sY��w��?FcE����>xMUW�B������+�5P��L���~�a9,jf�G"NW�����nc�[~��rn�k�&���9,jf�G"NW�uJT���
�`�vi���������aV35�p�2�-\����r��r�-S7�ivZ.��fj$
��f���?v�hg8�q�	`'��w�ada�����������Td�j-t�aF��?�3�<�H:L�@���C���(0�@�jd�����"�0�P#iho4��Vm�-���IH���>�/���3���c���%y��$/��:����t���q`I�<���F��"���_"g�Y+���3�b���!�j$���^&���{���Z6����wF�A�f���W�$�|��*��M.�L���e�o0�P#����Iq4�DN��]^��fC�q������=�o_WG3���~gr<W���O�a����L8_��,Tn�
-�XCrS�����������fj$2�te;U�����o��I3.���j��nfj$�p�2��$/�>�	�����9��~�B��y<�}CY����u<����N���w5��#yx�[��{k����M����	hz/,�Q3�����t�~��&�-n�����em�2�/�g�~�������+_�M��j!�y|�.����E&XB�&>����v���ON��t:���#�	������x�T$� ��._u��cFy���%%�t\�����]�����
,��TLC��f�����rU}�L�9�`����V9�������\��`����\����r��3?�p.'.�p���"�X�����P'�1+0�������������YL�N�$gE�<����"=��D~�?�G�Q@�( ��L8_������12��9����  �P�M��P��D������S�/i�i�e�,_W�%��~��6
� ^7�!�������e�J�#9�;U���#yxH�� <(����.�/�XpV`T\t:���e��<�u���p��9!��:���5�v�?xM%C�3�y��4]�������O~~��Y����w�!*C)x`��A;($	�k?S��Y��l({0W�g��7UQ�v�W�����=�A9L�^f��;��N9LH������cH�.�M�� �
'��q����nm6�FY+���gz��#��i��7f�o����wZ!zQ5�5H���N@����eY����VX��JU�����E�a?�
+~bb��F�P�F+������G`(�"����k�3�<�Hd�;�p������<��&X&��CM`I�<����<�z�u}<0��cupF��\��B�����,�u<i���;���	�3�~�#
\a��F"����I�x�^�DaJH$����$����	���r"��H�����:t�	a� '����Y��`�g9�s�AO\��d������U5����OTE�����vP���FUQ���x&2�T��k;���k;*�`'�������|8x���0�{���� ���@�dI`���Ff�u�#n�$��\�Z�0N�\E1L"~*,65��#�	o5F�<����#3j>�UQ��s�Q`Q3?��v���^��X�������+@�-�P#���u-5Rv_-�n=?���C���0��Oh�R���"+f�g�j����b����Q�}�#ym_4Q~M*�1�.k�C��~�SpXa��G������jY��	��*e���R)��N��n���"^���L�XLA�i0]�-���6����C��*	.�:
;��k���pA�q�`����v����cUm��d���K'��9�b!��k�*�7��jR�Wv��ZY~�:������"^7�"�0��"#��+��*sq��gUq�]�{�1N�cIl�~�~�*�X��y����wi�M�aco���'!v� >�AU4X`��G���m��
�l��k��&����9A����������#�ns9W��0�ly����q�sxM�g�"`����Lx'.ZW�U�kc"����k?�N�&�0+���Ld�;��r>��DFK2�D
�5��e'Z�)��'���Lx+5V�WP*�����XC�&�s����3?��Nj��d����J�-A���!��~&55��#�	o�F�^_�jR�>C
�����F}j����Lx+5�U/a7�q��	��	�,I������>�7�c��LKR4R���-%�B��9I�(M{�*%�*�i�8��fj$
=�o�G�-����	����8#����&�����Lx�?j�$jm`ZXv��#a?��35��VLx1�7fg��$.�H�7f��b��FR���K
iwy5R�MI��l�F�~.CZ��y����w2�u�h���&9O�/�a?�!����y��������\��u&�l��8!����fj$��G2�$�T�%�D�X�@�A�d��@�~��H��S>#��cB����?5�����O��u�G�-8��g�~��(fQ<��'Y����c6�1�����q�`<f/�J���a�WH����EP����J��6�{X���s�b�~�!mq��`O]����X�������\jT�2��h��rR�5'���e5�0������5��.^�;i�;���P#�	������^V����ah=�Q����Du�_
���:b��|W���#��
���C��c`T�yuT�T:>�����Me,�vH���5��H�E�,�Hd��W���|�#���@��B����xD�����#94}}�9�h�2�����*��v�IR'�	�bd�p����������S�!�e�O �g� �y�@={s�t����T��gU�������_���:���~$z��������;i�u�
ri�&�����M�������7W��z����������^��p@�
�G�'t�\yO^�E/F�K&H���a?h�3�<�HZl������/F�N�$�.3�~����Y5�C��WN_�[���PL�u����%yB�O���P#�	�^�*�!�;0I�f��H�|�W�y���s��r�����)�Iuk��o�E:#�gZ0��fj$2��+���%����d`Y�>RI�uFk!��oj��F�	�^9I��F�,X�vP,��j��R�B1�P#��o��6�bX5�/0���"r���2�rj�C���r���f'��4�Iu�	`'N�4F�Ok3x��dNU�P]����nK���e�[���H-�������������&�(bh_Y4���@���vVad� ^�.�r]���<���]2(=g�������Y�,�Hd�wW^�
��������vT^�������o�E�,�Hd��W�4����6;���I�d�mF���q��<�H0��+��Z�4��w!�TM����<%�l8Uo!�c��������i��]��X&���x�eY����%�1�Z9C�P7�v�����Nu*��C����o��tqzX���&Xg��~'!�Z@K9a5�����zN�Z����zf�X�/	��|a�+�<�Hd��W�����];rk���f�/	�>�N��f~$=v�~}�_�u������f%z�5� �3l0��#�Q����C����&�+���"�w,vPsW3
5L����RL=��B���jb�@����,����<��%c��]��q��=�%\�;n9T.S��P�#[���o"IR����;7�CA,Nk�&���g�M�,�H�?���FB]��jie@�_6$.;*0�ee�f���[�K��l�W��gqY����^1W|?ZhZ��6$Qg=Jl)�]O��1����L[��h�K������y�(��J�tGV���Z����1��k�nx��X(�P�wu$������l��#�2/��P�U�;�Z�R�!��=�cq�X+VJ9�Z�'��^ZoM~V7���6Z�����6��6���ac���m+�
�m��r��I�U�<l�~�t0X`��G�a�������ukk��nf�����5��o5d�8[�g?������9�PC� R]a�^�H�ZY��%@�D5��^*���e��q*:���j�~^�V�]����D&|m���Z7�#��WLhMv	�k?�L,b��dC��K�%E_��"���$:'�
\����a3?�
//�7-���M7�]^OiG`��fo�f5�P#i�M����Us�m[,0l�Q��>#�-`V35����O�[�grjmg;
���SG��ym�Y�����ICm���KJ=*����[��%q��g��|k� Ifj$2�������Qv��k�FJj����t�U�,�Hd��K����U��?�U���a?�
\��B�D&|q�$�g\���Eg����?#���v3>M�Y����/�����!o��5Q���;���u5�-j�a��*��(:�B��5��(i��~!�=����C�$�\�K����d�'�U������$�>i��0�P#��/��d����]�G.P���5X������H�@�^[(��E7�W�-��X������'�,0��#�	_\�����������\A����o;�,�H0���m8VD0x�
��c*��3R��7�35�������UA��W��,��8l�,�H0���Er|K����[��K��8��u���@�^[�%�S��|�9\���{����� jg�P#�1������z:c��c�~]����G"~�J������L��z��G���C=���x��a�Z���K�ML4�y�.����fP.|�Oxa���ci}F��x���Y������w����F�3��OX3j�;���j$Z#���u���Oq�5?���H>��\��v���S�w�;�;�������i�Z�d��F"[���xzX��\�����{��k�	D	�Y����%��L-��4^�"f/��<�mX���0�<���.+�z�����@{�e�fk��n���1�����T�YY���`�9��Uk��#9��/�|��QEY�zRC)�9�A����������h�7WN��+NGS��[r:WM�{r����|��t1��#�	�^yK~����X��������#�����b��G��D��W��^����X=-��/����3�"+y���7�\U���{����<}��m5j$��Po�����L����P��f�����e��~v�*?���C��L�����Y�k����X�l$�;>���X�����o�|��{�������mH���N��,����x����xk/����;���{�	�D�s���<�H���������7�+�.��;�>#����0�P#i�.���Il��k6uE�\���3�����*G�����������*C`�H\�zF��f5k~0�L�Qexu�m�j	�*
9��o��*
0����!����o=$�l]�OA�������i��I�,�Hd��W��e*�=���r�7��	����$5�P#�	�^9�[������&_����g��p ���fj$2��+?��F��C�l�|���$[����&�IF��7W�������r=�F��~������F�	�^y^5!g��m�
��P�LVo�f�����	&|{�,�Wm���c{���_�E�x/f~$������tld+��q�-�����=�
�]�~z��.o��������@�[>������lf~$2��+�4}:����B����b��a�������j$����w;���&#;WA�����=m[�I�3/�1\�\+�m��y�p&��6�����uF���W�y�������62�D8n64)�6F�~��*,[�!�Hd��W�;T�p;��}���
#�C��k�?��z��d�D����!���H�U�.�����fj$2��+G�������4q��C�x��5��_�8����5�cH`x��m���!a7B�>}�a ix��Pqw�p�;*������n@��_o
�KzZ��27�������~��������������������3����3��K��������������������_����}������o��?~�S������	ryU5��i���J�~�����:SV�=s��g�����(��W;�:�z�Y|.��l���r���<.�<}�������t��r�������������<��<}���_���^4�Di���������.]���<]�{�]�.����9����W.��P`���|���G���W:�|�����O�a��z��������z{��'�������PzJ�.�=���w�q�������O`��k�|w���0O������������'�e��oD�Q'�f����7sD��0���� ��h�vFR
����vF5������*��aF#�3�,�h�~F��3��QK���Fng������Nd����gt"S����=#�,ZN���e}������~'��������~b��>1'�3�����v�uA�6�2��9��'
�~�5I�{�����jfu�0����*y�S���J�gL�$��)���vaRG��ag5�0+����fE]�8���Yag5t?k^d?6����.cvaVG����������>�R��t�tt���]�G��\���>���~V��:���U�E�:��U�����g��j�fut?���Y=�Z�h\��������z���(�������������������9a�4�0c�<�8���9a'5�0k�:�����m`�T�|���W/����[M�}�~X@����t��Q�f���F���Y�����)�v
��'Va:��r.=�g�aF��9
���Q5(���~V��:z�u���qVC��.�j�~��~-������jfu�0+Zw�Y
=��-����fm�g5t{�1�8�������\���>����r������.\����m+r�;���~V��:�����u�����jfu�0+���Y
=�
�8���Yw�|�:��U����f�FN��������z���qVC��.�j�v���I@���]�=�Z��u�����������8To�<�q�U���W>�iN�?6������2���)���vaRG��[�����������Ti�fut?���Y��Z�L�a�P���0�������f�Sz�vqRC�n_�'U�0���)�����$��9����0�����6s����6�k�Z��m���:�/I6�[����r���@e������u�55����tYM���'�B��L�4�W��n�U�?6����\k��6@f�O�9�L�\���k!�!�cu�h����;�q��U�`�cq����1��i��"'s��s�����;t^���D�C��=��i��.�N��uI�ZkX���d5����^�%�_����L#�3�Rr���"�R> 1����L$�]����&��J��S�!�q��Jt1[q�e"��/�y6�m�1K���q,� X��h��X]&p��dG�lI6�n�U���8����#3x����w�,Y�:���	��8���%u����r*�O&;�������V�"�1{|+���M,�y���H��P���as�I%!��9�D��Ff��,i�R��P����d�Hx����lUw���5�8Qd�<�M4NC�w��q���l����fh��_>�6uR������q,� �Aq��!�cs����Xq$1 9F�Rq����>ct���4��k�)V�ZV��	�0�C���-���a�a�^3n����_"�"���&���xI��y��k��-r�i�6�t��J�M�N�$gE�<��5[R��#�����!R��o�����������"�����E���p���)�����N�o�N�$gU�<��5[���O��UE�S�g����)��hI��,���g������*��@@�Q���@�hI��f��,i]�\$��II0G]*�����������
���o'K�}�Yo�m�#mWcx|�|���D����2����d[����M@�QN�zF N��k]�,��k�Eq���D���9h#�zC
l���@��!�R���Im&'�I����8������&q��D��y����&�3�-���x�p���Y#�<x���S�d'��mr7�e�V���K2&5�B��k���i��N)������H�k N�$gFf���c����Xd'B����q�Ni 2;�x��.S�������y@�]W�q���X#3x���X���%K��:��9�N�@�����
#3x�s�-�s���U�u!��0w��C;�L�9>'���s�������l���R\p?
a����7��rN��%��<�M�ct9q��r��	���`��I�����I�-�y|��9yM�BKG]� P����R:Kb�a������W�?�p�d9�������,G�`�4�SH�/e$B����
�?16����Ld�&}HY1�h�H�D����	��@X���@�{�!�����JQ��:��J.�8I}be	Ff���t�t> ���;�t��\�����=���G�&�<��T�S�]�ti�@������
B�1����L����8����	[��>��_B�5o1��y�������(����L��Gr���|@�Xc����|M�|�`
P�H�������_�"&�<.w�iI������y����9)�@���������$%Q�o��u�T��X<�A���	B���2���K��=KB����z>�	'���m�Y���lI���K�
.G��J�G'��5Y��w{\���k-����KQz��[��hI�d�4D�fK"5;���8-I��e��������8c
Fy����-_}�O<q�sG�V��T�1_8�g���G���q����_�d�	�g���5.�0��[����������*p���u<�8�+���
~p�Vad����^4�u��!�/3�Q�:��KRFbY���L�Z��e�|H�����"�|YV��8��r2-l�'V���Q_Q=�z
�D���*�,���(�%��$/"A[�8�69���3P����@kr���U��V/���
�!���h��X]&*z��|��?C[bD��k'���5Y��k.�q�-r���
hgD����6�,Ff��-�� Vkvx����R���K7�2�,��u#������>���h N*�	�0�0��\lc��5<:�RK$���q�3�����B������02+��%Bkc����v9�oZKY��B������m���G��4��m�tP��Ff��Q�����o�\�~ �i��vQ��:9����d�9��7�Vu=��6!!�C���Yp��'����������.&Qm���C\	�iR��Qg��!�0�o]
U��P��z�>XV�������T���,���^3����K�vG]^E�%|FG N��K0��S:x�E��/��g(" P�����L N�$c	Ff��oVDb����:��J`N�8��f,���^3��{��^
�K�F��G����x����%�LF���c~F��+�' N��3�X���|��-�K�X����m���
m�]���[	nlb!�c�L�����T���_�t-�?�������<��\����At�I5��G�a\�(b��h������T�u���&��#���
O�#'N�3��Q��^3�
��,��q@�K��>Y�[`da���)Um��W�����|�q�%�vi�����r��A�+���6q:�p]�\�;��-�������
#J��'@�xI�6���0P��2(�7"��0��q��oc��V�a�Mz�w�N[ or�X���/�R�*��X3V��#^�%����h5��={a-o�p�%kb����$XQv���H��hE��������5Y��k~����J�c����(���N,}+b���y>��;�f&���j����7M�>8q>���,D���p�?l�&	hg��M3���$0yia���I�P�j�HA3�$ul�L�=�36^�@��5
q���3-��U��� fH��95<z�@�X�3����^7�p��l+$�H��6�&���0�0�n7�8U
�|��A�&��v9���OG&,�yl.��+�z��D��F�N��k'���U1�����D����I{�@�y����q�^� P���B��p���$�V-�P���25&��%Y���D��##L���4l��&�jzw`I�<��]z�$�U�����.w��nW��ui�����8���O�u3��F�/���� ���Kt;�L�;��Kt�G��^�g5����s�	'
F��)7���|L�#���c���8��!A�8V����Z{�L*���_h��z���]+���Hu��[k��z��#�N:j�t=����k%��������Z�nd�RF[�QJ����'h@� �%Y����%�v�M�k�dZ�&��kK0�0������?h*��uF$N�	������0���K���l��7�R���+|I�@�Xp4�`da�7�\��Q���8;���Z*v
GP-��Y#3x����?����m�M�J(L�8�`Y<�F��'��@�=�_^r�zR�p��d!	�z�L�8��/N�x����%o�� �+K������������Y����%���'2�������g:P{9^�;&�C���!�csyo�����W<q����p
V}�t���f�����~�l?�.��Q�d�>���~����UcFy�zs��Vq 9K�m���v
g'^�1�Q���������K_:��� �w���^8A��0�����K�*������vF$o N�Ei,���^o.�^�y��-8���������gu#3�zs�U���KKI������}�	'N5����^o.�et)]%YCu-�Vz
��*��*Fy�zs��5J��t��['���p�������@��������Z��I��BR"�x��j,�J�X6���1\�[+��Y	���Y���M@�?8c����4����%s�l�=�����hc��G N�K[ada�7��VIm�[�k�#����u�|
�����U9D@�#/���I1�%nX*�@�%��	'd�
�
kdda�^�.y��L}�*�dC�Q���F N,+kFf������!W���/��Hz<���5��/X��E��������Q���x�H^��!�����l�bda�^�.�7tE�S�@���&@��+�o0�0P�w��q��FB���0�IY."x5.�r!������M4�y�.����P�rM;TzAmaD|�p�JV��#^�.G��xx�J<�NY������O��q6�����/���s#����J{P��q�]���l"��Ocu�^�g��|e�`:	�?��/������Kq�
�R�R�o��NC��U���������v�?YZ���!?W�q�<Q����� ����#5������m�-�J�!T��Z��<�*�LU"-����U�k�TA���n(����&�Sf�i���o�@Sf_Y(u�9��w��������4�����@�O�?\(eh�EI���#/�L@�t���y���������3�lH�~I�!����Z�h�+-~$���%9�����}�s���x|�h�N6�s��������D���MKiu��C��'���Y`	Ff5~��p��{ml/�	��4�
l�y=�}���+���
���u�&hvr���=���<����B��������B�Z�h��B`I��j��c_�-t������v/	 ���8���Y�b�%�cuya�i���U�:J*���\������!�csye�����z%G' IOk�>]'�j�����h���#P��J�	�-	��NAJx�p�\�`d�����.����U�p��Q���i�*��!zcT���,������E��U���]� ��)u�V1�0(�?P�0J�f>*e	��T��I�8��ll����^Z(�X��W
j�������1<�A>��G&�<�1��+��c�����_@�u����	'����y�����$��]o��6�������N�PcM�<��7n�;E�����-�3�b�r,� ��
q��+�\���z���k�5��)���<M�k�b�a�b~�� 9����rD��T�qm�{N,j+b����J�2��H�~(�L2�9c��r,�J����x��.����f�"7�1
��wr�HO�M�@��yfq��c��6��M]z8�Tz�K)���x|+���$6���1\^Y!���.��N5V��@�M�'@��K��Y���k�Ho�C�R��Qf���-6�l�`�a�^�-t��FB� y��\�	h�X���@{��K��S��d��K��*����Wcx|+������=����
����xN:���}gdy��i�N`Y�<�Z�����$�N�@���+�[(�pV�*	�s��I�'`^s����r�/��i�Z�a��� J�|>��<h���h��������r�5WX�Uw?
ao��+��'"m����������.���
�+�B�����Z��uz���L�%���C6�V��j���4��[�\M�	��!H���3�M:gr��-�N����]��9]��8��e��N@��Fn��u�N^o.9��,N$q������x���p������en�wW�Ea����@��+�b���~'���%����zr* $�������S�%Y���a���wnb�*R��PgD�5�T}�,�����^\2���B��N����������
b5(rq��{k�l�*E��|��3���*�A�X6���q�����.�kq��7�*n�m��u�o
����'��u�[2����.I�F6yzIZ�<f�o%����B��pyk��5
��2��U@�eV�	'�_�u1����	���%����>j�:B��U*6_oeL{DY�X�������Q�SG���B�E��pR	\=����O^o.9wN|���$�+�+#���^8��.X��E��\2�^���rA�:��w��1{|+���L<�y�7������	�8�N�*���1{|+����b��.����9s��E�u�Q�e�	��p�|#3�zs����AwM^�w������T��w&��f5f����}��$���v@����u;0�;a�����,9����Uw5
\�%��X<�A>D�%d!�csyo�$p������W�c�X&�pR-]X#�<x����~��}Y-������{&�c���%@�a�!��zH�~y����/+�YK0C�u+�bK0��`f�zs�e��T8�]���t�~9�o�Q���B�����Z������HR���������d�<��6��K����Ud��7�P�f(�:�NZY
�x���7�f?B���������	'�$��#l�a�w�������

�IZ����a���"^o.��X�.K��$z7!I���d9�I�n6�0���KN"����s�cC��@���K�N���b������ER_R
(�c���#X��jP�7�X������9dT���SzgI|�tm���eX��3m������j^�:����������������v�<v������+���������������������I!W
��Bo��8��v[�������������������/�����/k����������BI+*G���_�%U��K���b?[Y\+�-o�8���#��/K������z�����B��O�/�q�@��7W_���r�y�"����v���,�\B�(����x	���pp�������|�������t
t��3Xv
����]���������&�����oz����������z{
�:o]vj��3��s(	/�~
@��`�~
��k($�L������9�%^��5��_�����O�r�'�_�9�I6J����������z{
t
�Nn�����~A�kz�s�k8��_C����k(?�9d��_o�
?L������2�S�
�F����������5��u�����jfu�0k�#�qVC�V=]��V?�y;k�Lm��������������Oj$q��c��������R���8�^A�<)}-����Pw��+pt���]�G��h-������jfut?+ua��I���	�0���iG)����vaRG�n��5�j�aV��Y
����(���:��U����f��3�5����
�8���Y��_�Y
=�
�8���Y���m��no7f���|������
��~S�Y�~
�n?9�������rU�_'�s�,Li�~����9�����j&ut?k���qVG���]�����
Y�Y
=�
�8���Y���r��t?���Y=�Z�x�N��aN1�S*y��{�\�����������-���:����]����������l��;�lO�0r���,,���gF]F_�N�g�Y�������`1���9�.L��v��$mk��t;����=��!_>�j�aV��Y
����>qVG���]�����Uk����K���
�8���Y��e������2��~�����O���.cvaRG�w��n_KO����L��������v�
�ne��q�����jfut?km��
�:��U�����g����x�qt?���Y=��Y�N��aN1�S*��q[�}�M��~N��:z���g����������Y��3���Z���)fqJ%�3��Z�����F������4�f�lOw�g���(�.�����v��~p��E<}O����f4_G���MN��Y���vaVG��R]���'���jfu�0��N�Y
=�
�8���Y��U�/����vQ�7�0k�>���Yag5�0�v~�	�a���2������no9W�%�o9�����-���L�|��.%�U#:�"I���<���;X�W�/�$����{I���7�;u�>"�R{!�naj���0���Jp)T��r����#��<V��Z�?�I�����_�d�	@-�����w�=����E���'8��e[U�@�N��%Y���N����S9C���,[�s�'���%Y��1��<\�Lu������g��rV9l���%����������r���b����.Ur,��x|�h196����L��+gmq��j���*g'��8}D����^�%�j�R�= 93R�P�����dL
��0��LG]���z�p�h@�Q�M�� N�$gI�<��5[}O����g�t&��I���x|��Y>6����L%I��#����$�l�W:2�D��U�<��5SW�V)HL	hcT����-��*Ff���	�t�+��4��[��.�Y��2���
q��/��K3k�K��!i�P�H�$���
w=��k�".^N�]$}���<����r���<��k.�Z�������H�� N���{��0P����1��!1�Pa��][�
@��?�fDdda�^3��J0���F��~-s�ly�l��������"^3���K�ZR��&�r��xp�%kb����DWq|�H���������N�$gFy���Z�����@b�"��7��5�-�Y��E�f��Uq%��i@��&]�-��!P�p��C����q�n���X�R�g�?��������gz�e**��C������5��y'g'��8+0����n>.����I��d����c����U1����Lt�M{�'�3
�2�V�G�������H�@�frQ�*��#2w��6��v.�u��K2���#^s�w�I��+,!�	�h�c��A�m1A���\&ZOB��&*%i=����B�k'�x�51����\��Q�ni��P
u�~N�$gFy����.9sT|rAd[�l9K��� (I*&�<V��N&eZ�,Q��P�`e�����N���t�y��g���,7�����6y(,�+M@��Dg+�,�JxJ�hq���Yz*�<P��G-������V�B�nU�BMo�v/�tA�����p���	�+r�G/��+��#���6u$����)w}���[P�A��,�k�*�3�^f*�'�Ha.'R�����r����h�+6��B�q�.������1HP�>�K���.a���i�^��h���)+��#�T=�X�Q
���@;�O��y	#i�PgT����N@�xI�e�����N^3��i�����������$���n{��,��5��\*��	_@�QM�Pq�6�,���^7����&���ra���N��,���@��M�CK(��8��j��D������M$���^3�N��<@�������&@�����b�a��4�,��9k'��6$����
��Ylb!�cu�dJ������q�U��?�D��E�<��5O`DBj�Ca@r����Y������ �=���
�ZDzG�i���Da�����z�����H������%O@�XSFz!i���L�+���S�N�s1�p�r�-��|xGq��$��k��_!I����V'��_!��
Ff�5���*�o�$�)��b���U5�3�-�Y��E���tR2�T���t�H���~
�����"^72�v��/5#4�RE�������=����e[!�B�`��3���m����8��c�$'6�0P��f����t���1'������z��D0T)6��1\&�TW)�"C��/���������{��2�<��s;z�Q7��"hcD�[�q��ug+�,��u#�U�v���*r��o�c���P9n����L���[�H�����(g�M@����b�a���Lw]�y��d(3"�o�qbELY�`�a�^s�Njc�L�C���-�b-�N�$cU�<��5[R��#��dI�694M)Y���U(cFy���t�!F+h�lr�$�~5��7D:1��f�8V��Hgrd5��$�
M�_8A�3��B���'�n��)E:$��6q:�t�#3x��t���	s\Y�*�vh&*]E�����(Wt���t�T��0xP���TX��#����Y�\5gHM�0��Uk�O8I[��:����<U���`da��J�B�]Q�R��K�zh��X�G��B���k�{h������^Y���T��`����GP���uY��C���K��I$���[u���X�3��0��dI���I�z\YIgBJV���7�f����8��e�E'�Md������(A�M�8!�,����&K����<�H�8)�}�<gj����<W�0���H^Y���j2��$Ff�V���eI���h���+��\
��[������������%=o�k�uE������H��_�C��p��e�^��;����hcT������[ada�� �v�Or��@��5-�qg�"��92��1\&k��i����D�����a��`O�"&�<V����*���:�|V5��d�m�Y���\�2�S����H������m#3���^EzA�v�?2G�W�S���g��N�$gM�<��5K�+�������r��<l�3���X<��@�az�e��M�@.PO����3��_��'BT�L!�y�.��� O�_�PG����z9�o%��bz�e*��.AGDt���!I-]�����,l�L�8��e��N�����8����mKzL������#�<x�D��9�Clrh��]���)�z
�DKr�`d����VT���V���pW���k��y,�J4w/��.)���=}�R��P����z�]8�g���"^3)��!�
����h�c]��7�b*�3H-f�VE�$��E����eF�
7���w������z�(r|���:��jh�����T��`d������L�����8��	l��0��a���9����F}3!-o+�7�Q��v
��9����w��Q�6���QGT�5Q���xI�
�,�������mZ�E��Q��NGN����n���2��rQ�Ck9��F��e]�v��v�u#��+�P��������\U�S���i?�rr�;�r���	T��d91
��	�K��������m�!��I�5�:� N�dcI�<������/��V� S��>���A��0P���%�����-rj��9gBe�~���	GT���u?�|�z�����M����;'�\\���	�CK��������vs�A�X���5�,q:���u��)rr�IgBGo�������M,�y�.o��z�W��H����jB���g�`da�7��O����I�������:&mC��:���\��DI���dcBh���7��ud�b)�����Vn|��>V�X
m�H]'@��K��
#3x�������������K�a�'�%�!�el�!�csyo���_;+���P���.9�q��1{|+��r�B��X]�[+������)2{W"�_������*�x��.��uO���#��9�$��]'�qA����O;��!�cuyk������~��Ft��L��2�5��a�^o.yMx�k��- 9�F)�u�K[`�p������%��
�5a�* �f"Ev����W��.Ff����3��!0%�����&F����8��X#�<x��d���~4�����}W&�Yy5��7���0�������������=���&Ft{�p�dQc
Fy�zs�v�9%����Se�g�G N���#
3�zs�t�'�h�Z��H��	$"_�+N�	���sb�!�cuyo����������#��[O���
����0�0���K�\v�)e��5�$�����Ysy����1��q�#oB��I��:�x��
-�SE'���X�����\��q�\���E�M�HkF�&��c��� �a�!�cuyo�I���,m�iD���8KjZlfY-�8�sh[��Z�lCP�&*w �8����� N���2�,�������Z��T��:)��{��bQ�rp�%�b���w�,��
U��]�����Z����
Vad��w��9�y��"��������Ej��P��}q���r�.��#���|?�f
*�~%9�[�L�l)�-���f�j�����q!�N�������q#W�����b��9�����;�Y���R���T�OBnf���u^�;��;��9���1k|+����"b�������!yO�2��R����_nu��?�s��j����Z1��n�/�� V��{��Y5q>V���O"��BkH�������KN�
�vxg����@���(�+�GX��1�4���fpX���@���v��( ���`���2��
�fpX���@W�#(�h���4M���!�z����;�6n��))KG��X���&@�����tuf����OJ�
�_-u�N6."����qC��8^���1T�����pb2�1jY*���(q���!33h��>�v�d�����O J��,Bnf�z+P^��]-Ka���Q��[���(q>�y[����	@��@�������7��(K�'@���,�����4��h����MQZ&��v�����@��@�ZEv�����.�
���7P�;^`�����<���6�W���5�5.����u����@��@������H�&x2�d��a��wH��@��	�`{�4
��+�����6�(q�_���9��j�h*��� SZI$��TJ�&@�Po
�D��ZoZth$?�m����Vn\��N�(�G(�!33h�h-
��/OR������5���c�������!��V����#*v�E�\(i���e"d����`G)i�E]���Pm�(�]&@����23��[�j:�'m�a�JC�q�	�-<��6�����T���_	���~~�r�.�hRO�P�
pcU�����V����H4FC������pr:]�����]D���*oD��f~p���- i��	�	@��e��hW�w��������uT�TG4������!53�z+�\PgSV~��L��*W(Q���P�
VE�-Z�����e���2C����,h~����\�t����V2z�PNy�(��Q��mD	�Q����@��@��&��v�jp&���m��"<C����K��=��[j��e�����%��S�y�PBu8X��Y��
�Z�.�ws��������@��U�"Bnf�z�%	�mI��Pedy� J�J�}����O��n����2V���s�hm��^4t�T�\?����]��_G��mr� �b5�-A�N�)��f�m
�%�W�M,���u�Zmb��f�'@{�~rO�wT'>%�M�����W�M*S��w/=�?2���0��K�+O@�dLg.�����o�����y�b�
N�����B����$���Gp���2Uzc__�\������	h��
,Kj���s]���zR����������'��Ox���	�h�7�7C�qM�r���c2�6s&�C�&@�t2�6sf!73h}r0I\29��?��PiD	����3��@���i���;Q��q@#jwQ'@��,C��Z�L+����a����u�������-�j�a��&��]�U�����3��
����q;���+�0@zk#�d���oPmm�Y���@=���2�������$ul�_&�X�egR���P��v�{z��+eFdA�Q�����/!���D����T>��r���2vnu$�Z)��'�:�:�F�ff�����m��N
��Xl'��xA�!h+�"j�|�*��J��1pF���Lh��ry���'�CE��T��q?�)�+�0�bV�
N:N�_|����D�����q�#w*Gg����~�
H:8Qg�<��9e}!33��0���n��(3C+&���
���(3g�\�����>�7��R�Vv�L0m�	k��g��U��@���������]F����4�l����g��U��@�������� ;%I������k�J`Bff�X��f����yM[$=��I�k�X����66�(dN�g����r�;:q��@Is�;:D�,Z�����y ��_N�B�����c��VBwd�S����*��J���Y�jb�]K�uF\Q>E_�[D������#����l��!?�ur#%-<�!df�C.I�����|E����{��sPB�9X!�<h}2�%m��8�$Y}A�o����@	��`
Bfy��4�
���Zh����bHY��q��3����������7:�O�LP��<���U�W������,>�L�P���^�i�G��'�x����R��o���SL���j�\�e��vi�2%�b�"��v~d�ey��}���y��Z[����DwW������������o���?��?������������U
���������_��o�����oi������}�K�_������Oy�����V�?����>�S����C�n��R�|�����j��b���e�6������������3h+��3z>�
gpP�>���Q��!�>��,��oAJ��z>	�
'qP�>����4m%�\���+-o�����Mog���,\����|=�M��������?���0�p��������Y�7�5s��x@�ga��,��gQ�k�����W~�o�
z��Y=�����8(��E����_����b����Y(z<���8*���J���M�������x@�ga��,��gAm�m�����,>�P�����,L.^C�e�*s5^
=x�\�j�����q����W�E������J��
;���r����=���WC^!�z��Y�^�n�w�F������M�|6�U���^'ON�eI�_��]�,G��M��98��lU���WG�^U.xut��U��u�F�}B,�4r��������O�N=xm����z�
��������%�WC^!������WG�^U.xu��5LQ���B.z5��u���u�������E���o84�>�p~�~�w�_h������\8G�Wn_|=U�:���b�����T�K�]:���r���{�yA� ���������6��P���B.z5��u�B�j��+��WC�^K�]�����O��F<64b.
=��\tj����;������o�5&�z��P"=�i�o4�_���rr� �7r����	��{�X�h���*
;�KG�>U.8ut���a��;��������6���^
=x�\�j����=K������^
�{��w_�WG�^U.xu�����e�<�l�w'7��.
��dT.:5�|��n>%?<�<�e���w0@<�����!+���KK^��zv��Y�gg�i�O����=xV��������m�����
F���<�.):V��W�n=y�uypk��/��=x�nK���r��Y�ggO��L���=���[EO^7������������_*C{�=g7�BM���s�p�T0����+��(�
��=xV��������0K�����
F���=�e�vU�����	����'��{��gcO�!x�l���65���=y������)K/��z���*xX�0������"{����C�=y�Zc7v_�^u1�|_�?M���������2MV���tK�\P<����=��M�(���)���[^^��~��2���n�foY�"C�#2���3�_7�
b�����q�����n/�����l!��;����
b��S0q>nw��(�.S�H����mm������e��@k����(�I������~m4����ICQ�>5�8Ce�������-8�m=��;Q���,����&!�9[G9�f��h�a�t�y`B�w,<h���2���x-���h��jM���!df�Y�����\H���6F����	h��=�!33h������}%�mq	��2��!��+B�o�=h�����sy^v <!{�����c���W�E6�i���fS>s��\��.��VF�7�N�(��q� df�IH��� ',]<ZQ�m�v�,A��Z���"m�h��/t��P��~y\0_^���&3q>.w#����������g�6=�	��;��*�3��TkRG�.�_HH�d��)� J�1���Bjf�5�����d�\ T�����������!1�cU�������{���D�Q��c�������j�^��Z�(�������-��j3�z?��#��\"��F���r���u{�T$��lK���v���Id���H��4�R_�u���� kdn�	=��a����8�FV��m���2�KE/,�"L"z���bf~$�pYk�>�]3Y����Y^����D������8�F�q���"���aL��m��HXO"sXT�L���=�s�J��k�7\$�����pK�	a=��aV135q8�����k�0�/�MV���?#�'�9�*f�F�0�o��*��R��#�����S���"3�!��Fr����0�y����I����M�?���$2���3�����E���.]:Z3O�I����(}�8�!�<����I�����Oe`�4:����_�������h�2�@��������5I���lh?=�Cd���H��<�������z�L����D�#'D�$2�M���H��<�Mwo��wm�r�}BD�l*f�G�a���Xa*t�2�����(�Y�dD(��9h�e��r�k��-�zL^��Y(�)�%�/ EhA��N���e�#�	�O�LWO%r�����l���-E�$�����>E�_��#4�I���`�������C�p�Uyi,���_�k,X����	N@�M2��k$e��hk�\6��)���[n03�v�2��HV��	CC��R����1���<$���m�x@r��M������2���t��<a���M�����$xu���
~H�k��C�O��4-�oK��%��a��	�g-�D
�����L�aR(_H~?w�1���Y�#%�5p�!df��T���
{��=����N@����e��@k�)��]�D��(	X&���(�!33�zw'����P�� ����&�j_������5����=�;k�Cu@�5��L�P��F`����5��-��#(j��9}�N�t�1�nf�5����0a|I@2f���& ��g��IBff�5+�K�T�p����I��>BuD��t��U���Tk������� N�)�dF4�g/�@���9�.#=�����iXIv�J���b?e��-S�h��2!�����17?q8�&�P�)h1�2�>��J�����!wh�C���d���Q*�w+�������o4o6�������V������y�0��E�I�+��X��[>F�z��u���	�s�U/��)8w�lK�����iBD�r��M���H��]�P��Q*�`g;��=�����	vXT�L���g�I� ���3FMg8y�w_;�Cd���H��]����-i����4�`���	a=�Z�
b�Ug"����Z����������J��T��9����I������"R6	9M%��)!X/�Y�[	���L��Ue^j'U�KM�IN�a=$���SS#�������R,g�!A��?d�\(ISYW!�<h���Io/N�5�����&i�:!�g�B�17?���vO�z�H��Y_ck�7&J�	=�63�#��S���)��=[�+�]�$
xMD/dq� f�G"��k�V@��'G J�l��@�������la���k	����D�P�6���
��32�-��u�X&����2��'1NZ�]�
����<����6M_�"�f��^�9:�j����%�V�1��Gd`Bff�#r�6�!���a�&�����:��6L!��B=����!m����AAa�]�c+0d������p�	��U����d79Un Jty�e��@k�/\�c
(_�Gl(4�6� ���8�`p5M������;�se��{���CgZOX�g�	<&�t�\������4�e&��`f�6��6�t�	c�h0��^'�����2�6�u:h�~��
�K	M@R�B��mD	�C�!33�vW<��[�/I@2����T�%	,A������P'Y����^����m��y,����f!%�fV��he$�w�A�9�%r3�����������j
uFT��&@�8{hl����f!u������W@��D� J��Bff��g���M�zCG�g{'@�8{��~7H��@�f!u-���^����3� J��u����jM���)�C�*�}���r�%	Me4��K���H�m���dR]Ql|VR���Z�^����Y�8��,��hV?Yj�n~Kj�:0���%�( �F��E�k��}IVd����T����"z^d(�����z�����!������)�u�UL��c��V�C2l"�F��P�K
��On�����Cr~�z!5�"B������yZh����C)�1+�\��K1j����C)&U�-���g��~P)[?#�?���������<g���|;�|�9�wu\�����d,X�L������rf2�w�zdrO�������
	v��S#Y�o�\��0Z�[��L�'D�,5�T���d�{�K)��L��RJ@+�P��_(ij�#2���M!��*u��A!!~�PII��eBD�`����8�M
n�|��h�
�@1������V3�#Y��!b]����XH�H��PH8�;V�|���wo�������AN�����(S�;�,x�uWH��Mh�i���
u�����RB�m����>X��^��z
<
�Y/73�����C�Ux�7��?�@�c���>��X(�����,�C�`�����j���vkY(�|���o��B�`�,R2���oB��^QB2�v~����>�f3i���Hn�������#�!33h}2���UZ���f�y���xqD�s���s���\OZ����
������6F�_z�Q��j�!33h}��z��{�m4q�Tf�>n��Y�����1T>������~��&'�q������D�u�h�|��bM�V�>b�����T��h��YTc	Bff��0����S��	��
$/�M�L�(�o�W.������a����l���6�Q3�e���qF�,�f�Z�\�W�(UA.�Q�$t-R�u����a����j}2������#��YW���ch|� �"f�|�*���Zi���:B�l��x	�3T����8���i�|A]H���H��e�5%.UF;|rf�Z��'�����0����O�(!=&}r��@������E��e5�I���NkY���"�vF�QY�a��Hf�����6U��(-R�#%z�wV df�OC�	�u��(T�
k���r�:!�gU��
bn~$����F!������`�e��a=D��&����	~y��L�X)��6V��X��O�HDU����H�8������5����i�&�����4�����M����a�m��D��g\�kl�N�������`R135q�q����3�*���|F������k J|1�TH�Z������f��h��������#�@����Al
�G"?���T4���Vu���w������!33��0d�R�*
�7��I����	`%	YY�2\73h}r��1�5W����l�������g��U���H�����I�b��Pg�������u����IAe���S����Q&�<C������������h+�O#���>~��
��on0��+'���Df�Z����
��|�@P}�bH�V����
Bjf��4��9�=���@�F���,�tA~���."�Pp:V{^7(�D���\�;.������4z��������l�O�;`�u������M����s����f�Jc%ZWL�_W�Z0,�:�n���XU��������h�qF���XN�X�-�8����\��n��GXP�6Nh���*nf����k�n��6�������P��P����*gHoQ����5tK�(R�����|��c��V��M����N���;��E��mW:!�G�'@�P�l��Eh�}�'���WM6���hcD�
�4M��!df�b����c�Ku@@�~��� JZ��_b����V�i�����"�bm��%I'�+�lS��!33h�hFu

���F6&{B��p,� ����8��[6��%m��

4�l�Q��cBff�z+���������[ZQ�!O�(�b,A��Zo��L���
HrI���Q�@�%r3�1#��������Yjt�t)��.�����B�X�E���XU�����
��u���m��'��f�iwY33k7�N��t+��(w��"�Kz_C�����Y���U��;�3�v��llqJ��A�P��lc
Bfy�z'P��������G�k])�?���:�4��A=�~3�������hcU���I^f��,�,�����	��ua�]������5�c�������C�s�#�o�[uZ@�Z���>�3��0��bfj$%�o���+�*�_��tV��l���H��������	�o/�O�{���V����uN����175q�f�k�����.�(��g���f����I�+�?/�,�2b�F[�7�J��e�-4��PIn0����8|3��b�qm����&��W��a=k=�!��F"��w�~���j`UZ�R���H�.���bjj$p�^����g��aWm`%1�d������63�#�����x���G����^n�+�	=�tv������w�����p�������X�3�z��B��B�L���Wg��<(�jJ��\��n�@�4�
�!d� Zo��)�#K21"t�����a,U���HVVx��s���/��E��������~� d��0��|v��Z�*���\:[�d�M2$�/������khg�p���E�L�(q1|	�3Xh
�3�Z~A�
������]����]�'�!��k\�����c!��<Mgh|�D�����RN t���(��S)'pJ��2m_�CF��w������1��efpJ��r�^T�qk��6F�<8M�(q�����,�f�C�fsl�t������R������(B��_<h}2�a�}���X&��rBu��Q�\z�-�"���#P�C�����IR]���������GFR�"�&������^����4fu$��h�W��f�Y�e��@���S����U��������Q�����e��@���y�N}�J��6F2���j��2�����a�e��qI!�2�K�U����/����O��o�L��]�����]���w�����?rw�t��K6T�q"{�n�
l!73�7T�rX���],����X����+�Z}R3P�C�{|I�%��</H��5��%��xI��@�>y�u���mG�U�:�=��J��!d������C������[8?Pa�wD�CV��"df�Z������H���u��0�aYV�2�C�f�	bnj$�������"k{+Z�����Fo��!r�bnj$����i �<wIK
N�k�p�d� ^�C�_!�e1358�<�M:o�!4w��0���>#Y[��wSS#���#o���������DXt�������k�kCLM�?��k��V���*��x��������T�L����g>��K�VvIb;�2/e��[3�z�\a� f�F�GN�������-�X�hq�=�WD���`W13?���?�|��Y�bY>�����a=$�
&3S#����7#��t<&�.��������,���`����j}2���E�#�x�i=#�����U���H����As<K|�����J������ J���X��Y�h}r�L�=��QXF�\��'D�lI�B���d���O#�1�*70��yahj���3�z>�O`R135q�y����u7�/6d���k
g JV9/�C�,`�?�*���	~���.�b;�}b���/��j������&VzW�����_�2=������[�P����}�%���E�_�������������������I���T��d)�H���_��������������������_�����������?���������s�������^Kj[wi��l�4��&M���uZ�)�={b��/�LJ�����^����Z�z��I���~:	���pe?����$���'�g��$���u��T�<������<���(��M�r����Dh�g��({<��l'rT~8�����y�/�����D��O���D��'BK����_�������v"`�'��~"�����\nB��+��I������=��+���N$u��<���_d��s�x"�O$(����Od�x���D�������D��O���D�'����������l>]R�U�f<;�W����L���;�~����������o����y/���'�<x6��y_t��{v��Y�ggO����gcO�!x�l��s��������O���v��q�5��e.)��W��G������IN����_.���g��D��~9�W����g���=x.�v�cC~!�z���V=�5�����<���P=;{������'��+�����'�<x6�������g`�U0zv���� �l��3��=y��_�����=���=��r���xW����Z�DCO���%T�x"�.!�I9}j=��\tk����u��:{������'�����=y�����{��
uN?<��{6��9�'�khT���=y��������:�=��\tk��k������B�������9������	<{�!Jy?��~�vTJ�z�.��9z�x�a�{5��r����-k!��u��W�cgO�7�}�l��3��=x���P=;{������'�����gcO�!x�l��3��O_Mg�U0zv������=�-��&GO^e������[�
��[��_m�7����a2���������4��_��T��
���=�`�9{�����gg�U0zv��y�������g<{���5R���B.�5��u}-%��{���cc��r\i"eg�U0zv�����@p������*z���u�}������=��j�f����'���I��p�p�T0����+H�`N7��<�`������1;'xv��Y�ggO�7i�x�l��3��=x�����N`�U0zv���{��l��3��=x���qc�U��b��s��{�\�����<_U{�/]������7����'Y�Ty��,����<]/9��m<{m7���gPAehC���0�`���<��
��G�Y&��i�~�x�> n��2�W&@�7�8�!33�|�e�W�������f���������\+��bfj$���S.M'���dOh���6#�'�9�*f�F�;Eg���Xu+hd��3�j�y#a=��aV135�:�.7/C���012S����f3S#��!�{y��;�u�����=�"���f������idk=�&���7R�u�kh�x"�'�9�sS#���6��������u�<�A���kFXO"s�U�L�D�#�|�����t���:�fST�*f�F����rk���*0l��{��g�z���bfj$�pYI�/��VN�[e{%}���G`R135q8�,w�d��#���P���6�7��bfj$�c�%�+��
~0��KW�:|��`�@�����`��>�$�������T���"�'���bfj$�pY�6��}�n��5i�^��rFX�)\��M����E��i��dP_��b���7W����h�6��hf������t���
S)(��'D��b9l*�|���l�p1��&����]�i�E	�2#�u�r��*��F������/��S6R�9�n�3�l����:758�EV��}�nZ��TF����_��."zY�]���H��42�f�W���S��S;%,q�� �'�9ls�#���v-�Z�v52���*�������pSS#��id���b�����.��e�!&W��$2��L�DN#k�q����;j�e�����tX ��F����4�Ln/E{��Fg���������bfj$=���8y�Z�+Vd�+�������"S�7�������&U��ph�s�-U'������n��e��w�}W1558�F����V�)��V�����cEj$���6����l�S�)w$kQ}�m���&�iH��5%�)�A�,@�nr����.�����t
DIB2&�*��lwO��!�����Ju�_�@IBR� d��iH
I���|`�MID�jG"z��r�T���dC;�y�Q�8X����l��	�g�3�zxe3�D,��8�"���^\�Dw��
%�z�Q^N��	8�L�n�Y,S���$&�58f*��}���2��NO�tr�v�1J9�s���JcY����5�T��ql�T���l�^\B��DX��%�%E�����4��g*y+*''�M3�`3NA�a=�Tf3S#i�<�Tf�;��K�&�����RrAX��sS#��L���2��qI�j�&@���8�!33w���c�v��F��|!$c���}H�6FS��5u�9�N/���5G��F��5E��sn���%�Dn3r�Mn:m�2�eF�r�}1���O��B9o	6.H��'f3S#ih~2���2.p�Q����d]g@�ir�e��` �<)�.S�%��ZyhgIYO
��8$g	Bff�u���m1�������iFX����I���H��<�P�d��8
f�^��A��L��!��z�UF�$WQh����R
�]���H�p���ty+�;V`U~QJ���5�d�X)�&bjj$p8OJM	g����R0��ds3�z�L��C�L�����u�'����NZz�������������[��a��A��`�����vK~7:�C�`W13?q8�MZ7�]����/%�	=��aW13?���Q��$O����r}b��b�E��.�@���u��@�&))kV���DR�XI�sR�q����$`����8�FV����.\s�D��������������H��<))_Z��IIe��s���\�CR� ���#��#k�4�P�����Z7}���!)i�B���d�}H��H�c�:W��)�esKm2L����aK����9��Y������3��y��33�z�F���175qx��Dm�1����^�����,6&�a_ULM�o��:�[��U�=K�qBD���;���Hv�,�����&���R7�����$2�45��hj$px���E�.WYI,�%�����]�Cd���H��]n�3d�V���Qel%�]rdD���17?qx�����n�����,1��c���%`S13?q�����<�:�(R���)^��p�`j$%��\�(m.�2J~��������=&{�����%�\T"u��
a.�Q��s�����}�b��`�3:"��)/���������0C�M�D'u�"�}��L��L�cs��mFXO"s�Ul��g���g���^�!2g��n�$FfP�pS#y����w��4����nFB5�C��4S#y���Y��ne�h�`lc�2����C����bfj$�p^MYe����D��d�V��~F�M�
0���I�D�Y�����~)9c��S���yM��;���fk��;qx����t�Y�2#�� �e��H���y_d�	���Ec3N3�a=Df0C�M�D�#�vC����m�+>�X���)qcY����OZ��e�NK��h��zII��.@�,}`	Bff�.K_����O���q�R���T{r0���I�}r/E���w�dw�t�AWJy��c��V���"�&��P��(7�Y�/P`X,.��o��d�r�c��hj$p8�Q�Z����s���T'S����Y�R�����	�3�U?�7��L�14�!s���"SH,f�F��L��3Jv�pc�'���O���J� �}��H��<2|�jc��"����*�~��J|�0�TH-Z����Qw�5�i�B{�&D����>������l��;ON�������5;�����vf��@D�����H��<�3�C���e�R��2��6��]s�#y���2i��qVP��u��a=�(��K��rG�w��6�$+�D|��<�l���!G���s�#�w9J$:�"G	��P���3�zV8	�&���H��]�$�����:��F�yFX�
'3S#�w�J�'�V�#h�F��w�7A�~��%r3����^��9+��P�qF��9�����������~����R�k��0�HR�%�����(;���H��]����*����C��.���6�Q�U��yN-��l��W�_���G���<$ �J�#��yH)�_���2�z������!V�������R�A�f�V�$��k"z�T�!3�#YowQzZ0��;���'/����<���{�q �u��<d^�m
)�&X?7h�vtk�k%e�r<�TOJ�1l`v����!�6�6��Bjf���o���l0Fe!�+�m2S.���.F�zH��*f�F"?�<��z+\C��&oT]��H�������4��|y����/y���,g{I���3���H^�>���N�+���6����lC�M��&�t���0����f��&C-���f���sF�mE0���I�V�#���}4�L3om�R���\�C�����h.���i���n.����%p%�n3���#�175����#��]}���h��!z)��jm1�%r3�OC�]������l�'�L�7f����5� ��F"?��,M��S�y�0�_�A�Z�
�����!33��4�d;��w�d�vZB��#����lE3���!��K������b�� J�2z�!��j}r�P��n����wp��-"�����1`S13?�
{h?��n�����*	�B�m3Rm5,��bfj$�aFN�L8�wM��f��$��5%�oy|�<��4��?���bl��8���M���8�`U13?q�y�����@ym	�\�uf��3�z����bfj$�������W2���@�-X�E�Bc�g��PvkpQ135q�q�{y%n�J�d�504S,�N�� �g-������O#���DO�����f�N�������*f�G"?��:�beT��<��.H�2*����H
��>�\�Q�Ybc5!�����HD�2��M���H6d�?��j�T��!0
RJ��H��������=|yC�|��+5oi �1�^3�z���!M�o=�	~y��
�����Y���{LH���A�S#���#_��m����#��"%/�� _Q��>��f!53��8��W
�Ru���U��s���q��� ?�/��8C��X���J�:�u[�����/��a-�`����8�<rN�v�("���In��q�n�i��������#%�)#�-�������.{y�`FX�W.sS#��G�3��Ab�����-<���5%�� ���[�h}2*��m����$z��l��Z�"z������E���D~�g�����]W*v���+#`%)��������@+�&�(�������\��).[:�B�����%t����%4���
���������������N��"L��!�P)�I�@�H�����R��<�O�]K?B�Zc�+m��c�Z��X���39,�����6\Yg3���Y��Y���H�����J��SD
�����Z���"z�u��)�:��v�x/�*����3]0��i���%4��`��FrXB��xi*v�?j��i�x���+j�j�������/����?t��o���Xg��y��]�g"��v�x+^��Ru�GQ<:i"A
�f$��ph�'�����������������[���]4�����53����J�$l�I����mR�V��[Z���$��0�X��q"���xy�j�Qe�H��h��>V5�I3@��[����hX7	c�wD��c7	@�a�F��$���h�:*OE�g=a���� ���\7��x��7��h��_��V���Q����U��}u]�����������3��
��/*���fd��V�U���H���x��������]*��jS�>M��Y�
��bf~$���x��u-�$��b�H��J��Yl����
�nj$p�f����@�"���|��C�@���]�!33��2P*6�;�d��v�*E���cf$Y=�C�3�F�o�K��s�9����zR�^��(ICcU���D��@sBQ��R��pc���R�>��Y"�@�M�D�/���*��t�;�*����cFX��w0��K3S#��7��E�^x�-1�yQ�l�6#���8lh���F�o��2�H�n���zF.�M��Y��17?���o���5��W�����,����+�z��175q�f��8���W��$���Uf��xFX�;3,���H���x��.\���<X��hjL���Y�b�eU1558|3^*t���yp����0n2#��}Db
A05q�n�+����{�k�1�������a�ULM���wOx�H�e�� �������4��P�q�_-�8C��1�v�m}WQ�X����K%���2�����O��A��V�-`;��c������RT�L�D������X������R�W����,<��!�~&��>���VZ�K���r���d�hQ"KrM�x�;�G?'��I���`	����7IpLp�rN~9BN����~9Fr��4of�C.^�oY�o��+��;����_)�-�L�#���������3�l]�.s&��S�l�����y�D�������S�-co���(P�"1co���fj$�����R:y;���ek#':y;���nj$���x7r�o�F����2�f)��~FB����J��+M�����u�6��m�����,.H��G�Y���H�1��n�4-f��K���d�����f{N�sS#i�s�a�T+��l����j��g���*f�Frz�y;���Q��&o�yO>y�L�?�U����3i������g������
;�9s?#��Q������c2������6�=�9[�����I�G��������#�{����%���A]VV)��^Q�)}�R3���!�MgN�>,�U�$D]f����5�HKe��F�G�{�����J�J�7#�Z^:�7{SS#���#�����)�^��p��\Q����!�<�-�c|?d���V[3�,*n_3#�2n"1L��3noG�[J_����+��� |'��W����}')�o�h��������7uFX���Xs	sS#���#/R���QP��l���e�^}FX�L"L�D~9��i#�,ZG�nD�������-"�f Z���ul���-�(�Ev9Q��}BD���������#��z�"jGUJ�i\�$��f�m"df�OCn*�hT��0��^��Z�R���I������;�������u��3E-c���\�1(I�qeBfy�q����W-����S����7�g���79`����8�<rm%^�kI��3M�/��z���bfj$����y���h�_�l	���W�%�B~���.���q�#Y�J[�y?u�|G��V�K��a=[�L*f�F"?�|OR��V��(/��u)}�tD��
!�<h}r�L���~�f�Qf���=��e���	~�^L'���x>�����i������	~��j�
����fb�fqA��XX�@3��f1���S=���u���k�&�%�����~S�����M�$J�em_yI4l�j)�����o�~������������'�_���s{���T�J�-��}��?~��/������?�--_�����I��/_���?�_��������������5i4�3-syr��p4����T�K
F��F�TjrR��$Ty�H��������U�����'B�1x��������Z6��e���������T�?�c����T�?��k��2��X.�����wN ���-����Y�����T�����Wz��'����=�c����ToO�����	������K����	����A��8K:9����k�	(y:��8����B�������Xj�w�p �'`�~�������~�/�^k>���$��LN@���i\s3Q���<:���b������U����{���{\y7p�h��#��G#�Wyt��zT1����c�Q�����_��G#+'o>�}yU���x��7wy-Kz����Z>|(��^0����^��H���2�[�&�Nz���6��S%>E,�Tr������#e@�>U.8u��UF��z�
������,�,�WG�^U.xu���Q���S%>E,�T��Q&x\z�	���������_����&cr�����L��S��F�|�IU���spt�T.����+�d�1zut�U��WG�^{�4@����������b�b���������v^D?x5��r���{��L��w9G�^U.xu������*y�)b�������\z�	��������L�����\�j��~��W�O����
uc������n&�g��u������rr�b���{�Y�eF���}�\p����4�9x5��r���{�ex	��k�`z����=x��o��{��`+zut�U��WG^e �������^
=x]_K;�]��mL.z5�x���e.����O�mh��r��8��n*����u[e�U�������������zx�
������������LG^!����o5�{������v.n8x5��r����;�s�z�
�������d.���'�n��\�:z��x
��������WC�w�m�Ow��|e��H�I@�N��)8��p%s�����W�^=x����S%>E,�Tr�����F�����������&�{�mx�	��k�
��WC^!�z��r��������^
�{�Ly>�Pt�U��WG^�bG�����G��������x
�*]�O���;}��Kk��\�����H$���a�{%�H'����x\��7��[r�cU��[QeN�mJ�F9��0�x}\��Iba3q>V�Yr��=�|,9����=~��t~��$Y^�2]��������w�I��E�D}Si���*M���ICu��;�/���
�xhS��@�����v�8����5'@�d;�2������+W�B2�r���+�!9����C��;��\^��YI���E6\�����H�M���2I��&��D�H�-��]t��PDDM��Ue���bw�+�H�t.�l�����	uC�uq^���1T���*�)=R$y��m
��N�(���X�!�f�Z����'g9M[j�gY3F%
��6����j�Bj�����d�I��-q��+ J�2�����Tk��l���W^���&v��%���14�A�3�+D���XU&y��o�n�������	XtV���BH�@�f!�H���o�8$E����(u��P���u��@k�G���W�������
M��P���U��@k�x��L9����y7%�Py,� �!�R17q>V��X��:�$xY@+�\;O���D��Y!73h��zE�6��&�@*��~}��J����E���XU&�>jHJZ�|)��d��c�b]��7�R �&���2�E�
s�T�L���(�M�(���X� �f�Z��U�beO\�B�+CU*Z�'���d���73�����W+�\ZM9&E4`;�/*�^���DV!d��IH��F>���������m�c��V���&"f�|�I�j�&}�&�3�V%7�0~D�!df�}3&�����Q����B�P'@�($g	Bff�n�s�28����, ��[
n����Y`I�������N���);����:}�
���,C��Z�\��c�%I.����T�,qD�C2�E��Z��*?Q���Mp��7���	�F��%�����y�mI"�e��VFtz���d,A��Z��v���_�n�3��� J��Bff����g���l;r�d���n]Q�7"eu����j�d��CK%�U�u��C��r���:��N{N�'�9�L����p,� ��������\��JH0H��T��������Psp%�1��9�D�Q���w���3��#����|� B��Ti5W<��!'��/�-7�I3���3x��u<{��c����%k����5��X8���*���R�UJ_[������sD��rKH�����f���+����J������V�	%��[D��Z������a���.��3���8.]*D?�<�TM��Ue�K����������:�Tm�DIf�)���`f����h��d���uF�UnpD�Br�@��Z����`�ru�j�;O��D!9������5��D��)�4�4@[����������`�|�*���I�e2o���+�&�!7y3����9e}GCI53����*��H�SPEWpD�B2�7�������"RW������$/��qBG!?�b�8���~0sGc���t�h��r��R����9c]������c������z��' ,&��;�D��s�
����5K�u�S��h@��k�l����#�Bff�������M�eZF@+#��eD�Br� df�Y���B���@�QO�(���($g
Bfy��%�?��-��3RxY���������\F1�cU���(al:}! �����
2�@�2������I�n�/�n�\��.�n����X4�� 
�"f�|��\����C[�v���C�;�5�g��U���wk��AJ�Z����vF%��6�D!9+23��,YWy*�d����Q�w���J��3V!d����#RB�Z���;��{G#*M�&@�83f,C��Z��X����J��!�R���&@�8$cY����5O������ZB�H:"�P�	%$��2����5���J��I�%����p��X4�A~��=�8��4K����i�����*%��	�Z5X���@���8������#t�������t"�t'p����
w2��
he�	�	��X���@�������@�$��B�m8N�{���o��X>�t�*����j\G���'�s:�_����Q}�����y�����cMp		917q>��K�>s�2���V�Gt>�5&�2�L��ok���m������{:��`QK%�Z��k_G�3�k(]�k��#39T��-�+�Hd
r���X��z����X��_�\_{�����x|������V��2���n���X4�A41�cU�d�6�sW����K�-����a����Nw��%,T��#�W�TV,U����o`� ����7_J��7�i�IE���'@�8gl����&!�{����x{�6�x��Eke[ dfP�jeR�1�"'�quih�%�yD�C2�@��Z���E+�
�=���M��Pt�30y�	fP��=���o��w$��gu����.���������R��rN��z@�U��	%z6�7����<��6����U���Z;Q����
Bjf�5���<s�����Sd�2�n�P���5��S��n�T���\�2���QR���J�1��,Z����7|P����\�Vyt��)d���X����f���R!�8$GU
_�Lx���9eT{�Bjf�5��������?h�gI���Z����
�C2"�F���2�e��OO�k��1T��]�	%�*khZbf�Z��\}-Z�'�H���Pj�z�](q��1y�
���=(e^e�e�m��tMe�P���U��@k���7$��'������s`Bff����P��MRN�vF%�5�(qb�X�����Ut��UJ�v���QFm�Q����!33h�sR�O������J���>M��23hwJ�����|hcD9�uD	�9�J,�f�yb��#��J�����c��V�I�M���2��!Z�8�t���[����#���F03�z�8���fr��Z���	%$������u�����7M�e������&@�41'l����~�|���Z��C��	���|N�B��	@�)1�� �!3�� [H��@��s<�~���3P�y~n�/�EJ�/[��;m��&��~N����]���;$C�D2B��x}��!!C�&�����������<��p#���E�+?$��O���(�}�=@����lh��+_�NE!zAI�L��K���n�T�V�������%��	(���!33h}rj�7S�TF�H'�R��}8f�o%��,b&��P�0V��hnKrV�������9+gHQ��h������"0%��Q��
��c��J@�E�����a��a�q�-(<6GQgDI�4������!33h}2����a��#�u������u=�Y�;���j}2��[���2	�QcDy�>��d���CH�@�>y���o��fH���'����+����T���7YD��44�T"o�i-�f<'��
��7�����|��}���x"�t��vo
zP��Fc
�`����Y�������_U6�:�quDI���
�����a����g������~����o[V���`f����"��uR��D��?���XJ�m=�����j}r}i�&�EV�3��Y�;C��@Q��cU�,����j���hC�X������u�E�����������������*[8ZA��D�����!53��0d����y��V#R|�������B�sL;kXDL��]��XW���d-Tv����8�@�Bee\�MBjf��a�4S�^�i�,��lG�d��M^�/�(qz�X!73h}2�9�����D@�������8�L3_ff�=NP��F���VA:�8����P����!�<h}2�XxT����~�r�2#����8dc��*X�>����]N��fLG91*9q������5��@����������$����Y����� �,�u������G~<(���dW�C
�R�Yv��^�n���
�A����i������2/�9�2#�X~
�����u��@���[��1������ *h���K�gP���5r����!W�_��q(�����B\;��j�8���a�\-��X�HFt�����%����7\82����!#oNe����������O[`	Bff���)[^}��`���f^���)��C�,Zap�9�__�; �������_���e�?/���w<c�:���H9M�Y<�c�����-�8�
N�g"�BN��TZ�])��q�R��S�J�|�U��w�y�������V��\[},�Xk&�������p"<FTO��;!��S�������=KQ�ku�6:�|lu�Lb]��8C�����2w�
�3�b�	%Nj[ df��-���`=��_�<�]_�q��73��z���}��w��}�P*��aN@��@����K1'����������:#N�O@����r3��[�R�N�Gy11��'��wB1�1�^'@���	t��8�NR���3���>���:�!33h��fm��~�5��n�� ���������j�(��e"�w0��v�Z8��@�5P3��o��6�W{Yv��)�J��"����(!}.�;6���T��@S�$Q5�����H���<QB�Oa}����j�(���8oja�g�]�(�r
��-�5��3���{����O���,%A�h�}o*<
J\�l�B�,Zo�/��v$;��ssi^�~���C�����
�V��5��P��r�J�O^(B���NZoJK+����.1�]��M@�������@��@���W+�N��&/�k������u���nf�z+�m��5�V`��f ;/�n����5�����
��8C����j�$�i�5������,V�o,�R3P�w�����Rd���/R�Pc�zy,� �)d�L��M���>8f���F	@�QI
��#%��VF�zR3P���
�u���46���2�iD	il�!33h�h)hFLc%[��0*yC����,�"df�ZoZ����cW�5s
$M<))�N�(!s
&�%��@��@�v���_�P��h�bh���P����!�<h�(�8�@
Y\h�3��1����Q�U��7<��1�<�h�LX���@{���R�����j�![
$e7t)�	%d������@���c��^���5����W�2Z}g!53�z����;&���pH9��E���a��#n�|�iJ���t�z������}X/�>HZ^��_6��>�D�N���>Y�xl	�&��H��*��S���X7j���EIUAjz���������T��8�R����������XHR/��!i�x��4q������~/V��P���E2�1�(����tu�c���J,_�(�m=���X_���8�������&���|�����
���������#�����Dc-���*�������BN(-�����j�O@�����n03�+;�����UZ�Qi�!)NLE���@�8�ml!73h}2���)���6���U�).@Hx�����)��f�T&�=[��g@�m$��[���(q/[c���@����AhG
]2�Nh,�W�F���E&���{�M/���*����:S0����|�:��|�q��&�,�&����Y�{E?z�����P[;�q����������8��G�R�&��%��$��io��lI���q_��@�>95���.����M�����b]H��Bjf��a��aKg�y�%��C��!tP����!�<h}rI�qe����9�58iX��G%�X!�<h}�u����Ys��%��	%���23���!��s��I��l�/�QlEE����
�CvU��=��g����D�7�V!��#%.�6�X��2��4dnn��S��$;:�fu*B�<�o��l�8��g�R�>��,�~U��I�J�=�����l��wV��a������5�:�1�!33h}2����G�aA�0h����� g�
�fP�	�7CN��Wo�7��&
�iF�4���,C������O��e)����F�B������h|+�/^�,�F��P�0�
i|j�/��sU��	%�o+23���!s���$^
��<��Dr�W���
��z6�����|����0?�����O� \ �����O��y�l9m3N���H*�8�>����&[�����X��n�M���J44hh��(�������2����!��)F�nG����'�[�ngBff��4d��oZ)�zF~V)�!33�S���/�����!��Y�54��*J��Y����Z!�9����R��]N;UNp�k����C{^�3-�PL�I��pi_�����~����������Z9�]���]N���~���������������I&���3�i�FO�J�r������_�������[n_�����I��/_��?�>�7\^�J{�4!)��k�Q�@�������V��ZV��8��Kh�_���"���v�����B�'���k�57��1~S��������ddj�_��o���),��)=����)T�OaC��DS����~�J�[��#��x
 Og`�v���i�7]���������1�O��L�O��z�
���!��7Xv���sz<S�s8���C��C�sx����
��=����9Uo��6&�r�u.����%���9����A��hWn���O�*���C�w�O�����j|i���&�������^
�{���}?xut�U��WG^;ZY���B.z5��uGo�����W�E����Vi��:Zx���j��h�����*�����
f'O�eI�_	J}��.����K�r�$�_:l��^�{U�������uz(4��S��K%�7z�:~D��T�������W>������^
�{�>k>xut�U��WG^��iS���B.z5��u�5����������������cr�5���������MK�o$�)8��p&N�����F�����W�^=x]1�.x5��r���{����U���������n��S�z��3W�WC�^i�����{�*�:z�z�WC^!�z��
�G^������+5e�G���o9*�:z���Fe���8?q��W���p���S�p
��/����/xut�U��WG^CQ�z5��r���[�}�=�5�[�&�^z��a�n�j��+��WC�^���WG�^U.xu������������^
=x�^m?:U��S��K%�35=x4r{�Q�����������5�w�n�+���+�r��_���w�����������V���&�{�*�:�����I2�����\������d������^
�{������1�{�*�:z������z�
��������T�WC^!����%���WG���^=�q���?�q�3��

�������S�p
�n/�J���>��W��`G�^SG?R����������f�Z^�{U��������#�M��G���}<E,��m,���E%�r%���H�I
������^�L|��>�����x��!����J�3��5���i�>��E}��C���k>f5��v1������]�C����	�>���������G�7������q��.U�-TU://p���"z�Q�K��Q;L���5X�k�/��>%V�$Gk)>��p6�+�#3=Qx��qw��pb�����s\^��?!����8_����m�W���HZU����0��+�#3�����zh�+����j_�s�|�#q ~�.�8_�M{�[��J�F�d��g������'zz�%1�0��L'.~�G�$s�X������LF�0�aAy�m6��
Hf<�T0�p6�9�#3������U�l������`�IU������,����dI��z�I2��$��vy��U�Wx>,�����)q���
�3j�U=E�\�`���fro�!��R�H�. N���H�Q
#�N^3}t���b�������:�DK2F'��H�@�f������- ���z}�`�������u|�^����4s�����R�F�'@�hI���,Y���|I\Iv�KG��(�6��%	�Ndda�^3i1An�#b���>+���]O���hI�:�y��-i��H�����zZ0=r��-�
#�<��6��d�W��Z�*�'��k)0y��0�w��^�+5m��Sqj�:7���DK2�v1�0P���l���{�$E}a���S�hI�:�,��5S;�Q�A��!������ N�$cTF�Ff�5����Y_���@�3��-n�b���������;]�~����>|x�;��~�W�^��o{gm����z����{����L���}]�xY�:�)�l�a�^3��B�l4�pa������	��F N�$gFf���Q����s�s�N:����*�����DO��b���D�*P�Z���6F$��	h:�8�E�<����u�:��i	���)i��-��"Ff��-��w_���I���
C&��w&]�B�h����v��o<R��dhc�.M� N�$g�,��5��d�7��*���VF��	'^��,Ff��v����rlC�����P���,���������*+���%Z��[&@�xI�2�,��5�?Z�Ykj�I�!KM��_��~@�;�4���N����d(G��M�����
��q��q�T�E�i�t&�fH��k���FE5�8_��D.�<�H>>7a�3���]^jb��rq�V�I�-��;�H�w�^�)�@w����,����dI�6	/,B��uF�m�q����,��5Swe�CP <�������5�y������k�L�����;CP�v8s\�s�J����t������
H�D�}{�	������f:7b��%�.�|N�Ig��b�������B����z-K��!��$������	(��$�#3x����Y�1����:#��O�8��������^��&�D��[pW�@�V��|�H�i�^b
q�V��fg�
��!}���t}���D{V��|�.��:�A@��iLZ��T�p����mIE&�|m.��E+��A�UF�v�	'Hb�H�c#
3���t�D��%��"��b]_����/U�B����L�;>�hZ�IE4i-���g/e88�<g���"^3ynW9�c\J@;���*��:.%�#3x�����~������Hy�&���a`�t�0�w�������y��P��2�s��Pz�|�N5��ku��`���G��;E]�b{YuF���u9c����]��F�+RRi��N�uB`!_��a�8_����N�o�O*���S'�k'zz�I��y�����z���?6u9�W��wpb-�X��E��d�K����KG}aD�Z�p�%�b����<�� ��g��.�����:'Z��*Fy���������zk�����d��R#��k��Q�%"
�X��5\&�\�O����I�k�Bh��� N,<[`da�yn�]-U��M?�J�m�I��
+`#3x�����elZTQ�2Y7-��
Ff�5[���2���%�#i=OR_����[`da�y�����r5�(�e{��U���|��'q���T��m�V1g(��Iny��
,���^7�\���L���*D��k��Q��'-1��.�\��Rz�aAi���G	u�'A�F��t�.s%�l��������B��f���Y�����c�(�3��s4���b:cZ;ga�PLW��_I�����������:�+2���0x���\KDI���%��U,�Z'����!���������i�Z�Hl�P��ap_��USkd@���+����.��$��B���e��5�����Q�%t�?\�:($���|}�6?�����=cw|.�:�V�ap+���${�*�P�I��P�y���}���P�y��)r����@a��Qg�2:^q��>cI�<��5[REO�K�M��3jKB��+r��,��5	744��.���LZ�hq|���/�S
q�V���d�v(�K
�P���F%�������16i�P�x
��"����u����q�Vd6������F���#P��0��G�}=G�Q���	'���
Ff�5��+���i9h���4��[���'�L�6-��@�fK����m-i���P����N��k0����L�Oo1,�'�PO��tpb}.�
���O^���r������@����_�8�>gl��������eVz�����|�$��uFE	dda�^s��f���=<�\*+z�N��lT)y���P42��xB+R"��$����	��B�\:��ks��tI�����Ho����k'���I�v�<�������~}q��N ��m������XlbA��p���-2�]q@�U������n�=��k�����]%��L�����Z<~�|A��B��.Sa�&w���--�n����-.�96�������l�������i$���EHX��8_���T+������bI���m�	���"%y��k����gzHes@+#��&�jes`Ff�������g	�8�jH��Hl['@����eY��k^0'���1Qv�%�!D�[�qB�X#3x��9��+�,���m�K���_^����L4��Z]�*����������%��(�BZt?]����w��A�[�S��o#�Cm\_ad�����"�k4i�r��
����*�yq�����]����O��7-]��Ot:���a�"� {��I��QD�q"�����VFHPa<��.���/�id����PgD����5����^.���1��I��'F	Uv
�	����'FO^.y�>�SK$�#Iw+`'JK0�>
�J����w������r7�}+e�_���=~��EL&�|
���{B�,PgDa�����K0�0(q���K&��%����Lv���ul�c`	Ff�Q�o/y�8Ck���@�I~i�6���^�f�����HH����#Z�2��%�;�4����%��'3�W}�2��y�����Q]i����Kn	������)�5YO8���%��z}�d:�X!o�|��R��?1w
�����"^�.y�6.=�h�����E�]��pT���"^.y��.���u{G+#>p<�����#3x}�d��q�� ��;���m������KV��Z3��l�=k/����f \��S������oG�/�B&�|�����=���
d��7��XC�����]a4���4����%/��U��,��J�(PcD�q�q�%+��mdda�^.�z��2"N~�;�����^^w�4�%����B���?�k�s)�.����3�UFL^q�����y����%�*?2���]@��t����Y�\3��p�+k����B�(�BV&�'�����B�j���8_��gk�9kT���D��@;�Q��6�����
�,�����w����R�H>t�{�u|E�`da�� -�4g:�s�DW	�2ZS�!TW ��b�&�����>\rnhH�W��he��$}`.�8��l,���^.yA��e�$	H���^' kOg�&Ff������nR"%2:Pe$E��@� ��>�l�a�^�.y{�VdR�|��3Z�-?��K6V`da���9�Y�6R���4��$9��u�	2<�"Ff��t��k��D���~VPg��� NZ?+,���^�.���\��@h����������Dg���|�.�����A�(�or@����� ��Fy�
�6�y�<-��k�N��<���{��'��r��h��:#���	'h�`I�<���O��J��u4�tiQn���B���d���h��;%UX��w�pv��*��q����=���[h����iD�b�tN��Z�d
�]2-��.��p[]/��Z��QK�]_�����cm�	�������n��"G����4[�M��E�`��^�	���7JG�M���]+�����	'��7��X3x�����HY��#n��rQA�|]��s6���r�A��.E%@�6����������\��@���Yh��v����:#*�� N��X�����Zh����h����X���N��h�,�����J-�P���>-�[I�J�Q�`��-g}����z�����j�6\�U�#�'Q:O@�~�����4�������r��������������.+��4�����na���2j��
&����3��%#3�zg����z����H6D�N��d�<f��	mB��y�zo�E�$o�xe@�eFm�qyp��k�B������E_�Zx������<Vx����0���w�m��S�u��G��~
�����G��Zh�8���s��$���4z�j���bda�^�-T[�4��K5$FzMhV2qb�\Y�a�a�^o-�W9IS:�}\���*���k���_M��DC������
�4h�Ev�����W@��(A���H�@��Z��0�t��Gb�����u�	b8X�����Z(���%��3�����>I�;�-�Y���;]��YN��m��� #X�����3�0P����{�>��t�!n��p�|
��6�`M�<����B���mR~��Iq��	HZ~�l]���@��[h�S�K�#����� :��DFO�N,��0����{�e�^�-4Z��T4���V�X���@����;��CUuY�)��H�.���T`E�<����BQ/�����9Zq
���k�#3x��P���:B�9j������#���j�`d�����M��_�z��Y*B��Z<�"�.&�|
��PMGf7�dl��tj��^t��^�����y���p���Re���3t��0�6����d��t��PO=���Cz:�sO�z��PO=>\��Z�Qlwm���9��i3������W/
��U�2�����$Ff��p�T@�Z�W�c��T�7V�����&�P����w��Maw������l����d�%Y��XL�����������Q��������,�g�^2
��9g��t���1/>R��UD:K0�>R�J��|{���)�h$uF��b�j$�%Y���F������v�J�=t�]������G N���������>\r]uP���5���E�����M.��&WC�������G�����:��6m�q���n��k���l�ha�a09�Q��_��r�f��0Z������K�p<�&=q�S@��x_&@�����#
3��p�T6U�S 8UF
?��@�g���0P������EJe��-��o�rpb��X��E�Ke�]rG}1��A��D����u�f~ _hQb��5\>[k^5s�����$�u��-�eY�������ipK���o�_YT�M�8����M�,����%��%����������Z+u��ao6���Sq��km���*Rm���7�qb���"Ff��t��(/G��Q�L�uSw(��/�X#�<x}�dj���C�Xp�i@�Q�fP�h��3���Ff�����n���hj��k���?I�����IK��UY����%�]�`����:�lLHV���
#7�|A�f
q�V�����P��6o\ ��+�h2w��T����oc
Fy��p��&��9�`$/L��M��kx��|��6����=�Vx���rVu�Q�<_Q]7��z�2c�Ff�����O�{�����H���t��������k]�]�.�2�6,
H�^�x�_qB�a�#
3��p�T)N�����. )X�z�	�����Ux����K��I��e0��]����k'�&���"^�.=`(��
hcd}e.@�*��Y����PP�7�b?��k�?3u:���
Fy�
�������m��m�P%#���c}�����6t|{MT�^Rz�t�m�����_����?��������_�;�[����j9��|�B�0�����������?������_���|�������_���������e����r�l�~���+��;jK���'����A9�P++�9��}l���iY����������D��/~��G���7<����#��+����������XVl���z~�����AP��&� ���A����H�A����0��(�����&��/����"o{�(=>
��Gqt�}�����G�����C�������0��(�����_���������4����z~�����QP����Sy�oA�M:�/=?
�
���|�(r{�i�(��[�.����G���Q��?������=�>�����-������G��(�7<���������Cn����~��G��(�7<����� �i��~���R��8��a(�L���0��j�[m���iVCYa����C{vdut�U�BVGY������f�]�j�!���04������Y
=d�C�
��^�B���k�jh�~f7��-��N���_�L~��R��o����"<G�O��������8o�:���v!��������C5r�f!����=�gDJG�9�.$u���������CV�������F�Vf ����j�:z�Zt���z�
����C�>�k=d�����CV��Y
���]�j����/�t�y����}En���8�����S�w;�!��jv�5����s<>����v!����<xo;dut�U�BVGY��iVCYa���Z����}V���Y=d-h��z�
����C�~~������9=d��R��^	y��cv1����������8�9;������?qj���'����_�����]���>+M��������j�:���I�G���>���������?�=d�]�j�!����!������Y
�g����:���v!�����u��r��S�bJ%w�v�����f1�������u/�5��49�U�l�"����
�L������8~�D��Y
cfg�K������Cf5���=d�v��3�p�6�����27m�2{��CfcO�q�����Sf2{��i�3;{���1�����'<ZfcO�ax�l�)����!���O$7<d6����OP��K��gN�pqx��A5�����3H���r���!����2��k���=dV����}��d�a���}f3�b�����u�pp��W�i=e�}���5������2S+����Cf5<lK{���wT`O����
�)�UDhv��t�c��3��4����I�(�S��-��&��^�\�l����&,��MF�Z�n���H���}���X�����#��VtO��DN>��������Q���,�n'�o���_
HN�n��~��j`rZ5���D����:K���"���������
,�����P��H�n4 o��l`r@6���������
���c���4������l3�����V����TH�;I�^'�f�&�SC���T���eeRXPg�ma��	�-i#����^3�{�c����6�����3zK0�0(w3z��d���N�,S����e��(:;�$Ff�n*��WoC��Dt7�����7q�%9K0�0����*����> �
j����l=q�%��Ll�a�^�%Uk�E>5F{��z'^�u�����@��Z���]�V;g����-~�x�?Jp��M4��Z]&
l����)O
l�S��8��==��t��0P���*O^�&���;���6��%=����D��Q�t��k����=�4���������"^s����#u�hgDj_�q�%9+0�0���^��f�xre��%����S��k#a�����jf�F"	�+�n�����L���DE83�~�2�I�,�Hn����P��]z4�r���YN���a?���35I8]Y����u���#�N�T;/��W��de��Y��t���{E�?7D�����YNe�#�#a?���35I8]Y������m���uf"w����&5�P#���������s��
�LC�a?Y�����B�	�+�*k2
.�{�Ye��;����l��i�� �te{�f��6fL����Y����V��m0�P#A���V�&�f��IsF�f��de��fj$][6�T��U��Z��L�Gn����R_'2�P#A����j��3�.���S����9l��i�� �|e�k[���� P!������Wdm����n$CX��4R�����J���*�K��|���WQ�UW�U����X�9���L�O�dibV�e~"H8���HG��>��( �����4�D����Ff�u'�����ry8���Z����rs�n>�A��?��W}��	�b�
��~8��P����E��Cg�e�	�W�jSRF��
'�������8�b��pp1Z}|�	�k^K\t�Z�#H_�h��\s����^���ZE+�.�������	�tX�Q�Vj'��b�����H�mn}�&^'Pt0d`I�<��5/��
g�ZO��H��A�SkB��5\&�\
s����#9�����k N,�a�y��kV���nt9G"���7�����^�K?ZldS@2�i�w�h��l
L&3�0��M��+�JIv�;�����������,����k�����*�>E*���0r�qb]�X����fRa�9>O$��PcD���/I�.dda�^3��XHSZ�*SIc�74y���J�� ���\&k�S�;~v�T";�,����O��`	!���2�ef��L��*�����+�u��3V�H�@��5��G��R���Dvzq
��/��D����e"Z�
��#|����f�����y�l��
iu��Z6qP����7#�����E�,�H$����Iw��&�dr�]g*��� �)l;�,�H�p.�egJz��T�����d�1t�61�0���*�sal��AP�Vf9����.I�!?���B�D��������iXzS�q�a���O��#3x��D��J"�E��q�\������0�P#A��������SX\|���������� *l;�,�H�p.���wJ��Yc�eyeF�OVf��0�P#A���v��dhWd2�+���@x&�9�������:�k&6���$c#����&c�	`'~�k+�4��5��t%�DX�2V�MB��3�|�� ��-�	�W��<=ySV����g���*l��i�� ��(wnd��
�`��X��3�~X�BZ��i�� ������7���~��� ���)l+�,�H��V"Lx=b� 0��oF��8l��i�� ��DH�d���>����=~����MLL<]��\��<\H��.�Pu�F��q�R�+�u�����t�\t1���A�����c�M���k�e��z�&�|}/���<���T,z�bA4�*����,�����$G�$�}���af�K�/��Eb\!��Go��r}��� >H:�8_�e����z���D���;��v�����&��E�����ceE@2o�*������,�t�dI����.�����P-�~y-? ����n�����L������aH�.l{s��lC`2u!���a�,��x�#���L@�>��%U�5��}7uo��l��:�}Y�'�'��%Y��k���G"]��ACr����	'��%Y��k�V�w���PcD���	'^��.=�=��k�$kX�p�-�Pc����� N�$c����4��k&R�i�B��������*.�8q�2��H�@�&KB����m�QQ����d����EFf�5[R��"I?��u_C�H�!`�n��H�@�f"���C+*��R�C���5�W�0��X�k��9�D��[lg;��Sy��~������I�������
�h�����S.3�~�2��fj$�p�����O��Yc�5�����$�j$Hx'_�s��_�}u��M�a?�,�W���	�������:s����]o�GR�u�0��C��������N��U���5����4���d���0�P#A��dX�?A�~8VQe���b�Ye8k�4��k�����ANQe�������/I�V����zM����c�Ug�������H��$A6�P#A�;�b{����
��=��>��"�����LC�	��WF�KO8��9�b�2#�g5�����j$Hx���:ih`E�6I\g$Y������	�i�$����Br�y�Ik���f��	��{�]����3��p����e�����D>������oa�"~�:��f~$��F��w�[�w�*�\�8��
#S�@���7��K�����F���c
��PF��k��y��:�������P�3�a�08��}{�e����U9m�,���|�Z���Uy��5\>\�N�v�a�������	'��j8��F{8�{��p����6b��TG8���6f�Mu���f%N>z{�=�B�����j��h����=~���c��5\>[+�!:_vihE�tp��X&m�����K%q�V����oC*s��y�t&�N@�9��%1�0(qN��K��7�*���3���J N,EK0�0���Kn�����3���j�	�/9��.Ff��l�$yB�Y�T���@�	����Uh=;�4����%/�G|v�p�1���h��nw��Ff����K�V�y]���2�u��'����obda�^.�.�z�KT��M@��R�P��a�^.��4�S!s��4���	�W���A	��h���'(��V:O*UR���Z`��^8��kL������K^uz5U��
Yhe��M�W��wH`R���#�!�������!�W�N���wH`	Ff��w����'{�Q������=~����EL,��.���F���fK8;����{�����rVWi���gK�3�r��4��E�1�)'�q� ��jU���Y��4��}��eG��l�iU���s/H��5���VC�$k���+/G�r���,d��q�2#���,0�P#������M��o��d�1+�yF���n�35���|�D��7��I�Ex��6F���_q�r^e�,�����i<<3��?�#���IoN�C���(���y����K���T���F�3��X�
���������F�!����S�\L���1�S�3�~~8�a�af�F������Yi�+u�]e��Ya�M8����n��3um1�H����S�-���@i�Z�V�q�eF���z����j$H���WL�k<PZk~1���h���8�\�lQ#
3x}�d��]<hp�G���?��jf�G��X��+�����M�����y�)���`����$�|�����P]���w��7
�����IC����������2�*~�?086�#�<x�����B�&����|�N�eW<����U�����w@��b]'@������,���)��B�"��c�C��`�� t,0�
,�J����Bk����&���{�J�>��[��8d#3x��P�G�����e�Pm��Y���~���yq�4I5�-Q��������/*!��Z]�Ya/�Y��x�b���6�3�
y���~�P|��7Z�.�WwN�B��#�T����;�.�W�h+�����q��EZ���h��2�8q���#3x����z],i��aa%���{�qB{+�e#3x���='ox���`@��Y]&@���,�����Z�B[J�I��<'�	�:�/�J�����&�|�.o���F't�0�� �	8t��Ff%���\hY�kE�����A�qb9?|����_N'����~�Hk��}$���%�@�8�0���4����z��P:���k:(����������C��K�����������0�����+������4�������^�$��V�� ���~���ye���Ff���B�U~��&�a�p_�������������=��.t������+��P��zy-? _R�%&�|�.o��{�Zw9m?���gh.���L��5H
��l�>u'����7���W^o8xM����a?k~k�ag��������������=o����Y�	��fj$�����.���27���wd�YN5�PwE�������	����IO]y�����%�t	��2���Fy�zo����_!�n���@��+��F"~�������d��o�����+���U9������~$�1�`�����S�o��D�Q$5*��C�����@����,�0�0P�7��Z�����u�-hvF���&5�P#)��H��;6���@�U���	`'���,��Rc�������1�3'R]\Xe&�������`Hc��LC�	�\/���l[�v�"��f�"u��NZ�����@��\�v6J���Zxa�����y-|�=�fj$����j�z�NE��/���_�T����������o����Mj7+�A�>_������������2�����������N����E{
��I����y������F��C�
���P5�s������eh��!&�'��Z[Gzr����8^��gk�%�*�/��S���{�@��Ce�(��h�N��K�}F���+�QgD��������,���������eGxR[��>�[����k�O�{������/�������>^k_^!�/��8_����n6�p���}�mT�7��R�1�`�^�.�z��N%�^��g6�����p�C��8���������I��D����-������6	���.��*��Gw�9%�Q�=go~��%�����Q�v�'���l��f,�6I	�j�	',�6I���6�	���%��_�Y���3��( N,�w�nf#3x}���;��%���6#�
������6���
��>\2}	u�q�~~��W� N�EPg#
3��p���}L������� ����cTB�����8_��gk���J��]�|��������B�D�f
q�6���J?��{0W�8��1�{���A>\���m�\C��.��u�zxg��{�h/�F��	�G'��l��<�����s�{���"IaW`3�����K��6�qF"�>^v^�7����3����C���;h���	~�r�Em���'����,��$
��U5��X����.������'k	�
�_�H����
3?�5�O�`����%������&D����jf�G��J�]y�!�RiC@vx�Uf9Q��a?4�Q�GG�j$H�����?)�.���'�A6#���8,jf�F��.���������WlNk�rZ�]����������	~������"s�Q9/���d���	?���0��#����|��J����
D=@$Y�v]�X����x}�d��%L���*#��_q�%��M�,����%7�y�f������3�����
�ljf�G����]yN8]Li������3�~V2�a��F"	?_��m����bd������y�<C��/=�	~�r���k�L�3�}��~��:iS3�'H�������
���X�_C�����d�z�������>�d�J�iH�B?#�uV�����<�}���y���l��� GI�E?X������?�����������O��������?�\�����(�{��o_�����������������?�K����������?-�_���?����1y���
O��"{�����_'}����o�9W*HJ���3�3)�9�K�zb�=�}5�����{~����<y�#h��4{�~-��>1���q��?����q�������UJ�[o=�L���q=>w�Gqp�y�^�r��/����F���||`����q��P����^��k�]o[���=?w�rp~x �#�L�������se�$8�9:�?�������������{~ ������@j�6y �����Z��p��'����a|�5���:oq��~q{�������S�M�0����2�����C�=�kfg��0fv���������Sf2{��zn��z#n��}t>d6�����
�A�r�����MOCj��{�n�&�#	��94��H���?�cfg��0fv����������.�5��UE����S^{�\+�u���2�a���)��0f6������2�@�r�$�!�����2����)s�Q��2�>��2�>���S���=*��^?�~���6�xx��A5�����3��p��=dV����Cfj�O���2�a���)��	1����0<d6v��$iZ3v��C���2����f^/2����������2�a���)s�U�=e���*���+�z����?��������R�������/>�J�>r����=<�j���g��������Cf5���=dnE�3zfg��0fv��y{mm9e6������2������Cf5���=e�#P�W����0<d6��y�>3;{���1�������S�jC�����!����$���*z�D��!��_|Q%[Z>�~�q�W������S���Q8��*5�:}v���R��!s�Z�Cfg��0fv��y������Sf2{�����=dV����S��������Sf2{�\�T����Cf5���=e��cfcO�ax�l�)s��������2{�X��i������6�S�����3���q8{x{�����Cf5���=d���Cfg��0fv��y�z����Sf2{�L�NoQCya�z������5������2����������jx�1�����(gO����r���j3���g�����_|&mtb}}�L�ds��b�L�}s�������~�a,�
:��^�����gKc���]�V�mF�O�c&5�|&��h{����}#M��#�S�����a?^Y�I���P<=+8[���l�de�:3���3�~X���fj$�p��M���*�
a��.���aIe���P�L$��0b���'�aR35��9�����]����M�j'�^��?Y��i��k��=�-��=�*-Z8wz�$Kj�Ti9��x��Dg
6m���Y�I�4Vb��bCLjf�FRn�2.[�er�*��,/�������de�<�H
W����C��t��3���Rf��de�<�H$�|e�H��2sE]�]�M��_q��1���0��T��^�����">�Yc���>#��O�C�$fj$H8]�"sF���E���7�4p���,g4���0��tIU�Y�]s�];��~n���_���������h���x+�I���(����Oe���/1�H�p���0�v��hV\��t!5�*������T��i���F���x��2�b����D���'�a����$��l��^_M&W�2c
5����f5�P#����m���[ad��0��^�#a?Y������I���3�R[������I�\vmQrA�OVf�F]���	�VVS��[�3�L}�jZ���#?^Y�M�,�H$�te�S�1��]���l��B"h����9ljf�G�a��L��ph��{"�	��� N�qX��E�xM5�
5�9�
MY��X��"~�d9ljf�G"	�+�6:j�����a�p�{r$����/�~$�����6����Z8��v�������5�0��7�KWt~��9�&o�������D���Y�����	��$/�����_��=3��8`���Y��4�	c�h�������r��3�~X��35I8_YE�xA"�M[�e���_�j�M�jf�FRo��:%���>�����\��.����2�Y�,�H$�te{�����
l�3Ruk�	pA��
0��C�����le��f���2b�@�a?�l������$��I�%��@i�X�.������Y�a�C:�Cfj$Hx3k=/)�v�S_�^��;���Qt<����dI�!�e�3a�Q��Y�!�D����>�%�@�P���g��!��P#yHklk��[z���C���i��T�3� �.��N��``���� s��b_�&5[}T����/A��ZW�����rA���:<�HVV��D�u���X�� ���,�������r;���E���r?L��uF�O�,�I�,�H���5����� J���I�W$����g�I�tY\����������.� ��j��t)�3�o�}io�|f9�������� ���`����$��Uf��i��3��q�>#��/:�,��j$H8]YiB��M�L��\���������Ab��F����m2b�i.���#k�8�2#���)���i�� �te�2y����Tc�Y.�n^�����lf�F��s�ry��_^�)'0��5x��d�rR306�P#A�����,-r(9��,���o������`_�LC�	��e���P�w��������n~ _2�AL���+5�I���
�zd;0��K���~��8,0�P#��si�����ld�Y.[r��L���BR%�LC�	�+�aV$t���3	7#���)l��i�� �\���T7�&=�cg;�\S�_���� M,jf�F"	�`R������6��PsSMk$�7��!�H7��I)�m�v�q����t�eB�OV�������13�r�-S�1<���[	`�����6gFy�u#MB ��M����	?�&;�<�H��7%��x�t��	l�-���t��	����I����<F
zaI����E�z
��Zl�����K��
BaG�C2d�k�5'^��E�4��5��j_�Z�����}/ITK
�6��F�������&	�:#GM0(���<k�\ [P$��/U��u���/�{=�8_����]�D����Vf,/n3�~�Bf�y��H�[�/b��&��2�����6�$�35���&O*d�
P�i...� �P� Knl��3����H�*�?���/�G\�/W������!mA4�Oj��H�/���Y����/���l��v��k�688� 4�k� <��$=X�'+0<5��oB��
OM5��'k������C
5�4MmYR�=����B����P3z"�p.�4�%���>c�wU�3�~(�4�`��F"	�+�~��*�3����vF���P5=5�'�/q}�u��3��}�f���?L0�P#�����21%N��9��2�3�~�
&�y��H�[��fyV� u�1Qg���L�����B�D�E�����X���&���:#�Qr�i1�P#�������i�T�9��n��V&���{,�Q3��j~����f���"lv��?_/���KT6��ku��9�i�a���1E��H�*_��c��F����m���H2�*���`�d����o0�0�����-�:(��=0��c������'�����P#A�y�d�������N�����Xk��
�,����E��*�h�Jx�L���k��Q�%��lbA��p��k�4���|9gt��J�)�:!�g���C��d�|��&���	m��yu-*v���P�ipQ35�v�����m��n����O.{�=�K�~(�T�v5�P#A���d���������Y.tmF�+3���B�D�E���T4C*���*�3�fy�*�R5��$K��*�\���	�AUHoL6�P#A�����0�eO�	���7��'���W0�M�:xM%�]g"�:a����� ���\��'�aU���0���!#�����V�x�lX.=4&�bs���a?H��y��H�;�QE7��jhc�jb�����������K��w�@��ecRB������=���Y��H��j����N�����%sF�;�za5�z_�i_�VupJ`2��%�4#�����B��bp�\y�>G�����23�Ny�3Rm�r�Y�,�H*�/�uHif�
���	�����g��L�L0�P#����!��a@r�0���������f�g
]��^�`8+R8���IV���.:	�� ;�c��\vG��Ql!�RF���2�BYC��*�H��{
k'��V�I�Z�X;��x��+������Z;g���y�eF��.@���P#�Z�WNU|�&�gyB��3�~��8�3B�D~��������L��Ed�K�<�*�y����������.w9�s���������s�0���D~���_v�T	�q�uN,/����L�+o�:�����3�����l�h�w����[��a?�5)�-�1=I�����"N�����������H����<�H
��>]���u�#������a[��c}����>]2�D�J���5�W/�<��5e�<���P#A��W^0���Dol���Y�4#���+$���,�H���������R���`�Y^�����~
)�z�Tc�� ��+/KF�7���p���5��Z������7���#@�>^2J9�E��E!']��,�vq
�%_2�TL���J�?\c�*Q�	u.�5D_wR�N;��jL��f��t�,�m�TqD�I��#��@�x���d�#@�>]2+�Un]�
*0Xa�O6���
���0�P#A��WN�4K�U�6�U�Lf�"\T]�C���35I������$������
3)�����Z���fj$H���7�v/I����HRe+��� ���+���i�� ��+�BY�/r��DgZ��������D�Z���F����<��Q��b"��a2�<�����Ff����I�^x��W�9��<�g$T�9��<5���p�5�`^**�Z���ju�	�ZX#3��t�
��3M�Dy�3)���bl�a?���y��H��W��/h�cN,�A�+�X��3�~��L+�,�H�����">.i_����w��$&��3�~hz��~o���	~�������U-=��q��6#�gM��Y��H��W�=��y�Yl��H8O�E�a�c���n0�P#A��W^_�nX���
�:3��0#��M&5�P#����|��/�]���3��0#��M�]�4�H�����	�:����Z�l�
R�<!����+�<�H$��+��U
��f��Uf�qE��X.Mu�bf�F���:���aNMx���a$�w����j?�p� �0�w3����#tY��q����w���Z�h9Xg�u������W&5�P#���T�j�K����v�Y��_$��0��!�H��[�]��zl�j,4z�3;��^�j$�����Z�J� �� ���7�<�XA�z�L��go�y&ZA��z�"������V{���K��
Z����Vo����7���{��*���g�X�m�U0[}K�L�@�]����N���D�r���N�����1�����
%�^�T���,S9�6#����I�,�H$������6�
e�
O~���(l3�~�
o~�E�<�H$�����Of�C����.�f��!��<�H
�����?���v��q�mF�_��?�`��'��H���[��"�~���X�6i�(7��b�r]y��F���wo��zxCZ�i�����+2���0����Z��6����5Atv�@���`'Q��Q�5���[hmr�����T�����&�x>8�6����F��2~����������f�~#����^a_�LC�	�\o����3T��v�^�(��H7'��fj$=�8�]���9n�����:3�����5Ljf�F"	�\/�����A�_g�l�1�	��C}6=�H>z~w�X@Q#�+��F.0_��8�MY��a�^o-�z9bF������d������`V35I��z3z}�
�v
p�p�?�k�a?��
����i�� ���]:��x2�l�_�0��j\v����
�,���������B�HY~�&]��mH��C��<�H$����*S�-�4n������Y���������<�H$������=���\�-R��~��3������dE��[��M�B������&�����0��C��|�
�b6m�S���Y���d�����q�Y�,�H$����-k��I}Kv��uf�������`��F"	�]o�/��������Y���u���"~X��3?�
sW�[/m��>.�}7kL"L;<�_'D�L��jf�G��Y�[�����x��fR=��H$�k NR��,���^o.��sO�D�YA�+m�|E�7�2HZ��B�	��)����7g�X�:�����d�r/���n�&����@N����	���d��p"��d��ot�|���A���.��t�<��y��[o�����ak~lL[�&�������[gD����� w����2/-//I��
�2o�Fr���^y}�`5���K�/U�]�b���%Y����%S������9{Q���9M�������>���[
�[��*�.mG��.�H��&�y��t0�t��Os������h���6#�~�`��FR����]y������E�X�mF�m�7mHbM�����/�wWN����RkXg��a?;�`��F"	?^9����+w���������o]���C�D~����o�-�)Xg��������D��[jlB�D~��\����M�k���];��d���35$�x�t$M�Wd�x�1�Iw���za+��V6�P#A�d�Tk,���Yc�K*3�~�r��N6�P#A��W^s��^�
3n��g�����@j��fj$H�����3F����3�@����O�]���
V�y���8�������W�-��n=��e���B}�'D�p�`U3?���>\9���h��7�N�R>IO|
��X�v�`da ^�.y�<�Y��:�:�?�k�����,�X#�<��t�����%:���{��%�:�B������V^S���M��Yc����QwA�+WXw1�P#A��W�wt,���rV���$�	?�p�jf�G�����+_t��pl=��$�
e�I��
�,����%~�&�a9jrv��
�k Nr�l��a�^�.��8L	�X��"��*
��p�%�j�����W�/�'�������k���3�?a��j$��O���-�u�w���WD0Sq��$-��uY����Kf���F~u&=w����$��m��/��F�	?^�v����-���5�=��"�wlID�b��F"	?_y��y������,]��L����y�t�Y�,�H$��+_�����������F�B�%���d�P#A��W���h	�F��(��qo�#�����<Xa��G�b��+G�?m=�~<k@�O���W���g
����C�D~�E�>���+��o,��O�H�w:k�K��~$�0�p\�5�+N>U�����j�������q�|��,�>H��j��/����������?�������w������tN<�\����B��������_���?����}i������?�%������S����������ao���59�����+�Iv#�nwE�],V�'��-����SB?/Y��������Y����
{��x�����/��z���~�}/��y�����~����T����i��i�?�^��wY�\�v����z4��?q��<>��pp�}4��O�{��p����0O��P����,o���kI�%�����=����Ht�8yo�J�\���0O��P�����{���(G���<>��pp�}�1N�����	_J�sq���,<r#��g������J2�,d4r��s���P�}>Xy:��*�d���`���cHh��o��uD�Qm������WJ���-����o���R3_���g����h�>��yF'�3	k�5�)��4;O�C�.cBcVEOYa����3.!����0)�<d�;�CJEO9a�z��I���U�SV������.Y��CVCY�.du����w�!���O��Y
=~��\�9�����L�\�����
f�����)
=�T����}V�
�I�<��YHi�!c���1������I
�g��g�j�!���������O��)����J2nr�&�T��v1����[��^!����j�:z�Z����>g�.f5��9�J������O�%��~|U:����A@��U\���@GY�.du����cVEOYa����'II�<��YHi�>c�}����CN�I=d�rn9fU��v1������X�Y=e�]�j�>kM�[&#9aRy�X��M��?i�.&-��s�����O�_|�P�������gM��#pt���B*f5��U�BVGY����������)������x���!���������t�d�!'�BJ#�����S�z�
�!�z�r��u=�?t���D��$�$��jvABs��=3bVE��6f�z����]���i�,�Q�������
fa�F��4��|����!���������'#O9�������4�*�~�r�]H��!+E������.f5t�U�����on��,���7�OC�)=��]Lj�>kO��'GY�.du��u�&5��1s�?���c������1�������D�b�m}����w��/7���m������(J~f�S��J�'�5\f����������t����w}�
��w4 �7J>\jsF��������~��]:u�-�A%@�=��l���Dw����O��,}J�L�9�����x}�q�H�r�)#�1s�(��nh�����J��_q�";c�2�0P���������
R����/��5{�(�3���4��Z]&�����"3��l��,cj61���!��
�5��k���~�W���2�e{�	'.eW�61�0P���[���-rP�Qe���M���8�qv0�@Ff�5�g�k��14�wT�a=�#O�8�����	2�0P�!�_�	��
�bLg��.� ��oD6���R�'jc��LdF�{Hm4T��d��=/��
#
3��,�&�����*���t�sO�~u
�~����=_��D�lxNe�<)� ��E?���N����ki0���ku�H�|pfit�C4J|�d�������z�$1P��%�'�cW5���C@�*f������������r�Z�������4�/�J��DKr�`da������VB	r@�,]��l�DK2VWi���dIm{�]:Mn��+';>���]^����i �h����Ld;t0hu��4���U����L����]L4��Z]&����������]+�G N��2VVi���Ls��gz$2�Q�I]$��k'Z��#�<x�%#>���o[.���V��Bu� N�$g�Y3xM�����V>�����d�u��T N?_����f��+D��S�$"�������-�Y#3x�$:WVd��UD��Z�
�-���a-;�z�]� ,���1�A���EN��`1lb!��p�
N�~t�(��:#:�\&@��O�,���^3�F���]j�����Q�2��K2����^�%U�"G�2}�%j�
�\q�%�?i���lI�~�,V�k�1bg������/�H�@�f���T�y!m[��Ho[&@�xI���Y���lI��X�Y�7N*�����x��|Ir1A����L�sv�/�u��!�*��l�Nt�����"^7������-/�����]���q �J����F�IeB
���3zL
Qag��|m.���4K�$FU�N*�-�v��Z<~@t-d�!���2��h�+��*�He��/���D�B&�|m.S���?9&�Hzi�=����5��j�H���9M������fH�6t.�A~Sfj�����&���fr��/�M~�����m�|� ��+$�=��UFyAw� N���t��B���L~��8��7CU�I��� N����
xV���z����oD��HJVIYk��+�i�dda�^s�������D~��.'D�Z�q������Ff�5����K:��L��}���d������z���f����������	S��|Avc�8_��T�+�P���P_���.=E/���q�ad���W��0'U���{����&�Y��E���\�9�?��.�@Hw���hI�*�,��3�T���y
�D��8��w��:c�,����J
�D��~oIs7���g�Wd	Ff��m����%V�d:���0"�m�p�B�ad���B���{E��u^Tx�8�|E�`����5���J3Q�����O�	'Z��#3x���X��.��
N�Ev �
5�t
��/���D�����F�k�c�7��izk*���L�k�1�����PWv�gU��������+ N*�	��B���F��g���y@+#��&@�T�����^7B�*f�g����`����Q��/I�&�|�.s�n����V���X��q���Y#3x�u��y�;�����	(��X���@���fM�D�1�E�����u�/S!��DC��������H��M���|��������!���7B���n�'���i��(
u�^���(���P��B]���Y��g�n�$���G�^���T"Rwy�?J ����8_���BG/�A�[�w�sU�h7��r� ��.a� �U�b�9����
Ni�^�ld��i�j������������/��5����H���w:�K��`Up�~�|����N+����u���Z+���(�=�����g7����p ��.�!����"���I����LD����|�C�l���]O�����D���h1`NSQ���V�����#�v+l�5��@�f���7����}�UF9�z�9q�%��Y����4n��8���2���u�	�q8������z������R��1.LQe���B��8I�W0���-��z���E~G�8�������������AYB�T��=���\������8Bv*���
L`a��$*��;�)y];*����k��d�kc${����z�+���f����
#��	'T�	�m>6�0P��6���;�/$�i����D�zu
�T���=_��T�SE�rT#�N�;����kx���BNL4��Z]&j\���@���#*���vb�NP��*�,����z9T��z9~S�z�hV	���(���@��+]�z�j��:Nd-Z<�b��3�f�V����(d��)i|H�
W�����4�&����eZ(�]��O����~��/�8A�+0�0��]�������&w�\
wu���M���!���r���Q��g�����Y���/d8�����n���]8��ebd���m��6�ku��p���E���|��?
�P[�O����`&'Z��v
��L��C�����l���w@�_���T��X#3x=W���S1T��r�k�8U��I��;^��s���
9*�r��"C��rl*���p�Q��&���$7T�E���V�-x����z�U�AKG����j��M�,^���>����@���7�����#��������'i������Qf��r���l���2K!P���4�����K�d�+k��#�����	�Z7��F��@��J��x� �Q�j�N�l~�V1P��x��������&�W�������,���C�Rx��������-n���v�TU7/�����z"���89�D�;�����(��p��B����;W�����hB��}����Z�!�V2��ks�h�{�sIt��e�"-.�[�.���h�/s���S�I��C�dGU�����	'����=i����K��0�9!k5h����6���V��F9���z}���z���_��+����z=E�M�I�@���f�F8Bq��5��fs������� �Z9G"��_��2�x�g;Bk5jgf�'V7�Q�86;��z���Q4O�n�g��T=mF�qB�_V�AF�i����%�Y-(��1eFT��@�P	(�� o!L����k��UG���2�D1@��RP��'#��z���U4�t��&'���9���m8�F�[=}{�K��6j�M�6�������Z<��@##z������Rr��n�|T����7�b�F�����"w^��B%^�R\Uy���y�	��`3�,L��l�\^hg]�HD@Uf����0��ge�����z=\r���I�����,��)��p��
�`d�;��K��2E5��+�<�����_����C���kuy��������6-�jh(9�	��Q��=L��p�E>��:k1~@R{?�yv��YZa�a:�^�\��6e�r�T�����8�$��5�DC����Z������R�-�^2�.�Y�*O���K	��Kz������RK�\J�/��u����{~�_Jh�uq������hLC�e�F���|z-_Jh�In������V�y)��q/4��w��h<�����y�����!�kuy����A�!���
#�@�Tm��,�i����%�D�����5:o��	�'@����M�,L����!$�U�88��
�N������j�L�&�x�.�
�9��|��?�^���|����vsq�V�gk�C�����K�������>�d���eY�����U����5G5�s��B�	����X6�0P��K��DVI!�ZLy/PL�w�`da:��K��M
�2�(��L����)����#��x=\�*�T���X�Qedz�	�(�S�&Y����%�j.)����LT����#�W�3�b�!�������	�k�Q�g�<_��v�}+0���W��x��I+�o��������3�x�OO���i!���2r>�����v��������v��Y���c(�i��xV6�O����I,���ks��������{7"J����Cw�����s��xm.7V�?�n�(�4<5����w���rw�'�4���c����q���	��
#����xaU�1z��������
g7��ZA�������N�3��C�p?\������G��B���%���7�8�X�dG6�0P�[�5�O(5CT*�H/� N���HF��uk������}a/�X����Ca/��������Y!U�io^�1H!B�x=�f�/%Z{�B�����Y�bMX'<�$O�$����!�m�L-���z�Z�Z^%AG�Grx���������*�C=L���B7i��������q>��?���*��D����'~o�����
���~A�2���9�K�����#w^���wu(���=�T�b��	8�i=r�uk�s{-8��uY%>t���"���|���8^�+��`y��������8i23�d#
�������>��9'��e��)�)��J��*�,r�uk������w>�P���W�YO�8��[6�0P�[m��d�=$-e��$}��^���v4R!���rg�����!�������&�K�?V���2�0�0�������e��H���-h��c��t���g�#��x�Y��t�i�����<�J}�I{0�f��uk��+sG�i�3o�������N*a+b��;�[�����='i��� M�
���B��m#
����P(�e+��j�^M�>^��W�N��\��\�&ue]�w���:�"��9��v7Vad�;�>��A;��f�n�1���x��������8^����.L����E	�6%	���8E	��%��#w^�
5�,��ugJP�+��5{|)���mb�!���rk��}Z�b3�����h�Uj�eQ����Z�Y��0(����R�Xc���=b��BM�jq�V�[+-�����'��T�cBw
� j�U�iB�.�~��&���V��Y�mA��v21��c#��x�e����;�,�{*�vWXN�+�T./�_���N�x;T#E{���k�m*$�C��ks���Z��]
�3���Q�C=��jh��<���(&�[��}8�����'����h�xi������yg��D��>��E�)�4$>�a�6����Lr�M�6�X���Z=�-4e?����w����8���a�Di���q�����<��'��0����RB/e��A�t�zA<[r�~� �%;���i�|z��;}C��!����l�4w�@^(�@�Q��+
�8����\2�0P��K��"����6���'4��
6�0P��K�������v��$=������B���p�<��k �~�4�i�����/v9�@���W<J��\ [�d"T����'����l�!����l�4�	�'�U��&4� �C�����1P��BL��L�_�������i��8u�=N���9�DlS=���.z�~����S�?�U��H�.���8A�+0��Xb���%S������k3�E&ZP��(�k3�#�����ys�Tk�A~�.�2,Ul����
dX��z�eo.�JIs�"�����?:p��U��B�,r��vs���������
��P��Kf=y�,k� 
��z����su��.v"{��r�^W��
�U�0�����.��V*����t�a��nA#����
\�0�0���%�:S'�P�-h����@��%�04��0���%[O�$gPB��T�T�v	���C��������8^����Z���`�g@#�����>���!L��p���w��>�H��IK�s`����M��Y����%gi�8���gR@�#�E���$ 0i�t���KV�]�cj�)!+���T[�`d�;��K&���V�G�]���U�'���`da:��Kv-|�e��gFAv?8�ew���;���(�.]���q�;���z������`da:��y��KF96��s-���~BV��8�����,r��p�Z�]L�
H�*+�>�^9�X�a:���+��W)Js��\��_�yh�Rj�w�\_�0�p����#y�Z~�5u�ada:�����k�	<|8*��M��p
��O���������-�D���;xZ������o?���y#�����'�uy5��QWj��[
����_|���~���?���)d����-���y�.%��������������?}�����}��������n���5(3���O%��,s����z��g�]�,��B����5��v�eZ_��4�����9��s���
	_?�������s=]�2������_���-���z�����[pW�����-�R�CO@�i��_�=��{h���=]����=�\��C�Y��{���@�����ty����s}�f����|��!o�k�]����=�]��������p������z�������a�����P������������.��\�v���!����=|��A��w_
��`���.����cx�����������.f5�>k�������v!�����%����M]����*�s�������Z��w�k�����&�1�;Pr���,������,�.����0����LR��\d�Y�h�"cC-qH��*'�bRC��R3��+��"'�BJ#�z�)]��]Lj�"�z|�*�����F�g�f�c���.r�]H��"kA�y�����E�bVC�/���K_��\��V������W���z��I��}VCY�.du�>k��1����j�:������gUt�u��������e���j�"�����.�6�KY]e�]�j�}V���w5t�U�BVGY3vWBVEWYa��������*z��cv1�����	�nbo?r�����������f�������Z��Y
]dU������[������EV�Y]d���K���*+�bVCo��)T�IVG����g
�"��]]5������Y
��J���WGY�.dut���]���*+�bVCY9��*z��cv1���O�F=���1����v�{pt���]�G�_7a;�~h;���v!���Y�����j�"������gm������.��]���"k}��+l�*+�bVCY��R���*+�bVC��.�H#����j�:�����]�*��
����EVk��Y7�P��G�bVC��8�L;TW�8��^�������v��}��i��f�0R�>��I������[TJ.2�,d4�>�<k�nKi�"�����.�.��_����bS*y�1'�	��.r�]T�
]d-v2A��\���R�E�E{��]N���	����>������1g��.?b�_bM�5O�@2U������+H��dUiA������ c���e������3���}��������H:��Y����5R��kIv����F��
�8����F��5��W_�l,��q��;�����K�u]!���2�������K@R�A`���\�������T6[��Y�`C�QY��2�DK2��A[��������#�(<����dL_��������~�60Fu�|����G`|F-�0P��,,�7�����0PaT��9H:����"#��i���l���)�F\j=�DK2FZ���t@�F�n�~L	UF�
g@�hI��
#
��)�U�����k(3�	fy���d��I�H�t@����N�2c'���<���	��Bp(LL�xm.�kit�P%~u�P���IkXO���K�v,+!Dw
���z����0���/�s'��rVa�zk���H%�^s��*�g#t4�}V��:��/%r��M,��zy�u��,A�M,|�"��Q�r�'zy���B��k$�Mz(|��Q��UFm��������&F��5X��p(Q;�8��<$���km�#�C&����8^��@g�_m�dU� �Mm�����������qV��#w^#.��a���5�����Tz0k3Wgt.��,L�k��a���}�(3j�^n���d,/bda:�^c!M�U����jp������z�DKrV`d�;��T1����-���0���DKr�`da:��������`�`@�^���b���0���,L�5��dpM)8L�t�j����A����%Y��k�B�3(u�j[G��2�H����Gd�,L�5�����<�Jt* �j����@�hI�&Y��k�b�c1�e����D�X����/I�������z����yZi&IKZ��-�e�~�����d�������z����W�v�h"�����ufm4�,obda:�^�%5�'�,���*#R���-����4L�k��
��%�hWZ�!�L:�:��KRF_vl�a:�^#ym�#��RoG��|��/I��&#���0Z0c$o������:�X~��?��5Y�����GO�2�b��f�$���
:Qb$��Z>(���_����@�����
��t��c
�����t�
HIQ��:���Iy�X���������<#���v�&�1�B��8^_ys']��uC�KMp�>�9��dda:�^#!/���x�Y@���@�X�S��]
a:�^c!OfS��< �v����=p!�X���sp-���*�7���w�� ���wV������z����b>���4P������3�8�0P�q�'v�u�q@2���	��A��0���t��<��K�@ErTQ��4I��:C#��z#�5�  �p�2#~6qR!��l6�0P�����YM�y�QfDrEq�%�s�l�a:�^#=mxXj��P��Wr�DKr�`d�;��Sd*}�M��]xn�#�����-�N���*�,r�5���������z�^��l��8���Y��k��e4q�9�Nd\3�v��u�8g!�u�y�������@+�0^T�UF���'�z�h:2Y���X�����kGPI���A��u`�6�;��}M��F��vN���v��uC�!���M,��Z]��^��~�w"#X�;��1PB&/���8^��X���)��OyG4n[E�;pR=OX#��x���
�*��QGE��Tw��z��#��x����;��bz���KuPT�	,���t���`Xd��E���0b�n�I�<a	F���y�)�2���0j[�����y��,L��F��_���q"�yX�;����B>D�
q�V������J��P@�kt�uK(�	F������k+?9���1����	y+�D2�0��;!��5�Q!��
�Fw����*+#�x����G��R^!�UF��m N��e���0P�wB���w���2����	y3�7�H�t@��#�aIQ��XQ�����!{!��������C!�l�0k����kt��N���
6��������G�,��I?�k%O�������J^I~��'���P6�N�1��z���kuHx���K���i}�/�@O�������t�:�$�r/��-h�{�v�u���6at���F�sR��B}z���%D���DC��e,��
�Q�;�V!N��E���x�=��i�RCD�k:X��ri�	�]�yb�a:�^c��jW&�)9�.�T_7
@�n���(��t�Z�����FN��"���
��pRg�E�a:`�I�ExMO��)��0���4��"<a\�B������$�F�5vqb9q��
lda:�^����)fN:�L�I��z��H!�x�8^����n��.=7��eF$ �0���3���F��5����^qhvB=��7�^������=^�eX����<L(!i��H�]�����h�������Vhg��t8;����k��R���e	!�k���9y�����d�'�oPui`"��0�o������f���L'�1bn�Ie:a���t^od:�����5���������"`=L��M��������1������q`Y�<L�[��kaP���Q���`6����"#���v<"�\�
����"=�
�8A��1�0P��h'_�y�odGT��W���9�D;�
#�����odV��hD*���":��K�~���cw��Ca�����uv \�J�[:�f�/%�UL,��.�����i��q�k��R���I���pySZ��2?
$
�T����5��M0�0��j"w�nR��+�#�+�;��%���:1
�u�����L][j��T&"��]����DC�����'����*#Q���8�MI�V��@�����^h$$������I�
���@&q�no
]�0��|�N�+����q�kp+�L�;�z��A��6���A�&$��9'��f�#
�����&}�e��In��U2
n#���@��48���~����8�4�Zad�;����\���2/�k=m��O��#�*�y5�������u�kVuJj���r-�}v�%iB�p�x���~N���������{@���^�:�����{����^r�Z��^UR)D��n>�f�/%:�o!����l�<���i+Y��$�{��OM��C������=[2��WqU4��DR������������iL���o~����6��n`�Z�P��a:`�Z��\��e�R�����������k��R�71�T�[�]����&��$ai����2#���`��g�X�����z=\2������v��#�3����5{|)�#�sq�V�gk�����ZQj��q�e�	g�Q��H�t@�.�4LJ�:�Q��/�X��A�y0�5Y�����^2U��6���RE�4	���s'��`F��z�����-V
+TQ�h����u�	� "����8^�0��Z��>A��BP#M������u��u!��DC����Zm�ZY��I�o�jC�������C~���=^�����PhEZ'��7$���:	g(��0�:�gKNIv�����b���g�l����'����.���"g����izJZK.|��:A�B������8^������r���1"mvq��&g�C�����!��e����������'x�-�Y����%�"
 I�\����U�v�s'�`M�<r��p�TjI&en��
U��6F\�9��U��2�,L��p���R��2U��HZQ�X<U[\6���t^��G�itx	���&����	�x�;o�,L��p�$���RXe�p���BM����|Z��]���c-������.' )�aEz���6���t��������\:�OH����&$��9'��yz�#
��z��Y�yY_i�������e�s'-DV��#w^O���QR��Ty����5{|)�}��!��Z]��U���W���QedB�	�X���&F���t�8?�iRv�U���e��O�8����L0�0���%C�1�i��{����e��z�m<zG����8^�����V��\��XT�a���Z<�@>���B����/x�e�D��Y,�g�;��^�
�,r�Lvz��Z�Y�}�=���<^��s�^&�������,W���7f��5L�<^�o-�G���ZCZH���NBe�2+��0���;�C�Z+F����*�����$n|h���~���+\
'qi�
=H�BIj� N���q���t@�n-t�b�P|�����2V|��e�V|zg�[��������T�&E �&E���+�V�{c���R�8~�������3��g���z�[1�\h��E�+���f��!BT��@�P���c��,L���Bi6�����eF$��H���U���t@�n-4W����
;PaD�����c��
#
����P�'����3���w@����#���@�n-�z�1uR�b�g&���������+l����ks����x�������1��r���s��}�5Y�l���I��2�e��c+t'D�x:pBw�
#��x�Z���[h�OzG����\l����3���"w^w�Bv��a����%�~~�_J>�(9�8^���
��g�����v
��
�~������!l�!���rg���2�h�5��E��s'�1�U1������R�1�BO4RD�����	�����Rc�8^���rPix)
�,L����u��!26�M,��Z]���������qY�4}*,����@��T�*t��L�%/���+�������wm��X��5\���D��>����#h�$]�d���i�����z�Z(U;o���BGIj��d��	�6X��E��n-���r]�4����,*�v���d�2����a:`^wj�G�FH:/$MW���	����~�����n-Tf�[�N��_a5H�=�S��i��Y����BUV�Qm�`S������������X��P�Z��hi`���
�O@�f�eY����B��).�j�2���'@�b�TB���t@��-�`<z��$������Y�GdF��b��B��p%���g���{�XN
*��"w`���\�Vr���S��6�j��Z<��|����x��~*���	G��+���8�4i�I�F�K��8��K}��l^���-IZ;9,'�D�������b��T�V�r���N�Ugb���8^���k�S�
�z
-JY�n�E������IO�E�E��K��w��^}V�tt]��C��%���B����<\��vH�G�)���{����3maa:`�#�-y-R��8�I�e���G��I������K�f��d
���E;�v ��Rf
c5L����%�/�1�feH&CQ��46*��M��0�QQ���2~���,��F�]�H�h�Y��;��z����4!���PPT���
�8A�F�6Y����%��
�'��FT1��8����z�Q���{�^�\t�=X����2����:0���3(�F���p��l�#�X9	��������I�O@�����8^����RK�f��.!9_p
����`
F��z�dW�7����gFA?8���#��x=\�*5���6k����t�t�8��/%���&�x
�Gk�F��$�zz��!!n�1��==���B������,:U��Eh7q���8q�����=P��K��������Y�"��8��>��2�����%S��g6i;��2��� N(9F�/���t@�.Y��Y_r,9ga}�^M~�)��������z��*3)Y�6�K=����5{|)��=��x
�gk�p:�W�]�9=��2� N��������t^�L�VE�F�cGrL�e��s���F�iwn���W��FIicrr
mt�!���h���K�o���M�7B���Dg�u��6BaB���i�q�6�Gku�|�(U	H]v�A�R��&Y��}���%'� ���a�c�`
���8�JOl]`�a:�^�������[#����������Mb�a:������iu���hM����p���I5#��x=\�V���.��UFVS~f�$v�����t@��.�$�UGZDer��p���F���Q�v"����2
�~f�-����(��{�^���p�/�JL'�7��
*����,&�x��5�w������� ��M7��0�0h���QI��^��Ss���L���Y?
�^��_�;�p����uym��~�/����_���������o�oI���+����3�C�\?�������_��_���o������?R��������������_��w{���m�n	���i�F#��z��'%�?u��R������EY���P��4��$�����yeU/����o�����,Iv��Z���i�qe��1h7���&��ob�|~4m�������M���]4=��wt}��b���.5������
6����0�p;��w�����]���-�Tf�v�(E�w��~{��wA��4��{�|����.����|�]����P*��w��o�������]�o�������
�2��o�[��U��WEh�B�3��0�����C��;��U�UV����.��������^d�]�j�}���k���!|��k����������<^��i�^�v���]B�����\�n07`���V�W�%��'V!���l+�=�t W�V�!zB%�3Rql�=f9���v!�������������[���;k�K=���E6�]dm��������5r�q�I��R�UN�����g]f=���.>O�.dut�y���U���7>O��'p�.^9�����+�M�[�����g5;��E�0�C��H�����Y
������DVCY�.du�>�<������EV�Y]d��S������.f5t�u=��]e]�����g�I�BXVCY�.dut��x?W����g�����.?s��������O��N�������v����d�wY
]dU������K������EV�Y]d]���*��
���������Y
]dU����E���=������Y
]d����gUt�v1���Y���=����j�:��Z�u�*���Q�����'�F�5�����O�4�r����+��A!��k������.����j3;��L����Uf5���]e^E��eVv�y�oD���4���j����1�����c3+���]fc�k��A�l�*�����2�k2+���]fcW��x�gV��3�
w��]~*��NB���c�Z�K�����+���>�]��\m�O��*/�bZCY���k:��
����U��[�XZe�ya�Kl�}�D-i������jeiCWY�XKkc���p���E�D���@Wya���]e�!�*���]bcW��vh�]|��<��J������������u�rn�7��t���l�^���:\�J;1z�L�
����5M>�ew�����x�xm.�r����8�+sT�Vx�q������!L�k�y�q����*��&��8��r�L�F%L�����3�-������	�y���3���0x?qy��W��,
���h�� N�$c|na:�^#�7�<��<��D�Z�����<wgU��{�����W*��C[�O.T�����@�hI��<�{���X`^�xFz�$-��u��XO6gh��a:`=�:�����5 i���f�X�5gh��a:`��"q�}�5��h�V�x����U##����Y�2'�pH��!z�%��q�%�i�)���z��O;�%o��2�-O~h�&}{;��i-���z
�T*�5Wy�����&�1��t{��DKr�`d�;����t���:jR����%h{P�07�
#��������-��%Rt[��>7�#wr
�/�
z�V��:�C��5�*��Z�@h[��1�8���,���t^#9qE����K"����m-h�q�%9�`da:�������	g`��H�`�3���#
���x2
����8��d�7MS���g��>>�v0��C��a��H�K>�x�E�3����
;}}��$F���F����$��,M���#�~�b��,TO�p�2����4>��P���h%V�s N�b9kb��; ^C9I����W���2KS�s�'����2H��lf�z����5�z*�ELZ���Q������X��,L��F����Rx8$4+��Y��	�'����Ljf�z"	�+������ZU�k���&l|����9���B�DW�i��g�T3��,M�&���H��{_;8��C�$���H+�^������LI����xf�*��N������z
�Tu��'��*����#{�~�b9�q�b��z�����	
0g>��j������s N�b�,[��.)����j��/1�����^��_,1�P=A���-��"+�X l#�~�2�yU���RW6O^�yQr�����2��44;�3_}�V�,�9��E�_z�X|+�
=;p�+g
F��z'����AK���v�����q�f�^�2`��,�M��@?������=���JL��I��u^��!��[��g��t�o
��Pk"� ��+��v��<Ll�`I����'2��d���Z�7Q��M�xm.C��(�
h-�T�K�S�e��~�\��)�����RS�%���]c@6�����ku�t�P����0"eoq��'����a:�^o�K��V�?$��V:U?����=y�X�����u��Z.����; �u[���������]�y���X�3}F��T���@�#��H�##���r���u�*PMh���@� �-:����0�=P���7��O�]	'��6����k����Qd�!���2���M.���U���E������M6!/!L�5����m���~���S�o=p���`d�;�7����Y��$�?��'�(Y�;p�%9+0����X�[��HC������-g�OB0�@L4����{!1��D0Lt? �~I��<��l�������_IE�@`@R0�m�'��8�������H�t@���~-iNV&i�ZU�����7��g�������A��?��d��x#2�����3�$K}�t@�J}�23������I}�3�$|+�m��@���>fE�f��7 �'+sXa��{�b"���OP��4��'la&R����+~�<TO$�[�on��%���-�D��"���v0��C���{nn�-Li6�o����P�F��\�[�
��B�D���][�R8��L�a?��i0�P=����?(��`Q#=���k�&1�0�w�wd(�6"�!�"������PFi�P=��q)�e����;���$���	F���N����XgS�*�b�b���)~
Y���[���{�o�G��@��D�c����'H�N��CL%a Wd�������'0�j��z�����<���G�Q��_'<I�|pf1Q�/\����G����Q�����uN;��4Y�l���\�K�w3�k������I������+���#hX�N����Z�T��0���	�����f:���^9��j>!*jq�B��
��Q��0N�y�J
��5:��x3}q�V�A�_�E|��9Ry��R��)��T��0����W���o@��r/�s��7��*�����2�qv���?����4�y��+����FF��5����K@��5�X8A�S���|C�SU��$��Y�kV���=L�Y�P�[��3��<�SQ_�����H���\��K��A�!T���@� ���6�0P�w����\L�C�!Q���8���B?6�0P�7�:�.Z���\�"%}�����D��.!��Z]�����6H8�13"U��8A�k0����H���'C�?EA���`/����g�;p!��HZ��kcsI�+g���A���h����5?�S��>E�Z��W@�P�6���t^c��Y2K�"���������?�	F��K8?0���(��mU�t`�W�XY`�a:�^�)�[r_�Gl_�w$M�|K��,���$|+(i�L,���7"��r`�� f�'������d�L
D�!Q�N�Xr0Vad�;���#E����K������d��X��E��N���3W����'`yU����K2�R��2�.������)h�LJ�!�
���/b��zR����������X���;�Mjf�z���K��.�oV�`���zo@�/�%}��]��H�oi}Z��>)��j���_(���'fQ�;���dS*�L��"��:�����S-���{a�8��Yf�t/f�U��E~U4+�i�� �[�O>hg}�U����B*��Y���Z�K�)����_O��(��0s��#H���bY]��zZ]�v���X�h�f�a:��Z��>��e�vm�x-�;�o��h}�+3�����{9�-�x��N��F�q��\���Lq���ZO���8���
}^�'cI�����B
���IFk�C%����6�%Z5���zI���K�>^Z#�%;�
���8�K6�+�0��K�5���q�):���/�M��U����R�.rs���(%4e���k�.}����K����!��Z]����=6�\���|������U�p�0P��K�Qzpy�1Ft�xq��d;#�F���p�krQTF�T�"�@���H##��z�d/3��l���13
��}E�4��G�^���e�J�J��G!�hH)���$��A�r.��kuy�V�O�h3�����".�NZ
)�A������%���mh���������:�]���:�X����<[+�8���d��*�%X�I��k��R]�M,��.��Z��vlz��&�G�"c��Au����-�a��d���8�p�1���'��6���t^�����d���*3������j4q�P�,r�8z����\5�:WJ�z�[Pe���@��73��XC����%S�%���E�^����8@���S�#�x=\2�g���H�~��sl/�YZa�a:�^������270�^�������^8��C�D>^���4���D*[gf,/"~�'6�y��H��+�\h�+�/
�22)��S��I�l�a:�^O���<i�,#� H�-�Xl�#�~V�
�`��z"	��r�x�C��K���Tw����251���k�<\���y��+��,�T�>�uE0�.bda:��K&�����F��7#��D�>��K6��H�t^O����y]_����������$���kW�z����/�5}�����Xd3Q�G��\(�a��z��h��+7u;/+?�D%=�R��*�	?��X�,���$|�r���m�R��������'D��z{nM�,|O$���k]l��������-K}������J�VT�{�� ���g�5��L`R!=~@�}�8��C�$���;JO���?�O����P�	���}�O�����G�>X�&��3m6���gla�������f�'����`�k-��,��>��n��Y��+��<����}��J��48������C�Bq��Za���vP���)��Z�����!���rg��2Bo�6�Xes���rz�0�\���L4���\n�p�S���@ZR\_yB����E`=m����mS�k�n^���yv�����s`��,t��=VM����+���	�����
$~�A����
���9������
#���'��V��F��uk���vu�D`�~�
#�D��8�])���bda:�^wJr�����*F��o�����*�z=L���Bi���HF����
�mZ^�����B��,L���B��'��v  ) ��
*>�R3�a:�^���NN�BMG����N%��j����0P�[����FXh�� *��R���P)
�`d�;�[��]Fs���[	
��J����p���vp�D�����
�YG���R�����/�.�_��x��eY��y�Y����)��T=k�#�x�Z()���&} !���XOqR�]�S7i�����n��!.h���~�-����s'�^�Y����BE��q4�{���L��a?�g53�$��^�1�����Y�Vf���a?��g�y��H�����
s�Rg
?�I�	i^C�0��C������zI�.���w�
�����gi�~�I+���y�����R�nw��h�a��tr���QdB�	o����6aZx�]�sBIv�E?p2��Y��E���-�tl��+[y�������/t8�d���#�����U*)��e{8k�g�����p8����I~�P���:p�I�,��O���U���y��,��ug�����&���N�e��3k{�)v���I6�0P������,�UJ�/�le�{@�/H�g5�P=��7��U��5�&�lc���g��\�a��z���2��qvOmW�����(V�~F�o_���c3�Ixw��"Q�����`r���Q�����(�3��P=A����
�Rk���m��\��]�>����7$��J�o��6�z��(a�����t�����#��x����������*�{
�6�XN�\��/��_���������t�uanm���9�]��]��k���6���M*�W;���ia:pT;o.���$�wU�A���g��\[�&�s��8^�����>�Z�yCP����@h���x
���<s�q�|O�fn���:q:��b�8sk����=E�v���9�nO N(�F�Y����%��oN
(Vk��	'}Vk�K=�����.oK>�
� ��������s����������%oS� ,�
��m����C�p���t`����S���im�z�-0BT���8��������t@�.y�9����|BK"����k����6��ksy��\p��>gd�������<��\�a#
��z�dj���g�Z�;��fRVl����]F:xT��o��K�zv7�3N��gPb�|;����_6���kuy�����*M:D���F�����^���(�lbA��py��e��B�{5 �%)=
@����fY����=�rx��_ZU��A�'�����*�
�=�H�t@�.�z��%L�3D�h�Vun��I����%1�����%�8�sx��)�x��C�E���K.�xNV	a:�^O�LmL�4��^�-���9(�0�#�x=]r�S[,��p�l��G3�F��p�0��C������+��g�8��"g��,�Q���y�e�f�'K<�v����Q�l�������vB&_��i����	>^��Q�V�sv�d
�c��v��
#��x=]2�^��x�4��.��:^+�#���z=]����L?����K����;>�P3�I�x��
�e�#�6��XQ��q����pl�#w ���d+B/k��Y����b��N��1e�d`����x=^�I���i���,��=���_����D>^������G@��a]���������=�=���%��^0�,2�������Q�O0�P=)�l�t�(;�@�f
L�C����9���I�,TO�~����Qv^hDq�vu�������������`�H��
D��#����#<��&A��4p�ada:��?
N����=Tf�����L��I���������qw����+7Z\���F�/����_���������o�oI����Ns��<�n����������������?}�����}����������y����L�nob�VY-�K�~�[e���1���)k����������UI�6mK��^���^�������w�	���pg�����&����X��M�i�se��M���}(����l��w�G�����d{zi�y#-������]��;����/n���t����g��/B�k������������F����g��|�_$o�H�7���F�������F�h�4����"�zw7v}#��7�s~#t8��������Ov7v}#��7�s��j�R7����U��W�1.����a|�e~\:����//eW��p��������eVv������2�������Js�K�]fc����}�,�F���^���iJ���y���
C�����p������	��U�UV����.�����iVEWYa�����MPHk�*�����27����.���.�����w�^fcWy�0&vv��t�eE�yK�^6t�u��K��2/w��]d����w���� 3���]�?����?���!��.������P
��8�x	����4����j3;���.�����]eV����EfjoRoXcW��0fvv�y�:-���23w������P����.2�a��U�����]f��{;�������.3�p���E��p$d6v��d�1����<Q�����7>�2U�o������6�]��t�r����]eV����U��O�[fe��a��l�"3
��oWcW��0fvv�Y���3+���]fc���j=�����j3;���x���23w��]d^&�cfcW��0fvv�9K�]feWKj��l�K�cI����`������?cW/���p��,���w���Ef3�Z����4�g:d6v�Y
cfgW���bfe���~�":_f���!����j3;��\��eVv������2���:dVv�����.2�!������*�����2T����.>��p������@e��>��!@����cW����pv�
�F��M��*/�bZCYi��a�M�UV������.�b�*���]bc�����UVt�v1����4���wi�2/w��]d��W��:���:�n�-��*k��H�6{����0�%6v��lY�����N��o|�|�.�m��)H��M��Y�����.�n������^������V�q����aJtgD�OJB��a�;z8��u�����L:����3����9D���'�Co��OaPt��9�cxiN5�>����I��'v8o�����$7�"�m�y��MG��I�C��Wc�� �pe4�/�o��Vq`�Y�iH������ ���,TO�p�2���zs���y������p���d�Q��4Z��\�x����� �4���G�~�b�U��������-���*x*����M�:q�K=�!L����m���4�K�R���H�O^,��cB��Z��������s`E�5���;"�?;�+�,TO�p��<��K���F���yk�q������`���!TO�p��$�����
,K��<�R\{F&����R����	W6�>5F���C���QwD�de�<|O6�)���Xa,�3�2�)����D�de1=3����v,&�7N8��v4���d: ���X����D���T��o��JT�����kP7���de��Y��H�����*�����Yc�h�:"�'+s8�����$�lY������r��:3K�~���9ljf�{"	�+[W��b��6���6"����j���#i(&��+��hD�<+����4!=!�'+s8����IF������7k^1�.�"��S��k���X����d�d��l6�����^[��f,3c�r�����Y�� �pe9�w��������T��x@�OV�p�����$������r����p�B�#G��de�<TO$�pe4.�k�Ky%�;[��RVTV����9L0�P=�����c���Vj>�4��D���5�})��?�C��2�f^+��5:��o����a?y�N0�P=�Qp���/���_C�Q*K�~�='z�w6������*ZF�^j�������O��-����e�������	�W&C���V�{�*3WF��de��f�'H8\���Z�zM���L>�9\����
f�'H8^Y�!���^������<"�'+3�W5�P=A���<���<�yJ���#����!u-3
�$����Y��7{�1�:a�j�e@�?�6<�x��l��g�X�N��x����i;��t���>����i����A���qu��j��'W�����N�CE~@�Sym�@�X"Q�Y��5������,"nr�f��>���#2�s�t@;���T�"�4Tc���<"AC5���������� ��q9���i��#�~P*KP*�lJ���8��(�<�� U.�Q�<�����J6���{b�;F%���!��7�����i���#�~,�:��B�	�	�������$��
��=%�'+3�4��	W���u�xX�6��z�I���I�d�5��;$|�[V���eF,GN�`�����#
��zS�:e����`�Y�7��uJ��w
�t��z��o�	���8�,�|Y���;���lf�z��c�2�r$y�9���� %r�3�r��l��{"	��ea	���25%0���(�m@��K�
3���S��%����E,q~db�Jd�3�������$|'X9��0:��Y��~���K�Y�,TOFB��
�:�#h?����q��c$���t@��(�Y{n�O����8���/�M��c��k(Q.�Y:t8�������a����,&����_���(��6=2�}.j������1�<���MD#J��K�8�la&j������
'�af�z"	���7E;*Jf�1QD�\��`��{"	���sS����*#�Yk<����F��z�Fn����.Fn��SV�;���%1�0��9����?g3�������6�]�<TO$�[rJ�W�����<���8"���`��z��><�F��L�p1MRP#$Z���3:����t@������h����@L��a?H'5�P=���EI�]���NEm��va�
f�'H�N��M�2�����^�pyDf��0oj��z���%��EIy�FQ�Hf�GQR��(y H�^���sF?��|F��8 �?�
�U�4TO��[��L����������_/J�,������d�t��3���6�D��d����'�n�E�D�)2[���_��:����sWS��c5%��N6VOI��Tk'��\USN�1�VS:����j#�)Z�����e5e
�v���>�Wj'�l~$��r������77?�K%p�`�&��+���B��F��+��'H8^��gC�_�Wf�0c�����F)��!f�'H8���0VS:���"m1O@��T����W��AG���
�4���CO��du%fA�8$���������wU�i��$����w��1.�4�=W�$`b|��%�G�~VTib<�6�	�U��J��3,��#�~^T�[bf�zb��E��#el(��R�F�c������-��	��(Q�>�o����%�������CR$��B�	�i�R�C���jk33����i��
f�'��mQ��O�E������)�Zr@���*WL$3�I�N��y��t�	e�q�3����������o4�UU>S(��MoYx<�f�&Y�b�/�8^�e�Jb7����L^%�1�H�!�Njf�z�1�u,NJ� ��'X`Y>vHyl#2�'�C��3
�$|SF����PB���@R��u�� �~��`��n%fj�=���Q0i�2c�����6)�F����	��&�oVT�#a�K8&=���smR`S3�I�V�L��+��+33D�\��`��{�a�e�d��������=)�`����Iy������.Q6�bVE�WM�%Y����)}�m��]��[���8"��������C�D��(QT\P���1��8 �$J�,�����$|+Q�0����G$���3��4�Q�$>W�hX��)���m�M�4TO���R�0�.�U)Ka	rDf�pqH�$�Y�� ���'qbjW>���v��G2���]�$�B��� �{�R���J��@oE%�I2��+����B�	��T���Q�$�W*���z�fA�<$|�Tj5�6�V�eg��s N��,/j�0P�K�w%���':8q'QV���d�H�wu��m���z�i�:k��U���I����q��o���ogG�[w�uG�}��������f�4�s���T��y��)	�d
��X�;'�p�t���B_qV�����uD���;�un1TO�����WR�������(�3�#�qb�OY]�a:�^O�L�CV��iA��Qn�)O�@�x����oa:�^O�\6/��C���3���p���a�P=�C�W�&����%�NV���ZG��LmH�H��B�	�|�ZD��9��KG��9'9��e�b�x�����%o�-c�$>���y;/�G����>��E�4TO��������u�0�f��X���vD�L�l0��=������Iu�_qhZT��x@�_��8����_b�la���%�����������M�-����Xa��{��v�����t�L���������#Rms0�f�'5�A��r>��IhUi�UH����'dv�� �olf�z���WN%�������Kk`��wb&��"~^+������$|�rzq�g(��$�����A~D6�������|��� ����dj����F����?`R3�I�x��*7��*GY�X�>�=UK�!L������~���)q#?m L����a?o���E�B�D>^9M��?:���2� hv��L��apY�LC�	���El��"A2I��3j�A�,����j.������$|��Y6��2�&-:��I.���E�i����	>^9����7���DwF���~��*��U1����K&�m�9T��VF]y��3i��~��@`�������~s������R`M q����K)�	f�'����Ev��*��\�������s"~A�gXa��{"	�\���:��g�L��-^u|F��U�4���1TO����WH|�S�h�m94J��b7�����E����=\2����Pojh���Z�|
�I�l���G��x=^2
��K>�A��R�3�~����*�����~�@9������<E��R�3�~���L����C�D>_9�*������8p��5tg��8��o���"w@�/�����c)u�c)[5�g�����4�f�'���a{�7�S�L��O7(z"~�}�V�]�����tt��}��5�h��[�~k#��^�����,�@��ZQ�V���M����/��[�5��Bi����#�>�7�s�$����t^��6�����3l���h/�����C���P=�o}w���K������
6
v �9@��������-8v����Zh^��#:=`'���Pi��5���l[!7��b�� ����O*���B�a��������������I�M�[���j�;����q���NH2��!�N�4TO���ziK�j����%k�w�����n�`�<W�t����Z�:I����o9ZU*�x�`=��UVWi�����n��u����e9F>�U���d����k>��	�[/)�mC	7m�`7,3�b�4"�g5�iu��P=A���M��UU�Y�=-~G2��mu�!TO���z�����6Yq�f�����"~^_��!k��iD64Y����mC�WQ�]0�)����*�,r�uo���cgPk��	u����8�B�5�������>���Z�J
������7�8����I�e�]/��r���23V��+�#2�i%��K$f�'Hxs����\n��R~����? ������Y��H���]��T7�EL�V�I%������R�C�	o���iT���&=�68!�Y�,TO*'�Y/��r a��'��`3��������I�,TO$�����N{M���+Yd/j�/�^���3^�C��k�vE�Q�-��(���v����y�
f�'���zYS�% �������7U-�C��hS�[�����������{s���'����e�`��z�	o�����'~�T�^Xf&J�����I����'Hxs�����+�g�IQ���g�Y}f�f�'-�g~{����Y`8V�|�{-���X��eS3
�$��^���8�Ykl�H��?!��/�/�]�f�'���ZV�+���9Y�Jb�����,O�mj��z��7���xi��2�������g���b��m��z"	������8��6V�B�����Y��<��cr8����6y�7a��4;	`V3������+��
s0�b���=�����n��=Y0����>��6�b������d9iq"[�����O��Z���0��,��zi����T/������������r��@s����Xh��G$Y�&���YC���6?\y�l*�1����r�IT��&o#�;rT���|�y`���g�Y��t���#0o��S���Z�<\2���#+����O=����H�_�Y�O1TO��������?�� ��,�t�jD��B�I�LC�	�����*��*��[���=`'~��5�1L��t�m��A��E�
�0��wiD�
]��/f�'H�x���;��j�G$��I��f���k��z���W����[8H�����S N���Ni��t@�.y�\�T����4�:"{���YC�������]y�1b��C�_��JO�C�������C�7/!��Z]���-�����1��mW���:"~��
6m�b�{������S+���>��)�
Fn*3 ���g��*a3�I�x���k��y������h�"~~v`B'�Y��H��+'!}��l�yk�F����p^`Dj<Bp�$�'��q��px��F���;,��i��T=��p�4���-q�#�*7u�9��@��X��E��x=]2U��������{��l����!�����i�� �����/����n$Lh�ILvF��|#A��f�'����mp IjT�'��1��g�����I�,TO$������gS����q����"~~�@`������vWn}t��s����#�		��N:�B�$����+/Z10-��,M��,������&�@<���=Y~�3����@�w	h�!����	`'��UFl�a:�^O����)n��a�i����9'^��#�x=]r���Oc����m���7%�C'nA���>MP�_����Et=p�)��
i����>X�r�]����u��8�v~B�o��?��%�:!��z[�>�5��h;=�a^����*�#��`��X�7FOJk��=���E�Y��8I�|���/Tv�/�-2f����b�Ljf�zR0��������:bD�#��A�q�VG�C��,�������M'��a��of���d4�Y�B��4LF����m�e�9K#e�NE��(�������R&�z�G
��?����?��������������M�-��#��/m]?3��������������?������O�?~~�Gj�������_���C����[7Q������2
endstream
endobj
5 0 obj
   268608
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-1-0 8 0 R
   >>
>>
endobj
9 0 obj
<< /Type /ObjStm
   /Length 10 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
10 0 obj
   16
endobj
12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
   /Length1 8948
>>
stream
x��y{tU��>�����~Uu����t��	�;i� /1H@��0D|]��
�	��@�3(��Dd�A#���x�0�z��J[DogbrrWUE?���[�Z���S����tU���o�����+�{��s~p|w@�|a���/���w�$5����<z����)�`��,-]0u���A�������|��=��&������Q�J�\�����.~�f����+Yp�������} ��9O.�������w�c�����6(�������!3�����U5�Eo�-���S}j���J)t�aJ���V������,���2�a��	���a;�����.�]��D�+��=��#�\�&i���s�vU2��*�O�>��;w���k��]�#�����w��
x+���V�Cr�`��e��W�2�4YJ�������:�`o(3)9]*��2F������c���,�����V�����N���[`P4���fO�6SR���RM���F&58'5��;��Qx����p����0u��Y��t:�����e����$�M��4�����������UP��*yU�*�~���AQ�����������4Q������!J"�j�t��2�t��;_����'����Jc�M����O�M��>�������%x��_�]��_P@���BnbA Z�MR�|�H�2�D5[��m#��AU��md4h���O����.��pR�E��nk��}M���,x��u ��~�A��2��e��2���%d�!�"(2��9N��W}�(r��\~��;���@�el���������5=�i���t�J��Ri�p�v��U����M�
s�����s�/5%�tE��vf�]��5mS�y�!G��V�[ik�:�a�PD�H��t����� ����&�fp:��� ��uk_xa��^ ���O>w�6��������5R@R&?�#��y�����F�3�7��{��W_|����t.����BX�s������ �-H4C8S{���<2_�N4��u9�]6����A�����t5G;���@yy���=I����}������Cr�������OK�<$���4QX�q������{Wm���K|�S�_��}���n�����lY�z�������WV���=�����{m�ao�����^}����x����W��d"�1��.��B�|S!��� l4��.�TbR���]�^��{��.[��`�9L`���#6�Q�h�Oe�����q2�?G���L����e�#�;/����9��a�@� �*�We�q�?8�s�X�eZ':�.J�F������`��lO���D�hH�V���`�~���3���{Vz�$EDW`L�D��`���L�_��O_���em�~f�������[����FaP��+--W�w���M���iN�~����4Y*:t����'���$0b�0��t��5���f�L�1d,0)��Q��7��D"Z�}Xt|��c(��
�
q��Vf��+��&C�l2
�n�P�F6w�l����c�����wRD4������O��U��9*y���5��N��������o�t����x�/Bq
�5f�c���C�e�vc�
D��4W��IiJR$Pm�X�l�DC�8������h��G�l��������N/H_��1=�N#�RzO���z�w�~5f��Or��$�{j��O6>��f_�cO<���qt��e�z���~��+a��;O�uW	��<��K�U���s��x���2@�A�4�2�RM	F+�� 6ZUoe�1W�?�nH0C&Z�S*c3����m�`Pw�A��]t4oU�4��lw�';5����N�r�<���7���<��_AZA���5�*OUj����&mcf8�z������zo(��{�}=Sz�VzV�����%��m��P��k%�-���o����>����Hdt���-������G�����������e�����T}���{N��W�8�>3�K��1>���rP���R���hd��u9�Mn�n�#��A�����v����e��1���CIQ���p���ZYc�%�u����D��&�"_��4 ���2����	WB#-f�j���1I��I���&O*uB%���Jgub�JuC�lJM����U28������-�����t����{M���R=\Z�\A��2���IHQ�Q��2��������������^�.������
rr�����������gV���Js���y� ��hB3Z��6T0	�1]�m��5:��YY+�6f���g%A��X�T�J<��5b����"I,4��)�gWW��:�y�w�6��GJ��Y�a����g?�S�:�P�>���	>k���w��O����:� `K��j�AO���\�+��`t(*�Zj����5=&D�
����c����`o��;4�e�L�%ee���p
#E?�;'8d�S����W��*x����
vM�soQ7�+@��{h��~�.��^���>=��J��N��{9���r��!�$4Z�XeU#�&���������������b�����~�@(%����ll�m��3B�F!/��(��W�h����,pW�&>��	�.B�lU.������&��
!��
�V=Z���uQp:�	�!w�}�P^�j��p����E�g��+W��O� g>�D�� hv���C��$�J�Z�|��F39��h��7�R���i�I��6��'��:kf�%N��)k�{�;�~�}��F�h2Z����K�������,8��i���x����O9A�t�J��������������y?K��>!RD��vG�U�o�������#2�#����9���Mq"-�C�$�j�*k��$�������q�#������#q/�i��a�GS&���	�w���8�����.<bW[Nv���J�gL�P�FX��25d�4[JLvg�rqd�������;��b�,��k�m�
�Y�Z�G��K��_|J��r���?��X��:l������,
�V���L�������gn�u	-�k���8�����Vc�mS����+��T�b�Ta�0V��rs���Za�P*�r{8�z����'�X����m9xp�ub�����"*^���;W��;��N~�G���,�#q�A�����Ki1����8���HU��v`�UA &��Q&��$K��R�,QB
�n�	���5�5�	zr
�.�RR7v�������\������C�+��Yf0�z�V�i�uO4k����Y�R9�:�y�E�N��f����d}\Z,?MW����vi��Q�G^���{�������2f�M��dNc�)3X�����'y8�
���L��	8��5N4�,�`�%��{�q�4�0�8�T`Y`y�TX�'[��Nj���r��c��Q��H�F�c$��?L��������k���I������O�/L��"R���-:�X=,���7X#d;�Q��n��Q���6�$�Usn��>��TW��Hd���[zzZ���}���������e<*������9�vb r���
����:[�u��'��F{e\|�
3���&c��D�;#	�;D��;6K��bKln�-�_��*/�hq��}�}��'(���P8��������w���{�n���s���Sw���#y������O��E�Q�.^<w���/�'�s����}O�q��s�p����s����?
�V�b�=��*eTP���
"�D��7��P ����5'�Z���8�y$�H7W�Lh��t�J��b�}f���P.T��Ja�P'���H��$��L3 �da�r!���4�0��	8��c���a��J�E�PJ��l�XlX�I9��%l����j����R��Z�]����g�vq?{Yl0�2\2�n-���#�!�Qo���}o�_u���B<�}8�K��3t�����m�A0�`���fUm`��fh�E6�f�d��-&�&V�'��&�j1�F�`�6���5�8Cijd����H�����1�F]�
(?�	E�����4}]f�h��,��o��L����Xfg��U���-��Q41��j�%��P�%����lM�eB:I����}�9��n������y���Kr�l����CLC��,y�<[�:B$$�0DC,$���!�8V��2�:�Ra*�*L�Z�f����{����M�����lj))��Z����'�O��a�q�i���Rm��=g�f�f�a�a�3��X��?���u^1+�Grs�CF��r�-S�.�����3��s�v��,�S���#@`g�eZ��C(
g�\iL�������q��
{��7��/e���I���BG�Q����-�Uz��Q��k?�v������e����^K<sR�xK|�����nN�c�����7����W�;�P��GO�m�{����/=;��ce�������@j���	��%X[��m��XXV��q����������s�����%!?M��J���a��"��)�����"2�@VD���T���%R�.#o�=����,k����h���8����e�DhM�1�;�f�N����W��5�����?�n����B=�l��k�������}t����1_�@�� ���D�
(�d����2�A���w�c��M�^F�/�%������,��g`4��F>'��~<���y��s���O�~�� h�v3Pa
x@+T@�F��'�o�f��p����������h}p��I1y���1�����0��M�����&�I��v��!��9}�����
z��
<� ���m�;�FP 	� Y�'B�`�
��	R \��W��� ��h��P;�>�@),��a
,��EZH�0Z`\��PU����D��x�	�A
��$y�A�$��B���i�^��0���VZL�H�a3�>6�������T��KP��s��&z;��%l�z��)���:(�8��������V�;`�@+�M��u8FV�x�0v�pZ����
�J�<��:h��e�"�A��z���q��o��������������}�-�@���p~@�P?�O�CMLX5����#����X����O�bR��bi.����DpD���B)4A)<!*�JF�5X
5����U�H�N�	�
X0�P��0�Z��c��C���v���R�!O�V����^�2�9����@��WiJBSgz���
����^E�6@A�e�7��S0���Y
���C
�?��&?�R�LoC���o<ul���O�6�Ah�n���
�?��1>l����6�[H5��<�x=��|`��ny����0��C�G�@n���Z���0���B���%@K�Z�%,jX�g	���>`8&��i��P���+P�]��`-�E��0h�D��N�=��&��H�i��v�;Y�f	Z�4H��J&���&��S�U����Ca.6�������:���&eH�����/���D�}7��w��@A�34	�:�|-6D�����9v:���G����[�[��3��X��Z�����a_q�V�_�b�����q�|8^��+?���������nk���?�.w�'��c�q���p���x��v����q�7�]7��u^h��]X�mw������������=���'����n��������;��Q�������&��Oq|��I�'86q<��������1��q����"��q|}%=��;Z���!����Z-��*����x�g���`	������+���W:q?��9�����u_�ce/q�_,���1,\�/p����vs�e���'��%���=��;|N�g9n����k-�m����n���}p�7w�����&�k�����q%�y&�j��&D�	��7��6p\?�U��1X�������U�VU��kTV�5*����U*[�q��Oq\���c��7+V��p\���`y���p�����f|\�%wbY'>���:q!����#>|��Cj>{h>�q�
|@�g��q,�x?�����N���E��8����2���3e�'>����:��|,t�4��i�x��N�cS9��.�S�T��w*8����ql��61'�-l���-x�q�8�o�x�0������q�$q���Qvv�G���Qv9��F�zl8���9�q6���u��!
��!�&6D�\�`����X��-&�db�d���l����?���X���eg}�e�>��gf0#`b6�0���c�
}Da>;zK0�=n�����.5��8�tbr>&��,�cb	&�'���Da�I��������3;GU��|TV���-�xf�h&
3��������A�%�%H��(���9
Da�$
G!%k�&��_�{��p�^wH'
endstream
endobj
13 0 obj
   6477
endobj
14 0 obj
<< /Length 15 0 R
   /Filter /FlateDecode
>>
stream
x�]�Mn�0F�>���"��n$�"�Qi@�1E*g����D�����|<y\4��
c��G�m��1��|���1�Jp7�t_�����M{��5��?3cx�	�����������8��{t�0����h��-�L/Y]s��k����"7�[!�i�_����k[����H������03eYs�}� �������w�XZ������2� ���"V�����qU3#�2�fFQ��u">��h�����;d�Q���G����k�Uf���	������^r�� �A���Tc�=��z��9�<U���k�4���L����$�Ir�� �W�<����x�=�no1BH���Y�������v��l��k
endstream
endobj
15 0 obj
   366
endobj
16 0 obj
<< /Type /FontDescriptor
   /FontName /TILBWF+BitstreamVeraSans-Roman
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -183 -235 1287 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 12 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /TILBWF+BitstreamVeraSans-Roman
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 16 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 317.871094 0 0 0 0 0 0 0 390.136719 390.136719 0 0 0 0 317.871094 0 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 636.230469 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 612.792969 634.765625 549.804688 634.765625 615.234375 0 634.765625 0 277.832031 0 0 277.832031 974.121094 633.789062 611.816406 634.765625 0 411.132812 520.996094 392.089844 0 591.796875 0 591.796875 591.796875 ]
    /ToUnicode 14 0 R
>>
endobj
17 0 obj
<< /Length 18 0 R
   /Filter /FlateDecode
   /Length1 8332
>>
stream
x��yytUE��������ss��wH��aHBB�� WdLQi:�0�E�Lj@b�������he����������^"�/��K�|��[�[�������[�*Ug�s������}N��@�2kR��1k*\���)���n���?�g���g��z|��d��U�{4
�j��i��~>c�_#��1c�$5��
�k�.3f�}p��� 9*��=e�Mu�������&=X��$7@���i�'m� ���`P!ky��6�,P0�-4������xOp6����IjX���p��9��vQ��W�@�@(m?��x��O,�a3��5���|����2�0��~)���m��q�lmq��%������P����������13��^^4b�m���R;����-����q,�����3��^~y[AOy:�>�A�	v���C����o���H������BpJ���tdB�s����9��?#��t�-&� �&Ngs�/�_�W����/�����Ncgz�E
�K�2���<���5f|I�{���}�l?NpB�F����b��s2'������	V�^`��:��5��;���	�%��%s4�Ba��Gx�����t1A�H�����=�C����1D���)O�'��S���/�ZQ��d{��*{�^���_s�io05�������bGmG�����,ZeIa���������@�c�@�
�3l��+���3��kFv�j��+��[������� �6ldh'w��{B^<?f-�E�����j�+d���1���~v�rbv�,���@����?^�M��Od*7U��M���'�/��Nq�3�=�>���I Hz��R�)�)�-k��|,�Lf����'���P�~^�y�!n����W436��V�4y���ueZ��=6�l�~i��xk�IuE;@~AGyK4g^��t
��p	��^Y��A����V/�����r�<!�a	f��e�g���
+y���2�����.�+�[^�h�}�:���4������+�-�n���l��T�s[<?)���)�Rkt��3�t�|
 �8	f�!����mK
��dLVR�Z��M���,h��!�
���e#7�x��\����4��W���}����L��}b�$��*�i�Z�
g}�����)�O���/�%�l��]p�4��lk�;[>i���u�(���j��������<�A���i��UL���'�:���sP������yV���p������:;.�=������l=uY~�O�=�������d^�p&�Y���;�`4	n�"fK �=:���y����|m�������L5���w�������l�75�/��h4�����X�����%������n� ���6�Lf��b��L��bM��������\s�%��k�	E��L���,}�}m%�bs���:������������������+v��h7��6K����9�2���{=n���Vu�����\,������}s>�X1�d�@L�/����K�~sn�=3G������)���r~~aQ�\�)s�;vff��W����<�1������ \�
|PK�q@�i�	��;2Sm6�3���A�'����z!�F������j���"�WVv���<7aa�W��J��3<��Z����������kcg����R2���t�/�4�������y9Xqi�XI7
"���`�e1s�M����^�����L�4���Uk�,)_X�V��dT'�����4���/(��[����Oja������MW����Ws�\��D�y�9f��'��5/6�*!D��; >2�0����}^�&7w	��48�.����da�M�(]��L],k�]��^�����5�1�hh�F�P>\3����|�q�i�e�u�������\)7����-�X%�#*�JC�q����$>������	�����/WY��7��h3M����i��u�[�[y���z��?��d��S���h���+����u�IE���	���[�>� O���5��xy[�k��]]G�M^��l^�A�X�����G\+�M�#i�������GK�[}���%���U�:+�N`�.X�h����a�����������������<$�2.���F7����Q�W�
�������X^�1�0��.g[��`Z/�5�Q������1���1@p��������0�Q<�,�����X��z�y=o��Z/�u�4�c�!�
fXs#�M�@�5!�h����v��/�+�����_�gA#wQ�Rj�h�M��b���+U���q��x���V�Ze�a7�x�W�5f�D�q,+3Ng�l�X�*�|���${J<k|��(V=e�@��M���q���l�%K4�o{W]��6��� ��Fv�}����:G��fMZu�	�fs�-a_��~��<o�����$S��M[��
��V����ZY��M�����f9�tC�@���BS[�9������/�u��&;��H�������������]��B{^|Y��=s������_�����
5�����w?������g(�������D0�w��S*��2�l��]s0%z��c�����!.C2��r@KV}�M�2��H�2������6��`7z=�j'��LS[�F���v�����tR�M��_����8�e1_WoW�}"���%g@�/#��|E���X�
Mvh��	>���pv8��F��������K��������s9���"���A������8�[����f���L� ����������X>z�E�@������
/?���:�z�}����;\
�$�M�l�L���=��P�O�H���5�����<n��C���*�cfF�:����wXne��;���Y�����r�����������X��uta0Av���
P�Jsx�y��k]���'LGl;2��>��e�=�	
R��hy@k�@�qM�4�i���y[����"�5P��`/
k��OE����������V����g�7d�=MS��9�aS|���T.\���%qo�0es�*���Ca��.��D������V�j�!����<qY���X���A������������/��Nm�U�f	^��ph�=����j&>*���)yc��YcqL�xQ~#/~�
&��r.�98��Y���L��t����)�g���]o�nc�`�me�2}^s�
��&gK�M_i�*/��:5x����_�m�b8E����>.U����-��������_����{�Iy�k�M~���l�Y�yC�>=o�S�ztj��3D��XW������PYk?�������<���B���m&�e@�l�~Y��\�S�������Z�:���F�e
��5tk�,T�/t_���#���%�%��C/���������=���-�X������@y��@upy`YpC`]p{�!�,���
��ut��.�E��
����}�3x�|q��G��Bf|�������o�b����������y����+��n����_����jZ�wz�f�R�
IPKs
��+�6��f�d�I�lC��L�x[��&W�
����
�:�4�����g=np_�zLZ�Rm�6�2�df(������jY)-=8�?�d����>Q���<���Q9�����6�`��� sc9��� ���C'2�H`GZ��.���w�\����tg������h�������x�U��O7���h�nqd]������]�]��,{[��u�����V����:�3xo��G�J���*f3�2v��J�gzc����y������������yu�����=�y�e=��X^�?)il
V�l��q*,�v�'���>��3��P~�o�>����}�ux�����s<)���4��B�`��5��u��x�!��NU��>0A� ��7�/������0`�����
V��W9���o������X������]AV��zqh��X����qV�,�	V�b~\�+�d�"|6���(.��]���is� ���"��P�g�|;�o�
��+����+x�}�W�*�!��R�2�5��Qc&��:�~�����������(���8��03���<,���	?��<,bE��_c!����T���C��F8�"|s8`<'N�n�$���S�3QCP:�'�Iq^���s��I�m�
v�����`&2d�� �n:����������
)H���&hW��XupN���x�^� ��g��q�g����l�bup�����q���
�y*�)u�Z4��G�8��h����JO��+p�Fb9U�D��6+.��b�c�*��*=�#�w;*�t���������~���8��dX������.���ou�0C����EFLm��^:4.�������P�n�=���>����q
"��"��<�����^2�4�o2�s�!��{t/����E����������38^�f���? �A?{vs�����r�L��5��~=t0�a�L�<.g���?�\/i�8T��P*�����z�`q1J��pJ������a�����0\,�R���POw�lQO�������� j`���5�J�
��c5��,������<~���F�<�D��3�V~Y�!f����]���Pl�3v1�7���)�``p`��j0�K������j�	�P����/�	 �/ft������	C�x�a�a�>�~ L���g�z���>�����`��H�'L��]z�@;,��H�'����@�~���-0�>Bh�j�hA3d��G-��!4�w��V���P���+nNBE.��\���'b��{��/���HJj��]�I����~�2L��������$�CR����������.K�.@-�.��������1������������oj��I_'��]�����.H:/�?$�����/$}���g����t&�N���S��S�Y�NE��E�_��'n���>9�������?���}l��|I�G5^�Q�N�`'2�����,���K|���.:V��������+�H:|�L�K����X��#�P��?G���?M��eN�$��4zO�����WH�;[R�;}i������5�b_
5�u�F�������c��5^�[�.Io{�-�Q�NI;$m���~j���-�b[��nI[���+��R}�W�/�7$��E�IzU�+�6K���6I���.6J�`�
1�~]@�O���X��5^�6Ak^��5i���������V���U{iU5�{."���.�_����\�R���T[���X;�x�*V��+-����%�|�W</�f�S�xi������%-��D�������������$=�'$=.�w�X-�V�������L�HZ$i���=���$-��Y,�43���*�%hn*�I���~I���������J��	�������4c�U�(���*
h�T��&i�������f1�J��4��#&�R9���C�t��2IP$��+U��t���T'�4A��4Uk+�W���N7�q�_����Q��i�m~1:A��R�m~���Y�#=TR��7���b�F�ix��
u�a��!	|�]v�-vtsDJ�����#�1Io�����N���/��Q���WR�M7J��D��RD��r��*:�{�m���zU���(tSa�X�g�f�SR>�"3�Y)7�zt�+z$��'"���nS��T��t���}��PV�"����X�.�T)l"#Aa�c<����JO����9�D���v!�jx��R�%"e�Q�J��S���o�<�
O��S)I%�$U�JrN%��)I�8��N�Ws��� kY�6a�����md�q�$�$�$E��"I�I�8OM%��`�lU�.���R���F����,��_&�/�
endstream
endobj
18 0 obj
   5973
endobj
19 0 obj
<< /Length 20 0 R
   /Filter /FlateDecode
>>
stream
x�]RMo� ��+8n"�	!i���~�v��hM*d��f��6�A�x���(����%��#n�������v����Y%�[l���j�!���.��'X;?mLk^|���)�������x���g~��=Q�-�X�'^2c�������x������t��m���:p��E���=��g`�,
��dx���h�d����V�p]�J�����g����[��+�E��(O�'����
��e,��p�XV�E�
�*�Z@�!�A��+����#�#f /���������5��s��`NE^
�$�K��$id�P~���W�_��ry��i���]<���b����w���xx������/!���
endstream
endobj
20 0 obj
   353
endobj
21 0 obj
<< /Type /FontDescriptor
   /FontName /VOWDLB+BitstreamVeraSans-Bold
   /FontFamily (Bitstream Vera Sans)
   /Flags 32
   /FontBBox [ -199 -235 1416 928 ]
   /ItalicAngle 0
   /Ascent 928
   /Descent -235
   /CapHeight 928
   /StemV 80
   /StemH 80
   /FontFile2 17 0 R
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /VOWDLB+BitstreamVeraSans-Bold
   /FirstChar 32
   /LastChar 121
   /FontDescriptor 21 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 348.144531 0 0 0 0 0 0 0 0 0 0 0 0 415.039062 0 365.234375 695.800781 695.800781 695.800781 695.800781 0 695.800781 695.800781 0 0 0 0 0 0 837.890625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 674.804688 0 592.773438 0 678.222656 435.058594 715.820312 0 342.773438 0 665.039062 342.773438 1041.992188 711.914062 687.011719 0 0 493.164062 595.214844 0 711.914062 0 923.828125 0 651.855469 ]
    /ToUnicode 19 0 R
>>
endobj
11 0 obj
<< /Type /ObjStm
   /Length 24 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL������T�1b��!���hJ����G�u������ 
)�P�r��gQ������t�^������=>Z��P���c��i0�01�uj�^f��9��/�,�����?���N�1�uhm���1�,IR!Dj�+�?Iz�r:����T�����#.K��O�^�!!�1�t��������:;�(�(���$:��������\F�����F�����~����R����a���Y7�p�!�\���ls������*��<o0
endstream
endobj
24 0 obj
   274
endobj
25 0 obj
<< /Type /XRef
   /Length 107
   /Filter /FlateDecode
   /Size 26
   /W [1 3 2]
   /Root 23 0 R
   /Info 22 0 R
>>
stream
x�c```��������L21�Hne``d`�g``d�����;Ad�)e
&w���|0��4���A�Y�L�V� 2�7�Do����2�IfF��o �H^S<{
endstream
endobj
startxref
284941
%%EOF

#173

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Thomas Munro (#171)

Re: index prefetching

On 7/21/25 14:39, Thomas Munro wrote:

...

Here's a sketch of the above two ideas for discussion (.txt to stay
off cfbot's radar for this thread). Better than save/restore?

Here also are some alternative experimental patches for preserving
accumulated look-ahead distance better in cases like that. Needs more
exploration... thoughts/ideas welcome...

Thanks! I'll rerun the tests with these patches once the current round
of tests (with the simple distance restore after a reset) completes.

--
Tomas Vondra

#174

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#172)

1 attachment(s)

Re: index prefetching

On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:

Real workloads are likely to have multiple misses in a row, which indeed
ramps up the distance quickly. So maybe it's not that bad. Could we
track a longer history of hits/misses, and consider that when adjusting
the distance? Not just the most recent hit/miss?

FWIW I re-ran the index-prefetch-test benchmarks with restoring the
distance for the "simple" patch. The results are in the same github
repository, in a separate branch:

https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset

These results make way more sense. There was absolutely no reason why
the "simple" patch should have done so much worse than the "complex"
one for most of the tests you've been running.

Obviously, whatever advantage that the "complex" patch has is bound to
be limited to cases where index characteristics are naturally the
limiting factor. For example, with the pgbench_accounts_pkey table
there are only ever 6 distinct heap blocks on each leaf page. I bet
that your "linear" test more or less looks like that, too.

I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
(among other things) ouputs "nhblks" for each leaf page from a given
index (while showing the details of each leaf page in index key space
order). It also shows results for pgbench_accounts_pkey with pgbench
scale 1. This is how I determined that every pgbench_accounts_pkey
leaf page points to exactly 6 distinct heap blocks -- "nhblks" is
always 6. Note that this is what I see regardless of the pgbench
scale, indicating that things always perfectly line up (even more than
I would expect for very synthetic data such as this).

This query is unwieldy when run against larger indexes, but that
shouldn't be necessary. As with pgbench_accounts_pkey, it's typical
for synthetically generated data to have a very consistent "nhblks",
regardless of the total amount of data.

With your "uniform" test cases, I'd expect this query to show "nhtids
== nhblks" (or very close to it), which of course makes our ability to
eagerly read further leaf pages almost irrelevant. If there are
hundreds of distinct heap blocks on each leaf page, but
effective_io_concurrency is 16 (or even 64), there's little we can do
about it.

I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
compare timings for quick queries). This shows that with restoring
distance after reset, the simple patch is pretty much the same as the
complex patch.

The only data set where that's not the case is the "linear" data set,
when everything is perfectly sequential. In this case the simple patch
performs like "master" (i.e. no prefetching). I'm not sure why is that.

Did you restore the distance for the "complex" patch, too? I think
that it might well matter there too.

Isn't the obvious explanation that the complex patch benefits from
being able to prefetch without being limited by index
characteristics/leaf page boundaries, while the simple patch doesn't?

Anyway, it seems to confirm most of the differences between the two
patches is due to the "distance collapse". The impact of the resets in
the first benchmarks surprised me quite a bit, but if we don't ramp up
the distance that makes perfect sense.

The issue probably affects other queries that do a lot of resets. Index
scan prefetching just makes it very obvious.

What is the difference between cases like "linear / eic=16 / sync" and
"linear_1 / eic=16 / sync"?

One would imagine that these tests are very similar, based on the fact
that they have very similar names. But we see very different results
for each: with the former ("linear") test results, the "complex" patch
is 2x-4x faster than the "simple" patch. But, with the latter test
results ("linear_1", and other similar pairs of "linear_N" tests) the
advantage for the "complex" patch *completely* evaporates. I find that
very suspicious, and wonder if it might be due to a bug/inefficiency
that could easily be fixed (possibly an issue on the read stream side,
like the one you mentioned to Nazir just now).

--
Peter Geoghegan

#175

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#174)

1 attachment(s)

Re: index prefetching

On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
(among other things) ouputs "nhblks" for each leaf page from a given
index (while showing the details of each leaf page in index key space
order).

I just realized that my terminal corrupted the SQL query (but not the results).

Attached is a valid and complete version of the same query.

--
Peter Geoghegan

#176

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#174)

Re: index prefetching

On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

What is the difference between cases like "linear / eic=16 / sync" and
"linear_1 / eic=16 / sync"?

I figured this out for myself.

One would imagine that these tests are very similar, based on the fact
that they have very similar names. But we see very different results
for each: with the former ("linear") test results, the "complex" patch
is 2x-4x faster than the "simple" patch. But, with the latter test
results ("linear_1", and other similar pairs of "linear_N" tests) the
advantage for the "complex" patch *completely* evaporates. I find that
very suspicious

Turns out that the "linear" test's table is actually very different to
the "linear_1" test's table (same applies to all of the other
"linear_N" test tables). The query that I posted earlier clearly shows
this when run against the test data [1]https://github.com/tvondra/iomethod-tests/blob/master/create2.sql.

The "linear" test's linear_a_idx index consists of leaf pages that
each point to exactly 21 heap blocks. That is a lot more than the
pgbench_accounts_pkey's 6 blocks. But it's still low enough to see a huge
advantage on Tomas' test -- an index scan like that can be 2x - 4x
faster with the "complex" patch, relative to the "simple" patch. I
would expect an even larger advantage with a similar range query that
ran against pgbench_accounts.

OTOH, the "linear_1" tests's linear_1_a_idx index shows leaf pages
that each have about 300 distinct heap blocks. Since the total number
of heap TIDs is always 366, it's absolutely not surprising that we can
derive little value from the "complex" patch's ability to eagerly read
more than one leaf page at a time -- a scan like that simply isn't going to
benefit from eagerly reading pages (or it'll only see a very small benefit).

In summary, the only test that has any significant ability to
differentiate the "complex" patch from the "simple" patch is the
"linear" test, which is 2x - 4x faster. Everything else seems to be
about equal, which is what I'd expect, given the particulars of the
tests. This even includes the confusingly named "linear_1" and other
"linear_N" tests.

[1]: https://github.com/tvondra/iomethod-tests/blob/master/create2.sql

--
Peter Geoghegan

#177

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#174)

1 attachment(s)

Re: index prefetching

On 7/22/25 19:35, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:

Real workloads are likely to have multiple misses in a row, which indeed
ramps up the distance quickly. So maybe it's not that bad. Could we
track a longer history of hits/misses, and consider that when adjusting
the distance? Not just the most recent hit/miss?

+1

FWIW I re-ran the index-prefetch-test benchmarks with restoring the
distance for the "simple" patch. The results are in the same github
repository, in a separate branch:

https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset

These results make way more sense. There was absolutely no reason why
the "simple" patch should have done so much worse than the "complex"
one for most of the tests you've been running.

Obviously, whatever advantage that the "complex" patch has is bound to
be limited to cases where index characteristics are naturally the
limiting factor. For example, with the pgbench_accounts_pkey table
there are only ever 6 distinct heap blocks on each leaf page. I bet
that your "linear" test more or less looks like that, too.

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason. And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either. Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.

I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
(among other things) ouputs "nhblks" for each leaf page from a given
index (while showing the details of each leaf page in index key space
order). It also shows results for pgbench_accounts_pkey with pgbench
scale 1. This is how I determined that every pgbench_accounts_pkey
leaf page points to exactly 6 distinct heap blocks -- "nhblks" is
always 6. Note that this is what I see regardless of the pgbench
scale, indicating that things always perfectly line up (even more than
I would expect for very synthetic data such as this).

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

Explain would also greatly benefit from tracking something like this.
The buffer "hits" and "reads" can be very difficult to interpret.

This query is unwieldy when run against larger indexes, but that
shouldn't be necessary. As with pgbench_accounts_pkey, it's typical
for synthetically generated data to have a very consistent "nhblks",
regardless of the total amount of data.

With your "uniform" test cases, I'd expect this query to show "nhtids
== nhblks" (or very close to it), which of course makes our ability to
eagerly read further leaf pages almost irrelevant. If there are
hundreds of distinct heap blocks on each leaf page, but
effective_io_concurrency is 16 (or even 64), there's little we can do
about it.

Right.

I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
compare timings for quick queries). This shows that with restoring
distance after reset, the simple patch is pretty much the same as the
complex patch.

The only data set where that's not the case is the "linear" data set,
when everything is perfectly sequential. In this case the simple patch
performs like "master" (i.e. no prefetching). I'm not sure why is that.

Did you restore the distance for the "complex" patch, too? I think
that it might well matter there too.

No, I did not. I did consider it, but it seemed to me it can't really
make a difference (for these data sets), because each leaf has ~300
items, and the patch limits the prefetch to 64 leafs. That means it can
prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
should be good enough for eic=1000. It should never hit stream reset.

It'd be useful to show some prefetch info in explain, I guess. It should
not be difficult to track how many times was the stream reset, the
average prefetch distance, and perhaps even a histogram of distances.
The simple patch tracks the average distance, at least.

Isn't the obvious explanation that the complex patch benefits from
being able to prefetch without being limited by index
characteristics/leaf page boundaries, while the simple patch doesn't?

That's a valid interpretation, yes. Although the benefit comes mostly

Anyway, it seems to confirm most of the differences between the two
patches is due to the "distance collapse". The impact of the resets in
the first benchmarks surprised me quite a bit, but if we don't ramp up
the distance that makes perfect sense.

The issue probably affects other queries that do a lot of resets. Index
scan prefetching just makes it very obvious.

What is the difference between cases like "linear / eic=16 / sync" and
"linear_1 / eic=16 / sync"?

One would imagine that these tests are very similar, based on the fact
that they have very similar names. But we see very different results
for each: with the former ("linear") test results, the "complex" patch
is 2x-4x faster than the "simple" patch. But, with the latter test
results ("linear_1", and other similar pairs of "linear_N" tests) the
advantage for the "complex" patch *completely* evaporates. I find that
very suspicious, and wonder if it might be due to a bug/inefficiency
that could easily be fixed (possibly an issue on the read stream side,
like the one you mentioned to Nazir just now).

Yes, there's some similarity. Attached is the script I use to create the
tables and load the data.

The "linear" is a table with a simple sequence of values (0 to 100k).
More or less - the value is a floating point, and there are 10M rows.
But you get the idea.

The "linear_X" variants mean the value has a noise of X% of the range.
So with "linear_1" you get the "linear" value, and then random(0,1000),
with normal distribution.

The "cyclic" data sets are similar, except that the "sequence" also
wraps around 100x.

There's nothing "special" about the particular values. I simply wanted
different "levels" of noise, and 1, 10 and 25 seemed good. I initially
had a couple higher values, but that was pretty close to "uniform".

regards

--
Tomas Vondra

#178

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#165)

Re: index prefetching

Hi,

On 2025-07-18 23:25:38 -0400, Peter Geoghegan wrote:

On Fri, Jul 18, 2025 at 10:47 PM Andres Freund <andres@anarazel.de> wrote:

(Within an index AM, there is a 1:1 correspondence between batches and leaf
pages, and batches need to hold on to a leaf page buffer pin for a
time. None of this should really matter to the table AM.)

To some degree the table AM will need to care about the index level batching -
we have to be careful about how many pages we keep pinned overall. Which is
something that both the table and the index AM have some influence over.

Can't they operate independently?

I'm somewhat doubtful. Read stream is careful to limit how many things it
pins, lest we get errors about having too many buffers pinned. Somehow the
number of pins held within the index needs to be limited too, and how much
that needs to be limited depends on how many buffers are pinned in the read
stream :/

At a high level, the table AM (and/or its read stream) asks for so
many heap blocks/TIDs. Occasionally, index AM implementation details
(i.e. the fact that many index leaf pages have to be read to get very
few TIDs) will result in that request not being honored. The interface
that the table AM uses must therefore occasionally answer "I'm sorry,
I can only reasonably give you so many TIDs at this time". When that
happens, the table AM has to make do. That can be very temporary, or
it can happen again and again, depending on implementation details
known only to the index AM side (though typically it'll never happen
even once).

I think that requirement will make things more complicated. Why do we need to
have it?

What if it turns out that there is a large run of contiguous leaf
pages that contain no more than 2 or 3 matching index tuples?

I think that's actually likely a case where you want *deeper* prefetching, as
it makes it more likely that the table tuples are on different pages, i.e. you
need a lot more in-flight IOs to avoid stalling on IO.

What if there's no matches across many leaf pages?

We don't need to keep leaf nodes without matches pinned in that case, so I
don't think there's really an issue?

Greetings,

Andres Freund

#179

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#177)

Re: index prefetching

On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra <tomas@vondra.me> wrote:

Obviously, whatever advantage that the "complex" patch has is bound to
be limited to cases where index characteristics are naturally the
limiting factor. For example, with the pgbench_accounts_pkey table
there are only ever 6 distinct heap blocks on each leaf page. I bet
that your "linear" test more or less looks like that, too.

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason.

It's literally the only possible valid reason why the complex patch could win!

The sole performance justification for the complex patch is that it
can prevent the heap prefetching from getting bottlenecked on factors
tied to physical index characteristics (when it's possible in
principle to avoid getting bottlenecked in that way). Unsurprisingly,
if you assume that that'll never happen, then yeah, the complex patch
has no performance advantage over the simple one.

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

I don't think cases like that matter very much at all. The only thing
that *really* matters on the index AM side is the logical/physical
correlation. Which your testing seems largely unconcerned with.

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either.

Why not?

Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

Then why do your own test results continue to show such a big
advantage for the complex patch, over the simple patch?

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

I agree that that would be quite useful.

Did you restore the distance for the "complex" patch, too? I think
that it might well matter there too.

No, I did not. I did consider it, but it seemed to me it can't really
make a difference (for these data sets), because each leaf has ~300
items, and the patch limits the prefetch to 64 leafs. That means it can
prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
should be good enough for eic=1000. It should never hit stream reset.

It looks like the complex patch can reset the read stream for a couple
of reasons, which I don't fully understand right now.

I'm mostly thinking of this stuff:

/*
* If we advanced to the next batch, release the batch we no
* longer need. The positions is the "read" position, and we can
* compare it to firstBatch.
*/
if (pos->batch != scan->batchState->firstBatch)
{
batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
Assert(batch != NULL);

/*
* XXX When advancing readPos, the streamPos may get behind as
* we're only advancing it when actually requesting heap
* blocks. But we may not do that often enough - e.g. IOS may
* not need to access all-visible heap blocks, so the
* read_next callback does not get invoked for a long time.
* It's possible the stream gets so mucu behind the position
* gets invalid, as we already removed the batch. But that
* means we don't need any heap blocks until the current read
* position - if we did, we would not be in this situation (or
* it's a sign of a bug, as those two places are expected to
* be in sync). So if the streamPos still points at the batch
* we're about to free, just reset the position - we'll set it
* to readPos in the read_next callback later.
*
* XXX This can happen after the queue gets full, we "pause"
* the stream, and then reset it to continue. But I think that
* just increases the probability of hitting the issue, it's
* just more chance to to not advance the streamPos, which
* depends on when we try to fetch the first heap block after
* calling read_stream_reset().
*/
if (scan->batchState->streamPos.batch ==
scan->batchState->firstBatch)
index_batch_pos_reset(scan, &scan->batchState->streamPos);

Isn't the obvious explanation that the complex patch benefits from
being able to prefetch without being limited by index
characteristics/leaf page boundaries, while the simple patch doesn't?

That's a valid interpretation, yes. Although the benefit comes mostly

The benefit comes mostly from....?

Yes, there's some similarity. Attached is the script I use to create the
tables and load the data.

Another issue with the testing that biases it against the complex
patch: heap fill factor is set to only 25 (but you use the default
index fill-factor).

The "linear" is a table with a simple sequence of values (0 to 100k).
More or less - the value is a floating point, and there are 10M rows.
But you get the idea.

The "linear_X" variants mean the value has a noise of X% of the range.
So with "linear_1" you get the "linear" value, and then random(0,1000),
with normal distribution.

I don't get why this is helpful to test, except perhaps as a general smoke test.

If I zoom into any given "linear_1" leaf page, I see TIDs that appear
in an order that isn't technically uniformly random order, but is
fairly close to it. At least in a practical sense. At least for the
purposes of prefetching.

For example:

pg@regression:5432 [104789]=# select
itemoffset,
htid
from
bt_page_items('linear_1_a_idx', 4);
┌────────────┬───────────┐
│ itemoffset │ htid │
├────────────┼───────────┤
│ 1 │ ∅ │
│ 2 │ (10,18) │
│ 3 │ (463,9) │
│ 4 │ (66,8) │
│ 5 │ (79,9) │
│ 6 │ (594,7) │
│ 7 │ (289,13) │
│ 8 │ (568,2) │
│ 9 │ (237,2) │
│ 10 │ (156,10) │
│ 11 │ (432,9) │
│ 12 │ (372,17) │
│ 13 │ (554,6) │
│ 14 │ (1698,11) │
│ 15 │ (389,6) │
*** SNIP ***
│ 288 │ (1264,5) │
│ 289 │ (738,16) │
│ 290 │ (1143,3) │
│ 291 │ (400,1) │
│ 292 │ (1157,10) │
│ 293 │ (266,2) │
│ 294 │ (502,9) │
│ 295 │ (85,15) │
│ 296 │ (282,2) │
│ 297 │ (453,5) │
│ 298 │ (396,6) │
│ 299 │ (267,18) │
│ 300 │ (733,15) │
│ 301 │ (108,8) │
│ 302 │ (356,16) │
│ 303 │ (235,10) │
│ 304 │ (812,18) │
│ 305 │ (675,1) │
│ 306 │ (258,13) │
│ 307 │ (1187,9) │
│ 308 │ (185,2) │
│ 309 │ (179,2) │
│ 310 │ (951,2) │
└────────────┴───────────┘
(310 rows)

There's actually 55,556 heap blocks in total in the underlying table.
So clearly there is some correlation here. Just not enough to ever
matter very much to prefetching. Again, the sole test case that has
that quality to it is the "linear" test case.

--
Peter Geoghegan

#180

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#177)

Re: index prefetching

Hi,

On 2025-07-22 22:50:00 +0200, Tomas Vondra wrote:

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason. And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either. Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.

That may be true with local fast NVMe disks, but won't be true for networked
storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
cycles for actual processing of the data.

The high latencies for such storage also means that you need fairly deep
queues and that missing prefetches can introduce substantial slowdowns.

A hypothetical disk that can do 20k iops at 3ms latency needs an average IO
depth of 60. If you have a bubble after every few dozen IOs, you're not going
to reach that effective IO depth.

And even for local NVMes, the IO-depth required to fully utilize the capacity
for small random IO can be fairly high. I have a raid-10 of four SSDs that
peaks at a depth around ~350.

Also, plenty indexes are on multiple columns and/or wider datatypes, making
bubbles triggered due to "crossing-the-leaf-page" more common.

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

Explain would also greatly benefit from tracking something like this.
The buffer "hits" and "reads" can be very difficult to interpret.

Indeed. I actually observed that sometimes the reason that the real iodepth
(i.e. measured at the OS level) ends up less high than one would hope is that,
while prefetching, we again need a heap buffer that is already being
prefetched. Currently the behaviour in that case is to synchronously wait for
IO on that buffer to complete. That obviously causes a "pipeline bubble"...

Greetings,

Andres Freund

#181

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#180)

Re: index prefetching

On Tue, Jul 22, 2025 at 6:53 PM Andres Freund <andres@anarazel.de> wrote:

That may be true with local fast NVMe disks, but won't be true for networked
storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
cycles for actual processing of the data.

I don't understand why it wouldn't be a problem for NVMe disks, too.

Take a range scan on pgbench_accounts_pkey, for example -- something
like your ORDER BY ... LIMIT N test case, but with pgbench data
instead of TPC-H data. There are 6 heap blocks per leaf page. As I
understand it, the simple patch will only be able to see up to 6 heap
blocks "into the future", at any given time. Why isn't that quite a
significant drawback, regardless of the underlying storage?

Also, plenty indexes are on multiple columns and/or wider datatypes, making
bubbles triggered due to "crossing-the-leaf-page" more common.

I actually don't think that that's a significant factor. Even with
fairly wide tuples, we'll still tend to be able to fit about 200 on
each leaf page. For a variety of reasons that doesn't compare too
badly to simple indexes (like pgbench_accounts_pkey), which will store
about 370 when the index is in a pristine state.

It does matter, but in the grand scheme of things it's unlikely to be decisive.

--
Peter Geoghegan

#182

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Peter Geoghegan (#181)

Re: index prefetching

Hi,

On 2025-07-22 19:13:23 -0400, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 6:53 PM Andres Freund <andres@anarazel.de> wrote:

That may be true with local fast NVMe disks, but won't be true for networked
storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
cycles for actual processing of the data.

I don't understand why it wouldn't be a problem for NVMe disks, too.

Take a range scan on pgbench_accounts_pkey, for example -- something
like your ORDER BY ... LIMIT N test case, but with pgbench data
instead of TPC-H data. There are 6 heap blocks per leaf page. As I
understand it, the simple patch will only be able to see up to 6 heap
blocks "into the future", at any given time. Why isn't that quite a
significant drawback, regardless of the underlying storage?

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

E.g. on my local NVMe SSD I get about 55k IOPS with an iodepth of 6 (that's
without stalls between leaf pages, so not really correct, but it's too much
math for me to compute). If you have 6 heap blocks referenced per index
block, with 60 tuples on those heap pages and you can get 55k iops with that,
you can fetch 20 million tuples / second. If per-tuple CPU processing takes
longer 10**9/20_000_000 = 50 nanoseconds, you'll not be bottlenecked on
storage.

E.g. for this silly query:
SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

while also using io_combine_limit=1 (to actually see the achieved IO depth), I
see an achieved IO depth of ~6.3 (complex).

Whereas this even sillier query:
SELECT max(abalance), min(abalance), sum(abalance::numeric), avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
only achieves an IO depth of ~4.1 (complex).

cheaper query expensive query
simple readahead 8723.209 ms 10615.232 ms
complex readahead 5069.438 ms 8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.

Btw, something with the batching is off with the complex patch. I was
wondering why I was not seing 100% CPU usage while also not seeing very deep
queues - and I get deeper queues and better times with a lowered
INDEX_SCAN_MAX_BATCHES and worse with a higher one.

Greetings,

Andres Freund

#183

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#178)

Re: index prefetching

On Tue, Jul 22, 2025 at 5:11 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-07-18 23:25:38 -0400, Peter Geoghegan wrote:

To some degree the table AM will need to care about the index level batching -
we have to be careful about how many pages we keep pinned overall. Which is
something that both the table and the index AM have some influence over.

Can't they operate independently?

I'm somewhat doubtful. Read stream is careful to limit how many things it
pins, lest we get errors about having too many buffers pinned. Somehow the
number of pins held within the index needs to be limited too, and how much
that needs to be limited depends on how many buffers are pinned in the read
stream :/

That makes sense.

Currently, the complex patch holds on to leaf page buffer pins until
btfreebatch is called for the relevant batch -- no matter what. This
is actually a short term workaround. I removed
_bt_drop_lock_and_maybe_pin from nbtree (the thing added by commit
2ed5b87f), without adding back an equivalent function that can work
across all index AMs. That shouldn't be hard.

Once I do that, then plain index scans with MVCC snapshots should
never actually have to hold on to buffer pins. I'm not sure if that
makes the underlying resource management problem any easier to address
-- but at least we won't *actually* hold on to any extra leaf page
buffer pins most of the time (once I make this fix).

What if there's no matches across many leaf pages?

We don't need to keep leaf nodes without matches pinned in that case, so I
don't think there's really an issue?

That might be true, but if we're reading leaf pages then we're not
returning tuples to the scan -- even when, in principle, we could
return at least a few more right away. That's the kind of trade-off
I'm concerned about here.

--
Peter Geoghegan

#184

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#179)

Re: index prefetching

On 7/22/25 23:35, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra <tomas@vondra.me> wrote:

Obviously, whatever advantage that the "complex" patch has is bound to
be limited to cases where index characteristics are naturally the
limiting factor. For example, with the pgbench_accounts_pkey table
there are only ever 6 distinct heap blocks on each leaf page. I bet
that your "linear" test more or less looks like that, too.

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason.

It's literally the only possible valid reason why the complex patch could win!

The sole performance justification for the complex patch is that it
can prevent the heap prefetching from getting bottlenecked on factors
tied to physical index characteristics (when it's possible in
principle to avoid getting bottlenecked in that way). Unsurprisingly,
if you assume that that'll never happen, then yeah, the complex patch
has no performance advantage over the simple one.

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

I agree with all of this.

And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

I don't think cases like that matter very much at all. The only thing
that *really* matters on the index AM side is the logical/physical
correlation. Which your testing seems largely unconcerned with.

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either.

Why not?

Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

Then why do your own test results continue to show such a big
advantage for the complex patch, over the simple patch?

I assume you mean results for the "linear" data set, because for every
other data set the patches perform almost exactly the same (when
restoring the distance after stream reset):

https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

And it's a very good point. I was puzzled by this too for a while, and
it took me a while to understand how/why this happens. It pretty much
boils down to the "duplicate block" detection and how it interacts with
the stream resets (again!).

Both patches detect duplicate blocks the same way - using a lastBlock
field, checked in the next_block callback, and skip reading the same
block multiple times. Which for the "linear" data set happens a lot,
because the index is correlated and so every block repeats ~20x.

This seems to trigger entirely different behaviors in the two patches.

For the complex patch, this results in very high prefetch distance,
about ~270. Which seems like less than one leaf page (which has ~360
items). But if I log the read/stream positions seen in
index_batch_getnext_tid, I often see this:

LOG: index_batch_getnext_tid match 0 read (9,271) stream (22,264)

That is, the stream ~13 batches ahead. AFAICS this happens because the
read_next callback (which "produces" block numbers to the stream), skips
the duplicate blocks, so that the stream never even knows about them.

So the stream thinks the distance is 270, but it's really 20x that (when
measured in index items). I realize this is another way to trigger the
stream resets with the complex patch, even though that didn't happen
here (the limit is 64 leafs, we used 13).

So you're right the complex patch prefetches far ahead. I thought the
distance will quickly decrease because of the duplicate blocks, but I
missed the fact the read stream will not seem them at all.

I'm not sure it's desirable to "hide" blocks from the read stream like
this - it'll never see the misses. How could it make good decisions,
when we skew the data used by the heuristics like this?

For the simple patch, the effect seems exactly the opposite. It detects
duplicate blocks the same way, but there's a caveat - resetting the
stream invalidates the lastBlock field, so it can't detect duplicate
blocks from the previous leaf. And so the distance drops. But this
should not matter I think (it's just a single miss for the first item),
so the rest really has to be about the single-leaf limit.

(This is my working theory, I still need to investigate it a bit more.)

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

I don't think so. Why would it? There's plenty of queries that can
benefit from it a lot, and as long as it doesn't cause harm to other
queries it's a win.

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

I agree that that would be quite useful.

Good first patch for someone ;-)

Did you restore the distance for the "complex" patch, too? I think
that it might well matter there too.

No, I did not. I did consider it, but it seemed to me it can't really
make a difference (for these data sets), because each leaf has ~300
items, and the patch limits the prefetch to 64 leafs. That means it can
prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
should be good enough for eic=1000. It should never hit stream reset.

It looks like the complex patch can reset the read stream for a couple
of reasons, which I don't fully understand right now.

I'm mostly thinking of this stuff:

/*
* If we advanced to the next batch, release the batch we no
* longer need. The positions is the "read" position, and we can
* compare it to firstBatch.
*/
if (pos->batch != scan->batchState->firstBatch)
{
batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
Assert(batch != NULL);

/*
* XXX When advancing readPos, the streamPos may get behind as
* we're only advancing it when actually requesting heap
* blocks. But we may not do that often enough - e.g. IOS may
* not need to access all-visible heap blocks, so the
* read_next callback does not get invoked for a long time.
* It's possible the stream gets so mucu behind the position
* gets invalid, as we already removed the batch. But that
* means we don't need any heap blocks until the current read
* position - if we did, we would not be in this situation (or
* it's a sign of a bug, as those two places are expected to
* be in sync). So if the streamPos still points at the batch
* we're about to free, just reset the position - we'll set it
* to readPos in the read_next callback later.
*
* XXX This can happen after the queue gets full, we "pause"
* the stream, and then reset it to continue. But I think that
* just increases the probability of hitting the issue, it's
* just more chance to to not advance the streamPos, which
* depends on when we try to fetch the first heap block after
* calling read_stream_reset().
*/
if (scan->batchState->streamPos.batch ==
scan->batchState->firstBatch)
index_batch_pos_reset(scan, &scan->batchState->streamPos);

This is not resetting the stream, though. This is resetting the position
tracking how far the stream got.

This happens because the stream moves forward only in response to
reading buffers from it. So without calling read_stream_next_buffer() it
won't call the read_next callback generating the blocks. And it's the
callback that advances the streamPos field, so it may get stale.

This happens e.g. for index only scans, when we read a couple blocks
that are not all-visible (so that goes through the stream). And then we
get a bunch of all-visible blocks, so we only return the TIDs and index
tuples. The stream gets "behind" the readPos, and may even point at a
batch that was already freed.

Isn't the obvious explanation that the complex patch benefits from
being able to prefetch without being limited by index
characteristics/leaf page boundaries, while the simple patch doesn't?

That's a valid interpretation, yes. Although the benefit comes mostly

The benefit comes mostly from....?

Sorry, got distracted and forgot to complete the sentence. I think I
wanted to write "mostly from not resetting the distance to 1". Which is
true, but the earlier "linear" example also shows there are cases where
the page boundaries are significant.

Yes, there's some similarity. Attached is the script I use to create the
tables and load the data.

Another issue with the testing that biases it against the complex
patch: heap fill factor is set to only 25 (but you use the default
index fill-factor).

That's actually intentional. I wanted to model tables with wider tuples,
without having to generate all the data etc. Maybe 25% is too much, and
real table have more than 20 tuples. It's true 400B is fairly large.

I'm not against testing with other parameters, of course. The test was
not originally written for comparing different prefetching patches, so
it may not be quite fair (and I'm not sure how to define "fair").

The "linear" is a table with a simple sequence of values (0 to 100k).
More or less - the value is a floating point, and there are 10M rows.
But you get the idea.

The "linear_X" variants mean the value has a noise of X% of the range.
So with "linear_1" you get the "linear" value, and then random(0,1000),
with normal distribution.

I don't get why this is helpful to test, except perhaps as a general smoke test.

If I zoom into any given "linear_1" leaf page, I see TIDs that appear
in an order that isn't technically uniformly random order, but is
fairly close to it. At least in a practical sense. At least for the
purposes of prefetching.

It's not uniformly random, I wrote it uses normal distribution. The
query in the SQL script does this:

select x + random_normal(0, 1000) from ...

It is a synthetic test data set, of course. It's meant to be simple to
generate, reason about, and somewhere in between the "linear" and
"uniform" data sets.

But it also has realistic motivation - real tables are usually not as
clean as "linear", nor as random as the "uniform" data sets (not for all
columns, at least). If you're looking at data sets like "orders" or
whatever, there's usually a bit of noise even for columns like "date"
etc. People modify the orders, or fill-in data from a couple days ago,
etc. Perfect correlation for one column implies slightly worse
correlation for another column (order date vs. delivery date).

For example:

pg@regression:5432 [104789]=# select
itemoffset,
htid
from
bt_page_items('linear_1_a_idx', 4);
┌────────────┬───────────┐
│ itemoffset │ htid │
├────────────┼───────────┤
│ 1 │ ∅ │
│ 2 │ (10,18) │
│ 3 │ (463,9) │
│ 4 │ (66,8) │
│ 5 │ (79,9) │
│ 6 │ (594,7) │
│ 7 │ (289,13) │
│ 8 │ (568,2) │
│ 9 │ (237,2) │
│ 10 │ (156,10) │
│ 11 │ (432,9) │
│ 12 │ (372,17) │
│ 13 │ (554,6) │
│ 14 │ (1698,11) │
│ 15 │ (389,6) │
*** SNIP ***
│ 288 │ (1264,5) │
│ 289 │ (738,16) │
│ 290 │ (1143,3) │
│ 291 │ (400,1) │
│ 292 │ (1157,10) │
│ 293 │ (266,2) │
│ 294 │ (502,9) │
│ 295 │ (85,15) │
│ 296 │ (282,2) │
│ 297 │ (453,5) │
│ 298 │ (396,6) │
│ 299 │ (267,18) │
│ 300 │ (733,15) │
│ 301 │ (108,8) │
│ 302 │ (356,16) │
│ 303 │ (235,10) │
│ 304 │ (812,18) │
│ 305 │ (675,1) │
│ 306 │ (258,13) │
│ 307 │ (1187,9) │
│ 308 │ (185,2) │
│ 309 │ (179,2) │
│ 310 │ (951,2) │
└────────────┴───────────┘
(310 rows)

There's actually 55,556 heap blocks in total in the underlying table.
So clearly there is some correlation here. Just not enough to ever
matter very much to prefetching. Again, the sole test case that has
that quality to it is the "linear" test case.

Right. I don't see a problem with this. I'm not saying parameters for
this particular data set are "perfect", but the intent is to have a
range of data sets from "perfectly clean" to "random" and see how the
patch(es) behave on all of them.

If you have a suggestion for different data sets, or how to tweak the
parameters to make it more realistic, I'm happy to try those.

regards

--
Tomas Vondra

#185

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Andres Freund (#182)

Re: index prefetching

On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

Got it. That makes sense.

cheaper query expensive query
simple readahead 8723.209 ms 10615.232 ms
complex readahead 5069.438 ms 8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

That's a reasonable distinction, of course.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.

Right.

Perhaps I was just repeating myself. Tomas seemed to be suggesting
that cases where we'll actually get a decent and completely worthwhile
improvement with the complex patch would be naturally rare, due in
part to these effects with CPU overhead. I don't think that that's
true at all.

Btw, something with the batching is off with the complex patch. I was
wondering why I was not seing 100% CPU usage while also not seeing very deep
queues - and I get deeper queues and better times with a lowered
INDEX_SCAN_MAX_BATCHES and worse with a higher one.

I'm not at all surprised that there'd be bugs like that. I don't know
about Tomas, but I've given almost no thought to
INDEX_SCAN_MAX_BATCHES specifically just yet.

--
Peter Geoghegan

#186

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#185)

Re: index prefetching

On 7/23/25 02:39, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

Got it. That makes sense.

cheaper query expensive query
simple readahead 8723.209 ms 10615.232 ms
complex readahead 5069.438 ms 8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

That's a reasonable distinction, of course.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.

Right.

Perhaps I was just repeating myself. Tomas seemed to be suggesting
that cases where we'll actually get a decent and completely worthwhile
improvement with the complex patch would be naturally rare, due in
part to these effects with CPU overhead. I don't think that that's
true at all.

Btw, something with the batching is off with the complex patch. I was
wondering why I was not seing 100% CPU usage while also not seeing very deep
queues - and I get deeper queues and better times with a lowered
INDEX_SCAN_MAX_BATCHES and worse with a higher one.

I'm not at all surprised that there'd be bugs like that. I don't know
about Tomas, but I've given almost no thought to
INDEX_SCAN_MAX_BATCHES specifically just yet.

I think I mostly picked a value high enough to make it unlikely to hit
it in realistic cases, while also not using too much memory, and 64
seemed like a good value.

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

regards

--
Tomas Vondra

#187

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#186)

Re: index prefetching

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

Greetings,

Andres Freund

#188

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#185)

Re: index prefetching

On 7/23/25 02:39, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

Got it. That makes sense.

cheaper query expensive query
simple readahead 8723.209 ms 10615.232 ms
complex readahead 5069.438 ms 8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

That's a reasonable distinction, of course.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.

Right.

Perhaps I was just repeating myself. Tomas seemed to be suggesting
that cases where we'll actually get a decent and completely worthwhile
improvement with the complex patch would be naturally rare, due in
part to these effects with CPU overhead. I don't think that that's
true at all.

It's entirely possible my mental model is too naive, or my intuition
about the queries is wrong ...

My mental model of how this works is that if I know the amount of time
T1 to process a page, and the amount of time T2 to handle an I/O, then I
can estimate when I should have submitted a read for a page. For example
if T1=1ms and T2=10ms, then I know I should submit an I/O ~10 pages
ahead in order to not have to wait. That's the "minimal" queue depth.

Of course, on high latency "cloud storage" the queue depth needs to
grow, because the time T1 to process a page is likely about the same (if
determined by CPU), but the T2 time for I/O is much higher. So we need
to issue the I/O much sooner.

When I mentioned "complex" queries, I meant queries where processing a
page takes much more time. Because it reads the page, and passes it to
other operators in the query plan, some of which may do CPU stuff, some
will trigger some synchronous I/O, etc. Which means T1 grows, and the
"minimal" queue depth decreases.

Which part of this is not quite right?

--
Tomas Vondra

#189

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#187)

Re: index prefetching

On 7/23/25 02:59, Andres Freund wrote:

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

With direct I/O, I guess? I'll take a look tomorrow.

regard

--
Tomas Vondra

#190

Thomas Munro

thomas.munro@gmail.com

6 months ago

In reply to: Tomas Vondra (#173)

1 attachment(s)

Re: index prefetching

On Wed, Jul 23, 2025 at 1:55 AM Tomas Vondra <tomas@vondra.me> wrote:

On 7/21/25 14:39, Thomas Munro wrote:

Here also are some alternative experimental patches for preserving
accumulated look-ahead distance better in cases like that. Needs more
exploration... thoughts/ideas welcome...

Thanks! I'll rerun the tests with these patches once the current round
of tests (with the simple distance restore after a reset) completes.

Here's C, a tider expression of the policy from the B patch.

Also, I realised that the quickly-drafted A patch didn't actually
implement what Andres suggested in the other thread as I had intended,
what he actually speculated about is distance * 2 + nblocks.

But it doesn't seem to matter much: anything you come up with along
those lines seems to suffer from the problem that you can easily
produce a test that defeats it by inserting just one more hit in
between the misses, where the numbers involved can be quite small.
The only policy I've come up with so far that doesn't give up until we
definitely can't do better is the one that tracks a hypothetical
window of the largest distance we possibly could have, and refuses to
shrink the actual window until even the maximum wouldn't be enough, as
expressed in the B and C patches.

On the flip side, that degree of pessimism has a cost: of course it
takes much longer to come back to distance = 1 and perhaps the fast
path. Does it matter? I don't know.

(It's only a hunch at this point but I think I can see a potentially
better way to derive that sustain value from information available
with another in-development patch that adds a new io_currency_target
value, using IO subsystem feedback to compute the IO concurrency level
that avoids I/O stalls but not more instead of going all the way to
the GUC limits and making it the user's problem to set them sensibly.
I'll have to look into that properly, but I think it might be able to
produce an ideal sustain value...)

Attachments:

0003-aio-Improve-read_stream.c-look-ahead-heuristics-C.txttext/plain; charset=US-ASCII; name=0003-aio-Improve-read_stream.c-look-ahead-heuristics-C.txtDownload

From 7e637be6685c4f88f4bc490c392211bb6efb8fb7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 21 Jul 2025 16:34:59 +1200
Subject: [PATCH 3/3] aio: Improve read_stream.c look-ahead heuristics C

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

Instead, sustain the current distance until we've seen evidence that
there is no window big enough to span the gap between rare IOs.  In
other words, we now use information from a much larger window to
estimate the utility of looking far ahead.

XXX Highly experimental!
---
 src/backend/storage/aio/read_stream.c | 36 ++++++++++++++++++---------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f242b373b22..81f752d0414 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_sustain;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -343,22 +344,36 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
+		/*
+		 * Look-ahead distance decays if we haven't had any cache misses in a
+		 * hypothetical window of recent accesses.
+		 */
+		if (stream->distance_sustain > 0)
+			stream->distance_sustain--;
+		else if (stream->distance > 1)
 			stream->distance--;
 	}
 	else
 	{
-		/*
-		 * Remember to call WaitReadBuffers() before returning head buffer.
-		 * Look-ahead distance will be adjusted after waiting.
-		 */
+		/* Remember to call WaitReadBuffers() before returning head buffer. */
 		stream->ios[io_index].buffer_index = buffer_index;
 		if (++stream->next_io_index == stream->max_ios)
 			stream->next_io_index = 0;
 		Assert(stream->ios_in_progress < stream->max_ios);
 		stream->ios_in_progress++;
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+		/* Look-ahead distance doubles. */
+		if (stream->distance > stream->max_pinned_buffers - stream->distance)
+			stream->distance = stream->max_pinned_buffers;
+		else
+			stream->distance += stream->distance;
+
+		/*
+		 * Don't let the distance begin to decay until we've seen no IOs over
+		 * a hypothetical window of the maximum possible size.
+		 */
+		stream->distance_sustain = stream->max_pinned_buffers;
 	}
 
 	/*
@@ -897,7 +912,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -910,11 +924,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
-
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
@@ -1056,7 +1065,10 @@ read_stream_reset(ReadStream *stream, int flags)
 
 	/* Start off like a newly initialized stream, unless asked not to. */
 	if ((flags & READ_STREAM_RESET_CONTINUE) == 0)
+	{
+		stream->distance_sustain = 0;
 		stream->distance = 1;
+	}
 	stream->end_of_stream = false;
 }
 
-- 
2.39.5 (Apple Git-154)

#191

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#184)

Re: index prefetching

On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <tomas@vondra.me> wrote:

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

I agree with all of this.

Cool.

I assume you mean results for the "linear" data set, because for every
other data set the patches perform almost exactly the same (when
restoring the distance after stream reset):

https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

Right.

And it's a very good point. I was puzzled by this too for a while, and
it took me a while to understand how/why this happens. It pretty much
boils down to the "duplicate block" detection and how it interacts with
the stream resets (again!).

I think that you slightly misunderstand where I'm coming from here: it
*doesn't* puzzle me. What puzzled me was that it puzzled you.

Andres' test query is very simple, and not entirely sympathetic
towards the complex patch (by design). And yet it *also* gets quite a
decent improvement from the complex patch. It doesn't speed things up
by another order of magnitude or anything, but it's a very decent
improvement -- one well worth having.

I'm also unsurprised at the fact that all the other tests that you ran
were more or less a draw between simple and complex. At least not now
that I've drilled down and understood what the indexes from those
other test cases actually look like, in practice.

So you're right the complex patch prefetches far ahead. I thought the
distance will quickly decrease because of the duplicate blocks, but I
missed the fact the read stream will not seem them at all.

FWIW I wasn't thinking about it at anything like that level of
sophistication. Everything I've said about it was based on intuitions
about how the prefetching was bound to work, for each different kind
of index. I just looked at individual leaf pages (or small groups of
them) from each index/test, and considered their TIDs, and imagined
how that was likely to affect the scan.

It just seems obvious to me that all the tests (except for "linear")
couldn't possibly be helped by eagerly reading multiple leaf pages. It
seemed equally obvious that it's quite possible to come up with a
suite of tests that have several tests that could benefit in the same
way (not just 1). Although your "linear_1"/"linear_N" tests aren't
actually like that, many cases will be -- and not just those that are
perfectly correlated ala pgbench.

I'm not sure it's desirable to "hide" blocks from the read stream like
this - it'll never see the misses. How could it make good decisions,
when we skew the data used by the heuristics like this?

I don't think that I fully understand what's desirable here myself.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

I don't think so. Why would it? There's plenty of queries that can
benefit from it a lot, and as long as it doesn't cause harm to other
queries it's a win.

I was being sarcastic. That wasn't a useful thing for me to do. Apologies.

This is not resetting the stream, though. This is resetting the position
tracking how far the stream got.

My main point is that there's stuff going on here that nobody quite
understands just yet. And so it probably makes sense to defensively
assume that the prefetch distance resetting stuff might matter with
either the complex or simple patch.

Sorry, got distracted and forgot to complete the sentence. I think I
wanted to write "mostly from not resetting the distance to 1". Which is
true, but the earlier "linear" example also shows there are cases where
the page boundaries are significant.

Of course that's true. But that was just a temporary defect of the
"simple" patch (and perhaps even for the "complex" patch, albeit to a
much lesser degree). It isn't really relevant to the important
question of whether the simple or complex design should be pursued --
we know that now.

As I said, I don't think that the test suite is particularly well
suited to evaluating simple vs complex. Because there's only one test
("linear") that has any hope of being better with the complex patch.
And because having only 1 such test isn't representative.

That's actually intentional. I wanted to model tables with wider tuples,
without having to generate all the data etc. Maybe 25% is too much, and
real table have more than 20 tuples. It's true 400B is fairly large.

My point about fill factor isn't particularly important.

I'm not against testing with other parameters, of course. The test was
not originally written for comparing different prefetching patches, so
it may not be quite fair (and I'm not sure how to define "fair").

I'd like to see more than 1 test where eagerly reading leaf pages has
any hope of helping. That's my only important concern.

It's not uniformly random, I wrote it uses normal distribution. The
query in the SQL script does this:

select x + random_normal(0, 1000) from ...

It is a synthetic test data set, of course. It's meant to be simple to
generate, reason about, and somewhere in between the "linear" and
"uniform" data sets.

I always start by looking at the index leaf pages, and imagining how
an index scan can/will deal with that.

Just because it's not truly uniformly random doesn't mean that that's
apparent when you just look at one leaf page -- heap blocks might very
well *appear* to be uniformly random (or close to it) when you drill
down like that. Or even when you look at (say) 50 neighboring leaf
pages.

But it also has realistic motivation - real tables are usually not as
clean as "linear", nor as random as the "uniform" data sets (not for all
columns, at least). If you're looking at data sets like "orders" or
whatever, there's usually a bit of noise even for columns like "date"
etc. People modify the orders, or fill-in data from a couple days ago,
etc. Perfect correlation for one column implies slightly worse
correlation for another column (order date vs. delivery date).

I agree.

Right. I don't see a problem with this. I'm not saying parameters for
this particular data set are "perfect", but the intent is to have a
range of data sets from "perfectly clean" to "random" and see how the
patch(es) behave on all of them.

Obviously none of your test cases are invalid -- they're all basically
reasonable, when considered in isolation. But the "linear_1" test is
*far* closer to the "uniform" test than it is to the "linear" test. At
least as far as the simple vs complex question is concerned.

If you have a suggestion for different data sets, or how to tweak the
parameters to make it more realistic, I'm happy to try those.

I'll get back to you on this soon. There are plenty of indexes that
are not perfectly correlated (like pgbench_accounts_pkey is) that'll
nevertheless benefit significantly from the approach taken by the
complex patch. I'm sure of this because I've been using the query I
posted early for many years now -- I've thought about and directly
instrumented the "nhtids:nhblks" of an index of interest many times in
the past.

Thanks
--
Peter Geoghegan

#192

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#191)

Re: index prefetching

On 7/23/25 03:31, Peter Geoghegan wrote:

On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <tomas@vondra.me> wrote:

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

I agree with all of this.

Cool.

I assume you mean results for the "linear" data set, because for every
other data set the patches perform almost exactly the same (when
restoring the distance after stream reset):

https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

Right.

And it's a very good point. I was puzzled by this too for a while, and
it took me a while to understand how/why this happens. It pretty much
boils down to the "duplicate block" detection and how it interacts with
the stream resets (again!).

I think that you slightly misunderstand where I'm coming from here: it
*doesn't* puzzle me. What puzzled me was that it puzzled you.

Andres' test query is very simple, and not entirely sympathetic
towards the complex patch (by design). And yet it *also* gets quite a
decent improvement from the complex patch. It doesn't speed things up
by another order of magnitude or anything, but it's a very decent
improvement -- one well worth having.

I'm also unsurprised at the fact that all the other tests that you ran
were more or less a draw between simple and complex. At least not now
that I've drilled down and understood what the indexes from those
other test cases actually look like, in practice.

So you're right the complex patch prefetches far ahead. I thought the
distance will quickly decrease because of the duplicate blocks, but I
missed the fact the read stream will not seem them at all.

FWIW I wasn't thinking about it at anything like that level of
sophistication. Everything I've said about it was based on intuitions
about how the prefetching was bound to work, for each different kind
of index. I just looked at individual leaf pages (or small groups of
them) from each index/test, and considered their TIDs, and imagined
how that was likely to affect the scan.

It just seems obvious to me that all the tests (except for "linear")
couldn't possibly be helped by eagerly reading multiple leaf pages. It
seemed equally obvious that it's quite possible to come up with a
suite of tests that have several tests that could benefit in the same
way (not just 1). Although your "linear_1"/"linear_N" tests aren't
actually like that, many cases will be -- and not just those that are
perfectly correlated ala pgbench.

I'm not sure it's desirable to "hide" blocks from the read stream like
this - it'll never see the misses. How could it make good decisions,
when we skew the data used by the heuristics like this?

I don't think that I fully understand what's desirable here myself.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

I don't think so. Why would it? There's plenty of queries that can
benefit from it a lot, and as long as it doesn't cause harm to other
queries it's a win.

I was being sarcastic. That wasn't a useful thing for me to do. Apologies.

This is not resetting the stream, though. This is resetting the position
tracking how far the stream got.

My main point is that there's stuff going on here that nobody quite
understands just yet. And so it probably makes sense to defensively
assume that the prefetch distance resetting stuff might matter with
either the complex or simple patch.

Sorry, got distracted and forgot to complete the sentence. I think I
wanted to write "mostly from not resetting the distance to 1". Which is
true, but the earlier "linear" example also shows there are cases where
the page boundaries are significant.

Of course that's true. But that was just a temporary defect of the
"simple" patch (and perhaps even for the "complex" patch, albeit to a
much lesser degree). It isn't really relevant to the important
question of whether the simple or complex design should be pursued --
we know that now.

As I said, I don't think that the test suite is particularly well
suited to evaluating simple vs complex. Because there's only one test
("linear") that has any hope of being better with the complex patch.
And because having only 1 such test isn't representative.

That's actually intentional. I wanted to model tables with wider tuples,
without having to generate all the data etc. Maybe 25% is too much, and
real table have more than 20 tuples. It's true 400B is fairly large.

My point about fill factor isn't particularly important.

Yeah, the randomness of the TIDs matters too much.

I'm not against testing with other parameters, of course. The test was
not originally written for comparing different prefetching patches, so
it may not be quite fair (and I'm not sure how to define "fair").

I'd like to see more than 1 test where eagerly reading leaf pages has
any hope of helping. That's my only important concern.

Agreed.

It's not uniformly random, I wrote it uses normal distribution. The
query in the SQL script does this:

select x + random_normal(0, 1000) from ...

It is a synthetic test data set, of course. It's meant to be simple to
generate, reason about, and somewhere in between the "linear" and
"uniform" data sets.

I always start by looking at the index leaf pages, and imagining how
an index scan can/will deal with that.

Just because it's not truly uniformly random doesn't mean that that's
apparent when you just look at one leaf page -- heap blocks might very
well *appear* to be uniformly random (or close to it) when you drill
down like that. Or even when you look at (say) 50 neighboring leaf
pages.

Yeah, the number of heap blocks per leaf page is a useful measure. I
should have thought about that.

The other thing worth tracking is probably how the number of heap blocks
increases with multiple leaf pages, to measure the "hit ratio".

I should have thought about this more when creating the data sets ...

But it also has realistic motivation - real tables are usually not as
clean as "linear", nor as random as the "uniform" data sets (not for all
columns, at least). If you're looking at data sets like "orders" or
whatever, there's usually a bit of noise even for columns like "date"
etc. People modify the orders, or fill-in data from a couple days ago,
etc. Perfect correlation for one column implies slightly worse
correlation for another column (order date vs. delivery date).

I agree.

Right. I don't see a problem with this. I'm not saying parameters for
this particular data set are "perfect", but the intent is to have a
range of data sets from "perfectly clean" to "random" and see how the
patch(es) behave on all of them.

Obviously none of your test cases are invalid -- they're all basically
reasonable, when considered in isolation. But the "linear_1" test is
*far* closer to the "uniform" test than it is to the "linear" test. At
least as far as the simple vs complex question is concerned.

Perhaps not invalid, but it also does not cover the space of possible
data sets the way I intended. It seems all the data sets are much more
random than I expected.

If you have a suggestion for different data sets, or how to tweak the
parameters to make it more realistic, I'm happy to try those.

I'll get back to you on this soon. There are plenty of indexes that
are not perfectly correlated (like pgbench_accounts_pkey is) that'll
nevertheless benefit significantly from the approach taken by the
complex patch. I'm sure of this because I've been using the query I
posted early for many years now -- I've thought about and directly
instrumented the "nhtids:nhblks" of an index of interest many times in
the past.

Thanks!

regards

--
Tomas Vondra

#193

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#187)

2 attachment(s)

Re: index prefetching

On 7/23/25 02:59, Andres Freund wrote:

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?), but I added a basic
prefetch info to explain (see the attached WIP patch), reporting the
average prefetch distance, number of stalls (with distance=0) and stream
resets (after filling INDEX_SCAN_MAX_BATCHES).

And I see this (there's a complete explain output attached) for the two
queries from your message [1]/messages/by-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql@7ppr6r6yfv4z. The

simple query:

SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid
LIMIT 10000000);

complex query:

SELECT max(abalance), min(abalance), sum(abalance::numeric),
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM
(SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

The stats actually look *exactly* the same, which makes sense because
it's reading the same index.

max_batches distance stalls resets stalls/reset
--------------------------------------------------------------------
64 272 3 3 1
32 59 122939 653 188
16 36 108101 1190 90
8 21 98775 2104 46
4 11 95627 4556 20

I think this behavior mostly matches my expectations, although it's
interesting the stalls jump so much between 64 and 32 batches.

I did test both with buffered I/O (io_method=sync) and direct I/O
(io_method=worker), and the results are exactly the same for me. Not the
timings, of course, but the prefetch stats.

Of course, maybe there's something wrong in how the stats are collected.
I wonder if maybe we should update the distance in get_block() and not
in next_buffer().

Or maybe there's some interference from having to read the leaf pages
sooner. But I don't see why that would affect the queue depth, fewer
reset should keep the queues fuller I think.

I'll think about adding some sort of distance histogram to the stats.
Maybe something like tinyhist [2]https://github.com/tvondra/tinyhist would work here.

[1]: /messages/by-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql@7ppr6r6yfv4z
/messages/by-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql@7ppr6r6yfv4z

[2]: https://github.com/tvondra/tinyhist

regards

--
Tomas Vondra

Attachments:

prefetch-distance-explain.patchtext/x-patch; charset=UTF-8; name=prefetch-distance-explain.patchDownload

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4835c48b448..5e6c3208955 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/*
-	 * No batching by default, so set it to NULL. Will be initialized later if
-	 * batching is requested and AM supports it.
-	 */
-	scan->xs_batches = NULL;
-
 	return scan;
 }
 
@@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan,
 											orderbys, norderbys);
 }
 
+void
+index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets)
+{
+	/* ugly */
+	if (scan->xs_heapfetch->rs != NULL)
+	{
+		read_stream_prefetch_stats(scan->xs_heapfetch->rs,
+					   accum, count, stalls, resets);
+	}
+}
+
 /* ----------------
  *		index_endscan - end a scan
  * ----------------
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..d92f68d0533 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexprefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Heap Fetches", NULL,
 									 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3892,45 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
+static void
+show_indexprefetch_info(PlanState *planstate, ExplainState *es)
+{
+	Plan       *plan = planstate->plan;
+
+	int	count = 0,
+		accum = 0,
+		stalls = 0,
+		resets = 0;
+
+	if (!es->analyze)
+		return;
+
+	/* Initialize counters with stats from the local process first */
+	switch (nodeTag(plan))
+	{
+		case T_IndexScan:
+			{
+				IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+				count = indexstate->iss_PrefetchCount;
+				accum = indexstate->iss_PrefetchAccum;
+				stalls = indexstate->iss_PrefetchStalls;
+				resets = indexstate->iss_ResetCount;
+				break;
+			}
+		default:
+			break;
+	}
+
+	if (count > 0)
+	{
+		ExplainPropertyFloat("Prefetch Distance", NULL, (accum * 1.0 / count), 3, es);
+		ExplainPropertyUInteger("Prefetch Stalls", NULL, stalls, es);
+		ExplainPropertyUInteger("Prefetch Resets", NULL, resets, es);
+	}
+}
+
+
 /*
  * Show exact/lossy pages for a BitmapHeapScan node
  */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..5511732aad2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,6 +125,12 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	index_get_prefetch_stats(scandesc,
+				 &node->iss_PrefetchAccum,
+				 &node->iss_PrefetchCount,
+				 &node->iss_PrefetchStalls,
+				 &node->iss_ResetCount);
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
@@ -1088,6 +1094,11 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	indexstate->iss_PrefetchAccum = 0;
+	indexstate->iss_PrefetchCount = 0;
+	indexstate->iss_PrefetchStalls = 0;
+	indexstate->iss_ResetCount = 0;
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..529cb5dcbee 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -106,6 +106,11 @@ struct ReadStream
 	bool		advice_enabled;
 	bool		temporary;
 
+	int		distance_accum;
+	int		distance_count;
+	int		distance_stalls;
+	int		reset_count;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -681,6 +686,11 @@ read_stream_begin_impl(int flags,
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
 
+	stream->distance_accum = 0;
+	stream->distance_count = 0;
+	stream->distance_stalls = 0;
+	stream->reset_count = 0;
+
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
 	 * reading the whole relation.  This way we start out assuming we'll be
@@ -772,6 +782,16 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	Buffer		buffer;
 	int16		oldest_buffer_index;
 
+	if (stream->distance > 0)
+	{
+		stream->distance_accum += stream->distance;
+		stream->distance_count += 1;
+	}
+	else
+	{
+		stream->distance_stalls += 1;
+	}
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 
 	/*
@@ -1046,6 +1066,8 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+
+	stream->reset_count += 1;
 }
 
 /*
@@ -1057,3 +1079,12 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+void
+read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets)
+{
+	*accum = stream->distance_accum;
+	*count = stream->distance_count;
+	*stalls = stream->distance_stalls;
+	*resets = stream->reset_count;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 3a3a44be3a5..51c85414b0a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,6 +235,7 @@ extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
 extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
 
+extern void index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets);
 
 /*
  * index access method support routines (in genam.c)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..ae18c6a5125 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1722,6 +1722,11 @@ typedef struct IndexScanState
 	IndexScanInstrumentation iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
 
+	int		iss_PrefetchAccum;
+	int		iss_PrefetchCount;
+	int		iss_PrefetchStalls;
+	int		iss_ResetCount;
+
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
 	bool		iss_ReachedEnd;
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..53c754a3b3d 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -102,4 +102,6 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
+extern void read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets);
+
 #endif							/* READ_STREAM_H */

prefetch-explains.logtext/x-log; charset=UTF-8; name=prefetch-explains.logDownload

#194

Andres Freund

andres@anarazel.de

6 months ago

In reply to: Tomas Vondra (#193)

Re: index prefetching

Hi,

On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:

On 7/23/25 02:59, Andres Freund wrote:

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?)

Yes, iostat, since I was looking at what the "actually required" lookahead
distance is.

Do you actually get the query to be entirely CPU bound? What amount of IO
waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
report?

Ah - I was using a very high effective_io_concurrency. With a high
effective_io_concurrency value I see a lot of stalls, even at
INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
somewhat odd.

FWIW, in my tests I was just evicting lineitem from shared buffers, since I
wanted to test the heap prefetching, without stalls induced by blocking on
index reads. But what I described happens with either.

;SET effective_io_concurrency = 256;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (actual rows=1.00 loops=1) │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ -> Limit (actual rows=10000000.00 loops=1) │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
│ Index Searches: 1 │
│ Prefetch Distance: 256.989 │
│ Prefetch Stalls: 3 │
│ Prefetch Resets: 3 │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ Planning Time: 0.086 ms │
│ Execution Time: 4194.845 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

;SET effective_io_concurrency = 512;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (actual rows=1.00 loops=1) │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ -> Limit (actual rows=10000000.00 loops=1) │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
│ Index Searches: 1 │
│ Prefetch Distance: 56.778 │
│ Prefetch Stalls: 160569 │
│ Prefetch Resets: 423 │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ Planning Time: 0.084 ms │
│ Execution Time: 4413.058 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Greetings,

Andres Freund

#195

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#191)

Re: index prefetching

On Tue, Jul 22, 2025 at 9:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'll get back to you on this soon. There are plenty of indexes that
are not perfectly correlated (like pgbench_accounts_pkey is) that'll
nevertheless benefit significantly from the approach taken by the
complex patch.

I'll give you a few examples that look like this. I'm not necessarily
suggesting that you adopt these example indexes into your test suite
-- these are just to stimulate discussion.

* The TPC-C order line table primary key.

This is the single largest index used by TPC-C, by quite some margin.
This is the index that my Postgres 12 "split after new tuple"
optimization made about 40% smaller with retail inserts -- I've
already studied it in detail.

It's a composite index on 4 integer columns, where each leaf page
contains about 260 index tuples. Note that this is true regardless of
whether retail inserts or CREATE INDEX were used (thanks to the
"split after new tuple" thing). And yet I see that nhblks is even
*lower* than pgbench_accounts: the average per-leaf-page nhblks is about
4 or 5. While the odd leaf page has an nhblks of 7 or 8, some
individual leaf pages that are full, but have an nhblks of only 4.

I would expect such an index to benefit to the maximum possible extent
from the complex patch/eager leaf page reading. This is true in spite
of the fact that technically the overall correlation is weak.

Here's what a random leaf page looks like, in terms of TIDs:

┌────────────┬───────────┐
│ itemoffset │ htid │
├────────────┼───────────┤
│ 1 │ ∅ │
│ 2 │ (1510,55) │
│ 3 │ (1510,56) │
│ 4 │ (1510,57) │
│ 5 │ (1510,58) │
│ 6 │ (1510,59) │
│ 7 │ (1510,60) │
│ 8 │ (1510,61) │
│ 9 │ (1510,62) │
│ 10 │ (1510,63) │
│ 11 │ (1510,64) │
│ 12 │ (1510,65) │
│ 13 │ (1510,66) │
│ 14 │ (1510,67) │
│ 15 │ (1510,68) │
│ 16 │ (1510,69) │
│ 17 │ (1510,70) │
│ 18 │ (1510,71) │
│ 19 │ (1510,72) │
│ 20 │ (1510,73) │
│ 21 │ (1510,74) │
│ 22 │ (1510,75) │
│ 23 │ (1510,76) │
│ 24 │ (1510,77) │
│ 25 │ (1510,78) │
│ 26 │ (1510,79) │
│ 27 │ (1510,80) │
│ 28 │ (1510,81) │
│ 29 │ (1517,1) │
│ 30 │ (1517,2) │
│ 31 │ (1517,3) │
│ 32 │ (1517,4) │
│ 33 │ (1517,5) │
│ 34 │ (1517,6) │
│ 35 │ (1517,7) │
│ 36 │ (1517,8) │
│ 37 │ (1517,9) │
│ 38 │ (1517,10) │
│ 39 │ (1517,11) │
│ 40 │ (1517,12) │
│ 41 │ (1517,13) │
│ 42 │ (1517,14) │
│ 43 │ (1517,15) │
│ 44 │ (1517,16) │
│ 45 │ (1517,17) │
│ 46 │ (1517,18) │
│ 47 │ (1517,19) │
│ 48 │ (1517,20) │
│ 49 │ (1517,21) │
│ 50 │ (1517,22) │
│ 51 │ (1517,23) │
│ 52 │ (1517,24) │
│ 53 │ (1517,25) │
│ 54 │ (1517,26) │
│ 55 │ (1517,27) │
│ 56 │ (1517,28) │
│ 57 │ (1517,29) │
│ 58 │ (1517,30) │
│ 59 │ (1517,31) │
│ 60 │ (1517,32) │
│ 61 │ (1517,33) │
│ 62 │ (1517,34) │
│ 63 │ (1517,35) │
│ 64 │ (1517,36) │
│ 65 │ (1517,37) │
│ 66 │ (1517,38) │
│ 67 │ (1517,39) │
│ 68 │ (1517,40) │
│ 69 │ (1517,41) │
│ 70 │ (1517,42) │
│ 71 │ (1517,43) │
│ 72 │ (1517,44) │
│ 73 │ (1517,45) │
│ 74 │ (1517,46) │
│ 75 │ (1517,47) │
│ 76 │ (1517,48) │
│ 77 │ (1517,49) │
│ 78 │ (1517,50) │
│ 79 │ (1517,51) │
│ 80 │ (1517,52) │
│ 81 │ (1517,53) │
│ 82 │ (1517,54) │
│ 83 │ (1517,55) │
│ 84 │ (1517,56) │
│ 85 │ (1517,57) │
│ 86 │ (1517,58) │
│ 87 │ (1517,59) │
│ 88 │ (1517,60) │
│ 89 │ (1517,62) │
│ 90 │ (1523,1) │
│ 91 │ (1523,2) │
│ 92 │ (1523,3) │
│ 93 │ (1523,4) │
│ 94 │ (1523,5) │
│ 95 │ (1523,6) │
│ 96 │ (1523,7) │
│ 97 │ (1523,8) │
│ 98 │ (1523,9) │
│ 99 │ (1523,10) │
│ 100 │ (1523,11) │
│ 101 │ (1523,12) │
│ 102 │ (1523,13) │
│ 103 │ (1523,14) │
│ 104 │ (1523,15) │
│ 105 │ (1523,16) │
│ 106 │ (1523,17) │
│ 107 │ (1523,18) │
│ 108 │ (1523,19) │
│ 109 │ (1523,20) │
│ 110 │ (1523,21) │
│ 111 │ (1523,22) │
│ 112 │ (1523,23) │
│ 113 │ (1523,24) │
│ 114 │ (1523,25) │
│ 115 │ (1523,26) │
│ 116 │ (1523,27) │
│ 117 │ (1523,28) │
│ 118 │ (1523,29) │
│ 119 │ (1523,30) │
│ 120 │ (1523,31) │
│ 121 │ (1523,32) │
│ 122 │ (1523,33) │
│ 123 │ (1523,34) │
│ 124 │ (1523,35) │
│ 125 │ (1523,36) │
│ 126 │ (1523,37) │
│ 127 │ (1523,38) │
│ 128 │ (1523,39) │
│ 129 │ (1523,40) │
│ 130 │ (1523,41) │
│ 131 │ (1523,42) │
│ 132 │ (1523,43) │
│ 133 │ (1523,44) │
│ 134 │ (1523,45) │
│ 135 │ (1523,46) │
│ 136 │ (1523,47) │
│ 137 │ (1523,48) │
│ 138 │ (1523,49) │
│ 139 │ (1523,50) │
│ 140 │ (1523,51) │
│ 141 │ (1523,52) │
│ 142 │ (1523,53) │
│ 143 │ (1523,54) │
│ 144 │ (1523,55) │
│ 145 │ (1523,56) │
│ 146 │ (1523,57) │
│ 147 │ (1523,58) │
│ 148 │ (1523,59) │
│ 149 │ (1523,60) │
│ 150 │ (1523,61) │
│ 151 │ (1523,62) │
│ 152 │ (1523,63) │
│ 153 │ (1523,64) │
│ 154 │ (1523,65) │
│ 155 │ (1523,66) │
│ 156 │ (1523,67) │
│ 157 │ (1523,68) │
│ 158 │ (1523,69) │
│ 159 │ (1523,70) │
│ 160 │ (1523,71) │
│ 161 │ (1523,72) │
│ 162 │ (1523,73) │
│ 163 │ (1523,74) │
│ 164 │ (1523,75) │
│ 165 │ (1523,76) │
│ 166 │ (1523,77) │
│ 167 │ (1523,78) │
│ 168 │ (1523,79) │
│ 169 │ (1523,80) │
│ 170 │ (1523,81) │
│ 171 │ (1531,1) │
│ 172 │ (1531,2) │
│ 173 │ (1531,3) │
│ 174 │ (1531,4) │
│ 175 │ (1531,5) │
│ 176 │ (1531,6) │
│ 177 │ (1531,7) │
│ 178 │ (1531,8) │
│ 179 │ (1531,9) │
│ 180 │ (1531,10) │
│ 181 │ (1531,11) │
│ 182 │ (1531,12) │
│ 183 │ (1531,13) │
│ 184 │ (1531,14) │
│ 185 │ (1531,15) │
│ 186 │ (1531,16) │
│ 187 │ (1531,17) │
│ 188 │ (1531,18) │
│ 189 │ (1531,19) │
│ 190 │ (1531,20) │
│ 191 │ (1531,21) │
│ 192 │ (1531,22) │
│ 193 │ (1531,23) │
│ 194 │ (1531,24) │
│ 195 │ (1531,25) │
│ 196 │ (1531,26) │
│ 197 │ (1531,27) │
│ 198 │ (1531,28) │
│ 199 │ (1531,29) │
│ 200 │ (1531,30) │
│ 201 │ (1531,31) │
│ 202 │ (1531,32) │
│ 203 │ (1531,33) │
│ 204 │ (1531,34) │
│ 205 │ (1531,35) │
│ 206 │ (1531,36) │
│ 207 │ (1531,37) │
│ 208 │ (1531,38) │
│ 209 │ (1531,39) │
│ 210 │ (1531,40) │
│ 211 │ (1531,41) │
│ 212 │ (1531,42) │
│ 213 │ (1531,43) │
│ 214 │ (1531,44) │
│ 215 │ (1531,45) │
│ 216 │ (1531,46) │
│ 217 │ (1531,47) │
│ 218 │ (1531,48) │
│ 219 │ (1531,49) │
│ 220 │ (1531,50) │
│ 221 │ (1531,51) │
│ 222 │ (1531,52) │
│ 223 │ (1531,53) │
│ 224 │ (1531,54) │
│ 225 │ (1531,55) │
│ 226 │ (1531,56) │
│ 227 │ (1531,57) │
│ 228 │ (1531,58) │
│ 229 │ (1531,59) │
│ 230 │ (1531,60) │
│ 231 │ (1531,61) │
│ 232 │ (1531,62) │
│ 233 │ (1531,63) │
│ 234 │ (1531,64) │
│ 235 │ (1531,65) │
│ 236 │ (1531,66) │
│ 237 │ (1531,67) │
│ 238 │ (1531,68) │
│ 239 │ (1531,69) │
│ 240 │ (1531,70) │
│ 241 │ (1531,71) │
│ 242 │ (1531,72) │
│ 243 │ (1531,73) │
│ 244 │ (1531,74) │
│ 245 │ (1531,75) │
│ 246 │ (1531,76) │
│ 247 │ (1531,77) │
│ 248 │ (1531,78) │
│ 249 │ (1531,79) │
│ 250 │ (1531,80) │
│ 251 │ (1531,81) │
│ 252 │ (1539,1) │
│ 253 │ (1539,2) │
│ 254 │ (1539,3) │
│ 255 │ (1539,4) │
│ 256 │ (1539,5) │
│ 257 │ (1539,6) │
│ 258 │ (1539,7) │
│ 259 │ (1539,8) │
│ 260 │ (1539,9) │
│ 261 │ (1539,10) │
└────────────┴───────────┘
(261 rows)

Notice that there are contiguous groups of tuples that all point to
the same heap block. These groups are really groups of items (on
average 10 items) from a given order. Individual warehouses seem to
have a tendency to insert multiple orders together, which further
lowers nhtids.

You can tell that tuples aren't inserted in strict ascending order
because there are "heap TID discontinuities". For example, item 165
(which is the last item from a given order) points to (15377,81),
while item 166 (which is the first item from the next order made to
the same warehouse) points to (15385,1). There is a "heap block gap"
between index tuple item 165 and 166 -- these "missing" heap blocks
don't appear anywhere on the same leaf page.

Note also that many of the other TPC-C indexes have this same quality
to them. They also consist of groups of related tuples, that get
inserted together in ascending order -- and yet the *overall* pattern
for the index is pretty far from inserts happening in ascending key
space order.

* A low cardinality index.

In one way, this works against the complex patch: if there are ~1350
TIDs on every leaf page (thanks to nbtree deduplication), we're
presumably less likely to ever need to read very many leaf pages
eagerly. But in another way it favors the complex patch: each
individual distinct value will have its TIDs stored/read in TID order,
which can be enough of a factor to get us a low nhtids value for each
leaf page.

I see a nhtids of 5 - 7 for leaf pages from the following index:

create table low_cardinality(foo int4);
CREATE TABLE
create index on low_cardinality (foo);
CREATE INDEX
insert into low_cardinality select hashint4(j) from
generate_series(1,10_000) i, generate_series(1,100) j;
INSERT 0 1000000

This is actually kinda like the TPC-C index, in a way: "foo" column
values all look random. But within a given value, the TIDs are in
ascending order, which (at least here) is enough to get us a very low
nhtids -- even in spite of each leaf page storing more than 4x as many
TIDs than could be stored within each of the TPC-C index's pages.

Note that the number of CPU cycles needed within nbtree to read a leaf
page from a low cardinality index is probably *lower* than the typical
case for a unique index. This is due to a variety of factors -- the
main factor is that there aren't very many index tuples to evaluate on
the page. So the scan isn't bottlenecked at that (certainly not to an
extent that is commensurate with the overall number of TIDs).

The terminology in this area is tricky. We say "correlation", when
perhaps we should say something like "heap clustering factor" -- a
concept that seems hard to define precisely. It doesn't help that the
planner models all this using a correlation stat -- that encourages us
to reduce everything to a single scalar correlation number, which can
be quite misleading.

I could give more examples, if you want. But they'd all just be
variations of the same thing.

--
Peter Geoghegan

#196

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Andres Freund (#194)

3 attachment(s)

Re: index prefetching

On 7/23/25 17:09, Andres Freund wrote:

Hi,

On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:

On 7/23/25 02:59, Andres Freund wrote:

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?)

Yes, iostat, since I was looking at what the "actually required" lookahead
distance is.

Do you actually get the query to be entirely CPU bound? What amount of IO
waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
report?

No, it definitely needs to wait for I/O (FWIW it's on the xeon, with a
single NVMe SSD).

Ah - I was using a very high effective_io_concurrency. With a high
effective_io_concurrency value I see a lot of stalls, even at
INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
somewhat odd.

I think that's a bug in the explain patch. The counters were updated at
the beginning of _next_buffer(), but that's wrong - a single call to
_next_buffer() can prefetch multiple blocks. This skewed the stats, as
the prefetches are not counted with "distance=0". With higher eic this
happens sooner, so the average distance seemed to decrease.

The attached patch does the updates in _get_block(), which I think is
better. And "stall" now means (distance == 1), which I think detects
requests without prefetching.

I also added a separate "Count" for the actual number of prefetched
blocks, and "Skipped" for duplicate blocks skipped (which the read
stream never even sees, because it's skipped in the callback).

FWIW, in my tests I was just evicting lineitem from shared buffers, since I
wanted to test the heap prefetching, without stalls induced by blocking on
index reads. But what I described happens with either.

;SET effective_io_concurrency = 256;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (actual rows=1.00 loops=1) │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ -> Limit (actual rows=10000000.00 loops=1) │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
│ Index Searches: 1 │
│ Prefetch Distance: 256.989 │
│ Prefetch Stalls: 3 │
│ Prefetch Resets: 3 │
│ Buffers: shared hit=27369 read=164191 │
│ I/O Timings: shared read=358.795 │
│ Planning Time: 0.086 ms │
│ Execution Time: 4194.845 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

;SET effective_io_concurrency = 512;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs off, timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (actual rows=1.00 loops=1) │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ -> Limit (actual rows=10000000.00 loops=1) │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
│ Index Searches: 1 │
│ Prefetch Distance: 56.778 │
│ Prefetch Stalls: 160569 │
│ Prefetch Resets: 423 │
│ Buffers: shared hit=27368 read=164190 │
│ I/O Timings: shared read=832.515 │
│ Planning Time: 0.084 ms │
│ Execution Time: 4413.058 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Greetings,

The attached v2 explain patch should fix that. I'm also attaching logs
from my explain, for 64 and 16 batches. I think the output makes much
more sense now.

cheers

--
Tomas Vondra

Attachments:

batches-64.logtext/x-log; charset=UTF-8; name=batches-64.logDownload

batches-16.logtext/x-log; charset=UTF-8; name=batches-16.logDownload

prefetch-distance-explain-v2.patchtext/x-patch; charset=UTF-8; name=prefetch-distance-explain-v2.patchDownload

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4835c48b448..b0e50307a0e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -407,12 +407,6 @@ index_beginscan_internal(Relation indexRelation,
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
 
-	/*
-	 * No batching by default, so set it to NULL. Will be initialized later if
-	 * batching is requested and AM supports it.
-	 */
-	scan->xs_batches = NULL;
-
 	return scan;
 }
 
@@ -463,6 +457,17 @@ index_rescan(IndexScanDesc scan,
 											orderbys, norderbys);
 }
 
+void
+index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets, int *skips)
+{
+	/* ugly */
+	if (scan->xs_heapfetch->rs != NULL)
+	{
+		read_stream_prefetch_stats(scan->xs_heapfetch->rs,
+					   accum, count, stalls, resets, skips);
+	}
+}
+
 /* ----------------
  *		index_endscan - end a scan
  * ----------------
@@ -1883,6 +1888,7 @@ index_scan_stream_read_next(ReadStream *stream,
 			/* same block as before, don't need to read it */
 			if (scan->xs_batches->lastBlock == ItemPointerGetBlockNumber(tid))
 			{
+				read_stream_skip_block(stream);
 				DEBUG_LOG("index_scan_stream_read_next: skip block (lastBlock)");
 				continue;
 			}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..9c95b4e2878 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -136,6 +136,7 @@ static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
 static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexprefetch_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1966,6 +1967,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1983,6 +1985,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Heap Fetches", NULL,
 									 planstate->instrument->ntuples2, 0, es);
 			show_indexsearches_info(planstate, es);
+			show_indexprefetch_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -3889,6 +3892,50 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
+static void
+show_indexprefetch_info(PlanState *planstate, ExplainState *es)
+{
+	Plan       *plan = planstate->plan;
+
+	int	count = 0,
+		accum = 0,
+		stalls = 0,
+		resets = 0,
+		skips = 0;
+
+	if (!es->analyze)
+		return;
+
+	/* Initialize counters with stats from the local process first */
+	switch (nodeTag(plan))
+	{
+		case T_IndexScan:
+			{
+				IndexScanState *indexstate = ((IndexScanState *) planstate);
+
+				count = indexstate->iss_PrefetchCount;
+				accum = indexstate->iss_PrefetchAccum;
+				stalls = indexstate->iss_PrefetchStalls;
+				resets = indexstate->iss_ResetCount;
+				skips = indexstate->iss_SkipCount;
+
+				break;
+			}
+		default:
+			break;
+	}
+
+	if (count > 0)
+	{
+		ExplainPropertyFloat("Prefetch Distance", NULL, (accum * 1.0 / count), 3, es);
+		ExplainPropertyUInteger("Prefetch Count", NULL, count, es);
+		ExplainPropertyUInteger("Prefetch Stalls", NULL, stalls, es);
+		ExplainPropertyUInteger("Prefetch Skips", NULL, skips, es);
+		ExplainPropertyUInteger("Prefetch Resets", NULL, resets, es);
+	}
+}
+
+
 /*
  * Show exact/lossy pages for a BitmapHeapScan node
  */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..707badc4fdc 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,6 +125,13 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	index_get_prefetch_stats(scandesc,
+				 &node->iss_PrefetchAccum,
+				 &node->iss_PrefetchCount,
+				 &node->iss_PrefetchStalls,
+				 &node->iss_ResetCount,
+				 &node->iss_SkipCount);
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
@@ -1088,6 +1095,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	indexstate->iss_PrefetchAccum = 0;
+	indexstate->iss_PrefetchCount = 0;
+	indexstate->iss_PrefetchStalls = 0;
+	indexstate->iss_ResetCount = 0;
+	indexstate->iss_SkipCount = 0;
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..e41189f6612 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -106,6 +106,12 @@ struct ReadStream
 	bool		advice_enabled;
 	bool		temporary;
 
+	int		distance_accum;
+	int		distance_count;
+	int		distance_stalls;
+	int		reset_count;
+	int		skip_count;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -180,6 +186,16 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 {
 	BlockNumber blocknum;
 
+        if (stream->distance > 1)
+        {
+                stream->distance_accum += stream->distance;
+                stream->distance_count += 1;
+        }
+        else
+        {
+                stream->distance_stalls += 1;
+        }
+
 	blocknum = stream->buffered_blocknum;
 	if (blocknum != InvalidBlockNumber)
 		stream->buffered_blocknum = InvalidBlockNumber;
@@ -681,6 +697,12 @@ read_stream_begin_impl(int flags,
 	stream->seq_until_processed = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
 
+	stream->distance_accum = 0;
+	stream->distance_count = 0;
+	stream->distance_stalls = 0;
+	stream->reset_count = 0;
+	stream->skip_count = 0;
+
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
 	 * reading the whole relation.  This way we start out assuming we'll be
@@ -771,6 +793,17 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 {
 	Buffer		buffer;
 	int16		oldest_buffer_index;
+/*
+	if (stream->distance > 0)
+	{
+		stream->distance_accum += stream->distance;
+		stream->distance_count += 1;
+	}
+	else
+	{
+		stream->distance_stalls += 1;
+	}
+*/
 
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 
@@ -1046,6 +1079,8 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+
+	stream->reset_count += 1;
 }
 
 /*
@@ -1057,3 +1092,19 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+void
+read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets, int *skips)
+{
+	*accum = stream->distance_accum;
+	*count = stream->distance_count;
+	*stalls = stream->distance_stalls;
+	*resets = stream->reset_count;
+	*skips = stream->skip_count;
+}
+
+void
+read_stream_skip_block(ReadStream *stream)
+{
+	stream->skip_count++;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 3a3a44be3a5..f1e5fdfd478 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -235,6 +235,7 @@ extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
 extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
 
+extern void index_get_prefetch_stats(IndexScanDesc scan, int *accum, int *count, int *stalls, int *resets, int *skips);
 
 /*
  * index access method support routines (in genam.c)
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..e91bc7ea35f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1722,6 +1722,12 @@ typedef struct IndexScanState
 	IndexScanInstrumentation iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
 
+	int		iss_PrefetchAccum;
+	int		iss_PrefetchCount;
+	int		iss_PrefetchStalls;
+	int		iss_ResetCount;
+	int		iss_SkipCount;
+
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
 	bool		iss_ReachedEnd;
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..34e184a1690 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -102,4 +102,7 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
+extern void read_stream_prefetch_stats(ReadStream *stream, int *accum, int *count, int *stalls, int *resets, int *skips);
+extern void read_stream_skip_block(ReadStream *stream);
+
 #endif							/* READ_STREAM_H */

#197

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#195)

Re: index prefetching

On Wed, Jul 23, 2025 at 12:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

* The TPC-C order line table primary key.

I tested this for myself.

Tomas' index-prefetch-simple-master branch:

set max_parallel_workers_per_gather =0;
SELECT pg_buffercache_evict_relation('order_line');
select pg_prewarm('order_line_pkey');

:ea select sum(ol_amount) from order_line where ol_w_id < 10;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│
QUERY PLAN
│
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=264259.55..264259.56 rows=1 width=32) (actual
time=2015.711..2015.712 rows=1.00 loops=1)
│
│ Output: sum(ol_amount)

│
│ Buffers: shared hit=17815 read=33855

│
│ I/O Timings: shared read=1490.918

│
│ -> Index Scan using order_line_pkey on public.order_line
(cost=0.56..257361.93 rows=2759049 width=4) (actual
time=7.936..1768.236 rows=2700116.00 loops=1) │
│ Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,
ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info
│
│ Index Cond: (order_line.ol_w_id < 10)

│
│ Index Searches: 1

│
│ Index Prefetch: true

│
│ Index Distance: 110.7

│
│ Buffers: shared hit=17815 read=33855

│
│ I/O Timings: shared read=1490.918

│
│ Planning Time: 0.049 ms

│
│ Serialization: time=0.003 ms output=1kB format=text

│
│ Execution Time: 2015.731 ms

│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(15 rows)

Complex patch (same prewarming/eviction are omitted this time):

:ea select sum(ol_amount) from order_line where ol_w_id < 10;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│
QUERY PLAN
│
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=264259.55..264259.56 rows=1 width=32) (actual
time=768.387..768.388 rows=1.00 loops=1)
│
│ Output: sum(ol_amount)

│
│ Buffers: shared hit=17815 read=33855

│
│ I/O Timings: shared read=138.856

│
│ -> Index Scan using order_line_pkey on public.order_line
(cost=0.56..257361.93 rows=2759049 width=4) (actual
time=7.956..493.694 rows=2700116.00 loops=1) │
│ Output: ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id,
ol_delivery_d, ol_amount, ol_supply_w_id, ol_quantity, ol_dist_info
│
│ Index Cond: (order_line.ol_w_id < 10)

│
│ Index Searches: 1

│
│ Buffers: shared hit=17815 read=33855

│
│ I/O Timings: shared read=138.856

│
│ Planning Time: 0.043 ms

│
│ Serialization: time=0.003 ms output=1kB format=text

│
│ Execution Time: 768.454 ms

I'm using direct IO in both cases. This can easily be repeated, and is stable.

To be fair, the planner wants to use a parallel index scan for this.
If I allow the scan to be parallel, 5 parallel workers are used. The
simple patch now takes 295.722 ms, while the complex patch takes
301.875 ms. I imagine that that's because the use of parallelism
eliminates the natural advantage that the complex has with this
workload/index -- the scan as a whole is presumably no longer
bottlenecked on physical index characteristics. The parallel workers
can almost behave as 5 independent scans, all kept sufficiently busy,
even without our having to read ahead to later leaf pages.

It's possible that something weird is going on with the prefetch
distance, in the context of parallel scans specifically -- it's not
like we've really tested parallel scans just yet (with either patch).
Even if there is an addressable problem in either patch here, I'd be
surprised if it was the main factor behind the simple patch doing
relatively well when scanning in parallel like this.

--
Peter Geoghegan

#198

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Peter Geoghegan (#197)

Re: index prefetching

On Wed, Jul 23, 2025 at 9:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

Tomas' index-prefetch-simple-master branch:
│ I/O Timings: shared read=1490.918
│ Execution Time: 2015.731 ms

Complex patch (same prewarming/eviction are omitted this time):
│ I/O Timings: shared read=138.856
│ Execution Time: 768.454 ms

I'm using direct IO in both cases. This can easily be repeated, and is stable.

Forget to add context about the master branch: Master can do this in
2386.850 ms, with "I/O Timings: shared read=1825.161". That's with
buffered I/O (not direct I/O), and with the same pg_prewarm +
pg_buffercache_evict_relation function calls as before. I'm running
"echo 3 > /proc/sys/vm/drop_caches" to drop the filesystem cache here,
too (unlike when testing the patches, where my use of direct i/o makes
that step unnecessary).

In summary, the simple patch + direct I/O clearly beats the master
branch + buffered I/O here -- though not by much. While the complex
patch gets a far greater benefit.

--
Peter Geoghegan

#199

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tomas Vondra (#184)

1 attachment(s)

Re: index prefetching

On 7/23/25 02:37, Tomas Vondra wrote:

...

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

I agree that that would be quite useful.

Good first patch for someone ;-)

I got a bit bored yesterday, so I gave this a try and whipped up a patch
that adds two pgstattuple functins that I think could be useful for
analyzing index metrics that matter for prefetching.

The patch adds two functions, that are meant to provide data for
additional analysis rather than computing something final. Each function
splits the index into a sequence of block ranges (of given length), and
calculates some metrics on that.

pgstatindex_nheap
- number of leafs in the range
- number of block numbers
- number of distinct block numbers
- number of runs (of the same block)

pgstatindex_runs
- number of leafs in the range
- run length
- number of runs with the length

It's trivial to summarize this into a per-index statistic (of course,
there may be some inaccuracies when the run spans multiple ranges), but
it also seems useful to be able to look at parts of the index.

This is meant as a quick experimental patch, to help with generating
better datasets for the evaluation. And I think it works for that, and I
don't have immediate plans to work on this outside that context.

There are a couple things we'd need to address before actually merging
this, I think. Two that I can think of right now:

First, the "range length" determines memory usage. Right now it's a bit
naive, and just extracts all blocks (for the range) into an array. That
might be an issue for larger ranges, I'm sure there are strategies to
mitigate that - doing some of the processing when reading block numbers,
using hyperloglog to estimate distincts, etc.

Second, the index is walked sequentially in physical order, from block 0
to the last block. But that's not really what the index prefetch sees.
To make it "more accurate" it'd be better to just scan the leaf pages as
if during a "full index scan".

Also, I haven't updated the docs. That'd also need to be done.

regards

--
Tomas Vondra

Attachments:

v1-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patchtext/x-patch; charset=UTF-8; name=v1-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patchDownload

From dd36f2194986e5366e1f0800eff6b8a61611dd0f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 24 Jul 2025 13:00:24 +0200
Subject: [PATCH v1] pgstattuple: analyze TIDs on btree leaf pages

---
 contrib/pgstattuple/Makefile                  |   3 +-
 contrib/pgstattuple/pgstatindex.c             | 447 ++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.5--1.6.sql |  30 ++
 contrib/pgstattuple/pgstattuple.control       |   2 +-
 4 files changed, 480 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstattuple--1.5--1.6.sql

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index c5b17fc703e..d5c62ba36f9 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -10,7 +10,8 @@ OBJS = \
 EXTENSION = pgstattuple
 DATA = pgstattuple--1.4.sql pgstattuple--1.4--1.5.sql \
 	pgstattuple--1.3--1.4.sql pgstattuple--1.2--1.3.sql \
-	pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql
+	pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql \
+	pgstattuple--1.5--1.6.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 4b9d76ec4e4..aa4431075b0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -62,6 +62,9 @@ PG_FUNCTION_INFO_V1(pg_relpages_v1_5);
 PG_FUNCTION_INFO_V1(pg_relpagesbyid_v1_5);
 PG_FUNCTION_INFO_V1(pgstatginindex_v1_5);
 
+PG_FUNCTION_INFO_V1(pgstatindex_nheap_v1_6);
+PG_FUNCTION_INFO_V1(pgstatindex_runs_v1_6);
+
 Datum		pgstatginindex_internal(Oid relid, FunctionCallInfo fcinfo);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
@@ -128,6 +131,9 @@ static Datum pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo);
 static int64 pg_relpages_impl(Relation rel);
 static void GetHashPageStats(Page page, HashIndexStat *stats);
 
+static Datum pgstatindex_nheap_impl(Relation rel, int64 nblocks, FunctionCallInfo fcinfo);
+static Datum pgstatindex_runs_impl(Relation rel, int64 nblocks, FunctionCallInfo fcinfo);
+
 /* ------------------------------------------------------
  * pgstatindex()
  *
@@ -756,3 +762,444 @@ GetHashPageStats(Page page, HashIndexStat *stats)
 	}
 	stats->free_space += PageGetExactFreeSpace(page);
 }
+
+/*
+ */
+Datum
+pgstatindex_nheap_v1_6(PG_FUNCTION_ARGS)
+{
+	text	   *relname = PG_GETARG_TEXT_PP(0);
+	int64		nblocks = PG_GETARG_INT64(1);
+	Relation	rel;
+	RangeVar   *relrv;
+
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	PG_RETURN_DATUM(pgstatindex_nheap_impl(rel, nblocks, fcinfo));
+}
+
+/*
+ */
+Datum
+pgstatindex_runs_v1_6(PG_FUNCTION_ARGS)
+{
+	text	   *relname = PG_GETARG_TEXT_PP(0);
+	int64		nblocks = PG_GETARG_INT64(1);
+	Relation	rel;
+	RangeVar   *relrv;
+
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	PG_RETURN_DATUM(pgstatindex_runs_impl(rel, nblocks, fcinfo));
+}
+
+static int
+count_block_runs(int range_nblocks, BlockNumber *range_blocks)
+{
+	int			nruns = 1;
+
+	for (int i = 1; i < range_nblocks; i++)
+	{
+		if (range_blocks[i] != range_blocks[i-1])
+			nruns++;
+	}
+
+	return nruns;
+}
+
+static int
+blocknum_cmp(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(BlockNumber));
+}
+
+static int
+count_blocks_distinct(int range_nblocks, BlockNumber *range_blocks)
+{
+	pg_qsort(range_blocks, range_nblocks, sizeof(BlockNumber), blocknum_cmp);
+
+	return count_block_runs(range_nblocks, range_blocks);
+}
+
+static Datum
+pgstatindex_nheap_impl(Relation rel, int64 range_len, FunctionCallInfo fcinfo)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	BlockNumber range_start = 0;
+	BlockNumber	*range_blocks = NULL;
+	int			range_nblocks;
+	int			range_leafs;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* no NULLs */
+	memset(nulls, 0, sizeof(nulls));
+
+	if (!IS_INDEX(rel) || !IS_BTREE(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" is not a btree index",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * A !indisready index could lead to ERRCODE_DATA_CORRUPTED later, so exit
+	 * early.  We're capable of assessing an indisready&&!indisvalid index,
+	 * but the results could be confusing.  For example, the index's size
+	 * could be too low for a valid index of the table.
+	 */
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("index \"%s\" is not valid",
+						RelationGetRelationName(rel))));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupdesc = rsinfo->setDesc;
+	tupstore = rsinfo->setResult;
+
+	/* pre-allocate space for the maximum number of TIDs we might see
+	 * on range_len pages */
+	range_nblocks = 0;
+	range_blocks = palloc_array(BlockNumber, range_len * MaxTIDsPerBTreePage);
+
+	/*
+	 * Scan all blocks except the metapage
+	 */
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 1; blkno < nblocks; blkno++)
+	{
+		Buffer		buffer;
+		Page		page;
+		BTPageOpaque opaque;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If we started a new range, count the distinct blocks and add
+		 * a tuple into the result set (if there are any TIDs).
+		 */
+		if (range_start + range_len <= blkno)
+		{
+			if (range_nblocks > 0)
+			{
+				HeapTuple	tuple;
+				int			ndistinct;
+				int			nruns;
+
+				nruns = count_block_runs(range_nblocks, range_blocks);
+
+				/* this modifies the array, so do once */
+				ndistinct = count_blocks_distinct(range_nblocks, range_blocks);
+
+				values[0] = Int64GetDatum(range_start);
+				values[1] = Int64GetDatum(range_leafs);
+				values[2] = Int64GetDatum(range_nblocks);
+				values[3] = Int64GetDatum(ndistinct);
+				values[4] = Int64GetDatum(nruns);
+
+				/* Build and return the result tuple */
+				tuple = heap_form_tuple(tupdesc, values, nulls);
+
+				tuplestore_puttuple(tupstore, tuple);
+
+				range_nblocks = 0;
+				range_leafs = 0;
+			}
+
+			while (range_start + range_len <= blkno)
+				range_start += range_len;
+		}
+
+		/* Read and lock buffer */
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		opaque = BTPageGetOpaque(page);
+
+		/*
+		 * Ignore deleted/dead pages, and internal (non-leaf) pages.
+		 */
+		if (!P_ISDELETED(opaque) && !P_IGNORE(opaque) && P_ISLEAF(opaque))
+		{
+			OffsetNumber	offset = FirstOffsetNumber;
+			OffsetNumber	maxoffset = PageGetMaxOffsetNumber(page);
+			bool			rightmost = P_RIGHTMOST(opaque);
+
+			while (offset < maxoffset)
+			{
+				ItemId		id = PageGetItemId(page, offset);
+				IndexTuple	itup = (IndexTuple) PageGetItem(page, id);
+				BlockNumber	block = ItemPointerGetBlockNumber(&itup->t_tid);
+				bool		ispivot = (!rightmost && offset == P_HIKEY);
+
+				Assert(range_nblocks >= 0);
+				Assert(range_nblocks < range_len * MaxTIDsPerBTreePage);
+
+				/* ignore pivot tuples */
+				if (!ispivot)
+					range_blocks[range_nblocks++] = block;
+
+				offset++;
+			}
+
+			range_leafs++;
+		}
+
+		/* Unlock and release buffer */
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+	}
+
+	/* output the last range */
+	if (range_nblocks > 0)
+	{
+		HeapTuple	tuple;
+		int			ndistinct;
+		int			nruns;
+	
+		nruns = count_block_runs(range_nblocks, range_blocks);
+	
+		/* this modifies the array, so do once */
+		ndistinct = count_blocks_distinct(range_nblocks, range_blocks);
+	
+		values[0] = Int64GetDatum(range_start);
+		values[1] = Int64GetDatum(range_leafs);
+		values[2] = Int64GetDatum(range_nblocks);
+		values[3] = Int64GetDatum(ndistinct);
+		values[4] = Int64GetDatum(nruns);
+	
+		/* Build and return the result tuple */
+		tuple = heap_form_tuple(tupdesc, values, nulls);
+	
+		tuplestore_puttuple(tupstore, tuple);
+	
+		range_nblocks = 0;
+		range_leafs = 0;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_NULL();
+}
+
+static void
+count_run_lengths(int range_nblocks, BlockNumber *range_blocks, int *lengths)
+{
+	int			len = 1;
+	BlockNumber	curr = range_blocks[0];
+
+	for (int i = 0; i < range_nblocks; i++)
+	{
+		if (range_blocks[i] != curr)
+		{
+			lengths[len]++;
+			len = 1;
+			curr = range_blocks[i];
+			continue;
+		}
+
+		len++;
+	}
+
+	lengths[len]++;
+}
+
+static Datum
+pgstatindex_runs_impl(Relation rel, int64 range_len, FunctionCallInfo fcinfo)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	BlockNumber nblocks;
+	BlockNumber blkno;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	BlockNumber range_start = 0;
+	BlockNumber	*range_blocks = NULL;
+	int			*run_lengths = NULL;
+	int			range_nblocks;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+
+	Datum		values[3];
+	bool		nulls[3];
+
+	/* no NULLs */
+	memset(nulls, 0, sizeof(nulls));
+
+	if (!IS_INDEX(rel) || !IS_BTREE(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" is not a btree index",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * A !indisready index could lead to ERRCODE_DATA_CORRUPTED later, so exit
+	 * early.  We're capable of assessing an indisready&&!indisvalid index,
+	 * but the results could be confusing.  For example, the index's size
+	 * could be too low for a valid index of the table.
+	 */
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("index \"%s\" is not valid",
+						RelationGetRelationName(rel))));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupdesc = rsinfo->setDesc;
+	tupstore = rsinfo->setResult;
+
+	/* pre-allocate space for the maximum number of TIDs we might see
+	 * on range_len pages */
+	range_nblocks = 0;
+	range_blocks = palloc_array(BlockNumber, range_len * MaxTIDsPerBTreePage);
+
+	/* lengths of runs (the range could be one long run) */
+	run_lengths = palloc_array(int, (range_len * MaxTIDsPerBTreePage + 1));
+
+	/*
+	 * Scan all blocks except the metapage
+	 */
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 1; blkno < nblocks; blkno++)
+	{
+		Buffer		buffer;
+		Page		page;
+		BTPageOpaque opaque;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If we started a new range, count the distinct blocks and add
+		 * a tuple into the result set (if there are any TIDs).
+		 */
+		if (range_start + range_len <= blkno)
+		{
+			if (range_nblocks > 0)
+			{
+				HeapTuple	tuple;
+
+				memset(run_lengths, 0, sizeof(int) * (range_len * MaxTIDsPerBTreePage + 1));
+				count_run_lengths(range_nblocks, range_blocks, run_lengths);
+
+				for (int i = 1; i <= (range_len * MaxTIDsPerBTreePage); i++)
+				{
+					if (run_lengths[i] > 0)
+					{
+						values[0] = Int64GetDatum(range_start);
+						values[1] = Int32GetDatum(i);
+						values[2] = Int32GetDatum(run_lengths[i]);
+
+						/* Build and return the result tuple */
+						tuple = heap_form_tuple(tupdesc, values, nulls);
+
+						tuplestore_puttuple(tupstore, tuple);
+					}
+				}
+
+				range_nblocks = 0;
+			}
+
+			while (range_start + range_len <= blkno)
+				range_start += range_len;
+		}
+
+		/* Read and lock buffer */
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		opaque = BTPageGetOpaque(page);
+
+		/*
+		 * Ignore deleted/dead pages, and internal (non-leaf) pages.
+		 */
+		if (!P_ISDELETED(opaque) && !P_IGNORE(opaque) && P_ISLEAF(opaque))
+		{
+			OffsetNumber	offset = FirstOffsetNumber;
+			OffsetNumber	maxoffset = PageGetMaxOffsetNumber(page);
+			bool			rightmost = P_RIGHTMOST(opaque);
+
+			while (offset < maxoffset)
+			{
+				ItemId		id = PageGetItemId(page, offset);
+				IndexTuple	itup = (IndexTuple) PageGetItem(page, id);
+				BlockNumber	block = ItemPointerGetBlockNumber(&itup->t_tid);
+				bool		ispivot = (!rightmost && offset == P_HIKEY);
+
+				Assert(range_nblocks >= 0);
+				Assert(range_nblocks < range_len * MaxTIDsPerBTreePage);
+
+				/* ignore pivot tuples */
+				if (!ispivot)
+					range_blocks[range_nblocks++] = block;
+
+				offset++;
+			}
+		}
+
+		/* Unlock and release buffer */
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+	}
+
+	/* output the last range */
+	if (range_nblocks > 0)
+	{
+		HeapTuple	tuple;
+
+		memset(run_lengths, 0, sizeof(int) * (range_len * MaxTIDsPerBTreePage + 1));
+		count_run_lengths(range_nblocks, range_blocks, run_lengths);
+
+		for (int i = 1; i <= (range_len * MaxTIDsPerBTreePage); i++)
+		{
+			if (run_lengths[i] > 0)
+			{
+				values[0] = Int64GetDatum(range_start);
+				values[1] = Int32GetDatum(i);
+				values[2] = Int32GetDatum(run_lengths[i]);
+
+				/* Build and return the result tuple */
+				tuple = heap_form_tuple(tupdesc, values, nulls);
+
+				tuplestore_puttuple(tupstore, tuple);
+			}
+		}
+
+		range_nblocks = 0;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_NULL();
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.5--1.6.sql b/contrib/pgstattuple/pgstattuple--1.5--1.6.sql
new file mode 100644
index 00000000000..c6a450a301b
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.5--1.6.sql
@@ -0,0 +1,30 @@
+/* contrib/pgstattuple/pgstattuple--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.6'" to load this file. \quit
+
+CREATE OR REPLACE FUNCTION pgstatindex_nheap(IN relname text,
+    IN blocks BIGINT,			-- length of ranges to analyze
+    OUT block BIGINT,			-- first block of a range
+    OUT num_leafs BIGINT,		-- number of leaf pages
+    OUT num_items BIGINT,		-- number of heap TIDs
+    OUT num_blocks BIGINT,		-- number of distinct blocks
+    OUT num_runs BIGINT)		-- number of continuous runs
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pgstatindex_nheap_v1_6'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+REVOKE EXECUTE ON FUNCTION pgstatindex_nheap(text, bigint) FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pgstatindex_nheap(text, bigint) TO pg_stat_scan_tables;
+
+CREATE OR REPLACE FUNCTION pgstatindex_runs(IN relname text,
+    IN blocks BIGINT,			-- length of ranges to analyze
+    OUT block BIGINT,			-- first block of a range
+    OUT run_length INT,			-- number of leaf pages
+    OUT run_count INT)		-- number of heap TIDs
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pgstatindex_runs_v1_6'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+REVOKE EXECUTE ON FUNCTION pgstatindex_runs(text, bigint) FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pgstatindex_runs(text, bigint) TO pg_stat_scan_tables;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index 6af40757b27..80d06958e90 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.5'
+default_version = '1.6'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
-- 
2.50.1

#200

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#199)

Re: index prefetching

On Thu, Jul 24, 2025 at 7:19 AM Tomas Vondra <tomas@vondra.me> wrote:

I got a bit bored yesterday, so I gave this a try and whipped up a patch
that adds two pgstattuple functins that I think could be useful for
analyzing index metrics that matter for prefetching.

This seems quite useful.

I notice that you're not accounting for posting lists. That'll lead to
miscounts of the number of heap blocks in many cases. I think that
that's worth fixing, even given that this patch is experimental.

It's trivial to summarize this into a per-index statistic (of course,
there may be some inaccuracies when the run spans multiple ranges), but
it also seems useful to be able to look at parts of the index.

FWIW in my experience, the per-leaf-page "nhtids:nhblks" tends to be
fairly consistent across all leaf pages from a given index. There are
no doubt some exceptions, but they're probably pretty rare.

Second, the index is walked sequentially in physical order, from block 0
to the last block. But that's not really what the index prefetch sees.
To make it "more accurate" it'd be better to just scan the leaf pages as
if during a "full index scan".

Why not just do it that way to begin with? It wouldn't be complicated
to make the function follow a chain of right sibling links.

I suggest an interface that takes a block number, and an nblocks int8
argument that must be >= 1. The function would start from the block
number arg leaf page. If it's not a non-ignorable leaf page, throw an
error. Otherwise, count the number of distinct heap blocks on the leaf
page, and count the number of heap blocks on each additional leaf page
to the right -- until we've counted the heap blocks from nblocks-many
leaf pages (or until we reach the rightmost leaf page).

I suggest that a P_IGNORE() page shouldn't have its heap blocks
counted, and shouldn't count towards our nblocks tally of leaf pages
whose heap blocks are to be counted. Upon encountering a P_IGNORE()
page, just move to the right without doing anything. Note that the
rightmost page cannot be P_IGNORE().

This scheme will always succeed, no matter the nblocks argument,
provided the initial leaf page is a valid leaf page (and provided the
nblocks arg is >= 1).

I get that this is just a prototype that might not go anywhere, but
the scheme I've described requires few changes.

--
Peter Geoghegan

#201

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Peter Geoghegan (#200)

1 attachment(s)

Re: index prefetching

On 7/24/25 16:40, Peter Geoghegan wrote:

On Thu, Jul 24, 2025 at 7:19 AM Tomas Vondra <tomas@vondra.me> wrote:

I got a bit bored yesterday, so I gave this a try and whipped up a patch
that adds two pgstattuple functins that I think could be useful for
analyzing index metrics that matter for prefetching.

This seems quite useful.

I notice that you're not accounting for posting lists. That'll lead to
miscounts of the number of heap blocks in many cases. I think that
that's worth fixing, even given that this patch is experimental.

Yeah, I forgot about that. Should be fixed in the v2. Admittedly I don't
know that much about nbtree internals, so this is mostly copy pasting
from verify_nbtree.

It's trivial to summarize this into a per-index statistic (of course,
there may be some inaccuracies when the run spans multiple ranges), but
it also seems useful to be able to look at parts of the index.

FWIW in my experience, the per-leaf-page "nhtids:nhblks" tends to be
fairly consistent across all leaf pages from a given index. There are
no doubt some exceptions, but they're probably pretty rare.

Yeah, probably. And we'll probably test on such uniform data sets, or at
least we we'll start with those. But at some point I'd like to test with
some of these "weird" indexes too, if only to test how well the prefetch
heuristics adjusts the distance.

Second, the index is walked sequentially in physical order, from block 0
to the last block. But that's not really what the index prefetch sees.
To make it "more accurate" it'd be better to just scan the leaf pages as
if during a "full index scan".

Why not just do it that way to begin with? It wouldn't be complicated
to make the function follow a chain of right sibling links.

I have a very good reason why I didn't do it that way. I was lazy. But
v2 should be doing that, I think.

I suggest an interface that takes a block number, and an nblocks int8
argument that must be >= 1. The function would start from the block
number arg leaf page. If it's not a non-ignorable leaf page, throw an
error. Otherwise, count the number of distinct heap blocks on the leaf
page, and count the number of heap blocks on each additional leaf page
to the right -- until we've counted the heap blocks from nblocks-many
leaf pages (or until we reach the rightmost leaf page).

Yeah, this interface seems useful. I suppose it'll be handy when looking
at an index scan, to get stats from the currently loaded batches. In
principle you get that from v3 by filtering, but it might be slow on
large indexes. I'll try doing that in v3.

I suggest that a P_IGNORE() page shouldn't have its heap blocks
counted, and shouldn't count towards our nblocks tally of leaf pages
whose heap blocks are to be counted. Upon encountering a P_IGNORE()
page, just move to the right without doing anything. Note that the
rightmost page cannot be P_IGNORE().

I think v2 does all of this.

This scheme will always succeed, no matter the nblocks argument,
provided the initial leaf page is a valid leaf page (and provided the
nblocks arg is >= 1).

I get that this is just a prototype that might not go anywhere, but
the scheme I've described requires few changes.

Yep, thanks.

--
Tomas Vondra

Attachments:

v2-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patchtext/x-patch; charset=UTF-8; name=v2-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patchDownload

From 09576e519adfe3b8db47de3b4b03f1f9374e3ecf Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 24 Jul 2025 13:00:24 +0200
Subject: [PATCH v2] pgstattuple: analyze TIDs on btree leaf pages

The patch adds two functions, that are meant to provide data for
additional analysis rather than computing something final. Each function
splits the index into a sequence of block ranges (of given length), and
calculates some metrics on that.

pgstatindex_nheap
  - number of leafs in the range
  - number of block numbers
  - number of distinct block numbers
  - number of runs (of the same block)

pgstatindex_runs
  - number of leafs in the range
  - run length
  - number of runs with the length

It's trivial to summarize this into a per-index statistic (of course,
there may be some inaccuracies when the run spans multiple ranges), but
it also seems useful to be able to look at parts of the index.

This is meant as a quick experimental patch, to help with generating
better datasets for the evaluation. And I think it works for that, and I
don't have immediate plans to work on this outside that context.
---
 contrib/pgstattuple/Makefile                  |   3 +-
 contrib/pgstattuple/pgstatindex.c             | 481 ++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.5--1.6.sql |  31 ++
 contrib/pgstattuple/pgstattuple.control       |   2 +-
 4 files changed, 515 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstattuple--1.5--1.6.sql

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index c5b17fc703e..d5c62ba36f9 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -10,7 +10,8 @@ OBJS = \
 EXTENSION = pgstattuple
 DATA = pgstattuple--1.4.sql pgstattuple--1.4--1.5.sql \
 	pgstattuple--1.3--1.4.sql pgstattuple--1.2--1.3.sql \
-	pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql
+	pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql \
+	pgstattuple--1.5--1.6.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 4b9d76ec4e4..d88b47595ac 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -62,6 +62,9 @@ PG_FUNCTION_INFO_V1(pg_relpages_v1_5);
 PG_FUNCTION_INFO_V1(pg_relpagesbyid_v1_5);
 PG_FUNCTION_INFO_V1(pgstatginindex_v1_5);
 
+PG_FUNCTION_INFO_V1(pgstatindex_nheap_v1_6);
+PG_FUNCTION_INFO_V1(pgstatindex_runs_v1_6);
+
 Datum		pgstatginindex_internal(Oid relid, FunctionCallInfo fcinfo);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
@@ -128,6 +131,9 @@ static Datum pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo);
 static int64 pg_relpages_impl(Relation rel);
 static void GetHashPageStats(Page page, HashIndexStat *stats);
 
+static Datum pgstatindex_nheap_impl(Relation rel, int64 nblocks, FunctionCallInfo fcinfo);
+static Datum pgstatindex_runs_impl(Relation rel, int64 nblocks, FunctionCallInfo fcinfo);
+
 /* ------------------------------------------------------
  * pgstatindex()
  *
@@ -756,3 +762,478 @@ GetHashPageStats(Page page, HashIndexStat *stats)
 	}
 	stats->free_space += PageGetExactFreeSpace(page);
 }
+
+/*
+ */
+Datum
+pgstatindex_nheap_v1_6(PG_FUNCTION_ARGS)
+{
+	text	   *relname = PG_GETARG_TEXT_PP(0);
+	int64		nblocks = PG_GETARG_INT64(1);
+	Relation	rel;
+	RangeVar   *relrv;
+
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	PG_RETURN_DATUM(pgstatindex_nheap_impl(rel, nblocks, fcinfo));
+}
+
+/*
+ */
+Datum
+pgstatindex_runs_v1_6(PG_FUNCTION_ARGS)
+{
+	text	   *relname = PG_GETARG_TEXT_PP(0);
+	int64		nblocks = PG_GETARG_INT64(1);
+	Relation	rel;
+	RangeVar   *relrv;
+
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	PG_RETURN_DATUM(pgstatindex_runs_impl(rel, nblocks, fcinfo));
+}
+
+typedef struct BlockBuffer
+{
+	int			nitems;
+	int			maxitems;
+	BlockNumber *items;
+} BlockBuffer;
+
+static int
+blocknum_cmp(const void *a, const void *b)
+{
+	return memcmp(a, b, sizeof(BlockNumber));
+}
+
+static int
+count_block_runs(BlockBuffer blocks)
+{
+	int			nruns = 1;
+
+	for (int i = 1; i < blocks.nitems; i++)
+	{
+		if (blocks.items[i] != blocks.items[i-1])
+			nruns++;
+	}
+
+	return nruns;
+}
+
+static int
+count_blocks_distinct(BlockBuffer blocks)
+{
+	pg_qsort(blocks.items, blocks.nitems, sizeof(BlockNumber), blocknum_cmp);
+
+	return count_block_runs(blocks);
+}
+
+static void
+buffer_init(BlockBuffer *blocks)
+{
+	blocks->nitems = 0;
+	blocks->maxitems = 0;
+	blocks->items = NULL;
+}
+
+static void
+buffer_add_block(BlockBuffer *blocks, BlockNumber block)
+{
+	if (blocks->nitems == blocks->maxitems)
+	{
+		if (blocks->maxitems == 0)
+		{
+			blocks->maxitems = 64;
+			blocks->items = palloc_array(BlockNumber, blocks->maxitems);
+		}
+		else
+		{
+			blocks->maxitems *= 2;
+			blocks->items = repalloc_array(blocks->items, BlockNumber,
+										   blocks->maxitems);
+		}
+	}
+
+	blocks->items[blocks->nitems++] = block;
+}
+
+static Datum
+pgstatindex_nheap_impl(Relation rel, int64 range_len, FunctionCallInfo fcinfo)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+
+	Datum		values[5];
+	bool		nulls[5];
+
+	BlockBuffer	blocks;	/* buffer with blocks from the range */
+	BlockNumber	blkno;	/* block number of the (first) leaf in the range */
+	int64		nblocks;
+	Buffer		buf;
+	int64		seq = 0;
+
+	/* no NULLs */
+	memset(nulls, 0, sizeof(nulls));
+
+	if (range_len < 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("range length has to be at least 1")));
+
+	if (!IS_INDEX(rel) || !IS_BTREE(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" is not a btree index",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * A !indisready index could lead to ERRCODE_DATA_CORRUPTED later, so exit
+	 * early.  We're capable of assessing an indisready&&!indisvalid index,
+	 * but the results could be confusing.  For example, the index's size
+	 * could be too low for a valid index of the table.
+	 */
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("index \"%s\" is not valid",
+						RelationGetRelationName(rel))));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupdesc = rsinfo->setDesc;
+	tupstore = rsinfo->setResult;
+
+	/* pre-allocate space for the maximum number of TIDs we might see
+	 * on range_len pages */
+	buffer_init(&blocks);
+
+	/* find the left-most leaf page */
+	buf = _bt_get_endpoint(rel, 0, false);
+	blkno = InvalidBlockNumber;
+	nblocks = 0;	/* just a counter of blocks we've seen so far */
+
+	while (BufferIsValid(buf))
+	{
+		Page		page;
+		BTPageOpaque opaque;
+
+		/* remember first block of the new range */
+		if (!BlockNumberIsValid(blkno))
+			blkno = BufferGetBlockNumber(buf);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Read and lock buffer */
+		page = BufferGetPage(buf);
+		opaque = BTPageGetOpaque(page);
+
+		/*
+		 * Ignore deleted/dead pages, and internal (non-leaf) pages.
+		 */
+		if (!P_ISDELETED(opaque) && !P_IGNORE(opaque) && P_ISLEAF(opaque))
+		{
+			OffsetNumber	offset = FirstOffsetNumber;
+			OffsetNumber	maxoffset = PageGetMaxOffsetNumber(page);
+			bool			rightmost = P_RIGHTMOST(opaque);
+
+			while (offset < maxoffset)
+			{
+				ItemId		id = PageGetItemId(page, offset);
+				IndexTuple	itup = (IndexTuple) PageGetItem(page, id);
+				bool		ispivot = (!rightmost && offset == P_HIKEY);
+
+				offset++;
+
+				/* ignore pivot tuples */
+				if (ispivot)
+					continue;
+
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						ItemPointer tid = BTreeTupleGetPostingN(itup, i);
+						BlockNumber	block = ItemPointerGetBlockNumber(tid);
+
+						buffer_add_block(&blocks, block);
+					}
+				}
+				else
+				{
+					BlockNumber	block = ItemPointerGetBlockNumber(&itup->t_tid);
+
+					buffer_add_block(&blocks, block);
+				}
+			}
+
+			/* we've added a block to the range */
+			nblocks++;
+		}
+
+		/*
+		 * If this was the last block in the range, or if this is the last
+		 * leaf page in general, generate the tuple. Count the distinct
+		 * blocks and add a tuple into the result set (if there are any TIDs).
+		 */
+		if ((nblocks == range_len) || (opaque->btpo_next == P_NONE))
+		{
+			if (blocks.nitems > 0)
+			{
+				HeapTuple	tuple;
+				int			ndistinct;
+				int			nruns;
+
+				nruns = count_block_runs(blocks);
+
+				/* this modifies the array, so do once */
+				ndistinct = count_blocks_distinct(blocks);
+
+				values[0] = Int64GetDatum(++seq);
+				values[1] = Int64GetDatum(blkno);
+				values[2] = Int64GetDatum(blocks.nitems);
+				values[3] = Int64GetDatum(ndistinct);
+				values[4] = Int64GetDatum(nruns);
+
+				/* Build and return the result tuple */
+				tuple = heap_form_tuple(tupdesc, values, nulls);
+
+				tuplestore_puttuple(tupstore, tuple);
+
+				blocks.nitems = 0;
+
+				/* reset the block, will be set at the new loop */
+				blkno = InvalidBlockNumber;
+				nblocks = 0;
+			}
+		}
+
+		/* no more leafs, we're done */
+		if (opaque->btpo_next == P_NONE)
+		{
+			_bt_relbuf(rel, buf);
+			break;
+		}
+
+		/* step right one page */
+		buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, BT_READ);
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_NULL();
+}
+
+static void
+count_run_lengths(BlockBuffer blocks, int *lengths)
+{
+	int			len = 1;
+	BlockNumber	curr = blocks.items[0];
+
+	for (int i = 0; i < blocks.nitems; i++)
+	{
+		if (blocks.items[i] != curr)
+		{
+			lengths[len]++;
+			len = 1;
+			curr = blocks.items[i];
+			continue;
+		}
+
+		len++;
+	}
+
+	lengths[len]++;
+}
+
+static Datum
+pgstatindex_runs_impl(Relation rel, int64 range_len, FunctionCallInfo fcinfo)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	int			*run_lengths = NULL;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+
+	Datum		values[4];
+	bool		nulls[4];
+
+	BlockBuffer	blocks;	/* buffer with blocks from the range */
+	BlockNumber	blkno;	/* block number of the (first) leaf in the range */
+	int64		nblocks;
+	Buffer		buf;
+	int64		seq = 0;
+
+	/* no NULLs */
+	memset(nulls, 0, sizeof(nulls));
+
+	if (range_len < 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("range length has to be at least 1")));
+
+	if (!IS_INDEX(rel) || !IS_BTREE(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" is not a btree index",
+						RelationGetRelationName(rel))));
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * A !indisready index could lead to ERRCODE_DATA_CORRUPTED later, so exit
+	 * early.  We're capable of assessing an indisready&&!indisvalid index,
+	 * but the results could be confusing.  For example, the index's size
+	 * could be too low for a valid index of the table.
+	 */
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("index \"%s\" is not valid",
+						RelationGetRelationName(rel))));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupdesc = rsinfo->setDesc;
+	tupstore = rsinfo->setResult;
+
+	/* pre-allocate space for the maximum number of TIDs we might see
+	 * on range_len pages */
+	buffer_init(&blocks);
+
+	/* lengths of runs (the range could be one long run) */
+	run_lengths = palloc_array(int, (range_len * MaxTIDsPerBTreePage + 1));
+
+	/* find the left-most leaf page */
+	buf = _bt_get_endpoint(rel, 0, false);
+	blkno = InvalidBlockNumber;
+	nblocks = 0;	/* just a counter of blocks we've seen so far */
+
+	while (BufferIsValid(buf))
+	{
+		Page		page;
+		BTPageOpaque opaque;
+
+		if (!BlockNumberIsValid(blkno))
+			blkno = BufferGetBlockNumber(buf);
+
+		CHECK_FOR_INTERRUPTS();
+
+		page = BufferGetPage(buf);
+		opaque = BTPageGetOpaque(page);
+
+		/*
+		 * Ignore deleted/dead pages, and internal (non-leaf) pages.
+		 */
+		if (!P_ISDELETED(opaque) && !P_IGNORE(opaque) && P_ISLEAF(opaque))
+		{
+			OffsetNumber	offset = FirstOffsetNumber;
+			OffsetNumber	maxoffset = PageGetMaxOffsetNumber(page);
+			bool			rightmost = P_RIGHTMOST(opaque);
+
+			while (offset < maxoffset)
+			{
+				ItemId		id = PageGetItemId(page, offset);
+				IndexTuple	itup = (IndexTuple) PageGetItem(page, id);
+				bool		ispivot = (!rightmost && offset == P_HIKEY);
+
+				offset++;
+
+				/* ignore pivot tuples */
+				if (ispivot)
+					continue;
+
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						ItemPointer tid = BTreeTupleGetPostingN(itup, i);
+						BlockNumber	block = ItemPointerGetBlockNumber(tid);
+
+						buffer_add_block(&blocks, block);
+					}
+				}
+				else
+				{
+					BlockNumber	block = ItemPointerGetBlockNumber(&itup->t_tid);
+
+					buffer_add_block(&blocks, block);
+				}
+			}
+
+			/* we've added block to the range */
+			nblocks++;
+		}
+
+		/*
+		 * If this was the last block in the range, or if this is the last
+		 * leaf page in general, generate the tuple. Count the distinct
+		 * blocks and add a tuple into the result set (if there are any TIDs).
+		 */
+		if ((nblocks == range_len) || (opaque->btpo_next == P_NONE))
+
+		{
+			if (blocks.nitems > 0)
+			{
+				HeapTuple	tuple;
+
+				memset(run_lengths, 0, sizeof(int) * (range_len * MaxTIDsPerBTreePage + 1));
+				count_run_lengths(blocks, run_lengths);
+
+				for (int i = 1; i <= (range_len * MaxTIDsPerBTreePage); i++)
+				{
+					if (run_lengths[i] > 0)
+					{
+						values[0] = Int64GetDatum(++seq);
+						values[1] = Int64GetDatum(blkno);
+						values[2] = Int32GetDatum(i);
+						values[3] = Int32GetDatum(run_lengths[i]);
+
+						/* Build and return the result tuple */
+						tuple = heap_form_tuple(tupdesc, values, nulls);
+
+						tuplestore_puttuple(tupstore, tuple);
+					}
+				}
+
+				blocks.nitems = 0;
+
+				/* reset the block, will be set at the new loop */
+				blkno = InvalidBlockNumber;
+				nblocks = 0;
+			}
+		}
+
+		if (opaque->btpo_next == P_NONE)
+		{
+			_bt_relbuf(rel, buf);
+			break;
+		}
+
+		/* step right one page */
+		buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, BT_READ);
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_NULL();
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.5--1.6.sql b/contrib/pgstattuple/pgstattuple--1.5--1.6.sql
new file mode 100644
index 00000000000..3841c71606a
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.5--1.6.sql
@@ -0,0 +1,31 @@
+/* contrib/pgstattuple/pgstattuple--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.6'" to load this file. \quit
+
+CREATE OR REPLACE FUNCTION pgstatindex_nheap(IN relname text,
+    IN blocks BIGINT,			-- length of ranges to analyze
+    OUT seq BIGINT,				-- sequential ID of tuple/range
+    OUT block BIGINT,			-- first block of a range
+    OUT num_items BIGINT,		-- number of heap TIDs
+    OUT num_blocks BIGINT,		-- number of distinct blocks
+    OUT num_runs BIGINT)		-- number of continuous runs
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pgstatindex_nheap_v1_6'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+REVOKE EXECUTE ON FUNCTION pgstatindex_nheap(text, bigint) FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pgstatindex_nheap(text, bigint) TO pg_stat_scan_tables;
+
+CREATE OR REPLACE FUNCTION pgstatindex_runs(IN relname text,
+    IN blocks BIGINT,			-- length of ranges to analyze
+    OUT seq BIGINT,				-- sequential ID of tuple/range
+    OUT block BIGINT,			-- first block of a range
+    OUT run_length INT,			-- number of leaf pages
+    OUT run_count INT)		-- number of heap TIDs
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pgstatindex_runs_v1_6'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+REVOKE EXECUTE ON FUNCTION pgstatindex_runs(text, bigint) FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pgstatindex_runs(text, bigint) TO pg_stat_scan_tables;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index 6af40757b27..80d06958e90 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.5'
+default_version = '1.6'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
-- 
2.50.1

#202

Peter Geoghegan

pg@bowt.ie

6 months ago

In reply to: Tomas Vondra (#201)

Re: index prefetching

On Thu, Jul 24, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Yeah, I forgot about that. Should be fixed in the v2. Admittedly I don't
know that much about nbtree internals, so this is mostly copy pasting
from verify_nbtree.

As long as the scan only moves to the right (never the left), and as
long as you don't forget about P_IGNORE() pages, everything should be
fairly straightforward. You don't really need to understand things
like page deletion, and you'll never need to hold more than a single
buffer lock at a time, provided you stick to the happy path.

I've taken a quick look at v2, and it looks fine to me. It's
acceptable for the purpose that you have in mind, at least.

Yeah, probably. And we'll probably test on such uniform data sets, or at
least we we'll start with those. But at some point I'd like to test with
some of these "weird" indexes too, if only to test how well the prefetch
heuristics adjusts the distance.

That makes perfect sense. I was just providing context.

I have a very good reason why I didn't do it that way. I was lazy. But
v2 should be doing that, I think.

I respect that. That's why I framed my feedback as "it'll be less
effort to just do it than to explain why you haven't done so". :-)

Yeah, this interface seems useful. I suppose it'll be handy when looking
at an index scan, to get stats from the currently loaded batches. In
principle you get that from v3 by filtering, but it might be slow on
large indexes. I'll try doing that in v3.

Cool.

--
Peter Geoghegan

#203

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#202)

2 attachment(s)

Re: index prefetching

Hi,

I ran some more tests, comparing the two patches, using data sets
generated in a way to have a more gradual transition between correlated
and random cases.

I'll explain how the new benchmark generates data sets (the goal, and
limitations). Then I'll discuss some of the results. And then there's a
brief conclusion / next steps for the index prefetching ...

data sets
---------

I experimented with several ways to generate such data sets, and what I
ended up doing is this:

INSERT INTO t SELECT i, md5(i::text)
FROM generate_series(1, $rows) s(i)
ORDER BY i + $fuzz * (random() - 0.5)

See the "generate-*.sh" scripts for the exact details.

The basic idea is that we generate a sequence of $rows values, but we
also allow the values to jump a random distance determined by $fuzz.
With fuzz=0 we get perfect correlation, with fuzz=1 the value can move
by one position, with fuzz=1000 it can move by up to 1000 positions, and
so on. For very high fuzz (~rows) this will be close to random.

So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
would be way too many tests, with very little change. Instead, I used
the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I also used "fillfactor" as another knob, determining how many items fit
on a heap page. I used 20-40-60-80-100, but this turned out to not have
too many doesn't have much impact. From now I'll use fillfactor=20,
results and charts for other fillfactors are in the github repo [1]https://github.com/tvondra/index-prefetch-tests-2.

I generated some charts visualizing the data sets - see [2]https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png and [3]https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png
(there's also PDFs, but those are pretty huge). Those charts show
percentiles of blocks vs. values, in either dimension. [2]https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png shows
percentiles of "value" (from the sequence) for 1MB chunks. It seems very
correlated (simple "diagonal" line), because the ranges are so narrow.
But at fuzz ~256k the randomness starts to show.

The [3]https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png shows the other direction, i.e. percentiles of heap blocks for
ranges of values. But the patterns are almost exactly the same, it's
very symmetrical.

Fuzz -1 means "random with uniform distribution". It's clear the "most
random" data set (fuzz ~8M) is still quite different, there's still some
correlation. But the behavior seems fairly close to random.

I don't claim those data sets are perfect, or a great representation of
particular (real-world) data sets. It seems like a much nicer transition
between random and correlated data sets. I have some ideas how to evolve
this, for example to introduce some duplicate (and not unique) values,
and also longer runs.

The other thing that annoys me a bit is the weird behavior close to the
beginning/end of the table, where the percentiles get closer and closer.
I suspect this might affect runs that happen to hit those parts, adding
some "noise" into the results.

results
-------

I'm going to talk about results from the Ryzen machine with NVMe RAID,
with 10M rows (which is about 3.7GB with fillfactor=20) [4]https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10. There are
also results from "ryzen / SATA RAID" and "Xeon / NVMe", and 1M data
sets. But the conclusions are almost exactly the same, as with earlier
benchmarks.

- ryzen-nvme-cold-10000000-20-16-scaled.pdf [5]https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf

This compares master, simple and complex prefetch with different
iomethod values (in columns), and fuzz values (in rows, starting from
fuzz=0).

In most cases the two patches perform fairly close - the green and red
data series mostly overlap. But there are cases where the complex patch
performs much better - especially for low fuzz values. Which is not
surprising, because those cases require higher prefetch distance, and
the complex patch can do that.

It surprised me a bit the complex patch can actually help even cases
where I'd not expect prefetching to help very much - e.g. fuzz=0 is
perfectly correlated, I'd expect read-ahead to work just fine. Yet the
complex patch can help ~2x (at least when scanning larger fraction of
the data).

- ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

Some of the differences are more visible on this chart, which shows
patches relative to master (so 1.0 means "as fast as master", while 0.5
means 2x faster, etc).

I think there are a couple "fuzz ranges" with distinct behaviors:

* 0-1: simple does mostly on par with "master", complex is actually
quite a bit faster

* 2-4: both mostly on par with master

* 8-256: zone of regressions (compared to master)

* 512-64K: mixed results (good for low selectivity, then regression)

* 128K+: clear benefits

The results from the other systems follow this pattern too, although the
ranges may be shifted a bit.

There are some interesting differences between the io_method values. In
a number of cases the "sync" method performs much worse than "worker"
and "io_uring" - which is not entirely surprising, but it just supports
my argument we should stick with "worker" as default for PG18. But
that's not the topic of this thread.

There are also a couple cases where "simple" performs better than
"complex". But most of the time this is only for the "sync" iomethod,
and when scanning significant fraction of the data (10%+). So that
doesn't seem like a great argument in favor of the simple patch,
considering "sync" is not a proper AIO method, I've been arguing against
using it as a default, and with methods like "worker" the "complex"
patch often performs better ...

conclusion
----------

Let's say the complex patch is the way to go. What are the open problems
/ missing parts we need to address to make it committable?

I can think of these issues. I'm sure the list is incomplete and there
are many "smaller issues" and things I haven't even thought about:

1) Making sure the interface can work for other index AMs (both in core
and out-of-core), including cases like GiST etc.

2) Proper layering between index AM and table AM (like the TID issue
pointed out by Andres some time ago).

3) Allowing more flexible management of prefetch distance (this might
involve something like the "scan manager" idea suggested by Peter),
various improvements to ReadStream heuristics, etc.

4) More testing to minimize the risk of regressions.

5) Figuring out how to make this work for IOS (the simple patch has some
special logic in the callback, which may not be great, not sure what's
the right solution in the complex patch).

6) ????

regards

[1]: https://github.com/tvondra/index-prefetch-tests-2

[2]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png

[3]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png

[4]: https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10
https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10

[5]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf

[6]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

[7]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/xeon-nvme/10/xeon-nvme-cold-10000000-20-16-scaled-relative.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/xeon-nvme/10/xeon-nvme-cold-10000000-20-16-scaled-relative.pdf

[8]: https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-sata/10/ryzen-sata-cold-10000000-20-16-scaled-relative.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-sata/10/ryzen-sata-cold-10000000-20-16-scaled-relative.pdf

--
Tomas Vondra

Attachments:

ryzen-nvme-cold-10000000-20-16-scaled.pdfapplication/pdf; name=ryzen-nvme-cold-10000000-20-16-scaled.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x���O�<�q������B���Y�,`�10�������"5��D��-��~�2"�_fW��������"�����:�Tf�{�������?Rv������~����?>�����?���o��v�1����pis1=~���_��/�������|����?�O�O���_?����~�O����O_��#y�9��l����x�1~�q;���~��������_����?��_���_����������������������z��_?���x������~������_������������������o��5;����}w�����o����?��~�������W�?~����z�����������_�}K�����k�������}����������p=����WL[���q�Gz��H������=�����:����E�/��r4y����_�k�Kz��l�(��O���n�����-�,�FbQ7�q�*�e�>m�����������~�y�b)�r$�sW������"_V/T�r�#�����r�7�<�-��������E�H,*�4�\�M+������������}F�H,*�4�\���|��������w���~��1
G<����l�5��H����/�q���p^2�P�2gRp)�~����l�����*X�L
��V��|�O��9�������Vw���������PJz��n�����5���=�/I"|��?S3q���v����nGJL��0��O�����gb�AC	�T�v�D�T	����Rq[��Pbb��A�Dx�6�!�������T!mQ�(��=f����m?�
��J�*�*��/J"f�4�/���o�L�����Z�/����.���q�U^{&f*�:��J��T��*U}S!<Sr����!5��E�s����)5b�j%�������(����A�PCb�V��l0s���fL�����Q�3W�(���T��-V�����@QC
��-���=�-�o�^�o_������Ba�(�����es�:1��AB
�T)8s7�T��s��E�s1��AB
�\�l��!13��(!��\��|U�f�j%����E�9}Cbf��QC�����j�j��� J�!1W�[(������QB
��-"����
��;Dj������:a�����u���u�ff��QC���Z����`f�QB
��Z�BQW���D	5$�j�v������A�PCb�V���1���!�x�����V4w@"�ji+Y�3W�(���\�l��F��.�QB
��]"�����;7�Z6W��bfz�0jO��R~;��}33U���F�\���3s�bnKD����o^}�3S�(Vcb�����R#f�QB
��Z������k� !x���
p0s��o�L���f~eH��z��<Ur���O���n5��X�$����b}��~�v}���^&��1�P>m5J���D	5$�j��!
f�v�S�6���df��QC���Z��R�G���A�PCb�V�#;�F�\
��s�c�E�3W�(���T-�-%���T
��s���I}&���*0J���U���=���+/n�
��e�����6���|r[t����(�8-
8����03��5~`"��7�M=A21��AB
�T){{+cf��QC���Z����9���A�PCb�V������� J�!1U+�V��3S5�jD����q23W���I�\-n1�����D	5$fj���u�.?\,[=������DJ[}�c���l������~�����6�����e;�aK[�D�D[|!�I��;P�������Kk�����Z
"J'R��+�O"�4|������2>����$\�a�/P�}��\�F���j�K5����
o�:�5���~@	�]�g})t��Z�D����b>G5��2�c��8����l���J[��D��A�{B��,��t)�Bin/}��m0P��'L�I�/.ha��
]O���y!>+���c`��e�n �_)������X�}Gg
|-z����2��`y=����R�T�^�� :z�P��p�o���'����n �_)-������/'�t����%��'1x�p�a1�\�0e~������.y4�-��'1P!��S��
s��6I�t��	���_@����R���J�7��C�F'1�3�A~�!�I\<�!,��+��%4
}��^fJg\�hZOb�B�!,��+�/�uG���!3�3M�x�!�I�W���bJ�Pa����X�&68p�����m�%��'1P!�d~��1�\��e~�^7/�����V^b�x��p���bN�R����e���D�Nx�gF'�O"����n �_)-�m/m��{_��L�L3�KOb���
��S��
c��"�l��1S��4�%����$���f��bJ�RaJ`e�����@L�L�P��2��|=\��CX�)W*�n�mm}�/���9�>�����$�_�s����X����S�w�k�3��|=�OB�Z\�Oc�����\��x������%��'1_}Q��S��
��v��":�Nm�[c���^`Lx������@��T\��q��/x�`"w�E/`�	$�u�J70\��%m��:c����*�0����G������'fi�r��!�C����'�O��=��/J��_|���/�����?H"qt�H��1���C�_#���[�u���\2=���_n ��o�/���p����W`$��7���)m�^�t�JV.��$�\� \,	?*R�N�_Z�+GKBr�`OD�Z��Z����5
��xiq	D���*��S�<=����%����������O�������o���l��������|����������u�.�*
+�UbMH:b
s���!�S��Z�� �X��Xe�T~�q2�c��4;���aL��aE��*������#�3�!��g2�3�)�^���J�1���a�I��H��H����)��������HKMa���Z����r�V�{Ug����sMa���Z����r�V���`m6��/1dX����MaJ�U���"�����i�i�p��a�iL)�j�XN[o������a�7����r��������������$��l�5�-�/���`9����"�[p��;:3|7�!C�pCh�)�^�l��f�HC��@������ ��3���~Y�������&���!M�����S��Z�wdi�%�	���pi�%~���H��B��F]��0�,�i��Ch�)�f�����NZv�w�-:�1CYvChP�)�^����~������b��]{D�!<���r�V4�`q����	2��p����NX�
R��"�vk=2��
�d�Y���+0�@��Yd�/f7���N��77��T�������D�����=
7�;k��:�0���n��{���vG�����Z~�w�x���y~����]����U���'������{�4e�e��e��e����k��k�f�K�IK-�.+�.+�.�.k�.?���j��5i��&����Q�r���xi����Z���$TK�+Me9Z7�w�n��$���"�8��T������hV�E�,-���E�0->���
��y{������,����Y����Vyk�ioMa���Z����r�Biy9��9��9���������f�k�I�����9�4���i��4���i�,MsW�z���u�B�����a��P�{�0�\�P�Y��4���i�,M���i�,M�fiZ�x
��|5NYc�S�1d�B�S�1�\�PZW1Z,F��)��!���S�T8����S�b!JWL��\���,\�,�V������Uza��^���a�0^���r�BaQ�'l7�nL���������#��
��������Rg��e1d�5i���4�9�J��J�-�,M��lzi�����@�H��J#�w�EZ]�3���2���Ch
���r���-�5i~��0��epEk�4��+

���j���4���?V�A��4	��C�	�����.
lei�����@�H��Jc�	k��-�~�D=�)����%��`9����y*m���������w����/�8���#:U�mDThM���F��U���a]~���u5i8�����z�e��\��L��$�S9���������j,�zj��E��_�������O������������_:�/���K9<|�����/����~����?����������������������������������N��
���i<P���"O^�����-���s����+��w��GWx�|��=�������\;7j���
�����j_7x��?l���s�v�}���j����u�w����\;7k���M�����j_7y��?l���s�v�}���j����7���?��4z���i;?o������X�����O��S�������{���C<V�n��E�������T���;�]���u�w,���������5V�(���?���Jp����u~���|���EE������A]��jW�"���MIh�b��~0S�JD�����V#f�QB
��Z[������A�PCb���fa1"�Z4����U
�(��PJ}_��"f�QB��Z�v����� HHu����H��i��f��������^"�u1�D%��w���T
��s�bwf�Vt�N&�j�mN�����a�P#b����_3s5�jH�����3s5�jH����BQj��� J�!1W;�z�+I�\
��S��Gu���6�QC
�����������Sn�Ku@�`f���A7DL[K�����3W�(���T���;9S�T
��s�d�����%=udb�VM���������T���`��L�0j�1W�'5����j��8�D�1s5�jH�����fz�F	5 �'K����n��U�l����^(��1�P�D,���T
��s����L������w��������c�����b��\��.���J��#�\+�v�L���I���V�ZyG
}x�F�\
��S�����`���jH�om��[�$��A��L�3�T5�'bz�J�j�j��� J�!1Uk�w5Y�T
��s�h&����E=�db�V6��:df�QB
���a��\��G�11UkK��3S5�jD���9�a0s���y`b�����������E����q�X�,��6�_�13=���i�D�.Ur����L��96���Z2�������21W�fJ9��Z�SJ&�j�������T
��s�0���1s5�jH���I-����ZzG�l�^��\�	)�s�c������((J���Q�����[|
m���f�)z2����B��YO/l[����%b�AC	�T���������R�8�w_OzS�th����%�����`�T#f�QB
��Z[�����k� !x��a���bf��QC����7��f���!�LL-O�����Q�{�K|{����?�Q|�q�t�+����l������� 3:��!,���:�4���[�B�VjgL{����]]c�o��9�r3�`�����c1W������N���^�[���s|}vF��14��T�r�k5�<N��O3�I�n�/1���<��;�<��v����1A�a������B��oXj'�@��`E�C���
�������������3�3��r|�!�IV�Ch�)�B���H9h[��$����������%���oa����]�������g����Pb������CX|��|�uKa��k>;1�y����KL;�������3C��f�	X8�a�/ 5�G��Z�s��7��8N�u����{1�����3�/��7��Xm�{��6��m������v/ 5�G��Z�s��7�7�sp���.(0����2���.�������R�T�zb�6�Ep������r�1d<��]|}�
a1�\����������`2�i�8��[<2�����
�S���'`�oy0hc�����ol0����s��
�km�B��(���B/���������rV_������O��7��������<d�.�1��j��^��j[�����9!�}��_���C|����6��p�h &v�,�%��'1�K���s��
+l.����b��!X1l�%Nt�0_}��������<������&&t&��g/1d<���-����S.U��w�"�>lK	�����h)!4��Q�A{����Z[���y'�G5��D��	O"��;���R���p�e��_��R�/������m`z����J�. ��T���w����1�I�W���h<��Ih�}�]��9����C�'�O��6�E�z����ij)�4�o���.�H|QyN7p�U>����?WW��5�.������Z����X�p��# |���F�������K/L��~����'_,_,_,�}�l|���"wk��\Q���k"2���*����L������NV�q��Qv���Mf\��E~�]v���i��q��q��q����3��{X���Jw+k��u��BS2���C���0���U�\�d�9c�9c��������V�n���:��v(S�P��1,��P_�Q;z���I�+$�����5�!�:k=D8k
S��Z�	�S�������)��c�����{E~��v�Va������5F�nC���z���4��{�
s�A�T��5F:pC�q�z�p�4��{'	���'��5FzqC���%�8�9�^��<������!|9	1����+�kNA��W�p��.%=��H[N���a���CX>�K	_mw^;v�������W��z�p���{E��vl�G������`�P^��bJ�Y+�l�w+�^^����,���9�!,��{�
��\c���0f(W/Bjq�GY^�uXq��L�q��~����
o��CX�)7kE#n���p����1�=!\�|�`�����$��n	�+��\'������� ��3��o��\o[;��6�(�����������:daj�=�����+xT�v{�d�)�����'0
�/b����<8u����?O����`��@�@���lH<�/�giI��|����������+r�����`M�4���i�)�kz���C����������
��������\���z����S�t��5��6��6��������3�����O�{�'����|����C�������V��V��Eof��y��Y�f�],����Y�7xo���t�����������K��K+r��o\��f��=�����sA[t
�����!��SX��f��?��z��k������N�w�v�������I��T����]9�!��r=D�r
S��
�YS!��5Bxq
F��H/��S0����G�s��	������O:p=D8p�yoV��X�|��w+�>����|7!<���r�Ba���<���r���i�+����C����Z����Dl�~����rg��f1dH��mV+Na1�\�pi���p�R'�Y3���v@���$��+������<������G��	C���
��f0�\��L���$	�v�]4!\�h�C�H�WJ#[�[��Y��a��b���Y���ZL)�*D3����+��8��������!8���r�B�����`��5�b'Lm��O���y��t�g������v�N2�;���S��
q�[<���X��|gx����!|��{BhP�)�J�t�@t[�'�&v�<�3�a��P!�O10�S.U��a�[�
?������2�j�����s��
�q�|��
�����@��!�]��v�`���)�/�pO��'��@nZ�vv���|����F�s�3�����I����]o���������?<~�����Xw��n7����=��l���T���:��uH�������\9w�������r������S=���a�z��;�s����T;�q����T�����v�@���;�S�����;�S�v����=�>�\O�s���\O����k��\��s=��y\��s}�		~������T��s=��YT��s=U�i�z*���S����X6�P����T�����h�BOEO;�c��CE/;�S��v��UD���Eq���W=^������s}������oE�������>��-L�����Bx����2��F
1"�j�tl�\-��-L����&QR���D	5$�j�����\�P[O��3JDL� h(!�+��e0s����01W��j'����AB����!�Kw���|V-3�D5�&bz��!N�13U���F�\
��j��� J�!1W��wT#f�QB
��Z�4�3S5�jD��Br�j��� J�!1WK[��/���D	5$�jx��P+�u���=�s��td��6�QB
��/o�m���.���M}�yz0��Q�t"�����?=3U���F�\�0M	3W;tSB&fjef9��E�s�<^&�1s5�jH��j?�X�3W�(���T����3S5�jD���9�d0s5�*ab���}0�[E	5 ����q���b_����3�P5J'bz���~5Q�T
��s����L������+=�db�T����T)���~��:=f����,u��+A��Bb�Oj�-�h��;j�/E�j��� J�!1W�f�>������[Ch��?��}I�L�3�T5�'bz�24�j�L�0j�1W�f�<��Z�e&�j����`�j5�����~�T#f�QB
���[�j��� J�!1W;Nj�[j�U;�P�n�����L�0j�1W�f�>�������E.[
?��}i�[�� 3�K�Q�x"������Sef�j��*1S�m��&����Q�11W�f:9��Z��I&�j��V#f�QB
���s[Q��
f��QC���Z8�����U���M�L��z��<WjG�����������E����;�W����w��}���i��^X?+���
AM�	Ouz��8W)f�;��R�^&�jm��:�w0S5�jD���V�Sj��� J�!1W�� w!��\�	)�s�:���1s-�bH������*3W�(����S�/=��Bjo��?�{}v���/���~Cy�x����	!3��!,��C<{q�x��6�|�]�����v�!3��!,��>ql8T�V�+�t,�
����0�C��@"����������|�n���Z��$��?pmm�K�@`������0�q�T�V���x��B`D�<��w*,q���^&����`���	�o8�
����*���/Jcj�{�yO�Jb����(�Hccc�$7��7�F*L;������h
_�4N-���������b����h�?���Fc���8����N��,���%������c�o��/>�V�����7���=�C[f1��Z�� X|��P�=����Vz0�3%��d��!�I���!4���r��Xh7#��c��5��xi�����
����������{e0�j�`��	G������!4��Q�^y��\7������\�;Sr����2�����BCXL)W*lMB<l��m�������	��P���r���[���;�~�c&u���W��1d<���X�CX�)W*<�v�=�`l����rDz#%!�?��-� �3��/�����Y�m\������i��{�6�F��oO���t��t����� *��k�L:C��_��@��RZ�������=�o\���^b�xm�{
a1�\�0�-�m;�kp��b�xm�!����R�T���(_`=21�3m��C��l[�Bh�9�R�~g���`rgj@{��v�@����!���8��V�������.�&}+�o6�;T^`Lx��W���@��T\�����=�d�D�D������D|�m[�r�����_}����O&�N�{!�I��#�fWN7��/�����Tp�C:���F0b��[k��`*����cmC���3}�s��ic!��G���s������{����m�2��5��������������3���a�kr]�U:Q��]U�]U{]�duU�t���n�������������'��k��?��w��V��b����]�UV�d�E��Ec���'�)V�n�����r�b���d�h����La��\�U�Y���%��%����c��c�V�f���r'��������3g�3��"wk�V�b��������V�r�$�c��c ��&�9�b���
�l!i�Ma�A�
�"6�?j[/�0g�6��6�|67l�n��9�p��C|�:i�9g�6����0dX����MaJ�W���j9�����i�w����iL)�j���k�p��������S���6�C���nz��@�����*�
�!��+r�e��Y�q�����$�pm�A�q
��~����O���\c�1�1dcB�1�1����<��7�
���`O�@�]�J7�������������9g0f(�.Cj�!��]����S��@��a�Yp�&l�Ch��'S<����7O�m��������!<���r�Vr��� &vfXwC����
!4���r�V���a"��=<�����v�t���W�D�mg����!�y��;��W0�=
9�^���y<\���	���xL,�|��j�n ��,\:^����h�i���.	���8z������q�z"�3=y}r�y���3=�g��Q}v�����Q�^u�OK��Y�����|^�������"�:�}���f������
�ci���}���f��h��Y�w,m{���X��fi�C�'���������j����"�*��Y:�s��s��s�l�%c����V������s��s��sgS�S�=���
�W�qg8[q�V��_J+��S0���o�&������vx�4��x~�6�z�0�����7+�X��)���xS��
7`�MA��R����'^{n������a<7����r�B6�b�v+��g���f1d(��a
a1�\��
����
�����������(O�{�Bv����2�����Y�`k�_�!,��+#�!���	v��p���V���4�����(����0���w\s*���!���)W*�um������� ^��Q�}�����T�����j"?��Lk������sh�Y��4k}�[jq���c��&���H>��N�)������s��
������%��;���	C�4���tCXL)�*D�*�~w�X;��1d��0dH[,�!,��K���H���I����Y��7bm�"���������r�9�IXam�dc��:a�n�ChP�)�J��S�u�pwL�9_g���VX�r{aq�q�����0&����\b8�`_�5�4�������.�������2�[�����y<�kM������������~�r�;������G5������?�?�9������9������]��A=������y�z��G������r�������<W?oQ��s��~������I=W�����y�z��G��6�T��m��zn7�����s��8�_7���?lT��s�y�~������Q��U}-�~�������T��U=U�i\��Y=��i�z*��S��f�T8%q��v�T�����lj=OeO��S���e/�S��6�?9
f����K=���-]��p��a����?��J��UEq�L�ZU.�-��J=f�����E�C�\��-LL�����f��QC���Z[������A�PCb�Vt�&�ZE5h!<W:�����a���R�Ms��L�0j�1W�pd�Cbv[� !����p�����f�G����+fz�0j�M������1s5�jHL�����f��QC���Z�/:�1s5�jH��`+�T#f�QB
��Z��B���F
5"�ja��(5b�j%����%^�Jj�w��� J�!1W����`��	�j@�o��������<}�cfz�0j�N��BU����c0S5�jD����?8��Z������@��`f��QC���Z��Q���D	5$�jy��{cf�QB
��Z=�����U����<����}	35���L�5�Yp2��]��@�z�;g����].�3�`�����E��3Wl�����A��Cb���s0����L,��f���B�GI5 �j�]I9��z&�Y(���O���[�q�M���&�Y(&3����aR��-���}�����no����fL��������5K~A+25W�0�H�B1�����YO��P<�4TP�COD3Wl�T�A)25W�0�H�B1�e5J���"�IEd���X�S,'���b�J1W���"�IEd��e7O���Q0L("��%�-�����u����}�����O@����lf��Z(f=�\�=z���\��"1�x��2�P�v�J�B�l.E��&�Y([�����B��"2S����(j�HaC���b��9�Y��()�B-�}�J������� 0o�M������u�r�z�|���%�33��%~b$���;3�gf�QB
��Z���A-���T3�PlkT�"QE������W�L��+b�P$f��a�zY�����()�B-�M:$G�B�� 2���K*E��&�����/c���jo?�)�?�����c�,"���#�}�b�[���)7���F�������Bf
�Ngx�IQ!a��
�8�*d+roEs���
y��T�9�}��=���{i��V��
�����a��
��������f��CX|�������e��y���F�zy���z�[�z���@����q}��2�q���x�9�Q|o����wm��+����M�#������{�{�76����
�Bd���1��7����|~2T�VT|���+FT�CX|�B��q���WM
V:�w�%�=���m�b��6�[�����Tp�#��k��������'1�����S��
}�J��-���a��;z}I�Q
s"L��

�ve�M�h�����m������'1��B\�1�\��7���\lXbJ�����'�3���f� ��S.U�R���{�g�4���r{j=&^�����D���p����s�|��\~�Lep�����;��en� �D`��B4"�
���#*A����5p/	H���:�Cs"(�Z���n���K��L����t��6�1�I���rc���Ket��~Z����~F}M@R���y
�aND����*��c�BoQ�L���ZLtG���'1_}w���S��
]�5�����a�E3u�k���d.U
�aNe]+�������;�w���s�O3$A��y�aNf],�lG{\��D3�����_��e.S�|"����t����\�o�y*���&U^����<�0'�.����O��}�ik��;����Zs��@8H���R���{����P�sA&u�_��
C����G9(���S�UX�����f$djg�(�0d<��j��0���R.U�6����$B'�o��p�&rH|�F��r����n�Sc&��[z^@j��7��t�����h.������:nk�V��b�� �'����]8o\8��"wk�Y5v\Un\Uf\^;��������N�e�d�ec�ec���%��%�V�n��7�K�s������l})s��c��S�����k�]�U:j���%c�%����i��k��m�����J�	Oa�U9x-D:x
V������K��K��Kg////=���Z����Z������k���4���A���4����J�

����4K�O�tr� H8}����%;.���������A�������Jc���������	<>�Kr����	s�����Gw���������v�a�����\e���I$����HcP����$�AM��`�K�z�"�wZi���!-B�&(�n�d
����('�����v6��'s�`N�Y+{y��b��x�<yX��
�aND�V�����W�Vb��!D�1������r�V��6�������$e%���h��)}WK��E���uj��'����0Hx�����%������b;FQ����'3��
Bh�1�f�l������;*�-��'3_���!���������b�bv�e��.���$pC��1��9�u�d�+�!�p���<F�1Cz��~C���O�ac��s\��c���I��'�},�Bp�)�f�h[���!���`��@>d��z�g �������
��i ���$0���<o�u�9����eVkGV�FVmF��Y�Yf�K����~|��o\��f���������g|�B�*����y������������aE�U�������y���/������������+r�B�8/�V&�0����0���aE�U����{j�7����B��J�^!t�+������R�w��a{�����v�a��Q��n�����
�l���D��
�� �j"��
�*�:wr�q�q��
t�
t+r�Ba����`��H��d=@�!0�Z�l��~2�r����a��H���� �D`��B�	�Nv_2n_2f_�OC�@��d��d��t�S)<������j����gH��
{O�u�P���bT�^��0�,�e�U
)��_aN�T�X�����6`9�Kw"0I[y����?C`��B���[H��K�������B��3�S.UH^Z{TD��)�L?�3I��+D.�%(�Z���x@����� ��LRf]��#����1�=�t&��1L��Ij`LD#��'C�,d��
��	;rg��
����1����'��1��������v& I�s>`s"�<���B�-k������U�h��q���%��OB
aq�.���#k7�=+'����b��I�R2�S
s"0�Z�`��������b��,t��]�v0v������[mt�x��m������<���u�����H��u�X�2��I������������!��}s��8<)�3;�}[
�E�/�������-�Gl}���
��p���q�D��iT��{Yv{�����L��w��S�%�nE�p$�s��f���o�j@�'�M0
GbQ8�q�*oVxv[���0���N�G��A�p$�s��f���R��YTx{.��/*�E����������2������|��"U�\�G�#�����r�7����?����!X�����1�|�Q�:oR����?S���O_�q��XT�i\���U^�������_AH����~�r���������e������Z�l��'��J5����K�
�3uc���	8��g���a1�s-R��J��fH!���(!F�\
{�5$j%��������21W�����\
��1��$����QJ�(%���Q��6C
��D	1"�j�tM!x~g� !�xy_���.n��Y9��%M��
a�(�����?|5"�j5����{?}o�1W�����\-twM�!�P�(�F�\-o{�R
��D	5"�j�YS���X�A�P#b����V
9f�z6�P�\*"�R�0�H�B1@L����~AaR���=��5����;
A]��.��Y\/�33�^^?���`f��aB���b���B��"�IEb��L.�Y)3��\�u�J���"�	Ef�~�Q��Y)B�T$f�O��=�xR��)�i4B��"�IEb��L��Y�M0L*�������n�������;�Y\0�33�`)��53E��,������z���\-{=%b�QB������y}���y}�r��j�
2+%�j�,������qR<�R,����,1L(2�P�f������aR���m#����,��G
f�����a����_�zl{����!L*3Wlj���B��"3�l�������2�P<�������"�IEb��~o�x��B��)l(f�N��=�pR�)���B*"�R�0�H�B���f�7
����o(��Yx�~���xWzg���d&~f����aTSpf�&�Y(3!ff�X��x0s�������"�	Ef�a;�VDf�aR���b��S�GbV�&�Y(��b}O���{�Gk.�X��()��\-��!G��f�aB������fi|/�w����v�Gl�_��s|r�!����(��1��I�.����o1����I43%j�,S� �Y)B�T$f��~:���J��"1s���}��,y���(!G�B��mAR���IAb��d^*"�R�0�H�T�������t�T�p�tIm#��:� X���@�^�D�D?;���'_���L7��/4��j�}���m�|!�I��0�@��PZ*p�%_5&�2��w#�* �W��
�s�r����E#���;#���}Q*^/=���J�aGQL@%�%�'HRH�o��
�SZ�pJ���_����V�!�I��0�@�����o��1����;A�B�K�t�����o��d<3���������$�Jc�9�J�~�;8[2lkb"4���^��$�k�n �_)-�p�u����:�W:^b�x���8���S�Tw8��'��ML�,�~�!�IUX�s��
� ��{E��p�-�?C��6z���WJ�px@����O&�o�K�9�B�.\�<������n���	�]�s��b����VJ��<n�(���%h�mF����a�TJ;a�xW����S��
��[�dl1E�������^`�xF<����~�Cs��n������]g���m[�h��Ob���o���S��
[�m|���)��-%����B���V]�<��e����yZE�'#0�:��JY�O"���!��
��+�����`��O�`��^��K���2_�pT�!,)W*����cL&R'<lr9A��,��
��H7�����b�*C)���	�T��i���O3��|=B����bJ�Ra��^�����
���%+mC�KOb��p�q�9�R��F>��������6��!�I��#D�!4���r�Bh����phqt&x\R	��	c����G�ChP�)�J�e�[9��Bo0�3�3�/q�~x�|��=���8�z�U�~�;;����u�]��S������+��cj1�\�0��������fRg�H/1d<��z�r`a1�\�0oV���?�Ep����$��}!�
��+O�8���}��n(����+�����W��`������b;5�����z��GgH���
��rA9Z�y�� -��p��o���)m�^�t�\P.��$�\� \.?+���u��or���M������J�
K�
K�
K'7,i7,=���[�~�/v�Vea��E�G�I�.Y46Y|X���J7�fk���-3�!�Zf-DZfs��Z���y�m�)���d�<k!�<��S��*|.8�J�h�����$��0+i��a�	����T���Q�����^�j�U����ca�����V��	���ak�a����)m0h`#���xFzmc���Z�����{�JCj��[���;����[��������Z�5�O�T�[o�)�7�1��o�����U�Xy�Va�������)����!����R��:�3hW&���m?��L�rLZs=`Xs
&���r��?;�=,M��H�Nc�0&]&��#�^��~�Z�]�;����.��c�L7���������I���"26��e��J7�bJ�W�0���Dw��ag`D_����5t(�N��}�	e��hk����<,�B��@W^�E_0���H��b��H#��#�b��F^pBCXL)�je���cu���3l�Y����"\A�}���Z�����o0�3���8��<6����B�?�p��I?;p�N��	�B�l�����|�`�zs��Z��k
��yDL�N��ga�3���k��{��>9���������e"u��>!\X����4�����Z��������l��I��o1��+�[���J���l��&�^�������@$#�=��������j��[�������	COB��������P{e���|�dm��m��m�t��������7�7�wo��n�?Yv�Xv�Xv�l�yc����V����Ba����Q�1W(��"�:�oU��+���P�e�mT�r�9�4���i�6�Z�4�$��+~��fi�h��Y�w�no������P�_�\Xl������c��a�i|\�
�������emfU�yQ�Y�f��],MZ\��hA�hA�h���m����Zi������m�7��L��gI���L�t����c��k@�e�	��1�!C�eB�,�x�\�pxXp�����^�a�)�Xi�A[d
������f�[���1rE���aV�A�X��1�\��������;30�6���!<��A��~���~�xaN{aB����6���{�42��3,���
l��� ��Y![Re^�^{#���!�+�+��@��T9P�^�0��O����������^���H�T���h��p�`��0��enE�A-.r��������%�R��td�����hMC�����N���Nf"w��+!\Y���J7����?����%�#�}�r������h���/�������/��|����f�{[�*��\�Y��)��"���m!|��+���E�{�|�Q�o��o��~���jM���y�{�����u�{�����T��{O���S������S��v���G�{�}���j�<�}���j���=�>��S����T;�q�������#�S��>U?o�O�sW�n�����]���E|�~�Q�o����?o�O��N�T��>U�y\��>W�q3��P��-�c�~��������qG����_�@����LP�(��y�	��)�[<��)�[;�L�fJ&(��[i�t�fV���[�B����J*"�R�0�H�\19�o���D	9"j������Uo�%�k���^2�V�P�;JA ���rH�o�h��[��UB�#3�K�a�vf������R$f��aB���b���IEdV�&��+V��
Eb�&�Y(���-�Y)�_7�Y(���?2bV�&�Y(��$�Y)B�T$f�x��N���S�W
Eb����e��������"3��ImS��C�-��X:fg�`�8������������i�8�����E!3E��,���2�RLfJ:��b�7����J��"1sE��O�P$f��aB����?)���I����q*���9Pe0�d����N(L*����Y�?��~�L��Y\0�33�`���83E��,�����zj��\-�z�J����]OT�X��M_L$Vj=J�!�PK�O�r��� L
�P,'���b9)��k���"2+E�����;P9�cfq�0�����w����;>5�gfq�0L���/Y�wBR��"�IEb���)AM��Y(b�Pdf��D���b4��,�����Y)B�T$f�m:�"2+E��!��+����"1E��,�������!L*�P�����
��"1�J��TH?�����w��,���0qf&3��G���3�:~����`��L��Y)&39�B���*3+�j����+z�eu�3E��,C�[#�Y)B�T$f��N��=�tRL�)���R��^��rH,��1D��Cf~3�0)H��f�\���a?�����~���l�?1�Q��%b~�c�Sx"j%��X�|��	53���w3��B�-�����!L*�P�}/�TDf�aR���b����Db����s����> G�B/��Adf��7�Q�Y)B�T$fn��w�W�c�����w�����]l0�A���|�i2�u�CX|��w��v�5������W�7N�5�J�q����O�[Fq�:Q�J��,�S
����qz���F���>�S�KH^��Z�����x���r	o]���z�
���dhEr�\0��KCX|������'��I�����D�"4����Dj[��b7f`�bk@�^b����h��CXl�z�U�Q����
Zy
Zw����KL}����x�m+����h�W����~���_Al�����to��o�B��g��8�����_b�������CX|���On��}� ����=�IU7�
��+�����|���,}$l6��O"p�b�t9�Ji��|}�����1�����W	��?�3�z����y��'t����Ot��vg��O"p���t)���;{������<���q��s�&N3�����!<���r���oq�����xH�����*���D�VE?�-<���7K;���H�>K�C�9�t�������$������)���E{3�q?ao%�L�����;�^`�x�[�C�0�\���
mG�7T2Q;�v\�e �?���{�H�WJ���Qo� ���`bg�^b�m��_=��x�\�0��j��tv���H�[��Kez�x���QZ���l��\�Q�L�;����
c�����CxP�)�R���e�[���)��v�w�#}�!���S�T��+t�ju{�'��Lj��^b�x������bJ�Ta��Y����V!1��7%���2��|�
�CX�)�*��m�h�6�f1v��x�^Cz��J70\>��9+�m��O��}�}�S���-��'1_}\�A->n��d�Z+����&w����%��'1m��!<���r������	�R��s0�����	�� ���L70]?E�������=��sl}���>2_�v���x���u���������{��Wo�%�/�GO.i~�a~I�Y����.�*�*�UbMHb
s���!����Z���X������&Ysg�I�0<�$�!�$��<WO��N��b��j�]��[-����|��v��o��.�*�>)T[:���0���&[.������&��b���mQ���V:H�Mb���[���'�Q�[V�1�jQf��|��2�z�0�$<>i��3^/��\�n#m9�1��r-D�r
S��Z����u�r����aM�"]:�)�^��Ls'�������;g;��"w�c�Z�GW��5Fzw
c���%�;�)�^��fs��a�x�.���]�0\<)�^�l���O�y�3���2�������x�����7�Uw0��Yy����z(��ixG�b�����!=�R:#==�!Cy|Bx�)�f�d��de0����Y���1���!���)�j]�Y�~�3���2���Ca1��������xH�����w����Ch�)�f�h���9���������	ChP��>��b���E<c0�3��8���t�CX?9���<�k�pG��lA�1C���#���	�S_��
>Zm�_#�C� ��0�r���~t����`�U[�'L�j�u�m��Uh�}�}�8�gk������~�]�f��h�Y������`3��5(!�X���P[Xo������g]H���l@ol@ol@������\�P�r���;��w��8��1����V�p��
�����K��PZ~2,?��\�o1��,M9o'w��{��;[{��3�������{�Ba��b<��q��;��S^���4��v<�|0p�p�,��Pr��!��`N�R�0��Z~]��_'���_��)�L�w��.�.]_�'\:���>�����i����Y����qo�1��S3�7�C�7�1�\���2:�s8rp��p���;�#|�A-V~�yb[d���s�+b�`�P>\����z����X�;q����N��f �K�
���xR����K��������Y�� Y����Cx��<{��
�
k���0���F6�,���a����������Vy^G�R���Q�6��=\�jmz�b��������	"|�c��fc���j��bJ�R![m]a��=��3���0dH���,l!4���r�B����B��3+�3���0dH�,W�!,��KV>���Z��3v�[���=y���p��������wb�\��&/�;a��X?����S��
��
��V����`gL�$v�B��aq�1�B��9���#{a�Unz]'��k�2`�����)n�?�����	S7���a��1��o��������������Z�������?���_�"��/��d��m��E���g6�V�����"��eH�����c�����*_���<���cR��z�������1T;�q��c�������?�>?�j�<�}}���1\;������P��������?<�k�v�\��������1������c�zn�������9��_��zU?��S��c���G��qU��1\=������P��������?=��7��Nc���-�`��er����1��7@P<1��&�gf�#��������Ms� ��1=��,�i��J1�61�Y(�vZ��"�IEb��nC�J��mb���Eg��X�A�P#b�L�fVz����B1n�?"��DI9$�7�pl.}|$@n���%���Ra�������mI�R$f��aB���b���IEdV�&�Y(���[1+E����K3�d�0f�&�Y(���?2bV�&�Y(&X�,�Y)B�T$f�Xx�+)^�:U�0�H�B�0�&�Y�J0L*���4����#r=����Y\0�33�`G2}�Y(b�Pdf�x�.����eq0S��3=ef�HaCq0��/�I��"�IEb��&KEdV�&��+���V$f��aB����7'�0�R�����,�y6`f~;�0�H��v�>d��G�6��efq�0L�������L��Y(b�Pdf������^�Sc&jEOT�X�=Qeb����K1��$�/�����Y�A��#f�O��=�xR��)�|���J��"1�j��Y�>0L*��}��=�C�(���?3�K�a�`f~������Y(b�Pdf�X�D���b5����A�j���B��"3��I�%�Y)B�T$f�X6��RDf�aR����qR<�S<N��[��m�WW���"�	Ef��<0���`�T$f}CiK�9��u�����%�0�03�d�E����R<�T���bm�jb��\����`��LS�Y)f3M�B���Y*"�R�0�H�\���U'�1�P�0���B1��{����SL��K�X��()��B
�L9d�7
���,o&�m��P}���~@����O���C��y��_����B�	)��b&������Lf�]oS&� �P�0���B1��]���J��"1��9����J�GI9$julB9dVz&�Y(�C���J��"1s������0��\��?
 ;q�}��3������I�is2c?a��>���s�{���@c��*,�?a����h��CX�n4��A�:�7561f_�e1�PDflj�!,�������s`�@"���:�����~g�n���Rs��Q��j������F\<��[�?j:��o�l��������P"3�]��~aoUX"���
�����[�4�
���������l0��,�}�/15%Cf�1�!,�s@{}C�<�>�D�����D�M�H��FN7��QeO|n��6������i�@���oj�to4�/.�is{o2��������$��!��S�S�T����g�L�84\��]y��'�����Sa��ZM����|}�3m{��%�-����$����bJ�Ra,}og����D� ���}�!�?����[H�WJK�W�Q�`���B�Aq����q���G4�m����Gg�3mo����C_z���$�
��bJ�Ra�[����9���C��
<\�9�W3���^�B�ZL)W*��O��f��p��$Kmk�^`�x�'��bJ�R�ql������X'j'��k���'p"@�t)�Bi�S���'���-u��@-����L_���C|�&�{��B3��/�f����D�3z���WJ{76�>���
P�vKz!�I��o<�������v�N�L�L
w�YOb���s��
���{��W����6|y�!�IL�>�!����R.U����~��Hg��L
�l�0f<�i�����S��
3�#u��0�$*/q�d�zO0�!,��������&^�������c��o1e<���m� ������oV�vx�ZC�n~z&w�]��%��'1p@�!,��K����$.{�%qt�F\a �?���9@�H��q�|�`���;�)<aj���8���x}@�c���;@R`:H��9��R�M�z��zI�YxeMA����B��R�kU�W7���%���~�v�VaZ��������$d�Szb-@xb~���{����*�8�d�����g��0>���"wkUf�?9f�8f�8f���y�����[��zgk��Z����'C�G�K��=�jL���"wk��W�������0dX����MbN�W��b��bu����n9i�Mb���[����r�V���h�7��& ����MB��W���z�.e��6�$�\���!��SX5��\�p��u��6������������?��0�wO����v�$������)\?�S.�3�8�?�
J��)�������i\��y��a����r�!�:!\;u�N����a��}��t�Bg�Sg1d�w]�!��;�G��Z�ls��MD��d2�t���^^���4��{E
��I#�1���8b9��9�!,�����Z��������n���38Q�@a�e�A-V�/��^\��/���4>�v��!<���r���������K�2�LpJ���|�t�v�P�a��}4�������2����!,�����Epf2�[�����
��BCX>���YW��t�43�#��I���G7+2�<��c���h��7k�
�0�@���������gW���w�p	���p
^����u��s���N���
��|��o���P�t���I��I����&����Zi���Jz�`�����Q)���H�O�;���se��~���f��x�
�������\�t� d�{
��P�n5[S�f��)���!����S.T���^�2�j����$�
�"=�oU�6w�����������s��s+r�B���b��X�}'!5�G���KO�;���s���C���k���$�p��A�u
R�����m��K���)����K!����F�i��s]�9���)�����a�iL)W*d���b���|g�1g0f(c.c
j1�\�������Pv������e�%�A-��+�KV�c�D���>B�@���	1����
�����%<mu0p4���,Nxt*[o���!,��+�#���7�L�N��v���������r��WJ#��Ze0�N�m���d�l�!<��A�Vy�B��j����a3��!C�k5`
a1�\����r��Ia�U���,�i��!<���r�Bt���������4���xd�0����BCX�)�*D�+��d�D�m5���,���A�(�@��T.�Ku�{�����/��3�i����bJ�R!�X�=?�0f�j�;c�VY���Ch�9�R�pfI����D�`'��y�����[���������3=��;aj�?l1��?����3>�j�4��3��������P�j�������u���Z�\/2_^�Z����������j_�@��D����j�Y?>�����\��,��G�����O����?W??
����Q��<�����z��������9qT�>����D��;�s���zN�����?��L��{�s��3�~N��O��?=��������T�H���8����\����s��s�~N��O��?=�����������3�3�Q9||2����E�����?���3��q��:� ^�� ��x�TM�fVj�t��\����c3E��,c�]+�Y)B�T$f�Xt�"VzEw�ab�vX��-�������t�af��aB������1I���3�u`��Cby�8|����S��MKIf�
�D���/U��R��"�IEb���W�T�Y(b�Pdf�����TDf�aR���b[�����Y)B�T$f����GF�B��"3���P�"2+E���,��%��+c��&�Y(f�u������"1��I:����S�v�����`&�gf~���~z0�P�0���B��f����i�8��b{#��b�&�Y(~�M���!L*�P����TDf�aR���b=)���I���x��b�Y)����L���J��A�o(7D���n�������"����e�8�#0��p����������Pm�U�"���0)I�J1��+3K�����Y)�����R��)Ed��9�Z���&�IQ�V�����T�gU��j����?��*�)U�V��<Fju��8�J��6�v����no/��vD���������K~A��R�8���J5�������L��R=��vPK��Lp�Pm'��A��R�8���J5m���V����T�Z���jyS��U����w�R�H-U!N��P-�y�����qR��7�:)m���s�^��x�7A���a��)�Z\���PK�l���Z����COG�Z�b�Tej�O�`����4	fj�Z6��*RKU�S�D�T�-f�{%j�
qJ����kZ���*�	�A�T��C-
�R��)IdV����V�H-�8�D�z��s��F�\[Q��F�\���~ndV���OM��R{g�Y)B�T$f�������f��A�Ta}�REj�
qJ���jp[�j���J��*S+���������{��ue;�__QM�#�/�����0�c�l�0���#B�DI9����cGpD�f)�C�`���aF���b��X���&�Q�f���5wu���b�Qe�����NY������'�u�������X��
�P=��Q��1S���~���
{��G'O(hk?����'��'�����/s0�����[�&Kb��LY��'C�[G�c�LP=�U�(�!!'p���J��Sk����p-�Kz��_1� /����%��T��x0���zO�[��,��2e	�����nQ��a?�`p�c+U�Z�[��e	����K�qi1�=���=�m�0���y� 1cO�,���Sx���d�*�
��k{�����7b�\8Y��'g*��4|�6]Ca���F�g<�k�������O�aX�,����*�8M��)� �@��0�=~p�BXC^�P����'&��)���3T ��K��
{;f�}�0�P}�c��X��������6��e�+4�0��Z�M���3`�oo0e���}�B�z�)�*��p�l�<���`Xs�_���$���m�+�vZ08Um�qg{C���>@mo�K~KP*��A�����{�n�wv�X��q(2���0�-CxQ�)�V�����c���
�������R��A�����{���w��w"f*0a����1f���=����C�9�V�}lx��	���dp!�����'0	.%q��
�e�d�*4���#.{��df
0�����b��,R.�����0�`����>0U�6
�O|KPR�����+��WhL��o]`n�b���T��{C�K:!�%<��{��a��R�\?E�T>��	L�K)\� ^�@d;��Z�)��*�e��\��VLH]-�#AIX(s��d�AY7M����*[e_RP!��;x�&a���$���Y���y�WDT?%���#'���
�4���x�W�*�U�4�Y10�.��� ff�`��Bh	����B��C�����1�31�F+��F33���0���x�c6���]NP�����[��!3NP�%<������]��NP04�����h�`�\��]��i��`zc�a���!��,~����S~����Ug�f����n�U��f�_��n�j��rp��s��s����+��+�^�i��f�Z���B���Xj�^�(/��k��zf���P�z�gf\=��������n���7�^���=[�`���^��������/:}w�v\;x~�y~�y~���5���_/��Vm��>[�������Qk��� ��Y��gz�de��WF���{���2��!�4�}2p��t%���#hJ:8��AK4{�����aW��`��`��`y�
�
�
>��m��l�N���%W8hX�����%+/��v�a��mh	J��!
�������������h�b�p"���h1�<�U�{�5o%v�X���$o%R��
!YK�/�i���&����.�b:�!��������Wif��P�^�%Y{�p�| �'�[��������IX��1H������%��dvS%5L�AI�y�8�W>�������L�0���R���$�A���[�V�#(�q�l6=)�~J����x (���)S�M�|P2����_��|��I��L}p$�2���L�a>���(���T�'y$0	Kf�r� �%=�YOK&7q/2��
�X�����#��s�c�|E9�a0x������4��rf������������i]^������
��H�G����!eSZ,�G���eSl{�.V��C���9�-;eNj(��6'�������X�-��~���X�/V�{��OD7�����`�1
����5~�'�E��b���k�_l�^l�]ls�Yo����o����|4�s����S��T��U��+���x�B��W�
��7V�@X�4V�@X}���������������8��q�;�������WU��g�&a�W����$oR�0	AY�
��,��@)k��d����x�AY�
����Zc�7]ez���!X8�W>�u�Pv��?�fl����<�c&�%<��[�#���c�m@)��@`�6��i�����U��q;�������A�9�X~�ChQ�����(Wo�?�A�?se��M�|�8��9��?�����H�Wl���H`�����<x$�n��\(9m[���������p�����Q�Kx�������VS�=�������$���JA�����������x�s�3P��;�d���A/s 8�f�d�u��_�"P��;�d���q�| �>q�r�h����X�v�R\���
�#AI��K!s�| ��d��Q�y������a/��)Cys�V!����r�8�er��`�	q�����!1�����m�_t��3%���~���H�����������?��I$����
��6���������{��$)��1����R*eZ����������S.��dN��u����w�~blX��}�W&�S����CE>/Z��h�uRt�������������Y�xR4'I�:���~
��V�K'Ep�}�BRE#��IR��:+�o���YEW��������F<)���h���.�-K?K)~0>�����[.��I��&������`��7���s�g��1�\��r�wVyX���M�7�&s[��-U��y�#�+�y'���
��������a����?/�"Y���h'e�Qu������^1���U��Zp�o�S5EI���Z�MV�Q:"��T	�D���N�B���h\�H1>���9C���
�-��Z}�����O�(J��k���!E�\	����s�bu��bu���3��+���b|�g�)���B�w8wr����#8�d��W���h��W��6!Z�O�(J��j�N/-��Z%Z������]G��F
-��Z���Ui1>������\�.I��>������T+�v���vq��a"7����J�����C���^�	�J�����)="�w���g�	W���o�G�B�_+
�q~������� ��(l�	1��09R�11��0���Do��������q�z}�x�zB��Q��b�����O���)=&&z���kz���kzx���cb�G��=&&z�>����AaJ������������[�?q~�(l/����o��� ��(l�	1���9V�D��'Y��Z[4�����0jh1�h�k�������6�n��n�G$�D������5����%�=,���@q�GaCO��^�O��8�IP��#��Mb����'�+��>wB?���rQ�(_���Ut�)=&&z���8�����az�z6����m�yS�����7����%}b��aJ��S���#���!N�8L�1�K^/]�K^/]��/��cb��aJ����n��qz��0�G���G���\�{�o1�K��?�A�_.
�q~���>�H=�s=
zBL�v��;���nqq���}����
=!&z������a��cb��-1���=SzLL�����������z��#<Q�(%F�\+����	q~���!����c[J�;���
bnK��[��"�5�%�G~����O�����[,�_a���2��m�>�\������l��11��0���D�.iZ����)=&������v�l��F
1��-�J����)9&&zy��+� &z���8��oHm3R8mF���n���;� ��?>LH��a{o �:%�g�8
U�|05�p��(M�%G�xw�,E��&�>(��0c�&�R�od)"�4Iw�Ai[��0�3����#*�g�q��GR#��o���e�\3(Db���b��	�\=������
������Zj^�;�RDHi�����Z�=h�4!���/��R��;����6�]�
����
LX��b�x1�C���Z�cN�Sa\a���CL�=���1�'�G��?���/2=����T`��I�-��3Ta�%<��;f���*�*���wX*DFU(Kx���2��;��9Q�y[�7�_L��p����-�m�'��a*0������1��^<�%<��;�a!��o�FW��m����o ������� ��)�F�������<a*0a�aW�c���x"Kx�)w*lq�J�����}0;0�beG��q���R	!�������+���
�Z^bd{���*�oc����Sc	�9�N��?b�	<l00�O{�1��L�OKx�)w*���vX��D�9p�����*1����v�t%�Ni)/������~��o4�����-$�%<��H�Kf����d&B"�V)����F+�z����dp(K��q0�Xp[�G�RH��o��Bh	�%�N��
��z�=����CG�{��a�x1���"�z����W*����o�-�p�L�/�b�x1��[J�Z�cI�Ua��Y.�z	
d���OQ/o1f����-9R/�1���p_a[eI+���� ���s��[�/f~��Bx	�%�V�xDS	}3 �D�O?~y!������Iw��o�V��hYWx����`0�'/o1f�����m�\���r��
v6���P��������������~��;�t�;�����	0�r�A]�@
1���W���9��;(+o��.kXR������W����{E~���W����F�wuOh�]
����J0~W6?��K���"�I�P{`%LC�2������+�%7���aY���by����,l}1Q�����j��pp��s��s���!�!�^�i�����*��)��(��,Z�,������,o�Y��e�&��~�If-���i/�x��X��X����b�����Z�����4K�4K�4KG�,9�,9�,=�H���X�}���4�pg��e�i�������m����fp�#����C��fp��z�������Z���7�B�����������f0�<�U9`�3��l��6����l6Q6�����Vv���"!�������bnJ��"�+������m�Y���,c�A���,�������K��4?1�"b�YH����(��s���n)NYf�K���-��<�bv����gkv��U<�JS�1�����a���!���U���_+�g�����s��=����R��"�@k�)�\���+��yL���#%eQ�9�Y�b��%���u;0���3�_���Y�cNyX+[k�a(�����8�:�)��w�5e,�1�<�uw8�t08ctXw'�9*V^��0dXw'=��~���5��D�@�y� �������?��l��'�^�����mol�9����kx=�?��0��-[��T�����/�x��X��X����b������5�+����b�]+����w���Xx�J������]j�����q��q�!���>����]����4d)c��e�i��4�WV�k�5���V��k�5��_/r�B��E��Ek�E����-�+���J��w�4�I�j���E����j����_/r�Be�������xpb���z���4��;�iSl����n�y33���C��f0���Pb��m��m��myxm���c��_�������v�B�wm��j[������M��M5�9�F���������/H[ic���R�V���r�B�m��#�@��6�,Nt~�1� Dh��G�(_k�;���:�l31��f0l3
%�Ni����L�,��6�,����1� D�eo�5��
��������V �"����E�)��s�����<�C����m�Y��O�R��9,)w*��U4vX������c�E�%<��[���������94��,����!�O�`I�U!�Sy)��1�3/�1�8`�Bx	�9�N��R��-6�{E`���0e�>�
CdQ�9�V���?DT�Z�P}j&��7}j=@���n?$����l4�����0�,���V(���SnU(Me}���T���U3����7
�t%��W\��w�JI�,)|Y�7�^����8�{����M�b��n���v���(���=�-�,��0�����������c����Y
jV��
zu��`���z�����>�%�U�0y��t�<W�YR�t�|?��F������<y��|�<�-Y\�|�<���Dy�[F�s��3��n����3���O��s�2U��>+�us��=+�u:W�����\��dy������d���GZ��ly�|������s��'����K�������y��|�<W/yR�|��T���y�^
���'S������_�2��>g��I����y��e�<Wf����>6�O-��6P���#8l�.���=�)��2%=�)��1�����A��P������MI���Fo�26����velJ�m���(=&&z���8��|��)���0j�1�h%��.i%��.i;2e�bG�b��"*9��7�Rb�/��x��������t�A�_&
�q~��Gv
=!��(l�	1����3���D���z5���CO�s=
zBL��R���b��aJ������g%�D���
��=SzLL�y����$?��0����^v|� �o6���p�����O���T}j� �����q������mfb���Y��8��~��8�s=
zBL��}��D����AL��x��zLL�0L�1q���4���\��������5����5�l�#�D/�I1�+��~�7SzD�o������x��6� ?���Ea�x!�/V�
��q�GaCO���fvO�6��+�\����S��D��<|
�h������D)-��n�1&&j������^o���{���^]���>���)=&����>����AaC��7��?k����������A�_.
�q~�v<X�11��0����^����8����'�D/���AL��}��D�(=&&z�����5hSzLL�0L�1q�����\�����/�=SzLL��{����AaJ����>8���Q�e-0�k����H
GG
19{3���AL����z�bw1�+�qw�j?1����s�z����zB��Q��b����O'�D������5����5�}�w�5�Rb�'Z����1qz��0%G���Q��������En����/��H����/.F��������<�>�����xr��j�19�w��������J�����J������=SzLL���-Gx�QJ���V�P��yv�j6�����%�����a��c��L��[�t>���[�����:�<��cfp'V�+�=��V��Q;c	��NI�F������c����02d~��1Y����SVg�c��PE���-�
�Q�?�����@&�����
�����8��'�m6V���p[X�[,��Q�e	�����q�F�Qx�`�k���b�D���4������Tw5
zW�����F5�Jc��G���
Kxl6^���K�QW(U�"���T���P���A�������������~L����K�1����Kj	�9�N�}8���6	���������38����.�F�4���q���_�o��fv���z�{5����rn`�W�����2ou3�_L�.A����@��SZ�9/�������x�yB=1c��X��C�si�g2�?5�}
���� �5p���3R#��od�������%E��A��<�3^��.A8�����S�TXW���7��0&h0;0y/K{�1�����Y�cN�Sa[�}�YX0PA��>@#����b�n����r�L�5���
|�I1�
��z��;�/fh�`jj	�9�N�}�z��je��z
:��V�7�_L�4{�t%�Ni}'i�A��	o0
�n9��3^��4{�F�%<��;&�E����>���a!����i���9��wJ�L��7�f�[-�[�/f���Cd	�9�V�w����R��R�{��_��m��f�!���f���
K�����7��L���ko�J�����Fj�%<��[�����7�0�p�XA3��7:Z�����Kx��?�������	�d�T�^�����3^��4{�%<��;n+l��q�[ ��
��-��3^��D�"Kx�)�*����aS$:4���+��R��s���x!����r�B���.�_�B�Ql���a�;
�Y0�A����w'�g�CS�������zb~d��X������'��p������L�W��c���� Dy`4���n��Z�e������/�}1���1��(_����xk�	��uoA9d�9�2�����Y����J��u��,��B�Wf�g�jC��5��9��,�|��A�r�����U��f���*'�8'�8'��������E��a�����5g��4g�97�q���
_-8c-8g-�������������>����%�y�m����|,�q�6��i����Y���R��{����,���a���,��g���N+���3��i"��b�Xq�O�Nk��:i[�3������&c�A���,.z���Z�w��Qf�����0�,�g�A�2�,��g�*�w�[����Ug1f8�.���l�Y�)��8R~A�M;���L;��<Ic�uJ�v�#(o�:��-Wk�uB�wb��� `�wJ��"�c��������l;!��x$�A�V$�m;�(^�^��g!�k//c���nO�Y�naI��1��.����]�Z1���XR�:��=������?�b�0�^�^�cNyV��rZ1�������a�>���z�)ke;�^j�/3<>�1C{~�R-���<�uu�q����h��`�0�_�^�����S��hJ��#0��s�2��n"�z�)ke�.��������z�U�p�X��Kx,)k%/O��������
����y���^*c�^��7{6y*>3<��@����a�(��������Oq������X�-�����rh�+�A���rl�+�A��z�{~�A�bi�������W����R�1� D�~?���V��
��_p_p_8|�|������8n�`�g�g������~����8�� w�B���r���%�#��Mr�u��_/r�Be���W������,.t�1��!�����S�.V�M��=�`=�`=�p�������[�^i��
f����u�]s��|�5!�k����us���ms�aF��i�������:�!��:HG�����i���m�3���T�y����9��>��b��6���7��P�7gq����7!���8��/V����
�l��l��|p��5���[��aD���`m�uFp'�<5\��gq��,�X��4&2v����`��4��*5���f���)Myk8YZ�����_�8��he��$im�Y�`��k�&�!N���N[���,��;������7��mk��4�9�t�m��,��[r�\��i������g�A��ZM��R���������=��vB,3!\h{�Iw��o�Fv���l0��#�G�|�YiBKx,)�*$����v0	q�x����)�G@-����������\^f�
��aS������z����VH]q��>�~�T��O�c�P�Xw!D��SnU��;53���+�~�F:����`6=`���{/�^<U_��/F��2(_�0�?���l��[L��Ex:\�,�Z������������I��_����������G,��>���M�\�Yo�A��Y����j��/U�N���9K��N�������}�[f�s����n�������O'�K�<+_�>���us��=���u8a_��Y�R���}�������}���	�R7����O'�s��%uO'��~dJ���r��/��O���%O��O���?��/���|��|�>W/yR�|��T���}�~L���'������_����<a��'��a����yB���l������}�E�Y�.����q:��[z�����f��>�J�N{���
9!&z�N{�D/�i/�����&zLL�0L�11�kf����Z3�^�k�`��ka��b<�Jv�� &j�Nz�D//zp����F)��\���x(O����f�� �/����8�LnzB��Q��b�W`C��cb��aJ���^��[�11��0���������^�8����'�D/��J���D������=SzLL�vr�Eo��T���f'@���AaJ����=,)��i�[m���v��bQ�(^������8�s=
zBL��8��^��q����>r�T��Do�m�A=&&z4�a�11����g���D���zp���\�����h�Z�D/��V1�K�9�7SzD�o��Vn`���}������Q��+���z�z6����e��+x��������n>O�v��)�\+������ h��t��(11��0���D/{�|M/{�|Mo���=SzLL��}�������n����I�{���q~�(l�/���������\�����j�1����z�z��i��9�s=
zBL�����
1��0���Do_��=SzLL���k����k��j�����\�����h��q~��0�G�����/5��I�{o�5�B�_.
�q~��kE�h-�D��Gk!N�j�����8��0��Do������m��s�C
�=SzL����yN��q�GaCO��^�z��^�z��^Y��!x�QJ��D�=d~6&No�����:�Z�58i����x�~�
J��e|~q1j����/nJ�q���
!���>b�����A����{���8����'�D/-uZ����)=&&z�� ��D)1��:v��5SrLL������LL�0L�1q���W�wZ������)���V���=\��=[��g}�yC1c�X��c��"rTN�����������b��E���5�����-�s��Ba����o�T���P���I�y����VG=��(�!�#�p��OJ+��S�C��i�o�����x���O*��8�fp
��3�����j3fp�%<~0!~�xoL�N���ul�3��h3&o�%<v��.UXq���P��o��BdT����O*l�l��w/z������90�Y$flsKx��,�}-�g�;t��5�W
�
{%���mj	�����{���mk���
���d����b�*�^�cN�Sa�x�n"w
�}������M�����Dx����wH����T`�D���P�:1��Bx	��nx�Xa�yKe�LT ���@1��!����;��������=e���}�G�/fh!��s��
������������}K��c������s��
w�z��q#�`v`������vU�b/�q�1/VX�c��="�s��^3��;b�x1�{	1����S�T����8"&�	���[��]b��	�������(��I[������i�t+"����b'�c/�1���0�]1���D�<�8�)��L��^�A��SZ����
g��S����1������^�cI�Ua�d������;�s��)��1�)��N��^�cN�Sa�?�>u���0�.@Y��8�5D��vA/�����0��,��~a�	M�#.�I���Y�����8o��36�/L��l�E��a�x1�w�m��z��O���K����q��`p_w�[%��D�	��	��Kx��,�K��I���������`}�9��N��Y�����`-������(�a�y�������/�~F*ld���\���������m��=X��3F��E=����������]c�f�_��n��U��f�����6��>e�i(�m��e�i��/����Z�j�����]�`�5g�5g��������~���Z�j�����F��Z��������-7���2��(��`gh��U�bh�h��,�|3X�Ac�A�2�v���Z�O��
��
��
�6\�6\�u�?-R{e��q��q��q�`����_���"�i��'�6��i����'������]�6������Yw�`����(w�`NyV�������u��5����K��K�n��E*-y�.Y�.Y�.l�d]�dM��x��0������u���y����Bh	�%�Y���5.N[B�:)\�v�z�O��ks�77X�3���3���!����S��:�6����N(/OC
�^/������a����>�>XKyx����=�C�B�X���|i�3���3��(����S��*n\[R��������a,�
CdQ�9�a�������h���q`�Y�����,���<��������a�y����Kx�)k^��p�YWvc����[w�t9�Y�b�%v���������n���P�%<N�-�[+�v{��'��uB9~R�8�?})�z�O�������Q�lz<��{�F��E=����'C��F���m�f�A��0�`������X�v����� x��4X�+c	�$���f���X�7����i?�-0�_���M?��k��m�����oZ}+���w�B��������������^s�^��"�*���w�B��E��Ek�E����}�{���JSf����]g�cg1�7�����3����],M�g��s��s��s���%���_�����i�|�\�}r����C�\�}r������Pe��9�����pB�s� `�pR����3�?���m�Y��|����Y�)w*����������y�e��)w*T�ke�
r��
rc�k���� g1����}�C�������A�F[�r�,���
����De�E�)v���a��B!����r�����!2b�9�]�0���r�4��;�����H^Z��v�<�m��
CdQ�9�V�����@"8	��9���C�x=w�K\,��>j�m��0b�0fh�l_1D���SnUH>V?13Dm���!s����fY���e	�%�V��^�
p�"+�H'�����-�\(���SnUH�U��Q��i�}pB�6����&89�Ni��w��_/3}�~����8�)C�a0\����s��
����L�^b�`�76�<��a���1��sp����o�lt��}ah��X_��|��x��X��k��?�i
_)����c[[�o�������������'-�������R:����nKM����yx��}:r_���Q�t�~�_��C��r�������%M*�����?�/��}��t��T�i����}�����R9����O�K��6*�����?�/��}��t��T�i������'�>���}�]��K�����zI����K������e���>~_���Q�|����~����'����(�_���<����4���1�s�	&.����������>�������q:��/�J���
%��j��b�U��A��u�S�����
=!&zy���cb��aJ����n�����f���VsZ��VsZ��VZ��A��Q��b��8H��|<��������m��mi����[����>��2Q��\������6���D���z7�=!��(l�	1���������a��cb������=SzL��ma�����z6�����eK��cb��aJ���^��U�+b���a��cb������8�qP��#�����e������O
47=!�/����8�X�|���w�z6�����v&� &z���8�s����X!B��Q��b���F�cb��aJ������M�����a��cb�W�^��W�^�����+���5{�� N�����C�bNo'���pia����k�i^1��L�� ���������<bK����<d��?�W-��D�� 3�b�F1�+�yt3��DsI���A��#b��Vle��L)NI
3���xQ34�E�C��&33M�����4��J���M���&1��JXc�P�#����7�����u�8�33�n�x��0M�S���47�8����f�3�l��U13�fZ3��p��f�IqJS��fY�`n���41Nk23����E����_�����k��L��&3�}��3��P��d��=����?2�?�u]�h���\7�S?�0��V7������f��3��_&�y�f�IqJS��f����4��f��/a����41Nk23�l�"�5��ib��d�\3�}�9�sM����i��{F���"�iA"fz�$�32s~�8-I���K�K������`�����!��(�S31���~f&&��_L�D,�D��3���3S,�|03��k5��ib��df��p[�nnaf�IqJS��fo�q�4���(�iA"fzel"Aff��%��i����uef��qZ�������;
[��k(�h���s�3�f7��V�^�1��#fl�Kx�`2{��x���zkk?v��-m��]pc	��.��Fu<^Ca���>"�-�B������O�aG��&G!pW#�?|y�"��Q�|p�C������C+�Kz���!�.�,����W�8�+�*�����R!2�BY��'nQ����ap�`+Um����3��%<~pxA���"�?���`Gs�p���~@b��������@c�g��Bcch_����K���
e	��T��8ov��v���+��<���1�k,�������q��G`�.�d�7���K���q)�n��u����p�`�������Y�C����c�?8 �0���"f,Q�
TX����&�~49��9�u��Tx�g��m]�����w3^����Z�cN�Ua�?�M�)
w�����'h����"�2��#�Vh)K[q�bMX�P;Pa��X���
8��9�u����'�_���cj*�[ ��@`*\� ^�@P��Bw�6���	������%0	
n� ^�@P��Bq�3��oX'3	������,��3?xn��K��
������T�XH5�����	L�K)\� ^�@P��B�*[(��]����Do0e�����cB�z�)�*x�G�����C��<�=�I�R
W�2���n��7���T����&a��%
�e�v�����������2~�u����i)\�D���Wh
���djgB�h�1d���� D���R�U�pKk�v[eR	�b�m�G��Rw
�eg�,t�1}���(<�����s^~4�r/s �B����oS��7
��F��R��N�B��8�W>�u��[�sM�,�:���+�����b��_)����Sn~�F���#�����������E=�G���
��8���O����g0[/�����6��h"�u��������G[.8[.�z���j�k�)��`��t`�)����j��Uw����v7k��}w�V��a���!e�,�#��h#���j���w�V���H$c��d�=�1��{=D�{s��Z���j����g��jl��m>�?���r8S�X~������&DC�Z���sD�c%���<���X�5
�oB�2
����j���zC���� #�X7�������a��
����c�w1d��s��Z�i�����������?�A��e=-Y�w�`F�FgF����P2Q&a��"�kU^Zfa��Y�L�f!
�����d����m�m�m�����������oo���@��5
^ij�51d�K��Z����_S}��*�%<�d���A����?���>S�M��hS�`�p�"���hq�c
��*�_������n��H�ap+�2"��wK����������%Y��P��| (�q���z�r��R���$c9n+�2����,�������c���#��@P��\� Y�@P������W�Fn@)��@`�q#��A�|�e=.���~�x1�$�P��x (�����x�AY�Kf;2�h�P�@)��@`�v(�#FA������%��Xi��0-#����a��B!���UOQ~b8�G��$0��"�K��8�`��t*�X���)	}�NK��)	�N,��8��8��=��<���En��3&"�*����D�T,��q*��O9�?8������
�m��P����8\I
��iW���Jj���i�/��"��T^��r���C�i����E�b�����k ,����k ,�^�^�_t/V�E��b�_l �V����s���m�~��|������-����o�^�P�z1���{h	L:�b�2-AY�
�^��8��6-�gR��s��
����	w�N��v��M;���	w�N���Wi��zt	��	��	���0V�Vo���l{�J>����	-�I�vBR�����{���}��}��}���l�l��9�zu�uk703�?�1����C�
tXRnU�z��f=���2����$�NHA����u�Pe�m4�v8��������?
��bN�U����Gt�v�N�vBG`�o'���Nh	��W���k��|{��:�|;a���:b�G�^.t�����i'��i't&�vB
�����{��N?<���v�4:���)H�:��?���m)�
Sr]���$�U�A����u�P����o�U�����I������Y�h���/LWa�LW�!8�wR��*tD��0��s��*,�g�c��N]o2�T���r��8�q|��0tt��s������
cQ��������u'<=�!/yO{�������������?�����#�A++#��[$�9:���������{{
�)Ut����X��p��C���F�DL��4)���-��K�5������)��P���.��L�3q^�H��m��s��}����:�G�q�[X`�xR7'I�:���-/��������<.��I��&�������w�����F_e�r"&�K�Tn��*�+��T~���{
i����r��7v�/y?��pX�|�����^����Iu3�*�A���)j�-D��o�
Nb���0hH1>W�1J���
hbL�����F�D�F�5&N��0js�8W���&��HQb�'Z8�eH>W*N�\Q*N�\Q�i(J����_bL���q�J�������}N�������x�Q4�W��F�B�^�Z�ir�1q�FQCM�S���A4��8W���&��Z�E��Z���a�Rc�\
{N�5�RjL���%��#&&j���8U�-{�I���q��D}Q�H�L�U���D1�lE�H������"3�{Fw���Y������h�w$a&���T���_�h����+�D����0�S�"13E���Lw�p)�Lqw���9W�S��Qdf�HaJQ��b�I+3S�0���D1�5�|P�����H�L��eQ��L�{�fr7�0����nr��M.�NruO��L.����9�`%�gka&����(V����L���^!���h�E���a�Rcb�f��6��������m�G�U��)a�Vcf�����b;(�K�{X�M?�
3Q�0�(�D1��a&�
����o[���?�\�q����_��%�0�s~�j��Z���"�iEf����-�D����0��?�23S����0�[�"13E����*���;m\~f�9l(f�����b:(�k������)b�Vdf�������P8L+23���~.�����������%�0�s~��+#�.�D����0��=3S���`�Sp���L)L)
3QLKV���"�iEf&����=23S�0���D��5�zP�����Db�QZ��s��R�����fBaJP���$���,��g��cnK������Y��'�6��Q��e����,�_d1��2�n�=D3Q�0�&�D���n�H�L��"3E�o����1L+2s���=>�Aa���0J�11Q�~Z����iAf&�8)O+3S�0�����{={��%g8�x���l~���B����'����=qR�h����1�,�p8M���`��	��w����b���(�A��QZ��<E�2�oIt�Q��j����������1ua�������J�x~8���]�VM;�`����:���� K!�-Nw�IiG
����Z���w�����8��'��Q�mW�	����������8��'����Q��L����&�����0f����� ���XR�TW�v����� �������3^�p�y,�����0��4
��No(�I���v�[�/f���Kx,)w*�+�>C��:�I�`��;�/f���Kx,)w*,a	���cx�H@��-���/&�O�P�������������I�����XG�/f���^�cI�S�m�=��TS&���o ����������)�����L3D���a�����WB^<�%<��;����m�6��$`B���G���Yb�B�%<��vnsm����I8��U��<^iH?1Xa�d	�%�N�}�9|;oKY3XR����k��k��4L/f~�+7���s��
�n�����>>4c��*)���OMx=���1�%��F��>�y����-�m�����;�����+�����%n������5�1���O�;!���;���f��
a�o!��=1�C'���1���8>���V��S���
*&����3^����u�^�cI�Ua^��%i����Z_>���iE����v
�%<)�*��;���8�g08d'nhHq�!<��@�?��K��
w���+n$F��L��r�b�x1����Cx	�%�V�	�b���-��3�;���B������_�}�;���J+�B�og�)
���������F.����J!����r�������;{�Dd ���@1���J�������������#��'���L��4�����5h�_8��v�R������7���I���Ou����v�������Puu1��??*������RP.UOj�WP�TSJ�+���|���W����j�U��V��l���(�JAo��������|�/�V7k��u���:Y7k���u����[7��T<�\��\��\��sE�sE�s}0XYR8�@9^}��r�4\i�q�z�r���gEj[*4�}�f�/�1�{_=D{_K��Z�A��v���]0�WPi]��]0�%�Q����n��L�>O�:c��������1Q������Veb��
���l<2)�yd=@ydr��"���NQnYg�[f1f8�B�[f1�<�u[���V����4[�g�=�5�h�l5�������r5pf�u�V���a���BxQ�9�Y���ZK0&�
�6��c����X�cNyX+9_u���������S���j�^������x^[�o��X�#0b�yL�X�6
�E=��g������S[�����1eh�-���������Vr�z�����o-[��c�6�����Kx,)k%c,�e��>ApxmB�v��Q�e�;����L�[�g���v�|�p[[)�������B�#kwKL�7��<#x]j�R��+��O<������=�����{��9p���1����?��]����5��8k�J�J�������J�h�D8��i�N�q�����)h���sI���]������
���j��W>��R���z���4~T����j�*xO,XO,XO,<�`=�����W�1���	��	��	�G',:',�z�{*�
g�*���U���+�ae����/%�Ni_���X�i�j����:����j����:����U�6��4R����e �;w<�ani��wJS^�o��B�lZS�3� DZ7m�]��}&<�C�X��,�pmc����;�)_)�$Ge^uF�Wc�3� D�Ws��
�gjt��XMxF��TR�n��(��s��?�A[I	��k��3��2�2�_�!����S�T(R��k�I:1<*1�8Tp4�X�����HT�!��������&����B��eq�?�}�c� ��i���y��t��az�"��s��
�+-97m@�0b80f��Cd	�9�V����"c&�%F6����v*�,1Z�cI�U!�A�[�4�6e49�_:`�����jj	�9�V���5>>�W�>/��������^�y��9����m�~�L(�`�����!��V�=�t9�|~�;����[�����}��w�������7��������.�����N��6��������G��o��^���������1��+?����K�T>��������o�]����'C������_��C�>����o.�t�7��YR�t��T��(o�|������ys��'���yK������<o��|�7W/yR�|��T��H�7w6���zS���-�����,���z�$�:��b?�+�a�xa����a5i>�&�!5i>�&���F3Q�0=f=]���E�U��L��0��]��r[�qZ���"�iEf�K�S#���a��cb���Z����Z��V��afz�M��Dz>� �[Fi9"�78
������?4����\*
S�s~�����(23Q�0�(�Dq����"13E����+V�����(R�Rf����C�H�L��"3�m	�����"�iEf&�u	i7���1L+23Qld�Ebf���9W�����s�Ln%����Lj���&��5,���$��d
SG�sz�����n��w7�m0��!�Ak�L)L)
3Q,��T��bq����(Vz�8��)b�Vd�\1�Ki����D����0�xP���A1^S��Paf��1��bq��9��p�Vdfz;�����&�gh�6����0U�0��4��L)L)
3Q���13��>q�VV��������>�
1Q����D�� J�1Q����#f��aZ����~P��)����b]�q-�L��"3�����!3��P�Rf~�����L��m�y�fr�(L���_�}�u�"13E����+�o	�a\��"�)Ea&��=�
3S��Au0�}�k2���1L+23Q�G���%33E����+����(R�Rf��������1L+23Q��;3�
�iEf�7��.k�#S��Z�h�)139������LP�q\��!��=�f�X���03���3Q��QU��bu���9W�a��!R�L)L)
3Q�>N�H�L��"3�rP,��A�\S��F��Di9"&j��R�s~3�0-���f��\�����%�����
^`�����\`�R?/�8���D
����[W�@-����=Pf����"13E���Lq��V$f��aZ���b��� 3=��rD������~���Q�f��]�Z���"�iEf�mZ���NcM���j��3�#�E�[m��kR\�^�1��!�G���?�c���:
o~�����y���?�Q���?���2�-�I�.���[,�	u�x	�]�B'"K�LP=���(�!!{�8��G�m�E�[�C�la�o�����x����*�7:�@*�*�#4��T�������Q�u���Ba����)��BdT�����*�O@�w�ex��n�J-���#�}P���S���O�y�W5c7^��7c�^��kR��4f9bF�����*�N��1M���L9$���1�\"f�i�%<~2p<�JGM�����w���a}�y/1c��,�����9U��������f9e�Ae o'#B��q��v����r[b������r����^2b��3Y�c���b���f<�~&�7�wxy;�������.�����>��
����@���^2b��3Y��'��[�����0h�L&�f�1f����g=���XR�T�����1�a�$`r����3^�`��Kx,)w�fY��������g��@���34p<�c�n��a���5aMl��3��oYGW���bw��^�A��SZKm�A,�d1U�Z���A��3^��.�"Kx�)w*L����~�)���)���B�^�A��SZ�h����"u�;/�;���"p�x�t9�Viv�H	��&S��w���b'�c-�������%W�?���~f0%���c���4!���#�V�~_0����sz~�h>@1AS���� ��*������p�-y&�$`�I��b�x1����BKx,)w*�V���g�W�>�L7��[�/fp�8��K��
����[�-�)�L�_��c��� !����r�B�!�G���g/��D���7�_L��q�t%����<�1j|08\���o1O'�G���_8�R�N�?7q\�`���-��a�\h#a�������|��������|�GF}��%<�lB�0��q��r����|����
>�v����7�����^���{A����lT�w���������UV����X��iW�`�`\1Q������?v�$cb�2��9����9��yd����Yv����f7k���v�H�r�����:��4�WzB2V�+�`IyV�p�v8cF�j�P���nM5SMC�x��r��K����n�5�����z���4��K��Z����6�:��6�1�m��6�%�Y��
[�=m�uF[nc��� DYnK��sg��t��0�0�|s�2��V��z\���7ke�l_���
�!6���m�������H������$Lfpc�6�b�^�cIyV+{g�R���V ������5�1��s���)�Z��i��0��0f�n�Y�cNyX+l���9��;�H���k��F!���#�Y�l��~�[V�]7�:�~�S����8�"�z�)k%�m�tG^\F�;�)C�y}����������m	����	1�<Lx�������\<�9��f�l���������Kn������\0D��S�J���*}�_�c9�u�|t��X����o�l��
�^��l�9H����h=9���}��x�� `��8}��r��h��Hwp:�<������/6�]���&��
�a��b�0�.��E����������^����g�\@c�A�2�4~t
�h�]���F���a�],M�ih�i���X��3X�0c�A�2�4~2���h�`�g�g���UW�UW~���
����A��A��A�]�]�u��+M7��3��p�v����a��
BT����r�B���:3���m5f\]����m5f\�;���c�;p�p��o�`����_����t�����5����c[��jsMm7g�����-5k���`�!�`�XR�0���|G��g��l�N�����f��n�4�-�9��=i�L!�}
��B�h�������:���\�Zgt������� D��Y�)w*d���^��p��xdR���
�zV=��bibZ5:�st�E`F����a��
C��fq��|^������.���Zs�2L�F!����r�B�`�hG[gt������ Du�i<RnU�NW���x��kS��d�(����6���>���f���f6�1�7���6�i,)w*���3+t3[f4�9L��
���E=��[r����
�I��K���+�6�l;�H3���r�Bi>�^|�~�	���!��f�B�����!0�8�0����t�����S�&��%<�8R�����e"<�����2����_��������K���:����m���xq5�o��V?I���&��u(�-k�p��T.������\��I�����6��>�����\�d�:�>�����K-03����R=�H��O�s��%�O�K��V��e@��~>X�k�<)~>X]��t��T/������\��I����R������i�+�:8l���tT��nK�V���J����L?1�a�xaNGl}#L1��������7���
a&z���(f7xC��bv�73Q�)���)b�Vdf����
&fz���\-�F�D
�����n3�Kn��`&�y1����>0J�1�y��vY���h��	3�T�j��Rm6l*Ef&����(8J+3S�0���D���"13E����+�a)��$�D����0�_*�"13E���L�R��+cf��aZ����.�����V�3E���L��'��VBaZ����dK�`��V���gfr�(L�/��k�M�f�HaJQ��bss���)67�n0�������0��63Q��e�P$f��aZ���b]R�F���"�iEf����Udf�HaJQ��btgo3S�����L��n ������"3��I�G�����/��/���Q�*_���V� .�D����0�l����e�h,�Dm��L��v��*��Z���O� HI�(E�������a��cf�����b>(�k��R��paf����(V��@������"3��Gi����}O�}�fr�(L���_�
w"(Ef&����(V� .�L�����+�3j�C�0E
S��L3�skEbf����(�K��Q$f��aZ���b;(�k����.)���\Uf&����(F��@��
���"3���������w���L.��@��K�_H�Gqff��?�s������0��63Q��c�03��=�f��`?�V$f��aZ��s��_����0E
S��L�A1]SL�tM�,�.ab�QZ�����k9b�o&����L�6�58`��b�x�z��+�'&br�1J��L�_����;��))���=L3S����`�s�j4��L)L)
3QLK��Q$f��aZ����G�+A$fz�����U��1����a�df��`��V$f��aZ��s������a�_��cq��p�-����h0��g[+�'=��=���@���Of6oQN��������w����	J�3~4�?\}KrV7��d��X�����`��1�G�Ud:mZ~YL�����{�7��P�t��Jw�����P5�7���R2�/���
�L�7H��PE���T�������Q�{V��*���jST���]Q�&1b��2Y�c?*�R��,a��������.�[������d	���'����������x�[�
��>1"d_��;��B])
n�A���@�+<��N��aP���O&��
�x�+�#������7�3����}e+��]��OB���=F�e r"B?q��O&���G�+�#�S���&1b��2Y�c���b�y�]v8:G�D�����"��0�A��SZ��}�v�r�K��#�9M���N���~���
���B��`:�`0���c���b!����r���.%��P��c"u�{��;�/&�:�t9�Ni����
����&���^{��L/fpd:���s��
[[��0=�����?w7�����_L��t�����u�h��,c_�0��*��3^���3�%<��;F:k���	����#S�@�F8
@�&79�Ri	wC�L��_���H,����1���M�
-��������X��o��n )�V�#��389Ch	�%�N�}�\��`	�(3�:���A1���R�t%�Vi	~Y{[�����w��-��3}���!����r��
�a��&�A&�������b7�a-�������&�>�<��d2��m�G�/fpr:��K��
~������~��h�b�����3}���!����r�B��7���%Bd�W���t���(�A������a3&�'�o�
���tb��t^�������nr�w-����%�
C��������������Z���Z������XjEF��Kx�a���B#Ee��LCqQ�Q��Q��g���k���U�ZxE�y���6���j����3�?�������m�~Q��f�\Wc�A���4�������9;��iG3�yi�J{\�1���T��U��V�G_-;c-�z����-,m�u�I�l��eL��X�Mcoa��U;a0���m%Y����&TZ�
B�����<�U9c);��?�+�MCw��MA�V�r�*�9i�3���3��!��3XR�����X��1�4�pg�A�0���gE*���k[�3���3�-�!��3XR�����3��A�E�Ag1f8�C�Ag��<�U��F#���F���s�2�]�Q/�q���n��l�#��w���L������Y�)�je�m�Q��[�b�Y��\b�}N��HM�~X�����7�
�a�y���;e���Kx��y7k\����V ���!�kK/c�� �?+R|�D����*6�8��Ee������8}rd�p�}6
&��}��8���|{�?\�����V6�j�Z��_�����2��W���s��Z���7��)por�[�b������	.�1�<��l�-/u7^`o2���w�����>�wW��s��Z���]�U���w�����4�#������ �?+�=����9�q��B�6��t9�a�h��m���P���b��c�Nw��~�'�g����'�!�L�����tg3�����S3��i^���w�X�?��v��o��],�����
�h�]����n+4������w��Xt���&��
���v�4����������V\sV\��"�*��w�B��V|�[��n����C�[��n��-�4�����V��V��V�f[qf[��"�*�
i���E��m�[<��E�����J��L�3���Yc�`�����YcMcI�S���.V��V�]��:�;�,����!���`I�S��H[�o{[�m{��2|����7�9�N��/�<����#���MC
�mo moV��^+Mw���Rmo��F��Y���
CF����r�B6�����=��xdR���6����;��i�`��n{������a�0���!�\��SnU8:�||���@�E��3\�[�����f����P<��(��-3:�,����a�,�1�����<t���������0e���M�dqQ�9�V���������eg��S��~��R����r�BnTkK,�t�E`F����a��6
�E=��;J�[��/���o:�Ru�������aI�U���q{�t�a7�t��7ot�a+�t�i(�����$�������:`��>�-�>������O6��������������C�2n������5�!	�)6?T*���d�����p���.�������R�$���#�k�X?������\�d�:W?��C���hg��C��~�.��]��9m?�.�:v]j���R���u�^G����R�����~�.��^��%q�?�.�:z�x�f2����~a�C<����������/�L���p�(~0�Z�f$��)-���Fr3S�n$�`���eFr3Q�0�(�D1�;ZV$f��aZ����nGr01���H!&j���Kj���+jiu�8���Q�f��<�!��������G���������ff�	3�T�j��R����U$f��aZ��s�nn�q��L)L)
3Q�������)b�Vdf����_33E����+na��3E
S��L����(3S�0���D���:��)b�Vdf����~�Ln%����LJoc�c�[?(����\0
S�s~�������0E
S��Lw7�N������
�\��X�_���(R�Rf�����P$f��aZ����������)b�Vdf�X���b=(�k����!�L���8s�����|1��
E��������]>����a
�=����I����\���'�AMUw�L���j��P�(13M���������<��;�W���AII��<�0�����6�w@�07))�u��?>�R���\)F��}re2U���U�L�\6{c�L[�QD2Q�k���%	�41N�2��n��vQu�U���i������*�UB3���F�}�`�Q%4��	�V���#=��Z����#4�w�������m�����b�Ve4S������������x+h�Z��B�:�w�F��L��*��jnjF�T��*�����U�^���z.���0��*�UB�cu�6�>u0N�2�������?��=�'��WAhv�0N_���;w��/h���g|�&���G����f��U�TS�Lh����`F3�c	�SE4U�8�Jh�Z���?WBSU�3�����~1���&��T�T�b�F0�j�0#�d����W#�h��CqF���O��-Gy��{����~���Qs�Hf���U���-��	Df�����`g4����]�L�l�#UDSU�3��&�1,�f��T1N�2���6���Q.���0#�d�����D4��VZ��L�X���.��*�UB�&��w�;o�}��?���<����L��}���������'e|���V�%�\r:_%�S���x���RH�|o��O�u��&�s�L����g�_s�@T���_?�0��l.��S�|�����9Jw�G������sB��x����5�D�<���U�7H�2�����\!U!m���*�79�2�����/�\!U!m���*<j7@}|N�+����k:�D���~���}��H��Ir����:���x��t�|�.�����]�h_��_V��z�#�v���_?i�5���lK!��U��/��:�t��=:��e|XC�sZ�3^t��Z� �]Sx$_�(����'���Z�>u�x��W|X[o��[�)���e�<i �L^<_X�����4�}���"x�B`���[��������$V��|�r�2~�`#����kJ�Ua�KY��[DO�Pl(��U����6�z��p�P��B������M
�����;I�Pb��<�6��{����p	��J[Ca=�;}�&�B��D;w��zB[�x�K�Dj[�c�k~'��1����X����)�V��%7�d��
�!��j��$�[�l� ����R�ThXq�����k��A��*w��2~�|�.T[�5���0���c|'�"�(6����5��z+��1���t�����v������
���BiG�^<����z`m���`�E�
������B���)���5~	�
�1��g�+4�qYj�ta���&�_��'�6(Z��!sB�x���Y	a+����cT�;�m:p>�B�����������P��kI�Pb�AA�M(�f�g��B	�H�RC�V�$A���IA�M(�^��������I U�l��b��@{���nI�7�y�����^Xy���rM���H�z���/4�/k�.�?����7�n���X0�[R���A�6(2����z��dK�P��;��%@��^�U�^A[jI�3�?��hM�4��_o�Q�f���n��=/����V='m��5�V�Xk��2����f���J���������z�-9�-}{���7,un[rv[r~[�
�����E���������o{���YS'*�������m^u����pw�Uf<|hG�>>hG���?�u�Z�r�������0�����6��YI�7A��3�����I��n��JK{g���9v@R����GA��3����l\��x�x�x�����������_8���Rm��o���s`�/���� ��,���%k�m�;So���g&u�)S��zZ29o6�{�������a��D!��_�!��Z������W�1��$o�a�}�w:�k7n��Y~�h���!�Y~���Ys��Z��+�{P�t[C�������S������/�������*Q��[C����k����1�i�b���X�q�$�b�8�p�����+b��w���F��d=���5���Zm�K�Ow�"�:I�� �Z@Y�K��_���"�:I�'� �	-���%��W��G�n
)��L���NA�s0�i����'v�����n��?�1�6�����8����������C��>�l$b#{Iw�Y_��X�����?�!^���/�]���/�]��/�],��~��
?�2^������
?�(^��X|�[{��=�������{�����P{�7O���:<;���������U�����U���J��^q/���^�_�+�����EnV���������
tf��{+�9���[���B��^�����^����+{���n�����6e�z~[t����y~�<?(�^��M��9}!9��H�^�K�������L7|�C�{�U��Y�I��� ��(�������v�q���M6q��3���!���5���Pl[g�m�������
�@�|Q6���E�U�.[i��������1�Xw;���~M)�*T.�vo��a�����
;����W(;h'��U>]hHu`�}[/c���S��\(9igkGb<������$��e
��;�Y�
e��h�k���!��y�I��K�;w�n*��.Mw�7���o`�����8���1^�v0nYE�t����(�v��q��q��U3��Ud�6 �zo��h��������9���]�c�h���!�8nCh��������}���[k�������}�e���g�����Nx�g?-��G�������_������y��}Z�����G�����3����	*q;�W�/���|B���K��y��=F���v$��9Qj7�/k�_��u���K��+]����,�8u�Hf�s�To2_�����N=iR�Y
�_��u���	���4���
Jo.�/��\j�O���MAW�`R9�q�&oT��1�/6���ci�:I�&�sWn�F��z���-�Z�S�!��CT9�I������Q����/����\@2����V:]�z6���J�&s$���-c�����i�)��@T�BR��K�%��!������(1#-)\������b2��(%�`�v�2��d�QJ
�P�v���Q�H
�D��X��M�b�Z-HI�z���R����R��
R���D)1c�6�P�}4@����������}.�?���4����nFI������p,jBFj%j�je7D����0J�����G���D)5c�}Y���1�A�RC0V;�h�"	�A�RC0T�����URh�Ga"�d��Qd4V�0��d��G�"���iE$���:�6}����2�%���B��a�|"���/���Vh��aJ��D1�v�F��X��"�����/+�	��X�6��VQ�P��"����o�+EFcE��H&��WL�S��.)�p��h�aZ�D���

?N0L+"�~�4S��?N.|���?�+4�c��'2�c9��k�����LO��+d�w�G_c�}s��B�j��L�����wq7wp���zx���4V�0��d�Xz�rM�t�����u7O�

1L)�(n��_����iE$������{�9��'G}+�>�+4�g�������z�hEFcE��H���+�}�Vh��aJ��Dq�O�w�xJd�X�������"�iE$C�m�c����F�&�L&��W��c�/)fz�G�!L+"�(���B�O
��H��(���s�e���h���@���a��"�{V�O�

1L)�(��X��������c�O�

1L)�(��J��X��"����l��9
+B�VD2Q<{�����)���a+Hd������Z
K���
?M0L	�~�l[Yr��g]��z�h�a)�����[m�o?��o1D�+F0����i��e6�y��X���i��J���L��'��h�aZ�D�\��"��"�iE$c��R��\��?=�Rr&j������ L"�(&l���!L+"*�_i����R��\�e������>��h��h]L��F����k��A�^��[�5��h�����*�
{�tk��*�-��A��hsU�B������I��P�����n��� �k9�w���#��:}qjl�f�[5��@PE	huiY_.Q
��"�ny������TH�yNKz��? �T������c]r���UR;)�/�f ��'�ny������og�:��B�	�?���L�E�m:�Y�
�}f�a
�u��� �R�0�K����B���CC6�T�l��vt�j��`�����u�P>;�wjw#���������n�E�m:��v7�
�a!C�>GW��H�+A��~
?H�/x�-��R�T�j��SkAz�����5�����A�Mv���R�GX�7�^�L
�!���^�wl�
oe�-�z�e���[��Z��
|��
~�To��A�M0�V�eks�Z6��T�a�Y�%�;����Ad������B�1��)?�w��	���I`2�;�5���@���.�_E��r���
�3�V�� ��K�A�M0�V�����qM6r+����e������-Z!���kJ�SaL�}�v�0�A��P55�_@������5���e�*4���io[�+C�4��	�k��A�N�F�-��R�T��N;Hp��A(���-�����I�V
K�;w�n����U�v^�}T��P�i�[=��`R+�Y� ���u��z4���,x�U��5T���j��v��;P�)�����'�/�B��b�������Rl=0���@p=d/z��u���h0q���zt�I�P`�$R��]�Zr��	M�JI���Z%���z��5d���.~��-��RnU���3a�U!�1����r���*���r�-�:����*��?����C-��@���K�!�Z�@��%���"^�N?��@5�r�I��%���_��F����������W����/��1�������W�g��{�~���W��k��^ur��i-���s��3����"�e���=����A�w��(����<B�M�������n7����w�dk�m���yWo����������yW�y�Lk����w���W- ��D���w���7X|7K�>�yx����f�YC�7�j�6���R��j����~��~��~�����������	��������1U�]C�3[�2��R��������+2V�������@0�a�����������)h$u�`
2���;}��g-^�=�~�����[�h{�2c:�u���%k����5cVd�B ��-H�`������|�cX��*��,#�,G������B��o�87V^}t�aE�5t��k���khe=,��������n��n����m7�t����W��2�CT�v ���_ghq��~��CNz������������.b����B ���uH��M�P�����[������M���e���w
�[�5�<�|�RO<�[a@b#��k��!��]��!��_s��Z���3��r6�va�����W���Bx������Z��K�uxR�b�a+��1C[�i����)�a���m���M��6���_c�6��!��_S��Z�To�~�@�-1\��k���-O�{�M'�|��l9)��9���F��pe<n;�~nI�������
q���������I��Z��4�U�I��Z��o��k5Y�/u~br�br�bbKQ�5�)�o/r���x�W������]|��;��}SSv�Y���u^a��"�*��?x�����w���[���[+���w�P�B�7�*�o�e��_g����_��"�*4���9}�3�N����
H��b\������P��L�!�0�AL�@�u�6�m:���0&�������C��9���/���7����@��~],�Xk���x��\< �s�j�q�,��[�*s-A���Km����!����B��g��r�Bq�V2mEp[�������w�q�e�*T�i����*���]g��Q���a�xvv��o�+�6:4��k��v��$����A��s k��b����v��u{C���$���V�m:�Y�
e��l�G�?��z��L2��5�@�����W(�fe_r�o����E��������1���kJ�S!�egj�x��;�F�������3@o���r�B�������p���|�n
���7�-��RnU��~UU���5�n�_c�v����l�����Y2��oE{n�7�"&�_B����oEL6���;���
k�wN����>'[k~	��ik}N1��������8�����G��Nryy@w�k?.�y�^6�?u�������B��m�m
b���U��K������*��}�!,1~��k���f�\���>���R���~��5�?�n�/@������p�\����Vl�v�}��s�`�t/@���M������S��w��7���9���7����n�O�K�}�~�|���<�~�|����>U/����q~�����y~�����>:
���uK~��6�u���}�c���@��FM-(LUOd��������R�����^R1E��E������B{�X}���+�](L+��������!L+"+�����A��C0Q��Z���Z���}k��z��va2Q<��N��3����'F���~��X�G���

��������Q�uYEACES�D&�Gmf�!L+"+�a	�3K�P��"��b\���4V�0��d���U��h�aZ�D�\B<�"��"�iE$�B^�(��G����K��!~�`�R$2�49�W�?��?�a9��6����0L
"2�ci=|F�������d�6�Q��"�)E"���J+f�T�d�x��hQ��X��"����6S+
*b�R$2Q�z�����)n������X1�9.L&��#Ph�yBaZ�����7v�?��?��?�+4�c��'2�cuT�}Wh��aJ��DqwO�B�z�}:&0V��{V2����gU�mq7��X�Ei5��;9Fc=��H&�G�x\S<:�������#�BcE��H�����#(4��0�Hd��������?B�o

���+ 2�g����)2+B�VD2V�����BCES�D&��?�*4VL�Y��D�X�5ZEFcE��H&�u����4V�0��d�XB�(h��aJ��D1��+F��X��"��b��%~�`�VD2�D9�e=�L�\����
�ub��Id2�4�'r���O�{"g2Q���X��bv��L&��ZUh�x��U&c�-,�����P��"��b\�����X��"��b��5��)�K��b?P���Z��0Q�TF������� ���I���w���x,����n���53�b�RW�`|�Sv��B�j��L�.���Z�����=T3�(�W:�"��"�iE$�}	[�������LO<t!�D�z-J�����6���P�� �����"��"�iE$c����{�����j����{�oz�|l�B3i�n���G�_��$rGm���{�������CAx�0�������O�6x�g=&=���J!p�R�%��[	D�J������f�G���D���H�jI���`���J������)���g��5�D�<���Tx�j�H�PTtlKx��
��
e�~P�y�3d�B!P����jfMQ�~���r�i79P����ek})_>P�X� ������A"����m�[�j��}�5�^lp��E�m:���p�9�=A������_cs/$�Lm�����
�S������B��������z+�E�m:�Y�
����jG�����4�2k:�D��-��A����|\��c����RJ����z+�E�m:�Y�
���B����s\��d%�-/�����������)�����X��x��ZD������W���!H��f�*�\�8k�������Pi(�u�z �*���m:�Y�
-�r�}?���������M�H��
� H��f�����~?�+{s�
�R_�{
 �*,4���f�*48IV���N-B�F�*���5d� �F�5���kJ�S�V�P���K��P�=�/�~��}�ko���Uh���k���EHi�9G/����������)�N��=j��m��Q�����G�����8��������[YV8r�(n
�zG^L��
�1�w�f�*4oK:�����y
�=���;'h��`R+��A�s0�^�������2J
�L�} ���-���e�+������z;��g��f�_$�B��rb����s��}m����������u	x���"�$�
!��_S��
7�M���"H@�F��^�!��H=I�Bo���r��v��i��?HZu]/~�W-�G�a�������~��N-��`S����|����H����l���F�1����s���m�zs.yw.y{�/ahtY�.y��q������Y�u�R��%o�%��A���B��%o�=/�m�3��s����{y��1���yZ�G����5>i�����f��Z���������Y�q�Zo-���3-��Z�E�~�o���+���%k�}�0.�yX������u	[�r	���	��k���/���-���/lA�/����M�}����Y���!�;�-D9�fM)�ic�����Z��h$ub
2����d��j��b���!ev����A�l��zX�6�����+2���:
_�i_���;
�%�m����vX��[��/0�����v�)�je�ok=����S��!C���@m�����Vm
�R��2���d��A�s0�a��	���\�������+�$�1B[�5�<��}�-/����
���H2��A�l��zZ29�!PA�!�{I��\O�m:@YKF��zO[<�+��\����~
?D��W�-��R�
va��b��1Y}�J�r����C|*�-��R��b��U�5Y���L<G�l�?j��0�-)�Y��r�V!�����nM�W�������_�5������2HtG�W�@7��x�JwK�X$�����"
����^�Wu!�����
dS��&8��J����b\�\�����������y��b��7+/�y��b�4&/V�A;�������+4�`��k��mw��<�S��A��`��M���{�;��w������w�;��w�Y��[�~�6���F��u[�v-x�����^���|�������K1{K1{K�f3�O�+4���;�p��>��oz��i��3��b���;�7
�hMC��o�e�5���M����_��
+2V�������B0�V������� ���� t��:��i������W��Z[�L���:[�[���5�I�t{s��X���$��ae�t�n���,������
�:��y�u��^7��x��Ie�����[C�������C�����U(r��4&`���uB?L�n�i�-�:i�b����Dh�*(7��$�$Y��@�X{P��B��z��8|[jH�I�H2_81�
=(�V��������
t�Sov�����A�M(�^���V�^����EB(G�ze��������5��v�e5f^i���B��pm��;�~n��wJ#w�8�'k&gh�-;��m�BxS�>���W+Dgm�<B`�{u��&��y�������![����n�����6�_B�r���D�t���[_��F�!B`�{s����W����s����88���������y>h��������?�������_����{H���/g������c_8�KYVa�x����c����.�3�����|���8 �c���8S.`>p ���;p�.A�L�H�a�������p���T=�q���\���z���P�����p����ypW?8@�sW?8���;p��G'�Q��n�Z!wb��f_�����7��	T/h�s��T�D�=7�mw�h���v����X-��F���)A"��;�(4VL�
����@�E��h�aZ�D�,����^��h��R�jL�j��L���B��X/�.4L&�i�]�?? J��z��������z��e�B�{�a�x"�{��eVQ�P��"��bn~�Qd4V�0��d�X�����������Q0��L��"�)E"���(EFcE��H&���5������L��E�0��N2�"��b�M-~�`�VD2�49��?1|`?K3t���w�T�D�w�d�>R��"�)E"���9*4V,��#����F����H��D��Dq��oI��X��"�����l��h�aZ�X�>�;EACES�D&��J��XqsCi�L��v�������"���I�o,�?1|��V���T���0U?����Wh��aJ��D1��c!c�d��	L���*d�v�gUc��.�^j� %��Ru�����i=$��+�k��SL�����(2+B�VD2Q<��������??b�w��8�����3SW@d|��o%�U4T�0�Hd�x�gq����{g2V�#��s�BCES�D&�i)�~L
+B�VD2Q<�YF��h�aZ�D�����b���3��dZQ�P��"�����!(4�D�0��d�������������	4�g��������=�+�4Nd�x�o	��X��"��"�����T+��I��D��5��������97����P��"��b��5��)�K���>+Hd������Z���������� ������eM�8��������62��dx�!J]1��-��>�3j� %����������X1���y���P��"��b\�#XEFcE��H&���_H&c���L�Nur��� L"�(��$wG�!L+"��g�;���X���?<t`�O�/�L�/t�z���_�y%$r�Im���������l
���=����|�N18�������{������n;�C���VQ�R����2�Y���O���8�jI������J��c���j�����Tu�d�~P���T�V��(���^S�@T���_?��H�<Y�PTt����YS�@T���_?���K8�)@����^{f���b�x��S�-����x�1��{�R)T�%�t�A�����v\Q!�nq����t��+*1����9�x����x�%�B���_M~�M��A�M\K�K�n;.���B����m|����duP��k�U�R��X�hf(���}^�=�X� ��f��S�khV�!�(����r
?H�V����)�N��h/��lhc�Pi���;�1R,bo����1�T�~.!��
���c������E�m:�`��q�K����`�@�Q�_�Pl� ��fg��O����
�
gjg>^�L���0���@��<�o��=���1���kdvM
���a@�-��A�jb��hk��e8�����O#��1����)�N��*_���,"�����k��ABCjo���r����/[m�%�[]���`���f	�d�`��B36����h}!���6��&�B�e��;�Y�
M��E��o��3Rj
�k�I�Pf;����{�F�[Q)�:#��=��V(��K
�m:@Y�
��[�`CkI�dk^��z�2~�|���-vpKI�U��k��?�B�F�����~
?Dp�@�-��R�T�W�!��s���0��/��O�z��y�-����6gnG��P�c��@Gg��7��|�-�����o}�����&8d`�W8���<A��M�z:t��Ek���8�,��\��\��8_����]�����A��:o�y���Q����M�����u7k��sw���zx7k5F<�������,�G�9����<7K���w�V��m���y�o�F��������y��qq��m���y�o����������y��q�����?o�y��~3"-����m���jSfz��"��9�3:��A��s�M��Y�G�����������Z�'h$u�`����n'r�de�miw�`%��k�p� ��;h����Vc��	c|�j���H�|��}Bv���f�b���lg��!ev��c�ao��zX�z��2�a�Z�LA�n?D�Y�-�S��*�^j���}x4���@���6���%���5�E��1�2���
��i����V���o�����`R�D�e��;p�_n�L�_��j�1��l�(�e'�>�����JwK�V${��d��8�A�����%��;p��A&.`m���(/�+
JV�w� �����d�K�8�V��y��:��Z$_��V�-�:��wkK��G0}[Hn�l�~��m$�����-��S��f�^p��ck��n����w�M���3>�}����V�����:�G���9`m����O[rk��c��d}w��������m��I���~M)kq���2(k��-�#������0s�_R�C{��I3��
�5��<3@�J+ ���t��j��j��_`T^,�����
?hJ^��x���"�3#�s#��j���#���W�}��~�]���Q�6Vc����?*�P�8Z��_�_�4^,���^+�~�����+v�W�@��3�� m-Z`���X���ro)fo)fo)��b��b�������k_`��xk$�5dx#��(#��)�N���k���}vgZI�}���}hf�*T�y�����a%�4�k�p�!��ih��r�B���!�UX��
�$oB��
��[�j�"��"c:I� � e:��X������V��S�RX�y��H��B�z����U�6�����+2�����O��'�@��w\L��l�-!�q��2�xb9�n�)7*�o�Aw^�`��K�`��� H�`��B��+K��y�pkD�$tk��o�BxS���;�W�Z�/T�?���"_�n^
�M�SnU������e_�}W��[��/x�������r�B~��6V6o�/��A�l��������-!�Vih���C�[[C��@(G;x������u�?��-���u��]	�K���=�3C����4t��s�S�6�a[��1C�tG����)�N����{;c���N����!C�s{����)�V����w������yp~	���kov'IwK���M=2�! ~oe��[�@1�pf�xyn}q��;���w���[YK<n���9g9}C��n�O�/��C�Z��������<���A��'	��<���$�NU0�%p������x"�\�x��\��K�O8�����&@!S�&��$����<����	p�<�����9Q��O����(���d�<Q���D�>S��w�������L�����T����/>Hv��+��|/��i�����T��-���__F��&�36�(!�63�j-Hi�z�t�63
��N�f��X�~U�mf*b�R$2QL�y8EFcE��H&��k3#d�w�63&j�S+���W+���[�(4��0%Hd����H$2���(-`��Q�.�����E���^a�*���^�s)�Sd4V�0��d�X-]��U��"�)E"�}��g���"�iE$�����4V�0��d���s�(h��aJ��D1.{<�"��"�iE$�L��(f�2�@��"�����U*4�,�0��d�iR9��D�zB1��BA�;�a�~"�;v�����Vh��aJ��D��]+�K#��b���}�4T�0�Hd�UTd4V�0��d�X'��?GAcE��H&�g�x^S<;���b��f+7i��P1��u��@��'
�����gJu����*����������k 4�s����MT�X.h�Z�0�V��X��$��bv�
M�}|e2S<���XF�f�L������&�iQB3����vQu�U�k����aU�MT!��"��f�EB����UD����n��?3~�u�q�4�<���Ahr�����T��U1N�����	^�������T���l�Z�3���j�r�hU��U1N�����+�Ue6Q�8��h�z�P=.���qM�\���afU�3��&������K��c��*�;����gf��\�cs)��o��� 4�y���5����1_�D�~)��T�X��*��j����MTS�Lh�z,a���&�gT�T��v��*l�
qF�X5��NU��*�)UF3����(
M4[��2S����u2~�P�Et�#��A,�2!��`&��4�+g4������fo�U4V�0��d����v�&��?�3���7K�*��*�UD��}�O���U1N�����`�(��f3�@f�Y��!IfM�3��f���w�MT!��"�Wp?G�y����1�x�AYW55������Y2}�v�**�d��f	�;+C��PA�t�+��*�1a�<)tj�fl?�2�	�;}b�5KC"'��~}��|�j�|+��S�|���p�4Iw���j�����@5�h-)^��< ���~������,�B!PQ:����*�*�-��A������
�@E�D��5UDU([���
�m)�z���+��N�|�O18������Y�7'u|s��������|��(�bprRm���R���*�iC��j�=�_=Pn�m:�`�AX������������I�������08;��������0w�o���l�U��p�6�"�6p�����8�akm�Yi,����%�$-��W
#��V=�}W�+��,�9�@�2�l���_�k�S/#�"�6x0
�DEj��8fVk
��K����
��[�o�[��p�`����-�����V���[��o�����}��c��F�N|����=iypn�[�o�{����^�`m�#�_�>�v��%iyP�@O/[�o�{p.���������z�[p@R���l� ���u��r��������!K�5'����<���l;��z����
+����)<�
�n��kI��
�D�t��4�yT�b8�yK�/���U0Qo��B�"6�o�Y
�"�����=1����{yX/C�
���	���X�I��|!����=�H��@�+p�0��' x�����5H�� J����	=hI�?\fu\B��t�n�����!����	m��jS����7`}0�(��,w�������0qH!�����D�8��JA�M���C!�~�>���_/F{C�^�$A��V
�m:�Y�
M�w=�E8bE�����������C��}�����)���}��&,0��
���|��q	Hd��l���	e-�g�l��_��,�opo��Q��f���K�]��_��a��1T���e�����Z?j���x|`�'��W�J�@L0a�y7����-|�Y�Y�1�bg1F�0Fg0�Z��Z������Z����6��m������q�6��m���)�
�������n��<JgB���`�3oB�����i�K�{����+�z�c�P��;��7�$�1���N{����-J�B�E��.���(1LY����U�[���uM��l���iy_���010;��+7ckr�,�v��X�����&�iK�|\�x��>�Yo��7�,;������6���%+�1������i�o:I��� �7���%���6��:��K�5{����yP;�A���k#r�=��{���<����=��{�kx��2#w��+`�+[�';��
F
��z�[��n�bK���]���68{���1L����N7wmP��O�9�kc���yG4Qo�|\�8�G��m����+3�#���3L&���WNv��:]Z�t����u��U�0����=/���\��gRv��d<�3S���z\2Z����K�.ghH�`�qO�LA�s0�i�hq�/�g2>j��Z���5fhu�B��5�<�=�?�ALp�;�~�S�Q����_OA�^H%�S� �?�b��wQ/jm����������P�0����;�w��qz����+����+��'z����my����my�P�H����>��Y�����Y�{z��^���y���;�W�6�����:_s�|�����u�����7�;uc������������,��,���}c����f����������Yz���{W�#1/e/��<��dOZ�w.7
��z�7���!6�5~ee������~%�)���Y����FmS���������W0�)�����l�Z��
��+cQ������ �3��(!HY���O|�'�����Yc�������IGP�^���6����'�(���d����h�`��B���&��_@w�'Y�d�Qo������#��F��dl�!�.=iy�z�0L��	��W9�e��pGc�^���y�1Po��Y��~�[�����I�������<o3n&[���+`�6m.V�=�k'�H�WF��m:�Y7%�/����Pl
���H��b� ���u�P1��?��"���k���F�?j�������K��P���,������y0�������!�y<�B��`��[��	�qD���n}qd�[���w��xdEZ��<�X��������������a{dlGX�����+�_��oGj,������7b��52���s*��6J-����j�D��d���~��5���{]����\}�����z$��9Q�7������[�1���\~�qo�0+���v�7,�u7���[w~=�wU;�Y��(���a�!,���X
*?�m��R>�I������Q���%�V������R�#����R����%����4��������%�Y��&���a�{\�o����)����v$��9Q�7�������t��o'�u������D)�d��O�������O����~4O������R��k��T�f�_��3/�o����S����x&;<��R<�I�B�6C���
�D��X�D��X
{I)9"=S�L���GJ)�(b�Rd2T����-e2V�0Q2V�RJ�D����Z�j��Z�j����RrD&z���a��D0���(%G`�����>;��|G���R9����0�]��V�z�Qd2V�0Q2T��7�� ��d�Ha�(d�X*2��L1L)2+����V$2Q�0��d�xB��Hd��aJ��P1��iN��L�X��DT��*�R�U	�T1N�2��Fhf�U	�?T8N�2�~��w���S��&��5����&7���%���9��4Q����UMT�m���f��UMT��*h�z�'W��������&��TMT�6�����b�Ve4QM�j���z�tQg_iUB3U����&���"!h��CqZ���#���S�9>s���,��9�S� h|�r����&��TMTO�P�d�y��j!c�}s��L&���L�]�/�����������Q"4S;�(�(r:a�Zz�rQ�������u7O��&��TMT7�=C������*��g�^���<s�����������8u����,�z:UB3U������%��{AU�S��&�{���h��w���&��jUB3U�������Z]K�3��*���B�������W�U�O�����b�Ve4Q=�7A�O������:�7�x�Cg����cI��;!hr�(N]�����?��|AU�S��&���4S=���Bc������*�)UA����T	�T1N�2��������L��*������U�^���Z�rZQ$3M��D��),9��d4���8%*h���me�I>p��:�RYJ�t<7���Q'�P�Mdr�1L]5�����O6_�c���4_��7��.h�FqJQ�D5/{r��f��UMT�%����b�Ve4V=��C���~�O��$������$��&�iQF��$�H�L��*�������q)-�������&��A��>�;2LR#mn��5d� ��w�����9�F��t�k��\�f�I�|��5d� �
�l���r��z��?�Z�B���j������
y�~R�~�1W����L�������E�<�8�|���J;�H��&�9���^�� o��O*<O�$
��Nxe������
y�~Ra��[�;��-��FJ����2~�p�A��kN�������JTj
kn��J~)�e�6xT����cU� �+d8l�H���.���
�k���pL�����/ pS{f����t��nZ{���*��p&��B�Q]�MZ[��_����1������/����������9�N�{Xr��vbBPj(�cY�To%�ZV
�m:@Y�
=�Bv�M��y�zR��2~���l!��_s��
���\�l["(5���y ��Jf����t��nZ�e��_�G���YG���Te����t���Z���\��>9J�����<v�j���K�l���Uh�p(�~C)�3�B`�����_$�|)���t��nZ��X���s�����U7���$�o�!��_S��
cm>���9T)��y!�k�I�V
�$;w�n��@����:$
9�����5d� i�2`m���r������g���Z"��y������D0��Ja�x�`��B���myk�uu���p�kI�Pb�@o���WhZ���.�0�C�RI�{��aT�/8�Bx��oO���)�U�GA0j����|�@R����W�D�t��nz�p4��| �^I�8�^�[��z+��Bd���[FhiQf�\Bqm��;�`R�������{�f��z��L-�Qj������Z(�pbm���W��0�/2x���H�Wg{���$��<lB[�5���b^����5�������3����!�I;C�Bo����k��:�*������6Z����������{u�yu���<lb�Y��^��Z�������{�.;�.{���~���W�g=�{�~���;SB[p���������r��e���o/�������%�
�����5Mj��G�ZI�X��he=,�o�o��Qg�f��Gx�dm��n�-�M��[h�������B���g�Z3/��a��a��ax���>��0�^::�~�H��dD(�Y������K��x�@��+3^����dm���l\������J��q[�v-���%+���7j��J��h�������E���g����.+(6�L�@�q�A�M(�a�l���x�{#�1�5d�q����)�Y��#����%����@�u7�m:��7��%�[oTi2�
)���d}�
�d�`����,\s��U[����x�=��fR1�0���f=-�|���2��b��
���L2em��x���O`1��O����gXeN�5fh�2��Ym����|�
{����m��;�/$i����iA�M(�i��*�}_�'�R	;���e���^=-���kLyX+���U=";��w��7���a��lB[�5�<3"�h���Z�g!G#�^vk��nf
�[�5���f����W�z1�?=���^��
���;3;3;3�fvf��"�*��oy������J��a�<��<��<��{��y�����������1��#/����y���{����x�����5�_g.:o�Y��,~����E�+�,��f��B����{�0��{�0�/��a��"�*�n�^:��56���������C(�V�����w����_8�_�p��w���~���u������_
cZI�IXg@Y�
5~��5���5h$u�`
2���u�P�.`���{���m�f`�$�[�y���jx�4��_����"������4lA�MC0�V��%@x�L�hX_S��e�7�|�w�Z��dh�i�;�D9o{��5&_iH�w�$k�E�m:�Y�
�m_R*��+
��-�$k�E�m:�Y�
%�-j1.(5$^] �z��A�M(�^���5?�=�)Tb�����M��A�M��Om��U��[��0S��`R+�Y� ���u�P��
<�i�������$���y5������JV��.�c���!q�zI����A�M(�^����'���Y��F���������Z%���9��7rr�j������r~������
�N+*�-1~<_�����;�^�amK��o������v=_������0[c�|��X�X����&�]���W�/oCa��t��x
._�kp��\>gJ���O��6�d�_�x�_g��gll��F�oO��K�1|�9|�'�O�lH�oO���e��?�����r�Yr	oO��KP�6�&�6�$�/���
���'n�%��
����
���K����Kx{�_���+����<�:wC�{��/�7����
���+����Kx{�_����K���K�L����
���'p�����d�\=��7d:���I|G���8�F��.��@����m�/�g��5[���-��� 	��Q��/uA��eNNu������~�R���Jh��qZ��X5���D��$��b��5��)�k��w@4����BU���E��BH
��D��*m~���9b��lVA�{Fq������[UFU�S��&�G�����b�Ve4V=��g��*�)UA��AiUB3U����&����@F3U����&�`jUB3U����&��~�UB3U������%�f��&�7�T�?q�� �w����l>2MF�Q��)&hx���L��Jh������h�6��T�D������j�O��f��?�*4Q=���Jh�J���*����6�S�2��R�R4Q�z������nU��$h����'�&���4���8��h��S��;�������a��9�S� h|�����MT)N�
�������LswO�B��yu��Lf#"W�,d��-���)B�V$2Q�?u8IB3M����&�G�z\T=z������v���*�iUFc�}��EM>g(N�
����������o��[�����8u������3:UB3U���������}�4Q�8�*h���3���j���
MT�e[�U%4S�8��h�Z��G+��*�iUFc�:UFU�S��&����cT	�T1N�2���������iUF�O�s]�����k^6��1��Pa�SS�M5��/h6�9�'~�&��?{��f����D��O��f��
Vh���e��&MT)N�
�����q��f��UMTs��/��^5_T=���d�	aZ��D��Rs�Ih���qZ���'�c���C<r��'>���<s�D&���U3�����&ES�L&��W��.h6O~��
MT�K�N��L��*�����-XUB3U����&����_X"3M��D������<\�hR�4Q��`�	��b�Ve46����w����*���z���� b|�
�X�S�x��K1����t�7�Th�j���~#���
���/���/o�{�b�)�A�pG�
L�>�: w��@����h}g0�:�`U����k������B[���
�jS� �i�1�+ �����t�Q�����dFX��-� �2�i8�Mz�8g+T�un���R(3������Bk�����UN���-��G\���V������B��U�������h'�(��
��d�<*�6���;�T�Bp�6�H��:�gq���[�����w���T�+���
�������������l�����O>��^�������Y��Z|�������XZ*�v��V���PJ8M���z����Im���Uh.|@��
�����/�t��:��[��=�{��c��5{�h#J
�=c�@R���jY5���e�*�\���e���\IF��t��� �����A�M(�V������� ��P:wr,;I�Pfu�G
�m:@Y�f���6��mb�B��T����$�B���{V�t��n�B�x�v�I=}��w^�]:���$4�����~M)w*��r���2;��uE�C��&�|i�!Hv�f�*4�	�z�7�!B��7/��2~��[�Ao���r�����O�e'h
U#�x
0��Ja��N�������V��������2�
���Z6:B,,�m:@Y�
�CN����/��r�!�I�Pf�d�`��B3�����`Re��~&�$�B���h�P��BO<)#��Uho�za����b+�6�����B�dl�R�f�m/���C��~$�E�>n�}������e��qB���`�@R����	A�M(�^��m �em�18+�wa}�l�?����@�n����������`�G��}�"u,Ul� ��� 1����?7�Zn�w��w��w��w�xw�xw�qs�����Y�����,�����Z�5z0x0x0�p�w�w�=�+���%�
���������#c��	g������1��[���[h�7�n�l���u�a��oh$u�a������dk�m���yq����A����y�y�������Y/q�^��y����o���"��w�T<�)�u��!��7��h����V���{c,�d�E��o,�e,�5�<��|�=tcE�bt�������he=,Y;��b������@-�������he=,Y���.(R.c ������e=6�]A��i������4de����p3?o��.!�3�����X�n�������5�<�U�����kSr?R~c ����d�`����9�����/�-V"v�[c��'�!��_S��Z�=L;��R*�[��"8S���~�)k%1fj�!�i�
�H2^e�0���f=,���m�N8��y��=�$�Zn�d�`����T�;@1�k��Z *�bo���[ �/���uk�v2�����=�$�d�R0��g=,���P��oA��u"IE�b���J&v���t������cm��+b��X�7a�A�M(�i��G�c����}�p��gi�M�-��S�dJ���`@�X�=�����!j����lu��?76���6^,����<��<��<��3{�3{��n��w:/�A�b��������{���b���������������4�/�-/N5������a�=��=��=����,��,���;'�����B�avoH��:�f
�I�hQ��^s��
?hF^�P�����}7�Y�p���m�>�%���W�l����l��l��6���l}�����W�����@�,����������Ex�P9���/�M>�k������$y?�i?���5�M���>����5�LC0���{cj�`��B���Y��9t���
�_��8����w?���6q�*,
)��d��A�M0�V�l����1c#b�5d���6��S�T(���:�i_0�
�W=�$��bJ$;w���F���=
)��(1��z��$�|��i�������U�/��������_��"������~�)�*d����hH�y�$���m:�Y�
e�-.yM���R�^ �������t�n*\^NxT�rg{ySw@�u�6�m:�Y�
e�-�G����]��H_[!H��@���QL��O���s���a���!���5���N�
Q��c�[�|$_<ADm���F��s��}	�g��e+ki��fSE���������������o�e>]d�� ��d��r���Z��u|�����R����>�."���)<]���L)>]$���7��������t����O�g�n��N�K�)!r��"|�'�O��H��N��yF��?�.���r��"r	�N�K�!|	��"t	���pa�_���E�xF�\�x�_g�%����%�;]D.�������P��?�."��;]D��!R�x�_g�����%�;]D.�g��%����%p�\�|��\���E^�{���
;:���V-��L::�k�C��t�}����07��8u���f�mw����4w��I�X1��I�D������j����T�o���D�X���R%4S�8��h�Z\{&&3���3	+���L1L)2�(F��I�L3��L
MT�b��2�|�`��$2���e	��I#{�����4�g�.@����7�Ue4Q�8�*h��a��V%4S�8��h�
6�Q%4S�8��h�Zm��t�4Q�8�*h���w�Jh��qZ��D��Jh��qZ��D��_=D���=���UMT��(+h�yCqZ���������I#�YZ�}
�&w���5����}��~|eUA����
���=U����F��+h��q���DuW��*��*�iUF����dT	�T1N�2���Se4Q�8�*h����U�f���Z��D5�o!���9�UM?s�y���I#G������&w���5����g}AU�S��&��=y3�i&��-d�x�g`&3��=+�u���D����Z�����)b��d4QM�j���z�tQu_���!h��qZ��D���CM>c(N�2����/���2r���7A�{Gq�*��]�y����&��TMTO��/h�z�g}������MT)N�
����d���h��qZ��D�X�}���f��UMTK�Z.��^�\S=��n�3��R�R4Q��7A�O������:�����L9�����_F�{Gq�*��])��>��j���
U�5��nAcU�U�&��4S���B����T	�T1N�2��W������*�)UA�������W�U�b-&3M��D&�u���NB�O������8���5�;e�����SF�v^�\7����0u�L����$iL� J����a]�L���
�US}g��2��R�R4Q����Q%4S�8��h���Q$3M��D&��:����f��EMTk�\ww	�T1N�2���;o��_����?<ad����/�0����z�'j���0<.(�t�7�Th{����U�
u��k �2��x�<*4F5q
�um'}�u@
eFu�6xTh�2�
�d�U���~Mr@�)\���U��	hN�*�5�������,���
����
��rY�/@
eFu�6xTh=k'9[����c�gF
�B�Q]�Mz��Y��*gY�smok�|�U�e�m:`�^,�d{D~�2����Ag�PfTo��'�k^
�i�
�^����P���0�e(�t�x0a������WN�K�/������M�F�m�#���lm���m|��,qb�0��ny��0R=�5���mD��#B3�W��=d�Z�j�P��B������R���H����5d� �c�Q��kN�Sa>�=x�����RCG��u��To%�ZV�m:@Y�
��p����$Ur�q�_��)AB'�����)�N�G=�I�>Jh�%��P�n�_$�[���nA�M(�V��`V�v{���Bp��8k}��J�L
�nd�m:@Y�
-�zk�!	6���Q��9���$_P��~�)7*������"����j���q^L���,c����U���#������ar6R
��r
?Hp�H�-��R�TW4�����0<��G�5��z+�e��;p��t���,���%���3<f����K��-���e�*4��q���4��PU8~��
e!H��f�+4��X=+64u���H�
��Z���?�5���e�+t��qKh����	v� �*l� ���u���fl`�}(�����QF0��,c����Uh^���-,�
R(5t���W�Z���R�x�P��B��@���D���������n
?H�`�H�-��Rn}M�h����`��^1�x��b+�6��~����A#��w�d�����+��+��+/���������-'�
��f��o��7��7K�l��%�
������f���k^�5
��LC����a������W����^0�a5�}h��r�>lA�>4��r7K������77o$n/�����7��7��v�S�v�6���7 ���B�������Y�1�������<F=MY��i���zX�6���6�d�F��"�ml!�m�����3�~	v�cE�wt���A�w������A�j,��?��$���a��!-���%k�pm���%Y��$�$oIB��$-���%��X�C�2��?��&Ysr� ��Ewh�]2;�g�R�mJ��/��[c��)3��k����Vc+B�1c,Vd�I2�Sl� ��Y��]2������e|�J��tk���e��dS���g�������_�'=H�C�Wv�������e.�=��4$Ne ���1bP�������qK��16��!�,{I����6���%��6��Y!�B��@��8�A�M(�i��@�!���6;��!�1=�$mv���u��j�`������#Ar��g�5������C�
[m����Hx�V�#�q���������&i��u�A�s�w����lArTh3�#{��2��������9��;A%�Q�����;�X� ����#{uv�l��_��^,��/<^,�����B?�v^�����B��zg3xg3xg3�p6�w6�w6o6v&c�����������������[��]���~���V��s�Xf�Xf�X��e��e�����L�Oy����+�Fa���<���'�$yO��'ie�*���g�D���<�y�N���������P������"�?:�1�#)����U��c�\�����$y����he�*T����e�����;I�k<0��F(�V����]��X�q-�$�0B�r��[�*��j_��+�5f8_B�W�kJ�Q������q���Q�H�n��A�M0�V���;��A�!V�=D���!��!�5���P�y��a����sh&y��i�������2�j�+c��4���ktk��!�s�[o���r�B���%���14$�_0����d�`��B��;�3YSp[���dL�� Hv�f�+�}��}�6V`lHY��$k����u�Pe�Q�kF�����:I���6\�)����U�X}hS�6���_o0s
��K�!���5����|�����`^p���a��l^p���~����t��h2�"����[��$2�Dm���&��u�}����L����7&����h�����DY7�]����!t���k�/r_��������>N"�#F���I�8U]�|<�Y[����'�K�!#r	��$r	��.a>��<"6�|w@�\
�k(�K�Du�%r��(��A#r�%r	���a>�D.��!%|2j�/b2��/BR�".�)��x{L�\���)���Tu�A%r�*����#r
�A%r	���`:�D.��Q%r<pD�`<�D.�S�5����E�;�D.�G��E����Ep�������w����7Z��5�\?�}��d�����q%e����S��7�\�qr

;���l�'E�Rz&j���$h�x�>O
�U�/m�<	��R�R4Q����*��*�iUF���yb2�<\�'!��)�k��S,���{<	�hR�4Q�x$2�"�P��$2�t)�����Q%��M�
gM���4�g�\��T	�T1N�2��_%lfAU�S��&�����6F3U����&���g�lv$���<����=$��w����0���1�oo�?�)J�>��m~;�+��R+W��)���O�H�����s�m���(h��qFU�@5M[�^��H������j�oCT���!��gU
T�x,������������\�D�����G�����8s
�������V4P���UT
Tk<�U�H��#X
:W�u��4P�8��h����Re4R�8�*h���s5�*��*�YUA��W�/����~Q��m��T�����S�e��0�#�a�O
Ta����c���'��,�������?
���lp�_�
*���o�@>��^��H������j	/����%�4R��5�dFCU�s���	�-	��F�h�������U����zU��xefCe
t��F�%~`1l�,�@�,l�,Z����3}M�����M6��h/E��&�uJ)()s�UV6R��'���[��`�H��o��
����l�@y[�:'�,l���VY�H�����2��2:ea#��B�^U�/��Ue<�+3*S�S6P�s�\c�����VY��G|����'���S]����M�@{)�7q������[�a�@>��-W�H�����r����
�s�~�l�\�e��������������Wf6T�@�,�\y�AY�@Y��a#�4�vL����d�T��x���%	t��.<��u���R�eI�R�HY����_=��M�8{��7�J�7sA#U����F���@�l�[��������+3*S�S6PN�����R�H�����2�(��pW��)��2����e�
u)�	)���x��
�)�);o�����|��^��o�r��iM������Q��������i�4���]*t�M�5,TQ&���H���.���
]�N���4�?��O��~He��Q��@<��R��R?�*�n|=�r����U,S���[���,[�"�)�)��3�e�4xThYM7(,T��V2�Z�0�K����B��v�h�*���n�~8��
��d�<*�����]��
���_�0���i:���hO~�5~���eu��
�B�q]2Mz�i�ImM�Ax����A�r�2:�P��@}��d��l�������~g����-�<���~g3OO����.Y���r%�{�2�����cj�B����L��~G�e^�Tw�R�2�����<��H���JH���T=��4+�M�*�8o���Z�^�[���b��cN�Ua.t.z��������9/t������
l�T;UOH�f���a��'�G����2�e��H��z�&����zB�7����`7��Bt�rc��-l����a�a,��T=!���Vp����k�
K�-s���_����
�S�0���		��w��������J
-��;0�Iv���
�4��{�uJ=[�Q��K����3��yxc��{����]f���3�	�
���`�Xe�4��{�.��h�����|a��/D`W�o��k`�0��'(x��5��S���
3������
�/H��zf�F��!$x�������)K�=o��X��H��z&��z��7�M����"8��,��`R��U��������&=x���9�����e=��HZ�X�3��T=A���nt��Z���a�7��!���*�&����o.�,i�o���������-�)�iyT/�e�0��'$x��L3�	��Q���5X^Lj���V
�i:P�����v��A:��
[`xz���a���L��a�=��\��w�����#�����������w�N����x���=�#z�������=��w�����Z��f��#y������Y���R�,�o�+o��78��J���zsK����e������I��WN0_�������VXs0;K��4e���������;{3��7sgo������o�wk��v��'0oyF��z�����	>��{�����;�3w�g~e~����������s%�7���������,�����fmPOH�q�����Ik���5D#��o��f�@H�q�b`���sF�S�A�3~�����Bx�8����Z�r�C6�;��Jv�h ����f��HP�i����?�����k�'
��OJA��@YOKV���h�1���/��v��c�9L��I�����2��&��S:<�8�=iy�;���iGH�y�bj�mq�wQ�"����������x��������)G)�3�����l�
����t��l��������mc�7X�,"�iG0/�k�0��'(��rq>�i]��Z����P{����
���S���W.�h�f>h�:����@Z^4]W��z���+gOt�d9� �'��j���a��d�|�`J�Qic��Q���<�0�c����y�Iu�Y���`���B�o�^,�����\K�\K�\���D��D��n�_4Z/V�}{�b��7U/�}+�b��7P/�}��b��7K���,��Z���y^����/l���LP�n��Y��z�P�H���L��L����g�xk��>S4?ovC�.��/����~�g -��2O��<=!����-N��z���l��	?;�o��^��f�MO�t�	&�\zW���f -����a����o�km��;3�V;7�J��<72������W����s�m�Ff ���f��HP�^��S����% g_�I��� �/��{�Zq=jgZ��gZ�y�iIa���o�k���������HZ^oUb��*!����u��
��1��3���y�&S�o�k]B�m������������0kKF��w�o����H>fV��0	��
�0Hg�f�,T������m
>cZ�`��)���������z�V#~ib��HZ^o5R��#)��������7�s!�^�nF@I�`�/	d�����n
���
ki
j�/���}��6�1S����5��8�U���5y����z�a������������^5����;/�N�!������]w��mCJ���V������X���M���x��*R;�Q�������?F���a����������>m�a�'2*_�|�yZ~Y�-�>
�R[)=����OdT�$j�.����N����.���g��!�#�iZ��;-}�4����1��
����xM��}�Y��������n�_����OdT�$j�.�����5�.�U~��N��r��K��D-�e��_�7���~���/dm;��/���bB�RN��:)�u������L���.�<������`��7�s��.�X���o*x���	8��(Q��@��T2jB��(L��$%#(�\��T��� ������SAAA<:��18��(U��@�D�rI��rI�NH2jB��(L�����cp���(U�����:p���4���F)9�Q&�8�Qp�����
r�
8<flu��JN9L����J�� �� ����z.ZA!������U����R�
28\`�����:��8�T2%���
�R�e2Mx��t�,�8#�d�<�e6����!�K���}�5���q�^���m[����6�\��TT�H4��VT�@���(��h�o�
DkxGU2M����*:�82]q��4�8#�d$�{�|Q4w���(u�����gD��D��q���'
�Q&�'M�J�?i.<j�m��������qz	B�����n��(���������$w��,` ����V�� �� ����������[�]����B�bg��D�^��(zt��5Q��zs����r��
�����A���3�L.<R6���������������E�����=�
�R�e2���=��sQ�SQ!#��{�U4������1������gD����34a��):�8U2M�h�(�:�tQ��L^T�@���(��h��3:}�H�er�a����g��a�.u���5t~�8N/B����)��t.�q**d$Z���A�^��D��q
:�82M��QAQ�3�LF���.�g�h JqF��Ht�E���{'�_=�c��L�f	�2���R�����TS��A�e}�<�G��L<v3�� kY�g%�7����nt�?�r�gZ��\�ink|7�\��TP�H�L[���gD��D�����
�R�e2���-�{�.�$)L�W\�g
$)�h2���Xk
D)��29
�0��U�tiU���>�����Ozy�����.�w������-��!��03&��/)�����b'/)M��i�y�)M����Zl/�	�ZJ��Q�)M�����)����X	|���XF�P�/I�����j�`�-�(XG�A���������~i;����`-{���!K��t�0�_�q�����`�j"P���>2n����C��!2iW���Z��k�W[�"�i�� ���/�x�����'��3nI5�
��`��k`
U�-e�<)4�S�X� �N�x
L���.��O
�3.�o�s��"����m��K@IP���A:s(�V�e��
lg�
iU�sck[�1e���"�I��S�T����=l����A�*K������V*��3w��nZ����u[�U>��
���jL?D�V��4�9�N����Z��~���&x��wK����V*��3w��nz��V�7fIm���cnh��}�K@IP���A:s(�N���hn��\��>v6
;�IP�������?��R���;f�-
�.��
�����4��-
��c���!T��H;,��3~����Z�L��r����E>��k�Y�
������R��A:M(�V��wh�t�����WhN?^�@ $p+�A!4EK��
K&S��t�����m��k@Ip+�
��;�����8�f�L'{�_ �6�s��O�0��0�t �o��Zk��Vg<���K�v���x�)�_����q�)�*��G;����BGC��f}0	n��DA2M(�V�pq��5ok;�B�!������
eV�i:�Y�
MxB.;��R�aw�v��k@I�Pa�t����f���������'1�6�����*[(H��e�+t��Lg6+I������t�2�#��Bh�8��;�6��
#�9`�/���C>c�
Cd�8��������iK��zk�gM/�d��+�K���"�d��+���Mw��/v����uw����������[������E:�n��~*�1����[���o���\h.�[x��(�������Y�s��+s�^;���z���z�P����+�Y���V<]��{�tk�y@I��kA������U��f���;z�����-����wD�����������?x7���SF4� ��n�)�jUonk���
���?u�:�I�L��t����\��t� ��u������A(�Y���kGZyk��r����Y�������d����G��p!�O��7q���;�<��Xx��k��s	����G��R0�R��"���tZ��
�x�q���'�_h�A�N��u��Y+yeo/~�-���V��1eX��
�I��S��*f^����6�5�q��a�>�'�)��R��*~�
s�!�?5�`B��s����������LK^�o����(�&5���SO��zZ2�{�L��+���A��[��_x�L�)��R����^:�u�X���
�K�`����c�L��zZ2;}���(����_����K��A<M$�i����mm�{1m�!q_L��b*�4���%�h���k[Wyk�b8���-)m���!�?s��=H��)%���1fX��T�2�%���G(�t������O�:|�i�V�W��k�}�U�V��_/�ZM_q������A��[�w���o/��X��}���~��X����],��tt��q�w_�@7��j������C��)�������PR��Z����v)��B���e��_3t���1f8�/a�L��r�Bg�m����L��y�I��A�����n��[i�����w�}W������t�P��B��VR�����|=0���J
���u�P5��"������|`�[�W+�4��[�~���X����E�s�9/L�k��.^�6'�X�Y�W����]������a���scI�S��i�q�]I@����-�{vC��cN�S�Xip�����@���0���m����q\M����wV�v����JiH�`����AA<M8�^���-m��s�rC���&�E~�NA<M8�V���n��esk�8������-��	����Bt�8��[�;�u���#,�������C/�b�lvu��B�;�vP�����!u�z�I�~�W�4��{��+�AK���n[nD|�n�fu_*;��q,)w>�����:n��J��uc�����@�8la,)��>���>�l���Ne��� M{�����A�]�j�����?:�k<� A3N{���s��3�J��2_���,SJ7=��������@��L-��`=*��q���| �p��@.A2����>n~ �������%H�^����^��
���\�i�����-���� p���?h����r� �%|�A.A;!�%��B�K�L��q+����!�%h7���vr	���0n����qC�W�:o� W/=�\i���?�������]����:$N/@����V�p�Vr�j�Gj�MG�t.�q�]E�t�
�~G�4��Q6JF���6NT�@���(��(�+���Qr.Ia��`$�:�tM0E�tM��clH�p����h���}J�'f	\x������KHp��?2���;�qZ����`!{QE����BF�[�YQAQ�3�L����;QE����BF��9�NT�@���(�����w9QAQ�3�LF�;������gD��D��UT�@���(�����s5
:�p��
�����T������v�B���~N�
�����<�xv�A���T2]�x��A����BF�%��4-��V�Ht7_h����(�Q&�u������E9NE��D�^t�(�v��E�[�4������h��/:}�H�e2~��/]���BN[� a��}�8�!�����No��(������^��$7��-` X����d��o�o�F������ �A#A������$�M&#��������Ewl�nE
D)��2�ns��a�����TT���K���LO�v��0a����8�!�;W�iOAT�@���(������z��E9NE��Ds|�5h ������h��9yQAQ�3�LF����?O
D)��2��^� ��\��TT�H4Mk��W�@���(��h�>s(:�p�er�a�������
e.���Tt�����u��Q+�%��4h���7{%#�_�
�����d$���]��{x�U2]�i���:�82M�7��
�R�e2-�h�(Z:�rQ�N�Y�d �aF��H����FQ���F��&����X���=J��V�T(mw��l!�7����nt.����sA
SA�6�s|37h��{o�JF���)�
�R�e2��e]����(�Q&#��&��$�EAX����h��%9N5��D�i�z�R�e2�����w�p���������f���}���wX�^�n�[.�����q|�����m����fB�������B$�B�4�T�����fB��Pq,"1��q���RL��.�HZ|{�b(�!�M�2_>(m���#�MU�-��r,�!17O'����6�i�62���:-/�R!S�N�*�����*B����c���P'��u�������7�c���@�;�)Hf������R��\Q]�^{}
�P�2��xR(��!X��E����o���w4,S����-`��6�Bq]k�M0�*��t�<)4�r��&��dk;%^�w>o
�9�M'���`sHk����"j��snQ^��`X� ��z1���y�iA��=�%7~y5�������t�8��]k�4�^�^+��������C��k@Ip+�e��;@Y�
�-}�;zM�������f��ff�@8z�R�F�o����1���k?h���0s�/7��;-������]����0_��k_L���,5f��e�*tYhO��Y�5��Pf^��-��n��IC��S����%�<x��"�����Ho&���l� ���u����B���B[�i���1f��F�S�����0�����6*���r0|��1f�0��= D'���lm�Z�N��j����
p�����M�P�L��|�gE)���'5�7U�����P�o�Lj�2�
Y$�t���Z�'����q����:m��e��d��3
������m�%�f��
���:�Lj�2+�4��[�n3��P�ikm�mKC`s��%�B���;@Y�
]��g�q���
��X�Lj�
[0H��e�+�����
�KBR#p#��c��!B{�1���cI��������Qk�p���9��e���[H)7��_�p�����Z�������^��Z����,�����.&�t�n���6t���V�s�0��22�����G��
��f����9��>p�����1���������G��
��f���[zp��-����Dp������O�-����n,���!j��G��WWSg�lA(��[��=���%���� b
B7��`b��~�)�J���N!8��(tc��a��n�%��:}�<So��:2F`6>0�0�of���f��]��|�T���Cp��f=�`��,a�N��z���Z~�qDw�s���.s�b���d1"�doD��8.t"4�_����"��?D�f��&��3���d�8N��C���&�q�>���6a����F��������
Cd�8��g��A���
�8����x�~���	�Mz�������[��jC�1���	�2Z�L��zX����+�Y��3V����,��o�f�l�`��%�cX��O�5'KjH}�`�3's� �����d��2����9Q����&��'E�)��S��*�aJ�`0�T��!5!{�I��L3�4���%���n|��"4������F�a	�t�l�h�[2�K�w!�����qL��\2���q\?y�QsQ�8�m���ue�>&un0vhz9��t�_����.��
��]^������
��S^,��hI9{�$�O:`,)e�@9�����{�u�$(8s�S�2����<)�����B��F^l+��%��v`���������y,�y,�y�y��_�7^,��.��
���Z�zoq���������b�������B�j@\�g��:<����eY�_r�B��C7v+����2���8����9�7l-��8���sXl8���ZzxB�s9�0L��a��a�u�Pk�a�9�r�a��s�a��a�m5w�/���Z�D��kp
=���)���z@Y�
�����?�$��1fDB�?���r�B�h��)�e�m}�]}�.SViU�L���-S�X�q��'nA�AC�1��@x�[0n�����~�T��{��!�v���7�c��p��B�5x�s���b�q��[�ZC��q������0)z-�zP��B�����q���1�"�$��
��;���Wn�u��x��8{�u�$�T)H��8�V��y�c�����|`R4�Z�5���{���V��o��
�������t�P��B�k���Zk�-�������B!<is������Ik���:t=�V���v����j����:��eZ��Hu�#����?������/�����W��87���%��:�l���s��\^e���8�Y��>���K�-��W��/�Z��WD�J�i������p�+B.A2���"�>��� ��{E�%H�^��W�^���"����^�i�������"��O{EH���A�����L���"�>��� ��{E�%H�^��W�^���"����^�y����K���K��W������W��v(�����t�����V��Q�E���H�^���sG�uG�(Hn�����#z:��8�2�����9��d$�;e����(�Q&#�#��d y�#z���R�
2	�x<�A���Q2��?�R������"��tLK��o���?���;�qZ�����i]���sQ�SQ!#Ql��D
D)��2����h��gD�D�2��s3�\��TT�H4M{��S4�8#�d$Z�K�4�8#�d$Z�RWQAQ�3�LF�G<;�����Q&4a����~L��hm��}�8�!��v�x>�A����BF�G<*������Tr.Z��q
:�8U2����$*h JqF��Ht�R�^T�@���(��(��QE����BF�kl;d�@t
m���DS��a���F��(���~���'�FT�%�?���7��K2�oi�����r��
�����d ��+���`
o�J����
f�T`���a���x$�z�4�8#�d$�{�|Q4w����6�����(�Q&#�=~�0����qF���GK���L��Z��A���;�qzBw�}����sQ�SQ!#�=��4��;���(�h�o���r��
���(�9�h JqF��H���^T�@���(�����E�N��&��W�VT��(������?ot���8#����f����L��
����������E�9�3��+���^���9��A��'�JF�[|�5h ��w]%#�c�K4�8#�d �,S��
:�82M�h�(�:�tQ�L��Q2��0�H`$X�:�(��A#qF���A��e���=#��Ly}�3bo���e9����`p�S���a���x$V�[�A����D3,�����E9NE��D�����
�R�e2m'�:M&I3�F��=�	
$)�h2����NT�@���(���������n���p��mYmW��5:����;=2�������f�8������}T���	U�@c�Wc���P'���d����a�PE���X*Db*�I��A�9�.B���k=��?�0����N�/�V�����)�j�1��c)��y:i?�N���6�?&T��r,"1��q��BX�/�[�B��:���v,"1��q��Bx�V�ySm���y�^�iX� ���0�WI�o�U�u���������;���:�x���N���B�^:���q�e
��;��
C]6���u����������;���u�����Tm)��*;�X��"1�"t�8~�/��9I���zD�������xx��/�2��g�]*4W��Z�N������������������~�T���K�������]�������V*�$3w��n���R\�E��ni�^�aX� ���E��Oe������ ��Wi�+��~>��4��;^��������p
U�s}0	
U�(H��e�*�8h����EC���H=xA[c��_X��q�)7*�k���s��f���!��I?�,[)H��e�*t��#k�
��Y��O��c�n)!<EK��
��������"�!���*�%��T�{f���������6<x!l?r%������[��_����)��R�T���
����>�B��Ip+��7$�t�����G����l�:��=��V���A:M(�^���]e�� \��V}0	
V
�i:�Y�
���lV����-U�s��3~����-D&�cJ�S!���P��PI���i~=�\a#�p3��q�TH��<����I���V�tc��!BM$ZO��r�ya/��H(���6���h	�6
��;0l"&����&�5�n��U��f�_5�n��+�f�_5�n��U{�f�_5�n��7X~7Kv�\����7��s@�*o�� k�y�Q+����Y�w�Jo�h�h��`��`����C��k����5�Z�nL�����S������y�%X�%8�p�������;�<��Xx�}B �'tc�>!��O�����Vk����1�C():�d��'g@[C/���3t~!�Rcr�/7��geY���N! ��I�)� �@YKV#�����w ��������Sk!2ES��Z��+�����Fl�8.��R�p��C�<�������z�O�P��(o���
���d���J6���'X�����2���a�L����Vu��)��3�rC�/�&9Sq�1H��g=-�����%�������Q{�L�"��?r�Vq�J�7����C�A�7�V�i:P>y�Q���k9�����=�$g9��A2M8�i�����|��!�{�I�|L��x�p����!L�������4��c0���l=(���g=,Y,������1��
��Lj%[1H��e=-�<�v�3*�Yp�����a�J��yc�s�C��}D�a��P7��/��������`�/b����e�"�iX^���6��
�hN^��z�h�XOkJ���5H�@���?8���~��
��B^�����cv�#�������������<�a�x�X���^+�/0/�}k�b���;:C�����7�`(����
��W;����6b
6b�E�Uh��a�v�!Lo���{��[�Yd��:Dx�w�a���SF�"�[���?e��x�A������(�%
�i:�V�\,�y	�W�r�a�=C2�a�u�P��onu���[]&���dV@Y�
5��N��6!k�1f�C�&�cN�Q�[H�&�U���b�Vn_����t��n*n�N���\QK0�)�Y�Cd�8�m���ZwnmM����`R4[�5=��{��3�������!c�u��?%l� ���u�Pk���?@���������e�+T|�c���;~kC��������a�����W��q�tl��|KC�������W(Hf�e�+T,����q���c��`�s��LA2M(�V��yK���^i�Yx`�����d�p��B�d�[�m���N�cFp�Z�q���Sn}>gSME��j��@EOo� ��"��(��}�����A��R7�r�Q��^��{<���W<��m�����,)��}yKvx?���O{E�H����^z�j.`�-b���"��[�^��|��8��!��"��"�">��!]�"��E�EH���q���O;F�Ep�����z	�h�`�3B/���r��A�`�3B.AS�.t�����k�^��~��8��!��"�}#�">��!��"��F�EH���q���O;G�������}���!�����������F�_%A�'�H�\���H�%��z�b���x$���z���%�u��t.�q**d$�[s4'*h JqF��H���z�$�?�G�H���k�G<.	�9�c��$������JN�j����c���"��C�X���?�q����`P����i�Bw,�g� *h JqF��@lqL�A����BF�[��NT�@���(���1����)�R�e2m�T��T��(�����h��T����(�Q&#��6��]�{*JqF��Ht���t���8#������\�D���Y������qz	B�m�����sQ�SQ!#��4h Z�a�J���~�?��r��
�&�M8�
�R�e2����?SEQ�3�LF�{/�_�;�����4=B�!%���k.�
�N6(�]x��������?MXvz�4�\����[j|��l$\���A#a���{]AY�3�L��%��4-�X�Pt��p��D1����W,%�*�r��4^_�W��^x�*�������0ZaFC�?�X6xq�f4~-�
�����_n�*�B4W"htaH
���h�
���I����>4>�;�e#�#�%4���s����h�
��4/���l$L�V��P���W�k/\�
�S��V	S�f4�s��b�����FX��'R)S��L�e������D��r��A�[�o���e#�-|\0h$�f��B�l ��FX�P8�o��F��{�4���Eaa#a
��������	S�f4^�CF6�\XUX�P8M�!�A#Y���D���,^�����Z]FW�F�:���f���y��U,K���/^���S��t&�[�.�#���(�Q&C����6�-��_�Pxo+�����0ZaF#��L���
s�4�I�����1�YU"C�b�d����,Z]FC�:mG���F�h�
�h�����<���[��?����g�o�};�w��G�Gk��8"���L���?��6�j�n3���ykG��c�#b*�I��A��b����l�PE�F��8�
��
u�8~PaZL���m+��Q}1���1x:_>(-��h7�*�j����<$����q������;m��"�gx��������0����x����V�m��J!t��n�G���J$�V��q��V�u:V��T�98"�50�R�9��On%�z���*������*0�*��t�<)�H�3|�5����p��k��M
���f�<���s6�[��qe�A{":�Y��wTVf����b�,���"jg����%���_�e
��;����2��}f�m�=}fm7lB���`�W3���=A�����s���u��7��1��GD��3���?n�Z���&�)k��]���f��`�Q���d���]�W��mN-:I���ap �v�p���yX���a2}OP�f��6�U�Q��)�hl���}���<�Wa�0��'(x��
�k�a6�X����
/I��z&��z��
�2������A�A?���u���KX� ����_��N�?R����Fl��]0\�w����U�r�L��W�2�F�u:vt��l��uL��z9n?L�+�n��cN�U�R�P���
���������!�������d�����z�D[kn��Y"���.�A�_��'��8Hk�T=A���V|��V�Z��k"�iyX���rO��YoZ�G�m��t�_�e�/!z���Q`X� ���u�����aGmi���q{GZ�X���0��'$x������;9H���D���=�o�a�
���=A������{�m�i�WX���pS�;���^���T=A���f��_p:�+(��t���k�I��Y���t���Z����PD�4`�vz
���a�I����ak�c>������u
o��U��f�_uo��O�f�_uo��������2��dl�"1�:�q�E���U��f���x����~�Y�s�dgD6-kD:`JvF$9#���J�k,��e������kx��)���<:��xeS�Myt6�7�y��w,s�,s�,3�la���k��m��8�`(.K�`�������^8�f�@P�q��Z$�y���u^�'�����0�e��������Ws�\��s5�W�����{�j>�������u������+s�����7?x|��X������7dL�`�w:W�����zZ�Z��r�=��_3�'-/z���T=A���[cr[R�?9�3Lj�����������l��e^;;��C����0c�F���+W�2�q�hC�M��$1������4H��F$��<���7�6f���`^�F7��{���+W���ae��1c��I�����S���#�nW����5���}j,��$�w�2���i���dcm���ssG{���������&S��WnLN��mN`�1����?��zB��+7��\w���
9��J�nj� �����
A�S���>�2����0�+��0�l��p����e�?���^���^��
���^���������^���k1/�E��b���E/�}3�b��_�y������Z�������-k1���uk�j���y�^�*����1����1=��6&�Y3�Y��s��K8��������%�Y�2�Y�sS�-K8��[�����%�Y�2�Y�Y'	��$�=���0).�� ]�e�+�����=Y;{2�������'#A�{�Z��_8S�Y��`����� 5%=��{��3X�_��"g��1VdG0/Z���d�����z�+G5D���,�HZ^\�I�>82�[����D���|\��=�{���	V(Hf�f�+T�������:Z�'-/����T=���
K'��1���5��\�E�K\��Q�N��[��}p0�����5{�����p�L��[��|x �wZ�C���f�w+�����Y�|�y�z�p���M�L�^�JA2M(��Q���t 1��9��@b\E���a��$���[��$y����:�@�������?��?����H��3��Nu�*}|e�f?�Zs������2_���@$�Ex}*��@ �}����J�kg2�]�v���v��� r���m�����W;�x��9MK�y����
^��'TJ����}w/�/�Z��<-?oS>��z�u�����OdT�$j�.��|�w?��E��m�������D-�e��?����w�s�|^O�z��%Mjwyg���4v���FJ�����}����xI��]�i�[��������Y��J��l�7��u����$�r�]�'�����-4������a���WQZ����:��+����X��#p�GQ*'�\��72rB��(�
9W�o����sE
3�BN�p|K��*r�**9W���� �s=�2r��JT+��JT+����"#'�\�����sE�pke������1~�����n���[�aZ���[��T��JN9L��*BWm�8TQ��"����s��k]���sE
3�B�7\�b��+R�Qr��O���)9W�0�(�Tq�5��33��{���AU2�����*�YUA��'�ZUB����YUA���Y�o%WN�]2�5��h�����%(:�qk�������r�QU4P�x��U4P�8�*h�Z�{�A��\
:W�3��WUt��qFU�@u������gU
Ts��/��^5_T��PVU�@�������?H=r8��
>r�w���s���������9�3������_�
:W�8��h����j%�=�V+9W������\�����������;����]����	J�jg
T�^���z���5Up�7�6l��*�UE�5~�4z�p�U4~����������X����A�����U(:�w�tb���gU����~o��*�UE��{�U4P���^E�c��?J
T)��
:U]����/���J��4PM�j���z�tQ�M^U�@������j��@
�:gU��:t����?u�����*��;�3W�������-��sU�3���5�o4P��}��s���7_��U9��*�&\`U
T)��
�n�������*�YUA��W�/����~QO��L�f%��+�e*�_'��������g]�	F��Y�5��%�Y.4�X�>�(9��f�Z���.�'[.�X����?�m�/���q�QT4P-������*�YUA�}Ju����gU��V���l~�f$�W�$e%
4)��
��)��@
T)��
:U�_P�EY������=|	����'�9R�eb�>��=
��������0��G��0�������4�X��rHR�4N�'�:$��fX ��4R?��3~���z ���cM�Q!��m��������P-�8�k��*�������������/��s" ?WN�'?W8�e�Li
Z-{���!I��8=��v�kP�J�gxd�y���������q�����t*y�Nx<���K��2-o&�-(�ZY-����d�*t�q�)T�����h���.���
M���CI�/h�@Y�����z�2~��=�q,)w*�3�h���@Q����x0	n%�VU�&C�t@�nZp?FK��}Bh�B����3~���lkp�8��;n�T=�i��B+_�4�������x�H��B�2�	���x>���z����c��!B����q�)w*�W�/���H	n	�K��-��	�!<EK��
�u�
��$�B��4�����5�$�a3��ZO���S(x�3{�3�DJ�;q�l�q��aB�r^u�8��;.�f��89A��u���&��R�@�i: Y�
]�{�|��=*H�����a�a/�PzJ����])x���X\+ �4�����*+�3w`���/�im��&�2L�FV����1��H�m�)���cN�S!F��v�<�[[��6��Ho&��T�P�L���U�6�F��n���@���
{��Lj�
K��t�����v�m���
eT��@a}
(	
U�)Hf�e�+t��|����+�������PR+TX� �����~�P8��N�2�E���J
����
��V(3p0 H��g�+4�9
[^��wTQinEy0����A�L���Wh���7X�D����nO��q��IP���� ���u�P<�q[~�P�m}5,���,"Y1������,���o���U�)��J/��yE����Bd�8��\����<��B�4����v��/�p��tNY�n\�f\�^\+N�tN\��X���{E:��U����yw��Y�q��Ag�������8�.���sc6c��9�{���;��n�c��Z�{�v���1{�B�{g���W�x7K�6[o�E?/�y�d�F�Q�����=���m��XW�4��@}fb�Xg�����Y����d����q������������%+^�����n�������b�>;��g�Z;t�_j������:��Y���zX��s�Y|���T�/�zk5��!����<�U���OvSD�����L���
lAj	 YK����M6�=�W��`���d�A$�Y���+K�����*���df�,Kp-���%��7�I���cv��s��6�0�zX�1��e
&��<Ju�3�G����c$���<�U��J�jSqn���p�73���������F(ho
j&�!���l��|����@�O�Ed-�y���U�d���'�`c.��z�11&b��&�{��-D��cNyX�1�6n���8�imD~���8c1�����acf�6�jMD7h_%�p��_n����R�o��������6Ue�e:Et�,mw��a����{�L���A:M(�i�b	���sZ�����L�f�BA2M(�i��.[{C5��r4"Nc7���bI�S�1�<��"�e*����	T�1[8{��-��-@����������tLy��� �?�;�122���!�y��w�u��W��k�}s��
�X��X��X��X��X��X~��o��Mw�Z���k:O1o�Stc���"�XO��U�M'��	����������_��9��9��w���:���~q��
�m��u���s�����x��: Y�
�K������tc���!��4cM�S�[�W;7�7�u�5���{���7���U��^��/�K�B�W��0��� �0���w�j�-�����rC���&y�o� ����;��m�%.�����z�y��q�*�hQ�/��U�]�[���@@nU`�W� �*���U�5��h���;��w�U�mU�1���a�{����C��l
<8��zs=�$g���A2M8�V��W�5��������sCZm�l����vu����A���A�!��u��YW)H��8�V��=��� g��In9 ��� ��y�Y�
5V��n���J?q������t����N�Cn
���IA���[GA2M������NE'T2]0�-L�4��{��������c�6\�_�f��d�$k|]-Tl�4m�HR�5d��`�w�f
�i:@Y�
U
���3��X��1��r3��q�)w>��]�oS�����-V�����aW�	Cd�8����
�l�m�p��+A���y��+�b�Z������
[����������5������=
�����d���E�����eJ�U���|�\��tW��5S���]!����h�  ����R>�i��+|K:��R��X����,�H�^����^��m����\�y����K7Z�K����\��Z�K8o� � �z	�fz	�[�K�~|�
�$O�7\��?o� �k������\�d���.�%|�v����{5h� W��]h���>������@���g�@����\�t�L����J��J9��o:W�8�#]:�&�����v��������oR���AU�@�����sU�,���Qr�IaFR�@1u���b��5���1h�Y��7
T�_����s�8�J2?UZO���2$8A���i��=�8s���Y�O&^U��*�UE��������*�YUA��;l(����U9��*��i_�����*�YUA�mZ�_���*�YUA�}ZR����gU
T��U������gU��K<�S��y�qFU����'�)��5dX���t�G��L#)E�w.�5��i�@��30
:W]�x�A��gT
TK|�5h�Z���A���|C���gU����T�sUt��qFU�@u�U���k��^T����A���4P-�����3G�����3~�j��r��'�����kPt~����
:W�8��h���7o%�-�y+9W,sxV2�8�w`%��@�)2(b�Ud2P�=H
hR�4P��j��Z{�zQu�v�}���*�YUA���?�=g8��*?g2�k�g:7d���0���q��
E����IAU�@�����sU�4���
:W�8��h���;�A���
�V\bdU
T)��
��,.�*h�JqVU����t���U9��*��i��+�R�U4P���F���YUA���>O��g:7�&��Tt�I��L+YE���K|�7h��w�o�
TK|�6h�Z���A�=�4P��[�A���2m���A��gT
T�t��
T)��
��^�\T-�j��Z'��Q2��0+�d����u<q$��
>q�R����sCI���A����|��r~�)�\���[�K����\�����A�9��4h">�v���J7�
�R�U4P��e]����*�YUA�}��V�@��$�sEX����k��59��*����?4P�8�*��`��v�Y��*�V�����M�Y\G$�������r�\#��S��c���b���y��9
��ty9�
��
y�8~T!,m�S�P	V�~N�1W��T�S���
amu�
�z�;�WC��H�a���
?�[�eT�p�\��id3�0�;'�t l��Xh�L����� ����:-o������u�4;=/�W�g������w_��D7�8���S�����Q��������N��?sf�#�i:��g�E�=����2�d��<'�����*�w��4[R/���az�/y����^)_�i�&��)S����y��u7�[����|m2���A7x�i: Y�
M�����JpggNY�'�1o�Dbvz�qvz^�0����������P�y�o&��d�vM� ���u��r�X���wv�Y����M$f�'O��I�0��v��i�F���cj�@D;>�q\t|+>��d|))@r-��z����D�B�)�XR�Tx�S�r����:A��"����$�ud��:M$�V�-h@�i�%l$;NPn(���5�$(�Y��]���d�*tY&^��k�$e��qa����N��0��;��K�#��X����>�R��^N��e~_gf�e�*4�uh��~)�l�4O��3~�p��)��S�T�W:�����[��B�
�$���V�i:@Y�
�i{^��^*�-P%'���L�B�%
�i:�=��e��N��+��gT��J��_Jj�
��3w���Z�'V��l�����6T
y�=�$(T�FA2s(�^�;m�-3m�3����u���
e�+�4��[�n3����G%7�Pil��
��V(�����4��{����3l��K����V�o&�B����t���J���L����F�F�/����!����aO��~�3X'���t~0�E����
����-$�t`��!���
��5���gw����w7ku:<��k�9y���Lg�Y�9<7K��Vl�=P����0s��s����Z����:�������^��� ��Y�i���w�dg�a���-��y�%3�
�h��%{�.�v`�v`�v`~a�h�h>>J5�u�/m��
��r$�o��`���l;t��9��s=P���m� ��s�n����tf�R�Y�&ufa�f����d���z�n�3
����eH_��u�P��i�<tj�q��
��e�0a�����K��Z���K
�!k�1f�C�<�cMy���8|�Q����`��2�[�����%[�� �|������x&��!���v,)�jU�o�v~����������q� u���������6�5�����L�^cm�x�H����%���:�q�6�R�sK������)� ,������5@�Z������o�3w����&�'�Z#ro�t������N�[[���!�?+R����}A�R~Y���Jrf�������C���w�4���l��5��c0���e��][�@YK�0��r��L`9����&��MK�)��R����:�����\SCj?���G�T
�i:�YOKf�<�W	o3���C�a~�����!�?+�����Sj��KCbB����SY��A:s(�i�d$������e�p����/&Y��l	�t�P����X�����F�����Z��{	���S�1�<t�\����^jPvcjA����0S����
�u
$�h_^������
����b�_t&/V�}?�j�����u� ��s�#,�s�����y�-�z��u}���x���� ^,�9�����LF0�3[�5-��[�Z�o;�h-r�b��E2���u�P����Q,�R,�S,[\��!�U,�V,7������w���!z�I�:�d�!Z Y�
�._;�Y�[�V��t���[���XSn��2�~�u" g ��gYg � k : Y�
u�^���}�}��+�G��Br���u�Pg���e��p�!�������]���"�z�9��v[��.�"��:�>�8����U�Y>���9�s#v�s���Nf�I��S�T�`j;��+����+�L��g�i:@Y�
�.�i!��L��8��yXW%���q��G��Zwnk^�39#�J�F`�F`�u�P��m{t����~�2~���Eg�w���R�T�Ft�=��W2�^0�-G\�i:�Y�
C�h�����7\��E@I���(Hf�e�+T�4m��-��!��u��rD��v�P��B���]2����|PR4�Z�5�����T���i�B�����&9k�.��t����^�>��b���4"^S�3�
���q�)�>���&-'��D5�z@-$�rDj3a�������=n�t���'`��|�z��������������_����q
�N{����x�{��U�sy����������R�v���=(�|���/�������}��B/�{Ih��=(�|����=(�w�O����������B.@2�
�=(�>�A�� �$��{P�%H�^���^��=(����^�y
���K���K���\����+8�A� yR�����q
-_:Ih��=(�$S�`��B/��/�%����s����B�����/��=(Z[�=�+Ptz$���+Ptz$I��-^��@s��(9WLK<���sM�3���9�c�@5�C|��	��4���gU
T�p������Qr����(�\������b��4�L���y��c2=c(�J2?a�1-��~��DQ�����Pt~��eZ���\������j���VU�@�������>����h�JqVU��j]��D���\������jj�����*�YUA�2���T4P�8�*h�Z�sW��6��*�YUA�#�;*h���8�*h������?��V�n�s�A�w���5(:�sG�'{t��qFU�@��gl4P=����������A����
T7���T
T)��
��*F�*h�JqVU��*|���U9��*�����A�5�62h����A�g��YUA�g����O����V��
:�sg�A���Ks|�7�\������jo�J�9�y+(���d�X�;��s�<O��28��(#�`��{�4P�8�)h��{�|Q5����*�T
T)��
���s���3������3&����E-{��a����8s���|?�{UE��gT
T���o�@u�����B{x��k��*�UE�<�?R
T)��
��i�V�*h�JqVU�@��U���G�z\S��i^�Vt��qFU�@u��D��:gU��:[���gzQTXn��Ut~�8�\���{_w��}E��{�Wt���g��m������A�-�4P���A�c�KP4P�8�*�\uY��i���\������j�U�E������e����&�YI&����t��O�����O����O{Q�k����������r~�)�\���[���� �\����Z�/�
k|Y7�\5�^���\������jjNU�@�������6-��U�@��$���nvK����&�YQA�c:r����gU����;k��������������DB��m��y��;��nh�)��oh�X!�{MvS�!X�=O^��B$�B�"�U�����*T��;?���+Db*�)��Q�9i����Q	n[l�v��~D$fS(O�a��
�}���J��e8�����!sy�8~taa��v�Dw�����@v�
���t�����l(�BQ]p���2��x�<+t/�R];
�p��������Ua�KU��@��z����������N���pk��L��i:��P����m��A�K�m�zd��0��*�t l[�X���6���9��\�
l�9hN�+'�t@�n�n����BU��la�c�C���DS���
S��$�"��'=(~����z���"�au�)�XR�T����M���Uw�������3~�p�-��cM�Sa�����6��5�$�
�
�@|6���a��\*m���p��v��A���n��"~0	�R��
�-����d�*�Haq�����4���`��mHmA4M$�V�;��#���5W�V���`���
-����d�*�8&v���"%����W���L�!�E�;[�A��XSnT�l l��M��?Wdy	8���e��d�P��BWZ�~$�"�w���T^���AB])Z�L�)n�Pa��X�=��T���Py��Je�4�ne*���nSj��t4���6����Pe��d�P��B3�Y�V{�O����(����Z�������u��D?�r��tlA��\~����PR+T��A:s(�^���]���A�!p���p�e����4��{��tC^�LMC���{�t<C7^��L"��E�)��S�TXf��6���9p�7@�P��9:M��Q�V(mj_�S�7�g�)��5C��0��"���^b�L-ziGav�����a�@�Q�P�L��a;
�6�6�_�������Z�j����9m��8?�g�y��	3�K��gA���,�Zo�OX/���sc���!��3����Y�3���:��Y����@o/3���������7X7K�L��%�
v����[�{c0Gc0Gc0�0s4s4�)�}�v`���#X���q�"lA�"�@��������s
f���5��-���H�������b�m���~�
_�m�!j�q�=bo����y��y
����Y+�Y��zX�1����u�XG����8�(b�:�v\��r�k��_::w���$����_w���zX�5�y��# �9�I�s� �9: YKVKpi�v���2fc0���;���d=,Y�BX@P��6dl�8��������%�y�S�O��F���c�p�d��4�w���^��0L��A[C�v�@���)H��@��:1w<L���;z
�Jvc��.%��!2Es��Z�D�s-�����R+������b�L��zZr���h�����=����\)�h�U��I5�_��{�ynH��P��0S� �����d�����A����Q�Jrn�����;��N�f�}0�k��|���d}��O� �����d2��st���O���JV(Hf�����%�!�`:x
7�'MH�����Y��d�p������s������6;�q�.ab~�Y�-D����Ic1�&���!�{�����ZS�nf�8u��?�e��u����y��/z�+���y��/��+���y���{�u�"����	?2�`z�v�s0[�u0-x���fb�}�}�}�����������y�k�{�2z�����"$�<7����W�_`N^,�����B�G�uF���-�[oDn����������`�0���1�%������9.b�q����l���o:��J���tl!�t4cM�S�����j\���D�qya5.�j\��h�Z/���2mGX��2~b0��;���d�*�z}pG�����`R�1���H��B�9����hu���o����@P�������w>�{����3���!j!�1���P��������~������q��� �����}�Pq�2�s��Ps03��L�W[����?�������AGC�����ML
�N�f�l����B�sa���;@�����'�)Hf�e�*T����Z'0m
�������t�P��B�)��=�����%E��Y�/��W�Xqx��s�jC���&����BA2M(�^�b�e��������:��Z�
�i:���U�M�Y+�u�j�u q'�hi�N��t�������d|��T1��z�8_��_������m��� ��Z��HS��4����������AS�R���;M-��JW���������A���^��t�eI�U��{����sG��}-��;�\����\���\�l�������-��C�^�ig�N4�{[�5�1�����z��-�$�\����^���-�"�G�^�yw�I51�o��i��R�q��B/BR�E�;\�E|��B.B:U�5�w��K�D��q����{\�H�
���z	�j�a��B/��./�e�����Br�z\����B�A7}�����K�o���O$N����O�%����+b��c0P���@
�x<�A����2�\������*dUAU�����5�d�Y��@J�G�x\S<:���b���@�kr�U4P]�����������aV����r��t��w�8��]N�A�����(:�gy��-�
�R�Ut�
��?���sU�3������g���*�YUA���v���gU���^����\������jj�F8UAU�����E���jQO�T�������O34z�p�U4~��9�':\��3<2��9�3�������C�?~t��qFU�@���;
��xr�A���M�C��r�QU4P��������������;�6�\Z�Ej4j���R�F�9�Wv|�I��Q���O�������o���h��qZ��D5���F��D��*��j�U�M�����������D�����nk��a_A�M�:(����;����'�\l�������O�y�\�����������D���+�h��qZ��L5��b�f��=+4S-Kp��L��*��*~z3��&��������r��.��]������f3e4��f�������(�(3����5�?�����/��]Da��H��T�M.b
��;eae
���f���!h6S��A������5�)���Y��r��FY�D�����rj���2��2ef3��B���|\(w��r�j3�)c�Qf6Q>V�^#lzk�@�,���)�%�?�3c[������\D
��"lrK�/������l�\_sN��+l�L�ZY�L9����f���6S>�-{ef3e4��f�u���9�)c�Qf6V����SVl���JY�����.>
�t!�����9l�|��nKh����-�a9�Okl��/��������l��M.:��sg4��a��"�&��U�TS�B��L7u/�����b��l���F��Dy���sv��2jea3�:���9t��u�8#Kh���z.�e6��@#�l�|,��W��L�2�q��%������������7�uU]�a���$���n���{"�%�<�?�KDoV���/W�PH��k���c�C�*�!���
�����
�@E[���?�
��
i���:5�~���K!���}��<���@T�
����7+���4�W���c���(��E�t�]�u��aV�*�uE���
H���.��
�A-��Ba]����H���.��
=�r�`
�u���o�J���a:���������L���}E/�����^�!��]�z��s�f�����(D�����b�H���X���0��'�nk]���%�:[�`]m����-�W���h]������n�[�%��v�ry�e���'��t��~	��6�����P�,�&>0�}��5�K����z]�n2TO���z�$+}�IF6/'|G�����E�
� �����[U���7#����
��\���
%��r5L(�Y�)-�J4B��Y�	0k����%iy�?�`��/5TOH�a�u�;�����f���y#��'-�%��t@�|X����m�z�Qf��m=B[
vIZ�K6�j�����z���hV��a�vX�����Hj�qM<(��^p��B���3��o�I6���[c��)����
L�������8C����������T/����h�`��B��wh�6�RaGc��
����.����P=����
h�K�m�v�j���va��$�t�z�u�rc��2�j7��)�j�o���)���	>�w��g�
��,o$�.�a����=)���i�FO�E������F���a�c�0�'(����[V���'.a����r|G6x7��p/FC���[�;{��q�D�Y�?��������'��P=A���F�
 �e��Ii�V0]��/"�����SZ�E���(o�u���
�6 �A�5L�m@��l�.��6 ?�!>��G����������u��7x�K�\��%S���q/���w���c
2�����dc�gO�D��9��>z���A�����^�l�����������p���E��d�����l����(wgQ���l6*e�r��E^������:��y�����e�>e}��>�!����lm#���cY������:�-H;�p�����x�����NZ����w�]B��.
!���k1�.f�.f�.f�p1�w1�w1_���+���%[�1^8��s6c�l�+g3v�f�����hm9n��z�^:������=�Ba�qzB��+��#l>h����i�NO6hK��NSn�%n���+3r�v�{Vd|O ���$�����d1$���]�MV��`R+YX� ����dq&�������!�ov��	ZN
���<���%�;�c�p�rc���I��v�Ja<TOv�*�i����Wo`v>hg�:ct�>h�����v,w��)��v�����|�Sp�0�'�Gw>�]����5K���vA{���Y�a��|_9��Io��^g��?����m��a2TO�'�����}���	�x��I���Na<TO@�}��u�F]�aj����>��3j�����:�4�:Q�������J=��v�`1�G�f��;���N"���!b�v���tQC��Y'��c���c�D~�@�Y���7+�A��f���L�!M� M�Ml���������n�g���#z�PcN��M�M�m���)4��"�*��i�7�d���������ravovov>�%�/�8o���/���C�������L~�e��}������}���/�2�j
�|�`����������9��s0������-���}K ��-!H��p��B��������Ve��@����Q�}���
UcN6+������s��9i	>����������5�i�����i���]B���� 4�Fde������F$�)#�|X��iV��a&��X�i��1H�G8�Y��	���ulO��v�����Ha�y��x�����q��c�wk���:����'�5�1�<�P�>�+����s��+ �8)���Iy�W�x}�7��[�*c,z���^e-B��=���*������b[�i<EOZ^�)b��=�������M/���a*��'-�;�+��P=������/�����"mv�%5�H�
A2L �a�����y���PG��=iy�.�(���	>�W�����_�������H���-7
��z��e//-%�	�uo��gX���m0���f=����F%
A_�{��J��$j��kT�����`��nT�#��7�������_���s��d�M3���[��w|W%����p���@��^e^^��/��������I�������UNiR�������gm�X>��A��>-�g�NdR�$r�6sT~)��]an�����
1�|$��9Q�7���k�������O�m���G2+��|�9,?�e�����������G2+��|�9,?�%}�&=*?�m�PU>�Y��(���a�!/����Q�+�U�#����R����m���6j�����.���L��4����J?��7�_��������v$��9Q�7����������I��O-�})��_[9��u����6����u�a���~��iZ{L��Oj�[�oj�0�]���t���D��P�D��P��RjL�r����1
�.RJ��X���A&#����^O!CA
A!CA�6J���FU�F`(��X�%��X�#��C)5&c9��J������#0�k`�R#0�g���[��~W��h.�����0)]��B��C�)L���z����� ���������F%d,�aJ��P���,F��X�� ��`Yv��%d,�aJ��Hp��h���Rh(�q���X�XhQFQ��D�EwX��E
�%�E��'u&S�>���x����A��e�8u���-8[�_��~�dQAc���jQFQ�������BUh"z�gT������nE�E)N�
������2��b�e4��h�){�xO??hQFQ��D�E�]Ph|��8-�hv�i�v��������-�j_7�S� hx���VT�X������hqO�B&��=?
��h��1L	2�K�o\�l/e�_�|�`#�h"�qZ��X��E���g/z��Fm6��
�E)N�
��6����BqZ�������O/wn*u��}aPh|�(N�����+��'�h"�qZ��P��^��x����D�Es��+h"��\Ac�s���?MD1N�2~�Z���G}4�8Uh,����M������o��e4�8-�h,z����7�����7��6,�s���l������4�r�NB������i^��(�)QAc��?X+4=���BC�}���
�E)N�
����9QFQ���������e4�8-�h,Zz�rS���������j�HB�V$2����=KA�
�)MA�Mu���g��?	�\�zO��F��PW���:����0u�L�:�k��3M���?����Bc1�S����i���2��b�e4-�~lV��D��(��������8�7~����a��Qd4��8��h6;)��&��E�D�f;�-
g������i��n\~��d?`�\����H�i]A��1���������Rl�f&2��b]e|u�r@T�2�?~Qa:��T�+J;���1�Q����E����UR ��~��8d)��_������V���M������.�Y��x2�?~Qa)�!	���&���w�K�<��/$\�&Ct����,��
�`E'n���U���_T���{��(k���l��$i�0�G���B�
��U����-���B�a]2r������)����t@*���;��������
�5|kK.&�B���&������-�B�D�����������R%����M�y[�Nr+�Q����<`�P0���f	�L����G�����~��|�M|?����1�	]���!�cJyRa	�(�e��J�������T/��A2r0�Q�gXr��1um��~
�\�����I�Pa	�d�`��B����_x�5�`>
+,���9�&�B�������G�n��.�
�~�K%�u��%���_�E��;p<��S}9��%�|Z"u�����Q��������BxPL)O*����+p���:2�������T/���d�`��B#-���~�D�h$�Z��2���K�a
��)�I�)������abH+�q �^Jf��A4L(�Q�y���u��KmkC�.����
e!HF�f=+4�-��s���H���B�@R-T��A<L0�Y��~b+.&Who(d��{I�Pd��0��G�+,A.g^���V :
u��o$�B��$�t������sonE-�Qn�^��
��V(��d�`��B|\n��V���P��I�Pf�� �������L��og��D�F��.�!�I{�1�������5�DW���K��HI`C��+v����!�1�|��]]�����e]�W�~�>�g��g��������u����C���"��{V���R��%��%�����K��K�����?��=��[�r�������������������~����	|X���	>,�Xv���q��<��I�;X��;hf�,Y[xP��	���	�1[���h��V�����9�as����9�5�8�`�������h��m<��C}��;lf�wh�)�]�b����\�������&Y1B�������?,YL����4v"�G(n�;��/"�*��t�c��Z��W��:K&�*K�L��be�Rt�^�,�_�v����(��g�����<5D�����[|X+[�_e/n�=L�Dw���[����<5���!��+�}��l����R���d,�5C����zY���U�6+�f�=������"f�=����0^`��n��X���$o;B��-���%�+XrYB�/X��5��b0I�%'��;�YoKF�����++��-��".��$mE�X0���e�-������x�m.�r �1f|���� ������V6���k_2���r�&_r�$#w����|D�R����2$X�J%V;�7����e�-Y9����x�k�L�^e� ������d0�L="`��K�C�r-\,�C�cNyg=��xn����������{�Y
��_��9�[�r?�\7�.�����q����I��^�?�Y���'��{��?y���t%�U��^���
?�@�������,�����������Nr3�j:����?�1Z`&��,����,����Zk/�.b�.b�.b�p�w�w����l�5��;l��;4����Cx���S��
�d���;~�a��������?��i�m��q�FaE�)�����A�,t��j�;0��I��^mZ�I�$�����6�Y�q���y�gt���d��d�A0�Q�j���,b=I�=��C��wO���R�T���^��)X�1-�$o
� m
:�Y�
%�n�����
����-�l����A�ps[{��OJ3���[�om�}`�u�"����G����.0]WH]���>�2����Q>Z�8{��RU��[����
���u�����0H�m�[����������u�Rg�%����n�g����H� ��E�6��(� �m$�t >��#v�Z�Q��mkHL�`�q������y�6����K��_w��!��.$i��8W�a:�Y�
�i�UO�������:�����������6++��B��'/�?&^���|�g�%e�$�������>q����=��=2����zd�E��������|������a�}[�&�����s��q�yy!�:w��E��=2��a�.����i��p����m2�|�'���4���L>��2�>o��� �2���2�8SNa�,CN��v|
�/�Oa�0�O�3��
3�>n��� =3��M3�8SNa�4CN���|
�7�Oa�8�O�3���3�>n��� �3���3�8SNa�<CN���t
�?��`�@�N����y
.��\������M4�8S�`�DCN��6W��i��s'
A�V����4��|�k' h����4���n�����m��������h>�A�Fc1��MN�E��I��h� )4=��tiDMD1N�2���m�$d,�aJ��Xp��{�{'��L~�#�&��o~��X�X�f�B����D�7����&{��a7ZUh|�(N�/hx���>��
�R�4=�C�e4�8-�h(Zp�%*h,Zxo4�������(�iQFc�� �(��(�iQFcQ�!��2��b�e4=�+���z��@��(������X�f(N�
��hJ]��'�o�j����
���Q�j&ht���,���h"���"�h(����BcQ�S�����?�*4M��V��hQ�
P��D��(��hX�k�4�8%*h,z�pS4����h�
���F��I��h��

o5�E�n5����D���g�"����Q�:A����Wh,JqJT�X4�Gl!�������v�L:9��iW�X0,��2�B�$2�4�"��$�iMFc��=n���qO��M]�(��(�iQFC���7����S�����X�x��N�~i�/
�����4�rG^��DMD1N�2���?�+4�8%*h,���B���v�KXw+�h"�qZ��X�\�f���&��E
E�|'*h,JqJT�XtoSd�(��(�iQFc���s�l(N�2��lJ]��g:q���/����~)N5�4n���'{�&M�7�d��X4��l�&��?d+4-�yW��h���

E��d�R��(�)QAc�����
��b�e4M�h�)�z�tO�X��F�D��"��`���%������&���&��-�O;q��X���Gj���i3_hS'�dx�cr�B����������\�I{��?�+4����(��(�iQFc��hQFQ������u���$2��0�Hd<3i�������$�)MA��P�}�Qh"�qZ���3����L����[�?��#D�$����+�9;@��~i��be��
�n�'��>a��v�D��B�a]2r��n5�	Q]q�;VT���.�o
MIum�/�����_5��eit#zi4���T�af�XQ�M����k �K�#w��3v��M�����$!�xG�r=P?sa���G���+Z��S]���O��
�(#w��=�%��V�u����PaX������oP�]�,���n�+�BK�#w�M�[Q����A�,5n;��t@��*�0�G����1�t
eDu���I:�
�u��xS�^ds���L�@���{����!����H�?��'��s	��2�}O �V���1����T����A<r0�Q���%�|�S!�����3���.|I�����K�^_��V����-c#����,��&�K),A������}�QhY��r[
�.	��U!a?�`R-TX� ��������Fm5����|���dt"�1d|��k�������Nkk��F�4��+Au��#��g�Zx	0�^Ja�15r���f����A&u�~�N+N~������d�A����N��
�'�+9k��W����
���_��$�t��^[�������H��.�!�	���!<�?��'�@���N�z&����st| �^Jf+�y�0��g�X	�4�"�:u��S�aqd0�*,b���zTh�/��O��`w0�
����I�Pf+�0��g�&���<]���PJ�=��T%V��B��zVh���'5bTMa��,PR+�Y� �/Z
U���kQ7���&N���g���
e�B���zVh���u8��A!���^��
Xi;a����0��g���%/;�I���^��������t�Jw�����@��v(]:��w���'u�0������>�s�;�w�a���������{X��zk�\��=s�I��~����������y?p�~����g�a��G��A�����2na5��[h����[���[h����l���6���7
��:�W�-C�����y�s��C���TQ�Z�����k��^��a��a��a���l����
��Hc������|D0��[��-���%k��:�(V�M;���]7�(��(�cg�=�W��_}���Z�r�1d|!'�h�c��Z��;�^��^\R���d��A2r0�e������i��X�`R-YX� �6:�Y/�����6��y�`�We*v����7���f�,����~������x��2��C��)�]�b��#�q#a�le4v���A���zY��q�e"V�}H{����5x�)�j1BW�,��Z��8B�$�`s�"C���AWv�&r�/�2�����%i��nNA<r����q��!��r��]Y]���������R
�0���%��Xg�����hH<�Dj����A<L�'mu�W<N��+���Q���$ca�x���W����a��%�a��#7$>e ���u�Z
�a:�YoK&�1�����5��E9�=�$ck�
�d�`�����Lu_mp����Mw���L	h<w��/�����B��C\�`�mubO�lzp��G�K��6?ot�,���������y���2oVh]����������L��L��|Z����7�y��f�?oQ��B���o��v��
���Y�OX�7K3.`�,o
�X��h&u�c���������6S�	kG)7q�"3q�L�[����f=*������j�/��e�'Vd�D0���-H��`���mzRa�K�_���_��������E0�Q���k�Jmn�8��0�M����e����&i����z�*��X�@��
[��
��G�*��W*�M�{��p2�5\KP��>��'�i�.o�7+2���dg,n�MA0�Y����OQ�����P�7!H��`��B��;�T��CC������!HF�f=+T�qG�����@�7� H`��B��;���oE���>0��}����
8�u�Pq��t��|�!��u����O�a:@Y�
�~�qv��q:k�H���d�=0�Y����iw�^%��3��A��/p� D\={L)����b�� 
A6�.7�V_� 6��������g{�$-��u��yc�������������o���6o���o��e�7�7��K�[��*�����\��i�)�}H��!\<�I��!�$���!\�����'
B�$���F�>����)p�9�q�>��S�7�S��A�����S7�S�L9�y�9�O��)p�9�q�>��S�7�S��A�����S7�S�L9�y�9�O��)p�9�q�>��S�7�S��A�7��37��<�� ����A���m>��q�>��3�7�S��A��_dF������^-
��r�u�W��AH��];A��f8N���������I�D2�}��������X������h��2)4�~_&��������&��E�EO�/������e2��d2�0%�d,��=��H�~O&���q��
�Z0L+��X�s����Br�|`��Uh|�(N�/hx���[jQAcQ�S����	:�jQFQ������e)��&h"�qZ��P���t���R�4�k���T�D��(��hZJ��{�&��E�E��!��L��b�e4=���
�o3�EMo4����O4��\�57_7�S� hx�Ng����Bc��UYT�X����*4=���
�D�u���

E9ND�f5�EMD1N�2��e���2��b�e4��NT�X������h������kJ�����5�j8N�2��j���x��f!Gu���]������)^�}���
�E)N�
�F��-d"�#������v�L��+d(��^Tc9�Rj�b���c4�8-�h,{�xS4����h^�mw��D��(��h�o
�o-�EMo-{���g�un�}�Ph|�(N������_�4�8%*h,Z�3�B���������IW��(�)QAc��>�QFQ�������r���&��E�E�^��)z���-��������(�)QAc���7�l(N�2��l�����4
�F����
_9�S'!hx��wd�T/h"zvO���_���}�Vh(�q"��X4�g]�&��?�*4=��9QFQ�����������BcQ�S����{/���{���h�A`5�L$!L+��l�,
o4�5�n4eM�?mRB�T��QHi���i3_hS'�dx����
0�A�R#0;�S�B��?�+4��A���h,JqJT�X�~���(��(�iQFc�6��h�HB�V$2,z�(2�Hb��d4=�3�K�h"�qZ���~����L����[�?�$$oA�2����p�b�����,`T�+��;�6��Uh�w|�&Dum�m�qT���.�o
�w�>���B�;VT���.�o
��j(����\d`��yu�������M�����j�*L�-9�%2�����k +��d��)4G��>
�q:����~���G,#w�����~-� ���j��
PWT^@�o�h��PFTWY��F@*���;���3����9�,O�u�k�
�u��xQ��������?� �����_����*�0�G����G�J/��j���syL+f��"[5�?~���Y�n�LZ����A��c��BB��,D�1�<�p?��b\����������t
0����%��;�Y�
��,@n�1	�������1f|!�&!��S��
�f`�.��������H;,��&�K),a���zTh����Rw'oP�I�G��|
0�*,a���zT�Qw�����EPX�o��5��Z���A<r���F����P�-bT'k����]L��
��#w�z���Z1��A���nt\c���BN�����U;,��2��t6T}��I_�4�1���@�;��*4�R����-�ar4R��ty_H�WH
�!�1�<�p_��]�{g�1�
U�(} �^Jf�A<L(�Y���A���J��IUa���Z���A<r0�Q�����B�%	������
��V(�
�d�`��Bw������A�WR�����-���/h�QCh�)�*��:�U��!:����
��z)����t��Z`��:�4���J������V(��x�`��B�
_��p���B{C�R����6�V�a:@Y�
���"��~�L�FJ��%�1d|!��AZ��)��kzB�{�(�Ar^q�k�W�0l
"�t`�+��Mk�{��&������a�����d����L�L�L����~{����
f����W�����m������-�y[p��������a��U��d��5/���%:�����i���ih�G�B������<��a}��\9�z��u�?� >�U�|����Dg$��(�����7�_�D����8;������vb�v�f���%k��mOe��m���>�o,�e,�cJyW�����vg1Vd,F0�[��,F0�e����~�l�}�����l��X� ��1^ t�5�cE�v�@:�����������d�
����a���9��/"h���=�cLyW+��i�g�d}?���1d2`
��)�U��"��L	I���L	�0���@�������t��X��!����e��\�S
��9�]��#��A���2;I��\!H������2�	��i��umH��P��*[���������d�k��v�ZV�������]��!��0���%���;*�cyx�@�`�@5L����K&��8�3'ch��!�*;�I����Z���z[2��u�p4�f��0*��;��/����!4�?��w�����vo�����-����E�����c��Z�{Lm�C$C?m64�1fh�3E�A��a[t?���^!
Akq1{��B��Y0��M��Br{w��z��F��B���Y����7����7+�y�f�?o[�,��������Ey���7&���v��
���Y��a�i��Mcnc��17�1��"�*4f�~t��~8��L�|��}G0�Q��\����J��h�1���"n�=��'Z�o�=��{�����q���=��� c��a��X�q-�$�,B�r��G�j�o��1�~���1��w?�q���.o
��;1g�#�I~#i+��zT(�{;59��gl��,�Ch���h0n^XwoVd�C ����C0�Q����m���a�����2�]x@����S�T(�]��lz�a����'���s��`C�6�����Q�e�[!u���������:I��1���f=*��s��9����@ �sk�q
��g��)�h���3����$Y�/�T2�I��[([qy�a3��q6��@�u�v�a:�Y�
e.p��*���� ��t6l�y@Y�
.��p��
)k��d�1fx�W�t��*t��._%��3���\��\>{|<�����tQ>�mw�s���
������!-rn�;���!%���O:�l�N�����m%�zg��B�Y��m��l���W����Nt-���C��	��?l"�S��~�>���`W�O���	H>�I�>I�s��@�O��"r�DNb�@DN�S�I�[��I|�BDN���I�[��Ip�:�y9�O���Ip+9�q9	NU'1o#"'�i9	n"'1n#"'���$��D�$>m$"'��@�$��D�$8U���������D�$�%�����'�����	|�LDN�[������)p�:�y;9�O��\��Vh����?��"h��S������i8�<f0nK�qr
����yIF��X����X��7)4,~�&�������7)4�8%*h,�M��2��b�e4=��MB&����I�X���{�g'x��W�s�BcI�S�����1��<�$jE"����r���9�n�v�X��W��T���W,���N��D��(��h�2a�NVh,JqJT�X4���F��D��(������QFQ������y[���4�8%*h,�/y?�(��(�iQFc�D�9D��D��(��h�[�*4��P�e4���sY�?�J��}f��R���Q�:A��V�����
�E?�������wPUh"z�T��9��4�8%*h,��I	(�h"�qZ��X4/�i��&��E�EK/Zn��^��=}'*�&���D����[�������f#�����vs��0�D;�m�6�6����I�>
a��Ww:��q��^��q��~[��G���&��UMD�{�Uh&���B��w��D!�������S6��@�+l".��]�p!n
���Zaf3a4��&����h6�Q�f6�mk���L��m�p����&����_��}w��&����M���l&����f��?#k6>�S�fc��c�����0jaa������
�	c�f6>.�������qS�,��/5��0afc�����	���(P���RZR�3�I����S�L�M.!�36���k�}Y�l&����fc��2s�Zaa
���&���6����������Zaf3a4��&�����	c�f6��P�1��bcaT��M���vIRh&qF��D4�M��*�����.�����q~��d������&����3�\r����h|���.h"�qZ��D4uO���dS����D��	YV��L�0�����M�3��M�)P����~&�y����*��h���P��L�.����|���l&��F��������d`��"��6'm+�U�����	���r&��v@��*�V��x���N�W���+7!�����])T1�KF���Bq"](#�k�x��*T�%#w�M�u��n�$BU�7�r@��2�?~Sa���Y,������@V�*�A��p��*4��	���a���~������xsEs�M9�L�6����0��2r�\�#,g0�Q]���J@*���;�����U������>�b�^P�
��d��)�����j� \�|���: +���j��hs��kT�<[��Qe��k�:Y���IT�����b�(;�����pq����l�g�����S��;P����P�v&[)3���r��@^��Rq2|O^��'YR���U��KY���/&��,Q���zVhL�����hsk}O�J6XFtA .��Da<|O��B�N�)-��=J6�{��_���o�A����=������\W;**�;(nk��	�	�A�#���=q�*���.W��9�b�e[��v��"��
�����E��m-��������y5�����:�I��Y���7�s��+w�Z�0�Kx���9���;R�h�w�����S�|V�������:����^�����Q
����n�-���?a�5���,�k��.�7���+�����z���
;�mMMp��
9n���
�<��a�0�' ���c9�6��x�"�;��F���x����4�7����=���l���[����a����FC������"qh��6���)+�=iyX�)k�!L��I~��g��;��a�]�2���r|GZ�+p�0�'��v�f�����N&l�{�^��CZ�K��-���	
>�7����
�}�
5�0]Hj7^f;�0����|y��+
A��:�{�p�a�eE���������#�?�v��0��7��K����%��%��V����L��L��L�d��d��������a��a������|X��`b>,�x�P��3���3
P%;�;���JvV#<�Yc����4�zc���Yc��(�i��w����������/,��-��-�J�{����/r�m������=��s����m��[[?2\��3@Cg��+4th��.�u&���:+4uVh��BSg���
}�g�z�Mgm���bc�:y�)�a����ME�Vn�J����������@��=
a�u_W��KhFd�����R0��$F��'�������c�~��0u�%���]
A�-u�^�����3MW����2m7$�gzu��������\�
�F0�Q��hI�h��� �L=���%��������'�sP�hOZ��Nw��z��+gS���L��4cY����v��U#�p�P�����,T�X�j�i�&A����d���J�UW�mw^���6I{�����a��|_9��}�TwqTviO6��)�
��z���+Wh]�/g����Mu���)���	������6�7N+�/&j �8���A4L(��F(�Q��������|Xh��,V��_�������/��z����Yo�����B�S�Y��;�7�y��f���L�k��k��k�.\��]��]�����Wz���wHo���<���L��[�6&Kg���g��fV4"=k��Y~{���-���z���:�X���Fe�N6�`c�NRF�/:�x�1]����7Sgo�+{3u�f�����T�1�+aM��'XS���05!L������z��x\X�Gge��y\Y�Gge��y<��`�D�����Y���
LS�' ��^�%�ky8��p����l[eC�r,��^�a�bB�kS������'v
��z���5�!N?�e�2�MJ �O�� �SZ�Y�
��!~M2e�dJOZ���yP�2)A���j�p=;or=�5�����\O�K:�����*����e���������~�0eHZB�����~l�Y��"-�<?�3a�v#=����)X��a]�RS��iy��,�d�����z���#��{cv��'-�{�����|Z/����3�c9����{�����S�|X�2�2<,j��m]�}F ����|F(��C��1�@�4�a��p�8B�e&:p��g���+��A�5���4�������������7����t����?���L�Y��X���p�W���"�K*52���}g��--��A��#����R����1�]0������g��&��� u�+�`R<�q�&oXz]!p|�k�K?�����\;�Y��(���Q��A�|�e���[{�R�#����R���������p�q_��������R��������ynu���C��B\>�Y��(���a�u��o����)����dp�Hf�s��o2���n2�����Mf�]����<J$��U"��2�o������s�����+"�lOs����]>\�+��o-6�_��g>}�g��%����Rw��xS�9�#�Mbj�D)AC��S���l�����hK*d2V�]��"��"�E%�L�������B������VPZ��X7�9c����-����-5��I�1��^S"�d��4kA��G���"�`~�h��g������[Es��/����/U������"�)E!C�sm�R2T�0�(d��/��e	+B�Vd2V��z��d�aZ��X�,��%2V�0��d�X_0�b~�jR�Uh�
��*��*}BP��&�{�����T(����X�4�0}*������>A��q��/\p~��~+4V
��)��&����jTMT#��TMT����D��O�
�Uk����
�b�V4Q
m�X��h�
qF��D5����j�U�MUh�eTMT���Re4Q-�EB����qF�����>���[��{Nm�n��*4�r��A�������b�V4Q-��Z�D���j!c��c���"�iE&E{U��+����7�d>�v�J�&jgMT�^���z���=��x;��a����UMT��4��`�Q%t����_��3w�.e�o
�����4�v�\��TMT!��2���|��X�����j�{MTs��+h�z.1�[���*�UFC��n����.ACU�S�
MT�^u������M��qw���&�gTMT�"h|��8�J��]��m���t�w�P7�;���B�k�q�,��]��j���b�V4Q=���B��?o+4V�7����X��������KUFU�3��&�y	���
��B�Qe4Q-�j��Zz�rS�\�bE�L4[��$2V����=OF�;�iQF�;N����p��?	�\�zw�Q�r��IH�{2�������R'��M7~���L��h�!]���iEA����TMT!��2����@��h�
qF��X�������|��	aZ��D1�5^F��D3��0%�h��h_���B�Qe4T�����m8�,�%��z�A�Xc��h���:�6��In�5�<��/$�*Z!������`��x�~f�B!PQ�k�/�I��P���/*LGk��*��nR�����
e����Jan��� 7P�_^B���ZK�%�+����v�����xB�bu���c�@���!����WJ��\]=!p�J�M���.�_r�d���g[	�*�����8&9 �B���p[�%��`�.F�o��b�
@R[��l� �������'�i���2����Z�+Im
5��d��vG�[�����
�u�Do0�B�a]2L���B*�j����*��a:����-�'��&��-�%_C��-�!<�?��'�mI;��}r����M:�$�����A�%�����=��m!�2e2�O�{WR
�xy_H���������������P6l#(����h+K�$����V�a:�Y�
=���W���FHn�H������e|�k	�Y��STX���|��R,HD�x����H��R�����������	%��I�`icXa�����_���x�P��B�j��cN��I���&��c��B�Jj!4�?��'��������&��
U�# �^Jf�@��zTh� h�T�����%�`��1�I��ChPL)O*L��pc��[Z�vW ���a:��7�ym��m�
�
���5�Z���A<L(�Y����u}m��B	��C�N������j��"����g������xEe�d��`�+*l� ���+z�m�o�O��}@[h(�}�&�B�e��;�Y�
������>[/�0ri��$�B�m$�t�������#F{C�R�o$�B��5a��t��������	\Om�flV[�1����P;���b��7���������[���k8�c��BR�=�!<�?��ow�����u����F ?k�=��g��g���q������i}���y��y��y�7��3��o/�������vq�I���f��K���������.���������%'�Ic�U[��~�;iL�dL?�;������{X�q�`�0c��[���:#�#�������[�Jv�`%�����,��,A{L)�j5���o��:w���@R���V2����de������.Gy�����S���Y������]�=FoVdC ��-H�`�����Wwq�aE�<t�j��V������de��%��z��!ev���X��3�a:@Y/K�ok7H�*F�g���H��b�;-��P�����Ei30��9��1f1C��O�E��Z��K)�q�\���PL�@�u7�a:�Y/K&;p�m���������C��7\��C�cJyW+Y��y.%X�q

���%i�q=3�����%�[���Cn�nY,�@��!W��#�t�����6����+���l�&iG�n-A<rkF?-Y9�%����2V��d���� ����d�����B<��y"�WPl� ����
l"����v�`2"�c��"R��l�����S^�J��~�;��,�������`��,��A<r0�m�yqS���{w��7�p�@ ��"`
��)��FcX���vu����{yq�f�u�!����������K.m��T��[l���y���t0�U�����
��xz���n�i������s+��n�G���Gy���t&�vO�V������#O�G��y:?���E�n���.��B�{�Y���P�86�8Z ������7��
5>cS�>�R��k��-xS�6��|�d�&;�0�dC����k��T����������������]��'��#z#��6�������a�a�ZI�X��hf=*T[y��X��l�,����]CX��\Cv���^���KG�^aE�+t��WX��
-��G�j���
;�L�:I�!�U��Z@Y�
U��.���?����3�/�B�/h�)�I�b��%m����!3��H�n`� ����P��5oCy�Z+�<C�D��=?w�)O*�/-;<�JC���$�y�k� e :�Y�
W.��S�����:�����-���g=+���:�	�j�H����B(��|u�T��;�Y�
��?����$��W
����l�I��[��A4L(�Q�b����H��!e�u���w@#����P����:j#/�����$�y�k*��t������As��f��?�L2&^��#w��3���Sl?�x��~R4��;���B��!4�?��'����m83�As���s��tDP7�iS!e<w�gF�w�~��������\�.��U���{�lW���"���U����_7zU���l�kSU�3�w�=w���t\e^^�}����|�����~\��a���R��aE����?���NAzV���V�	p�?oZ!�������\��q�g��W�)|���OAzW�)��W�)p����y�����+�������g�)�X�)|���OAzX�)��X�)p���������m,��������g�)�Y�)|��OA���S�5�H��%g�)�ifA���Y\�I#4ih�g/-�������>�i��������f)��@�x����~c����s����l���RH����~#��-����F��om)DqF5��Rh�g��gTMTO���T�U���2��0-�d��w��=��S��)&���B���Rh�z,vCE&��J�.�J���Jku�y�����k�4Uh|�0N����5;�6�V4V�8�*h�z���F��D��*��j��ZU�X��������`�MT!��2����H��2��B�Qe4Q�N�F��D��*���I_D�T��G�gT�U���|*hr��8������K:�D�X{����
��sa�n�%hx���|���h�J��J��X�����|*4V�8�*h�����B���v����UMT�wV������^�
�R�{�*h�z�pS5����j�]���F�GI��j�o����3������?���F����_a��W��9_����BcU����&��=y�hf��-d��V�,d�oqu��B&�aq��D��E"����I2�hB�e4Q=z������7U�RlW�&�gT�U���EM�3�U��������t���K�}�Ph|�0N�����;r���*��*�UFc��6b���b�V4Q��X��j���
MT�%��Ue4Q�8��h�z�t��h�
qF��X��:UAcU����&����Qe4Q�i_;Qe4Q�����]��*�w��.k�31��p��|x4���q�E��I���?�+4�|��'~�&��?{+4QM��[��j�O�
MT�
Vh��%���
�U1N�
���7>��h�
qF��D5����j�U�M�:���h�0#Id�X����$4��P�%4����X��iG�TW/:b�����7����0}�L��:&�� d�aZ���5����4g_��B�:���2��B�Qe4Q��6��h�
qF��D��Mr�(��f��v�$��b�����BcM����&�a����BU�3�������d&[����?�#D����
��+�U;�K�~i�kTe���R��I���Ba]U� �
��d��)4&�
�u���[�B�a]2L�Zg��o{[A(��O�����Z�)C��]�\�@��U!�)oK��"G��,�o
=�j����)+��#�4��K�[1������{9��vE�,A��4@��0��2L�\��Xb0
5�x�v��rE���a:�����U���mT��K�]���7�Sl� �/�?��>��
�um;�u@
�u�0xS�,��Ba]���H���.�o
�/��m=���Hl�b]C�n����R�T�;�==J���(7c\�o$�K)l� �����t����W,����vyL�j��F�-����nW�{m�V��f�dD������H��R�
A2L0�Q�e]��-/b3'����`�I�Pa+�0��G��u.<��(m#A�4=�L.e �*l� ������Z�s�yq���(5�:G�@R-�Y��qf5L(�Q��� �_40���F�JS��1d|!��5���������#�:�m��B��j���H���X(��t����pk�������1f|!��k��)�I�1�\{^��fZU?��@R����0��g�l�P7��k)(���;6G�$�B��	,-���e=*4�%�y���A�n-R/�������E�t��!2�?��G&�Qma9��Jd�J�>����d��D�t���Zd��
�{�r�������Z��2����G���������.���j��o$�B�m$�t���Kw]B���+�7T/E�@R+Y[�V�h�p��Ba�~���	9�b�<��3H]�z�BC�����\�_��>��Pv��/����2~i0����������	��k��7x}K�\��%[S.��_��_��_���������^�&��N��Z�O�i{c���L�L�L�`��`������y��>��}������O�y�p�>��}}�
��a/!c����w6�Z�%[��i����J�\��%��=�X?q=��h6��~b�~��������Wg�Xg��,�c�p�"���h�)�]�����dmn���[�n�9��6��6�cJy�oL�����s�3[�[$�&�X��9���o���A��+2��������G�O�(g0��1!SC�_�$�n�V�x�P���n�#�a�mGVd�H ����;��zY2{�+�b�7"F�?^���2&ac�s��Z�Hs�X�kC�����E!HF�f�,�}���WSfe(��9��!���BxL)�je/�6�@���!�Hz�I���-kk����z[2���RZ�$�bC���$��w��h�P����e\#v�Q�hH��@R+��A2L�'�p�p�6TL��6��!e[:@I���{
B���z[r�m7l��4�e�l��WX)D�t��^�L&d��	?�J��������� ����d�'���{���z�����$�3���t������d.�'�H�j���������=��0�����?dQ��
AC2:�������A<L�]:B� �����y/�f�?���,���=�,�������y����;�7�y?�f��ZL���������.\��]��]��W�/�.o�����B�y�z�2y�2y�2]��������O��8�O�������9&����-D����X}7+4Fa��'Y��$-�F���lA�����7�Y�6	�7"��!��!���"%X2�v�?l�=��;��[��{���������������h+0�jl�:Vd\G ���-H��`��B�,�z�n�cEf����'<� =���z�M�~���8����v�$�0VfF(�Q�l�m��+������2�������9�I�b�������&Y71{���~!�Y���h�;���nz�>��y�����p�;���G
�����e��
�$kf��;�Y�
e;������F�����Q�1�G�����,T<���'*
)����O( H��@z�~B9u'��W�`hH��P�53��8������?���:�
�
)���d������0��G�j��l�*
)���d� H�}`��B����q�j��]�y�2���_��h��c��
��������7".�?���B��;��G��d�Q+!��CL;L�9����jP|���'��������G���K��`����������_���������;{�m�[O�-�F�Y^�=�JLW��W$���R����=�|��!��;{p��)��;{���!���|
����`���N����yg)���R>������=�8S�`���O���r
��CNa���O�3���=�>��!���9���=�8SNa��CN���r
��CNa���O�3���=�>��!���9���=�8SNa��CN���t
�?����G��:9�N�Ng:��;{\�I#4�MI���I�d��R^�+���=r����
7z�8}�����4Y�|gk$3����Hy���H
�51N�
��F�5�B���FRh�z,���Re4Q�8��h�z����L4O�5���b��"��"�iE&��o���Ds��")4Q���r���aF��;�~.[���G����U��5�8}���,���VU�X�����j���Z��D��*��jY���	��B�Qe4V=���I�
�b�V4Q���Qe4Q�8��h���R�����*�UF����*��*�UF����*hr��8�J�����=��.��m+}��W��9_�3��R�������D��;�*4Q=���

U�u�O�

U)N�*4Q��[>�2��B�Qe4Q-��Ue4Q�8��h�Z�����*�iUA��;F)4Q
�c�B�����s(����s���x��.G�]�_a��W��9_�}���
�U1N�
��F��-d�����������L�,d���^Wc���L��7��h�qF��D5����j�U�M��$��B��*�UF���CM�1gT	�����/����q���0_;��g!h|��'�bU�U1N�
������D��g}���G���
�U1N�
����L��*h�
qF��D�XBV��D��*������7U�^���Z�e
�
�b�V4Q
�MD����qF����N>����G�*���
_;��g!h|��w[��/h�zvO���������n����T��f���D5�'`�&��7��h�
qF��Xu�����Th��qZU�Du�U���{���TM��p�L4[��$2Q�}��y�q(����q���5~�����������3��d|�!L�5������� ��^��r&j�XWh�x��u���qkS���FcU����&�u��fUMT!��2���e���L&�-�H�(��%M4!��2����F��D��*��Q��x?��V�,���w��[P���faK0�Z���^��,�T�t�E/��j�
����6��R�0�K����B�v3|��Ba]��}�R�0�K����B�.��N&XU�zyLr@~��x���TM@��U!X���������Q1X������^9G��
�u�s9�R�0�K����B���d*K0���5�
��L�V�vMf'�����4&�&S!X�����o��Tl� �/Z<��T�
�u��ZHl�*��a:���cM���:�P���c����\�b+�0xS��Uc����B���Q_z���b+�0���|��k'�N&XU]jtyLr@~��x����hS2�*W�`Q�Q��}x��b+�00�\oY���`>d�p���?/�!�	-n!4�?��'����������G�q����"������7+�e����'z��rCG���
�����b��zT�Q��)X#��z(����]���C��x��h�q�j���K����>��NJ��T�b��zT�yr��x��$5r�4��C�j��BN^�i�9�A�����[*����
�hKBa��G�a:@Y�

�V��b��Y�J%��JW������z��S��
����W��T�&R�@R����0��g�&���4�a�]�k��A}L���P�/sD�O��k���m�W���wBam��~\L�WMX� �����*[h��7�RCe��=��V(��� �����,�iB���U����T�1(�Zf0�Y����m��@��P�7��$�B��a:�Y�
M+~��{�f.	�
�K�~ ��l=�5��zV(�������\��H�0;�?��/$�������R��'�����B��#�����;x(0������G����v��+���%�
�������a���K���������z�/9�/��"ok�|��%�
����/�a������+X��+h��l\��]A>+�o���l��v��S6�Z�W�:�-H;�|t��
��a����{�{�{z�08�0��"ok5�^�}��}��}�|�#f�#f�#��,�o���9��s��Q����9G�}c�����X��h�[��[���1���U[�J�����H�.#)���������WM����H�~#)���zY����~�%�MS�������Z
��zY2�����d<���x��2�Y0��.��S^��<��M�0n$l���G0����x�`��������9�D�D���2�/�C��)�]�ln'�0�bE�|���C�e��;p~�B@��
�(�Wn�!�Kv���Fl-D�t��^��6�ZIch6)eUv�-���/p��]v����Vt��T�piS��	)���������	�d�`����s����n=�`�v./$i{s���Z
��z[2��k�=��6:��b�@�6:��b��zY2�u{�-j��m��E�fv���y�+�����d�%����7�fgw��N�b~�Bh�)/m	�&U���S9�W�[w(�c��i����>�X�����Y��[�7�.d�
��
��
�tax&ox&ox>���'m�����y����4o��F��B���Y�q�GoL��33����7�e���/~��������Ay�P�������D�����[B������-��C���X3�(�����	[3�i3��=F�;�����~���}����J�f`�-��=��MGp���dl��}��?l�����X���$o7B����G����2���3-���dlA�d������H�Z��X�@��!HY�P��B���f�6���b�)��>��U(_p�o�
)���d�����������������8��n��B{��#�B��mg��`E����-���-�f=+���_��R�_ ���'��e=*T�����Z�g%����-�Z���F
��1�Q��������!e�9@I���$#w����=x�P"���~�0 )w��O�yK�iH�����:����b�x�$�i�f��o[s�������,�$o�� m�9�Y�
�
��*C�k��g�W��f��=s�)�^��[�.B�+��s��Zn �.jP|�K�G���kL��KG��l��.[�������q��n��
�m�vi��So�{yQJ������Fr�nCN`��CN�S�	�[u�c��?m��'�-7���:�8Q�`��CN��fr�rC�`��CN�S�9��u�I|��CN��n�I��u�Ip�:�y�9�Ov�Ip�
9�q�9	NU'1o�!'�i�9	n�!'1n�!'���$�M;�$>m�!'��7�$�M;�$8U���m����m;�$T�
:�Y�:	��'q�q����;.��	�l���/;��K2�$�����7�8W��L3��Bq���rn��x�`������Z�;)4Q,~�#�������x��X�����j\��TMT!��2��n�#!���x$d�xv��=��S<o)����H��&�iQA���dE4�������$����������v�1��W_3��' h|�bY��TMT!��2�V�nG��X�����*��7��&�gTMT�2��&�gT�U��d�(h��qZU�D��*��*�UF�$��*��*�UF��7h4��`�Q%t����e=�D���vIu�LA�+�q���\q���~+4V���{��}$�{_WQ��A���{���a���0���`�����<=u6����'�:�R>I2juUA��#Uh�Z�a�
���jg�:W�8�*h���o�I��D��*��jn��*��*�UF�:����uT�U�J����@)t�VX3b?�(v���@%������-���>�����?4�!lrC�/�������l�K�f4��8#�h���K�B3��^����es7��L��Y��&��J%9Ya]
4��f�������By����v��l���V��L9�,���Eh�;�?������?��#�W�>�h6��h.E��&�m�w�,l�L�FY�L9�O��������L��o��������D9oKYw�,l�L�FY�L9-kp�aa3e���f���r��\^(���u)��mf3e���&�������l�h�@�����)�%�?�$�����_a��H��R�Mnb��C�f3��?Fh6Q��9�{�6Q�@�,l���ka3�8�_�)�%d��l���V��L�Xbv?ga3e�����a��WV�\�je�f����	w4�mqV���*tk��������pgWK�����^!!�"�A���<{��&7���3���-��"�&�gT�T���@��n>(6S��^��L�2��2����7J�&�h����aM�������qV���jR��,��.Zaf3�����if3e�������o��,��o�C��
E�uU��7\m�w��#�G�xK�b��UM3��%����cX��L
���5�B��d�<)t��.*��
�?� �
��d�<)����N&T|=�r�����"2�?�0�1i��*����,����m�j�<��q�M�E�BQ]�,�
�B�Q]2���7�}3�o�6�S4|������V�p��f�O���
��r]� �
��d�<)��vc2~�f6/����
H���.�fO
=�%�����V!�j{��h��U���i���HX��i�J5��������e}��IV����bC�m��QX!�����5��������Ao��nq)�o�m[�5����#l�H�������x���m�W����S�lr�Vfp��;������=�O3��F�"�m��f�����@��� -o�����T#A�����Vs����d�a������w��a���S��A�����l�
��_s3����Q/�H��z�=��	
������{����&,7�-�����
\{O5�Yo-�0#���A�U�m{��!��t@�H�JwC��U�Q����������������<�������FB���
+��i[v+�L�;�B��%�$xf	�=��<��Wh(�Eo��N6S����j{GZ^��
n=��	
��w����h��f�����w��Q��Ja<�HH�n�e9*�	�H�l[k"�
�<�W`�a<�HP�f�{�a���J��x����;���^�+��T#��������F�����~P#��Z!�!�G�2L&�������
h4B;IU@���7�%�����x�`��B+}s�F��f��v7�;����v���`M5�.x���'��{
�#{5k0�=	olg��x�P�]���m�
N�~&�|���b��DM3�i��c=����'��x����i�Y���9�,�W�����
>����G�f��0L�7��7��7�^x��{��{��K�\��%[1�~e�~e�~ez�W&�W&�W>���N��l����j������� �\Z��]�k<��e�q������w3�n����������6�[sp5SK���`i�������`m���M0i�61HY�P�����H��19�eLNGZ������
>�\�������1v�'-o�;)L����������{�
c{:���?mzb��<���%���q.
�v���4��Z������4���%�?�c�F\S"��0����V{P7B=�YOK�rmk��-��:>e����y[��0�EB�O+��P_	e��
)��J�i�A}��O�Jh��{Y����*t$-�Y�t��j$$��rq5�x����m��a����<o�
��F���+�n�z��������`��9�t�=���a�R�[9��+���5S������a+�4��C�X���uM��@�����P��R���m<�H�u�nWNV'<���d���H��+�)���z��>�HH�y������"Df�L}�O_�$��zO5�d��[ywA�`��j���_vT������Z2�����VHs���B�C����7UQ{��i0m��P�����_`�^,���������z��_�V/V�����B�F�X�����}��b�����b�����Z�����h�6V������m.�u��]u�d����������
M�
M�
���\�RG�To�:��^,��0��:��@tkI�R�~���~Z�I���A�������-H��b=O`�������'�)����W��X������Y�����X�Ia��,~a���z�^c:�0�p"�u8iy��N
�&�#(x�^�8b�d�n���'-�-����NGH�f��n,e��f;
@;�`�w6)H�Mz��B�����~fh���3=iy��Ia��t����c�Af���1��fzU��d]���i@Y�
+q]��9�roL;�#iy��,=�������z�@�������x���"G���cIg���F��w�e��t#^����k�r ��}���x��m�_�W{���������I���<��	��T#��w�U�aP���{Hk_�����n������.���a�[+�gp?���y��L=��I�[+.���`�O������8���l�R{X�j$$x�^v<7��X*�*�q������
Cx
?���~������"&��6.D������/u~�l�9|�
t~Yc����/���������������t���4W��;}t��������A����|y/�}I��i[�������8����vN��M���#�������f���'{���/����;��/�\��<+;���iq����S�(��'2+��|�yZ~���^���/\=�Y�=Mj�y���w�R\�����7,���NdV<'J�&����-��?�K]^<r���j��|����x���������=<���!���FL����.�k{S{��@����_�$zN�(��x|.F')5&�r&z��@ %��\��D���`�w%%(�T�����sA<�G�up.�Q����b���Kb���Kbt��Rcr.Ga���\�*�N�%j}<}f����G��yW<�T
rz�z�.��F��i0�BN{
8<p7�r*��XP�� �W-��\��D���`^�R� �sA
A���-x��L�)L�
���W�����$���"Q�n�(��(�)Q&��-���N�%�D�L�'��c���\9��*����B�����9�m��x��o��E7~���B&�h�QFQ�S�L&����*4-�U�s�v&�t.��DT�Dt[j�?SAQ�S�L&�q�E� /�R�%-�h"JqJ��D���
�?iz�e2}�4s��?i.<j���S���}�qr	B��[��5]�s�'�B&���=�HV��,�\0o��V�� �� ����������V���~�_4A1�S�L&��(z\=���h���ZT��h�Q!���P������(��#%�_�o/W*u�:�s=N.B������V'�h"JqJ�������x��E{��
����W�D4�\!�c��>?MD)N�29������NE9�E�����~QtD����]�nEMD)N�2���9C���
�)Q&����c�<k��-�%�������qrB��|�e��:�q"*d"Z���B��^�9���U�\������(.�0��&��D�LD���34�8%�d"ZG�zQ�������jv2��0����`KJ�*�?hz�h
�>h68Y1�s�Y7�-�����F3�-��C����0�h�7:�k��3M��.�4����:�q"(d"���(��(�)Q&���%XQFQ�S�L�E���+/���$��"����;n�"��$�)M&��D�AG��(�)Q&���f��'������o�����z�IW��`�^���wm��wk��b��?|�{<�R����b��T\�%����.E�K����~i�`���N�f����vz�1e���z �'���r��\����v:�z���{|1��p|����8�v��*�����+��j�w���rL?D����������V<��a��VQ��q�/���C�+2�0�)�Z�64��L����
c�!H�=�I���=��r���v:�����m)mo�@ImO�LA<�(�V����([6�������B���#���G�Y� �y�u�P�|�J�D�� j70}����k@IP���A2�(�V�q���-;����.h���C���
�$3��n���6!��I�*��qm��/UO��O*��}���f�!t�L��l��--W������G�)�����XO'��V����Gc��!�JI���qO�Sa�p�=d�9:��}S
.�%��<a	�d�T{���B�m�)cvI�fv�BC�y
(	
�0Hfe�)^hW4k3��RD�BCA~
(	
��m��@Y�

`^W��?���'������Y�G;��P��a�
��������H�;zi��S��/4/ZO��=�N�;�<O����
jA�c���o&����~�f�u����@�tu:I�l�Cdc����A!4�s��
S$SN�k���h��w����R�FA<�(�V�y�-W;��j��	�7|��M��%�B�e��@Y�
�Ki�����:*D�f��?���Pf���>�z��B+��B��W��!r�Yrw��q����u�Cx
?��;��T����cCh�v�����negk� �f=�^�;�������1��;�_Lj�2$� >�u/��+>��%,����
��H/AO�B�E
��@Y�
����v@���W�Ild+��cc�x��;H�)��S�|?�d�
��D!{#5��4�1��T��f\���9��!`17#��R���u�_�%#�^��d��+�w��{�j���m���M��,�m:��m:���?,�zi�`���;�aw��������<���j�`���;�uw���������<>-_�k�`���;��w�&��L����<?�Xyli���������;/y;/y;�������c�u��������c�g��,��n�y�_
��3�;���� c�Y����%]Ic�5;Q�}(W��}d�>�+y�d��������S�������m?�����g��[���������A7��4#�Mf���n�����	�h���)���-D��v�S���]��N�5�! cZ@I�0lA�0t�������k���������ClT��C(�Y�����m�D\R&��dM����d�P��������P��S8L�v������ ���o[�v������~��X��Um/M�����V���B����]��{�%�q�$3��f�7K�> �����04���S�6��j!}R?���������h�1$�b�Q&����>�z����+��$H�?npX�r�q�������Z�L�����V2���)x
���^|0I{�[])�������_2�[
K���#�b����Vrg�RP�f=�a��3�b^��k{������y|0I��[�1H�e=-���V�c���y�=�P�6*�-a��<���c�}�oV����[)/r���N�������aL)k%CQ��u���=J?�)����-�����9�S�}-{��Q9�}�,�e�/���tcNyk\����A���j��'�~���V�����J�M��Z��bL^+�8��6cG�l�H=f9kGB��#��A��l2gkB�1�YB�	i�*�M���Q��L���������^�����w���}��b�������\l+��h�Z�h�E2��f���B��FK1xK1xK1��������*�=>����Dh���D=�o$B�6������xz����iw���4��d�C�~��X���J��a	�4�c���!�h���{��
�W�`���B(i�
!�X�P��B�k��� ���� 4�����Ahe�)�xu���dlA(�����-�e�*T;t[H�d�@0���-H��P��B��������I�1����j�[�(���pW}\/�&6\)�<���`�'�b��Q3���H\����k�o�����1f��PH����S�T(�[l�kw���b�
����4��[��J?�D���>�oA�&�E������5��{�*��~����~3�������J��>�z��B����
�7 �R_���FA<���.���Y���w�~V��
���y�2����{�v-�Kp�]�
�7L2�]\1H�e�+�{h{\R1F�^��17�[�1���!<�S��
�V��*���v�g%G��)C/*�F-D,>;�)w>�w��5"��7���]��a�Q���Z{�]yqn�)����)�}����r,{�}�?����^����O�Oa�'�����?N��6���6Gi�;qy8�,��*�����W�����\���������s��??����Q���/A���/ar����K�p�?_����%���|	�G��%p�\��������K����
N����yR���)��\�t���[�p�\���\��M^=��K'����}�\h��>�{l�����O�qrB����GJ�����QR��c����;�E�s��Z4���o����-
MD�;�E�����_U#�h"JqJ���(XO��!��&�&�� �_���~M0��[�H&wt�"�������?N(L)v0���������S{L�B�w��I�B��X������s�'�B&�[�iQFQ�S�L�EkX�{�	:�q"*d"�/u��w�&��D�LD���e4�8%�d"���(��(�)Q&�C-�$�C�SQ�S�L�E���=KR���L�Q!�M����w@\�Rm�6��{�8ib$���������w^�"��a�G'*t.��DT�D4��Z�&����*2���U�����(�)Q&��~��E���82�F����6�nE�o7��D4��3�LD��|�������(�������'zDh�n��*t~�z�\����7�N���h�Q!��^��L$�}�p.�V��+d��n�o�&����*�� �)�&�t��Vd4��8��d"ZF�rQ����h]�=
_��(�)Q&��y��8:��82�D���g�D�f�~�P����8�!�w����N��D���(�s��{�B��=ND�LD��Uh"���"���l#�h"JqJ��DZ3�����(�)Q&��GD���82���nDMD)N�2����3����M�S�L���.k�3}����}&��[i��v�B&�_��Wh�6�7{E&���d+4M�%[��h���
MD�{�U�\tK�=�:�q"*d"��VM-�h"JqJ��D4����hD�E���g���$�)�&�G�m�>h8Ni2�>hR(���Y�;`��>�m"2����FS�\4����4�L�)LLZS���\�Is����+2���N��D���(��h�sZ��D���(��hm��i�N&��;8����y�V�\���������v�(��(�)Q&��������!�^�����]��:�f\@�my�����;��4��Q����%��]e������K�
Fu��xRhL�Of��%��h����5��������N���S����I�
��|5f9$jg�L��O*�����h���R_���X� �y���KV}Q"�����g�e[�k ������N��k�����9FtJ�^7�N���V)�)��-p<h��e�w
h?zFt6�~l��D>�J�LA<���A�v�V� �+����i��X� �yO

Uu�n{����v��3���X��*��pGj]*t��c��������7'��1f�t�L��bS��
a�JV�P����;�����1e��nY��=�N�����-~�O�6V�_�R^J���DA<��^���t�6K8����t��HK������C�;$�b��[��V:g�"�B1"��X�w��N+%���0Hfe�*��m�&�h�m��1����5�$(T��f@��v<���O�o�r��1A�LV�(	
�0Hfe�j���� ��#hLP�w<`��K@IP�����y�u������l��z&�� o�i�%p~L?Dz��I������?9L��pS(X'���
��j�<�T����!��SHl�9P/���H�h[#��)��S�T�
*�����B���G�
�$���6�i@Y�
�t��vP
Apj�N�A�5�$(TX� �y��6)mKl���MFA#D�
���\��d��~�S��4h.:J�+0L�B�������Zi�'t�{��j�O��f�o&�B;��-�O3��u������s��D(7���`R+���A2�(�^�})}�3�������.�K��Z����.3���>��iiX ��X_�{�$�aO��O�ba���c�B��>��7�:|)F]
d����1����l\`�2X}�9}�}x
X#�sKq6_��"�f�+���%�
������Y�u���<�xx3�xa�<���d���ilA0�-h��/�-���-h�G��;�.�a�a�aza&o&o>/��w�i��Hc�<Mc� mZ��Q���k~�5
cq���[��i����,�zz�h�><�}x��o�>|����F��Z����qXK14"��S���T��i�=�Y����������_������������`m�@k�����2���!b3�qOy��I���������h%y��i���zX��~	��G@�z����������e=,Y;�[��dLH(���-H��P��vY��M�����v��d�����y���7Kf�0w�	�f�����eL��S�q����Z�7��:
cQ����8L���A2�(�a��W����Yl��9i��a���!<�W�Y�f��%ne�h	���6�I(����Q�d�P������rdk`��!�&G�I��7!k�r���d��B���!q)@I��\����y����4�	qi��\CC�Wz����	=#0�ge=-�<����({X��a�r���M�=�A<�(�a��}���hWF�V�i��(I�pY��@YOK�>$����������$cy������%�/	�������7�f�0�m~�B!4�s�Cw�{���@!lx �B?�PY�������B�
�����A�E�b��o}^,���X�1<A���B������i��B��uj���6����9[��9��I��on^�"�����B���X��������iy����*/j�C\HhJXh,J���/�h��q)-0�	/j�"n����-x����m\�������W��/x�X�Y���H v�cp'C�����Envc�����1�����v��g"���u��^�^��������^��������wo�v���=�=Z���1������E;��;�����:�)2����=E`�St��n�&_Z��aEHmD�C?��$n�S�qO�S�6������`��[����[�*C/��\C �54c�p�aQ����;����{��
��!e��W�c�L3��W������xV�Z��A����k���!h�����Y��}��\kC��&Y_p� �f�u�P����&��q�Rn�=�����x�P��B�1���=@@������A�t��n*6�F��
����7����+������A��W(�{���
�R~�0��}�����g�+�r��J5.����znL����'���r�cz7���B�0Al�P?��Q�e�yp���������������	fm������~��������S�a��M���"ov����
�����7$�������)H��A�?o���s��?o��j��?m��� M�&��%H&_��v
|	�S�K��r	���8S.a�NA.��v
r	�-�\�i;���'�O�)H���S���)���N�/�3�
���>m���I�^:�F]=�l�������{����������'Gsp�\����9���)6B&���b#�\p�_�����9��q�)d"�)6
MD�;�F��(�rQFQ�S�L&��;�F�D����8��29�0d0��	6
M$ww��"���c ��?Z(L)v0������2����L���X��������+���s�'�B&�	��iQFQ�S�L&��}�6��&��D����U��3��E{��
���m}�e4�8%�d"�gQFQ�S�L&����"�h"JqJ��D���k*t���qJ���AS��o��B��~�S����8�!���H�K��E����Td"z��$���4IENE�g�����(���"���w%�,_���R�e2��t�e4�8%��\�����s�'�B&������Dts�y������B���S�L���G�?�Z��#f��*t~�z�\��������^�s�'�B&���b�HF��-`"X�����`�o��a{����\�D��'b��:9FA�S�L&�q�E� /������2��R�e2����B����D��-�l�3mJ����B�w���E9�s��G����E{��
��V�N��D��wzE�E���}�U�\������h\�d���&��D�LD�$#�h"JqJ��D�E���� z\�aY7{{���82���
��6=N�2�?lrY��g�*X����
:�s=N.B���;���^�D��o�BNE+�g�����(���"���u��f����D�Xbp��&��D�������kQA��=ND�LD�Qt�(���E��X{F�D��b���EF��S�L������O�*�--q{�V��MI�����h
��fp~���~`p.�Q���������V���(�H�}�:�q"*d"�/�+�h"JqJ��D4�:���D��b�����b��6���4�LD�����2��R�ern?�����%���[��?�R!�M5~����#�_��J���8�p���y:dX�*�������������q��HD�B�I���w��}W}C�V2�wn�����������xr+�.mw��=&�Y�}��j�������dR?~�K!���|��p�h�[����%��p���75��R�9�~0m�� �U����k ���]ej�<���K������j��	P�
��d�<)��%�KAu$�u]�k�E�b��x�<hQ_:����qO���GP�8
��{�e
Jj����+��5�.�m� j�P�@�s���b���4Cw�A/����v
�
�>����P�LAY�:��aA������� �'��m�Mqv��5_�X6��
�w{�~��qg��t��o�� ��)�(�g@���^(49��V��������1e��[[H����=��R���������^���K�� �P�Ja��x�P��Bs]B�

y�H����]����c�����e�I��R�TX��|:,�9�����
����3~:�
[O���r��Z�K��� ��%��Q�/�tOP,Q�<����C��,��� ��Ak���:�jH8�a�R����������]0B�!�F��I?_�m����n��]����+o�Sz9��"mO��!}
?��;�+��m�bs*������������+����)�V�����~��0�[Cut��(	~��2����[��u��g�m[j����RC�P�Lj�v�i�����O0�%�LBGCu�v�0��l� �f�u���7w�}�C������z���o"_�!�)��������u_�z�3D(7Tc�M�#�$���V
�i@Y�
��/u/��5�(���:���
e1Hfe�+wHo�uv�7R#-���C�5J(���cN��Q<�A����9���o�(A�E�i0o��|�/k�`m�2x��w��w�C��1c��o/��7���Ef\�fj�e�	#G�8y|�(��p�n������Z��v�N�����;}����N�����%#K6�_�����d��A���,��dc���j�?0B��g��S������g�G��9�N�>`
�4 �Ic�lA����F	��C��8�`�G���G�iG���%������u�|W���s	
����-H��|���za0�s���x��D2c���������@��3
���@lA�@�����\�zx+����2���!b%�qOyV�q��
3�"Y�T4��S�iS��zX�v��|#�/2��%K�<F�>��v��
�?�pdG(	J�(H���O�x(W�,1���\R��d
�HA<�(�a����M{��hD�G?��nGR��S�qOyT�2�v���&sC�v&Yor� �f���d6#u�RaWrGj�\JlH-S�q���o��V�V��*�.��*�������
�=�j�P�����r;@��hH��`�10��jAI�
�������Z�/�� ���K�%+�#4C�g@���;��:�����
�_�AO��&�Y-Hfe=-��G8^�tV�&������I��c�QO3�zZ2��[�<U�[�V�
��P,R�<���x��������������$�|�Y�$3����L�dZ�A���S�V�yc��(M�*���9��I�-J����G�s�/A9������`�/!8��_�/�/X�x���7B/�����B���X��[���c
��
��
����L��L���y��_`k^,�8��3,=cf��Acf� mfZ���[i��E��/jL~Qcz��1�E��/j�yG��x�������������������L�C�����vgW@I�]���]ie�*T��k�&��I����M��LJ3�)w*4~a�;k��#;k���G�[�-H[���������6���m�c��k[�Z�h�=�N���#w�
	���P��!1H��T}���B�#������������q�������$��1z���-�$o9b����[�j��0c42F����FRF�E;ae�/�i�B�����~���ye/n����t��z���u0S�L��")S���U(�|u�k1C#����)�X�	CxR?�)w*W���������C7�h��Z�����jR?�+��V��A��q�7z^,x~}�������u��f�K�5�\B(����d�P��B��;���������I��$3����6������5$f�(�:���x�P��B��� G!<Z���A��+���A��U�X}���(R��B������Ab�y���Jv\����VI���7�1C{~������Sn}Z��wOP����wOP��AAl�yp�{�';�����A�����;��v�}B=��I�%;����fx�)m!�������������q�n� p�@A.�S��[(Th����
|��/b�B�/BR�".4Q������\�B��8o� ���"�m�">m� ��!�5��Q�K���
���>m� ���
�)�%p���y+��O[)�x*)t~��\?�B�������J�X�q
��H}�qr$��(rz$��=���� F�\O��?�F��`u'�(r.
��=�F�s�'�B&�q���2��R�e2-��!�bO�0<������K���O�Q�\�������F��hvr�X�aJ���C���r���Q8��W{R�B�w��I�B��X[��DMD)N�29#�����h�Q!��d�D4�8%�d"��8�(��(�)Q&��9,���	:�q"*d"�/y/V��D���(��h����2��R�e2��8M��3=N�2�?h���6
�tJA�����%9�o�}x���B���_S����h��G*4-��HE�E�e��[��h�Q!�]}K��|
{*JqJ��D4/�a��&��D�LD�(Z/��A�^=|�&�������hX��AC������F����-��V
a
������I��A�����l&\���Fa��Q�.��,�)U&3��^��&���L�.����f��E;���k�scT�Md{��4�^oW��Qx�*��v�l&L�Z��L8��%�MA=P3�>������?�s!��}Uv�P���@u%�&�0aw#,l"�����p��4�	g�Y@������5�	�-Y��0�_w+,l"�����pj[Y�0��0jaF3��B�\.�p�*�k��0��0jaF���O/�M�H=P	��DJiI��4fk]���_a�[�����B�
�~X�l&����&��i�p/��&�=P	�	��MZ�L8o��f�p:�f6�@-�h&|,1�����0jaF��>d8a���9P��	�����L��j'3Qh���������.�O#X�s|��!��/�<h�B�$e/����S��t&�[���C�D���(��h����&����LOq���f���M������=*6��JX�L��+O�+���U;��&�)��&�t.K�Z��L�,�����L��0������;��;��:��x8�U5�����#��W�D��>�p���y�aUM��PFT����/���%3�I�[PmL�PF����16U�0�Kf��Ba��n���W_����/UO��O*��:�M��h�k�^����T3�AK�#m�MM[�,�v���]�/�l�Sw��������7
��^W��2�PaT��<�'��m96n��~��h����_�����S3�A�^Lv�b��z]��}��B�Q]2��z����F�9!�VG=�C�G�A���oj�a|�=AX��Zn��IGT,�9��H�z�<�2D�[w�74Lm�����A�����eF=�V1[\JMR
�[��l/FT����Nm%��o��'��Y��Ht�Mx����_J���,��>�(�^�ms���c���:�y$��������x�����zSj�~�n�D&og�3E ��,}E0��zO?�jO��Vo�m�1�����Y;nJda���^�����#A���p���B��k�u��	�a�S��GR�f��������9����`of���J��@I����=
�3���z�v���Jw�U@�Y�o@Kj�
�(��f�*4�}���=Tx�( ��^L�B���!D��g�+4���><4sU@�d�����
�QO3��W������A'�)���?�k�S
f
��Gr���.����o���?a�~x�5�;���^�k��F�?������!t����X[���iyT/���d�����z��}u��X��\��C���d�g���PO5�[o�3��zQi�k7��#-���0�j$���T!���#?�io�Z����M)�	�Q�c��G��|�Vo�������A��pM\s��k�I�A�Y-��@����%�
a�X����AB5��i0� q�G�>�_�A��ve����s�W$�A�z����<<����o��7��7K�U��f���x�dc�a��KlZ�K4@�l�D2^���l��v�u��\E"f]E2���-�����f���k������s
����:�d�F>jza�@����>�q
Pv�0r���h�GM/�9�������!
��/q"�����HG>j|�-D:�M;�
iw�Jj���� �y����%����l��%��=��`��33S��Gbn�_���������;{so7�g�9ss��"�k5�c;���)y���{arb�69A���k���v'0kw:�y��Ia�����ck�W�}�=��	���`�h|R�2>=A����#	��;N`7���#�"�����%�5��"m����!ep�����>�(�i�bR���Pk�����9����>�=��	
>�\l��NG��?G��P�5HS�3���:����������1��:�y�@2�=��	
>�\�����loL��#iy�;
��x��������V�D�����G_���\����>�HH�q�lo����1~��{x�Q���<��nG��>�H��������`�U����e���<���{�a<�HH�y�dy�qq�a�s���:�!C��/<�N����R�������O�����6AKB���\)�O3�zZr�=3~��+�7E��#�$c����x�P�sPnP���u��B���By��T��O��/kP��V��
�`�X�����E3�b��o�^,������5viS�v��Pc�B��K-xR�q.��3&i[��MR�}gLR2&�:S����5�{kt�������5�{kt�y�u)�Pc�6m��
5�h3 �!j��B����b������r�}�A�}X��+2o���$�������i%
�g������`��cy^��x�qt:��:��:��������y��ho:n/\�mp9����^����rn��y��xg5�:z��zo��{�m���MGP�f��`���hn�;��`�G����H�}���������	�:��`�_�{�23=A���*gq����Egb:�I��� 11��{�j'v�z���.=iy�uIa������]�-o��dK(��$���u�Pq
+[fbS�����y�2�0�~$�~�-��e����?mtf+r ����Da2�HP�n����oI���9���d�a<�HP�f��)L��A{�	}Oe;ZBy��L&���A�#��������1m6����(L�	
��W�����#���-o����[I=��	
��W��p�?�C�:��`�[��S��G��7��V`�K�o��
)C�J2�cJ�3����hr���#���#�����Z��b���v�����A��5��Nx��������������s�c��6R�L����>�v�f�<��o@����|y+�}I"D>l������Qp�Df�s��n2_��������-.	���@m-�|"��9Q�7�����Y3�^~�����^~'��%����g��Z��^������O��.��'2+��|�yZ~�v�4�����}:������D)�d����e���x��a�������r;HE���
�'��Nn�M��m�1o}��j���S	����x!�r�ILx>�28��(�cp�F��(9&�z���+�q<J���"�)E&���zco��S�&�B��%���F)����WK���WK����%��\��� �sE��;8}|P��#0x����g���W<>T*rz�z��.��V���P���S�&�BN\����*�0Qr���<iE&�����+�w��(29W�0���\=���\���"�S�+��5S�T��DT��*y�Z��D���*������jUF�����,�>d���\9K5���V_������%:�q�����o��U{�R4Q�x �Ve4Q�8��h�Z�{�B���\:W���w�*�\��)UA�m���\MT)N�2���Q5^T��j��J���*��*�iUF��?H(t���qZ���#�����s����u�wN����q�������B��=N�
��V�R-d�Y�k��s����\!����L�]��h�w3_���9%F5����&���z\T=F���*����mX�s��TMT7�9C��gK���]x���������K��'	���]�SW!����cY�Se4Q�8���\>x��{��U{�R4Q��k���j^{MT�v0�Qe4Q�8���T��)�~g!�T��DU���>��U�Qu��
_��V��D���*��j��@:}�p�V���S��k����?u�P�X���B������t~��;1�����j�S��&���o+4Q-�}[�s�=�7_��U{�R4Q���F��D���*��*�e��&��UMT��Z/��Q�^T=��Z�N&��%;9W�pB��NA�O��D��8�� Ey�<�;��c9��X���B��
�1�����VS��j&��:��l��cM�g�.�@aF�����Z�S��&��(��2��R�Ve4Q��^�Ue4Q�8���\��>.�r#_�iR��d2Q�p���d4��8-�h��h?)4Q�8���T�Am��B1b�k��?�Qb/�d���v���H���r�?D��)����8�/l$%
����� �w9$�B���T+;��_3��,�;��eG�mk�qX��J�]�v�[5��V�AflW���WC�|Iz�x;���
����)�����z1��"�m!<������i0��d}��������U�����o���l-��o*�4rD�7�1��H�y�S�qO�w|��}��#M���a�U��&����4�3�S3b����a"T��[�
�����Y� �f�u��}�f��}�5�JC�.�x0	
0H�e�*4�Q���
���U�rG��
�i����Y��7pC����H�y�/���C��AO��=�N�7���-%T�4��
`�Ja�d�P��B��z��7�hC��||9��"t+[O��=�N�u���-�*x�K)�#��$�����5��=�R����D�bB;�X��P��o&A����P����
F��n�����Pn(���7��Pakcj�P��B���E����@�d[k[��b�cC"_X��~���a^\�K��ZP��5?��"P`��<�s��
���H����f*�#�����X����g�*4���T�vL�9��.�w�������R��>�R���`�f����"a��rZqS��Ip���75�z��B��;�RLt��Bt~�M~R?CB1<hBM3�t�����R�'6X���7#�����������Pa��x�P��B+���W���B��K�
��Vhg�)��i��nZV������
e��K��}c���~�[l!2�S��
w������F;Z*��F�����,�i�����{hX'�PF�!��
�$(TX� �f�u�������C�D5/����)��H{4%
���qO��Q���{n����-�h�&4��1f���cO��=���b17c?�u��A�����������Z��=�{�);-Y�.i�.i�.Y�.a��}X�6�����q����;t|��?�����36Z�8�Mwd���q�-�����:3vN�������j5����mg�]��v�m;3��V���:x�:�2\������g��io�l�5��l�<��k�<0i�� �XyP�����VO�:K�:G�����W����Z���]�������g����� ��Y����%k�-���`>3���A�v����<��pi����������/y�/y����U�R�\�v��r��8�Y���k!����������[�� ��9�I��kA��s���,����3N`nH�|�$��$�����l����c<A@�t��'�x�P�������X����z�ca	�4�YK��^2n����j���*lWn<��xP�^�_8g�%o��/�b�4H��������M��ikD�B7������jk!}R?>��c�ke�o���B�����'�������7�(>���F��m��������1%[q��@��(���������L��>&��0��-"_h�����?y?�o�=m�b\a��2�q�������S�1�<��;������g���<���`R3�:��x����%�K�sQ���R|R?�Q�
�d�$}�������d��B{Cl.���H8���4�YOKF�0��	�I����{����e/2�H!}R?v�ooZ�0�C�7mJ6_P<H7���H~��r ��sC�kG���������?���7��k��y���Z��ZM�Ev��b�kM�k5���x�h��3/�=��6��B��WG��z��z���0�7�7�j��8z�������q\A�7��8�{��5���/<�_xx��x��/<�_x��im���J���%��q_i]B�.��%�+������ ��������Nz����vL)w*�k���y�N���8�Y�v ���f/o\��
`��h&
 ���U��������~`���Z�����;�Zn���Y}�s�ZK]$�������o/r�B1��)
��K
)+o�d�������g�*TYn���Zz������2���!�����r�B���"J���
Hm��q�E�H��?K�:yv��"��*�
��j��:;��LV��N!}
?����H[�k�=�}�eTl�
��g�
���g�*����{�*�����p#��\d�]m�40]n�J�Y<��]v��g����'�����3���J�Y���R��p���qL�$��{���)�����+d�,��gC[r���$�����%�w�i@Y�
��X���1�RjH<�`�1��� �i����=1jO��7��t�Y&����4�Y�
%?L�e������l(?��0RL�@K.y
?�z-������U�Z�I����k~Lz`L���q�_(�N�~���}��/����������U8��<���������>�������p����s�;�9����!d�<�E��[][�������|9���??����L)~��v:������������8S.a~��\��G��KPg��K���/A2�.�������|	r�?_����|	�)�0?�_.�������������8S.a~��\�����zm�Nm��������V���w�^����%8N]����%��P����v{.�~�L(h���`Q�\������#X�/�;�E��j�G�(4Q=���V�2��R�Vet�
�#�,B�5)LI2�(���~Mq�k�����D3��W����@'���BaZ����Jk��ys��-���L��=�q��������ZU��j�S��&��iUFU�����UkX�{�	:W�qJU�DX��'h�JqZ��D5��1��h�JqZ��D�.a/V��D���*���!k8��!N��*�iUF��G�'E*t���qJ���'N�}��k@\��������8��H������	��&���U�2:W
�?Q�s��TMT��Uh�����B�*��vUFU�����U7�xo���U{�R4Q�F����6�nU��:��D5��3
MT������3���jG�g��+�Ot
�{��0:�s=N]���;7�����j�S��&���y�hf��-�\1��X��?�����Lq+�V�d��aZ���"��N��D���(��jU�E�2��������2��R�Vet��W�YD���L�S��.<g"�5�3"|e?e(t~�z��
A������N��D���*�s����B��=N�
��F���D5�w`�&����F��D���*������*��*�iUF��GT���8�*h��/[�wX�D���*��j>�:��8�����N]����p,���3t��������&]b��Wh�)6�7~�&���{+4QM��[��j�o�
MT�V�\uK���:W�qJU�Du_��N��*�iUF�4����iTMU�b:B&��%;�(B'w��N�8�E;�?q`O!���Y�;�R�Xd������0u�L�ouL�c��sE
S�L&��W��������_��������h�JqZ��D5�ZU��h�JqZ��D�.���2�hb����\�n��Z�s��DMTa�����&��U�����;���[����O������{�M�k��v�}iF�e�<8��Uw�f�}�����;��x5��dx�Jw��g��1I�L������c�O���yW�@A<���_�r��"������/Ft�x���Ij�{f�{5e0�����P���sX�������u��F��'��>������7��Pa��x�P��BkQ���n.�p��^7�?����X�����Y`vs],����1�m!W7|���owP,PO3���+)����1+T����sZ�:f��4�z��BC������!����*�7��Pa��x�P��B�*
�p�B�u�DY�n��X� �fZ$�v�
%��L��F`]|9�t8)�<��~�y�*��ij��������=�P�jv��Y� �fZ$�t�V���$�4�}x9��"}�a�S�qO���h�S�!����Pi(�Dg���V
$���nZ�%����$���Hr���
$���nz�K!�yx<,����@�A;�A��g}�Q����i���+:y�rC�����&�]�6��e�*4��/�(m�����c�n2�d
?��;na��x�m�
�
�%������,^� �f=�V�;-�O����o�Xi���1e�����[O��=�N�q#[7j��6�_���7�+�4�Y�
�}�bl�PF�!�|Y�L�B�
�i@Y�
�-`��_A�M?�w���
e�1Hf�q��
����� ����Qn~T��IP����x�P��Bk?�-���

��u������NA<��=y�R����� vS����a��I������4�Y�
������;PD��n��z�2~:��� D��cJ�U!��^k,u��w`}5��"�kC��~\�?�`�0�����B�]>l�7��|i�SO3�i��v�����gw���q�n����]�d=�d�V$)R�w��M����b�fu�bC+��y�a7��B������g�G��[��'e}�Z��g'e}��}=>:>��nX�q�@�8|H���kA���5H=-�pu���s��3�@���om���E��I1\3P���g�Y����Z���,p������`��gc�Y@��{��i����N���\kR`-���h7)�`��5)0>���WT��f�MRk�e���Uo��7��7K��]�� ���f�� l!� 4����������U�X�`��
[��
���%k'o��g^����>?��!�/�c���f�b��K���sC�#&���b�L3�z��J[~���������&yw�q���%��}!���8����g��>�s��Z�<���r�����%Y�1c��<��v��Y2��hz(��y�Uy�~��{\1�������V�
qI�v!cnH��$�B�$������M���q���J�'[�P��#wrWe��S�,���-��#�6��-�a���A�zvnh�~�>x�ccLn��� ���!��_f�#������L�BC�2����&
��@YKf�p]��� ��q+�nu���>�8�i������-���.��(���,c��<�zZ2Z��n��XHj����x����W�rPM�����s�vboH [�%�����/�B�&��i[�
>������ay����)/����/�|D�SY�m~eR�q�C�N��V�?���M���<~���X�u	�hHFoHFoH��d��d����_�_�!/vH1�`������8��^fj����G�2�����"X�1��r��WoZ��i���z�b�����b��h/^��8}��ik*��LE0i0[�6-��[�j�/��J����3���B��h�=�V'$���0�58�LD2��u�P��8��<4���L�6�N�5���nj��<X��9���Y�R�1va��"�zKi�n��[���E�&y�����[�*��[c�6�3�1�B�1h��r�B���;��;�J�v`�v���w8�V��iU����v���2�`W�����Bx�B��"5/V�4�\����X��A2��n^|�P�����6���{m�Y@I��� e�9P�����?��
)co�d n;�4��{��!���8��!��9@Ifb��3���������	�\������P�|��[������
����7���������s�������u�RC��&Y��� m�Y����-����_�4">�c���j��>���[����=Q����{�u���@�<���>�t�>�����R�V�x��������~��������@�a�=��yo�a��D������$���~��[ H���@�?o���s��?o��j�3
?m� �����[ �%p�\���\��-���_��B���K���/��r	��@.��_g�%�[ �%|�A.��%��@�K�L��y��O[ �|1:=vF����4?v&��p��y��d�:=���:=#o��!#d���2B�����Q�\��)QA����Qh��2
MT��'/*UFU����&��;CF�D�pg�9W��+29W�0��d����c�h����&�q�g3
9�P��$r�	�C����!dpS���
����.@��=4XUA��=N�
��&��UMT)N�2�����gTMT)N�2:W-aI�=���j�S��&�{��bTMT)N�2����V�+(h�JqZ��D�t�WTMT)N�2����K���7=N�vt��S����h�������:�s=N]���;w$��B��=N�
����G�&��?�Q�S�����W�SU�U�&�Y�����&��UMTk�1`TMT)N�2:W�ONU��j�S��&��o���Du��u����S�B�������9�//�B���V����8u
������w}��U{�R4Q���[�D3�7o!�����L�{r�������0J�u0Q��1�(R��d4Q��j��G�xQ5/����D���*��j��C:��8����g���L+����O
�����B����jU���8�*h�Z���B����:W�.���W�s��TMTc[oTMT)N�2����5��2��R�Ve4Q=F����1��T+�S�wX��j�S��&���$���S��i��.<urY��gZ!X����
:�w=N]���{_���}A�cx�t�Z���}�V�T��DU��j�o�
MT�Vh�z�%�F��D���*�s��b�)t��������>��U�Qu��
[��h'M���L�!��NF�O�����8uM�?m�P����A+���2��fr~�)L]5��[���#�s=�RrL��YWh�X���B����ZUA��=N�
���K-��2��R�Ve4Q�K���L&��%;�(V���%M4)N�2�����e4Q�8�������x�Y��,����
B�j��v(*��Z��s�CQ1������6y�T����7���Cm�:�!�8$�;U��>�?��j���e�j�jJ�!��@J�t7�_Z��=.������/�}��'Q����r��{��<�pg_�G[Y�
�V?�pg��f�H�K���j�B�-�c9����nGS��A�\�>Z��b�'`���S��#�����u����PL�p�\�����H�X� �f� �WQ�a�m�S���#�m��F:�%�������5ICr,T!������X� �fO

Y����?�p�_�6o�kP,PO3����y�[��������l��3������������;4vs��B������>�L�@A<��?(��f�=��LJ#��/���C��Il!}
?�)w*���+7%H�fT*i��
#�$����4��[���>��F��-aB����L�B�
�i@Y�
-uIm�������]������JA<���g�Z�����Z�B2�[#^�[�O'�y���~L)w*<�����q�<$��r��p~�?D�)B�)��SnT�W����}�B�!0G��I?_��JA}���[�n��yom�w'ymL��jL?D�)B�I�����p_��Mplc3*�����J��-���/�!<���[&�H=+�
�=�V����Tl� �f�����u�vjQW\%�Q���PR+�Y� �y�u���~��F�YJ
���#�z{a���i����is��i{�BGC�p�7��Pa;�4�z���B+
��t��B��D��&�B;���4�Y�
M+}w
{z�vG���V�o&�B��i@Y�
���ki���N"am�FZ�����C�5E���2���[��Sw�{S��B�?�o7EPl� �f��`,�������w�dc�a���kZ���@JFW���J���f�������o��n�ll9,�8�MK;�H��� �Z�Y������Z��@J6^`�^���l�:�Y�+�q-����-H��8��f������?Z��@J6�`����l��:���������%"��Y��~{�������,L�3-`����-H��8��f���CV����j�����������C3v��Z��f���+�;�%XG��1�;�-D9�f�S���
?����E@�[t����A�[t���l�?�C��>�q-��D�2� �2Z?9����pN����L�~#)���zX���l�	"'O���Q��X7�x�x{�f���~��D�9�����@�JY��B������Vc�YT�<d<H(�����x�T}����9�����L��#�3�=�bO��=�Y��)����8��6$��0	n��
�d�P�����H[��s�;�)�%�r��3������v��P������=�%s�$3�����-�P�U$6d@�G�J(���!���3�Z��n��l\���}��/5bY�����f
��@YOK&�E���t'r������N8Z�x�P�����~�Bx�'��/�T���9�~�hH��������L7�w�W�z�!�-�������+�do� [)�����Oi��&��Yk����Yk�E���<��	��@9cv:����x��T�bi�ac^,����K4�e����>�Z�d,K���F��B���X�q
[�kJ��LI���5%[�6%-x���/�"�j�����:�99���j�*� �BZ��k-A����N�q-�5��}lA�}��A���/V��N��B���19191��br�b��"�*4V_k�j]���h�a��b����
��B�K�X������q���jDR�����B�b��[o�,2k�$������E(�V����Oc�-��B;��b���f�)7*T�����"��(���1���/4�ZO���r�B�����X���B��1�X�Cx
?�)w*w./k�>�D9|~�2������~L)�*d3.�G���������I���(��e�*T,�D�s�
)#o�d��
�d�$�>�j�l�������/�v��3~���e�r����z��A��|�x�0���~�0����q')�]��Q7?Dp���JiIdR{Oj��q71����FYz�#a�M���wa9x��g�
��5J	#o�����A��S����Z�?����37�4�����i&���!g?-
3�HAv�tv������4�'i��}>'cn�M`�m����M,a�<�{}>Z��q��}r���6�~�6��;}J-vW������?~X������s!��}yRJ������������s����N(W�
?���A� ��� (U��{���� ���{��A���@�q�>�������A�� �� >���A}� �� (U�����=^\�Z�WC���Q��_M�J��{(�S���k+��v�G �r;����|��Z����b7�h�X��3�U��7�h�:��*��j<�eT	mT1N���^f�&��l>�d�X'�zO�N���b8��3�5G�e�Q���E=�r\\_F��D���T��z~�?���\�.������h}��KF��F��*��j3c����UG�Pe�Q�G6�6FU����6��^��1��b�T%�V����7��Zu�	UFU�y�*��*�IUB�4�]VM���R��*��j�{e
����8�:��+Nj�����P[�As�d�>s#N���+����/���#N�2��^v_H�6���R��j{�l���UG�Pe�Q
�Zv���b�T%�Q��Y����F��*��j�U�M�2�������h�Zm/����<�}�ly��@��u������C����������q0��Aw��~�v�����l��K�&���8)Kh���M�@;�dn�������Lh�
qJu��jh6��e���R��N��P�w��eW9A7
�l���J��N9��6�����_����(p����;���lN"��I��P�mNb���2������v��>!H�S��A��r�w������w�l�����A+3�(�@��l������0��2*eb;����uW�z�|�U��nZ��N�2���u���6��(����4�t��{:2������/��I��P�mNb��!B��r���m��cN5w��6�#P*3�)�����N9����v����U&�S�@�Ll�\������N�2���kFY��2
e�v���=b��B��h��g��%��,Q���e������mp���]o�mp����GOhs�1N;��I��>�0��b�T%�SM��`;�4=�S���Z��N�2��rp�%�����FyJef;��%���J����qJv��j��D$Kl���J��N�:r�g��N�2�uG	�&��WS��w�4��{;��=j�6k��j��q��m�&��&�����;E���~�@[�X��L��V"�U�"�7P������M/�^����e�X.����yi�Q�X����W�/�cQ!�`����7z;���G�x��d��j[H��Us��"91������&�mh�PFXW����B�a]<��)4{��.�2/��e�US��4���A+1����P/T�z;[t�\���W������L����5�&����ur�*bM���O0X*'���-��&���x�����:�&I�Pf�x�	`��B�yF�[�$�f��N�DX��-���������t�V������<+-��c���d\E������yh>WI�t2��4xX����	�p�?�j&o���hUi�a8�c�iC�B~
 ���dn�i&�Y�
�	�Cd���AM������Z�����e�FM5|XoJG=����/��3w&�� Iz����0�j& �������2Z��rg��m�^���2<GM5|X�������oo.�C���}E��yP/�s��T3	�79wg��w��)�J(u��2���$�_���E�i&0��Z�#T�����d���������Z�%Le���f����u'.5�+�a��@m����z	0��Xfq��'�Y�
u���A��bC���?���%���]r�	`��B}�Ut1���d�3�\��t/H�����S�$=pq�_G-����,���H��@��0�0�~&U�Xu���������������-����^�'��T3���\'�sn�R���9����H��z	��f�O���K�?�7�Y�����'���^��ac�����z��w��P����Y?)�O��a����f�����,�} �\#�L����=�/X%�ChR;����Q�%�f�7�y
���`�LBL3�mw�z����o�.�����g���E�_�a���K�_L�]L�[L�,b=��������V�,?0@����K�.j@��iw�)wQ���^���>c�>c�>cy�3�3�3������d�8v-�8j�%+��)�Q��JV�`���b��b��&��{h��|Y�wkU�`�M��9�Q�M��-H���m��������6"����s��lA�����n�$D+BY���V��=k�$���,IC�u��l����;��'5����O�a�t(5��wKV����[��u+
�y/�J�n�!Q�.=���[>-Y��>N;��i���7;�&LKL���+k����t�z����^&�I/�|�ri2^�N�fc�������&�	W�|�ri7�2���w������Ko���id�[2���<i��o'�L;���P�	�n��Hy�V�!+�I�F��Lz��<�r�F���~�H����I������7
�f���4���y���d�'S���/cY�`LMz�q>=�4��wK&K2�'k�
�3an� 	����b�T3��+�dk������zir��'���,`M3�z�d�']�+�@�����9��%��L r�`�����<Q��K��3�<����x�������Y��}��B��:��D0O��m�7��g��W��e�
.�'�Ux
a��<���M�0���I���f������1�=�H_�s4fv��!M3�������hP�R�7j���nP!&��]���UT
[���sz��?���,�O��7K��������;�7U�d������|�����;�jK2��g��g��g|�~F�~F�~>�%�vd�M�h]�hm�8o��������P�B�{��f�)�Sv���-Fixj�F�
�=��{�l����{���{�l����{K��j�����L��L��L/��4��i27^����F��4�?��tV��������[g��bd������wU�f��
��4���0-!L������z�s�V�Z��1mUZ��f���Ui	>�W����A�&�2MezeP���L�A�����Bx���d{X���!=o����7i>�W���[O�x�I�q���$��'��#�Q�������?1RN�!�7;�&�HK��Z�f���	�>���q4�T�#��h@����-���pT���Y�LZ�3�yh�<GM5|X/;�cq�d��Ez�3�dI�0�����������*�i,�Iq&=�|v����f��e�/�~v�����8��g>�t�x�����z�Xb��yu�
EKz����T3������v5R�b���'����"vbM3�zX�4���>b�G*���{�#�FS�$=�o&<�4��W1f����[�pI����tG�F��]�	p���1��y�:e|���=i��;e�#����*�����������}����f\W=j�s��u����+�����U�������/���w�����D�Hv�S"��2_� a��+����k���#��w�|$��)��W���}�H�AS�Q~���������T��\��v�������.�������D._e��o�i�'�0����G��dW>%r�*sY~l��_��������~��|:������9���E���[�����(?t��7��	��p�L����C�#�kg���l��z�r�j�b���P#���0�Gd)�{�A"kA�DV�mw}>�,G2Y
�&:Bo��D	��b���[b���;b�W�P#���0�Gd)�?����F	�v����vw��_��
�r�L�'j�q�LV'��~c*�,G2Y	�Z�� ���cA&K�p}�b��0!Hd)��_ ��1LY
�#��.&kA�DV��}UP���@KI�cM�����JQBQ�����������������v���e��������;Hu���m��#`�<m�<���%�Zt�	QFk�[YJQBQ��������Ch#z�{T���m���E�EG�e����S��6�'E	�E�,o��Y4��>ER��F��(��h����4#N��]i��]�W���X�3�@��6��!0Z���m�@k�'D�E��{f��,����R0{sC�d-�aB��ZP��|�tf}*��4��
�/���IABk�:����u��D/w�Y��
�qB��Z���	����'E	m/)�}z�q�r��R�}`h}�F�8F�3W`�%Jh#�qR��R�=^��x���#N�2Z�����F4O7�����o��D	mD1N�Z���� �7������E�,n��Y4�s_���b�%���s�@��
�IQB��Mm�7����/6��W�� �����hy��K.}7/�Zt�	QFk���X�����@K���=�@k�'D�E�Q�%��8)Jh-������F��(��h�E�M�2��{���Ek���0�8�R0�#%}����'4�.4�-_�|�y�������XO1��f>9���d}�1L4���N���n�L�����O�m����Z��8!�h-���(��(�IQBk�r��iQBQ������W[be���7~�&��=,����6�'5	�Ea3j%Jh#�qR��J��a�����7>1B��Q������u|v�����}@�W_�j��H~5�!�&����`��[�
�`E�-�|5&9 �B�����0]���wDzE�+�<�1f|#�
O1�G��
s�]|�����*�����Z~Q:�f�T��,��{��i���u���3�����BhR;)*,���d�>s%�*r����y��HF�-�&����h7X���K�{�
R|'����c��F�����G��]��|��^�p;D���-�H�{
0�/("�1�f�f=*�����g������B��>���`R���X� �y����pBGvQ(�Q�O��2Q(3��g��;�F�z�}�L�P���/�^�%��V(�A<�0�Q���5�:U���Mc��K�D���;�5�!���%�K�rmA��1d|B%�)�q��nVx�c������N��_�1�	�ChR;)O*,>�o���mfT|G������92��3O�Z���S�~���������V(3��@�<�����Sh����k��������f�0��,w&f�f=*�5�����O%R���Y���y��������v�)O*�}������uR\'��Wiv��H��B���HyRa���x��}w�jG���?H��%Y� �f����8�������_;�%��������\1��cJyRak�������O�v��	��_H��%Y� �f����|���+�;*���#�����%�!�g�f=+4�
�����o8�����u���
evbM3�zVh��X�p}=������?�	`R/�X� �y���F���K0����9LD��Vl� �:Xo��4Y�

�t�
��sn���5t���/$�B�9|�M�L����M����kTj�U����&�B�E��y����&�;
�%;iV�9��o$�VW0��cJy�`^��o��FLb'%��4�!�I�E*2��cJ�q���������z�G���W���?��=����w�j�CN��"����������3���B����#��Z�����r������I��^���t^��U�|�H�O��$)����O�?:�����~5h�O�!�{*V�}z�)��*��
���W�"ya�i�I���A��3��,Yzr�J+��!e�i0\fc�� i�`���%Kw�mId�����&Y#�I#��z�d����Z�����c��[���x��W����dFcFc�����_V��Z�����	S�6�c��	[��	�x��W�r�`[e��T�a�&M�aR��Un���d��x7 ����u�&Y��I���z�d����+<D�����h���=Z_�3/�����z��w�����x�v`�$Z�I��Y� �y��f������p�F�'�{I��l�����I��y�;!�f�����;�,�_�f#�����p�Lv�N�\4c�Pfc��1�_�t\b/0U��P���#�(Z�I�vL�h�	T��������.F����-�$e@FhL$f���I�"���m0��[�e�&)+2D��'p}���B�����= |8�?�	`�2%}� �y����q������@��';�3�<�=�`�O1�����KF�0�v[�|������1�{�Vv[=����v�)o��b.p%^en�cbr�$��\2����wKC1�7�U����!��Q�!��������h�t;����C�ax����/��	�����,�no��4cJ�����t����_���I��?i[���O���*����
��1y�4������#�����B���oT�'M����9���F��p�Y���o�?(T��]A��
�B��������;��yK�f�����)#�:m$�1fX#��H#Q�G��
�������S���N%�A�C��*m���w������c��[������	���`�a��-��c�&M�aR��U~�p�Pe�����a������a����?�?���Pu����3�t�2���������:����������-@=N�=*��P���+oV�n����������I�v���#�Q�����vT�}V���g��!��T1d�l����Br�b>\�����#��fI����h�	`��B�o�����^������0IY{>C�<�zV�p�\��3IC�����W7HR�^�eK��4��g�����u��m����:F���r�D3O�:��.��I�.��;C�/�@�4�r�,��(�4Y�
nZkQT�g�36+!�n�$�-���Q�>�0�Y��D�.�c�N|�V�0I�um�����\Rw�P���
�A,���4��ma�Cp
;��'O��=k�D�k���%7���c�.�;@����~���������	'��ZS��=��P���#��-��l����|�h����!�����i{�ik����E��3�;B��C��-��u�*�2��}�_/�W��.t�&�a�'��2��}�>�0A�
���0�3�n�
�C��[��CX��C�L>�}�:��;�!p�:�u�:��C���C��k������E=7`����H}�: �J����'��rC�U���T�{O���S!���@k1Z�"z:�;�����~/mD���E��(�jT��6�'E	-E�3��0YKb�P$��`�'&�pK0��^�H&���@k���{�1Y_N0L*���������!F�1F�36�D���g�
���e�qB��Z���hR��F��(���X�+D�EG�e�
G�����F��(��h>���c��8)Jh-Zz�P%Jh#�qR��Z�;�E�tx��R��R�:���@�����������f�uw���Z7=q����y��y\l��6���T�Z�:o�Zh-:��(��h���mD���h-Z�V%��8)Jh)�O���(��������~�7E�,���F��F��h�jZ�&�|!��RCqR���R���]�w4�!�	���m��C`�<o��{z���#N�2Z�fs��d#��-6��`:��.�M�����2Y���Rp�� ~T+Y#Z(�Hb��$��f����5�^�D�Q���mD1N�Z���>q����8!�h{yi�/��i<�k&�0!����8q��gwiS��6�'E	-E�3���h-:��(��h�w�mD���h-z�Z��F��(��hk�����6�'E	-E��D�EG�e�
��B���b�%���3���f�IQB��M9������L�7o�����8�����_��w�mz�:{g/�Z4��l�6���d�-�~W��h���-E�;�n�'�Zt�	QFk�pT�O��(�IQBk�4����iM�D�C_k�l$!L*�l���QZ^h(Nj�]hR[�V?m<��>Y��R_Q����Dc�8h"��y`��0!Hd����w�m�Y���\��h����(��(�IQBk�|8��(��(�IQBk�rD��Jd#	aRq��`�8���h���#Nh2Z������6�'E	-=c����WD��poZ����({��;��hU��W�	�%u�$���<����?�$��+��F]������PfX�<�w
�Iv���d�Ed!:�bM`��%Y� �yo4h�S�IM_YFW�������L�XY�������V�w
��eL`%Y�>1��c�^Y&&�c����.Y�P�?k"}���p���C�� ��@L1�����C��f�}+%F�wRhk�_��5�/�2]b_}
��J�
���W�uUX���B�a]<��(��u�1��(n��W<���&}��,c�<�zT�+��%4�j_���PfX�<�w
m_#�/������b���~�oK���Z��2����G��B��`�S"}a^��mo�Wc��F2V��BK�����OoT����T�:����y<� �NbR;~�� �J�+���TWZ��H����1f|#�e�����#�Q���i�a �������c��d���S�TX�#��o���	�������i��~a�e��'�Y�
�g�����}3cFm]c_�d�'�I�Pf�x�	���m��9��=��0j+��+,8x0��,u&f�@y���Yw`��D�Hq��?w�c��F�2{Mj�Eou����t��E�v�5�$� ���d�x�	`��B�X��qw}&��n6�{�$v*�!c
;��'F�o����@���]?�<6�,@O3��7��Wh�m��x�`���!��9M�Z��2��x��"%a�����@����L����L0��LL3��w4�Wh��X��1J�mYJ�
�"�c&X� �y�?�f���m��LC�gl��Ps���z����Ac�	��G����0.������d��%^��x�	�7.�����v�J��:j�����%F��y�y��q��w��}��I;��1d|#�N�S�1�<z>����VAo���]~�z��`�������aAhFa�;;�l��%k��������{a�9k�9k���-����BSV_3�����E���������G���Y��a�����������=�hM��KVV]���`��`��0���������[���`�)�6��A
x�)�B��w����-Yw����R�����!Hx�|�����z�X��b�BF�W�� �j`:�>,Y�y�FI����W��
��}�$}C
>��_�zW��+Q��:�i�IQ�)�a������/��D9��%��%��Hy�Ve����l�f+_Q�$3C�XY�'�a�����xi����E��/��.�0^_f���5I�����z��5O���eRv�i��������{�zt�X�
`�d=� e=jPu��%Kg��+�!eBj�I���A��4����%=B��J��
);RL�vd�n��#
(r���%�gx��J_�l�e[R!|����?��|f�����a�����v�$io�AO3���A<,�D�YT���v%�2�K	7b�R�1��W+�����������"
�$�W��h�	��wK&;1`ee96�LI�h�,��A4��n���d2c�;~J3�����$)3z�i&�Y��<,��x�&�f���3�c'/��4�r'��%��_��2��oNX�`��5}� �y�G���vtu�� ��Ot���$�p�A<��G�0��<k_"�N�;bs����3C�<�z�d4$O�������9�!C��m�#�)��S��#�)I�lm�;H�����}Z����5>�����B��G�7��N��B���y��?�j�m������y�f��Lt�m��k��i�?">U����"�:�h�0�6e�6e�6ezaS&kS&kS>m�C�S�9��PeN* �TeN� iNj�����*�\TeI6�SY�
UY�=HZ�(�f����Y����Uc��5&�]c��������~��������mc5�6V�mc��m�����eEv\6`�;�+��!�8j�I�q�A�q4�*�����R>��d}�$}F0�Q����qrs4���4��-H��`��VV���gp�)6�<E
0�z���[z����iK���d��F������>q�"�D=)O*���+���
)�H�6b�6����P�5"�/�#��c��F5�����ClWF�������W�����\G�!���e� �g�@y��F�wn|>�(�w_�N���c�b��1�F��B�����U��WG���$i{�AO3�zV(9u�8��M�����&I��D�L��J�\���F��g�	��G���BS�qnV�v\�4����7�4V��l�4Hr��B��+�����\G������� �g�f=+tr�H���G��A���#��^��r��O��b���e6�fF�s�N�i&p�E�'���#�{�GBj+�K��f�#��������������o�u�+!���#�]�[�����L;���^e�<#���m;�����S�.�+���L.�+!���"~�+��:�!�{%�!P&��W��������W�8��C��+���^	|���a�+��2����>����@����(�a�+���^	/�F��]m��i�
�v���3�|�+!{������'��j��������F2�-j�,��[���qB��Z4�-j��F�E�@k���`D	mD1N�Z�V�E
��d5[�0Y
Fg��1LY�=�@�`��h-��#������ �K��K��Mh����������3Z�����kQFk�'D�Et����6�'E	�E�0Q��6�'E	-E/w�Kmf&�Zt�	QFkQ�{�D	mD1N�Z���p%Jh#�qR��Z�..���b�%��v�L����'E	m/4�;��}r�G�v������hy�j��S
�qB��Z���"��V�U�@+���W��(���@k���]�(��(�IQBk���,P��6�'E	-E�M�e�qB��Z���;mD�m�#�Z4�g
�������v���o.���7�}j$Z��'���������qB��Z4�[l&�hn���/s��d#x��]&K�x�����(�6�Z���F��F��$��h�E�M�8��{��Hz��6�'E	�E�}�h}iqR��������3��5�����hy�����E�EG�e�-��^��h���-E[�l}�+�Zt�	QFk�x�����6�'E	�E��1�%��8)Jh-Zg�zS�����h�Q����h-:��(������/6#N��^l�u��{z&\��U�������8q��g��C3w��6�u��g�-g�7�-E)�EZ�f{�+�F4�{]���������b�%�u��t/?���#N�2Z��Y4�
�h�'�m�0�HB�Td-��T��$���P��$����3g��gB������	�/FR�Md}�1L4���A?
X�A�P`-v��r�6���+h)���M�Z��8!�h-���J��F��(��h>��-%���0�8�Z��56�Hh#�qR��Z��T��6�'E	-�G����E��po����%d�e[Wx�>.��g���x��`�@O�<�7�d�e���!�=-���{|x;5��s_1��hC���m�@�B	��|�+�D���.�y��������*��6�����KyR;~cs����'�%0�zm�nz9�����=1�?���s�^���,|�mW�����
	/�����~	�;#�:��H�/'s_�:�B8	0��sd�1(��p
`��BK:�5�$x5�>��y��`l.�K��A4������m�f�?q�����
)�2����bc��
�����o��6�o�����>�4_Lj��7�o�.P����n{#1P��y����k�I��$�D3O��3-N+��E"�*�������3���3�C2��������p�
+ ;dt���B�j��k�I�T2�D3O��vcvp�/Z��H3���1f|#�������#�I������5x�V�s_m��`R;��2����G��r�4v+m\@m�w�_};���Z��2����G��=��������5������1d|�?�B��1�<�����/��5�����*�������d�0�f�@����*��c��5���rvrU����1�	�"��1���6�+G�^x7\�<�j���@��/��4��G����!��BL`;��0��������������P�V��D_7���Y�jG� J?Hj��Y� �f������R=Jo��(��Jp�l��
e�1�f�f=*4���� �g�Dr#m����g|��@�r
;��G�QE�����z����'�I�T��3O�����NX�
^��Q���$�B����4Y�
-��?���"DWG���$�B�9�i&�Y�
M'��nk�������Q;�k�I�Pb�x�	\z��{���h���V� �����i�H�CBS�1�<z"O��S���!B[�~�!�6D�i&����<��N�o����0���O����70����,/e�)��]is�{k�����-��'Y�|�F6����6_�6�vz�g�/Z�/Z�N/3<������{��*s.L`0`0`��`��eE��UtP����t��j"j�6�T�����L��
);P�M��$�@>���=����vk1�����`��;��G���W�d�d�Bx�4�2I�P�M���������k1��|��u���u��ggY�{}G2e"^N��rL��i��Q�?j��=>� �������&Mvb�v�U�_��d���������c�5{�0�x��W���`�2e16�,F
x�2e1B��
P��=.Y9�~����q�� x~{����y�Vi��pt�qt�rtpK)X� a::�:���y�+�+�!e@j�I��� a@P>��]���c��"�V�c��"!��H=�>�1Z8����)S�t$��	@�6%=�4��7KN"���^b#������lO��^[����)��Z�=tG��_��#�AN��Q�
�4Y��L>b�-�����r#
c�&�N�i&��M�K&_��K��y�6<�����=C�����,��cLy���-��Q���v�� I��!@O3�z���2�{�����kG�U������4��wK~�+���N��j�J7�h�F�/5���E��;�22]��=�@�22��h�	��wKF��<��dii���0�1dK3�Bp
;��7���8R������3�����D^�����������k�f����Y���F�f����Y���,o�=�8;��Z��z�P(3�K�����OU���M���)�$��"I�R�zT��
]�,J�E�&Me�����P�B�YeL6��I
.j3��I������B�v��B�3�7O�&d��	�7C�&d�&�o4C��`���d��d��4[��X����<�P��u6�5�5���Z��Z������l}j������3��!l3��HyR������/V��b��/��/V��b��/>=���s����~����������~�������I�|�t�������&Y?���h@��7��^����
)QL�."	��zT�����&�&���;L�;L_V�Q���s>Y��!�I�1� ��Y�
%K����l��N�4c�P>!l����q��f��PZvmws�6��A ���=H����g��Q��P�IjD�}v�3��[��1���'�;����sG���$u[��u�i&�Y�
%S.bL�����7�8:`
� ���@�0�JV\�Wq����7�4^�	�1���@z�:Hs��.(������g$���� �f���P���Z��wv��>
0I��x�	`��B�=W�e�:�z����#Cy~	BhR;.�/Hl�Qk����&��5�`'�4��������'����{m�O:#8�5B��nQO�w�ae�s��k���*��9)�����9�8�X7G��Tq���
������ ����=����7H����A�9���4H���T>�-� >m��A�� �-� (U��I��M� ����I�����I����M��+�@��l��y�
F��l�sV��MB=�+��Zn�Aq|����.�]i�!J�
�+vW�6���J#�R��rzW���#N�2Z����)QBQ�����������F�2��0Y�I����`�#N�#�@k�'4�E=��9���2��� ��J��M��-j�vL��gl���-�X,G�F��F��(��hsc����EG�e��G6W4FQ�������^��1��b�%�m���_=Fk�'D�E����E	mD1N�Z�����(��(�IQBk�l��h}�qR���B��q^��EBm��������8q��������Z��8!�h-z��!��^vgH������y�f�qB��Z4���!Jh#�qR��Z4g�?SFQ������e-7E�,Z��V�aG��h�vZ�������-/6���m/7�>��6	�t�>MH�<{(��������K�����dk���Q�.��,�IUB�dn��&s,�F���aB;Q�S��EC3��*������6�����+�_����[eZ��N�0��p��%�m.A#P	�]����_���O����U6�����@y$���0�##�l#<�0��p�O����}�l#\�=�d;�j��%[���g���6�#P
3���t���l'��J��F�z!|��^_7�K�[�	c�&��N��"���4�0��)�#���t���<.o~M�mN��G�l}
K�����}\�l-�f���e��R��F8�w��v�q��f�n���0��0*ab�z�l~��v����-�]{�0����)P����"�N���@��wa����W#
T���W�������9w������RJ<��)�8y�����;���h#�qR��F4Mw���d�t�/�F���P�0��0*abk�_���@�m�G�f�n�L_��;���T��&�2h���b��%�n|��Ll'��J���g�}�������������y����}�@�"���W�	�B>�`Q��yo4��k?���%�u��5y	�P��.�y����Q�%4�r�c�2��x�	�Sh�v����|Dp�^��jL�����
<������mO���	l/V�u����w�nLLj���7��E�����.���:�k�+��{b�	�|7�^�����1��e5�����L0X^&f��Zov��vb�%V��k �-��r�����v�3O�lu��D�zG�m0�B�2�2��x�	�Sh
�*��g����R�i���`����yo����3rB�T�QY���� �W��g���g�V�.�B;�eE��|>6�������_�e��'�Y�
m���h�����u��<���y�0�0�~& ����x��I6-1=�>�/&�K,��1�0�Y�1�Z�� vwpgt�fb&�'�aa4�L�����������2�
,�pgr}��W��^�y���3�r1��z3l�������`�w�����	�A����g���`�g������]v����A����gR��w�-.�ko��UD�:��R�k>;���A~AQ2&�cLyTa��P�����Y7e�O���eFM5|V�;�2���J�����|�Y�#����zV���{���vP��Y���O���+`a4�L@�a�>�
�|��Qh�9Wq��zR?���A4�0�a��Q6p(����k��'����2<GM5����S���g��
�dk����D����@�K0bO?|ZoKb�;$6�
��h�9_qm������Cb��f��O���Bk����Jvu�O���yP/C7�h��D�i��z���`j�zk7T����~ ����#���	>�7��rW�����������%��~a �1�f�f=5
�%u��H������J0�&<��]%�Y�������1���/&�/&�/��b��b�����{�#���%�<��%Kc*��cW����v[�t���Z�Eg#2Z'2Z+\>fh�j32Z7��=��[&_2[2W�[�@�u�<��eE��U�P��&��p&��j"j���T������'��({R���J({�I{����Jh�����Fe-��T��h��)�R���hV�*��:cY*�=��e���e���9��dc'�4��5�R������yi�G�0��x������1��Y�IS�������8�c��f�q���&�����0ih�Q�	�7����6���6�}��	A�����~Z���0��g�&�&�����0irRu���K�19?���i�����Nv�% �v����x���������!�7�&�OK@����#��8Y��i���-P�% �v��S���vC���j&�_Mbq��'�Y���fe�&��]�I�s&=���n��T3����������O�;R����}�<�����tf��Cc�MM���B'y�1My���3��+'C3���S��	63���D +'�1���	�_��6��g�fj��	�t&�g���GM?|�rr9�y8��Bg�/}Az��U��=���	
�_��;��u9m�6�r�9�	����h��S�  �~���<�'.�9��Ex�3�<c��y���31�<�=����.�sm�
��Z�I`b�D3O��-��\*��aT}Gd������K� �y��8l�R������3�����N�p�`����&U��������,T[�iv]�u]�u]��5Y�5Y��i��k�Y�tXoV��}����A7�f������h�g��*pQ��}gRF�o����7�^��}g��wV�}g}�}g��wV�}��V�����,mH��\��C��=H���������A��H�j,R k�H��[�"5����:�������1Z�1Z�1��P�N�v���P?�����';�Ov�������w��|�y�^�?��;'�sr;�#)mvN^�du>� �F���o6��MC o�71L������z��8{����~�03�K�d�L�d>�PX�g���lH��`�u/1��K0�Y��B9L�ec�������%�	��|V/����
��u$�J0I;�i��'P������]x����'���?9���da4�L@�a�l^��X���SX���<�JF��gr�����%k��,�E�p�^�D �z�q���3���f	�0wM9��v&����<�@��f�O����-���R��hI���c�0�j&Y���]/�W`�n���t'y�m�#���	>����xT�1�&�U[��I�ctD�L��Jv4��b)	�p���E�A4�0��'&5����U8j�!���L�^������g+�v��t��R�w:l�����������o�p��Z������au�V������D��2_��TZd9�;������r�e�Hv�S"��2_��*������oR��iU�}��+�G����z���_!����>?���������#��O�\��\�K�|]����+v�������D._e.�����L����Rg��|J��U������z}����_j>����A6�s"��3W��Z^�������!T>�]������e������_���o���:�Hv�S"��2���|�?t"��:���,"�����|:��B�f�X�������C�����������D�����ug�q�L��7>��P,H`��Q,G`��{A	9"k=�D���	�P$�V�0�Hd��	�S�d�8�X��Zv~��� J�
�VKV-�RKV-�R�M�����	A"kE�A+X^>0J�!�_<z3���c�k���:UL��j�q�L�����x#�,G+2Y*�L�"���cE&kEX�*��1L(Y+����R$�V�0�Hd�X��T$�V�0�Hd���W+E��	���8h����*��*�IUB���JUB��
�I����������������~��O����h}�����h�:��*��j�mX�*��*�IUB����
�Q����@k�� hUFk�'TmT}wQ�*��*�IUB�8����qV�7U���T%�Q�8�Jh�Z���@�K�����/9��I�_rn\sZOk��W���q���\���]������6���T3�hs[�d�����e�V�0�Hd���j�qF�>�����W�8����IEB�:����uV��T/�������#N�2��z��!���2���@7�-���q?s��R�}�h}�F�8
F�sW�q�Jh��qR��Z�=x��{���#N�2�������F5O���6���I_JmT1N�Z����W�}��*���@�0����aV
7U�%�R%�Q�8�Jh�z�'��W�����������Eg���:��~'Z��'���������._������6����h�z��m�����;_���#N�2����9�R%�Q�8�Jh�
�m�*��*�IUB�2����eV-7U��RZt��&�I�A���)��d����8!Jh���)���v'>��m�,f����'gF��O5���&�>�I�d��k�?�t���?H%Fk�'mT���Q%�Q�8�Jh�Z��^�Jh��qR��Z�jK��W?��O��$����_R��F��(��j<�~h��qR��R���~�uY�G��Xs[-�����>x�����fR:�}?_�nH�WCBh
;N���m�*�?�}=��I��/�	w�2*l!4���X�z�Br�d�I�$�q��c��FB&1������UAn#:\���W����E��&f�G��Ei�"��=��N���c��F2N^�)�x�<��������fR:)����	7�2*l!4����g�
� {�	}�)��K)�1d|#����G������g��[�T:rg:��`��`�h�	`��B=|o���;�\���-� �FM�CO3�zTh8�1��?��t�:��$�B�y�i&�Y�
�'���2���7�����i&�N���O�}���������=�{�_�/1�t3��'�ew��W����JG������v�y�i&�Y�
��1|J�x�H����u����;�+Q�)�8���nU�6}+=;�N���%tu�Z3�$�S��AO3�zTh�GN������\)}��+����4�z��;�6���0v_	�HDWG���/N�Z��T<�0�Q�.�
�[v7�hx}��A�*��Y��hD>)�-�����%���w��{��H+��f�4�S��
��:x�=��?%gi6G~=�����BS��HyRa���K���\�4�!�I/0`����G��
SD��jk�z�������[�&}��,a�<�zTh>a�_�N�x\m����,�C.M3�����x\����1�,���~,c�*�i&0��Z�'�">�
tu��0�gI�Pb�h�	�������|k��C�@���o� �J,@O3���f�Wh�� j��k[i�����4���A��{H=�'�������7��������$���s�~n�����L`d=+4������Is������t@��tCp
;��'O�u���E�__D������1f|#��P����G������}_.w��74���F��Z��%��V�������������9W�9W�����*-Y�.i�.i�.Y�.�.}���,Ryi�d�]����aw���e��v��P��u�SYw����/S[w-DZwj�>��U�k��&������CX�x-D�xj�Q��c�V�y
);�ta��A��3�x�K�n6@��4-���d��$�=�n}��d��aGi������3�dc�� i�`���%K.@w_i�5��> ��}=H�}`��%K_�^����6��2����w=Dzj<R��U�s����|���h$Y�I��z�di��h��#���a�"A5)���;��Gy�m��
jI�7���7�A���<.YZw�a�6�\B ���=H��`�{%/�������
�@��
v��%O3��IW0a�y�!P�-�=B;���t�ZMa�#��Z��;��&��^�v�X� ��E3���je70�f����l*� ��B��L ��B�Lv���O�R�ab!���!Cz�>��)��R���,A����c{�-�V4c����i=�&�cLy�Vr���J �������U��~�<���U�K�`c�F�"�e�$)+����A4��~�������(!+S2���o�$)S�}����4Yo�L���x�eN��q�k/��U������������3�.������$��K�CO3�$��|^2�����?��,K{����F��$-�R
�i&0��-|�R���-�	��rC�0/K�U�!8�S�{���p}3t"�ILl[�1fH�]q[Mj�#�G3�#�d�|�3�']3���y��?i]���O��*�#6����o�&s2s2s2��d2�d��"�*�na�,�d,�d,�4[��X����<��������aKO����������a?zk?zk?>������7��V��B��Wf��X��X���0�5�5���+?x6��b#�V������{���X}�{�B��M�z�I�7_�n|�������	������G���;l�QK�P�~_���"�C5N�������p�D��H���$�z�i&�Y�
���5���q	/c�����o=DZ��q��?l�m��+BW�G�z�y@�����GaV���
:�E��&f�o�������L�*�9�#-�~�',@=>����{����R�T(��������;OH��H��_������P��E���}I{O�1C�}�N�&�����B���L(v$l�	@���0�L`d=*�l���
�%��#6�fG@��4�����������
�#��&�I���	�h�	`��B�����+����s��� I�y�v7I�n#�Q��^k�8�9K����w����G���4��������k��M����]$�77L�^�p��y����������l��4�w@�� I�vyt��i&�Y�
EM��E���_2��3�Y#��I���_�|"'C�g�]����3C������m4�b���'k��3M�
g����(��P�]��E<k�(�������'��;0���Fn�������f)]�2_�������
(�|�@A��[PP�����[P��1i�
=(��	��e���P�����[�P�!p#
:�u'
:��C�w��C��7��CXw��C�L>�}7
>���Q�!pC
:�uG
:��C�w��C��'�8��b��+�8��C�������t���a����2���)�>�MA���)���)�(�a���������Zw��������h��>oR�|�G�h��	��#`���Do`nl^��e���e!��a�>��Zm��"�6���$�F5�=�����zG���b�T%�VM���d��aB��F1L���b��=�d�h����@mT�E�d}]�0)���U�w-��{Eh[��MJZ��'�������Y�2Z��8��h�z�'-R��F��*��jq}���h�:��*��j��|R��F��*��j>��
d�Q�8�Jh�
{g(UBU����6��?��_,U1N�Z�Vg�2h}�qB���+N	G�~G[�x��8��
�F����hy��y��B��^v�P�����m;Z��8��h�����@�d�v����U	mT1N�Z���HU�\�UG�Pe�Q������gUS5��HmT�m�$�F5�'��������9�_�G[��}�h}�F�8F�3���h�:��*��j6w�L6���y3Y+���3�4P<�=0���?��%�Q�0�8�F1Awi)Ih��qR��F��U�����z�T-��T%�Q�8�Jh������"��3#N��q�i/���iq��L��!����8q�������C���b�T%�VmO#�~_������6����Q��X���u�3hUBU����6��?���2��b�T%�Vm[�UFk�'TmT��/}�mT1N�����i����3���@7�:�<��{Z\�3��ze�n�;�D�]F�6�����ie���@�d����&{�-�F���`�6����V�����@k�'TmT�Qu3F�6�'U	mT���n��Y5�T�}�a���0)9�F�53�Ihy��8):�����u��i���#_o��H}!�:n"�S�a����Ou�d@�!'�V�0�Hd�k��7�m�����]��*~7+U	mT1N������N���b�T%�Q-G���D6�&%Y+�/S���h�9��(��*�uP��6�'U	�
f�����l��rNC�E��>�!+��?�%t����d�h�	���!�$�`�������� ��C�D�L���!&��
e�uE7�[�B�a]<��)�}7>~���N���/�B_��*E ba#Oa�o4�9�}(q_?��`���Q~���`�h�	��#\Yt��F2)������g|2�e�S�1�<��\b=�	��.���t_�x��RHj�����K,�� �mDoZ�#���"��.�P�(T��e ���cM3�z����x�o3���n(���
I��$�D�L��
w�o��Q�(�0�|&I�Pf�h�	`��B}���.�+
B����Y9��
e�1���f=*4�F2��E,�L_1D�S�nI�b&1�'����
c{m+�����b�����v*�y�i&�Y�
m���5���LJ'��w/�����:dT1���G
�N�Y�mMG�1�����O3Hj���� �f�����������C���Q��Q��
e� ���f=*���u���$FWG��aON�Z���4��G��N��E�~y�c�(wk���_Hj�2;;�L����l��F�L`q`:�[:=���"�����)��R�T�]��|�}E
5%� ���`�(�4Y�

�}~��Y�"��N��r�H��F�)�x�<�0z����+�����I�k�I�T2K�3O��Zp�����������)�gl��b�/���C�v<R�T�����%��3ijo0��N%��y�	��g�&�Q��o�[���~;��~`��
e�0���f=+������i8A{e��-��z�3��������1�I����%B�U�����f�5��~*��Rb1��\%|�P���������3�����i&0������<�pN���_xt��_r�a���������'F{�.��hy�K��A4���2BkkR��N�V���W��W��W_8x�:x�:xoo��������4�z��z��z�����������^��zK���������f//}Y�wkU�����kV�r�4@s�:}=H:}|��A�q}sSm��`\?
hsS�<I�O��67��,�V&`����&�IP�z�d���g�lH���g�� a�68{�H�u��H�
)c��Z�1A�4�X�K�����Z�
)��H�!	���z�d��A+Ae6��B�h%��Bf�I�|X�p��bav%a��V�!��m����*]=������2
 ^�$D�
��d���-`��X���xt��VbV��&�O[��������T,	�p��MEA<�0�����'w1s1o��^ �!]Q�b��"��*m?p����\k�/�q��km!�\�cLy�#�t����P��
)��H�&cc�d4��,Y8�\a76"�F=�c7�a7�1��W+��u��G)�-�$m<f��'P?*Y���Y�����C�� O�)�x��W+;�m��"���p��g|�<��S�1��Y+Y�_�)7�!v g�K1��4�������F���2(S�Hx��$mP���L��-y����-nC��ar����xbM3�z���'6���p(l��6���3R0�A4���3�q�d-z��2
;b�rnl������<�F��%���W���t�#�*gI��t'�4��wKF3��E'��	���8�N�@���<1dLa�An���� �s��`��4������!�O��!f�����o���������M5�������g���ot��x�7��N��B���y��?�j�mT#
F��Ry��I�ej�Z*/�I/S����B���[]h3y�`j@�.���������.����@�BL�[��[��[�^����������qX��Q�d<J
 i�({��(5��G�*��w���d�������jg�IgR���f��*�:����~�9�G��G��;*k���.�K����&�IR�zT����=f�=f�=��{��{��{�OO����8�8�8��c��c����������1V�c��1��1V�c��1>�[��?�Q.cC�f4 �%�c�N���m�����cC�a,�3�&����
��G�����.�I��}D;vx3!|E�Ma_��)�I���.[7�!�&j�I�M�A�M4�*��+f�!6"=D=��!��!��HyR!�z�������!�����>c��h�	��g�
s��$����R.���l�T0hL3����Pv�\��������'I�,��h�	��g�
#l��5OD��z�3�i�:�!c
;)�*$���?�����7HR���0hL3����Pv�����t�a��q���W'��cLyT����2�~��|`����d\�i��-|�R��d�3��zG���a���0dLj���%S�y���	t�`���G�
$��!�����u��dYs�qo��HG���
p�-:������}��	��_��:�;�N�E�a�����e[��U��3�Z���.��V\>5�����:�|�����:R[,>o��@
7���:�(�`��#���>~����n�!�[u�!P&��U���:�����Ue�!�[u�!|����n�!�[u�!P&��U���:�����U�8��C������V|�p�a����2���:�>m���@
7���:�(�a�����V/��Zn��GO��0�o������y����18F��[(N���-�g����f6{1Y+g�:h�9��(��j�{	�Q�v�#�6���vU��6�'U	mT�����F���������D��&�l���H��f��	�Q���C����aR��+L��K��������
�>g#N��9���N�2Z��8��h�
�|�*��*�IUB�rsmc�Q�8�Jh�z�#]j�<���#N�2������@FU����6��(E�
2��b�T%�Q���V��%�R��*��j���
����8�:��+����G��\���e ����8q��g�&��@k�'TmT���T��j�[�
�T��`�|Z�R�
�Q��R���b�T%�Q��{J��F��*��j{0����#N�2��z�J����-����"���CqRu��5�����w�������
�>s#N���k�8���@k�'TmT���f�������F�2��L6���f�V�����k=�rl��[�Gh��qR��F5����j�U�M�|$�vB��*�IUB�b�CZ_cF�T��5��^�O��+��!����8q�����d+Z��Zu�	UF�b����{�/�Z����W������6���I_RmT1N���^}M�R%�Q�8�Jh�Zg�zS�����jq���f�VqB��F��'��W�'U�q���Q��i�q�o���_F�s7��Q0Z�������3����n��R��g}�-�R��XU��j�w�mT��h�Z���*��*�IUBkU���R?W���#N�2���Y5�T
�j���m�0�hB��d�x�O��$������@�+N9�q�O[v����h�Q��8u�D����QY���#��D	�6j��Yh�x��u����}��U�UG�Pe�Q
}\�Jh��qR��F5N���hB��d�X��/�$���8)Jh�Z��X�Jh��qR�����o��|���,����������|�@������h�`�hTL3�7�Kd�EO/(���Z����e�u�4x��D�Z(�����nM�e�u�4x����S_m����W�/�c�*������7Z���alR�:��9�#�h�`�hTL3�7�u�E���	��r�G�@R+��� �f����+����Z�|�zq��8S0�AQ/����ZR_���&F��r9�����V(3�A4�0�Q�5�Z`�%L �,W�+�
�]�~I�1(��b�^aw
��7��Y���l�b���t��g���K2�A4�0�Q�.s�J/4����_W{�y=�����G9�c��
}�u�}Oj&���{^������d�C2/�T�����p�Y"#�)�#�P��dXO3�w
�-�/��N��__�!�	�JS��HyRa�p�J�E���.}�����Y��hD>)&C���.���������
�
@R��2sD�L �
�n�����O���QI�������Z���4��G�X)�����QFWGWi���
e�0���f=*�VX��7��7��F�����1d|#-9zH�VjL)*l�W��'���s
5��� ���`���1�F��B=~�_`�Z3<��4�)�C�7l��Ch
;)O*'��W9bo�����Q���{��ue=��O�����$����7�����c�}!��%��1Wo�[�C��J���h�'��4�(�f�g�+��r�m��(TQA���$����Pe��A2M$�V�u��;\�SW�f^QC
�/���Pa�� ���u���?1h���;�!����A�m��)���@���_+t���u}�V��at �k�Pj�p�N���W��L�Y
;E���h�����'a��V��;�Y�
m��.7�p� �S1�&i�aV�C�N���W(��_�h�������S��/ZF�!2Ek��'�&��4�0��:�u������U��;0l���T���;�~����	�������z��e��e�����Z��Z����w����3n�;�o����ns}?
�_5����@��`Vrp!(8�|Vr0��F#z�kI^`�o$z����>�7�:��/����+����\A
r�`a���%{��t�`I�`I� ��L��s�Kr�o�|��d��,� �c��"�.��cNyV�7���*����'L�������A�*L����=�o���@�m��m`�
)�l�8;
����zK]��(�	PR6)��	p��������J���8��d%R�Y�q,)�ju�k&{������%��u.#�4���%?P�F5�����%u�#y�1�zX�w�Q�		(��	PR6!)���	p����#,��d�#WD�i�%E;r� �����d���K�hK��J3%�[#�K=�-��.��x�?�!.���'Eor� ���'��@�[T�RnH���c�.�L!:EK��Z�D\��u0�hOv`�������t�H����_,��\WoF��6�$�+�$�YY�&V����
�3 ����ZL��o�(���-���*�a-�3��*���\�PR�*�DA6M8�i��$�"�U�V��P?��LK���A:M�G�[�).��,�h�vs&{P��rl� ������.���Z�G��"�
�Md�t�|r��n���V����0�1e8K�7��"�-����N�t�0��8����u�p�q����ag�u��R�m�9�������o���@;sDk�������C�b����X�_4//V��-�����\�`T�]����F%�2���],4x���g�'���`OF`{��=��"�S�b��%�"��\�dFF�-D��������!�G#r:�%�O)���p��B�3H��P0!��i�	IA��L�A���
��`=n%Z�a���k�1�Y�a������D���d8F�MN���A�p��A�����lF �f�c�H6#�����r�B��Q/�`.
�b�8�����B/�����~}������%u�"yK1��U�w���D@�HL��%>������K|�����!���>���.d���}��f
����1�5�m�8^�L���B�9cM�Qa��
>d�Pp#���R�s��[��w��}z����|��Bt�<��;�eW^m�khE���PR��q� ���u�P5�l�|�������5#|�A2M$�V�j��c��?���|�8)|��!�f�g�+T|�r���1#4Q��.�1e��"Cn��-vcN�U��t���r�s7#rf_���q�����W�s���`�������K`�o����4��[��%7����#����J�/0��.�bS���>y�Z��r;���37#1s/�%#�}t�f���~���y���u�0�.������&��K�9>Z���8���+����o4���+�9���6�6�����y��e��R���=';�A�~��{s�h�
;���v��`��c�MU���s�!h�
;���v��a��c�
o��i;��aq���BS�A�;t�A|���B�l�A�w����Tw�v������Nv�=:� 4���K���]:� �����K����A\����q�;��aq���BS�A�;u�A|���B�m�A�w����Tw�^v���x�g���}���m�E�E����y�h�f������m[4�����m[���M��+R��0P��&G
����C��`��M�:W�8�jh�
Mr����*�yUE�-mrdd���M���N���xt��%�2�
�:��8'jh��p#\]�5���E��$����x9���t���v[u���I�;C���Bc���h��q^U��*�qb��U%�����5]�
T9��*��2�*�r�WUt����5�:W�8�jh�Z^k�����*�yUE�&�
L��+�SU�����k��������yUA�8
^/��>�F���1:?s��������C��o��U%���ny�Q��[���sU����C���T

T�{������U
T���?WCU�����{��_T�{�����{>94P=r�'�NU�	>(�� ��^u,���p�9��R�D��y���������@�gp��M�g#�-��{6P���=
+�r��U4Rm�����jK���T���*h�JqAU�@���$kl�+�^��Hyy��\U^�(/W��������90(+)������Z$�AY��Z4O��q�3M=�	^���$�D	��blpm���
�%�+)��	������<)���������=(��k�JT66P�@�ll��^�����F������7��U����vUym[>��F���
��)?�x6�4I�WVv���������<��k[�/���I�@(�'q_�C�g#�5?Fx6P�����r�
�%�+)�����H�����F��k^����2ee#e�6.*+)s`PVv�<��GRv�\Y��c#������R\�4R�o��*;�,i`v��t,����=�<��5o���3���G�hp�9����I_��(bh��q^U�H�u��t[�@��Hy�o�����2eee�8q�w��
�%�+)�WY�gP��_g���F���JYe#]��F��k=��V6R������sI~�<x�=�?�C��&�y��#��V�(r��zW�hy���:^��qk2F��Ya����-�1������B�������
WX6�~�0�1Z�������=�������Z���g��e�*]�;�)�8�ww��
�����zW�hy�Qh��w@��:F�[�4h���=��Z!�G�Fpc��n����w}cb��)�8nw��uq�:'z�������}������F�-d�t��=U/���������1E�W������������4H��]*�
����q��4]����~��[(���g�*�(�U��h�	�hgX�4���8��BA6MZ�qb\�<M��fB�b|����������/�e2�`���'�v����2���uF������a�I#���/��i:�Y�
]*nF��_�g�?�<-�9��`�W	��z�v��Voi�o�>�5�#���{@Ixb�-$�t���
���v�cxd�3��I������WH]���7kj
�X������g�yj3�[{K0�~�g	��zB�7�]W\oN{�q��6d����|���B��g	��zB�7�������[Fy�"��
~s!�G��$L��		������y�d�DA��q�v��/J�_&a�m��6m��d�+V^�d���+��
}��72Q���M�d�����z����+����$�B���1f��>$"S�1���p��o�v����}����P��k�3w`=^+t)������R�
�y������Ah=�a:UOX�n��&�
���*k�P��F0��U_'R�L��Yo������m�1���"/'z�I���LA6M(�f���U��'Ci�.$��y9x}��y|b��zB�w�]y�
lH�V��:!�s��B(��5X%L��		��w���pg��"/e���7�y\��e�0��',x���,���!����_{������ �����?�d���a�!�O�w����8V�	\g���a	,���c���yx����x������xdC������P<��xdC�y���;:k�H��������d-�Y�i���#4���bz�1��/��L�>���@�E�����7#0_����s����������Z���J�#�1��J�� ��'Pp"���Db�w"h��f��(�]�'	(x�	�.���� �I&v	�]����fwPp'���Nb�w'��g%G���->%��S&�Y�O	A���$5��[��~�v���O���L���cIA��L���wK�V"���n2�s�&a���� g[&@YOKv�"mF�K@��L���}�Af_&�>��:����W23�C�#����M	3s�#$����>no����;�������[gwn�����#����3�����	���'�9�3��p���;�+�����oW��	��(�94|\�s(A4�����%e3����@���l��n���Kj��M��P^�E����=�c�����w�[i�����hcF0H��Cd��4�����9����d��	��)>��0��',��r�2��k�;]i	�7E;By�;����
!�����Y+���lC���7���Zg	��zR?���L��p�_�2g��!���\��z���+��m[f�sN��K��s�N�|^�8����{:6�6��C�	�%�uZ%L������|P9{��w�|o��@fn�;�y�t��"a:UOV�y���������z�u�?o�j��@I�~��q�4�������\��R1���\�1������as�p�(���*���*����d�F��U�
�o����As���z���o�^,�/�+�����B�������
�T��

��6'S5�'�?��Vj;���6W�V*+5��U���B��mz���`N���h3�tJ�rr��h�N�S��,��%p�H��4��,R�i&�v�X�Vr��%	g�F���� o�F����Q���C��
�D0������'�H�D�j���<z/��^������zd/��^��+U6$�������Tht?;���>������5��5���

fg�^g�V��]���X�[��o�,����`^�or��73!���z���2DWXt53�����0�jfR��fp���i/�"r�e()z������d�+����L|�f"��;���L���9��8���D��"rNe(	�z�AW 
���#��{���x��`���n�Y���lZ��������;���1^u�<���:y����'a��z�/\^m���l�����Ke9$L��	�������(:���	��e[r�0��'Zy�p��H������=Y�=����l����=��z�,��-������	�er�0��'�~�6�N�����%��=��d<����T=a����������7�����PR���A:M8��Q��"�w1B�\�=�c�������I��R{����#����K{�2�u����?��������?��]
m�Qh��o[�gV%���_�
?~��l�����b)��Cdy��O���.R�|P��]��vK��c������@d{�_Zp\i���C�C�������D�>d��W���e�p���dj����)��dT�&Z�!����C��&.Z~���N�g2*_���yZ~�^�/��.�o����W>�Q��h��������K��[g^tl�|&��5�����C���n'w�����WLH�B�[��3�����:~i�s�Y���^��3���V~�<-]_���K������%^+�~����E�O�
���zlW	�y.`�-�\z������\�J���_��0+�<w�z�
��r���8�����P^N�@�����sE���+
(r�STr���)Ur�(a�h�\�{��z�������.����.���N^N�@�����sE���28�|p��0�x`����c|�����U���TI��n��T�
o���sE	3E#����}N���sE	3E#���ezE!Es�J�W��+
(r�STr���U��9�)*9U���=��:��8uh�J�����*�yUE���UA��������h��\��w�h��CP48q�����[�e��oC�E�<U��@���$�
�r�WU4P��}������\:W��%�*�J�S54P]����*h��q^U�@�����j�U�EU��T�T9��*���A����#q^U�����O��%��5z�����gN��1:?s���wCU�s���{��V2���m��s�uI��J����Y]/��5�����\7�R*(	�q�WT4P=z�������T7�r��*�J�S54P]�s����E������e����g�\]���$
�����0t~��W`UA#U�����U���WU4P�8�jh��v���F�kw�kh�z����*h��q^U���2��a���Rt��q���@�����j�U�EUx�[����*�yUE�-?�:��h�WU4�������Eg|�Y��U��;ahp�$����s�S�]�����9UC�-�o�n�~��s�2�;_CU�s����u�IU�H����������UA#U�����{��_T�{���*�pE��4)�K
9W��[��T4��H�54��,��j�.8��,�xpu���:��>'K��������0w�J�Ou�?�v������]���K�I74P�8�hh��^kM��F��U
TimTP4R�8���\u������O��������������&�yQE������H�����S���:~]6�~]V����[{m��u.)~���^�+�O���	��!���*pC���c?H���_d���HAR�4�S��/�Bd�<���
;�R2~�h� iE.XyL?L�B�)�XSnT�����)(����!��(�Z�����;M�4�|��o |#��o#��>v{?��&|�0D��cM�Q��c/$J��l
�����!����
��4��[�(p�����QHA�U��'�����<�)�XS�mE�m�0{���
�y���%�+f��A2M$�V����w1��u6Rh?���w��'���	�K�)�XR�TX&�f��:��F4g;~��R��Q�N���Uh��2�^p�����>�(	
UuA�N���Uh�_b�%��e����O�H���1D��cM�S�:�Z��>m�g��V{m�n�o�����K��"�%�N���KrY���^���S�)Bd�<��;���"e�C��E|'��(	�9*�� H���d�*�X^k�*���P�F;;��{(	
U��4��;���8�Mx��C3��4���pjlE�f��q�����c��b������=���/�����[�.x�@s�s�/�:��>f�C
��t�4��;���}��Hg�9�#�c�P�Cc�4��[�V^��_��g�J
��V*�O��/X��q��!�cK�S!��Vp)���g�9���6���%��T��$�t@�n�N��������	K�7�%a�����t�H��B��7.���QPX~�����=�u�l���B?�����d��l��c���:���)��S�T�M�����K����h�m~���R�LA6M8�^�w�(s�E��r8�����ZCn�t��E�Wm�2�P��
���T�_%A���i:��~
W]��|�C[�))�A&j��1g�0��K��4�7����g��M�m�9��6����,�~L?L��w;(D��cM�u����?��k-��f �0C�+r�<��0��3��)�����W��5���\4�5Xu���0�h�A������Z��vd�����M�#�vG2���4��"��FUz�����X��}�!����j�N�������8g������������L7�����-�����������!��cMy���o���P��`O6y|�<���{�do�m���%�oKn���}[r����m�7��f��h��(�~	PR��0��~H����-�Q����.�&@I��wd��@���l���Xw^ ��1e$/C�����Vo��f/��
Z�/�%��U����$�a������h�6aSF�	1���a�)�j�.�#�~!��vaO���!�+cIyV�7�x_i�����3L`�}��g�A�3�@���,��v�!��F�I�=�kr���%{s��Y>"��#F�I�G� �#&�YKV�o�O!���CWg"�1e���E���y,)�j5�����`1��������h1�17M8�a������9�mZ��b?��`6�)�XR��j����x�Z��Gf�(v����;Fa�L��zX�����7|���fD�[����	\A����zZ���m��8Q's{�JC�
�4X?i�c~!l�]��J���l�P�,��$�t@�������.������pRp)j`�f���I�3��������1�"{P���c���4(�d�[2��o$���iFd�d������w3w`��[���`����F$�Y�����Ws���y��>h9�F�^�>�n&��^vc��n�>Q�N������Y����G��G������V��t0�U�7}�k���Zi�>l�G��G��G�z��%��}g�[�Mg�j����G^��[���������iHA��l����W�_�/Vm��w��8��q��8�{v��8���=�G�3�g<��x����3�g��q�3�V�=���@������Eq�bk��
�������=�={��Oq����=��?so�m�H����EqS�?��o"n�i�{�OoY:�pY�u%�/
1(��H��B���]i���

_F0�7]�KC
_ Yw

^#��{���W'e�;�x�0��U������������0��!_dB�7�x����^u;�� o���M��!�!����~�#�������Q��<��CH"h��"S������t�q�oE�]�8����M"S�������[N�+�t����x	PR���w�M���U��n���%8|kA���PRp��FA:M$�^�b���1��W���,�PR����A2M$�^�b�-Z*��+�I�3�2����A����w=��[���6/������=�J
������d�+������G�9w��RS�
�$��M�LA6M8�^�b��~�G�~�A����'�x1Hg�g�+�
4�hVH����mB���]�vpO���r�Q\l���
��
�zq���}�ys�L!:EW�������a|���/h��~C��~�*��:���;8kH����R�����?�Q,K���@�T���o�9w������|{*�<�J������RH���R�ej�:RX&����z��B��)��f���R,p���H[
=�K��p��BA3���)�>nM��`�)���S�!h���9����)��?��y�
=��C7��C��E�����C8oR����v�&v���Cp}*��*�,S�B�
=��[U�!X�
=��fz�i�0nVa��q��w�
V��k�
CZV@���{V��_j�(:�E��:�%�gV.�eV�>f��f���e������~"���A��T�K[i�W]/mT����*h��q^U��j���AJ��$�K�X�)�N�\Sly� C#���rh������J���B�Wli�y3��ny�RC�s&q������������T

T7��T�T9��*:W�����m���T

T�*��)�r�WU4P]q���*h��q^U�@u�e���F��U
Tya`���38U�8���\ZF��M

�7�T
��8;���].�4�����������\�N�\���w������u�\u^�.����T

T[��54Rm�n����no�EU�H�����s�ez�#�\
T%���.��rQu�U���5wI24R��O�C���<�_s4��*^s�_�V�D��Z���ahp�$����3W�|�oh�*qN��@uMw�JF�k��6r���t�d�OqJ��F��+�X!#E
��B��� I
ir�U4P�z������nUw����
�r�WUt��N�Y���:#qN���:S�����xQ��f|�048w���������%�
�r�WUt�
O#�~��@U�����j����F�5�;4P�^�T����*�yUE�w�	��F��U��s��h�*qN��@���-�aE#U������{Q4��H�WU4�������t�hS{-����A^�s-x

��������������|�mh�����C�=����.��s�e���������9UC��o�����*�yUE�������WmUq�� �d�Ia^R�@>�L�)����q^T��������v�he{����
W���28���Z����-=&((r�ST2h�>�vC���S�awh�J UA#U������k^��*h��q^U�@u�9�(��&�yI!�����s������9QC��	��F��U���m������2�X���/�����"UC�*u��jx���c�0U��@��R����XT�!��@����
U&u�4xThm��+�u���$���<����6M��p�
������E����*��XRqk/u�<���]�p30��p��S++�����j�x;5��O:e�m�&>�JK����5�(	
U6�;P���u��}s�#�����]��-�����0�%��M���r����y�9�gi��d�cy���a��d�4x�7�S�:����+]	���:[)���#Dv�Z�y,)w*�w�>O�o8D�d���SP&����q�L���U��[{����*������$(T��A2M$�V�e�5���KPR��RufS����!���cM�Sa=^�.--���PATk��L�J�S�l�8H���d�*��P����������'^bGDN%��y�)��oM�2��;T�����7���T*�3A:M$�V������O�W"#����n�u�c��>��S�������������/�"h�Q�����Qf�Ip*��d3w��nu4�f�W
��358v��Q=V|Bx8	
5�"s3w��n
{����/�J�Nk�������c��a�E�J1D'�cI�S�2���{Z�*&�.���5)�J���l� ���u���Kk&��0B��>��}n��!E�������.d�bO\7�PE����$8���S,�i: Y�
�y]�>��(TQE�j���=�$(T��s�L���Uh[^u���B+I5D
^]�(	���t@����'���E+����(	
56q�N���W�.���������_�/���Pe�4��[��m1�2���l@�.[��1�G���8�y�)�*\�Qo.�w�q����s����B��LA6Mjn�p�P^��l��KI��Zp�wc��a��3(D&���~�.���,|���5��g�����4�=3l���h�R�m�����?���Y�?���Y�?���Y�_5�n��=9.��(���X<����>,9�s�X#��#0s,��A����������?c�����deo�|�$�a����Z�!��&@I�#� �F Y{y#o�J��0J�Ke��A�2�@���==��5����}�%e����}�d=,��{+n��D@�HL����HA�H�@����>���[�@���o�,E
1K1�5�Y����n�\��&��E
r�b���d�f�1Z��`4{�k�
�b������G
���(X�	h��`1b��#�
�o��}@��=�������$�����{t#���%K�zE�q+�}��:����g=l���Aj�|H@������	,��	�a����YH��{G�w$�x�m��#�!�����v�w�a�9�7�!r�c()z���������d5w��������\�8���R6
�I�XR��j���j����8!r�d���JA6s8�i��/n��98�"gTf�I����3w���lN���`��(��	����BA6M���n�b1�U�]Qt 2�����V8H��g=-Y��Z��.�k�
�Y�=���k���t�p����|����hq���<�P�,6�����zZ�����E�|?�ovc���E�#��������V6$��;���T����������&_~�a=1���C�F���yNsl� �����}��5�����B���y����f^,�/z�+�v��;�sv.��\�o��9;�sv.o��M�c���cK~e�������$�V����Xh4{k2;�����D���m��J�l���A��8X�����$�+����d�*4��d5<�`@F����A���Ig���SlG�8���9��v� o;����z]�`6�����0��l6b�3�XS�T|����l1��b��X�{��l1�w[gy�o�����%z����_���w��4���^���;SqN���,E��1�6�s2��,r������3��L�(�3!(��H��B����1�:���:����u�A�:L���{�\<�v��33��h6dn�p��B��G���	�x�0�9#���l�8��q|�B����x2s0J�� ys0��U�s�@'Z�@�%����,Aq�`K��
��k��\K�������&1J�F�BA6M8�^���5qy��h��oh��[8H��@�.��B��[����F���yX����F��M���W��o��#xt�������0V7�i:��������V�x��8Z����8)|e� ����
�|���K��fD��������(�f�g�+T|�y{�g������3���2��7���4�������;0lob�@b�]S��p�c�L���r��\�6m����0��U�96s�N���M9>Y���8�������>/��r�������������U���=����S�<b�k�#����{�K�^�����������Z������9�|���/t�h��N���9�����yw=��w�h;u���;�����C8������v��v�v��C�v��9�4�a������v�c���;��f�!��s�!|���A{l�!�w��C�L;�qw;�O�s�!X�
=�Aw9��C���C���v�c���;��f�!��s�!|����5}�u�#�;z������H���������_��(:��E��:��e]���������72r�X��������9QC���724R�y{#���#7�
�r�WU4P=��FJF�G�����b�����"�9E%���624�,yk#����
U2��p��2��x6��S�
V|�k����I�;C��l�_�U
T%���6�s�U�T9��*���=]��T9��*:W��W���x���T

T>UA#U�����
��UA#U�������Q0���)��r�WU4P=����������W�m~��Ot�X����1(�9�s�`���-�yjh�z����C�#�>jh�z��G:U��3{P5t��q���@u����*h��q^U�@�����F��U����@RU4P�8�jh����O�F�K����@���C����������W�?��c�5�648s����������^��@U�����jMw�JF�5�y(n�X�HqK��F�+,=��z����0��	)r��T4P��j��Z{�zQu�E�AU�H���������C
�1�U��1�`�3]:���'C�s'q�(�����yUEU�s���{��74R����C�����

T%�������%U�H��������Z�%�
�r�WU4P=z�������T��5-�+�J�S54P]�����UG�����Ug�^{�3]:��>�G�hp�$����son���������:U��Y$�w:W�8Suh���;`C#�5�;4P=��sP4R�8���\u�_[�-ih�*qN��@�����j�U�E������&�yI!E����S��G�����g��k��v�������.;.��-dp�9����S]J|D0��(''`����uC#�-��;t�
�gc�yCU�s�����osT4R�8��h����� �d�Ia^R�@q��^")h��q^T�@�x5�]A#U������������m�f����������~�!Z.�N�\Q;��G��rQ��O_����xQ�������V�2�K����BKqk�co�h]���r)��.t5��Zm�<���_��.t5B[�����Z%���y��C������~�!Z.����~�~�/�i:��^,t��h��r� Za��_�${�� H���d�*t�n����
�u��FqP�����6�@�n����I��k�!�Zm�i��w@�Z3�[��4�{�]*�h�
��u�PA���%N���Pe�F���%�@����+y���*FQA����_%�|9��0H���d�*t^�9%��PO��<��c��i�P�y,)w*\Vk�������K,���Xo	?���:Vk,�������"m36���@c���,�S��1f����!2EK��
��k~q/�����p�]U|�LKOC��SZ�5����C�� �Z��-o%���2(
�d�H��B�}�j�c{��� �`������Pc+����[�n��-�c]A�Rl�W�����)�G���!:i�;Bm;�����=*�������g�����#��W��n�n�)w*<^v��$d��C��9?��&��Cd�<�����V��A�W,`����&[8H��@��7p_x%�{��w�J
0���1e�0�f����8��;��\Z���+������_�,`����t@���x=�>�^�UD{�yEbf�����s�L���Uh��G�!c��/J�{?��!_����y,)�*,��Z���T(������}?.�C�/j�!6E����
W^U�*�U8��`lm���n�@���v�S�q�}W�.7������!z"�k�M%��*6����t����������eLFpk8��1f����� ���cN�U!��_�W)x1Wr �+/������7�+�S����z�nl�k�
�����N����w����i:0����T��}7���t^����%9up]b"���|��;�<�7����,�����,��zx7k���<�rn^�p#�y���>,��m��^��_/�i/�z�}��N{�K�y����38|�g>�_�3�Y�7�h����
^_��^��(�y}|�#$Xq��9�~������d�9�/����^����Gqo��s�w���;-3���������f���[q���
&`���2�3�E0�zX�3�`#�h
��`S����ic
�`k��Z�GM)�K(��	hlc��:8��.�7Kv����mC �7�c�(���[�[�a��]�����J6"�`#&@I�F� g#F YK6��Gh�P,����J����A�f YK����{�{6��.�C�c+q������
x��-��"�6f�IP����t�p����/H4��|g=�!����n}���p���E0�����8e�qz�?N����8=�0�p�Tm���Nu�X2�I^�M��{�Jo�j���7��;�lgAf�I����n�l���;q�X���D�x�%Ew�p�N��zX�Y����
�hCd�d(	KV6S�M��zZ�x��q���\��Y�=�$,YY� �����d1qG�9X��!2����>��rp�L��zZ����p�s�VD�Z��H�s�&�i:Pb���%���6Z����"��cd��i:P>1�����\��v�S��O����p����������SR{r8D-8������pl� �����
"�m=9��*^,��[������B���y���8���9gss���������9gs������A�����1��T1����>U�IO���y���)v�*J�e��e��"��e������$n�g�%�rK�%�&�n�������
�}H���>>.e���Q� oT Y�
��!|��J@��L���AIA���@�n��=��{4%��I��SEp������WZ�������\p!#����� �B Y�
�� ��	���������Gq��k��
�
HM��(8�	L�T 8���$�V����Og���z�1���Y"�m�-�0�^��{}��=���M�0����"�8G1����b���c����h�z1N�."91��{�u�����y���a���E
11�%�F����|*"��y��r*�9H��@�*��G�c�]���~,�.�g���;�^�����/�����7�	PR4	
�i:��k����_We?rC"��u���58s�N���U��t�CpE���PR4'
�i:�Y�
Uon��5D����"�5�Uzia�t@���,:�L:@��������r�����~����R�_����8)X~K� ���u�Pu�`q>�2���@=�4�e2F-�4h~���k�����7��f�uc���B!:EK��sq��'�C�����hO�f�i:p�'�GK��g}��c�o��v�%��\���{�*����#�/������m��M������k����r�Xo
=�AW=K����c�����v�]���/����Cw���
���9� ����yg;Mu1��a�io;��aq���BS�A��s�A|���B{l�A�w����Tw��v������.v��9� 4���C���:� �����C����A\����q�;��aq���BS�A��t�A|�����9C����u�C�����|������1�C���N7l�8;�N7l9��70P�(�'`�������V(�y{#��U�����J�S54P��N����*�yUE�-mo�d�������N���xt��%�2���
4%���.�dWD��__$�K
^]�<��;t�[p�g����I�;C�����cM��F��U�����64P�8�jh���K��*h��q^U�@�x�%��)�r�WUt����5�*�J�S54P��s����*�yUE�&�L������U
T������F�������
�':t�5]2
����c0t~���P����@u�[WU54P�����F�[�y��sU�R!�;���T

T���UA#U�����+�AU�H����������U�^u��z�nO�F�G������<�{�����U�M���u��_K�]:�i��c�c����q��y�7��
��|���@K�(,h��q^V�H���bCC��n����%�dACU�����9$Ye#]	���F�������Fy��\_%��pl���AY�H���F�"	�����yZ����i�1O��tM'Q��$J�?c����y$(+)K�W66R^��cC�5?#x6R>���cC�#�;{6P^��6���l�,�^��H�����,l���AY�Hy{��]U��(oW�w�6**s`PV6P���\����$�^���KSk�����3,������lt%����I�����Py���
��1�Hw��F������k�l�\��kc#e��3**s`PV6R>��!**s`PVv�<�W�Q��@Y��c#���]�
u).�
�V\ue�
.K��]�,�k;>m2����Ag�y��y���N:��cW48���E�T9��*������P�u����k��,l���AY�@���u�w��F����������3����3�YA#����YaC]��F��mP>��������,�o��|�o�����rL����+]��������]�j���4x�����IaI�C��e�{����t�<*t�]]\����=�U�p�%��x�M��'�C��]~�qQ�Z���g��e}-�$W���'�C�:��X^,����G����w`����2��M���Uh[�	��t�a���z3�E�Ll]�M��a]��
��-��u��2���n�h�.�/��i:�e^,t[^���<�gn� :��U��*�.!����u����B����4]����v�~?��!�3��"�%�N����Ut�6�����yt�%��T�� H���d�)t��j}2���YA/z�/`�����v?OOX�f�s�u��D���J��rl�8H���d�+Vv]3Ko���Z~���*���~�p��BK��*�����_'���$�z�M��4��{��FKmp�cc�ZYA6O�L�7�y�/�AX�Ha2UOX�f���b0H'�W�F���MX������N�w��fM��Z6Y9[��r+�:�����shp�0��'$x��m�GZA���Ne�8w�W�����z
�����o��o��t�o�������~p�2)�$3w`�{Y\-������[z�#C�e��`�X����T=!�{����&��=$A�Y�%���F8H�$�Y�
�������m�h�5zQ�o����w	��z��7�]
���� ����c����c�!���
�v��z��w�����
Yw6z�5d(R#�G�*\	��z��7�-3�����C�!f+�6q�~#���*�8���		�����kh#�|�]��<)/�~C0��U8s�M��[�J[n@;z������l���z
��S������+	�`��:3z�#����F0��U�H�N��v�?�K�pW�e��<��7��ko�+m�����`�N�����?�dN����C��>�.��fb����4w39`������$:zK�.�:\�s����X7��.�Y�������f��������Z�I��JN"'1�Kv��y�G���)��x$O���"Ok����]$����	h7��.b�w#���I��do4��s���I��l�����Hn0�|
�c������@�po��}A�,H@��L@{A��A�}�d�nk�F
vd�%+��%���@���-CX��I@��L���;�A���@����C�Q}J`��L�:���O�H��}�lg#������();���$�i��P�so^�9���`����0o`F���+wc��lf
ff���L233�zZ�w+���&�hkf�y���a��L�W�|�������D���`�Wv<%�L�LX�q���\zt�&��=P�����%��N�1�sc��������Nh�0��'{�i|�rgTNe��(���F�I�� �D���%�YY�&"��+2o{���;:I�N�|Z�����_��@�,�PR�I��4���%��Y���g����B{R��ktL'	��zR>iB��M���;�
2o����w
�Db�N�|^�����7��tl�?qf��!��\��H�N�����K�7��;�?�6d�(}C��6����0��',��r1:�C�a(�eS��L{By�Y]V��{r�7�+�s��(�b����b�����b�p�N��z\�����_�xV�9��
��h�N�.a<UOD�q���N������
I���N��-l� �����![T��8D�S�\���Sq����i:0��;M���O�����B���z�������B���z���o�^,4:�so���T���:�1U�l���T���p�7���#�y�o8�7�p��#�ywS��jb��h��s2P#�v��@��`��A_,����
f��������R0��X������,r���VN��F����0��F����HX�f���Zh�%�7>����G���|���A��=n�BEK����#y�G�>,�|}$����"7+�������;�s������9wf������ht�7��Y�sgq��,���8����]�w7��;���)X��P^�}'�9w3��~{�h2nu�\M`��L��zW�����		��W
F���{����l�PR�2��4��{�zCq�
�)��S�/'��z�r��������5�\D�,4�����w"�3~�|���b���1�����b%x������dV��m�0��'{Xs�^�
W���1������`^v%��7UO���:�58���g�7{�y������T=a����E���Ft 'd�o��e�r�M��[���� {�!��bO0/��3��T=��
��z���j��g��J�d�/�r�+7UOX�n������"�bO0/Y��.a2UOX�f���M�_���%��z���F���A2M$��5@^�vtq���c����5pq�t�ZG�����#����K{h��\j�������������n.�w��������B~YZ�l�X��M�2��������kY�6r��a�>^��v&��5�j�ok����]�=���p����������D�>f��gH;w��	������Z��`ceW>�Q��h�����7h��A�-��_-_=�Q��f�������:���h���F8�F�K����NK/�������-Z:lBZ}�F�K����NK_���~�k�Y���8�O��.dP�%j�1���iym�\(/�d������k��Q��h���������_�������3�d>�G�������.o��)��xlU	�'Uh��K3�q������n��~��F��v%���
~Il����)8�� S0�}����S9�\���y{''��TP��2A!���L �O#g�e�J�����p*�H���X�b��X�b��o�������^Q�'d H}������95�k��]2������i�p#g'J��t%�'jo� ���	J�	*9<��<'h�LP�LP�@�	��FN9�	
��l�*9�('(d ��J��2r*�QNP�� ������Cg�f��F���{QE������
���LAT���D�������Z�� Wv���k���C��M��(��%Y��v�TT�����h�]U���sQ���F�[�Cu�\t����������NE%��*���0AT��(�yQA#����k���E�e�Ut.*}�������:��H�4�����c|��p�����O�C��M��!(��V�m�C���D�D�t�l�\rO��J���nh��
r�2��t��5����i\�s�$��\����������D�^��&���n{:�0'�h$���	�N/)�E]���x%�pM�pQ�U;� ��9	s�hp���5�IT��(�yQAQx������J�U4]�\C��kw��h$z�j��OC���E��.����7W��D5�D
�DK/Z���^�\�7�%�*:�0/*h$�����.6�E]���I���f|�Y��U��������0w�g^���y�NE%��*�n����s�-�X��9��:t**aNT�H���9�*:�0/*h$���9�L
��r�4�{�������Ex�4��KR�Wd2����x��N/4�4�/4���v�y��d����kb��,��,mN�FNO4G��28�-�X��g���]�i�K�w�TL�����h{�5�*:�0/*h$JAT��(�yQA�
����u��O�������B����sI���F��U���C���E���W����v�MX�����k����a�l����2���W�����5�/@���M����m��Fm��B#TQ�/���E�������6Z�����X���\���|q5.=
9�Fi����0O(8�;�wC
�1P|Yz��8������d�3���d�����������+��y���:v��g&��9#���W�o����D*��"�%�V	Z��3nua�@rTZ���)��	W�!:EK��}�q]fO�6L��h��k�P�3QV8H��g�*t�h�m]�����"1�Ac���$��Y�BA6M8�V�ezUzi���o6���`ql� ���~��W.@6��whGD_��MV�;F_a�i:���Z�U�
�O,v!;�������2~�|QI�S�����p���g�d@+��F��/���T[(���g�*t��)L.��M��<��vL?L�Tb�N���r��}���->�"Z7��_%��4�P�M���U�A�	a6�����C;"���Pjl� ���u�PZ�W8��27�#��y�@IP������u�P�L�7��?;�!����J���l� ���u������\�HC���qZS��-�
�)�XS�TX�k�g�I�E��n�C��;@Ip*��?�4��[�V^�17���!;��� �c��a��r��"�%�N���q��l� �DK�h?�7���T+��t��n�N������N���������#�P��4��������6x�+���9D�l �������t�<��/T�8E
U$u�-��B�I]:M��M�r���k�7vum�V�����*�d3w���Zh���v|��B�3"8�=�$,TY� ���u��F��K[�+(TQC���(	
UV�i: Y�
]����a�A�����R���/\3�2_��[���8�;n���o%����e��v�?L`��Q�N�����FruE�V�L����u����L:|��L�0��l�A�7��XR���w��{�Fg�v�]M�]M�]U�������E�������|�l��\�M��'~X]��Z��Z��Zt��yf��}������������^�������x�����9Ko��Df��1e$KC�����tls�����{����%es�����g=,��71u"�9x`�6�|d��i:����%���Y
����yy��h�
�i:�YK6?n}���;9W����BA6M8�a�����GT	5�4�/�Wn�L�B!:E��w��Z��[y�&#��d>_��.Bl�<^��M�kuf�Q����`SFr1�9�q,)�j��7��� 6���`���A��`-���Y���*m8�]B@�%L���K�A�%L�����L��k���}g�1f��"�m�~�<��g���W���`����J���DA6M8�a�����2��<�H�,�c���N!j���<���>X�rq>9���m�BA6M8�a��6���5�33��3��85�I�XR��*~�k���C"u
��*��;6s�N���]�Y�������2�� 2[���%�6
�i: YOK+p��.��6#2C��\��q�������`3[}�e�c����J
�c�9H��g=,Y��rpCC������N
NdY)�f���I+3
��n��$�R��N
�$������g=-Y,C��w>N��3{@I���g�i:�YOKf#qj�9���N��n����/��(D����A�t ��<^��|K���LC����,�����i���:�����.��:�n�I����Q��
��Ky����M^�0�5[�5z�5��5��5���;M���bG^+-���Ifp!�m�0��,�G����0f^�f�����W;3�t�cI�cI�c����h��r,�Y�^��f�Xhp��+�`.��A�\�`�����A�\�`�_],4�|Ko).�R\������l).�R��a�����Wg$����H�c�n5����8��^��/��+�����
���V����%[���p�V�������Mp��{p0��o0#��� ��`F�|��k���~����7�����,x�	��?�`��m�� ��	&@I�� �	&�Y�
u&��+,��;�qL�	���������G_r�����#����U
��;���9.�9����~�<??�pq���rj`�_r����y[^�X���9���L��� ���u�P3��������s�:@I��+d�t����.���������yu��`����d�H��B�[�E<1Clf�C��b�����t���],T�e�K��lt�p�\����Q������I}4h$T�`����q=��`�A!�i: Y�
e�l��7���iFd\����Z�`�����~�)q��jY&���n����R���wo�]|"W�xk���c�l�<��5a�fL���������d�0>��mX[_m��>�����Y�Gl�{����������������7`(�������^��6f�m�2��
��*����Z�������Gh��i���G��&��
$�\	9�A	9����Bb����i"��`]$���H�!h��������$�����i+	=�����$����Ih��MB�?m'��K��?l'a��PB���Z�iK	-_���aK	+���R��*!��J�X����v7��C��z��%�4�a�ZB������):m/�o�%�|\h0>��&J������{�a�
�=�[���6e%nQV.lO&i[ �N�$���(������-�C��k���H�x�mYUt.�a^T�@��i[ #�����K'X.	�N�\lyK ��%[���H���zM!�����L.\L�����'`�^�~���3&a�~E�3�Q�g/j�TT������F]���sQ�������������9QE#�����{g�\�������������sQ���F��k.[Ut.�a^T�H��"z�g��E
D�
h��������9QE.4;���'E�f|�����X��c):?ou��>���nyPC�y�[r:t**aNT�H���Z��E[��54�����*:�0/*h ����gj�TT��������.�D�^t�(Zs#��Eknddh$����Cg�
������G��?����5?H8tz�$����y�K��w�TT�������n���K��[�@�M�n��y�)��*	.�tV��R�d2�-`���sI���F�[/�]�z������c���E9��
��S~�p���"aNT���K����?E�W��a���3'a� 
������D��r�4���x_�����9QE#���v:��n��Ht��/���sQ���F��'>^O
��r�4=�N�����9QE#��Z�xz
��r�4��3������yQA.6����?���M�t��@5t�W�\\E���s��w�������
�D[��v�\���lC#�=��:t.���]C��\���NE%��*���{%:t.�a^T�H�����h�E�E���5F�%)�+2	BK�t���.4�5�/4m�^��i�V���=�O�pZ8l%�'���A�������SA�r�BF-��|g��y��)����W�IT��(�yQA#Q��$�*:�0/*h$��j�MUr.IQ^��@>,]�]�C���4�D��z:�0/*h����w�@�����Ow�X�k�K+K
�R�I�����_��jR��ZB��-���
5�uh��X���.��O
��5`�B
q]u�+V�Pc\�M��'��G���^���!�����j��"H"n��M��Z������!ja;����P��Y� ���Z�G��p':Dk������+��t�A{��o����[8D
-`C���~_��le�\��)u���%S�k�a���~��Z}&��n�S�����_��3m�F������`���@I?_�-��t���Y��p����wBLp�X������V�Lx���y�n�>��J�k����?���k���A:M�����$%Wz��"�:�Ru
eb���/Z�7��4��;������e>��W�vD�����$8���i:�Y�
m�v������fj`��o��K��H!2E�������h	�D8������h5�@Ip*�-d�t��n�O�:�a�%�_���*lR��$(��BA6M8�V�����q�����}�����)���g�j<6�#����\��S�!�|T�PjlF���g�*t�i}#/i&S��in�|����y&�
-0D��cM�S!�@^e�6����G,����c���EkL!D��cI�Sa��$s�]`��H�mz;n��+�d�!:E�����
��/4�F�D��j�mo��#��[ �����o�������JI�vD�^9��&+%[8H��@���]*�-��s_��l8�#j�.�@IX����l�p��B���kF���
k�����s�N�B�U��;�Y�
�y-����l�:5\F�P��p�N���U(,����������
���=�$,T�JA6s�����B�W�������6!�S�����*�d3w���J������iF*8�������|�$
�)�XSn=�ob�K+C������@ZS|y�8Hg����E)���:Z����%�,��%G��������=�%��&r������Z���������������-�������O�����[|{���l��o,�=[|{��oB8r%��vb0�"0W2�}����+y�d������l�V�� �~�l�V������77��R��o���$(�X� �&�Y��M���n�D����*m+��t`�mo�l���FD0D�
�%A��
�4���%����.�,<9�����A:M8�a�����t�7���8�sDt����8N]�n���=�
>"��#&��V4���|�Z��Y�3�r>���;�qL�Q�s�XR6�r���J�"��-&@I�[� �-&�YK��_ksv�1J�.#y�1�zX���
�����9+������l�p���i�lS��#��<&@I�y��8�a��l��l�l�}l�=����������O�������K���?.���$�qI����fVn�idCb�cW����H��iS�q�M6o�j�a�O���4���@�v��Md�t��v�wKq��w#x�3"gCf�I��������%���M�������L����J[v�4���%����Rx�V��Y�=���_���t�p����^���u	Nfm����%'>�� �����d5u7d�h�c�+{��{O�w@v�t ��|�d�'����&,���3.;�I����
�����%�9M�R��9��D}�nL����r(���cKy�G���=0�p��2�X�Z0���n�<v�X�K�����7:/�������ES�b��S��s�J>&3��>&y3�'�FK�7/�w��K�Z��j��q�}��F����#��G6+�7f����#��7w5�X�{�D�p���9�s6&�7�����9�77�Ma������������l��l���yA
�����E@���()�HA�������b��
�����\��XlZg=���S�qpv/V\�i��iM�c����
�8�V����l3�6cSF�)�l�8��[=���G��sP0h�)x01���	����G��<�`)��,�(��!(X�p��B��7m{6#1J�F"y#1�����{z���i��aW��g�-��-:�q\�C��
������X����bSF�bB��~�)w*4s�@G$X�d�8�0N��Tc+����[�:'��O�v!o�q����]�!�.����L�X���p���������	P�7	�A;��4��{�:������5��Y���]HA�	�@�nj.������?�P����![�C�����W�xs�����

�����'pm�3w���*��������}��'��U��;�Y�
U��|��������r��X� ����;$s�
�<���J[�/3��`��JA6s8�^���-�;wo�����.��7��<S������X�+gz�OQ��D����w1�A�h�c��0��
�=�+Ei���=�["��T�J����pH����������[��y�f_�U
��p���'+���a{��F����V����������_�_�e�J�nk�+�n�o����6����G�1��)�J�~�o�����!���JC��L-�J����[i�	hC9�I+
>����J���;0~�JCOAb�)\���S�L=�y+
=�O[i�)pK=��Vr���O[ih�������%���JC��<-�JC�������-1���VR>�i��VZ���4�|m�!�OZi�	h����Vz
����S��z
��4�$SOa�JCN��V/��	���HO^v51�c��Q]����Vu���]m�"a�]o�R�v0Rr-Y�FB&�i�;t)�aFS�L4����q#E3�c�>���kQ
���f�g��H���v02�[r)HQF��L0�����Lq�"E3���-@�\�Z(�*�qcI}���j�n��]S
��bf�4�b{DZQE��fD�D��v���E)��2���U���\��(�YQF�c���VT��(�QA3��e���kQ
���f���;QA��fE�D�<����bp!JaV��L��[�ty��0+�������������|?{
�.���S4�ng���t)�aFT�L��;�t-z��D]�}���5�JT�TT�L���!t-JaV��L���@'*�Z���(������.E9��
����+�A��{���h&����AW�	�������C����V������5���q�9A�����Lo��(�QA3���\K���-d&x��]%��Gx�2�]���K92jfb��@�t-HaV��L4����hE�M�����kQ
���f�-�mtyk�0+����%�������QZ|�0���q�9	A�+Wq�+��R������h���]���L�h"z��I��KQ3��f�y9���*��0+�h&z,{����kQ
���f��(z�=G���h��u��W��(�QA3�=�oty��0+�����KK�����'@����.����4�r��kx�Wt-zO���E-���]�J��*�����k��h����f���� *�Z���(������+��R������hE�=�4����e����kI���Df���Y8KAW7	����7���e����h}'����
����ry�)��4���N��
��� �F`&v��r����T�h"��<b/��R������hZ��yQA��fE�D�r���Z��"��`3��HQ��$�YMF3�s9s����E)��2��������h��0�k���4�����'r�%�um|����4�x�a�F��J{��BQ][���h���.f�����2��b�
���<���Dtq�"���nU���\����~u���1oD�DT63D<{����d�H�x��SnPW�	;���c�a�A�fo���5�f@~����r.�@��3��3� �Sw��#�e�?Te����_�X��J���a���YQ��}4j�m�qm�~l�	��u�?L��*K$�����v���b��!�PT�����z��v
�a@YO
=�������_W5@�����0������d�P��B���!��K��������0��l� f��������SH�4 G�l��1f|���Rea�?��'��c6��i_����T�S�3�zTh>d�@�5HI�����q���p��!�q�������i����
j���p��Pxu�a;�0(vu��Bk�>F���h�r�&�/��nT��!����/�����".`2u�v�d^�d�FA2��]�t�������.�Z*:-���+�I�Pe�0��G��'�y�GV��v�K�~'Oy�����uB��C�PZw�j����[�A	PwJ�&}���B3�zT�N�J6X���*{�u�
���3������!2D<��'��\�����m��E�Pxo5���h�����W 4Zxz�h�:�����`R/T�FA2�(�Q�y���S��A��I�x}�L�vG-A��)�Q�	����7\�����Z"��RX� y�]�z��J������-#D}I�V��:^J��*�$#����p������f���
P�w�%A��
����G��?>�����B�P��k@IP���A:�(�Y������u�]I@��e�J2B�7�/X��8=r����~=w�PDM4j]iu����a��d�L�jtki�aW
o��}����<0i��z���<�����[8{K����Z��kt�r�n�;������C����C����:3�u�HY������~=��~|V����Mu`�B�����`r��q���[�z���j�@2^`�qV���8W�#�
�I�� �
@Yo���]�Mo:�o��r�(H�@�M9��l�;�v��B�'0C�7�W@��)��Z����%a��@z��(H��G�K����5��������[�6L�C��j7�z\��z���A���� �(H�����q����o��J��Z��3���!j%�cNy�V��a�g*v�L��i�31���|���9����;rc��s2l� �1P�~N�KV�,9��2V�0���;�0��7K�v �.q�cG�y s��<b�q���.Q���,���@�v����>$����C�cNy�V��eoJ�g0%=X���7%!���p��%�s��V��' jG�c�p���!2D<���jU�0���A��j<�$~6l� f���a�[��9��l' u#G�I��l��d�P��%����f�^���#�$(�Y��v:�8����b,~�RR�N��������w��!:h<>���������geoh��z��$gh�JA2�(����o��������R�r�����I�d��e����y�{�l�����m���!���6��!:D<��7kE#r?l#�Fgo6�	[��1f��	�*V�!�1���G��I�5�P3
q4��t� ��5���x�\��6��/k��7��x��?�r�,�z�7+�fc��f�~f�vf7S�;3��
�?+���k�5�L��L(�Q��K�B�m	
���@u�er����������y��f��-����<�`LzPyb�7&!��T;1�^��)�B�y����@
5,Q��$=x�Pon��E/r�^�������E/��6��<st ��Hw�����@�cNyR�3�1������4��d}G(�Q�bV�4��U��_�]�
8��P��*������;r>b�}D2>b�z7u�8�=L-��`��B�u=��G�Z#/�=���g&E���ge=*��w�#�9�9�0��=B�S�A�) ��75��Q�$�N�,E���}3��&*�cIyP������s;r�`+�{t� W��zT��U��N���+��NM������Cd�x�)O*T[n[�S�so�d�t6�a Y�
5
q���R������S
���(Hl�8�Q�j��K���U����P�s���A:�(�Y���U������oU��)K4�M�@������[�]�o���; 31Jr_�$#����^[>8g�����%9[/g
��@Y�
�-�%�s�R���0��XL;�0��g�����*o�B�o�LR�3�@����3 D�����
��u�NCu�Po�N��JA2�����h3����i��k;�����;
5���N����@~X��m��k������W��������z�CO���������7�h��n��M5��5����������)�h����;0~�VCOB�c�I\�����Ts��z�6����z��5�8����������5��E���ek
=N4'0m��'�is
=n��'p�\CO��	L�k�	|�^CN@�d�L�k�)h����z�6����6z�
6�$$���������-6^�i3�r$9}��D��|����>o�q���(	���E��]o�rn��h$�R����X�;t-���F�&����;t)�aFT�L4�9�D]�R�e4=��FJ�%�������9���A��%�����A��f4�Dwj\��L�n+e��o*���<~�^��{�A���+�a�~A�+�{�� *�Z���(��h���~�]�r�4��v��
��0+�h&z.i��w��E)��2������SE��fD�D�R��E]�R�e4-��@E�~L��0+�h&Z�6�]�f8��2�q�)}^��h�q����F����q�9A������_]�r�4=���]�qWQE�>!X7�.E9��
��&�t����E)��2����t'*�Z���(��hE�=�6����g��d����3)���>I��hXvu��8�5���>����g�����]]=���!lv���pD������9����������u]�R�Ue4-�����h	����m��4�0'Jh&��w����Z��������Bx�)����
�gi/,l"LqN��T�����oA����oA���?������_;,���g�D���[���]s�6��M���p���M����l�D��O������k�����9�
�
�e���W�D���0����B��)|�>�
��8��6�8'�l&|������;�Yaaw�H�,������Vtg���r�=a�K�j|Y�l"\���a3��6s�Ze��g��M���$�l"��'iaS��*!
�S�f6��������0�9af���da�.�%�+�
�5A��D��*��ho��V���H��.�;w�s_������7�<�h��m�������KNa���.���WE��fEME���o�D�O����
�1yaaa�s��f�i���~��ka������}:�0}���{LaN��T��ud�*l"KqN��T�X�/���0�9af�~#�[���s��5�����?�u5��N3�P�Y��:Y�j.@5�������n`X����fR��j�����B�����	#J`-��U���cZlJD���!����g�������2������c�&��P���ag�[��7�����3K��zO�py�fo��8�nm��Pv�8������0;}�a�F����f&~Crvr�]j_C�7����!�1�<����s�F
��	:��-��I�R*K������!.�"
���ER��x�0��l� f����3-�f�.���>����7�`R/T��A:�(�I���f�`���e�t"��q���������8#A���nYv��'�A
���n%�&}Y�S�3�zV(,=��8��l[i
�Ky��Z�s�5���/�Wo*�E�ki�rl�����V��A<�(�Y���:h�����l[���m^���p�0j$(���R�s��-;:���m-��"���*�9L�	
>���egW��X���������@��p�0j$(������XH�����n;�]^��707��F�����w�iI��*�mm:�/L�_4a+�0��g���G`<leY�L��@^X���a<�HH�Y��J�h6��-������_s�s��;��G�l��kY�HYV���<�x�&C���'ZHy�/��������/�Q�w
��F��O�=��Qg�
6
��"�'yX���a2�HP�a�i��t4����������e��<�W`�0~$�nr��D+��>�t}��������z$�G�
,������o�[q����`��[�`!�����`��0s�?����z�^|,I)7�!�2+0�(���^���d�����z3nI�����r����%���WKY� y����
�uD5�3��k�]G�&#f�L����	�.���#+�a�S�a�������������%+�
������~��
,�n,�m,�l,�5�V}�j,�����|@tE�����=��#�r�A�r����#�������Z0=��#�|�A�|����#�!$c�����Z#25v���k��>�:��B�����9_2�����d}�>�:blC�:b61�����a��fu1b��y�h?r�����A:�(�����X�A�e'0cH��k� ��3��{�>-[L���o���	�X��$�_&
�a@Y��������kT�r$��L.S�����wC��������!����#�v�?��7k�.#n��m����	���&�[3�b7l{^��+p�����Y�3�
N
3g$(�v��<�u���z�#��hu�&C������Wn=�D?���L��iM�H o4=)����d��i��I�p���Ygs�������Cz{�Y���N�I�W����@0/���d�����T�r�;��4�nk��v�k#l=9��,����-�}�dvqng�&�QG(����&����\�����[�0������
���#A��+go�h`�y3wW�.�@0/��G�0~$(�v��r��ZY�sf���`^�Uk�0~$�'m����_���
���x��9����?�0~$(�~��|�����Zp�o���������#A��+g4U��H�I7qS_������d�����-����Wo����W��� s�y�6
�a�}�������Eu=�5KK"���g]P��Qg����xS��{�����@�>{������u}�Y��w]o�������	��fi�W�Y��8s�TP�n�Z�sSs
n����=���:;s;�t;�s�&
�ir�����Pkbb��v������K!������o���+���8��{�0�z�@���� ��F�&��-���T�uDQ�8���c��B�uD=x�PgK�/|�s�A��=_�����������7&i���?a���?=��8�����e=+�9����rF�3�{a}b��>)����z��X�hx�
�@ ����a����\��m���6g �����0ksR��A^������Es3�{anb�57A�����'�K'�K3L��&���g=+T}���'������1#�$(TY� y���P5��3��Y�I�!C�u~��J�m��1�<�P]���X�rfm�H(/����d���U��z�+LR���"�:�$�z�i�P#Io�+�FM�,����8��dm�C���+�`�`h9#�������@^0"K�0j$(��^v�F�b,������@^�S�0j$$��^����)K�����#��`:��a2�HN���v���m���Y��w�<����@^�{��F���e�o]�/����!�p8^�O��Ch�x,)r�+�"j�����+�ax�3#�^W����W����,e��Qn�d������������=o��hW���F���1�++����G�
M���U��K������������Ej�g�bk'2�]�v�����j�zd}�	�T�������zI��]��B����+��W�������z��%MjwyW���?;�tEJ�/���a2+^�z�yY�Q~l|s����_2��3���/�Z���,��&M?����w��U�`V<�i�6����.��V+R�v.g�7K&��%Q�w�����C������s������D-�e^�����>h���O�	��G�����41�
N��z�8�7R�}�����[������x!����?�aZ���1?$��I\�Q��	�V�]����o5��B�i�'��d��{L���K��2�_R!����J�q�'#H`�G�J��k���-���-5����1���R*(�Z�A���2r�7��=�w��|����j�B�/�i�J./U+�Vc�\+r�**�T<WlF��B�9L�\+�Ry��d�HaFQ��b����"��"�E!��}���(d�HaFQ������4�k��ZS�T���*y�V��L�?UA����ZUF�7�����7�>uh�>����w�����xE��q�]_�=X���h��qFU�D5���V��L��������[�T���j��j��?yUAU�3��&�����w4S�8�*h��G�|S5����*�����f�����
����"�hr��8�*hz���&���s���������&W���9(��r%��wEU�3��&�-<T�i��X��Z���1W�D�������������j�W���:%F35����&���z�T=G�����-kuO��&�gTMT����hro�8�*h~o���������KK�MB���q�9E����|5��h�JqVU��j�����&�gTMT���+h�Z��^E��
s��f�gU]��+�]g�s	�V�8U5h��F�tS5����j���*��*�YUA�#��(���H�U4����������������;�hr�8�����k�����|EU�3��&�G|�V4S=���A��i�O��&�gTMT|Qu��f�gUMTqz�Se4S�8�*h��F�vS�������wSp�Df�f%�\+�yI�����������g���d�����d��r��c��(7���e/#B&����Y���������������u���&jgMT�RsPe4S�8�*h����*��*�YUA��G_�g����f$�Lw\�e%�4)��
���%�W E3U����.U�j?�l��T��r��y���a��aI:p�^�+ $��/�7���W�r����%������Fe����*�@r_��3��H���%�A���n����(�@J��V<�h�$\a�!���<��6hL����&�r����#���c��KR����QMG�MV1��S��d G^�����M��Z�!���<��5��������%�>��o"\a�!���<jI��{r9q�"�����/�1��U!<D<��g{������v�a�mk�~�5�$X����A<�8�Q����w{v*���P���&���R� f��������M��(��5H_����^������0��G���<��.uG_P�s�_L��
���z�3�zTh�4q�n�3I@�\���3��|aI�C�cIyRa�`�d<?��]Q������?L��RtY+f�p��Bh	���[	����|~y��n$t)!����������*�
�D���:�����������_#���S�Tx�"Qx=�#���$�[��
�_J����*����'�v�qe���n]�����gc�1J��*������G�n	v��l��BQ��+v�z(���2�i��<��y����4���U�n������6����R�[���)�Ii)/���?��i�t���>��o&�{�C��)�I���K�uIZ6L��V+����M��V}@/!	������j<��-m�����
�^V�{a���
��@Y�
�+.���/F�A��������IP(��a�3�zV(�gH�/�]�7[Q��&�B����h�H��B�����d��u.����+?c�7���*�����|�~�����}93��'T��+0	.%��� f������A���I�����+Q^J�B��#�y���P��/��Y���KP^��s�q��;��G�T���R�eV	��v�ly\��I/���C����e���X�����gG%H+�#����M���`��9����r�l�R�����w���z�����m��-Xo��(�����{V�s��Vg������1{;��X;�V���rt�������r4�rp���0��E:
�tN]���s�,���b�:{�Y��Nk��k��k��k���4���k���owgPG�bG#c�U�_���?��`�A��������W��j�5�1�h;#5���h�a�1�<��7K����sV^G�����`�A���<�������V�`�u�L�0)�zdM=8����s��������7L��^� ��"��7Kn�
�������1�}���}�XR��U}8)VU��H�Y~T�Z~|X��r7��'��}�x�����M=�����^���m�@Q�]�����[?*K8-�x�p��%�	�.
w%S�_p��}�������+���+��0(a
u�I�!<�C0�z�dk�m�U��pF!|$!����l�-���o��~^�7v!��-a<�����M1�!�q���kUk/-�iH�����2��B��)��Z��kp��N�D��pL�I,���cNy�Vk���EO�#�)z@I�S� �)@Yo���wF�.����%yw1c��<�z�dq+m�o�	�X���]�a��d��O��7f`�{����x��$(Y�FA2�(����(,���Z���q��]&yr� f��n��f�Nk��������0lak�@��i���b��I������&����FA<�8����W���k��eo)R����1fX�2�BC�cIy��`�q��7�6&:�j[��am�m�"s��6f:��`[������}�%�O���*�����
��ay��?bS�+�O���*��a<�L�\�<L#�!�������>��������0���0�0L\��z�Y�
u�`:�1�{����9���zT�����s{��8z@sA���������Az�Pg���x��&c�.Dx��,��WyV�u�����{{<k+�����y[��X[�K��
���x�P��a�-k!vb-D�Q�,g!B�����<���y�3������`R4!��p��B��w�LGk�0O���`R�!���p��B�s���I�I&E��3gz�Y�
��]�)z�9s�J��!���P��B�<����yb'�c�p� zjv.�?��'Z�w��n�^���vM�n`�n�?N���=��lB�
��_<��l�-�&����&�Y��qk��*���F(��������
�����S�[\+N�RT���`������>f�P��B�z;*��.�qRo��\�#Q�3�zV(n}C���(R�n����rP3�zT��lyt�r4�F�������s�����~��5E}���n��|��b��<��~\�[([j[Z����Rcn����m� f���P���2Y&����T<����s�"s��s6��
��S^YUgn8����;����S�;O|�F^s�cX�}�w�U���[I�N�'�6��{"�Gm����������wb���!��j�����
��=�E��k��������R�������OH������O����}�}BN@�O�	\����L=�y���!�H�������3��A�' yR�����y
)_�PH��}(�$S�`��BO��Nr
��BN�����d�)�{Q�)|��BNA�Q�\����<���B���#���-)����r��g0�I���qW
9mK!�p��BNA2��})�>�L����Io
�3�)��N���}���wQ3���A���H�9E��������m����,���,�wR4Q�8�:$��(u�6�zkw ������@)�K�
�T�T)��
�V-[�H�D������b�=�4(�{�%��h�Y��@MT��%2��P��d2��@����V������T���q�9E���HK�^U�D����������*��*�YUA��m[�po4Q�8��h�����_?A3U����&�u��o���*�YUA��l����f�gUMTO�\���~0�T�8�*�Z���&��&��3���w���r���y��<3��h�j��L�-E�W.������f�G�&��k�m�v*��r�QU4Q-�iW�L���]�&�M?��*��*�YUA���
3����*�UE�}T�o����~S5��H�f�9�D2h�Z�����{��YUA�{4�����E�_g�_aE�+�q�]_��%�?�+��r�QU4Q���[�L��'o%��e
��Bf����d��/��2�)b�Ud2Q���$��&�YQA�cT=n���qS�-�7hP4S�8�*�Z���]D��>�qFU��>��_����"����-C���q�9E����KKA��L�����k��6���MT9��*�����h���3�AU����2��R�U4Q=a��Se4S�8�*�Z��UAU�3��&�i���T)��
����mD����qVU�����em���E7����U���.��n��&
�����h��x�O�MTK|�V4S-�����j�O��f�->t���Lv�*h��qFU�D5���N��L������jU�M�2�������%2��0+�d�x��N���G������/[��O�[����x��E�m���L.5���r}�s	�	B&�f�L�����]�����
���I�A��L������j]�}���f�gUMT���^��L��$�k�>5uw��&�gDMTw���T�T)��
�6����'���We����[]��6�����xq��w����/�x}�3��=��BS���PETWZa��+��
��d��Uh.��+,m4�2��v�@K����:���p�
������w�3qQ�!��!��F@v���p����H���z����Li��~��Q-d�a�W6�,���QO�V1�2N[���~����3�8�Q��0�"W�0�(Jm7K0=��^��^f	�������r���
0��s_�&�B����������Y]��R�OW�"���&]2&}�w�� f�����i�y����(o�_��I�Pa�Q3�zT����c_^�?i�������I�Pa�Q3�zT(����jmpA�$ 9em���1��7t���
���<�0���f�%@�O�`R���zY����ZNi��y�$ ���^c�7��BC�cIy�jk��)���A��\�[c�d�9e���3�����B�
�)��
,+2(�G�m�_L��
K8G�g=*�\��mx�P��Nr��W�c��fB���4���?���rz{>`�>E�gG+����PR���*03�(�Q�}��S�BNZ�ke�)t���`o� ��C�RZo�pR�
�K�u�$�0�����t������p
����7*�u��>�<^i�"j�!<D<��'��+��n��`�0��N
5T�1��	<W�A�1�<���J������)��~�����e+���a�����7���n����fH���/���!���nj�Cx�x,)�*,�O�u�DAP�9l�`R���V
�a@Y�
m����.F��4j��Q���&A�����x�p��B���J��A�Y�!��n����w���t�8!D��������E�]�v��m��_���Ip)�m������k��?��U��j�uX_�rY$�0c�"��w�����c�A�"��l�u���l� f��)e�>��5��;yK�;XzK�;�{Kv>\|�����,V��i�{}�Wy�Vg����k��k��[�,��+��[����W����vi����z�F�����@{��������8�LL@�&���f���������&E;���g����u��`G���1�A�����,��v���"��Y�`R�1�X�p��%[o?�W��*������wE;�O��9��Z����;��3��1f8��QH�
�����W��y8g�9��1
�I�9l$�a��f�b�m���z��z��3���(�=�p,)o��o�����y8����1n"nd��8�����m���m�W�`����W�A�Wt���,Y��&�8�12�0����#�2�b��f�6���qL�pL�o�oD�=Rp��(�n�b6�	���D}�pL�<~�g!2h<���jUo�7S6dd�`��!W�a@Yo�,�a��5��~���j�3�Hz8,�
��"�.\�������N��1�y�k�4S����eX�L{9t���$�N�s� f��sj)�.�1*KK���|0	Jfv$����-����a�4k\��z�#�$g\���d�P��%�����8���6�M�c��f�Em�b8���j�q?�M���	_n��%��V1HGe�[2{��[?Z[s����&9[�/]�A2�8����{\����58�
D��3�����C�cNy��`�Q�j�m4��A��2�h=�3�yo���}���k�y��f����Y�wGO3Z�������y?3��+�So���Dx���%�v���`�g?U��o�@~�[�����Y�����w)������@��z���K�@v�uo������~��#y�Pkb{6�Cv�|�6n��|H2>����Pg�u�c���`��n�"Y��zT�q��Xo=vb�G��z�����<�����h8��p�����0�h8��p|�[d@������X��o����1DmFw,)��z���'4]qS;rS��8U��TE8�Q�~��>NU��T�=NU�_LU��T�=NU|����{����������J_�T�B����9�I�4k�ao%gv��C()NU������{+9?��;��Uo�c���a���;��'�i��f,v���X�����f,� 7c��b=�����w��O�]���D��� � �������Y��E���*v�&+z@I�*�d�+@Y�
5�
q���X���Dw��L��\A4���3��U�]io�n��u�F�I���tw�������n\��������0�m� f���P�WX����j��d�0��XL$����7�����Z��6@j�����Ww�a@Y�
e�����I�� 5�F�I����0��g��������z�o8�����BC�cIy�~��t�0[m�Q7��a�<��a�����,]�W���}��v�{�7��������������y��F=ls��� ?,�n��l�����//I�[����>Z�t�����pH������p�N��}��CO@�i�	\����L=�y�>�v��������3����' yR�����q-_�ih��}8�$S�`��CO��>z
�MCO�����d�)��p�)|��CO��i�\����<���C�����/�4���>r��g0�����i=����p��CNA2��}8�>������|����H�tS��{�
�}�������!�z[�3g��r[z�u���FFf%����j��FF�&�gDMTs��H�L5������l��h�JqVU�D�	�i�a##%��y��L&�f�LS��H�L3�M�����o*dr��0+�d~�I����{r�������5�8s���Y_z�yUAU�3��&�[YZUF3U����&�mi��&h�JqVU��������MT9��*��&x�u��f�gUMT����4S�8�*h�z�'U=�#��*�YUA�3���hr��8�*h~�9�%���'Gm'n�d�A���q�9E�W�n�?~)��r�QU4Q=�>��f�g�g��K�cM��W������A��_�Y��L������j��z��h�JqVU������&�gTMT���I�Lu�����������#qVU�������;zr�r��M���sPt}����MT9��*�����-d����������������Z1����&ze�L�����h�HqVS�D5����jU�M�
���*��*�YUA��CM�1gU��1��%�=�8��������8s���]�������*�UE����T[|�7�Z�������*�UE�����R�T)��
���mmv��h�JqVU�D�U�����z�Sm��dUMT9��*����MD����qVU���S������������_A�k�q�,]_���6<��������K�������ZU�T���j�O��f�5>4Q=��UF3U�����U�
6���&�gTMT���n��Q5�T���N��L��$��b������G�������eY���8Z�������8w�L&����Y���}oJ'H`��QF��D����f�G|X7�Z�	���&�gTMT�����2��R�U4Q���Y��41�J2�(6���$�4)��
�������2��R�UtmT�/�O&��/��!�7����$���|���C�D�� F���Pfq?�[���)�BQ][o�h���.fo���M��s�X*���h��i-(]��C�����E��*��}��M��_>_rude����wJ+�����0�M���\�@�*���:��w�
��4�&a��Z�N��4�PF=7t��7�p�#���'L�R��c5�0=��^������Y��g=*��e;�3�
�*��j[q���I�Pa�.�a�Y�
=�.���sJr'��t��?�p�"�!E�
�cNyP��lg�;|t�;9��_|�1d|3�
w3D<��'n�[M�K��J���b����/�z;�a�Y�
���q8�[Ir��������M���aH�^�XR�T�h{��&`C��#���+�I�R
�3� ��g=*4��$��E%	H�������M�.%���XR�T�'�6zl:���'���u���[F@I�R*�$#��Z����Ib��o�GM����I�Pa�� f����W��i��`E���50k����z��*��8���������:S�S�9h��x��>�h3h<~���w<��qo���yr��2��@��Bx�x|����v��fZ`�j"K�P�=���I�_��$�����J��6�-+xM��1e|���"����n�q���������
F��E�5��~)�U��@Y�
-�*:c�Rp�2��6\��l�\Y��A<�8�Q��)��cl��hP��&A����*�a Y�
M�/���C�j�����Pf�� f����J����-3(�~����^���Q3�zVh��%�M��6�j���5�y�e�� f��������RN��PQ��K����
��1#����+W��Zs��gP`���q�I<��o_����x�����������A�z����@zq�R�3�i/�����^#�a�K�a��w����������&�0�f�-��w��Cn��"�������
EbLSg��_Q��Z���y������@v���_r��~���%[g��p&`G�`�m=�	�A�����,�ztT��A���h��� cz�a���CS�9�9k0��M-e`o�����+��x�Vi�Bx'�^a��T�+� cz�Yo��~��0�:���S8L��a� �#��7K��������i�1����B�����%��Z�����l�N���{8L�vb� �#��7K�o��J25��F����(���p,)���~��:�12�0�[����Q����,Y�
�~:���oq����FAb6F�Yo���
������q��m����e����,Y�����r+7"7��AA�:F�Yo�l��>��;��XG�SFp$1DI�)o�j���l
9:1�c<�����a���kg��Vqm�"���32����^=@��!��W����	� ��`���'e+~q�a��OOK&��f{���,g$6��I�����0��7Ko�h���q-q�b�P�s-��A:�(����T������zRkr�����k��0��wKf{�};��	;����%9'��3A2��U�i�l4���q;��Y9'�b�l.�b�������V�wn�fvsS�r��M����t���wc2�-.q� {��D<�p�lr���}��"��cNy��`RZr�8�����0l� f��u��!�����f����Y���6o�,�<X�9x�9��Y\M�]����WyV��3oV���6x�-X�-8�P!S��/��(��Bg$���<���:���Y�g	A��t ������u0*k�)k�)���5���WyV�w�hL�hL�hL��d��d�����h�G�3;�v��.V�����5�^�����Ah��M�|�LLH�&����P�
��z,�z,�z,��X��X~E�gz���c��c��cya8�h8�h8>-�:�hwX��{�ft��mF16�=��':��zs�H�\��s����g=*��|��(�P<���	�p_�#����(��Bk��B� v��V^��D2����Pg��������|C0i�
!���p��B����B�-���[�)����Bx�x�)*t�
�����������p�1��_��^vj!��n�/vb�/�c��!��_������f�j��e���f����_�������A:�(�Y�f�������r�=�$?�P��<�zT��2�q��L<�&*�/�/�8q��_�[��2��G� \�h�& s	�6�a����0���?��#751L��w
�a@Y�
�Y��(� \g�&@�)+��x��7�(���O���i��NM������+��8������})���
C�tE�c���/���~�������5��a��P�n������d�������ex�����)�v<l���w�q����"�u�{����{��-��-�W�//Kkui�?��~z�UCO��������w�h}C��yG=����p��COAR�)�{r�#����������r�=9�$Q�`��CN���z�[C���+�����s�������/���t�������'!��$��9�$>���'�6�.;s�)p�9�io=�O{s�	H�
=���z
�j�a��CO���z�cCO��;�������������?��?o�&{!q���D�|/����>��q��C�,�w���Z$N�����Z��y�i�(b��c0Qkqc#E3�762�Z���~c#EU�3��&�}ZWPe4S�8�*h�z�����4�������9(���A����������&�QE�]���(����YI&������\?��q�=���hr�8�����k��r���h�JqVU��j������&�gTMT�R��M�L������j�����T)��
�V����
��r�QU4QM���2��R�U4Q-�AU�n�p�JqVU�D��}XM�7gU��8}���;zs��j�e
�\9�3�������L�����*�UE�#�9�h�z�=G
�V����#h��qFU�D5��}Ve4S�8�*h�Z������f�gUMT���n��Q��T=c�'E3�3�y2�Ru[�\�
b��]GU���}���=�����m�5�������<�M��v��~���G|��l��_K�f4��8++h�Z�C���j	���T����Jh��qN��D5��g8Ya3]���f�������By������P6U�@�,l�\��a�{:ea�{�������i�����h
Q��"r�=e����<��,l���VY�L��7���5�#X6S>���aS�3>;[6Q��r��+�)s�UV6S.�������2:ea3����qW�x�|�Un�q���l�L�NY�D�X�{�a�[Zee7nM���~O��mm�r���E��"r�=e���j|�0l�\�k�e���s��\a3e���f�y|�6U�������?X^��T�����������M�)�)�V���GPV6Q�@�l�L9-�����.�9YF3�)^����$�NX�����/��iK�m���o��6X�������S�=wA���o�UD�L������j^�Mu��B`�L�����������M�S���?Q*�)s�UV6S�����O��_g�s��f���BcYfS]
t��f�����������]w+�_��L��_����o�r������uA���������jUF�Xu����n�����(�B�	�������a�V��f���Lt�p��U��
@��*���:�����
�3�����[�P>��:�5�dCAI��y��3wG��1'F����k 3��Q�f���!g�M�#X�i.�<��Q_Y���`�0�o�9�n�s��L�p�Y�Y:����TF0u���7=�������`�Vp������ (�]�v3�8�Q�-���Y?��2�����B�0�*r43�8�Q�gZ*��n�e@g;�
j�F��/��A2�8�I���fn����2�p�H��S~����M���c�	�=,v����&*J���#����eX/�x�p��B����}��o�%`������Ky���b5|Xo*�x��tvE	|\��&��e��0��g=+��+I���3����m��
?��+0�C���[�r��|}���a��#w��Z�W��^���d�����zk�%��s7.��m�;�#�<�W`�1�0j$$������0�����������'��@0�UX8L�	
>���o�.9�`���#�x^`��h�2��8�w�;.���6���0�h�O�`�ow��F�����VZ��V��� �b}��`��x���R�8�Y��A��v�wa;~F�j��`\X+���#9��S��	Wh�:hCd��m���7@��`�+�R�?�7�m�~�}/�u]�fP�$����Yg�3�zVh���6>��$b������<���Ia2�HH�i���el;� a�	Y�m}��O��^��Ia2�HH�i�7�(���@��20�(�O��^�}g��FB�O�m�
�/C�W��);��E9~"�G�
�8L�	
>�7�&���� �F��������d���]�\2�jb61���@���7~3��]M����YW�����f,�f,�f,/l�m�m��w��[�Kv~ ���G����Z��{����d�����������bD�=��E�wkuVaL�<�,I0M��^{��l�����:�v����Q�1���������I�G�M�ox��-�#Z������-����������`Vv���0)��d�J8�������X�m���-�~,��� k[z�?���\El}��������v&Y�����Y��\�`7;oef`��d+�A��2Bjo�-N#�d1[�s�������������,�m�F'�ackf�����8L���|�������cPd|�l���1:��g�[�:�0��;��#ko��d�;��e�[�:���,���9�s�%���(H��(����I�3>��q�g$�7�f��@��IO'�H�/,�}�@�y����9��[�O+��o���
�1C#�$o���@Y���\�����^�-���E1��������[�2�� ���`R4H)H
R8����+O9wj�@�<�I��SO4��[yOK��e�����+���d��B�5J���K6.&vO��ig�2����3�)���|�r54W�
����uKIh��{z`���P��%����};������@0/����t���������x��3o��y�N-&�����7�+g��O���������� ����c�x���������Oz�����{��c�7�/��Bd�x|~����tR1���:��b�2�`�I��;u������`*��B���z��?���,��X�uQa|���c��.j�.�=~��?������;�7��>��B��;z�PgT�c�D�<Q0i�D{��D��G�z{r4B�m��[���M�h�F�Y�����h|�)�@�h|�)�����b�I�����Hkw��{��;1���p��B������lG49��&'�Y��|X��an��6��Mdn��61�X����w�����B�����1����'s�����a���3s{x�������E3�{�ab��0=!���Z#�����-��`� X���i���1���k����Q�%E���������na��v����8�����a2�H��xw�^g��Bk�|@�A:@I�����@Y�
U�0/G������1H
V�FA2���{�Pu�e��@��uGy~�&tt�0j$,��^�[�wq�c��rs�b���k�LA2�0�Y�����v�_Fu_��5���t���7����WW���9�}���Y����4���d�����z���M��2+0c��������#A���������e�D���/�W1%��F�Fk!����o���h_�{8Jr�5�LA2�(��'����r1[��G8��b�n1��^/�����w���^.�/A/�:������������7qI�!G+�O�����������@�c}���Z�i)�Wju]��%����{�������j�D��e����T����^�~�I�T���n�G0����t���'7�l{�����M�����O�����xI��]�e�}���	����p�G;�mK�g2)_�|�yU�q���C��[�Z��#�&Q�|"��%Q�w���}��Z�H����)���|I��]�e��\����[
Z��t.�8l�Df�K���2/��{�|�5����dK�Df�K���2/���R�MY^����y�C�r��Ik��l	vL���S�]���
�w��d�ei�W����X<�i�B�
\�n���.�(J�L�h�'#'�Z��� ��"��d�\+R�Qdr���K��R��TQ�D7w2���0������Vn���Vn��.NFN���A&E�ql\�>(����<�����1��'nV��+��T����T���QTr��a�(�Z�����KESE!��$�Rr�HaF��D�.�����kE
3�L&�8��*
�V�0���Zq�SU��53�RS�TT�L�>XUAU����f�	�����.o*gU�o,}��>}*��g���mu��]_8�3� hr��`���A��gT�T3��jUMT)��2���������\MT��dU]�r�Q4S��������*�YUF3�<����yT�7U���U4Q�8��h�����A����������r�o97�9�Y���k����8s�&W����n��*�UA3���L4[x�2Q�{x�Ur�HaF��L�_�z��V5��+Y����M�(�*2�����yS�U�{��W���a��U9��
����=���{�YUF7�-�[@��<s���R|�0���q�9A�k��emAU�D���*��j����]�r�Q4S��c���j{�TO���TMT)��2�V��m9�����RU�TU�L5����jU�M�s:����*�YUF3�#��ty��8����]������t�w�};���g����q�9A�k����o��*�UA3�#>o4Q=�����j����A��gT�T�rnAU�D���*��j��zNU�D���*��jU�M�6����}�?/�d��aV��D��]/�<]�q8��
��q��\J��{mM�|.g�;�fF���d/[xQr}�)��5���.�_���g-������{|H7�Z������*.Vu��&�gU�T�&�W4Q�8��h�z�u}q�O���IaF��Lq�U]VR�D���(��j^�2h�JqV���j�@�g����e9/'�������3c��S{��-Vh	��j���Nw�xL�D���A��9������fe��j	��������c��&��4s��
��m~1�OZ�_%�M*S�7����1�<���^����d������������}I:�
�?*����"���Bz5k�ud��~<��o"|�z��9�A��a���_���KH]����/�)��U!2h<��G�(p�#$'\�H���g����1e|�
���������a�kj�.g�/h�N@�Z`G�W�p�(f��d�P��B�����>�?;:�.o��I0�ZX� f�����R�N*�53����l��z���0��G���A4Z���|A
N�
0��l� f�����-h�e��O�4 [�K}y��D��$�!�1�<���.�����)j�6��|
0�_Je;�0��G����&�.�BnL]�%�<.�M5��"C������m��!=���v3VP�.0y0�_Je���3�zT��/��O.KT����[��<.���]��C��b�������+{�}�:�+;�mgS�1L��R��3�zT��}ozn�zw�@�������e��A:�(�Q�}��-)�-;$�����x��D�?y������XR�T�2��d�n�0��	R_J��RY� y��f^��h��ar�i!�x��D�Rn�C�cNyRa�d��f�u/DP�|l{0�_Ja�IA<�8�Q������K�u)�
�����`�l;1H�g=+4/��m\�^��PW���U�Q�3�zVh��5���A���W��G�IP����d����@�����}aK9x5�h���0	
e�N�a�Y�
M�[��2���PA�~)���Pa��3�zVh����Ft/T��/E�`R/TXm�����Z�������dR�t�my��D�pe���XR���������Oxe[����<��o"���b����zop�*x�G��3�^�Q��?k�=����w�j��9�xg���`����w��E���9n%z�z�z%z%z�W��"�������=X{�h��������~���Z����5���;�B����?��:q�X�o��Fj��c�v�������f�������]���~�3�2��.�������ggv�����%y0c��<�z�du�xA�f`����&>�(����t�P��%�WW���"l
h�`��qo3��o6��d����p�8!'5�1f8�0a��9��Z�������d\�`��
�0��7K?�rg!'�c�0W��bL��!2D<�5tQO�.4T�����1d|3�b{��)��Z����[��wxR�p���%�aP�>QOK6>_��va?:��p[;�v`f��O��9����\����`R4!���d�	�����w�6c'�f���lF16�;���jU���Hg8n���%y��`��<�z���;���g
�.f�h�+n\F�+�t`��C��H�8Y�}���l��`�wW
�a@Yo�,����������%9v�-f�P��%�M�M�~5�d��:���$�Hv��x�p��%�ax������.1�c��f���������)�z�5QC{�* uG�I������3��[�?-����`�mkT�9��n<�P�3*s� y��n��#����,3^u#@I��L��d�P��%���wT��w�7�����������t�+pz8����q+���yB�B,�W�c���e?�"��cN�����sa�����}�M����9�r?�{i��2��e���;v.��
�����,��=K{,f��,{��,���W�U�'��{:����������g�'��O�����J�����
��y�_��s o��m��}���;��w�_��{���;>���[�����h5���@x�����3>��9~h�8s��*�\t����7{�3=p��7u>��p�b��,E0i�{��=��G�Z�&R8#���Ht���Cx#��X#�W;��f���C������q���4��=���P��B������i���Y��cF4
{�5
�1�<���x�^a-�,���L��A�/���j���<�O���	���B��	��'�:�8�;?U����u;s�`���3��-�rxG�C�7�/t�z���1�<���s��3
;��?��`B�����<���q���9��J��Y�/������s�W��4�����)��4L�Bx�x�)O*�\)[��:r^_��>2^_���Pk��|F��#��y@I��� ��@Y�
�m[Z
�^dl�`�����g=+T��J��
:�n��i�(H�@���o*�Z�K����v@����������|=y���Pq�-�3�2��0�x���3�d��-�����FN��K+ ��E@I���3����g���f���v\/��\<�;�p����cNy�b�&Z^��T��2���!��K���t@��|����U|���p��W����������������*j)��F�����F��}��5$pe�m*��Z�W�"���H}*w��O����*���NR�dj��N��a�~C�
9�U!�_6���9O��6����6R��]����*����r��g0oX���q�
>���Oa���OA3�n4��S��m������S�n\!� �z
��z
���S��r
��+�$SOa��BO���r
��BN������d�)�X�)|��BNA{X�)\7��S�L=�y=���X���&������v�t��E���y/���w-<E���H�9A�MR�>g��g��o�n�m�r�[	t��q��H���P��m���VBgU����R��uUAU����&�e[	)���0#�d���tO1
���b��4�,q!E3UlkE�\�W(�J�qW����7�H}G^���A�����	�\�#�afU]�r�Q4S���NU�D���*��j�5�VU��*�UA3�� ����*�YUF3�
bNU�D���*��j�UrVU�D���*����X�����gUMT�-�xj��������w���r���y���9���rq�i�%������{�4Q=����&����{t��qFU�L���]�&�%>�*��6�P���&�gUMT�G����*�V�8�*h�����M�}T�o���=���j����TK|�0���#qV��������;�_�T��A�W���9�\���g}��U9��
������d�Y�����bY�3��I��5<�)�K��B&�f����� )h�IqV��L�U�����z�TmK����R�Ue4Q�k|1��>�qFU���L��O'��?d��������4�vG���NU�D���*��j���]�r�Q4S�����j����f�ll�TMT)��2�������h�JqV��D��UE��gT�T|�v��&�gU�T��6�����qV����N����Ne-���~vUt����Lk^A�n�[|�7h��x�O��f�%>{4Q-��[�L���`�&�->+���|$����U9��
����{�����*�YUF3�2����eT-7U��6/�d��aV��L�/��)���#qV����S�c��>��Q�����N���r}�)��5����%�&(�V�0��d��}��M�����]�L5�VmNU�D���*��*N�s��&�gU�T���/���&�YI"�>-uw�]kr�4S����4Q�8��hb0���O&�����z�w��������T����n����_���Tyo4�H��vYX� �+��/�)T��#���fGz�I���Q������\��`���0���?0�#�%J��xz���C��T�;I��)4�
�u�D���B�Q]:��)�/������=y�`^��B_�7�@"�/�N1������
g���C���`/������%�[�,S��<�zTh�����7
��%�c���x�)�1�<��hfg$lnTD{�c30����e
jf��0�[��
����A�0c���~�A����d���A�OE[}a�5~sE��$�;���c��aB[� D��cJyRa:�}]2��`�	(�W�_&�|Y�S�L3�zT�vp��
�t�w�N@�+���I�Pe;�4��G����b(��D�$�Y�=�1f��rH�?��'�`�����A�����W�����m$��<?�>�sa���;��L�n���c��!B[1���cNy��k��(�x��1��k�M-/&�[�l� �f����c]��]-���a�;��A�
5������#�s�K���;,�7��GU;2L�wMY� �f����Y?1>�h��.���g_y�`R/T�
�L3�zThJ[�XSR������1f�����S���<�pK�.��F���{d=��&_�����q�)O*�y�	Z��(�����DB�1 D���p\��
�F�n���
����=�~���e��A:��xH��B�I�^���
��J�q���I�Pe+�4��G��5�i��G��"�=�������� �f����B?�u�-��PAP)l���z���yRM3�zV�����lgPT�0�W`��q����x�p��B��GH�cW�!BP��*�&A����0��������m��5��1P��/`��5�V�i�?]+7����p%
H���dg:�I��Z��q���w?�7v��?�"j��?c������R��<�i��}������&����=,����=,�[met�Jt�Jt��W�DW�DW���X�����Vg�a����R��s����A��������F�/E�/E���Ie�EZ�/E����Kv�n[u& |�oM@(i0!���P��%�G�g]Z;�W��P�o����LA2�(��^@��;�����5&���v
2�`��f���;�����S&E���Se�Y�5���fg��;�~\�f"d�B����~pn�:zu���C�	Dm�8�g�r��m~�)����^�>����g0�+6U%�'��v�8���ib��9��C<��g���<��Bx�8��7;���W���A' c����c�N3����?,���x���;��b<)���=D��������,����qP��a�4(���1� �3�
��x��$o@&`f�P��%�K�����Q�1�w�h����h�)�x����Z�_X�������`R�h-,7�����f��@��D�K��u)����p.%�����w"kU1���W6@��&y�2a�N3�z�d�7�t�������$�\�>������,Y���������r��=��������-Y,�L������r�d�L��A2��'�;�����p�&z��&y_�� �5#��7K��bX���2�e���~�P����O>�Y�1p�{�+ gcz@I����t�P��%�!Ik���Q�3�)�����@'����Hx�G����5QC
�6��2��I=5���w��z������w?o��=����y��f����Y��w5�6���^��B���y�Pg&�:������-����� �[z@Y������0���nP��E(�����#�c��3>dG�����A����j�Al'������@�v�n�"2�j�	�,���&[��k=�1f�C�z�cNyR��{S��8v�,�0��X�)���P��B��i�����X���1#���n�s���^���&��c��y�TnB�<F�c�6!�Y��~�|��[�' �%�z��v�i@Y�
�^�!r�bG�V��"g+B��(v
��Vi����9�L<R0=��4���}%�3=�vi��B��G^��U�����n�Y�b,D?vK]nVh�<l�����q&E���q�g=*�8y:�[��k�1f�B�]�����Bu��;�$,���7Lr&a����'8�zV��ui���B�����8a�r"�^�C�cNyR��s�������1��I�	l�N`���PYsxPg�u��=8��f,$3�x���xsx����2���B�0��~;�4��G��W��Wk�5�fkL�8��&��x��"�)�Q���������q��I��[1H�e=+���u���d�Q/�1�z�QM�����8{k�|�h����0�~D����1��f�O�.�G��v�~Ri_�}���������������������8jZ�i5e(�/{���X6h�5f��#��_���C���Z�u+)_2��y+�������8�������C��<�~����Nu����/
5���Vr��W0o����i+�m�!�0i�����r	7Zq�%|��C/Aj�%\���K�L��y+��O[q�%HC
���Vr	���0o����i+�i���p��C.A2���8�>m��� 
5��[q�%H�^���^���8^�Aty��^��k�h~�Q]�r��������`3�w]��"q�
]��R��2R2���,#!�=������8#*h���YFMTs<�H�L��s��&�gU�T�p������22Q�)*
�V�0��d���s��h��#E3���3B�\?c(�J�����%���r�����U
��gg.@�����l��*�V�8�*h�Z���U4Q�8��h�z,Gx�)��R�Ue4Qm�K�;��kU�3��f�����
��R�Ue4S-�q�_AEU����f�M�>`��_ \�R�Ue4S=���]?o8��2���ii����-G=N������q��A�;wG�_�]�r�Q4S=�Q�MT�x���k�������KU�SUE3��_�����*�YUF3�lE�*h�JqV��D5����kU�3��f�[l�d�Du�-��T��)���g��YUF�gN������-G�U>at}�8�\��������o��*�UA3����L4sx�2Sl�X�D��w`!��.��2���(#G`���Az�&�g5�T���o��Q5�T�uNU�D���*���?�t���8����3f������/y��0��wg�B����/����Z���������
���]_�D�m����kU�3��f�y9��*��R�Ue4Sm�D��
��R�Ue4S=G����9���T��������U9��
��n���A�O����n<u�N���i�_�?�������4�w�[����h�zo���U�u���]�J��*����l�D��7`E3U\��TMT)��2��������A��gT�T�Qu������M��xG�D��$��"��$]>q$��2�?q�Iyk��%G���7Zr���]���[Ma���Ln������0�������n�D���uE��{UE��gT�T�����
��R�Ue4S�K���B&�f%�����$��u�IqV��L7�8UAU����&Fe��������24����q������KTm
��A���.Q�pS��yo4���fzva�������50�*��t��S�������U�^��A���vV�p���y���[��][8m�]n=tg(|�j,[>����:i�����Ai��"�Z����@w���B��x�G���rW!<kIQ_�{���w���U�1Hg@�g-�*�e�)I]9Q��V���v�0,S��<�7�r��,�q�8\� <��R�0l� �f�����k*��[a;~sE�T��;{�I�Pe;�4��'���`�{����71B`�I[����+���v��N��,��Ya��o�,Km����m�L���l� �f�������I�D���`��]|������|A9��aH�OJ�����1��� �����(�L��PY� �f�����C���`?�������1f��v�S�1�<��6-��aO�:58��5��~+�m$���Z�%��?���K���W�/�z���4��G���k�e��sC�N�q��lAE�����"����Upj�I�/�p���N�0��Je��d���mw���+R
�{8.K��GF������V���}��'~TS��j���	����G�~���eXn������n�}�~
`1-�fx5��_�w�l���
)�Ii�J��/���!��Q�`R���V�i@Y�
-���������
������U�R�L3�zTh^�m�=���[U@]��0	
�b�N3�zV�N?�u�7I�
�cg��I
�����@��Zq[q��Qp�_�����|�a��d�d��B<U�oDu
�����&���x�s��|�it�S�1�<���_@�o,��R��{��2�?cX� �f��?s�P���_����N&H4��c��!��[k!D��cNy���S��8Q��ZW�V:m�aX� �y�v�LZ�������=,����=,����=,�Yoe��J��J����������(�n��~C3�9}`%Z��=��;}=�9}|���qT���P�x~����A������%��������%E����e�Y��s�N�R����@I��$3�����6][V���[�-�o���LA��!�e�W�����T=N�q@I�c��LA2������du��NP���w�2~����m�8���jC��	
�� ��ge�p��*g��augk<,R��
�;��dL�`�ww
�i@Yo�,�udQ
X�1C��B����hz~���x{�6kt2�0���;�4��7KV��O4�4^��$Ag(�$���?��du�*�i3�d��T��f�NA2��'��X�
m[E c#����Q�L3�z�d1��l�4�!�������m&=)��"�7��$\��[�: �I���<(Hl�8����1���#yQ2�1�MI8Xo��"�9��Z�6���l�tt��$2C����#�,������}�n�C��/�!�X�~L)��j}B�h�,���%��&�%	A��`���s�'cNv���0)��d�I$����;<N8y��� �@F@I��<*����wK6�"��q�bG������da��d�����Q[�������i\s2Jrf����t�P��%����~o�z���`K�]�J�����o=D������ &7�PD=5���6�0�R��<�i���z������?���Y��w2o�������A��f����Y��w(o��B\��|��@����u���� �Kz�F7�`n��E7r�n�������E7��1��<��A�����t�,Hu$Y�� �^����f��
|a8��`9���;9�qp����3E��L�ntcn���|Q1�����7+��_�eNcG�i����A�i��j�_z[���0�1f�5�;��������B���:���s�L��R���e=*��[�7����wFbG�H��
yg$b�1�v���B�����Tkv���0)��d��(�Q����U8�5,��
�3�8�b�C7�������-�_��������Z�n~��s��
��B\XiV,���f�������X�f���f�������xi���{��&���������*�]!gf@�&���� �f���Pk�U4(����s�$g�
�d�p��B��;�&[M��K��'yS�`��<�zV��s�]�:�������2~�|��!2ES��
��k`8�od��(�;��d�P��B��+�=�����P�{�a�4(����������=�0�pv����~�Q��0��G��i����W���
c��AD[n[0�����h{2|�����{�G{�s#mw�n���8���������S�R��W�/��q���v�o�����z�m7�$�\������L=>o����
4�
.o�p���z�h;���i�
�i��Wp�zC/AR�5��o�E|�|C.B[h�EL�o�Eh�^���r�����&z��7�"$�\���^��
8�"���^�u�I51o���i�i��q��C/BR�E��p�E|��C/BZi�E\7����Ts�6z���x�'���#����<A7�<������p���J��H�o�.Nd�8�E�'��)���\+b��#0S;��EM�x~���j�r��E]�r�Q4S�������*�YUF3��/R2�l��"!3�sP<�)���yKq_��E]kr�4S�����2�|�p��$2��}e��y���R5���q��A�{����AU�D���*��*��v'jt��qFU�L�.5<�MT)��2����1�*h�JqV��D�����@E��gT�T�����
��R�Ue4S-z��=��R���*��j���t���8���������g�x������k4�sGp'��m��*�UA3��5h������&��G��*�V�8�*h������*h�JqV��L���S�*h�JqV��L�U�����z�T=c;'�&�gl���Z5�}��b��SGU����3-[�+�p�5��1�������:���`j�����r������������R��e4U-�����j	���������,h��qN��Lu�_7Ye]�����������By����0.�,l�L�N��T��,�M�E�����Ei������_GZ���5�De�����R��nb?\m��&�h��M�k��`�L����M����l�L����������u���&�h��M�����V6S�@��l��^(�����r��|��M^Y�L��2��r[���&�&����<���v�_��#��
+.��(��D��"lv�?DX6S��c�a3��1�o��&�h��M���~�l����kaS�������2:efS�s�5�����)�)3�(���#(v�,�FY�Ty_|�!�f��d	MU{��x���K����y,����O;�����7Z�������t����hv��?�(��R�Ue4U-��f�e�@�l�|�J(�,l�L�N��LyO�<��D6Q�@�,l���d
K��;���d	MU�����Ew�]�R�f6UnK=��6S�@��l��$~�d�]�������A�u5m�a�"��z��������vV3����q����?Du�k�I�h��Q]:��)tK�M.*��J��f0�*��t��Sh_����~�K����������i�����A���I�D�=�����n5w����F{��l�����4���R_�1��	�����q{|��Y7��3�Z.A��������5��~+�e��@Y�
m�rn�����T��������I�Pe;�4��G�;oE��o�N#t:�����I�Pe;�4��G��xxF_J���}�t���1d�0�]�=D��cJyPaZ��]/+��oW����=�Y_�Pe;�yF�j�M��s�%��=�PZ�����~�,�(H�e=+t����+} ��(��
J_H�B�m$�������"��*~6;�'��}]�k�IP���A<�(�Y�����*��Z������yE ��K�p�0�j$(���R�*!����(�%���W���*�(H�f=+��)��&�!��KkMj�F�q������d��d���n������{m������
zW�$���*L&S��{�E:2��@P����
���|rM3�zV���>mg�1Hn��]y?��9~!��7Va�0�~$(��������K�Dc�Q��k�I��*[9Hw�@Y�
M���=�q��6QhOm�H�B�%
�i�����R�v��	��_�aXJ��������W
��F�������[��6:���D�od�w����x�����z�D?�u����`)m�e�yT/-q��S����������b�
~o��v���W��^�+��T#A���V<h������m�Mi���z�&S���{�������r�&4xK�/��� ����������~@��nP.`y
v>u����d������f}��}JQ[���{
�O�a����<�i����.<��>%�+�a��S�a��{�a��,��X��X��XF����|G�wku> ���rO�Z���z�r�A�r����������Z�~�����*�d��l-B2v�	��u#=0��u#!���c�a�j����� @��'����N|\�����D��S�q���0lKbf�G}J����������;9J�%+C���<��?�=+����X{��fk^Y�y���yI0���-�b]�-{�d�`��1w
�i@Y���.#���[Z�r$�������d�����O+����"��1/P��sO3�����%���7����<�Ys$�����d�������Y���De'0ki���r�;��T#A��+WW���D�A��7GR�ab��a2�H�gO�O�1������@�h�R�1D#��0����_)��gsuB@��5ZV�i>�?-Y
�v�{�tfM���x��7I��%:|�r�07��nvF7N�6���l?9��g�[���������uEGy�G2���T#!��+c3�=k��X�/�'5m�S�����+������ �F���@^�T7��FR>y5���7�g�����:���@0/��G�0�~$(�~�l{�F��v����:�>k?$�t��4{����Z�w[�a-�0*�`���K�����W��g�3�/�����y��@��s�0�j$$�v����u����Q���:Jr��^)Hf�i�|�
!����(��)��@����S���6S�N����L�`��,����7��&��B���z��?o��,����7�����B��ez��?o��,�y������}���G0�x�=���=��[�{��`S����m-��@�h�����)�=,����Vh)�
��+���e=+���\�5@Q���@�h�R�1@#y�^�E�6����3L��'��e=+�z����Y���"<kvFy��Ia����������
�6X��+�s,�m�8>`��xm06;��f$�7�f��HP�a��.h������)Gy���8L�	
>��8��Acbf0���&E�������j�[�:���5������f�wF�����.b�K4,;r�e�
K
R���zV���'�����YSr �m��a2�HP�a��+�G����9	���$�Ys2|Z�8��VuY����8���d�=��L5|X��[:#�3oD�y��a�����+n�l2w�X_q$�	�B��c���t�^u
���70g7��2��b��������z�+�O2kC�3����@�hCR��!!������Nkv���0���;�4���������h��^�0��-D���1-��^������\��V/e���[�^��?��������5����Y��J����X!�l��Z�?}����U��[�7=�,��&�c��vXg�l�Df�K���2_��?h��y�ER�wv��3�D3��U/�Z��|��>m�5�%��u���B_H�Df�K���2/�����Z���"�����J�Df�K���2/�O��?��B������P�B��7�\~��(?����?�[]\����Z��#��iZ���,����.�V�)���RNj'2+^�z�yY~���>��"���������D-�e^�_���r���~-R����0�|"��%Q�w����e��<4������eR�eK����uo�>�)����C���?v�����CZ|.���)^�
���R<�������U�c����K=�2r���,++'�Z��2�B��+�(�Z��2�B.{�K�\*R�QTr��VYA�zx���cp�V�Z��V�Z����SY9!�zt��r���� ���FY9��4
�=;�O�v��OS���[Ea�v%���(���(*�T�0���R���[FQ��"�E%���K����Z����k�
�$XE!��f�\+����\+b�Ur���Z����t��qF���*~��TMT�{�*h�����N���C��������/���o%wN#N�T�^���Gq�]��-���6�Zu�7OUU4Q��]�S4Q�8�*h���{�A��\
�V��	���Z����������\MT1��
���Q5�T��j�����S4Q�FcVU�D��$M9�TM9���9��x��N��/�A�w���5(��s%��w��U)��*����Z�D���J��^s�\+b�U2Q�w�������z�N���2(	��a�S4Q=G����9���T�����
t�JqVU�Du��3M�-�T��-}�e��>s��r����A�����U(��w���GP4Q�8�*�Z�����]�R�UU4Q��k���j^{MT�%�(U4Q�8�*�Ru[��w����]�r�Q5h�����M�}T�o����w�*h��qNU�D��O ���:�T��:}��~��3�l�A�x{������Pt}��l8���Z������j���MT[|�6�ZuO����kU����&��r��*h��qNU�D���{�*h��qNU�D�U�����z�T=�q�N��D��$�k����ZIA�'�YQE�'���K���y�a�����O��'Wf���e�������[�a���\����������i���[|I7�Z������jYj��&��TMT�eo��
��b�St������`���ib��2Q�`���4��8'*h����?4Q�8�*�R5~A�����uq��_����$���eo�T��?���v�/���C��?TO�)�XRh��a3*��r���r�rHL�2E�Saip8��P	V�����S�L��TX��G���I@���|5��pr�����t�����������]I��i����3~����!<EK��
�Z;a2l�Q�}����c��!���"�%�Q�
��	���m�Jg�}���|A5=���������F����Z2*�R?���I��X�� ��g=*t�����m��o"T��u��I���Y_?��d�p��B�u���$�r���P�[���L��
��.F�g=*4��m�i����;�`[�k�I�Pa�-��4��G����kH��	�s3��y1^��2$��0���cIyRaM���{�n1�%X��`R�����N3�zThK�I�c;��f���n��Wc��!��b�L���������wE�Q���;�b^J��RY� �y����s[j�?��GH��Y������c���=D'�cJyPa�W0
{w����D�����PR���p��y������=~�?`9���mk�����eY� �y�����K��Og�7I%'�-�Z�8��"�#�1D��cNyR���>�-��(�*��~�_&�[�l� �f����l���~�o�mW������+������y7��7��6��Y�\�.�
����2���4x��]W�)��=���n$j���5��IP���A:�(�Y�=	6��p�Q/TP��_&�B��$����z�O,%XJ���h������Pf�AA<�8�Q�m���}?�������n��~��+ IP���A:�(�Y�;wq�M/���������
��3���Z���)q�]E�@�V�_&�B�m$��<�)��
'�5/�D�I�]���3~�|u�(���cNy���$'~���OIr��:�1��H�p���:Es��'����~���Co�����5o�5��51���}��\���y��������g�g�Zs
J��]����C��]��j�����Ho����w�.�9�xgp���(�f��j�Z��WN���1{?��X?��?��Zo��Zk�@�&�����N$����Gw������n��V���V_��sV���z�d���`M?�oM�61��A������,Y=���0RT��!u��I�������g�Y��s�+"�W}���F Y�jF=��%�[��	�,@��c�p� ���`K�{��c��rg�����I�<1H��8����)HmV���*6ao�.�������n,)����`��)�2�=Q�pI���ax`���p��%�Q�����;���y8L�%��x`���p�{%�p�o5��������%�a3�:����,Y��si��]�
�q#�$�*V��@Yo�,�_[j����Wk�]��8��/f�I��S��������4v���0):�d��(����G��
�����X���1�}�~{K��"�9��Z��{��M����P�3!�BA2�(����#�Vx��vd?�i��i&9;25�i�Y���n�Z��������u7��0���k� �f��n���YF�"t��}|Vq����x�p��%����0���U@�C���Y��4� �f�=��y��%��[�s-k�D��8��&�����63�0��7ke;1���� l&���VnR�lo�����-��E�F�I��w�Y�L�s���%
�)��S��#�h����0����a���o���q�)����8��!�'����B���y�4g.��(�`�i������x��X���1t��?�[�������Js�!��<�Z�G��,�=�b=J;~�Bo���,��,��,�3Y�3Y�����*,�Y�Y�YF?�?�|G��g|�i�f���>�{�cw��������G�;���y��f���k������&
>c�>c�>c{z<�1�zS!�.vb�E?���.B�q�XR�Th��tDK1�Q�c��aB%�k'�1�<�P����491Q�0�1�Y��������B����a�
�k<�P��3�����7���7�v�|�0)��d}B8�I����sp�������%�����;�<��:t�!�����%E3��e=*��s���X���X�nL��c�q�+znVh�8Zuf�?X.f��vYuf�?��_�]uv�Pq�
#f����{q����(���cNyR�:o���L�
�1�"�$o�U
��@Y�
U�-C�g�a��
����%�i@Y�
5�y��9���
����W(��w���[�xk��u��n����� ��"��G��������{P�/7J��]� �y����[�tj�	"�KM��
�N�^��EP�
v�P1��_o�A�8c��1d�0�BCB�������Br��ys������F=��w���z��O���7;d�R#%�bHl�a\���p��r�)����E��e��O1�;�S.��r��u'�u��V��F�?�����V�6��?����h��9;����[�o�v��p�Y)�U��{��5�;��o���������MC��L-�Mc_7��>l�!��4��j�H�^����v6:1���r	�SC.����\�d�%��j�%|�VC.A�j�%\7��K�L��yc
���[k�%��t	����L��;�5����r	�_C�����\�i��Z��-6�|��!�_7���L��y�
����l�%h�
���Fr	���0o����q�
���!�p�lC.A2���6�>n���o2�I�
�z����F���7����n���+Pty���+Pt}��?�m�q���a�o����=yd����~(��#��n.��Zoy�qN��:�h��\��&��T]���<Rr��aVR�Dq�{�����S,��#�&�%wd�D�-��G!��
�9I&��
������O�g�t}�(�^���{��%W���Z������j���NU�D����k�#��/���Z��������F8UAU�s��&�uI�7P�D���������yUAU�s��&��~����&��T]��)���h���8��h��9�����KG^����]��8�@L����k0Rz���*��UA��i���t�JqVU�D���]�&�%��4Q=t��
��b�St�������*�V�8��h�����M�mT�n�������j�}������C��3������3���k������?at}�(�^���;����w}��U)��*������d�Y����k���w`%�~�kxV2Q��pc�L!�)2�(8	�I
�hb�4Qm�j���F�vS�X�u���*�9UA��u��EM�3gU��3��5�k:v������a����8{���]�cf����*�9UA������7�Z������j���MTs|6h���m�����*�9UA����?ZMT1��
�V=����Z��������l��aEU�s��&�y�4"h���8�*h��9�e=���e��Z�7���;S�m!�h��9�7~�&��S|�7h�Z���A���
���-������
�V���|oI��U)��*���	�NU�D�����jU�M�2�����D?/�d�	aN��D��]������qNT���SR[����eoKmot�(���]���[�a���\��\��%��f�LZ������I��5��4Q�a��S4Q�8�*h�Z��%�*h��qNU�D�X���2��0'��ZNPp/�]kR�U4Q���?4Q�8�*��`��v?Y���U��o���e�ux(��+?U@����2��Yq�
��i���*���n�+��
��d��Uh.�k-������[�B�q]2��*�/u��vX�����B_�yc%�S��c��f����3����[�Z�kZ�_&�[)��A<�8�Q���&D��� �o�7���g��a��C�@{���~4��	[N������Q?��D��!M����
?���:��vrT���r�6����v+f�p���Bk����#7�,�l���~?L����L�����t@�{�~' �����?��~���!��g=*t;�eZvh�`P��RY����� �f����~�$4��d��i�/���C��bM�����|B�ln{�����[�(4J��RY� �y����r���z���nf��G�z�k����C��S`O����q�
����{�e��?{��\l������J��;d�8��'�d�uKP��~`\�.���|
(����U��@Y�
=WX���G)��H���0��&p+���q|,O�7�5��HH���YH�����B|5��&Ta����R�T���h��DH[����7�nL?Dh�-���q��w��q�nX<��
��$�/`�#�[1H������U�n6�4�<��1
��C�a@�=���as�f�*-od��B�o�@�%j����o��!o�@y��.�Ln�v�^��
���:>��z��V
�i@Y�
-�%�n4��#�vR�w����������!:ES��
�������EPT
��#��~+��'�4��g��Q:m��|�������p���$A��*����G����t�8�{��s����q�?L���B�N���>�4�o�[Y��}$B�~�_&���a�N3�zV(��6�����a}9��"��Cx�8��G������ ���?W�j��eY� �f�����7cW��u�������d���NEg�uG��yl|���� ��y�����<�s0��`����;Gc����E����n�B:��{�����\�h�A����^��n:��������e9��9!��}d�>�G}*�#Wp�����4	k��I�daPa�����f���\���i�/����{��������	�'u0L���u>��g�Y�:u���N@�&yS�� 1#��7KV�n��1�=X�[Qu�I�<)H��8�����K��VI��Bb�q����(<(�����XR��U��D��e@�2@�WQc$a��f�b&4kd�C&�a&rj|��>�r��
��-LtL�A���8L��Um���4����,Y���$�������$o(��a��f�j$���������8L�%���0H��8���qjv���f����PR/YYffe�Y����R����	���aL�|,"��1��W�z�e��R(:�q��m�
�t�P��%�cx�E%�$��PC�1���`�|ax:s�a��Vpo�5y2��0�[�h�i@Y�����v���1�=���P�3)�JA2��O���IL������T�A9L���������-M��<��m���kd�I$����y
��@Y������/��2�c�CpZ�r����~��i��P��%��X3,p��fm���&A���4��wKf�1�K�dM�\�9Lr&g^1H�e�[29���@rv��{-���c��v�V(�'�cNy��`7R:��>��u���'u�0�`��c�-*�� ��l���������M@��9��������� �.��B���y��?�h������w��XKZ{����^z@-~�{	A��t����=���:+q��W���Vz M���E����5����Y�9
n%�O�����]+�V:�Y�
u�a,���~_�p�����E�U�}�2Z�%Z�%Z���%Y�%Y�%����z����46�B8����A�z�v��OJ�f`��B�w����`R�1���p��B���?j��N�������*���K���c��k���������^X�-Z�-Z�O;�8�o-��/�`���A�`���j��:��5��5x��+�XG�C�>����=��\D8~���h��"v�\���bm�=��vb�C7���B����S�Thm��N���c&E���ce=*�y���Q��u
��2�+�BsB�Y�����Bu���b�AA}�%�d�o��M��A:�(�Y��Z���2�`�Z��{��\��"�*�5��,n�bd�*���%�i@Y�
��
q)�[���p���l���_� �~��zV�,V��a�g�~q���_l$�=��G������,��	�Y�%9��(�3�x�k��-������Y�8���l�1H�@}��FW���~�	D�-�1f�e�;��q�)�>���&
D��!j����X7����ip���'[��#Nz��|�R{��[�D����������o�m���>o(R���)�sA~��}'����|yS�y.+7�}����/mA����"R�dj���"�8�;����^����n(" �z��"�(t���
E��-�^�uC���K�7�K����^���K�n("� �z	��"z	�6�K0mA�f
E>Q2��4�K����^7�+�l("�yZ������iC-_��h��
E�$S�`�PD/���"z	�D/����\�d�%���%|�PD/A���%\7�K�L��yC��O����,;.q����Y��T������"u����(�<_���(�<_���5'Z���DaN��9���)��d��&�YQE�Od2h����LMT��VU�D������NdR2�<��LJ�s��B�1�*
�(��4&�&�{<����j^�I�B&�s�L�O��\R���H�n���)��gg/@��=�i��WUt�JqVU�D�@sQ�*h�Z�/�Q4Q���NU�D����k��������V�8��h����x�*h��qNU�D�,��MT1��
��6�J�UMT1��
����YE��
�9UA�'|��W4�c�]�9���������T���A��'�������L5h�z�S
�Tm��|
�T�8�j�D���V4Q�����
���uNU�D����k��i �*�V�8��h���FUMT�������?�(�~�p�S4}��y��+���-�O]�9��������k|�7�Z������jo�J&�9�y+�(���d���;��k����[A�ze�L��7!AO�D�����jU�M�<����;UAU�s��&�G��h���8�*h���6S�k��r�O]�;��W�����o���Z���������
���]��k����_��U)��*����,���h��qNU�D�-[�����*�9UA�sT=o����yO�H���;��Z�������?�(�<u(��
�?u�V���i,��r���W����8{����yo��&������R���E�{�A��gT
����l�D��7`�&���SP4Q�8�*�Z5%<���*�V�8��h�����M�}T�o����N��D����L&���[�NF�O�s���O�c-��?m,rle���E����[����0{�B�o������ ��1�����n�D���u��Us_P�U]�R�UU4Q���f�*h��qNU�D�.���
�hB��d2Q<�n5�4��8'*h�z.gwW�D����k�2~��de^�f9�}S��6�$��[r�z��4�i�-���S����{��m3-�`��A�#���O��l�UF;ru���o������kR�{]�v�k�EWmw�i���Uh���)����}���EW����*��3�M� lR3n]}���2j�����M��l�"Ak	��yH-',+���2j�����M��M�������fC���k[��R0�*:�d3�8�Q�G�seq�'��RT�c]�/�z����H2���z���w��WV����z&�&�B�A#��L3�zRh[n(��"a+��&��3�-��Z��4x��HK��f&8G������?��&8���,&=)�Ii[]�=���`���m�}�q���$��Ch�8��'�}%8��&8H��P�������OYX?$�x�p��Bs�6+�1N�������J�������"�%�I���>��MF1��������g���o��JA2�(�Q�}c���<��E��:���7����UV)Hfe=*��'���a��z����g����UV)Hfe=*���:�5&�<GF����~=_J��*�$3��z������"�{������F.�1�����S�1�<���a>%���;�n�/I���q���q�7�86�WSa=R�=��@��T^�1��l��1D��cNyR������[�9;�nQy=��&_xd_�)��RUXh�e���RtE
�AG�����eY� �f�������~h�Q7\3/��
��IP���t�P��Bw�	�U6��;
0L�B��
�d�p��B+��^W�`n���]���sf�/�p�h��B<;�<Z��E���vr�}�/����Z�d�I��R�TXV�����IO5@�����'=�0H�@�'=�-�����F�N&��iu�0��"_�3Bx�8��G����b� �(R�J{dG -FK$�`�b�[Kk�[�x�
m2��u�Yy�M&�]1��9m��%����%�<��%�Q��a����3���WJ��<�����������?��q������Z��@Kv`r��l
:�k�������+��@2^��\��%[��a�
��k]���� W��h?,��v������yw��7|� ��x����S����v��f�q ���O
�0��?,Y�<~1�=�C8z��2��$�a-��<,Y�=<
������q��D<)�J�\7���jU�o[�T��X�p�������N����,Y<���E+ w�6bB8��t~"�����^�j$&�h����x��$�-$Vb��f�j)&<_���M���8�ayD������S��U]EnOnv#7F��=�1
_1����a��/�K9��w@�r&y��Q�8�p��%�/x�q�B�%tV�1�)����Bx�8>�A�Ok5a��8��<�q��?Z+�)H�e�Y�q!�b&�����|I����~|~T���m��UL���:�q�������1��Y��+�O�����!�I���4BO3�z�d�S�����RCr��\�~�f�i@Y��,.���������Z�r�d�Kp!���f�[2����AN���V �I9LrN&l6Nf�p��%��X{���L�zRs�%�)H�e�Y�8����kr����#�$gr�7������-���
�p�sO@���c��v�V0D&�cNy��`7RZ��E����1l� �f�#5��[����y��?ov�,��[�7�����
���y��?ob�,��[�7u."t����~���<�
K����u��Y�s�\R�S�)�H�����o�2H=*4���Q��<�Gy��(��Q��|xt�������/�[rKO����E�Uh=B������y#��7j���H1F�K��
�'���;��#g?��1��~� c?z�Y�
Wp���:�K���q�n5�I!�<���<����{�=����������8:�Y�
�`������V3�������Z�����;��������b�E7v��oVh��~�R�;r�b�=E2�����Pc�����I��:�nL�I�u��x~�1�����@�&y�p� �f���Pc��\�k��u
��2�k!�5�cNyR���p�3;rn���o���A�0��U�w5+
�-�E��E�`R\��AfQc���P�������c�K�2~�|���b�3�1�<�P�n�����,^��}�Cxb��p��B�C����[�=��a5#���h��Be�!���qd�3F@I��;*����G��J��vr): ��@I~Q#n�23��;����
{���56 ��1�1��|5a�L�����9n�p� �/������a;�4��p��m��9�~�p�mk?Y�a�����8r�G<I�\7`�/{�S�kKi�_���/�Q��{���sD/@:��\���Ts��#Gk��}�uD/Az��%\w�K�Ts	��#G����O���EH�����#z�j.b�yD/���#z�?D/����^�����������_�� �1�=�aR�"�t������^w�k��>�����
��G�>�?� ]D�
����%H���y��O;��EH���$z�j.b��D/��$z�ID/���^�����w!���������&'>���q.��'>u�a��������MC���4g����i����i2�V�(��`�v����(��&��U�M��o2�Z������j^�TMT1��
���p|���f�7)�(���yO��[���n2�Z�������F]�Ut�&��
s�L�O�^��~���/���t}�(�^���{�W���*h��qNU��j���V6�Z������j�"��MT1��
�������
��b�St�Z,\����U)��*���K��W4Q�8�*h�Z��V-�-��*�9UA�O�U4y�P�S4��sY�_����w��#S����8{
����,�}t�z����*���x��A�OV5�Z������kU����&��.U`UAU�s��&�����
��b�S4Q=F����1�7U��������Yt����*�1���#�F���s�L����.$�/g3,����C���_�-�)���o�D�,9����.�9YA3�^�
����Zl�L�X�p��T!��2���+,Vt��&���������]����vW9/�o�a�L����r�X�=�(�+�>�R��;���]I�g*��#�Mn"�KQ6��e[�=(+�(S�SV6S���e3�?#X6S>���e3�3�;[6Q�kj��++�(S�SV6S.���sX�L����r{���*�����F�W6S�@�,l��'����M����x4������i��c]�~Y�Mn"�KQ6��G�",�)��1���r��s��\ee
t��f�y|�V6S������r[R���f������O�������1�+�VN��GP6�Z���a3�}�]���B��e4S�����2�<�8���X:����6>I)�Kjot>I	6��4����]���o)~Q4Q�8�*h�Z���t������r_�����1�+�(�	�����M�)�)+�)��e�:����3�yYF3�b������.zaa3�_�{ea3e����{��o��,#��x���wG9��ty���I��=������]e�GW�@<��V�i5���PE����W@
U�u�4x��-�.�X�"�+U~�
@�u�4x��=i7-\(&w���/��U��],S�q<6�V�������(�����a��^Xe��U�@<��V�e3M��I�A����4~�$E�D�i��h����
���*��sg9t'�C���y�)��S�T�6�����SQt�
d�L��RX���4��G�����
�&��;����/�z�����$���z��7���wTQ��?�_@��]�h�N3�zRhZ���?��q)�aX���� ������b�	�=,6e9-�[:�uF���:J�q�5���/��AA<�8�Y�[^�A/�G���l��V��7y��X��T#!����wb��>�n�O�|�P_J�+�r�<����s���g@��O4����K>;I����<��
+���#A����ti��F]^�Z<_��zV��G�F����������dF��S���~!���*�&����;�������#��Of�!7���_�a�
���#9��!K��h�l�'��������N(	~���cNf@{�	v��Q���Xt�
0�h�����XN��FB���M��,�	W�3�+ ��^J�7VY� �y�����������F(�k��F
���p�0�j$����i�i7����qE��,�t�����^���t�����z�r�'�-���+Mf����z&��F�����S�w�H"�����/�Q��wa2�HH�i�;�-���^-a
���������G�yT��Ja:�HP�i���t?�^��^fg7��B0�UX8L�	
>������h�W�nJ��@�+p�0�j$(����7�����C
���(,�&���Y9(��g=���i�bvl����0��i��������[�������`6��5�`5B?2$�����;��y��?�p|X�?�z|X���bkB��1!����	�C�	i���=�2��%��%����Y�Y������-���1	Z���@Kv�dr���l�C2}�G	n�5)P�W{���� ��K�Fb?�-x�9�2L��%Y���z�dk)���������l|��s/!���p��%����������c����Z-�0���Z���6�ed[�s9�
V&����Ht���-N�O(gj@��&yS�� 15#��wKV�q����`��	�E{��0�7B�oW.�f��/e@��&y��� �5#��wKV{3-G�����Y�s$�-�����	�]��i�������#��h~&V�@H�����L�-�����#��h�&��@H�����\�=GCtf
��@^4D���9|�r�+z1�(�5�h�����A<��O�A�!��LR|���H
~���S
3�h$(�n�����.����PR�K)H���'����}��q��7N#���8�0c�FR|w��������,�*0�������R�N5|�r�6�`��������t~��t~����N��s9r������h�V
��G���W.��!���DM�PM�@8/����t����[Vw�g}z;�0�N�������r�L5|�r�s�s�����7V#���X�0c�F��oW.Fh�t�y�ud�(�[���d�P��:��6� ��>��m�a�%�L3�i��~
���mmc���z��?o��,����7��+:oV�}�s�P�h���B=_X�g�P�h�>-���7u&��{�4�`�z }��]���]�@s}�o��M����yk�f���Wa0&Sm���
�d�6h�����k�m�>�`}n���H�v�[�=��(��Bg=����:����:��U�{�f�NOH�a�a��s8K\�9���R�����A�a��f�CW�����5#�����0�kB����b�����L�3�����}��F����h^���@ ��y�a������D4��e�m$oYy/,K���'$��^��M������PR4*)H��(�Y�����k�'0�;���=�8L�	
>�W��cY[0% cJF@I��,�3�����`��4���Y�q$����a2�HP�a�v�e������@�Xa��A2�(�a�b����xsf���`^X��2���#A����!����e
��Gy��L�S�$����0���9k0���H /����+��FB�O�e���r��	���y�ml��d�������eo�{��u�%k� �<��s�L5|X��}+��(��UD-�8^���HoR���q�)=t��G�A��F=�H���%��f�z�|����Jo�����{���5�����������������x��8�=���w�������[��l���W�/o��/���[?�&���Fj������NdV�$j�.��������#�?eR��_7��)���|I��]�e�������uA���@��#��iZ���,�������[�C��u[v0�v"��%Q�w�����m��~\~:V����3����R���*?������������%0+���v�wYzwW�9S���������V�\r^��%��z[������������;n��'��]IY~�p���w���X�#�p%GA�&�Z��.2jB.�(��	��#����KA�2�B.���~*��(Tr-�'=�rd�\��(V���(Vn��QDFM��E=!���;��1�zjP�Qc0}f@���#c��k'����+��Q��+��Q}S]r�J�9J�\
�+�A%W���J�wxM��B.)�
���xYA!��e�\b�T+(�R�����K���H����+I	SM�&�d<[QA��fEMDw<Y��
�z�H�4}��U*������}?��>7ty�8�\������O���6�R������h��B���kQ
���&�-��t-��;�A������*��0#�h"��;���kQ
���&�y��D�(�o�R(+*�Z�������?.t���0+*h�����?in<jz�q�������a�]��~j�M7�R����������\K��Y��`����KA�2�B&��������V���a~�]�Q�4=G����9���D�)g���t)�aFT�Dt��&�|�p�4������������]�93������AT��(�YQA������7�R������h^p]���W�D�\r��OE��fE]�nkZZ�_�(��05h"����=�}�o��a�NT��(�YQA�?gt���0+*h��9{��;����fK
��E(��sf.B��������y�.E9��*����bm��h�/�]��#��;�A��fDMD��LAT��(�YQA�
����kQ
���&��(z�=F��������dr-�QV���`N���**�|�p��T4}�l[�2�����el�\��L<3��v[I�3���MQ���\����e�3-�Zn�4���
��0#�h"Z�����kQ
���&�, r���E)��
�m}�X\4R�?I�2�B&�n
����%)�j
���S���Ut-JaVT��h���/N*���r����M=/��������0�������l�=(��2~�|����4����������B!TQ�`Y�89$�B�4����� ���/��+Z�w������C�+�!2is��
��������_E�����K���0�������{���%�s��9��"p�2���q�)*<l��}��^��}�%�b��/����8��G=
p�����Bz1�o���w��2~����!2i�wN[?��6�>:����Jk�Co_J�-:�2����G�n+������M��v�����c��aB�����qL)O*�Wlp���6�"�����_�k@I�V*��3���W\4���1%�\������z��2����G����)X�x#!x4b�u�/��N)B��$ D������N�	v���6��/�pp*mY��o��
�t�P��B��j����8����G�W��8���!� @������vl�v���E��?��^L��Pv�7���znK-���(�[a�����C��/����qL)*�v������a$T��`�p��o����f������_�Y�[Tm��_����eX;1H�g=*T�O^.����xZ&|Y ���v@��!�?)
�������Ev��G}
(��Ce�t�4h��B3����LwX�.��p���1f���-���XR�TX2��u_V�PI_�������c��a%�"S�1�<���x�U7���BE�|��[��PR�a+��3����+4/
>���f�	%+��PR/T����8����'��c;`�Z0������R�5�$(TX� �y��������������	D�
����^J�B�e��@Y�
��@�m��)�*��oE�`*l� �f����B��YoxG���(�L��*�(H�e=+������<�h�1�O��q��'�$���;��'Bh�8��'�O6�O����o!`����x}9��"��\1D&�cN����\��Ou��-�����{V��u�����3��yti������z������Z��[��H���J���.���X��o����q�Q�j�q�j�am����#1���q�Q�����K��3��ltT8��(����U������x��<2^��}X�u����;��wF7�8[CX� �����;���?[������^o�=()z{d���G��[�
��:��#��y@I��� ���|?��%��F�����{/�)��}Cd�8���jU;���0BN8L-�8��&Tl�I��-�w�^������[����2��R!�X~~�>9�Y�$�\��7J�n`� k�@Yo���]�=����8L�%+�1H�e�W�s3�c��4�0��;0�@Yo�l��zx������q?L���!��cJy�Vc�"�F�`i��q���Q!�+�cNy�V���y(��1�!�����=D
�0���j��o�qFb_�h|�0�pv��M��aH���|�<��9��9��x��a+YG1GK����������;r�a�����t�����1�_���}�~��l�`���ql���Y���6`J���:�	?���8Lr��zPO3�z�d6S��������BSSqc��/4�z�N���N��	�Z���A�|B�����.R�iP?ij��`)=�3"���f�v�����U�����$���i� �	p� ���H���,��\oj���Mj�^U.{Xc�H=�P�1"k���@�zZ2��0�����.5�;^��6�$w���������Wh�a����7Y�A�"�a
o����P�K�X������J��a{�X6�}�W�'=~���V�o:��*�M��Z���:^+�7��k���xu����xy�c���5�qKOq�&�I��X��M�����5�^������
���C,A�C���nj���wW���9\�8��wW����M��~5~aA�/����_X��/t��n
��x��t	�1g8����Kh�[��
�mzs0xw0x{0�Y����A�Cxw�]�X�'�1X���3�1XCp.�9n)w*D�n�e�`�	lhZ�I��Ah:p�����g���1��0&�����%���$��}�P������+�XP���
3���u�P�����q�
2���d�����@���#n��$���n�=����p��qK�S!�o���W�q��$;U�� 4�,hY�
����1����:z����o��x�q��+
��#%o�dl=(��z���u�P�H�wQ��*����e�� �hA|�w��\��*Ho0U��d�
�3-hY�
��{eA`�S��`���d�
&��;�Y�
��{to6S��`���d�
�^0r8�^�0�~�q��z����8��:�/�XBp��=N��q����#X��#�%�9�s��{������26BO�:'�~�VC�\9D���p��]|��C���(	������K�sKk�"��;��q'�O���7��'�k�g��`Y�Oe�Cp`�-mk���������`��
{����d���r�Y������b_�����@��6R����/�Z�������Nr
��@Na��@NA2����>�F � �����Z��?�G����@���R��'���d��{�)|�����%h�0�K�NA3�.�%�S��3����&�3�&�hyZ��7���qw�w7���V4z���@���M��-
��|9�'�h�q���	(o�a�����\�v�����V{���2��b-{G������Q�h������h����^^NT�X��PT�X4�n_%CI�E!���/	���~M0�=e�%��S�D��Q�f#��	G�b#��Im��y������WL����0�_����+$+�h(��@T�D��N�4�04-���)��0U4-������(�����hz��WO�X��PT�D��K�����(�����hVG��f��������y��\�fZ�*��h�&��hj��u�fx����ZtWR4�na9�F������H�Xt�������-DMD��4����D�l_�UT�X��PT�Xt+-3�����hQE�����n��vQ4�>8����w�4�����V#a(*hz�)?tG�M
������ukap
���-l���P������hr��J���=b+��=�*7�[�������rWU�X��P���`q�����$���������D�^��(z�N�M?��(�����hZ�����������%�����A8�2hx�Z�����;�������(�����hy�������-DMD��4
�i�D��QAcQCQA�\�RQAcQCQAc��v����-DMD��v���h,�a(*h"�wE��MCQA��������48�K|m���a�}TM�������7����MD��4��!�D���������4��:�E
E[�*���uz�4�04��h�&{�xQ�x�{���$E�b#��S������F�PS��F����O��x��A��X>��2��'-d|�Ct�J���B&=��dh�5{�O��&�{�jD�E9EMDS�0�������&��+��T!cI�B�F��eF�f��
%[h*��nuI�4�04�����;3������o�;��^��
������h������E�z
�{��G�PA�����x�Pe\����'��<��7E��m��b�����Y����@x�h�|P�f8Tg#���U�����%z:�?~Ra*V�����w���Z������A2rtD�����pK%�a/����.LZ������7F������2���~n�|�����Y����Y��?�����:��S����\��o���8HF��Yw����I:G�����nY�?�]�OX�I�_�xy���������;w
?��5��C�vE(Vv~��'}!$#w��n���c��Z��~�h�tEX���dy8��,p�����Uh����+R]�Y>J4�iE�x��?��o&�&�B������0�WY�\���N"��P�g{(�\Je�0��[���+E�k?r5���-�X���c��f��r��6�?n)��--����6��-3:��M��J�����T�R����U�������:�P�(���PR)T�BA:L8�V��]L�����+I�	�L����5���/������S�4�Zh���\W�)u��\f�=���p���C�cN�S���������L���
�K����)��I�C���!�����p��,u��"c�J{��[������,R�����U��s���>o���ZXM��������%��6�?��;�������1�DP��XF�����O
�a:���z�*�2#������b�y[(�*,����@��Uh�^�����"���W��%�B+���0hY�
��/�����Py�=�����pR)TY� ��u���W�����"X@gE���?J��
�8H��g�*4-�/�xY1 �GZ�����b`+�0���j��.��;[M������~�T�R����W(-�.�9��Jb%�B,o�7���Hm��9�����r���h~�����$����=h]��%��;0�����~,��5O�'�{7K�'�|7K�'8~7K�U��f�j���\���������e�Z����G��1�rr��`N*c/�|��:s�d���gp&�����5M@8�a�������;�N�v ��g=��n]&{��:e�A8�������d����@��f��Eh'y����"t�������K�g���t���N�f!��P���zX�8z��
�bmB��6�&����
o�����)*�k��)�`8����d��������K�3��J���u��)�X�;��!�qKy�)J���o����0��H���NA:L�}��Y2z�!��^���O��P��y� ��p����V���"��S��PR)��������-������*:�-M����v
���`�D�vz�d0��Z�G%j=�c�@+�Z��z��-�Y��@��>P�L�P�5%O
B�����dq���
�d���jG�c�@{�"������VuO�����J2x��$cT�HA:rN���n��#�6�hY��"u#;�I��,Cj����zX�:��k9�1/�Z����$c^�HA:r8�i��Y�]��1�Z�:��$cc���$#w�����\�m��'hhn�"�*{@I����a:�YKV����D�_������F�_a��~�wK����{� �/;����)HG���;��/�9���>��y+@�MwH���,M��}p6�!�?t'�
�Z1���������/����;�7d�J��]G��7:/�����B���X�����D�K�.���/<�����#��j�_�&/�����B������8�����edc>�I��X��|������0t�cp�cp~c5����������W��;�E�/��W,"��x��W�!�+���r�Bc���4nb��M4��:7���hg�*���)���sz���2��XC�C4�-�Vg'���qz�� �:@I�9�A�:�Y�
5&���_Xn!�/���:������U(Zwkm5m\���>��P�w	k��p���Yh��D�2���5�0�`�\����~#	4���	:���#h��,1�`
G���]N�9�''�b�������xR��
hY�
c.�V������9����p��qK�S!q�C��_�aB��S�w#_d�����1���P|�� ���*��N�_� �@\,T]<���{�
�^(�{���A��W�xw[��;/Vv^(��y������e�+�9k�F6�X��s��F��/jQC��s�-�N�b�m�������q=�$��m+�0�����[#7�tV�N]bkl� ��~�f��������J���%�/m��LG,37)�
��������I�@��A�����	xv'�#���6
��R�o:��>
e�GH�q�Q������?������������aCZ�#DZ�������g��������$��ZZ���
�|i����6H�����6���+���
z
�vAOa��ANA2��
�>m����/�6�	�<���A���a��/m��q�9��3�7l�S��a����]�S�4lh���r
6�)|��AO�5^�36l�hyZ��a���i���7����:z��w���:iI�|~��!m�kJ=�
>��|�08E��?R���V3]�'�E�b��ON�W�O��dME����4
~�@Q�6hD�E9EMD��'G�X2�}r�����9
�Lw�G������4
/��������P����e��5~��!����	hx�Z��h|����V+�h(��@T�D4R�=4�04=i�9�
�r�
�+y� �h(��@T�D�'����(�����h|����S4�04=��n���#QCQA��7�4���04���k��F��t��1s
������)(_���z����hQE����4�~�J@C�c��3.�����(��hj��UT�X��PT�D�|�!XQAcQCQAc����D
E[�*��n������{�����]��V#a(*hz�)?t!���
G����]@����������gz@C���&��=b+K���d"x��]%c��=�*��e/jC9
�&b��vr������&��
�DC/.��W���E9EMDO��hxkia(*h~k��_��������4�r-NB����o7�U4ma �h"z�gz@c��?���6��I�P������hx�h��������&��kK�4�04��h�&�{�|M�l�����h(��@T�Dt����7�����7�TV����
G�{���*^�'�h|���?�T�h,���zEC���r�l@#Q	SQ@���u�E��4�
6QT�X��PT�Xt-����)��hQE��������~Q4��=�d,IQ���D���rg)ht��0�4���K|-���
�_a{�������i^h���2���n_�(������X��O����e��$h(��@T�Dt��jE�E9EMD�k���B�����L��.H�%95MD�+wi�E9E��G����t��q�v��M��aoY�B��L��w��"B`�fF���^i���S-TP�k-}B�(T��#w�I��J�jGE��1m'��:�����F�Z"������l�UU�f�;9"�q����T��g�6(���)�W|t�!0Zk#w�A���Bk3�M�P��i���u``�M`P|�iC:���+S�
�O:X�g'�K�,pP��|p��B��Z��@��=�?�\�>�o'�B���;p��.����	|B���?����=���F����D]|j�9�F��[;���	,�U����cY�
�����,p�����U��Z�����y����e�s�w����J������[�nI���Ub�+D�5}��&k��1e|3i�jH�5}�����p?^��#b��:+:vZ��PR���6�a:�Y�

�
���j�������������vQH���;���8��&I9
9���$������XCd�)w*L�����%;���8��J?Y�l� �f���B��7^�XV�RP�������J���a:�Y�
=���c}�+:��a?J*�
;2�a:��n�3/��iVd�m����#��n��2��p�
�!������8`�� �.%S���UT���=���/d��d�p��B7��~l�
ZI��N�������p�������r��}a���f��G��(�(�\Ja1S����u���+������r%{!������f|7�EkH���;��0%��^9��{Ca���'�K�,R�����W���T[���J��\���w��j�������_������_�����u�R{1�s���t����C�K��3 �E����x��PR-T��A2L8�V�q���9���Q�QQ����j��V
�a:�Y�
���eK��*I���'�t�����������[/�����[�"n�����;���%��;0��P����_���a��,��`��,��`��,�WM����'wZ�oE���fp[��9j���R�9}+�|xp����P��9��,c��Yj��r��0K��W������&���t5��+�x}p���(�>8�a�h�����_�i��3������D�r��yX�1�he��d�?�Q���(�?�O�P6��Y��q2N����@
'��zX��ue#���`�o��c�0�`��$mF�qKyV�Zv�+����e������N��`� ����dpi!�_X|A,���'���=v��o�����,��T�a8�\^4wo:�YKF���$Ay!H��p���oB�����I7KF�����O��������;t }�����=����E�@j=���$�t ��7K�/����Y����2�`<7iC�����V�+��@�"�;@I�j���-�a����EH�_�����9M���v5��=>?�}� �+F����N8��$c?�DA:r8�i���RrCeerB���T�3a��d����6d�^_��3�0;�I��,�D	��;�?����Yo��Q��"�;�I��,BJ����zZr�7Z��n�+R#��d���A`zZ����,f��
���N[-�p��-#��8m���%�����Y��1�eZ�u0�7,;���y�d���O�J��V���x�\�2O
��K�C�Y��#���A�����+0������iG������e�f��B���X����E��b��oL������&/�.�i���H<�b�P&~��'����x��_4�Uh���w�b���h'u�b
B[���U(:|dv��XL*��1e|7�.�j}D{�)w*D[���s��C�
k��@`:���}��we}��
2V����B
����U(zw4c
�2�

B<�o�0���s�.V�����B��[��cd�@8���f��u�P��hicd,@R�@�X�����������+�H�
c�Q<�\��iZ������q��M`�����q�	Q�*4�[����|��J�&��gA��U(�nk��^!h��c�p�^
k���;��F����W�1�,�$o�� 4��8��j���-�xk�����J�6��A��y��n�N���T����������yGMW����e�+T�������e���u+�.S�:t��{�6�l/+7�1����K��d��}� ��u����m�;m:+R���u��q���x���:p���Sn��rj������+Mj����W(��Y�3L+�tY(�Yo�1g�Wf`����[���������Zn=�����$���.|�������.�R�I��.����g>�~y���?,�^�n{�u��w�o/�Y����6^���	z��z�
'0o�p;oO�i�=	i��'1n��'!�p��z�6_��hM����Z"����������������������9�0�I|��ANB�(�IL0�Ih����r�`��h���-�Z"���	����M��V
���S��0���)����M�b>����~��
I��
7��k�{�
R�50;��7�������X�x�v�@C���&��uNT�X��PT�D�p{�(Kn�%��	�K�������wh(��@S�Dtc�C5�VZ*62���-���y�\6��{l^��+_�@����(�����h1�������-DMDS�=gD�E9EMD�k������(�����hZ_���)��0U4�_i?����(�����hl��F]*?�04M~#N@��LCQA�M����
�A�j�
�[�SP4�n�{���?���-DMD��$������4-���6P4ma �h"���**h,�a(*h"�^K���������&�g/z^={���h��{�E���h(�.e��}�@6��h��"��n����7�0�����	d���qx�&�o=��=������M����iu�e9U�D�{�4���L��4��
MD)��64����T��e[�*�	oo��������px�����0�aa3��_K��oA-�����e+��N��u)�����������(�\���uQFX�X������p�o�&��� �	g���l"��S2��p�Ch�����p�Cae3a�%m��M�9��	o����������:��M�9���{A6�#�8Vv���+��������������%lqx&�&��,.�/�&���. �����h���[
+�	��IZ�D8�O��f��kM^X�D������p~�����M�9���%�	
K�	�/�[�D���jC3�P7}����w#�3��.����:��}�u�_�������uX���/9����\�m��������f��{�6����?���Y'�Xaaa�3��&�{��g����[
+�	�/����|�����jC3�(��TU�D������0�Vb��M�9�w�������*�k����"/����

���3G��v@�3�i�z.�u�~Q�PA\W^R�`�h���.�O
�V��Z4*��yM|��.�F�a�<h����i���`�(�����w]��G�e|�����R�X���"��,��ktO3`�����
 r�Z�^���V��p����������A������%��"����	+
�������
 �Q�a��"����Y�����{��;��D��%{��$B!��!���g�[7&����@�G�R����[��[��'���~:�I�R*�#w����.Kh}���e�<�XYi����&���U���P�p�����b���o{(W�.K�5����/d;�0H�7�]�-�=�������&�������H�+����d��D?Y�J�{|5}�c3��h�#}H�%�+lkAm�p��BC��e�;����l]��:���GV���d�����zc|��{�u�d��u�k]����<�W���d�����z�vP�����e��uIk�7{R��^�K��z��Z������]���)�6b[����%5��U��0�'����Z�����`V=�Fb!�rn������F�h�$�������0u�%�]VB�����:.�����*�-L��		��w]xR�As���`��?(��p���/s�"o�l���XW;GX���L�O�����0����	���������%R��Be�Z�����<���pr���[oy���V��R[� �+�"�'R��^�e
kC��o����O��M>������n��CO(��8L���q�s�D'�yX�]����W+l]��kz�����
�9L��		��7��K������E9~"5��U��0�'�~�u�O�V�C����+���^��'R��^�k��zB�7�
��}-����JH|�����c����������L!m�Rn���g��fBw�(����@�L��0r�m&����Y���Ax���Ux�d��Q��4�Zh%������JV��U�"+��h�E��'����jU�O<Y5��U���d���#w����:�qE�N�����/s�L��e��wZqEo�,=���}V�|���q2�����qk:�|�f��~�l�����h'���
B�������,@�6a�����h�v�0fc
B�����MX#0/��3������%���������<��k4 �!�D�;�I��$#w�����������e�S<��P��"C��{r��vw+��x���)����o�'YS2��6r8�i�hE�Ej��,�m'��=�<oO�f�'$��r�
�3�QYm��dO(������=!����=y���re�E���y�roa`PzB��+G�r��3/se�J�$�������0�*=!�������`�c�q�J�����t��mp�������������=�y��<[���������L�#Z�3^�?��vV�s��6�?��������J�������D���l0Ga2TOH�q��L��ok�P�,��gOj��B��a:TOH�y��Q��Ak�����I�s�h�N)�UGH�q�j�F��������<g����P=�v�����o��������oB�
�i��1J���t����������O��1]�7NtB{P�����������d61��66D�t[Rej��#5������0�'$���fm��|7+R���D������4��e=78��l.�����v�����v�K�O;\�=���/�p��Q/�����B�3�X(��d>�SZ'r�S���!Z��������.����B�����1� i���oY�������U3���������}w�b���<;K�t������oYB�z��"�*�}�Z���\{�su���\�U=O���<��[�7�)���:�s��4����>g
B���`�j��lLo������s7�;w3u�f�������hk�i��y�pR�iRx�p��B������\�w2-��7N&����	�����\��/���/-��7�%����q�rg"��P�Z�����t$��P�ZR���$�z�^c����������7�SfoSf�R����&���5����IBk�m�IRk���M�j��n�%��v�d7�2�-����%,x�^tiI��!2F��-i�� �"H���j�h	���� ���� =�_-�Arx��<���F`[��������H�{�<�0�<:B�w�E�o�{�q����H�{�7R���$�V}�^5���b]�\��=�y�e�9L��		��W���:�����24
{R����q���[��|�-!�i�X�=Im����P=I���r�����a���ge��$R�Fk'n-L��		����&(:+���dL��qx�p�]�\=���Zq�Y����"�������9>[�]^����9�s���Zo���?���������m��c���<����<w?�}����oG�?�������Wc�%�qI��C���?��Pv0=�v��%M*7yo//�{	L�p����R)�|������dV�$j�&sX~���|��A��]�"��dV�$j�&sX~(�t����5A��v��,�3��/�Z�����WL�x�[�/;/a������D-�d����F��|�����[�{��g�����~M��m����L���nc�~yw�[�g2�_�~�9��4��������n��������HN���V�����wP�^��
_�Zw�~�0$�l��%d��t?���x%G���Cb�_��:%C���J��4���9��N;����bSE%c�HAQ�X��@Q�X��vT2V�0P2V$��Q�50T;����)��0T2V��30(2��HZ(
��HR��9��\������P����jaZ����:3������"�����b�'*PT2Tla��d����
��9��3y?�(d��a�(d�X6Q<�UU2Tla��d���M#)h��q(*h��G�*h��q�*h����CUF��������}%��rn��+���G�jg d|�8�2�h��FIEc���&�m����&�������������8PU4QeUMT9UMTc]qmTMT9UMT��h�z�g^@U�����&������a�o�����v��@U��vS~�b��n.�o�B�����\��sP4�r1�S4��&������i}��yy4Vmq��h�JLP���&��d#���nrB&����L3�����&�����������U[�*���4�	UMT9UMTC���*����������&��3S���V�eU�?'������Y(_��	�j�����-TMTCmFlTMT9UMT����Q4Q�8T4T-+NkW�J�����4�UMT9UMTc�oeTMT9UMT����(�hRJ62Q<��+hr��8=/��k���]���o8�zz�����88	E�K��W�oq���-TMTOo������4V�7o���8PU4Q�)oFU�D��PU�D5y�D5y#�D�|�S=�x�-%�+np�l������-DMT7o	��pZ�n�]������	�Y�-.��0�����L�s�2��oqp���<��|UMT9U�U���,��8PU4Q���FU�D��PU�D��5��&��������Lf%cMI!�����&��7�MT�7�MT�7�
U�����{��K3�B�JP>���=��na�6��Y[H�d?����w���WAKE:�?n)7�I�+|������J�����1e|3i�C���r������Z��(���?nrD�B�?���[HX��'uANw���Z�4�js��*������((I���|L�L��E���[�W�������Jx	�����/�
W�����Z�������W�5�**�k��<��H�qK8Fv�|���V�����R�P�h-�z?J��-�-��t��n��u{��e�I*d-/s��k�w#|5K���9�N��;N���Q��RE�I����r)�-��t��nv5Zy�(UT���?J*�*[(H��g�*4���~�kd��A���y����Rhc����a: Y�
M{{-/���U+Z��?J*�6V��JP��u��c���X��&*)^}x{L�L�JI'���XS�Tx��r������l��Y���������l,����0��[������U�e��O�t7�����f*�[��J������;�n�T����4m����?k���v�%�B���`�p��B��@�Rn��go��U��.��u�%}!K$#w��n��l��x���1�!������R���A2r�������o�y�/pd�1��r�+	?J*�*�9H��g�*4Dz�	y�?c��D_J�����R�����6LZ��B#��B���l!g!�4
��#�>fR����Q�!�q��+,[ ��W�Q�!��
E��-&���
��r)����t�����2����F���2~x{L��|�2�"��cN�U�����Ff��y���u��`m��*+kmkP�-�V����1JC����Iy��SV���k�w#�R��
N�s��
����������5H?J��R�NA:L���Z���dB��D�b
KE�J����J������{����=��C�~�Z�Sl�����-�������K��w���iR|����rVrf�Y���������"����~��k����\��~����������Z���W�/�z����[�����������{��{��y�3M��9}�9}�9}�w��s��/�|s|��xOh���.a����&m=���3z~|V/Zr!t�_A��s��[�����?��{�d��Bo�ov��-�f`�n`�`_vt���z�,��P�7k�p������sgd,B(�[�5-B8�a��������"�;@Ih�)l��p��������cl�T���$�
���p�p������[W!�1�a��wn� b)����x�����{�
�bE�v`k�X�������d=,�:��S������k�w#_�.����)��jLD��tU_�����k,���H����N�v��bE�0v�����9E$�a�h,�W%@�"�;���J�� �
���%�x��H5CO��1g|7��N����?>�q�Vu�rh�|�,��
����A�zXr��z��.�j:�����v�YfA�zX���S��O�rEj7����'i�5��zZr;���}�V�� ~d8�V�"��8����%7q[y�B@�"u"{@I��\Oj�t�e=,Y<�%qk@��Z��PR-Y�NA:L��l}�d���|�D�FJ���ewL��[��5D�����Vv�c{�GD3s?bE�S���f��2�a:����L�#��k�&�7�?���Gl�=n����-��5���#����j1���'~
W���s�'���f6by�+?v�k}���'�-~���V�qi��4��hi��&g-��[K�?��7��k��}y���4-��*�=������jP^���m�����y�P����L��L��Lo,��-��-����Z7�������7�7�������wwU�}��b���;:��p����l���K���?^�^�h���S�S�S������x����{��]���;'�J���d�D�u�Pc�-���xq�"7R5o".�E\nj��T������1-���1sr�!��U(�x��X���J��!�eh�d�*��@��Ovk?E��2����O?��x�{���3���*g���dg#&���p��B��K�f�G���Y4�!���@�6��4k;��C!��u�5��F<)'4�qK�S��qg����-�m�^b��l� �?$^-T��\�w��;i��Zy�$����d�p��B��v�/�SH�B����f|7R���+p�)w*��������Z��u�$c��HA:r8�^�m�`�^i[�����l��_��������)H��@��Wh�)Sm$��c�H��������s����W(��������NA��}�d�B8(���a���r�|��9&&4��_�������f]������q�O��7�������H,K�����D�j�L�"�����_� �9������a�9�!�(�y���h?��4����D�_�AH��6�z����m$���w��>R�������$�~���w��3���������37��3�L=�y3	9���I�HW=�q?	9��3����3������4��3���3�L=�yK�-�G�����9�����D;��s��VB��/4��s�z��r���0o-���ys�7��k�^BN_�K(�7����9��p~����4��D����$	{�VDU�X���������O4Q=������Y�����j�UE���B�*h��q�*h�_)��&EU�CUAU�J���L4)%�(��n$M49E�U��7�4���8PU4�������o6��w�4�l-N@����~�h��&���������GQU4Vmq��h�����&��o	h�z���*h��q�*h��_��2�(�(�c0�Ekm��6�Zu��H��a+)A�������*�����j�s
�v�
~cN@���^��7�R�%�M(��[gq��2�ja]�of	k�8U4Q��yHS4Q�8T4Q=����&�����Xu[�N����-TMTwr�QU�D��PU�D5�Gd@���MTu6FUAU�CUA����)hr��8T4������7�R���nGI@�+�����\u7'TU4Vmq��h�
�Q����*�����j�\�!%cMI!���0W2Qns%����IAM�CQA����MT�B4VMu�UEc���&����$hv�iq�*h~�)�~���^2�z������88E�kw���UMT9U�U��w�4Vmq��h�}�!@������~#s@��odh��K�
�W_���,�a�YAU�����&����&���E�hRJ62QL�$4���8M�M�p�
�J_�:��->P4�����Q��qo�u��%���-TMT���MT���MT���MT����U��oIh���@U�D5x��D5x��D59�X�D39�X�D��>1����}b@��}bA����j���M^?�K�:���/E�w��/x��SW4��!����h��q�*h��}�3�I���;��&��+�sU4Q�8T4QM�_&��j��2MT��������P���0��x��X��������f@��;������}gbX�����+��e���|��9�[��9�B�/de]��������#�����.�q@U�u�0xRh���
U�u���s�B�q]:L�Z���:�pU1�
{�����A�?��X������
��Ed���A`�B��������o]7D�*�J����(��Q�
��W�Y�:C�.D+�n� K��
A�ZX�z�S@��b/o�?ia�:V�0x���|�����d���S�d� 0Z!�t��Pv��
zH�~@�lP^�~{��l� �;N��X�
���>*��AA�#K�PR)T�BA:L8�V�A�o�Tg��=��:@I�_�j������d�*�������7�XQ�$�[�
"�n�U��0��[��U_M�GTH����~�����|��N�k��
��m?@����]aG��c�p�-��r)+�+jP��u��S�{.�����pj�;��
���%�
���Uh���K�u��c�������,���X��e�K��d�j��p���Sa���t��T�|4'�s�w#_�Qa���cN�S�����g}�%�O�6��m9�@K��BF�0r8�V��,��F�����]�����0F�0�v�K���������P��:T?J*�
;3�0hY�

�;u��Z-AGE�a�PR)T�JA:L8�^�'/V+[�U��"^���$WT��A2L��+�i�W��>�+*
���j�����d���{��������{*������T
Oj�t�e�+TT��zll1:*���(�*l� ��u������C�Xb���-b��m����W*e����h�xY�~��#�E(��k���T/��HA:r8�^��7�5q�^EJ�J�uX�S�7nhA!<�?��[��+�������E�����C�2�b�t`��b/[��ec���w���	x���x��_5o�j}��[��[��[���E��E��E��d����,��,��,�o����������wZ/��McVcmC��4�a	2�������]��{�0��>��;m������;��;��s��{����f��h�k�w#�������S����u56bA�Ft����X��Ft`�=Go��.�V{eC� c(:@I�P�Ah(:�YKV�o��I.�T8��$���i%H��g=,����i�d��5�����X��d4�m�v�dq��jJh5��q�
�F�BM�F{�vT�]����k?�5cE�3v�����)H�G$�a��n�u��+���d-�LAjA: YKV�p}���*���d������t@��|Sr����x-����R����7x�H��6u��4>e�>e�7>���������6"���cY/�y����d�DA:r8�a�j(�y�J�W�F����5���/�'�����S����"�h_V'�K��h_� �/��i�b��x@gE�Qv ��x`�0��<��1&gi.�N��f� ����d� ���57���������8HF�g=-:���S���[E�`z�I�����u����d�"��JuCG�'�������j�������%7[r�u"4:��V��p�q@��}	��;�YOKn���}���J����Wn7�^�zr��%��I��Ji���^��������
k`�L�a�{y��Wu��6%�l��h�aQ�y�wDk:�<h9��A/�����B���X��;��}�b���j^����^��>B��`^���}�����XL�[Y�U��t���[I�o�[�g�*��ggQ���<�AY'�i-NgO����
�OX����d�c�����:K��O�$H��B�#L��F$-�V#�S�3"k�x�)w*��������Wo?�y������z��n��	�3#��x$g:Z����t,A�tD Y�
U���1Vc���Y�%Y�1S�Z�H��l���������x�-�q��Y���XR�T�&}-1~bA�Ot`m_K��X��O4@�n�����������M��� 5
��[��S��
��X6u;�����(
C8�V���������e�gG�;���/��YB�"t�-�V���+]?����������N2>ai'Z�d�p��B��;��c�e�
�~�q��n�^������SnU�:���f��U�>_8���G� ��u�������QtT��^(�X���A2L8�V����1��-Y�<{����j���/s�"o�|�������Y��u=�$c����t�p��B�cK��F^�+��;�4��.�5�����r��yj�Cw�Pg����w�F�����,�./5�K����[c:�k
2�������_�����������2�����	6�%�?,
?#�|h����n�EZ�W�7�������?n����<)~�(#�e����(C��vZ��Q��/�z�Fr�6��3�vz�Fr��g0o�!g�i�=iw�g0n�!g �z�Fr�6��3�vz�Fr��g0o�K�����Fr��B�a�(���f�9\h�!��y�=iw��0n�!� �z�Fz7�x����x��$���]FM�$Jk���y����'#:E�}R$�@�p�����@�T�U[�*��F��;��j������~OK@���i	h�ZV����U[�*����$�*h��q�*h�
��X���&��d#���zIAM�CQA���*hv�iq�*h~�	{i!�i��O�5&��ekqp���-�&�����-TMT34~g��~�P��PU�X���~����-TMTC�����&����&��k�X�D��P���bV��I
�hr�
���+�����j�UE���*hv�iq�*h~�)6��7�e����1�����	_�|���[IAM�CQAC��4����J����&�)%��j��R���~{u@��o�h���������j�UE��?&����1�D�|�;����*�����j��~
��o$UM�7�'o��Y��%��%���kqp��Wn�^a5wj@c���&�0��
��r�
��f��H�D3��GJ��as�++r(
�(�����4��84QM�)�D5��d@�����
��r�
���7�4���8PU4������i�q���R4�v-�B����D_�QU�D��PU�X�X|�O@c���&��o~h����G�&����h������&�'�C���&������'NI%UEc���&����0	�hRJ62Q��Q,hv�iq(�_6���_�����N�+4�p-NB�������_N
U%NUMT���MT���MT����+4V]�W:��j�UE��{��&����MT����L4����L���M4���MT���p$U��^�Y�/���e�[z-�I��s;��h|�[����/�������j�UE�����&��w��U��}w����-TMTw���Du�M<MT�s��L4�s��LO�2�h��e4Q��e4Q��e4�K�W�;3��V>���i=��Y�/��.�)(��.�q@��[8���9<h���>�P����m�:h���.�O
���@�*����}0������a:���p�����&XnY���o��X#�C��'�������Lqo�r: K/�-t�f�<h���	�u]��VZ�2�� K/�-t�f�<h���fJ����?F:��4�i�l��S7#w�A��tBsrj���c��f�����a:��aF�l%}��2 uYA*�~��y��5�!2�?�e�*��}�����(Ut�w�%}![8��E9p��BWhI]���O�nOPP����(��l��S�8�Y�
���r�i
m�vl�m4��9�KY�]M��@����B������R��{h
t��J���5x5����d�*4dy/��X��FB%���o�)��	����k��
c�m���g��`�s��
��[�����0��[�����u�	���#Eu���Rhc�.
�a: Y�
=���YW�*?De����%�B�%��;�Y�
��B�.��g���!�V�q�f�:��J������;�����!��� P�It��vp�%�B���`�p��BWnlX�{k�r��J~\3����{
�!�1���p[�u�,a�G�v�(TT��%};O
�a:��n�/�����8�jEGE�V.�{@I�Pe+�0��{���eM�Y����}�ul=��R����d�p��BC;�t�7����PY���?��.V8�*,R�����W�^	j�
�/�J����`m-��E�)��t =�2U� Z��nKJJ/�j����k�w#_������SnU[��'X��d3���{�9@I�R*�9H��g�*4.�&z?�����-��^��=��Z�0j�#w���������X���	�+9��:�S�7��QC��XRn��g��#�2�)�yqg�S���d�L;e��������a�8^���_���g��g=����f_��%K�W���W���g��l��������O�o������Z�a�{o0{o0{o0��������l�������������K��K��K��dc��k��b���u`�_X��/���������9�Z�:�%�sHA�:�Y�h��}�&b�}�Et@�ce�����G�H��GF�12��b$S���Tt��7KF�/,���{�J��"���g=,Y������1U�b(	�Frj2��zX���;7j�*��{k�
l���{�:��F�7K_�7�SP��S���~��&��q���E����=���n�%��H^x�v����d��W�6�D���x�%Y'2s�8�H�����Lu�5@��<�P��$3�i�d=,��6QQP���l���8�5�H$�Y����6��h
?���$�S��[��zXr3�/um�	h�H<�7`i�>�����`S: YKV_1���>��"p,;@I��,����j2L8�a��0�����+���d����t�p����k��zckdlMR{�7�&���@���\�b����\*��N�g� ����d1$��H,��"p6=�$ku
��;�YOK?ry��8&�e.���x��`z��B�k�;n)kOr}-���g���P��?������@�zX����+��G����J2F�vP��������^.�T��U�W"hwLh��Y�%�
��%��W�Ki���c���i�l� ���e���U���a�?��X��O��X�����E��b��ox^,��m������y����4/��F��B����B���X����Y�iwV�{��j���;�����^,����}�5(��J�}�5(k����E�0g�K�l�IsL0'�z	5�Is�R�Th���v�d\�'�$�:O2�)�p��B�=r�D���H���:�5�H�u�P4�5{�� �?:@I�� �
��[��k��:d\G(���5]G$�N���ig�k��.�k�`m����Xf���d�*m?�Tn�����%y����h�d�*M����5DS��Z�hI�W��6��u�P���N�4��B�&��w�����2]j������4��;�k����y�H}�p�1��A2r8�V��~&���2-H��P������2u��������c3>ai�������F���E	�!�qK�U������Vt�9-N`�x��6�a:���Z(��Xhfq�����J2�`y]�Am���[��?�hoY���"u�:�I�
L��t�p��B��K�/`�h5F5��c�@��u,)����'�6q_
u�:��6���;p���G�����������-i�7m���Nyq����K���XS��;��h�����#o��E^_a�������������%Q��w�8��U�_����K�=�q�
=I�S����s������t��s���s�T8�y�
9�O;n�9H�=�q�
=I�s����s������t��s���s�T8�y�����/��]7�,�w����������Y\��!g�y�
=���g1���g!�p��zw�x�������7� ����oT�!��p�����
7
7\�88E�
W�zBnV4Q�8T4V�6�A<��j�UE��7�4QM~�L@�������*�����ji���O���j�UEU�����L4)%�(�:��H
�hr�
��&�������������&����n�8������4�l-N@����M�e����j�UE�M�Y5�y9T�8T4V-���*��8PU4Q��#LS4Q�8T4Q�W7]���^�t��&uv�����������J��h��q�*h�z,~�QA�L�UE�LY����n�\���J���������I�L����&�����j^`�RU4Vmq��h���]-MTw��%��j�����&�?;���I�=�
��r�
���@�l�l���*�l�\���NY�L�����r��*��y4�(��{�����yc]��7�D6��-�C��
n�kM�*+�(�@TV6S�9-MY�L�������+���������m�MT9U�T�WZ�]A�L�����r����f��?@#�)�"D�,l���FY�L�����M�G-�(����e+�a�N{�u	����lr[ ����E���=�#�(�@TV6S>}�Qd3��7E6QN����l��QY�L9�
��������L��H���f�h����qNlS6S�@�,l�|,��	��r�
����nV6�-�@#�^v��%��9m�z��t��vF6��-�D������O
�f�h��M��.r���l��QY�L9z�L9z�L����#�)~Gwdc������������f�����t7gH�����F6���)�l��-�lvW�@�/;�������}���������������&���P6������fl��QY�L9z{�L9z��L9�������9�(�(��oI�l��QY�L��z]y�\������u���F6S���F6S>���l�|x�����_��L_+_����w��K�%?�fX]��\V��d�����.��A'������"Z���EW9 +s�-$�t���w��-��
����n��.r@��2��:L�\�=��SW�/���79"��U���O*�W����"�\6���G`E�`���?r�`�����h�c�k{���z�pP�
�x��#��5��G@��1'� ��-$�t�A��|@�u����}��R�����a:���G��&������TQ>����J*�*[8H��g�*4C�nj�z~�����& �J����M@�e�^�K��Y��~�Y"5�����K��zB�7�]�q��J���������mzR����.a�C��	��wK��^��%����e��n
=�yTo����x��4�������fs
�W���������dD�t@���W!��2��f]����'5�.l�e���P=i�7���n��m9�������/?��G�6X��0�'M�f�	Zzu�{eu�^^���xO8��U�Z���Y���4k��~���U�.���dO8��U�Z���Y�y�/�g}^��#��,l]�����	�Q�
c��{B�7�����6O9���,WV���'R��^�{��zB���]�����BNj#�,VF~�O���z�����z��7�]OjW_b'[F��������<�W���d�����z���������.�re����2�'5���s���[o;����U���c�l]�v�zByT����d�����z������{h��P�����?��T�+�'�0�������1�8�R��M��q�'R���
\9L��		��W6�K���+,WV����yT�����P=!������|?y�-��VV���?��z�&����U�W�
KiC]B�8�+��a|�����:�'�0hY�|��g$IC@���?����A2L�
Ir�R���
I�7�����,���!-Pw���5mH>��a]B���!Y�/cHZ�>�1$k�|���aM�,�WM���Z1�ve�ve�ve~cWfoWfoW>/����7.�7.�7.��2{�2{��y��W$S�X���5�j
���i�G�=���%��Y���t@KF3����t����k{gkdlM(��������dtCm6`�����%y�����t����l�G~#4Vg}�3V�'��������d�o�w+W�5&AF�H���I���g�E�C����$�F��'*K������g���[�P=���x�rq'����*����Fh�0�����d�?��K�T�=�y���=���		>�\�����"9`��d��������	>�-�6�TP|��&5I;P���	��~�-�i��=v�����)�I��Nina���W��hZ6�����3�I�������!��	>����������W���;R��{z�0�J=i�O+7�i����ge�����95o-\SOH�q��=e�OX^*��'������?u���f�V.V����h��s/,���j���zP���p����J��E�"}�{�'�|.2�+����>�\����uuN�Y:�=�y\����t�������=���V�R�Q<���������W�Rg��� �O{P����� h�bg=.Yl�����Zc��Z{R����[��a�����
�Qx��
��A2L�Go���J�@��D��H�`;�0��<�Ky0�W�<�X��}��b��o�^,�������z���7R/�����B�4�X��[���e�
��
��
��� M� M� ��I��c�^��z���;34vfh|g�����z�'�;�~(m^�-��Y����;4w��{�u#��3>�������5>)�Ovl�z�Pu����������.fjR-Tgf�����{��1����2�n:R�������f>��y��5S�S?ce������������iI�Y�?���Z�q�q�6f�	���H�{ccR���4���/�f Y��L���#��@��%��yiH�W/8�G���w�Nd8��+$#w���j��F��������C����������Y����!����a��7���9��3��������z�,u�cJ�{e�6�!5�����������W��#��9e'�z��
�y��<�6�N��I�t���1�r���<���j�	����u;�l������z�O��+:K2�����d-��r����Yh�c�9���Y�4��9�=�<gD���d��d��z���	����0�cL���J2�c��R�d�p�=��9��pE�WQ?��pX� ���|�b}]��w�x�q%�������q�������������7[���)�v��/F��T�Gh�����O�&no^h3�^��|�����Jo����w4j������������R��|��Y��y�������vR���-��|&��%Q�7�������V��zC����=$����%�a��Vn\^���'�R3)w���6��dV�$j�&sX�V���������c��~�����|M��m���3�M��Z�O:�H�G��"��dV�$j�&sT,�?����(��Q7!=���g����6*/-�����^�A��0->d��wy!�>5�P|����'�Cb����d���TQ�X1��!P2V�0P2TL������bSE%c�H>(
+r(
+�m/*
+r(
+��U
�Q�50T+�C��S2�ka*�d��~(
�HZ(
��H�ei|��O���v%�����v%��U,��*
+r(
*f��TQ�P������"�E!c���C��3}�E!cEE!C�������d���TQ�D��L�4�l�(�
�����Q4Q�8T4Q�OJ�*hxW�8T4���'�s�=�\��v�oMpB�����~!�����u�����f�QEU�T���&��+�
����h���@U�D�>pUAU�CUA�H�5�
��r�
�����D�����&����[SRMT9U�U�B�3@U��v��@U��vS~�b��n.�o�B�6���\��sP4�r1��u����*�����jZi��*��8PU4Q
/��P�D��P�����r79!E
C�F&����IAM�CQAc�c�O����-TMT���TMT9UMTMEDUA�{M�CUA�{MY�L��k{��5(���(_�g�h|���{�V4Vmq��h�h�#�
��r�
����T4Qm�@U�Pu[��>�(�J������U4Q�8T4Q%���
��r�
����}�Q2��0�ld�xz����#q(z^v{����av���p����/���kqp���n�h���U[�*�����4Q=��h��o��4Vmq��h�huT4Q�8T4QM�4QM�4Q=_�T�+^pC�����������-DMT7o	�pZ�n�]�m/_��	�Y��N����3s���7�(_���h|���z�6�h��q�*h��6�,��8PU4Q�������*������IMHQU�D��PU�X�X���d��a )d��y��Ds�&3��j�&3��j�&3�����}g�S(_	��x��no?�?i���|�n��+Z�����3��|��[��-��&�q�O;u���RQ}���v��LZ���Ct�-�F�i�/�B!\Q<���?9"P���Txl�a�������?)��/I�����J;7���27EH���nA�s�7�v���Ct�-�Vp������Z�������1g|3�
W�;n)����I
�45Tj�����@�cN�f�j�����tobZRR�W���P�j�[x��K8��fFJa�p��ByJZ�>B�4Eh[+*��{�Iu���HA:r8�V��sN����+m�����?�N*�*��#w��e���k6/h���q?���N*�*��#w����-S���x����m�h
�V����J������[��������������i58��,P�����Uh�#�~�h�������,9�q���{��u'�{�W�����(�K�2��3�
�����'�Y������9�!�OHU��|$F�N9$�����t	�������{�K8^;�]T�
�d�|��R��A��(�V��#��@�\��\�z�%>L�B�m�d�p��B�
����n��9��u���e|1��V���S�T"=��n�{�)��{;�P��/�����[���
o�U�<(Eakh��3�@IP��HA��(�V�G�'���O��jPn���
�$(T�FA��(�V�1��|�W����-4�l�����Pe��d�P��BS�������O"(7�Q�`�l� Yf�u���������n��������Jj�
K�+������"����e�X#���WF�����}��}�=1��Plh�?��Lj�2;*�2��[���v7�m����!
�\3�%�B���AW���_=3��2�H�
�^�p)�7�Z���2�}��k�&��af��WTPl.E�`�,���h�H��B���Ty�7�9�F����S���T#�����Sn����l*�4���/ZOj�2c��L�-��XR�m���+������|�l������|�����7���X7��_����;��^�?k����g
�{����w����kUv��X���po����~��G�v�\����@���X�{�������?��7��9����@{,�ho���=p�������+i���<k	v��K�$g	���v����c�vpg��������d�A�)|�d���=lm�f�Z����:��Y����,��x�mY�C@�a�J��a������d����;�m��1�1f|1�z�I�����_b���+�����!�z@I�{�0HWe=,����S���SF�#&1�a\m+�����W_�MW����,%��b� ]y���d5�
��}c��������[8s��((��0`�q��v���(	JV3+����o��c~�6*���!c;�:�1$�����ncj/�udm���������2���%�?��������8��8L��H��[�,3�zX�:�g{����aoHM�P�5%����d�P����3��l
u
�
���`RgO�`�������dq�[;6�PTRr���g� ]f���d�����DG��d���1�1�x����[2;�Z�����=A��e_r����9P�,3�zX�x��'��\���Lt5^'p5n����d�^���J�
�:��3���B�B�u�����n�y��=f�i&�*��+�I��(H�e=-��G�����c7s8�h����������XR
%:^�O�Jb#�k��a}Nh�y&�k�cI������\e��M�O�K����*�Ig�Z�?�g^��G\�k���wyu`��9��;���Oy���O���yO�b�?�D^,������v�`�1���yw�c(�a����c(�V�����;��5w�c(i�!��{P�|� u�F#c�F�=�o4B�5�cN�S����;ag/���P��[����;��B��W��ME@���J��b����u����k�{+�1vVb(i���pg%���n��zu����s��>$��+��;�����
��{�_���$�b����[�Z��#t.!��%t����K�A�%t���.�v��U�
��Q�7��M�����d��t�X],Tm���s�2�0�wO
G���U�5������$�� �:@Y�
5�\��w�u�����?2���u�P6���"c�I�_�e3�c�p�J�Y�S��
�w+�&�R�n��Y|0&�d�p��B�s��������R+o�����7�e@Y�
��S~�������%�-��	����!��?��[���#?����+�v�q���b���(������)�q�w���gb�
�-�m���Z�t��;���|�O�'�l`�Wx��>�a�%����k*��g����?�.�f������S��R���L����fM�L��|:��S���|�_3���rX���9��g0� g �z�
r�h�3h������4H�-C��Oi��?� ���z>�A*�L�~=�A���I
|:qA�`1���@3�.�j��B3?�� � S����$S�a=�A��Llxs�T4����/C�YMm������6G�}
����8s��}>��n��*��r�QU�P-�'�A����4WM���h�\��������6��
Z�R�U�PM�sr�*h�JqVU�B�<�"Q&M��L�8������8+*h��7�������������&����������;4�lgN@�����n>�
Z�R�U4W��xx��h��qFU�B5���-T���h�B��cV�P�8�*h�z���%�PlQV���5U`S�3���d<���L���+����*�YUA���]�E}�K��7��_���"�Q�]8%�yTf�Q	�^�nv��=4��8#�h���	�&��1U�8�*h�Z}SE���7U4h������*�UE�Ml�*h�JqVU�B5���-T���l�B���^U�B������j�m+
��o$��
Z�o�o^���Q}�F��W���9(�_��|��h��qFU�B���A���gU�U����(�kR���P��1���bt���,����_���&�YQA��B6h�z�O��U3�������gT-Tw?�����������&���?3����Q@�����P4�v���h�*h�JqVU�\��~$�AsU�3�����1h�������w7h�Z}wp���'����+ZLo�8;�U�B���I���gU-T���2Yhb��d�P��$6h~��8+�/�������y�����}b���o9�L�U4����[{�U9��*Z�f��P��'6h�zz������}b������|4W�8��h��Ol�B5z����jv>���fv>���b�>�A��}b�����
��p$����}�������Q_��d�C��w�
�_p�3��h~�c~m�7K�B���������;�-&�o�a6h�z��;WEU�������4h���J�����o��0S�����7�04W�8��h��{������f����}�U*x��|����f�I��������i����eY����i�����q��<X� ��H����PeT��<�'���<G�Bq]1����B�Q]����N}X�u2��R�=-���!1;uQ�������TD;h����_��@��L�8�f|��������$������3�����`(�qn�
���Nm�;�k�{�{���uf�<�7;X;�)F4��o��A�����%]y�B�#+�����*��^q����s�."���,������q7�q �"�����{�!��f�<3N�8��@�}��s�� �%A������[�F
\��~>#��<����=�$(TY� ]y�u���|�b�	$��;��q��DhgX��d	��~ �*�A6��"�.��J�<��1��M�h!��?��;��;��W��f�
�
�\�L&}��l� ]f�u��*S2��3(6Ka�|���)��;�.3��U�)CC���F��elCska�{��
�*�2��[S�6���4gA�7��w<��P�,7fV�y��|��x��|���g�[Ci��I�J��eY� ]y�u�����m�m�A��Pi(/=L�B��e@Y�
=v�:�������	@��z8	
U1HWe�*4b����.�h�J
5k��IP��rb�,3��Wh�}^�c�m�VTBC)F��5J�B�%
��P�v���&>���y�2��@a�`R+TX� ]f�u���:����6�U"$I����q��b�[��Y�S��
���I�5{C	N�=�$���2��8����wg�D*�ph�7�&L���P�e�`.xM�ux�h��AGCp)�o��]d�Ak���@��_i�!
�R��5b{wL_Dh������- �}i�F[:
� ��/>����0���`=���0����Wx~7K����%�
>�����Y���
�,�����%<�Kxz��|���%<�K�����k�V�����h���/���/��G�z;���9L�s;@I�sw��9�A����#z���3�'`]�P|�T��o:#��j���.�:}��3��5{@I�TlA�Tt���l=�L/+{Pg/����������@��i�q��+��Fch��PRo4&��@YKV7��0E�v,��d%A������ ��%�=X��������=GL�1D�����k=�
�s#��5�I�y��Y�G(�a�j�W��%���
9L�|�p`�.3�zX�����n��.d�$��)�1H�@���f����[C0�jC��&u^%�9X���t���N{�u����Fs�Z�:�2��,3��?�Y���ao���/:%jM��z��� �+=���%�my�����<�`��P�u2��nA��(�a�l4�,�6���D����e|1���,����R��^��[kk�gC�X��:[�<(H�e=,Y\�s{��@���!�.G�I��Y�2���%�	_}���:KmH��`R����$�����l:�m���4d���0��>S� ^f���d�&�������>��9��/&�P�[�.��)�a��N�'vISr��f�z�c��b��\��,1��������7�#!��>�c�0^(���K�cNyhR�U)�#�Ijx��AXW��E�������Ov�������X�����y��b�?�y^����������y���w5/���y/���w0/�����B���Xho��G�{�r�������G�{��f���.l�!zg�Hw�dtBB�LBP�L����k��G�G��S����)����
L����S�T�Y�mF�B������&
.d)���e�*����8�-�=:�I�{lA�{t��njm@�u�����X���$��u[�uH���8������u>��_@�|�����H��hw�_�����:w�L��b����u�P5�������+�#8�w"��^��a���X����+�A�
'q��;���8���nj����7u��d����Y���W�8z�un����H|�pRoF
��@Y�
UoQ�a!	�����s�����n�;��[�mWS���CXkCj���:���� Yf�u�P6���*1w�`�R�o����0l�����[��QW��bPnH��<���
�t��'��K�K��Y/�l
��7J��D����+���J�m��B�F���1����P-��%��Wx��d��!m���z�1�� �KN��?�:���}����*<A�_e��,�K#(���������������E�]O�B��(���7C3kL}[�I�
x38��GQH�:PB�_����5S��0�"Ux�����3��z�Qr��g�E!g��(
=�X���E!���z>�B*�t�V,%���(
�\2���(
9�OGQ��@	9��(
>��3�0�"�'�?0�B�AJ�9�GQ�9H���z�����(��?M�����>�Y4��!���������P4�D"q�M;��#�����h��qFU�B5���-T���n�B����-T��i�\53�U�U9��*Z����&EU��������2Yhb��d�P,m�^')h�IqVT�B������6gU��6����Q9U�|���e�8s���-G��IU�\�������if�����*�YUAs������U9��*Z�F}H��Q�8LU)��
Z����_`!M��L���N���SM�������'�����gT-Tw���������������W�b�x}�<%��Fa������}g:IAM������0o����
��J���P����A���>�P����A���4W
�LpUU4W�8��h�z���-T�1���*�������*�YUA��k4��H�U����o�v��qe����A�+�q�������UU4W�8��h�j^�`UAU������0�d�y�CJ��qw����)�(
Y(��������8+*h����d���J6h�
cc����*�YUAs����M��5gT��54-�3�(J���"������Y(�_�����Q�P�8�*h�Z6?Y���*�UE���2h�z��B-T��_n�B5���-T��L�*h�JqVU�\���>QU�\��������r�LB�f%�,o4��p�=.�%�c�?2��@�0��P���q�9	E�KW��Y?��������A���b���{�-T���
Z��4W�w���AsU�3������
Z���+6h��W�d��W�d���Wl�B3{����j�^�A���Y�r�+�<��xE�g���8���24��gN]���vs6����gT-T�w�
Z�V�24W�����*��r�QU�P=��L����i�B5;�Y�B3;�Y�B�z����f�.�A����-TO�24�K�S�;�V�=b��?;�"�bF�U|����@���{�[
KTt�������=X� �k�u�KL���.]yO
=�y���
���]w�����������X��������|q�"��l��E���
S1V�����4�t����G���x0�Z��w]��'#�@o��� �H
���x0��5��m$�"�@/��� �H
�U��������8�lu
����n�o�1��	�v�f���R�Tx���m[��=�zN�=��Dt�Y���E�����V�E��FO@���=�Y�%
������(����q;�"=�R{�r���*K$+����efqi����F;�V_��-<�$(TY� Yy�u��������9)�qR9"�����Pe��d������E�H�H�x8�����/������Nw��uw�GI'��?�������)��>�C��(�(H�����O�!��
��6(5T�[�7��Pa�IA��8�V�EfDnmN*x����JI�t��Pa[� ^f�u���':��F�O�<�
h��P�,c��<��S(8|��x+��R���Kq&���%A��pW�Yy����Xh�9���$���*
��r~0���e�t�P��B����vYT��/5_�-�$(TY� ]y�~�zl8�>����(7T���F���*�0H�e�+4�F/8��
eTBC����J�B�%
��@Y�
�|�)��7<�
��/�0�*l� ]f�u���5Kk]��wT�r���o�.��������������}@_���8b��IP�0����x�p��By�fH/�d!���7��	����a-D����u�o'l�-�[x�&�D p���-��I�,X[/��%�V����C��L!��H�V�)��
�h!��?.��Z�������0gN��$|d�2�9��`=�vH��O��+\��%�
~������f�?���������
��
�����������J��:,��A��;`J�|�d}�|Vrg����9�`�va���9�-�:�=�hnFo�����`�una���a���a����ft���=lN��{`ce�[�g�+������nb3����J�n"7���{_7K�f_���u�b(���d|E(�a�j��'��wCC�a���z�1Q��<�zX����K������=���kLtj;R(�a��dl�&�@��-����)��G���!�k��9�Y��
Vx�n]H�������1��	�BN�N�S��Z��K�v���&���X2,R��<�������(,�������X��8J�<IB��d�P��������������x%u�$��� ]y���dcJ�v'�|Jl�l|�`R�S��1��(�a�b#��o�������#�����������,F%-l�KZZ��)Y
��Aj�z�YKfk�l�1a�:1(�������� }��!&�y�DA�H��pR�gB��$+��d�����.�;	J
�i9L���zb�,3�zZ2���vg���y��1��I�|[�.��)�a��=fh��:�3����&u6g�(H�e=-���x���u<�c��X���V��LA��(�a�bN����z����w������������V�)�,>�V"{h�����2�
vq�E�q��z��lU�
�p���#�!�)�$&��!���jEoSR3#����9����mA��A�w�����z���w?/�����
���Xh�2�9��x��O����������x��o6��\������y���w0�����y��b��qX��Q��<�P��Q� �Q���n���u0&��%��%���[}��d��E�Uh=��b�3"�����2��B��s��
�3xl���3 {@I�[�����e�*�3����{�y����{lA�{�e�*���W��q��9����qlA�q�e�*��8]��u>c(���-���;]�Z������]��s{�I����.���nj��3�)�<E0�{�-�z�P��B���������s�$�$� �$���njM���t�!��?t������e�+�XyyK�5����$�b�q
��[�Z�<�Ux�S���/&4}B�M�s��
����r�
�
sp����X�e@Y�
e������#�~�q��b�=�"�1���P�VDzvkPiH�D~vkX� Yf�>��X��r���6�CQ	
��7Jj�
K�+���J�\�_g<���S#��
��a�?e	!��?��[_�������m2b���4C����0�^|q��G{��V��|�Dym[Ny�;v"�W�N��s
�����~w�7;�C8�wCSC8����=�^q�x�������3XL��3�T=��'*�s��=�� $���'�$���z������'��,	����Z{K1u��OH������e���=�?��K��=�B���	r:GB�a1�B�AS�.����Rb�#3(�,d����|������XO����|
����n��|��d.��������	�>�Bq��7�4h�|D��(�69�@�(�P�8�*h����Y�AsU�3����7�4h��}�H�����&UAU������0O����U9��*Z��'b$�d��aV��B16�NR�B������j��8
��m8��
Z�m�c���N�8c�='
�_6�3'�h~���f?�*��r�QU�P�fX9�f5���gU�UaF�;WEsU�3����>�`UAU���������l��j�rE��rg%A5�����8��S�P�8�*h�Z6�z���
������
&�����'Nx���pB�����	�_���{')h�IqVT�\�����*��r�QU�P=|�G�����h�B5�^�-T��Un�B��nT-T)��
����'��������e+eS����2v��V��7��lz���NY������OL�[����e�+���<�-� ����WV�P�@��l�l_�`ea+e
������2d�J��1C-T���07h�JqVU�J������.v��V��|�l��h�V����W�R��NY�J���N�-�G�)[����C��UO�$#����([\�t��gt��h������i�J��!��-�s�s�,[(s�UV�R����e+����[�R�9�c���2v��V���QV�R��NY�B�l��	Z�R���R
�n�lq[��N8\v�����.G�^�+��mg�W���([\��
�zea+e
���-����g��h�������-[)'oB[�R.���e+����[6W���h���h�
[)���6h��;C�����-i�V��7�-[)'oK[6�+I`��.;�a;����C/B���O�^����-[\z�@����{��ff�B�����r���e+��
n�V��k�Y�J�;ea�#����-�9�*+[)�^W�m������p$ow[�RL���l�\��m�J�x���������y��x���O�8�h��T|�&��7��>��l���n�3,QP4�oz�`*�	�#�#8*��
�n���)T��+�I�{2�6�PA\�����B�Q]���z$}��u2���Mv�t�"����E���
���<zh��v�3�����hX����px0	�L�X���2�]�g
���hX���-�x0	���o�7�mTD�Op��hX� Yy�~+������1��B�}�������Pe��d�P��B�j]"?YC�c�}�@�W@IP��DA��(�V���\�w~������	_��J�B�%
��@Yw

�f�Nm\�am���|�7�~���`����<A����\�6e��('�����^���d����7���������J0���d�6�<�Wa�0Y~$(x������D��
�
�S�� q��v�e��w�[,�x���.,7����?���V��a��HP�f�I���;S_��F:��@KjVlo��2��{�f�k���N������������<����a��HH�f���Q��	-PP�7�r��<��zf��Gr�M�W��<Pr+��z�e�$���A��`��0r�,?�Y�������v���cS�#-�U�s�,5�Wo�nm��zR�4����P^����a��H|
�a��W��b�rc�d}GZ��p�0Yj$(x��=�-o�+��mM�4[�z�yT�i���0]~$(x�^>��1��Va����<�Wa�0Yj$(x��#�bj�0�+�����X�g�;���^�G�0Yj$$x�^��	��g�������q��yT��Da��HP�n���j����P�$�����a�2��q��B+m���9m��k�d�w����e�va-5�Yo�����-��aAgh���-�$�������{B�W�dF�Ax�����
��h$�.3����z��rF�_�?�,�op"o��7x�7K�Qw�f��y��v>%Hu>e���|�d}�|Vro#��cyz��������<�cyz��q��P��;��:������lA����g%w�b�k���%9����w1[�u1{�����d$+������~f��m�L2~�����F��9��:g�A�;w�&g�7��^���X�8x��z���=N
3�'(��r5+6$�fg�f�N0��F�e���@��-��+����gh���<ox&�����j�������3>��	�h���������f���\�����[�N����I�o�&+������n��J�W9K���g��Xws$���[������	
>�\
��"�N��^�7F�0��+T)HVf=-Y
����y�' k��%���$�����h��,��������9_�`�,5����1,��nE�C��d��7��9�4l&K��Vn|���;��]��z�#���+���d��������l�7bg������o���&�1�����p�����<��(�7N���qD������3P�.5|^�8����);��z�#�<g���a��HJ���n��m��P{3����K���<g��@a��HP�y��r����v�j����
iy�VM��d���������U~[��@?bqOG�y�[
L&���~���(4�(�w\a�W	�_}CZ^����Ma��H�'���������o���`��^��� ^f���akTf��#@�`��1�l� 1]=X�97���O��������z���w[/����
�Y�X�����y�b�?��^,�������Oz�����ZogT�0��5xW�'���Mf�QGLq�d�GC����	��1D1������z�O���P@��J��M����N��Z�u&�+Xo�:�y�Ja�
�o����5�hM��	��1@1������z;/���yo|������t��{�����z���Yz�X�zz�����`�o|z��i��G���	�7<=iy��Ia���o���c�f�����9���m��a��HP�^������)e�����H6}J���abe�o�k}F������%ki:��K������W��u��F&�������F&�Y#��[�u��zg_�����`����}�
��W<�
3�����Gc��|CZ�7-�u�gOH�n�2�[�uV%����z�oH�sVe>)L������ze�1��goP���q����J���a��HP�n�����t�dk����#�<gK��a��HP�f������1J;���3��:32f
��pZ����@���)1��%oq<��#DtL�1���cJ>���-P��gsJ�+�]�S����������?��zD�q�U���G�<	s��M�o6���m�2�{�&�}����W�x���N'����T{�!�Y/�Z|����	x ������@.f���!��g�(_��>sV~������$R>i������D-����}&��ri<����������D-�����R��_�K�I����J�DV�K���e�-��?�0��f���^�R��m���������5o�)B\�v~;������F)d?_g�v���i[�������F�As�3��������x��s���B���s���6G������PE%SESE%s��E!sE
3�B����UT2U�0UT2WLh�E!sE
3�B�������B��f����t�2��a��c0U��"��S2��0T2W$��(2��H8�(2Y�Hry�����o���7�]��jq���dz��������"�E!S��;UT2U�0UT2W��FQ�\��������O+����"�E!S�_����d��a��d�HN����d��
Z���O����*�YUAUz�dU��*gU��+�G���(W��|�d�@���Q��_�����+���T4��8#�h�J�����*?�2������������*�UE�[�[UAU�����	�k�*h�JqVU�B��O�-T���k�B�|����TAU�������ZUA��
�UA��
���s}��p�I��a�A���q�9E�+���d�*h�JqVU�\5|���*��r�QU�P�/��P�B��$��by�����"�YE&��5K�$-4)��
����?4W�8��h��b;UAU������}��hq��8�����&���-��k=�k���Y(�_;�3g�h~��$C�V4W�8��h���G�*h�JqVU�B�^J�����>�Q4U��
���� �����-T��T�S�P�8�*h����;UAU�����m�b'�d��aV��B�z�W���#qV�^v{�y��au�Y�pv���[5�/���P4�t{|��[�AsU�3����[�-T�7
����
��r�QU�P��g��
Z�R�U�P��6h���l�B�����+^0�Y�z�
����`���gD-Two	+Z�p8����]����h���y{�^)�e�Y��8�=��Q6h~�9�����O�u�����*�YUAs��{g���*�UEU����
Z�R�U�P�8���
Z�R�U4W-�3���5)�H
Y(��d6h��{����j�&�A��Mf����A���b{�5��|��_�y2�#�@5�X[Hi62�?��/"���t	�)7z���(��`���c�Cb*�%���
s�O��S!��T���?��/"Ra4K�cN�Qa���������q��p�RZ-�4ks��*�����g�������-���/*���C��5����[{�R����V�1f|���v�%�1�����o3��xHPi�����c��qo��d�p��>����egb��4�6|0���0H�e�*t?pI��~��
����q�Lj��
�� >��q�of��4�����~y�2����Bt	L)w*�;�<R��M��;����7��a�.3��ovt��t�W����q��B<q��IP��
�t�P��B����3n�P�
��}�o&A���~��,3��Uh9�fx�����o�@���o�1���n�k��K�cI�Sa���VKxCt�D�PN6�=�$�������[��w��{��E��B=�A��E���#�����E;Mdh��2��	���y��=L�B��>���e�Y�

���0��l������u|0���a��D�e�Y�
�#=
�Zy+�����o����*�$+��V�
="=�
�N#����$�������@A��|L�c�o.[z��F���P�����*�)H�e�*4E�_��9W`�:�(~0	
V�2��[�f����@��f3�@��q��x�2�����]�S��
���I����!��%��T�)HV�yZ�N@b�`%$�l��v���
��V��@A��(�V�e���iv������=���PR+TX� ]y�u��wR�p�RkfPn.E�`R+T��A��(�^�	����B��*<�<r��-�$(TY� Yy�u����;����~V�_���L���@��zP/��9���Oz����-,h�����'5h�1��Tx
���S�m���+�E	�?0��g��{��[��������8y�y��yX��Zz�j�!s�^���/:�/:�/��g.h����^����j��Y��;f����c��u���9s��
�c��������;�����{�v�]u��P�6!c��$L��j�����Z�9��o��|@��u>`�>����Y�5��og	�<A0	JV0���P�������T�"P�oQ<e��e{��f����W8��,,
p�����t�P�����c�X	Z���c����X����U|�V1��r%_�V����J���Q4�%��a;[��U
���J�=�����I��/�C�.3�zX�X|�A��A�!5
Gp��e:[f���u�d5���u�V����2&u��~6flD8�a�b�A���;�GlH��`R�0����P����,����p����#����1$�e�g=,Y����Z��#Cq������� Yy���d1��L,YQA-�G�det^���G%�=x�L79����d|2��;(H�@��&'v����/^Q����B���Z��2��8?�/����K�G	������x�2����
��cG�cIyX+��g����P`� G�I�OY+�2���%��X6�+e�����#������d�p����P��URg]��V����3���B�BtQL)ke�r�^�acO�/G@�:Y�S�,3�������c�[}e|�$�F���)��uK���cNy8��uC�N-M2���)�X�u�\]�s���f�v��N�z��O���*�I�Z�?i]^��G�k���Myu*���������%y�g����6"����m�#�o#�����m�#����.��B�{�X��;�����j}�����Q�������@�"��B��&��������D�?!�Z�P��B��w������Y��h-�Z<��x��mZ�/����:C�L��b����u����Ky�Sv6b0i�Sv6b(�N���G��X�Pg:��;5�y�^���a8�V���K[��!��2t��e�A�2�g�*��w8��3
uF����B2Fa8�V��g�����`���%
�`��`(�V���+�N����:S�L�� S���U����|����$ob����[��+����������&���Adf�<�L�9raO���`R�n"|pnA����d�+T�8�8�y~�!c�
�:���(H�e�*T�����[��V j�
�-��	�%uo!��?��[��V�k�;���5�V�(�{)�D��@Y�
e�
����s����v�������d�P��B�f��w�})�B��/!���a�<xu7�K�����>�~�(
��k&ZP��S�u�`���	��9e>�������{�d��
�g��k�%��n���%��Nb����?�.La����x*Z�sR�_��8������8�0�>� ��@�>�A��L�=������O�b�3��
r�i|�)gpa�����d����| ���d��2�|<�A�@&+��g2�H���z&����C�d����|,���d���2�g����f�sh�I���G3H�-C���f��?����)h1����|A4�0���	
���>��4m�!q��[z��-
QU4W�8��h�Z|�q���7 7h�
����*��r�QU�P=�[Q�C���*�YUA���g���h�JqVU�B�>�BQ&M��L�g�3v���gE�U��\*Z�m8��
�p�I	}:��(�wr4h~�8�����e��r��
Z�R�U4W��
<QU4W�8��h�}�E����_4h�Z�����0U�8�*h�z���%�PlQV���U`�S��9��b�P=�d�NU�B������j��-��NE��R��M��z���q�������)
3���L�Z���n��AsM�3����K����*�YUA���'�P���As�f������gT-T������*�YUA��?"�PM�#�AU�R{UAU������w�T4��H�Ue�������'�:���������sP4�r16o��*��r�QU�P5ow����*�YUAs���I=J��f$�,���d�]p%E��I
ZhR��P=�'d�����l�\5�?O��h��qFU�Bu�s�-�5gU]��D
�g�:����?�����P4�v��sT�*h�JqVU�\��~��AsU�3������1h�������77h�Z}#p���'����+Zj�8;�U�B��w����gU-T���2Yhb��d�P��$V���p���M�X��XN��t�I[�>�A�9�g�*���
{kHjU�U9��*Z�f��P��'6h�zz������}b�����[z4W�8��h��Ol�B5z����jv>���fv>���b�>�A��}b������o8gU��>q
��������|2�!��;��/8��SW4��1�6���h�JqVU�b���f���7�0�P=ZK�NU�B������j��&
Z�f?o���j}��[/8�e���9��;��9�h*Z���a6h��{�������~�y�
�2����p�����O�O����>�������@A�ntq�A���8�C|�~�F��p$��������e���x�������.Mn��-T����I��4���LU)���rH��@]�?��l��o"��
�GN����~?"�E�,�����G��a�v�;����d��a�#�,3�n���B�M_��������h��/�p��YfF@��������~d�a8��,3�# �Y��e�fa���e�h2���"�� ��".����ms�!8x���?�&]D���Y����|4?��hq���o&��T�a�.3��Uh����i�|A��A}�L�B�m����n��~u�-���b��5;oT H�g�*4������]Hl���o�1����h!��?��;�@���J�4����n��%��T�#7�����i�������'^KEt%
��0	
U�c�.3�����2o1���LCb���BV�p�2����i�������rg���c������;Fp
�O!�v��RYn��<��7��g��J���G����Vi��(���e�t�P��Bw����-��R��#�$(TY� ]f�u��c��1�g�m0Pri�������~���Bt	L)w*�;���p}!�f%�?n_L~�|�%�1�����v��:������P����
����,P�,3��Uh����k�>���@a�`R+�Y*$�����m�ZG�c��JC	�0	
U(H�e�+�w��TO��=�R���p�4�3,S��<�T�R��7g�����#B�!���o&�B�m'�2��{���u���G�-�/W�a)pR+TX� ]y���(��k_��I��h����;��/"m�C�^��������L}0�<����7��8��'=�e���p@���O�+\��%�
~������f�?����7��hF�Fo��UF��`�����;�n�������������3���"Ok��Y��!D�8���i�dB��ys��5���S��
u^����B`�W�e=,�������5����=���5���5�e=,�Z{�W�3�CE� :���a����@��j�,Y]�H��*
�p�O0H�@���n�l�>x�Y���rk-:�I�ZlA�Zt�����vk�L��
\�S&u&��a�.3�zX�8�M�U�Q{�4>V�F�=V�����~���Z�e����A�!c<`�G6�x<1H}F8�a��R�;kAR�;uG�I�Y0HG8�a��N�4�h_O�|�73i����5�1�<���d�q�����:���������Xws��yw������)��\<��8J��=b��<�b��.Y��6]�����*�9L���;g�e@YK?6Z������P�r�%+�0H�e=,�
��^56&X����|0	JVv`�.3�zZ2y�`���u5[��C���3���B�������R���v#8�8B������o��#D�0H����5���g�q��!�/G�I��	�����e=-Yz��m[�u;��������,��d�P������O)�������q��b����"K�cNyX+���b�����!�>G�I���q� ]f���d�'+�^-����(���9G�f#&h�v�-������8�Q��E4"B����_�as9���s#x*����z���7@/�����B���X��[��Ac�b�?og^�d��&���;���F�2y�2y�2��.��.��.������B���X�:��7&����6�Hg���8�N1��m�ng;�]q�vt��������e�*����z�1��l��1���b����S�T��~���
F�/��8"��M���7��
;���G��O�(�`c��w!�s;�Y�
U�{2wF�Nm�����%A�����<��5�L���f�qo���PRo"�`��N;���5���+��6�/B�8t�����
a�o1��r�Bk��6}�s
uN�����d�B(�V���C������Zw�?n�K���B�5�S��
���X�N��8}�+#[XW@����X~������GC�&�w��P�8p��B����g����1����a�t���&{�/Z���eC�S��_��E����^�f�����7���;SC����:���$n��u�P��j����
��7���
;0H�@��]��}�R�������i7J����1HWe�+���r��3%�������c��W6
�%��q�k���xj�"��x
�u�hR�YygG|����`C�g�#���lH�6<���������������o�!9t�`m�S�~��7��as���y������D��WN���e��?"!�K���"�j{���	9!g�"�g��r�H�|<DB�@FA���H�H���z�����C$�d���|����d���H�|:DB�@FA���H�H���z�D����~<DB�������"!���{>DBk�x����'�y������z2l[w$\���|�H��=D$����i������U9��*Z�&�����j���
Z�V�����j��
���`�L7UEsU�3���0��mR�P�8�*h�j�o�(��&�YI&��
^R�B������j��4-�6gU]�����-��o6�����/��P4�l9�������gT-TO3b�T-T)��
�����z4h��qFU�B5�#U���a�JqVU�B����Yhb��d�P<u;K
ZhR�4W�[�:bU�U9��*Z�����������*�7�_���A��]�H%��Fa�����Y_9����&�YQAS����l�U
��J���P��a�A��6�P����A��[�4W
�lnUU4W�8��h�z���-T�1���*}�����*�YUA��[b*��o$��2Z�o�7o;�� ��g������8s��W��_1twj���gT-T��*h�JqVU�B�t���,4O7H�\1������"�E!�����W�h�IqVT�B5�O�-T���l�B�8UAU������i���-�5gT]��0����(y�c��_;�3g�h~�r���FU�B������j�/'����*�UE��2h�z��@-T��<n�B5���-Ta�D�W_�B������j��r����*�UE���~��,41�J2Y(�(V���p�=.�%�KKd�D9���
�_8�3'�hz�����~x�ASU�SU���{�-T���
Z���P-�+6h�
�b��AsU�3������
Z���+6h��W�d��W�d���Wl�B3{����j�^���
G��j���
�}<H��G�OI�cK�24��gN]������UU4W�8��h�Z��l�B�z����j�-#�����gT-T?������\�P��eV����eV�P��e6h�Y��l�B��.�A������R����UZ&�?>L"�b�~�����Y���~d��a����gq��`��g��g|/���w��������e�����g	�[�u���hZ�2�K���Bc1��l6��Ihx{�rH��J]�?�0k����B�������i�!��h�����I�\�i�:���F9��v�?le��d�P��BK�W$�3�A8;"��?F
@�I(H���a����m�!H����7+��y��r�?�f��
O3�H�C#2|���	�U�,���)�y�m��Q��_�������@A��(�V�A�2�0	�pv<Z9�2L����d�<&Qv�6|��{��d e��/���/&�?�����R�Tx���_�5(5T��[��Ip)��vI�,3��Uh<��hhS���>��*0������E��H`-��%�N���
���/���9.)��P\Jet�u���2���=��sE���Df���;W�S�,3����y�������SA��R9�#�$(TL�hA��8�V�'�b'umS��
����%A��2��8��mW
����*���"����B����$(TYl��<��Uh�	�!b�>CZ�=�R������s# D����uwK�7���[�-����fH}0���e�2��[����Ls�=|EGCz~0	
V�2��{�f��__[WT��H��IP��@A��(�V��O3o��+|�t4T�X#��V(3�A��8�^��J5��Tvv��W����c����(������]����.�68�D�&BC�������R*K$+�����6��*�GD����/&�pwE�%�1���0m������HAp3�`{8�]Ja�t�P��BwlQ����#g�"��
l/�)��M�h!��?��[_�Oz�F�H�{orN<�~2E����d�,�Hd���)��w������Y��`��,�G�����]}������@��������|^rg�a��/Z�/�-��[��{�Y���C�����Z�?f��1[�9�2�����S\��lB�S�a���S��G�/:+�J��a����Z�u
1���|Trg����?������?� �:�&��,Y�=0v��I,
�p��;�����u����v�7�{�m����q��=��!�6��)�Y�j���fF�.��>c&�:w����S�:@YK�`'G90�+������X�%���I�}c/�<����qpp�d�6
�ep|�<Y
����8���V���q�(H����p���[�;�*���#������n-HLG8�a�bB��3v~$�
>��G�I�	7R[��zV����,�:��4KM�~Jf����18=���%�}?�;�~�`��+9J�/��2�����%���(��3-�9�q?r����0��t�P����S�;��}	V: q&G@I��+������%���������H<�7������� Yf���d��M�Q��4A���#�$kiB���+����L���m��� �����l��D=�������d=-��\��}a[52G }m
�(H�@���~�lI���������5�s��JvP�,3�zX����1�5�T��Uh5B@I���iA��(�i��V��Lk6�(����:S��D��`�E�q�����*�����h����#���E
����)=�`2�?3S�/0F/��v��B��X�Z�+�y��b�?os^,����v>�>8��34w�g��{�-����o/r���w0�����y��b�?�Q^�����L&�L&�L�7�d��d�������Uxf�G���#�c��~��{?�;��;Zk����B�\H0���d\H(�V��l���z<��y�����k�B����o���W���F@���&y�������U�u��;@g0�Fv���-��=��;�v^��V����5:[���V�g�*�:|'��f�D@���J�fb�f��u�P���Rc!�,D0�[�-�Z�P��B�%�^���!�����?n_L~�[���;��[�.�����:��L�>!����U����~zwP�����;���;�e�+�u��'������'���'���W(�sgdW[QA;ZL�Dq��
�e��}�P_�?z`A?i��P�G�����������������RCo���~�b�,3��W(�p%�p&%�����D3���+8�I�����X&c���	E4^B����	����6o��M����&<�7Q^��S�������������M���!�+��!��`�FN��"n���3��z��z�j�`=t����N�9��9���	9M�s�0v�����z2<B�a>vB�AR�9�O�9|:xB�A�G�9�O�9H�9���	9�OGO�9�	=���	=I5��>����G�O�Y��%z����[��|>~B��x�������S&M~$N{v(����<_��������N4�:"q�M������wN4h�JqVU�\u�}�t���gT-T��i�B5�.�-T�AM�U���*�YUAsU�#���I�\�������}��L�f%�,���
�ZhR��P���������*����u*X�0\�lb��&
�_6�3'�h~����?�*��r�QU�P���#UAU������0#�����*�UE�]�Y������gU-T���/\��_�|��f��JY-���YEA�����T-T)��
������T���p�Qt��w�J�FN�us�%�����2�j5��|����gE�U��S����*�UE���v4h�z���-T�oRn�B5�&�-T+�FU�B������*|]���-��j�
[�R�1�)[)S`�,l�}M��w
���]���oa��'���i��
r�=e�+��k�)+[(s�UV�R6�~����2v��V��M2h�[�|!���u�-h�JqVU�Ju�����t)��R����e+��?@[�R��Y^Y�J�;ea+���9��q`��l}?
4Y�����-V?����E�@{*�1/����2Zee+��goZ�R�~��e��"��h������rn�J9�>����a�����l�L�����r��()[)S`�,l�\����-t)��
Z�o7��-q`'.;�aK�U�3H/��j���e�+���L�-�!������)�S�P>���-[(s�UV�RN���l���	m�J��������okn�\9l���-�+K�Q6l��;;�����i�V����-[����l�,��e�n���$��2�+w���}<�"�p���d�EA�9��{��K���([\�=�����P�@��l����m�J9y�����������V��)[(����l���VY�J�����m3�VW�g�#y����b���e+��-o�V������}>�$���^'6Q��#0�MU3�����z|d�a�����q��8x�<p���I��4���>�PeT�.3�'���>����I7�4�}��8h���.]fO
=�y��Oj�PU����1�!1[Au	��������(VH�x��7�����n4K���#0�����F`�;����d�a��d�<�q�d^�h�
�]�'X��hX� Yf����B��5�o>T:���7 ����
�e�����o����������2��aX�������z�Y���� |q�7�o�L�0,P�,3��0���Xi�a��h{�w��up�0Yj$b�-��d��aQ���s_����_�(H����a��R��^�o�!�������Y`$-/��6#��FB�7��=�t���_�����{���2��A��8�^��P����vQ���n���6p���*L&����M�W�M<c�MT#�YXi����w��a�
��R#A���f�d���e���QW����a�C�0^j$$x����'����6����7�r�u<��zf��G��7��<B2j�h����V�H0��zF��GR�����=y<b�I^����f���H��z��F�`�W[�oE[~�t��$�l��H�k�xp�,5�Yo�8�6��7}(��/ZX�����
���x�����z����L��
!�f�[Z��0l� ]f����v>O}�6�Yv4��W�
iyxa�`
��FB�7�=B��is0�x��D,7�����<�W�Fa��HP�n��!s�Wi��#4���
�<�W`�0]~$(x�^������\AP�$����/����d�`��B+��������.������=�<��
#���#A������o���
icI�Q��gSB���80D�����d�3z�H�q��7����
�q�$�`9�������WX�7K�L��%�
�������Yk��������������
�}�����%w��>X��s(wgP6�qQ;{r��E���y���R�����G�jEbj����yZ���VkT���)�c��������G�Z��uf%���t@��uf%����pt^"~��lK@�m��m�lK2���K������=�,�v��I��f����V=�h��#��������)G���r�����_�W�Vc|��{��1�d���9o�����9|\�X�7�P�2��0�s9�� �2���%��y�r8����~�HZ��;a.��R#A����y��������|C�n��a��H�������<�W<z>���AKj�&j��$f����dq%�{v��
z�������<g��7mGu $��r�'��:DYv4f��7��9Ct/&��@H�q��T���p��(<x��<G�y��&���V���[��$�Kc��|CZ�3I��a��HP�q��^��J��R�L}�7�z���S�,?|\9����S����
L]�7��r���t�������l�K��T��L-�w��9'��&��',��r�8��}Y�<��@��
iy�S�'��R#!�����	�������x��6	�9w���F�I�������+;�5G@�BAK�}�T)��e=.���X�
�s\���q\G�y��
)L�	
>��l�
���j�W����:J���,R��<�b���F�!���h����#���E
�����B���d��_`�^,���������z���Z/V����B{�3��j��j��j_�lA����z���T/V���ax�������;��N�w>1������y{�F�m�����
��v6j�u��FmA�F�A����B��I��Z����k�S�u_k�b�1Ox0���L�����W������=z��9���;{���s�Go���N%�0�����Sg���S��)���Wh�Le�B���PGZ�+���
����#��VoV�~������������9G�7!����l/�u��'����y�fLOGH�f��{,�N�5;K{����������0cv:B�7���#���������d��8)�X�����z;���Bt�f��w�fO0����a��t�����u;;Xogz��F;�����	
���z��{���l��o��&���������z�I<���.w�3o���`��.3���#9|b���������������0cX��o�����b�OP~��'��ZRsq�m���n*d�wT�U~#U\��$������H��F��;���e������*�Z�F�H��lo`�0Y~$�����e_�l�ONgD���Z��
iy�����x�����z��;
�2(5���v:�,������9�0�DE4�D���D����6���=�a���x6�>t�0���H����������?��H��]���z���oxtT%����K�_����������R��B��/�;��I��M�=�������D-��|����
@��y{_���R~4�\�'�*_��.sZ>�7���pi����{����3Y���R~�9+�5@�f����#R~��^�|"��%Q��2�������KCG�����+���|I��������ok�Ndg�d��������0Y�/�Z�9��A�o:�����A����OdU�$j�]���P_���A���^�+iO)�)_�>B�`��?nR�3�
���������e��l��-!������(�]	���IVJ��1�Jf���J��g{�e�L)�
�
f������ G����`j��2�(#(d.X�7+(d*HQFP�\����V��T����<������G���� ��VP����QFP�����+���c}���$�����Gi�J�����a�2�(#(d*H������ G���� >���B��e������

�
R�2�
���Jf���J�h�v����f5-D�+F'*h.JaVT�B�u��f7	�������|mu_}��M7�f��L/E����/�~�B8:EESI3������4�0+*h.���w����fD-D��5�4�0+*h!���:QAsQ
������5h.Z��[���y�4�0+*h.
�U���hz��0#�hy��_�t��2n3ik���SP4�nfNA����W�����E)��
�����W�����fD-D����2��(��d!X^��&d.�QV��B��g�]v$��5�y>E�'�*�]6p��A�Q{�8h���v������9+}e���)w�%��W][����5�EOg��*��0%*h#^G0�����EmDcore���P�e���d�r��X�6���.�>	A�+Ga�$��\{�������(�)QA��_�D�E1L�2���;�(��(�iQFKQ��g��A+QQ�6���bE�E1L�2����� �h-�aZ��F4���/B����l�uoZ�h8Lk����m[�����>�[g���Qhy�(L�����k}�_j
-E)L�
����*�-��Uh-�utZ�R���,QFkQ���6��:�
�E��v����=�r���(�X��>��Uh)IaJS�F�[�W��}�����l����lV�g����x���9�2F����I���rS�:sA���J�z1�2Z�b�e�����
-E)L�
�����ve��0-�h#Z�0�A��Z��(���y�X�R��"������BkIo}c�6����
�E���Z��G�w�����o���+�o��?a��~0M�����M{�_H~5�a�/j�)�F��������;��qH1�4������+���t\���W*�N?h2����Oz�1�Q����A��Wf+��������d�-��Amf�[�����^2i�����1��1f|!����,1S��)0�oN}?0�^��R�;�9��/��I��o�z'�f#j������c�1�	]������%����7�@66�!��j�\��_U&�I}����_x�	����B}��z6��$(��\o/�`R��Y� Yy�u�����>1S<��MG�w�z+���Z��2����[�� ?�q@�����
����`R+TX� Yy�u����|yU�wR1J�����`R+TX� Yy�u���wxy9���0J�#�jo,�`R+TX� Yy�u��3���L{*�T��+?��3n���L�OR���Y���Xa��6��������l����p{�3���]�%�1������O �G����*(���z0����!HV�f�)���L�O%�W���c�q��O=��/"�R���,a�)�N��{s���sw�Yp�jG�u��@��/��2��[��Ff�����H����G�(�@R+���D�L��n">�uG���������Z��2����G��v?�m�o�S�Tu����~)v�e&�L��K�������
T:jQ�@R+T�� ^f�u��|�`�8+6(T|G>���
��^(�A������Z�t��u��B�5��
��V�0�A��0�^����K�54�����o���A��[�E��'`n]*�<hC����y��|����$�B��
A��(�^�����;j��P�iUs��=��^(�A��0�^���ING�������m��~)�{�I�Pa	�x�	`��B���[��'(��|0`����B ��P��XX���[SF���^���;)�U��1d|!���!^�s��������X��Wz����A?k����g��{����w��2��98p����k:����,<�|��6�p�Q��#�k�&�-?}��@T���7T���a�6)��c����E;�#�����Kq�S4^����^`����f�����`
�i��c�t��E=���h��U�u�OE������c��"�~��	������
m��-�����jop���`�����.�b��W�it	}G�%��F�0C��<�zX��x�st�}G�/��F�0C��<�zX��	�5�6��(��c��fQN�xL)�j{��1���5�&��+,C��<�zX�8��g��br�_8LL��bM��'�Y�J�/����Q��b(N�Z��bgj�	`������Q+T;+q��Q+0���@�Q�,��@��y��!��"]e��2�]e6��{�B�#�gI���h�2���%�Eh����<�h�&
ndH�+O���l:�b�|I�}��r|�G��2���%�m��X�MN����o@���b�2���@l��J��lFxClC����J��e&�YOK&+���S>����A�O?��e&>��S�t�
Y�f�*cr:v�uI��r�!��=�������Z?�4X����&�N��3$�L���Lc���`fF���������x�	`����q����"}�����������|=��3���Q"l:z�J(c�[��c��N�f+�u�<��9R���f�4��?B|���q?ig^��'M�k��uy��1,����6���$?gN^�S���4����K��������})1����Q��}HkCZZf7�\������
K0No#64:����x��m�4�����x���D���u��v��w����=ht����7�6`��:�

��0�:�=H;�`��B�^D|����3@��:��-h�G0��z����;'s�4��i��n��E-d0��V�^����G6fb#�L�1���=D���1���p���8Y�g4�0i�[�`!��n:�y�N|0[��8&M�a�����B~�i��=��f�p���~j�p<>_w;���]���`��$M�`���0�V��QWgO�ZO�ZO�����������������'�H�N )'p�u�Pm�e�Q�����o�d�?R���u�Pm��o���khp��$��A�r�F@Y�
�\�]���kh���$��� ���Y�
e�-�,�Q�H�w�������L���f�k�+�AmcHb��������A����{�h�g3j�b�n�4����9��gP�������V=9��[����)%V�)�,3��W(�j���`m����xs3����;+�2��{���vz6�$4����=���6b:r�\�s����h�����*���gIl��2��������R�S!>���>F�9A9^���}��L��2��Jy7"���L�������<o,��mM�O�H� $3�sl�j
���=>	���h�=���L�?"�7l���
�g����s!�8S�`?�����t2���`3��@2�.���3�|6��x�3XO��3�L9��t>���C���9���>��3�����'h���s�Qr�!|�)��!����>� ������r�Ar���x�� h9,��^�E����m4���"B8mJ�V�I8L���uw��taQAKQ
S��6������Z����Z���v�Th)JaJT�F4�a$�A<��(�iQF����A��Z��(���~d�D����l�+���ZKb��d���m����&CaJT��&�R����P�p:�xR��E�0U���E;OyHA�����E�E�#��E-E)L�
��F�-R��h��"����t�D�E1L�2��������=H�!���rd����� ��AP�6��K7�����EmD�m���z�U�M8��_J~��'�L�#���B�c�0J��b��f_p� b���0�)h#����&i$��0-�h#Zl�G�����{Th-�'�������DmD��J��Z��(��h���Z�&�]X��h�SFQFkQ���6��v�Thu��0-�h{��t�OL�����
-���S��n�)en�
-E)L�
����o(�h-�aZ��Z4y3+H�R��"��`4������N.d#x��1�k���0��h#Z�Wa����~Vh-���QT�R��������c�Z�b(L�2��bb�>�g�L����C
-������rg��J��Z��(��h�v��BKQ
S��6��N
Rh-��� �6���(Wh-Zl�r����M�?����b��h#��DEQFkQ���6��e5��%!J+�f��*���P����}�X�+l�|^���#[�W��(]
S�t�����~�C+��0%*h#�����Z4[�W��h���Bk�j�_�����f�
-E)L�
��Fk�*����Uh#���+d-���+d#X����Z�X�W��h���B���i�z��M�=O�x�D
����))�4Vhy�)L�����n�/��*AkQ���6�k+���~X�X��(�V8�2Z�b�e��v��Bk�l�[*�-/sq��4� -W.��)�4Vh)HaJR�F�[�X������Bk��>���~W{P���1|���|�^U���U����_�e�����"�N�l����d��h?�/�2U�vc�i�w�P����(#��1����+*/��<�'W4U����U�}��O�w��,D�V�E���
�C��}�� ��r��7#��d���D�����P�tM�	lPg��7��v"�
�jQ{|E��z������B�e��>@�`��ZyR�CO�v�|�P��A���{0�*,w�V�f�*��	��F� ������@�*����x0�"z5�R����B{d �b0B�<�)bPc���~������&��1d|��b^-1c��
#O.���f�j�'\�C��`R�������[�bk��
�%�Q��zX�I�Pa	�d�	�o�b��q��&vtu���q�
^@hE�%�q�{�.Vx:�"�7�f�2��|J �]Ja�2��[����������p����I����
�!HV�f�*����6+�]KD��XN4�gI�PfmE�e&@Y���4��5�J��v�jG������Z��Bgj�	`��9l4b�\���]��`����A����&���G�<�B��H�XO@�R(c(��x0�"�����Bo��P��V��$��Z���E/3��Uh�8��
�(��(�
�f�,�o$�B�y�e&�Y�
-��x�F�����| ����b��h�	P��B�f�7A�������h����^;z�,a����t����	�D�������>��Lj�RX� ^y�u�P��F>W����FR��!r:�_D~�����!�1���0��QGy���l��8���sP+O����f2������2F:w�+�L��2K$+O��
@�_�?v��kCH��]�����
@p��%��J��k���5<�B!�=�^_�-�K1?!�L`?�"��5���Ha�<�}��}��}����������������l�yk�yk��7���������6��x���o�:8uP�`
6������)��Sp��<:wq��5�u�|����Fk>nx;�xi��q�1��	D9��U�~[����>^��5Z���p�4Y���pu��uw6�r����`64��#�<�m0{�6
��4�����66bC��8L�6b�6����d���w��Qlh�G�I�da���h@��i���*��W��H9�`��0f��'�YK����5����h&�^c� Yy���d6�k�E���u�.�=���uL���cJyV���'=g`��)�����0i�#��8��7K�1��G'�v���	@��D�A��0�a�l6���nV(w$v� i�$[�$�L��lc�0��1���
��x�$
�d(�q���e=3G�a<^~�

�9H|��>�e&@YKf��i;��X�p�Wf�0ip,[�$+O����b�_��u���:eJN�=���/p�[�,a�1�Y��)6�������h����&i����=HV�f=-�E������g>;�[���Ps�h�	P����m�3%��,F�#61��@O\; H��@���-���z�B����GGl_N����F��'�YOK��o��V(t$��-�<1���e=,�=��_�?�V�v$�� i�?s� Yf���dr'S��8��*�{a�s�4������L����nek�	�h���C�tb�v#Bp	{�)�
r,i��"}~����1�@��GX(�ogZ�f��j��_��^,��}������y���</V��N��B���X��������yu��;��y��b�?�V^,t0�kQ�G���<�`��Q����A��E],t�s����39L����39��U��
�|�~d3����1���-D���q����
k0��N�B�&2���f�*T��Fk�����{4 ������G�v�/�m�>c0�0���q��H~���B��8���0��!cC��h��i����`0����
~S^����"�GU��������#Y��#������j����3����8L�fb�f��u�P���[<�C���h��D{�7u������X�6�`���vm�������h�0��j�l��W8�k�;R&� ip
��A��0�V�b��/����A1�*p�4z��d�	`��B�tG�t�B`>;�3�$��1�F��u�P4��pN�B��������~u�e&������n`��������3�����1�~k��e�����3���,'�2��{��)W=*`������:����Ch	{�C�������I
��q�f��-���8�B-3���0>�0�~A�g1���G��|m �����������������`�����&��q��7��K�?�,���v�hR��}!}<C���R�z0���R�~0F*�!��!g��-���1�8S�`?�����|2���`3��@2�.��3�x0�����3X��3�L9��`>�Oc��x9��`>��3��H%�'���1�x����z0�g�9�c�9|<C���[�9�c�9p���~0�����1�~&0Z5.����$
me�^i��q>���
���p�:A��*�jQAKQ
S��6���rWh-�l+w�6��6�Th-Zl�K������� *h)JaJT�F4�O�A��Z��(���~��D����l��������5mD�����&CaZ���&�^��va��cR��2Z^4
S�Z_����
Z�R���j�;�2Z�b�e�=�mp��R������h�'.$�h-�aZ��F�|���2YKB�V$�����"��$�iMFk��z���Z�R����w�B�{�iQF�{�_�����z�*S���a���������.������5-E�6tw����J��DT��h�]*Z�f��R��h���Z��Z]���s������(�)QA�`�+�
���B��� D�E1L�2��f�T��m���(��m����O�8}��/Z^7
S� h}�����Y��(�)QAQ��
�2Z�b�e��f����d5#����7M��,1J	2��WI��vAkI���6��~Vh-���a�6�mz�e��0-�h-�;=J��-������-&�y�f8��;.I����0u��W.g|��D�E1L�2Z�������R������h��Z�;�H��h���Z�f�Y]��(ldD�E1L�2Z���*�
Z�R���/�w�d-	QZ��F0X�W������f�������v^���L���Uhy�(L�����+��������V�&�
mD��Z�Fk�*�=����Z����BkQw�>�
-E)L�
��zk�*����Uh#��+d-��+d#�����Z2[�W���i�_�V����e����b��������(����B��Ma���/w�!���0%*h#Z�q��Z�X�X��h{Q�����(�)QA�`gr*�
v&�B�l�c!k�l�c!�b�c�����
mD�5�Z�Vk+�6A�s�;�{e���?> #�S�-����)�r������&<�2��+����?���^(#��{��3U�0�KV���B����QATW��+g�PaX��<�'���'��K&TU�	���������=~Ra:���7^
�}�9��c�l�T,c�)-�
x0 #��\��� �h�[������l�T����2�Y����� �����_�&y@�b����3�2r���<##��0r��D��X� ^yd��P<��l	j�c�������Z��2���c��+���9��]�<���J{d@�b�x�	<�q:5a:���[iN�P{z�Lj�
�T�}��u�P����������X��>R�0�*,a�<��Uh����ts�~lN�::C��`R+TX� ^y�u��������]��������G4�9=������T���Z)8��������gJ�����P�� ^f����fyA��Pq����;:s"��Lj�
��+O��
=ile�"W�?wA���L�o��Z���X�D�L��nZi�d{	���'u����t��{�_8��%�q�]��9{�`�����}�rGgM�f`��.���3��0�V��F���
y�Q�R�{�I_�4K$+O��]V/��9���V(v�]�o��}���A��(�V�����4s���B�����z �*,@�,3��Wh��h��6�
�*���4N�E�X� ^y&�H����
[��J��5H��2; H��f�+4����(���^Hr������2����[�����SnUH%�����+��47�| �]Jf�`�����J
�\�XI8��uT"���&�B�%��'�Y�
M�~i�u`�6`�H�R�&�B�%��'P���
�/?<�;)�|�o�vV��d�%��J�����k6<'C!��s�-�3�9�p,�,3��������rN�1����e��e���7�_��_��������w�����Y�6��5����L������6�o���"���`
�2��P�u0{�6G��u�;(y���`@�,+<�|V�`���/L�/L�/Lo��d��d�����G;J���58�P%�a���>+Y{�/N[����v�c�GD���>�?�w7x{� 6bC��8q����h�q�o��]�����

��0���E0�a���U����E���-���Z�T�3���~������_y��5_e1�c��"��=�J{��S���\����[#?��W���1��C�W�)�Y�b����v[7����8L��A��4��,Fc{��
�c5���8L��>m�is���d��C��.d::�q�4���=��+O���V!~���$~�r��ha`4�e&�?��.�a��7��?��#gI�A�&<G�G�����a��\mU��#q!gI�U�2���%�����A��������@�`Z��x�	P����Sl�qm_��{
�3�@��/����d�	`����]t��R>/(�����&i#��_�Q-3�zX2����]���l>~C�V���-�����2���%���Z�A�>�rGlg����)���2�O�����,�Ub��R;#s�4����H�e&�YOK���������:Ks�4���b-3�zX2����7d�`����3��
YX�����(�i�hS����_d�|��������Z�������%M�P�O��s:�1Hd2�2>��v2Fkpt�j2�_��^,t�#��|f�|f�|�7�g��g����B���X���x����8/�����B���X�����A��Z��ay����)/::�y6'�5'�5'�s2[s2[s��g�OZ�+�a�C#�������a����������<z�)O�c��~��=�����8�z����4��5�q�4��5�q������}��`564X�#�$k5B��
��[�j����cC��h$Y��i����S����>Es�lE ����
���u�P��y�h�fbC��h$Y3�i3���U��zg
);R���Rb� �-��[�j7����

���d����C0�^���s��`64��#�$kB��
��[��s�p��B�#�	N��|We����W(�u.�P�` ��~3H4E1�����@�?�]���/?L�����2���O�#A/j�����l���=?L��k�7���I���^1�Ad�Y�Y�
e+��W
np�����@���UA��0�^�h�Uh����:a{o:�y��O��!��=��[?�����
����f�#1;0������m�n?�{�����8r���H������1���y�����q����C�k"�6w�������T9�m!g���!g����s1��zq���r<�B�a=C��S�9�'c�9|:��A�[�9l&c�9H�����t���s�	r��r���a?�����r<�B�a=C��S�9��c�v����!g�S.�,��1�,8U��~B����2�,x����zB�����������xF���F��Er���B�}��Z_a��q����
���p�:A��*�5jE�E1L�2Z�zo��+��0%*h#�m�K������Th#Z��p-b/�D1L�2Z�������(�)QAQ��4��%!J+��W��)���0��h#�mQ��7
����7��W�6c��cb�-3Z^4
S�Z_���`����R������hV��Q��Z��(��h��h�T�R���������/$�h-�aZ��Ft��yY�xI�������,F�J�� ��h{<aD�E1L�2Z���m����BaJT�����+�?1���t���j��g��f%�B�"��$�iMFk�z��6��0%*h#l�J����6�Th#�mWu����vUWh#Z��W�����E-E�o=�%X������f;a����0��0�
��v��6�lu���A�����������m�K������i�\?��n����0�iaa;a��
	3�c� �l'|��G
mdO3�H��h8L�u����E�D�+�n l#�q�.��p���5�G�MY��p���(�l#�q�0���i�Li��
Q� �l{r�M���mz�t��������R�>a�K�
Os'��8-,l'\��O�6���l#������Z������p�m�5�G��]��p��a�&�c� �l'�_s%afa���m���u]FkY���v��:����F7���&�k#�v���Qq�I�l}�(N����,g���m�1nf���:����)N�	'�*k�N�W�l'|�.��m�O��]��p��g�e������	{�0+����cVh'����F6X�Y��p�N�f���
�������5�}<��9^g}2���V������S�>a���+�{�bka����v��:��m����5�	��1�1��0�
��6��������)N�	�X~Q:���oI�u�5�&�ak�>����F��>�fk�>K��Z{h}�?>��QOP���~T{���]�C����S,cPT{�F�`fCmm.��v/����#4U�0�KV���B}��+z���.�.��B�a]����z(��$BU���

�,D�W�E���
cR��EA�G�F'[�F ���T��<��QSR&y��	lR���F!sL;���E��=�?����^������~���iQ��A�������s~����P���k���`R+TX���z��nZ�m|3�Q�����Z~E�&�B�e��'�Y�
�jZv{����7B�]�o���Z��2����;�����M��#<=��,��� �}C ����f
��R��Y����'
(�\g�h{>�!��
L������7���Wi�O�@�M��
x��`$}�X� ^y�u��p�o����~������/"4�#�%�1���0��!��4��!w�-~����/�0�A�� �^�������X��<��>�!��3���3y0����j�W��+6�����}G����Aa��L@�f�'M�l�5N�WX������w��A���R3����i��q9����\���#=�xP/5�Yo�����[������2�7��^���x�����z�n9��Z��[��bg`u}Gz^�W`k�a��LP�f������^5�a�30��#=��)���	���G�gw�����*������v�yX/��a��L@�n�t�9`#B�rg���8�#=�xP/5�0�#�Vop��p.�Y�;�\g���7��^�	�d�����zi�hh��q��Y���w��a�}�0^j&(x�^���Z�T��<t�u�/J��@�+0Q/?�[o��7G}��X��
"_'�	�a�3���3�����/�xl��P��;��=����Y��b,3��g��rO	Q������7����� ���SBjk����)!������������8��:��:��{��������f�?�G��u�����lR�39uyg�igr��<��P�`R6����*YV8�#����I����i
��T��i���i9��J<E����y��}9eY�e���>2��g����#s�d�LRF����d�3���-�nCkKs����&)K���f��^��?���y�_���4�fs���i�C}�rqiv��9����9'y�������%M�.d�������i�s"�������%���������%�����0����� �=
0����,>e}9FO�5�sA��3�<��3���s�|\����}�,��R�� �_la	�x�	@�����l
��<����L��3�<c��Ha��L��G�������$�h����N;�oH�3�i�K�W�Nf��X�h�F`�<�7���4:
��f?��<���
o'��)w�
�l:��h������������? �t�����+O��Y��Lgw�j��5&V�;��FG��zFK�WNVg�so��Y�LL�w����j{a��L@�y�t	����M	^��o@O�&��2���%����$��0s|�u&�g���u����3����Z�l����oH|c�g
�7$������=������=�8�3a�vmx��0�q�d�f�����k�,�6V���g��T)���	
>������5#��.�� I9���2���	�4�D�>�D����� �Q%�}5��Q%�h_��Q%�{����_/�����B�j�X����y[�b�����7@�}4�7@��7@�}4�7@�N�,�����qz����K�j=���$M�I�&�4�3I�d���$��=���F��;��%��h��	��1D!L�����z�7��f���&Y��5������W�|Ht�������9�3���Da��4d9z�^m?��dw��q;GI�A{��i�N �^��m��f��7s�W��������0eoZ�7��1����
�)n�L��_)�^���f����W�g��������i�r&=�X���0^j& x���U����`66Z��@�y�O�����o����C����������%�)���Y����U�Y�K����$��W�g	a���$��W/���a�{�G��v�m�����}�v�1���	��W,���et)[s��M�	@8�������������$�d�������H�3�d=1�=�����z� l�:�8��5w���7���A����f�w�%_�D�2��S����Y��R3��g��'x���_���7"��=��/"��m����cL�����R��(�/����(v`/3�k�R>���Z��x60%�\�mN���)����������?��������.Ch.y�[��0��D�������4$����p�|{A����/����L��.���D;��G�+���!������E�G��{/]������:W`W<�I�:oYzp���|m0
��������dW<'J�C��|^����f�P���7Q����/�\���*������q(\~��������D)�|[>l�I����'����^�?������D��\�����'#P��������~N����e���j�d�	����t���G�����!sY�o{.������W>{���7�������I+*//{)�7
gm?{ni�_3'�'.��:���x)L��/|���L�(d�Ha�(d�N�Vd�V�0��d���oE!KE
E!k���Rd�V�0��d��z�
9B�����k��	�� J�X���Sq����0�VD�S)2Y�H(L)��H�����o#��`A#Tj��Z&�Y^���Vd�V�0��d�X�<E!KE
E!kE|����1L)2Y+�����1L)2Y*��O�U�B��&�B6��*kIFM����6����<�2��b�Ve�Q��{Z�����qZ����R��(~��JK`��0Y_6S�3Y_4�#;-)h�IqJT�F�jUFU������}�p�Qh�JqJU�F5@�}��h��qZ��F5����h��qZ��F���x����;�BUx_tPe�Q�8��h�xSF�
Z�n(N�2��n�_^������M:�%u��W���9Z_�6���7jAU��������+8JU�Z������j|�OB&M��D6������l!L+�(�W-i�6��E�U����D�U)N�
�����*��*�iUF���jUF�{
�iUB�5����5_�
<(^Ug!h}�(N������<�V�����TmT#�w�UmT1N�2���"Z��F��*���?h��I	Z�r��*�Q
/���h��qZ��F5�k�Z��F��*��j~��l�l4!LK�(k�*���p�-��^W������
�Y���Uh}�(N���������_q
�U)N�
��k�*�Q-��Uh���Z�R�R�Q���^�2��b�Ve�Q��Vh�����F�����+^0�i�r�
���v�BkM�S��6��Z�
�o8�U�eW����Y}�y6E����b_&�e.Q��i��M	Z_p�S�.h}���Xs��Q�8��h���u�Z�R�R�QM�b�U�F��*��j�1�Z��F��*���y�Y�Z��$����&�BMoMf�6����
mT�5�Z���w^Fk�9��
���?66%��:�X�I���&y_H~5�;�%�1���/��<������N�I���q����`�0�����I�Z|�����T(*J'=��$DU(K����@����z�C�b���mN�����J+^�|W��ABJ''�{��!�	_������rk�����u(�tR
}g������+tj	{L)��!�K
{(1*�B���1�|!�{�$^c�t��4�����>��Q���
������� Yf�u�P`cO���|i����	@R�*��C�,3��_��Xh���W�=�������I�Pa�d�	`��Bc���?�h�,�����7�Z��`��Zf�u���~|�~;R����t;���
� H��f�*4�u���f�q������E��Dj!��=��;�h����j�1�H����c��B�
LBh	{�)w*,:*��\FWH�i��d��c��"�K�=D���r����BH�d4s��R��$�?Ga�d�	`��B���Bh����c�����V������E�]��G�%�1����u�_�B��Q
��'��bghx�b�2��G5�{�����(v��L�o$�B���A��(�V�!���~��0DgG���y �*�a/3���_,�����?Zk�>�R��Q7����
0���f�*4��l?�K��q�3B��V�f���Z��2����[�6K��h�g�&
��|����
��^(3�#���@{�^+�N�}��?���QS��Hj�2K�h�	P��B��u�}{RB�P�W�o�>��L��2��+O�nz�����IF��:�7h��z������{����}�/=�B	�Q��=��^(�A��0�^�	�a�6��^(��Q��I�Pa�2��{���F��W��I��y����w�����WCChQ{|��/�T��Z�?�����f�����c��V�������7$��%�r���A�����Z����K������7��`����2����j�YK�^�?d��+rt��d�Uc�Uc�������������3�����W��W��Wg������"�'(K����kR���Iz��j5��>�wp�j�L���&����&`M��q}o��=:W&;�cI��A�4��,n]3j�h����7H�AA��0�a���5���a�H����"�$�L��,^�)�B��`�������"KX/a��0�f�b���e4���C8Hj�W�� Yf���d����@�e&g'b�c��bv�K�cJyV���{��xv���	@R����2���%���������	���3�KS1h����@�d����[�1��U����2�=�j	{L)�je0�sBP�H��@��7��h�	P����L'@j}��L�	`��<��A���8��f�l����
d*��8H<�����2���%�E�xr�H�"6�t|�c$�*?!�������Vr	�_�U-F�#��H��dh�Uz-3�zX2���u5�U[�����U\�7��5���� Yf�������u-F�"��
�$mRl����@�=���b8Z�������7�N�I�]YO"w��zX2��m�w�T�����&
�e���d�	���wK&_�L����0�����3����l��� Yf���dr���#ff��O9H���6^�e&�YOKF��}�'�j�/ILLs�di������z���i%��l��(N�=��l���,j�)�[g3�ot�����m���@��t5�U��^��
���V�����J�I�������(/���Qg�r���#/��.��B�{�X��;�,@x�n0��p��8H_J���F�qI�Pw�Pq�~�S�t���	$����� YfI��X�r�`�r��e,��( �o����8��sF{��Nb�Nb�Nb}�$V�$V�$��sjS��k��������d����C��������m�o�q��������o�U*���
���C8G/�+��`[�P��B�{�U����(&Y���Qh@��U/�=�pVk64��@��!H��`��B�_��W���4���k�������P��B��/����)p�48��� ^f�u�P1�LHU�O6U�=N0�I��|�%�qR�Q�VH��:����G#���c��"���!��=��[����Y�HhD,������H~�+�=�����r�Br���i�Fl�������������3��{������lEP��]�7�+�b����W(�j����(�YH��<{����{�/N�����Z��CB��;u�1������Z�?�%$]y�#k���fY<:{�����#Y�S�z �'[��hRP[S���0������P�q%�����a�����_Fax����w{@3l���mu�?���G�������k�>�����T�=���L�?#����O��3��r��\?�I���\��31�z�m!���bp��)'����g��X:o�g��Ag �|c�|>���'\��gc�p���~6F{���>�����S.���1�8S�a?C���	|<�B�a=#���3��32�>������.��s2�8S�a?'C���Io>��ge����F�e�6���q��
�_S�e�Sg h�_��R{D*�V�8�*h�����*���i��+�V���ZU�Z������j��TmT1N�2���W��_���*�iUFUy�H�D6��%�l�+����hb�e�Vm�
�����w�S��.�mR�M>��Ng�j*��l�N@�����<a!UFU���������������TmT�����F5�^�
mTOy4D��t�X�b�Ve�Q�/s}l{��C�i	��YVS��)N��b�Q
���A��F��*��j�]F�������7����O����.���0Ss��,�Zt���a��BkM�S��6���*��*�iUF�b�Y*�Q-���Bk���bT�V�8�*h��y�Re�Q�8��h���Wd�6��~EVh���V�A��F��*��j��BZ�o8N���o�_^�C6b(�=�B�+Gq���\������*�)UAU�r
�2��b�Ve�VM��M���0%�d�MCv!�h��(���p�d���8-�h�Z�7d�6��~CVh��v��#�Z�R�R�Q�v(�B�{
�iUB�5���O8�db���L
�������vg���J��F��*��j�v�BkU�S��6��MRh����$�6��veWh�ZlWv����
�?�m��b���h����%UFU����6��eE�l4!LK�(fk+���P���M�X�+lG�^����'����:Lqj�����c�{��a���*�)UA�l}b�6����
mT�����V�+�V���WWh�JqJU�F5Z�X��j�>�B�l|b!�l|b!�b}b�6����
mT���Z�p8N���>qr�1��C6R(/_��H���B�Nq���/x����e	��b�Ve��;X�Y�����:�
mT�+�s�Q�8��h����O�6���Th�Z^���3Fi�r�aN���BkE�S��6��:�
mT�u�Z;��������Sf��������S��|U��j����-��D	�T�>�<��Z1$�����3�
m��7@
�u�2xRh���u�$���B�a]����*?,�:�`U�!��c��v@���I��!6?�T�;��+}x��b��Q-3�S8B9����
����z|�oP�)j�	<��!oH�0�`�Fh?��8Y��f0~C-3��8����v#�`�Ph�
�<�C1������y����v}�B��xd��2H����� Yf�u�P/��>lO2}����:���k8��;�O�&���3:;jR�7��U� H��f�*4������:;�!S/�	D����� YfQo��Xhr�c=�'�(wc����V�0�w���f�*�o�����';������������=l��Ch	{�)w*<�KH����mK(��b�O�`R���2����[��G��+��N/���x�����Z���2��[�Vw��c�.��b9���$�B�y�e&�Y���4�2��Y���rG������Z�����2��{��h��w��]�o[%��Q_va�0�2��[�z����������h�� �*�A�,3��Uh���������B.��9�3���_�Q�
^f�u�����p:�T}G�Lz0�*,C��<�z�V�7����#��t�b�mv3��V�0�A��0�V��N�5����5�
��^(��9��2��{��WM8���A(y���>��Lj�
�o���0�^�<��|%O6�xt�R�����^(��A��0�V���x��KHh��f��q��"��.�^�S��
=������pB���q��;hJ�b�Yf��������`�����}�|�_H���!��=��[?��t�#8�����sxh��/�`��Zy�Am����$�����Y�����,�op�n�����ZG���n`�n`�n`}�V�V�>/y4���&�&��7�`��`����^��mV��vR;�#+sp[��� }��v4�����+,�x�#��s�WX��
G������M�F�0G���&��
��0�a���{9_�����-��V�2!H��f=,Y����F��t��	@��2z�e&�YK�&`��7����n�x�3���`�����S��*>`|e�F���H�����rt$�L����`|g���1�1dh�1;�%�1�<�U<��:.dnD����g|�)�-D������V���n
���d�$�Nd� 1-���%�Q�~�y���%��'�I�'�6�� Yy��pH[��W1�1t�'�3��A����d�	�O>��<l�X�Pe������9L|���Zyu4To��6b<_���c�^��2#'�I�c���S-3�zX2�1t�A{���N�V� i�.#5�e&�YK&k���]�.f���Q
�i3�$�b�A=HV�f=-M�>K��������(V�I����j�	P����n.��*�jGlZ��������2��L�b��5��!j�3�K���7����Z�A/3�zZ2���!�
�L\12g�Cr;0�����{������<���A$5]d<���Y��,a�1�a�^�w����-�#17'�I�z&��'�YOKF�����K���'��I��t�=����R�l���A8�C�	�8�������C<Z��_���1����H9�}n�vD
���� ���`T�_��^,���������y���w:/�����B���8]�����c�~���X�h&�w;2�e��ez�[&�[&�[���������a�M�h]�hm�8���F�2Z�����5��AY�AY�Ai��t4���7���1��D��H{�w"����H{�;�+�>�qF�F64��@��Nd�z{���4�n��Ax]bp"�H�.18�=H;�#��[�jS0zo����q�d����G0��89m����ulhp
���`p�����Q�>�:�N��;��������8L�^c�^��u�Pm���Q�a��5i�qQ���c����Q�/��|�;�e8H}EG/��2��[���W����Mt����D0it����<��W({��S=����'I��X0�=D(�V�l�y�"���cSp�ET1��2x�ETL<W����m�'�I�_�2����{�����^g��v�}�3�o$i������x��u����E�%xA��%�H�	�K���;�������Jn]=��=���{~�����hR�<�S?u�Z(:u��f6!���������4�K�\�s��_����<A8D��	�|m���.N�d�t�1m��������kcA����������������dg�6������L^��Hf�il���Y������� R?�����A�~�����AR����x9�!'���S�����:D��!R�z<���r�� |���3�!|�� t��gpa<����A�x����z<�g�����������A�x����z<�g�9����9|<D���|�9����9p���~<�����A�x����z<�g�9����9|<��g�e#&u��[E��Sv����&�g������;��3�������*h�JqJU�F5�.�
mT��b��F��N�
mT�����Z5:��������TmT�+���I�F��*���<�$Q"M��D6�g�>H2�hb�e�Q-�O�B��
�iUB�6���������
�/����l���*h�JqJU�FwMkUFU�p�T�UOo�x*�V�8�*h�����F���T�8��h����Q��F��$��bEZI2�hb�e�V-G���U�U)N�
��z�#U������*�7�3�R�#Br�����u�0uL�W��Wva�d���8-�h�z��"��BKU�U�6����Th��m+N�6��6�Wh�ZlSy�����Q�V�8�*h���d�6��~MVh���A��F��*��j��NZ�o8N���o�_�Q�����g��S����8u��W.x�0 �����TmT��8��h��qZ��F���OB6��}�V�����V�0��d��_� �h��qZ��F5�o�
mT�����F^=TmT1N�2Z�����Rh}��8�����&���fD��;@J����8u���.��a��h��qZ��Z�<��S����TmT����F5��O
mT��)��F5���
mT�(��S_�F��*��jQ������*�)UAU�2LL6��%�l�5�Z�q(N���F����m'�^���h�b�����IZ^�r�c��R��DU��j�^�B�h�b�6��������+Vh���!^��*�)UAUo�b�6��z�
mT����l4����l����hf�+�Q=�W�����qZ�����=��xDH��u�'#B�?������S�:uA��������*�)UA�b]f�6����
�U�Ns�����TmT�]��F5���
mT�q��l4�q��l�u��h�2+�Q��eVh�Z�����/�O�����}{P����dw�qn
������M2���*�1���I�e/�Oy�['����{�
?�:��� ^ff[����>��PAXW���hR�0�K����B�)��+=��j�i{����]V���O*L�X��U!�t�S�mG�]��y:������\���]�
����}���BU�c/3�cB�Y�M	�L��
�o�	Q�c/3�cBrQ#���j�`*Hn��<&D1�A���	i��x�7l�S�������cB���L������>V#vW���l?�����+�0���@��]�t<|�F�(C����
�"�9�e&�`���y�t|�Z�cFgG���za��
�0���f�*4T�1����!R��	�����A������R����S�B���/��v��;��_~��=��!��)-U����+����*M:S�&#��v
�e��'�Y�
�4f5����V���l�����
�0���f�*��)�m�H'�Q��<��3��V�0�A��0�V��fn��}�B���$4rgI�Pa�2��;�6��!{�����5a��8��$��C&Yf�u�PG� ]v��ICtv����
���_�9�e&�Y�
���9���)��8To[��_�.�ts�wJ��n�T(tT|�<��A�#���x�	P��B�4��I[
��Jp����D�����L ���U"�f�|����Q���X3��^(�����x�	P��B��<p����`lUi��A�9V���*����j&��(�7#��������Z������{�����u�h%!	����J�q��"��Bz-a�)�N��[@��}�T(w����
��~)�$�L���_M���'1�G'%��1c��>-$B/j�)���������A8$��X��LQ,b�<��������rZ��������n�h������A��3��A��w�����Y��Z�7k��:����������)X�)X�)�������=X�=X�=X����������K�;�^��y��Q8�^��i�p
8}<(y���`�@J,��-�|V����o<�������o<��=H��#�h&����AVc�N�d�<�20�����K/�[�(T:R^� i4=�2���%�����Z,)�p�4Z����;�zX�8	�l*���8�Dc6�T�o�i�y�d������xv"�=���;����R��*fc|���A��#�?N����
�A��0�a�l�y)���BgGb2� ��i��2�����%�Q���mI�Q,Y���q��"B�Y�^�c��Z�+������d���8Hl�T ���e=+Y\�T�9�������h�!���/��sVKL�����b����GGb?N���M�nA���O������,�<���c����$���9�e&�YK&k�5<��.B���
�`�v1[��$+O����&c^r���
����| I��m|	�2���%���:nEl�C�]�����&ig�OoA��0�i�h<6��)tv������b�d�	������U�<�|�����<�����kC�����jo<�x/����<8��t$� ����<�2���%�-YJ�����:ss�48�%A��<�zZ2��%��'�-��>�c��^h9 ��������`����� bxN@��(�0�W��vZHk�t�jZ�_����B��X�����A��b�?�t^,���������y����2/����BG3�L�e1�e1�e{
����cY��V�^��w8�����e�q�*��5(o���B&W��J� ������A��A�N/:�u�"��"��"�/�Z/�Z/��
i0������nt �c��dQ�pL)w*�y��b���?N�6��*�6��S����F��?Nm����d�oVj����/h
;0H�����B���c�c�cyc0k0k0��B��Sr_���X&
�=&�;x�T=%�b�����ZO���S&YO�iO���U���`H��$648��@��I�A�I4 ��L�b�������-���?L�K�o@�?$BYy��Y7���#eN�F��`��P��B���N���Ry���N���R-3���b������1hYY�2����%Y-1�������G�)#�����Qz�\� Zf�u�P+q��BgGl���&�(�0���@�?����6��	r0f���`�����$+O��?��L������G�����2��W+���g�%���v6�h0� �"V�d0�b	�x�	\��N���)<r��#���q��d�z��_�k{P��A��'����0��[5�����g����9��l9NUg��R����L�s�)r
�� r��N`;���t>���S>���A�8U��~B���B�d���fB����9\�B����9��!���"�����SB�����SB�,x����zJ�������	���xN��O���X�	���Tu�I!rO
���yr�I!r���b?+D���Y!o?�-;2�@
V]������N���y�"T�e��Sg h�"��"#�I��F��*�������BkU�S��6����Th��mKO�6�Ef��j�l��qZ��Z����_���*�)UAUy`I�D6��%�l�b$m41N�2��f�0U������*�����}��
�1���
�/����l���T�V�8�*h����UFU������m��9WAkU�S��6�^)����KU����6�����l�j�rE3����h��qZ��F�������b�Ve�V=� U��
���*�7��I$��� �����u�0uL�W��W�a�d���8-�h�Zz�������TmT�����F5�>�
mT��(��F5���
mT>�P��6��U-U�OY��X�����f;�6��ef;e�����my����#��2����W���w�h�|j�����C��
��?��m�)P+�)��sX��Nef;����Rh�{�aP
mT�a�+�Q�8��h����"�,��.��v��~}�l��h�v�m��Uf�S��A��N����4���(pP&�����Gef��;b���4�\D
��"lsSx����m�)P+�);(U��r��R5�(gg�Ei�Q�@�,l�mKz�v��6��l��{��Q��Nef;e�N/+3�)c���l�|��3��b��e�Su�n�ls[��A�]v����
���xW*����m� �3����|��[���2��6�����+	�(S�V�SN���l���	��N���5�)���fkew���l���JY���7v�B;]oi�v��Z���t�5�5�)'kKk��+q���.;��������I�s�u�'�I�s����m.=� ls�[WRs3Sl�L�ZY�N9Y{[��r��f;��_�����1pPf�Q�Q�l�L�ZY�N�����k��VW~g������N1Y�[���i-o�v��5�5[�|�I������3��y%��v��B�?��l�2�7�*�1(�N)�+�.�
L�'�W����j��xO�b���+5�C6�O�l�u�C�J �
��d�	<)4$~��u2����{��1�Qve	{��������T6i��h�x��b����7������X��kS!��Y�f�o��T�c/3��JjN��t�S����|���b�x�	<�WRO5��(�Ij�Q�
�y%�y�e&�`^I{���������Q-���L�Z��<e�ref�*��0q�W��'����7���(�0�����y%�8xx���Y����g>�;��~����0^j& x�^���c����0;;s���b&=��(���	���g�R���,w����Lz�+��0^j& x��p��Z�Q������=��UX���L���O���&f0�Oi��B��*b���1�W�d�+4�\��e���������w���_�@Ga��L@�f������}�5���h���#=��)���I����'�
��V�5��� �!=�xP/5�Yo�������`72����l���<���+FK�o�[i�e��t�W��Ywl��H��z:
��f���uG�Gm�1lMT^�:h��zR�1(�1���@z��hp��&_*�
S�������@^��
f
��gR�Os�Gl=v���a����3�\��coH��z:��f�w���m��=l);s���{Cz��0
��f��7�
��+���%���\�|��@��0c�,?�z�WC��h�������,�u#.C �e1L��	�����/G[L����E)����2l�1 ���	
����@�����0J
�-2��������D�L����y�Q�rTp�[�IP(��A��0��sp�;H<;E�Jio���@f�(1�W��vvJm'����)����6�y��G�&��#)����f8�����3�f�?jK��ut
�lPVkPVkP�7e�e����G���Ve�Ve�Ve}cUVkUVkU>/yp�L������L��M�|4�d�����lZ�}9)y�/{��/G�Y��]D[��|�F��ak#���i�G�S���n� 4�����������e&����5���`X�v-g�����)����G#T����_�hs���9��gmNOaY5�5W.�c�V���k����I������,�u-IcC����q��B0�O9�����l=!���@G�-Y��Cs5;;�6�Lz\l���x���qt��������h�;Sv����+�
��f��$w+g�2�W��65K�)��
�y�M��h��������l7���b�3e��!=�8���E�K�W�f�=��g���u�����@��LS�0^~& ��r63�����i[�m��!=����S/5|Z����������2I���v	�X(���	
>��|�����a���c��D1z:
��fG��n��x������bm�
���y���ZB�,5|^9]^�9x�����C}zR�����L���L�g{��Z9j��={hL��w���~k{�a��LP�y�����7;���z6&��;�����z����f����WN�h��y���������~���Aa��L��0�v�d���S��&�&Y�3���;����L��*z�YLB?!�2fo����80|p���0������P�k= �e��T��$��r���d�������g�����=����y&9{^�o��e����+��\���`��\�$+ �rL�����<z����f/�����B���X�����y��b���<F���F������zD���F��B�^�X�����y+�Z���z�^�b�'�il�:����mZ��������]�u4K���@z����Y��>}�^�[�:X�������*�E�A�"����U|J���]�����'�(�Yc4�Y��b�^�W=�U~p�5�iS�s$���*�8L�����u"�K��>��g}F��F�����	��W��'~Y�����vg����5
��p���e���X����3k\���Es�0Yj$(x�^c2.|c�0��e=��,rc������9�Y��K9/��2#�y��Ia���o���b�A�A>�Z�#i��f�Ja��H��h�L�iT�a+Y6��H0�?$���<��*G����W�q�[�,��x�#�I�����c���P��Be�
�R;�rn�G�	�y�����K���k�]��*�`#T�^��������fF_G���z�^�a��47�G�05�����c�0�6#a����5]F����pI?6c0���	#�{�,3��g ����e�$���e�$+��h��^j��i�f��������6[�o�����������X����P����@\-��x���M\6jp�3����u_�� ����b�;"����D-�e>��,��Bd{����
�OH���Ui�|"Y����������oy^���R��'S>��|I��]�i�+<���'���	)��:�������/�Z��<-�N�L���������|M��}���qxe-�m���r�c�����Y��O$�_�~�yZ��?�C��M����������d�K���2O�����
��H�pS�����d�K���2O�/0���I1�.!�/����]<�l�������A#�z�3x/~�����7��_]&>���x����������-'Ur��a���\���vF�I�HaFQ��b#�M��+r�**9W�h�E&�"�E!�������I)�(
9W<��N#H ��(#��T
�S''�\��TP��"�F����������B���\F���Nf��.��lq�����l���1�LE
3�BN2QUQ��"����sE��d�$�f��+b?5��$Q�0�(�Tq���d��+r�**I�`���2M����U��Te�gU%�t���2:��H�U�^W��1�K��JG���0IN�������e���ZIA�&�QE�*�<���2U�����U�	��0��U�3�����[UF�*�YUA�j�;	V�Q�JqVUP���O��2�-~�5(Q�.E�We��R�Ut�Z&|h��
J.7gT�����G~��p��>Oc� (9sg�������8f���R�Ut��f|��
JT9��*JT�#�%d�ib��d�(n�p�c�)b�Ud�(��{�[/(��8+*�\u���bE�*�UE�*<�Te�gU%�����k
�YUA���?���/��%��5���;�3�Pt~��	�QUA�*�UE�j�GP�*�L�����D�����2U����NU�i������sU�SU���7Tu��2U����U|>��2�T)��
JT����ib��d�(���Ut~��8+�_v{�c}�+Bv��/80F=X����qf��O�R��S��r�QU�����U�����5�\u]���(Q�8��(Q-���2�T)��
JT[��e�-�%��#nu��s��������.���M�3���%Z����Y���+��5�	���+K���ef����+��08����qf���O84-�)A�*�YUA��m����D�����D�>��WA�*�YUA����c�*�L�����s�m
&��D�����Dq�&��Ls�&�A�j�&��L�D���S�x#��i��mx0����g�J��w������>?��o"_p���d�x,)7�P�9��Q������c�Cb*�%��;���9X�YM%+�|��E�3��p���K�cI�Q�����^���������`-_�N���o��/��_��%k'���1������%������c\��dD�������x<Q�$Ra�%�����`j�1�����? �cL�&�5����$�^�i|���A��0Z;�����o����
�d�p��B|g
��M��[{��������6�$&���n
�s�Og��C�������$(T��a�,3��UhY�X�����#x`~0	
�m$���nZW�����!`��{�s9���(	
U�0HWe�*�����xL[�M��t4��O%A������[�n+}���xv��;)����c��&��LCd�x�)w*��X�������M��`�|���e���.3����B��=j��u�]D���&���#�$(T>�fVe�)t�
�X����5�_>��@IP�2��6+���zh��7�����h�1*SG�<�A;J����P��<��U(����
O�X{�6�2�L�e�	�$(T�LA��(�V������vx�
2$����3��|���=D���r��R������C���M���+��'�+�0D�����	K���G����$���#���s@I����P��<��U(6�w�
9)�3�-����I�Pf�� Yf�u�P�f���������@�<��*+$+������j�����v�����z���7�A��8�V��s����&4�=��������PR/TX� ]y�u���s-e�oa�$8�X�[�����2��{�V|3Lgl ��`G-8�9�$(TY� Yy�oz�����
:!�&�p������ ���rP-�%��pV����C���
�G������1f|����:��XR^v���c�w�k{����}6��g]�{�:���Q��[7���c����EY?�GK�^��t�7)���R�����$yg���������������^�s�@��}������n�������,�^���@��~������n���������u@���;fi��[p��^k�����@@��l��� �z���%[�?C:W�s��3�s{�u=��7K��]��n����0)��=���p��%[����e�B@�)��S���S�g�Y��[\�V47�3�$n�z��e�0�z�d�
'���h������{�a���p��%�p��_�����#�c6���Hf��
�q��o��6���7�(�i�Yq����A��(����H����/�m8Jr�"<vA��(������x�����g��8Jr.�������4o�,&�V�����E��8�*_F�-$���/�7K;<�~A6�t���f� +k;�2��7Kf{�v�i�&d���]�'������d��O.��V���:W�R�b�!G@IP��JA��v'�f�l�������u�H�P�u)ki$+���,�M�
������#�"�L�~e-���,3�z�d���>���l59���gc�n�����-���
-���z���:v�'�L�f]W
�e@Yo��#�(j�b�L��H��'���	]�z�,3�z�d2��R��&6Z�jY�����Y��A���O�R��X��l�?��	�~���$���C�_2��u�������6�������Y��Bh�x,)oN|%C�B%����w��r��a]O�(�j�������,�1�,�O��}2��'�k��s��*����U�#����~���:������
��������������Cy�Po�/yD_�������<�/yD_���y7�b�?�A^,������s�YS�<z���:������d�b�?�7^��Z�������PRt{�u��[�Z���:o��=���-� �-@Yw
56���Z�����q#�$�(������[�Zs�a7d�#r>b�}D2>b�u�Pk��H�r�a��C2���u�P5��c�oyttd��`��W
�e@Y�
U���-\�S�td��(�;���d�P��B�����Ah�n��x�3��|��C��
��r�B��*=�m���q�P��o��eP��;2���X��M������I��)HL�8�^�����B���	�zz�3��|��V���)�N�j�������:2��0�9}�A��(�^�b��o���7���c��f�ri=D���r�B��*�2h��8u�$o�-$����
�����Z�t��]<�g���&^8��[�B�Uk��?+�L��x�������������C�?����[��E��1h�W{��������<����g;.������/��X�����`$�{��UK�F?j���7�;k:>�����eH�2�B�?��!�K�����X�>#�7����������d������gf�d����|j��@2u�����c3d2�Bwp>8Cv ���|p��������!;Hfg�4Svpav�s���?��!{�)�������=��3t�O��=�$���
��d������h�d���|���A2u�
����4��]������0
E�4�F�}>Nc]��S�y��3;Pt���l;���D�����Du���e�[�o��j�b�LE�*�UE���O����V��*�YUA�*xS��IP�JqVUP�j�7�(�pL41�J2I�i��K2�4)��
:W�)���TPr��8��(����X�n���f�c�ME�i�8�E��m���
�2�T)��
:W�g2JUUP��qFUQ�Zb�LE�j��2
JT7�/���2U�������/�L�GY=I����d��(r���%(Q���A�Q�JqVUP�ZbRE������_hvh��;�p�����B�9af��	9=k��57�CQ��qFTQ�Z�V��2�T)��
JT���RQ���f���.sl��(Q�8��(Q]���Qe��R�U����YQ�Z�Gd�U���{UF�*�YUA����*:��H�U�^o�O^�C8�������3�qf���\)����ZQ��qFUQ�j��aUy�&Q�8�*�\�.a���D�����D����B2��+I�G�I2�4)��
JT��	YQ�z�O���6��U%�gT%�KZ�(��p�U�_k
Lp�=C8���M��s�qf�����Te�gU���K�(Q�8��(Q�q���L���J%�{���(S�c�v��U�������R��+(Q����<���R�U���G%�ib��d�(�h+J�8gE�e�:�����K�:��+J�s��J��|��<o�)JT9��*JT[��e�-��%�G��e�G��
:W]��]Q��qFUQ�Z�O�(S-�'6(Qm�'�i��+I��+�4�����'Vt~��8�z\���M�c�x<���������(9�g��������v��2U�������St�e����0�����*(S�8�*(Qmq8��L����%��#�����LQVo��0�6E�YQ��qFSQ��D�YQ��D���s�9����l�e^��>�c]�8,� zyu)��Q����a�v�.3��k�R����k�g����U_2
����2�^��e��]*�D�",T�U��d�*���e�V����X��
�v<=f9$��GY"�U�6����h�����/
@^~TF�:�2���.
�P�����A8�c�*����2�Fs�2�����yB�?�l�X�k�@�7������7�u��<�fE�x���2�C���e����e%���
�AkGej��e��V��u�f�p��B30���AkG�����
�w!H�g�*t5����AkGp���wSo<B�,3��Uh����Q��j�U��M"�$(TY� ]y�o)r��:���1����`<�Q~���WB��*k�+��n�f��@�KL�G�����)��������p���[����Z)� |���(�~)�0�7f�@��.��,��?�e#������9�$(TY� ]y�u���Ge�������sL�W�7��@IP���A��(��,��[��E4��y��y��x�������F�g8v	m�i�
#���/�f�e@Y�
]x^�����LC���!|�I��$(TY� ]y�u��u����J�j�9g�A�#X�q�a��H��z����3�����~FmSG�hz(	
UV0HWe�+t�9m�HP�;8�&�`�����AA��8�V���Y�nD��!;�����%�B�2Ce�P��B�e�>R�z�+��I>�TV~I��g�+T�[l�n�|�����c��f���3��]�o���y7�b�M�~,������i��p/e��A��8�^�
�Y��!%�L@�<�O�1����W���cJ�U!��<2���T���c��&B�:0�����r���L\�aN������p|FC:t��c;Vh�������w��?��Y��`�,�G����:��u!H9��=��"� g:�a���;F���f�����YxD���f���=���%;��kY��-����lC>+��z�7��������s"0g z�41��{��Y���������������g�Y�u�jk�T�L�0)��=���p��%[��-�����&E{�Y{��z�du���?��vd��`�77R_1�z�d�'l����0��`c9n�c��f���C����$�c8�g>6�E�������7�SgE���:��$oC�<G��v?��f��>�G�v$��f��p����]!@��!��W�8�����Y3r���8Lrfd;0H=�8���v���X�����#uG�I������I$]f���>v
k��i����?��8Lr%�� ]f��f���g���A[G�B>�;��m��2(�t�'�hr��h9��L�\	[���e�Yo���"�Go#o���8�O&Y���� ]f��n��.B��}�>&�^�u(�������,f�x�)���#<���b�Lx����O&Y/� ]f��n�d5�oE��&tc�jV�K�+@�����4��C��Hn~7��Y�2w$�(����*���?��#�c�y�A�D�28�Fc�z]fM��?�YV���Y_�I�2�	�J�2�?u��d��o	9�O)���1f��N+����S��%������8G@#9�Jc;�2Hy�+|���&y���B���X�����A��b�?�m^,�����zs��>f�>f�>f}�c��c��c�l����WG!��gy���w*/�����B���X�7��E��E��E��7�^d�^������r��mt ��@n���]�5q���9��J��c��c��x�Pk�����9�1Lr�?n+Y�1��U��������{�q���c;�����[#���W�}F@�g������P���w��kmp[����xw���.zP�RqF58��" �)�I�S�A�S��n�����KpKG�8&y'q� u#��[���W��TD/�58�*�������eP��y�Pc�������o&���)� cz�Y�
�>^Y�cXVo�����������-���r�Bc�mk���{���gD�����s��
�g����J�����A��`����9u�R�)���`R4{�5��{���#Z���&E+�Y+���W�zs�a�����1�1fC����r��;{q<���>�C����&z���f�x|m�'oJ����H��) �1�z�[>��������������������9��0���o��xs|��C!�;<ul{��4�h��h��y Z�L������H������@�~�#�����T���<��d��y ��O���d����|��@2u�<����@t2�Cwp>Dv ���|����y ���!;H���4SvpaH�+�M��<��L��=���=H��!��{�x��A�z�����$S����=|<D� S=t��@d��{����>������
;�����f*�����~>�-5�Ut�F�����i�l&����D�����D�����2�������u��Lu��;
:W���S��r�QU���Mm��$(S�8�*(Q57'I�H��aV�I���ec%e�gE%�{l��(��p�U�_m
��xH�{��(9mg6�������]HUP��qFUQ�z��jTe�gU����a��U�3����7�X�Q�JqVUP��=��d�f%�$���F�Q�IqVT���>=�p���r�QU��.�)���b�qVUP~������	��%4��7
3r~���7|r��2M����NU�i������\U�T��D�����2�{o�������Lu�]�
:W�g�����U�3���5~LV����c�A�*����*�L�����D�������7gU�������1d[Z���(9sg����������J�(Q�8��(Q5����<u��R�U��a���L�S���+B��a�(R�Q�(�}j'�(��8+*(Qm�S��L��O�%��(Pe��R�Ut�Z�8CKQr��8��(����@���	��)N�R��;�3�Pt~�Z�{�F�Q�JqVU���6����U�3���5�xR���q��A�j�M�e�-6�7(Q��!����L�����s��>����U�3����~��d�f%�$�k4�%W����e�xk�cNG�^��%z����qf�NO�>M���St�*q�jP�Z�W�(S-�+6(Q��W�(S��Wl���<����U�3���%z��2�%z�%�%x�B2��b%�b�^��L�E���Du�^�����Y���W�Op+��� ��Syg&��l�eV��p�3[Wt~��
�p���r�QU����eV����e6�\�I{��r�QU���qV��Lu��J
JT[p��d�-��J�=���2�=��%�Gt�e�Gt�
:�K�]�;O��n��� m�������A��a�})y�P�j�� �U�T����=X�"�kY�-��Pa\�,3��
]7s/UDu����*���e�V�e�X�����R<f9$��XY"�Ua�����75_/mu��� ��*��Ku��vt�
m������������qw���2z�T����\����I���QEkGm[���0	
���<�2��[��f;�1��4�f��\e4D��;sA�����?�dZ;j����$(TX�Wp�e�Yw
�w�/m�{I���W�@��(�1 ������f7�1�������� �h�.3�w��l���>�AUE}��k[
7Y����Pe��d��������?�I�"x�q��p��(	
U�(HV����]*���jt���o�u�������|�(
M����zpo�B���/�V��[����a3�2(o��7��
S:���`��Z~m��u%A��*����[�n<��n������Mm���9�$(TY� Yy�u����lB�
�@F�m�m{%#w��*+�+���
~�C�`hl%����vT��0L�B����e@Y�
�y~���I�>w���P���e�t���/H�2���m�_��#T;�.��IP��v`�,3��U�:�����K�6|ib_�d=��*�{�f�l���k������]��u�I#�6���AA������B���	S:��3*�,��}��%�B���@Y�
]�DS:����C�v�����z����� Yf�u�P�������!�SG{��)�$(TY���o9@Y�
��"�Q`4�����_{�I%��g|3���=D����r��:�����v**���������0�SiV�;��<�ny��g�d�:�����c�7�>�d�Y"s����4�F������*]�G $e4D�@>�AW������8z��Ze����[=�Z��x�K�L��%�	v�����Y����Xg���=���,�d-B>,�9xX�3A���h��,�A�,t������%;���m����l�dmC>,��z�[����@@�ck b�1=�������G.����k���E�V"+���"�+�9}8g���������v�"S��=���Y�������]��7�q��fB��b�Ew�)���v"|��a\;2�0�;����z�d5�G	V�
�8���g|3��[���9��Z�����������'����J�5����W����q���n�����&���LJ26���f�j;|�A�#�
�d%9'F�� Yy���p�d1
�!��	�6���"v�(�y�4<��<���� j���ss�$4<��1G�I����������+Y��
?�[�L��0�0���`�M�Yf��f�l#��uI�X�
��9J��%ts�A��(����P��;���w	-v�-�`��.���A��(����Z�~]s������A9J�.&4t�A��(����d�u��oa+V�*�����6Q�,3��6}�]r��O�i���	K��	�$�lB��$����-��Gh���e
�;��	�$�qB_�����)�wK��xe�����	����`�u;�Y%�2��wK����l�T��	K�s��KV)HVe�Y2���<s�gk�Vp�Y=�'��
�3{�.3�z�dt+��&]�����S�@�c�0�(t��!�h<>>� �c)�B�� �{�@��6Q�,3�tX�N��_5,�0F/��v��B��X�Z�+�y��b�?os^,�9���L���\�3���g"��L����;���4/��F��B���b��O,�kY�mY�oY�g"1�9�%Z�7;6����B���X�3s2x����Cn`���K�,�9�e���3�L��d�H8�V���������t��r���e�*���8u������%E�����~����q�n�G
���A�1��n�"Z���f�_+�8+�5t��S�*��2���2X�a~�P������Z�;2�b�����A��(�V��������[G�+&yCq� ]f�u�P����Vg#r6��m�dm�v���j����~F�y�_
�/8L���LA��(�V�����5Z���e&E��Y����W��wMZ�::2�0�����,3��W�xv?c(��������1�>���3�9�V�b�����w���%y7�P��<��U�s�1�s������7L���A��(�^���+Gt����?{L������cN����M8�bH�
�F�pL#?����D<�4%�����Oo�����R�}�7�����1!����uL�����E�y^��BR�y]}c�����(���Bt2�Cwp>(Dw �f����U��aT��A~��G��$��!"{�tX��AF~�����$��!"{�t\��A�~�����$��!"{�t`��A�~���!�M�=\�o+��������]���]H��E>4Dw��������]�
�]H��E>6Dw�������]��
�]H��E>8Dw���������i{&=�m���=�q<��e��y��B����8�E��b�y7��I�Q�JqVU��������U�3����{*�T[��iP�������w�L�����sU�2aU%�gT%��F%��41�J2I����%e�gE%�-vOU�\m8��
J�6�:=���d~�)-�	U��6�3Pt~����`HUP��qFUQ���[5��2U�����Ua�f���D�����Du��G��(S�8�*(Q�g�]8����v��6����L�����Du�p���R�Ut��M�[����qFUQ~�i�c]����G��$9of6 ��������K2�4)��
:W=&�{H��U�3���56�T����)�A�j���e�-��7(Q����Qe��R�Ut�
_e�gc��u5P�-���.�Wf�*S�S�)������+�:ea��~
��2Os�=?
�� �}(K��2=�V���L����L�<�#��,N�L�NYX����P�R�-L�2(Q]��u^Q�JqVUP��<��
�R]
t��2�?>�*����L��[�^�Y�L�NYX���I\�e�#t�����<-����.2Oe����N"��(KNb]�3�a�2Zee�����*�qn�e�2<r��a�2Zee�r���
K�K�PoY��0��
�Re
t��2e�,/+���2:ea�2<�e�gee�s��
�.K������<�������J�mg��3��v'��s�o�-~j�*S�S�(s��
��9�*+��k��
K�k4�-�������Ty�
�-;W��)Z��%�h�
���`G+Ju�`H�����6,�]�)mY�\�-mXrU�@�\/;�����������>���A%�<�hk��z�'@Yr���Q��LY���VYY�\��mX�\��mY�|<�a��Re
t���u�U
��9�*+������o�s����=s^k��
Kk4�-���hy�*o�������w��<�w����/9�������oE /�*��ru����
�������6_}<�I��
@��TF�>�2���.�TsoUDu-���*���e�V�k�X��j���(�rH����D<~�B�Y�[�mM������y�mMe�r�.3����R��+���i��y��?@��TF/g�2�m�.��yZ���`��y��������L]f����B73,��R�[�����1��#�-�t�xZ�]�p7��9�AkG�>���L��DYyP�2��[�f�x��Z�pH���^�Z�����2xgj�<MfF��7X�����=�+���~}[�J0L���o�;�����W�o��\���y���0��z6��G��7���v�,r��F0���p��0����FA��0�^��}�!c�����
O%�B�����2Oe�0eG�z�����,�"���?
g��F��7��<��4��eb��Y�����^���d�����z� �����2��&{���>��z��G���Y���C���z�����<���/�a�
���#A����<�>�/x~�m����zJz��p�0Yj$(x����Sn�c��� ��!=���^��zV��G�������n1op�cG�I��:`�H���8s�,5�Y����$m!���6u���yX���a��HP�f�K��2
�2�40��A����G�2��'f���3e��.����+����</r��yX���a��H6���Z���(����mc3�X�l����+���^��7���R#!�������eVq�E7���yT��Ba��H���3�y�!+k�����:�'eEz��p�0Yj$(x����A��>(�0x�|.�����`����j&��o��b��|u�r�_����`|��� ^f�u�@���-T��<��2AE
L�e�OP9��Q�+'�������=H�Y'��������d�H>,�O�$o������Z�y��:���O���^�S� �S:�a��F���c	Z���@Kv�e�����E,�y����K�d�]� �]:�a��`$����y�>fj`c��Z�D�f��k$+�������fje[[����A��o�l]G���N@��@&r;������&r�+�;��j�s:������
6'�����Xgs��Z������#����q;7k�@7|�r59�1R4>����9�������qX����������zK(Z;2F�0�����g�[�:������2[�s��KV�(HVf�[�:��c-�������;G�y���{&����\�Jh�{����fc|���?�U��G���M�+����;�p�`��}Bz^pJ�{ ��R#A��+s[���<��vf��'�����K������L��S���� Lm�g��y�z
Q�,5|�r�5�� 6���t��3����

)��		�]9����:���P��g��yG��;�a��H�p�r�:{���Y���n,�'�'u�DX�1H�e�]2���uw�*4C����y�^��"&�����}Oh�8-�3Z�;"0uP�������=��-5|�riMxh���p�:n{8���+��:e��V.����1�a������<o�B�A
��F���WN�(4
[��a���jM�g��y��Q/5|�rrI�R��	E;>��^��I���=T����g%�:%�����p\���#��-�M$� ��rL�����������B/����B�x�X�����y��b���D��Y�`0:k��)����������X�?�P�X�����y��Z���,O<�2��epM��t(�9�����1.�_z�^gZ���%�4�K�$�zx���K�		�����`�.�]�1��2;����]~E��:���m5���`^|:�q��CA���:[[J{��������f}�@���I�I_������;�3��O
3�g$����^kA��������9�3�&O:�����I�O^����l��	�;�������0�tB�7�U�q����	��f���`^�7���#A�{��q�� �����9��]�����	���Z���^�^f��Z/3��7z�f��H���`�Vle�0[�f =�?�M��lLOH�n��&��J��/����%G���}�r�,5�Y�����1-�y�2��7��fM�HP�n��9��e�*�y�2��J�Ve$(x�^���gP��,������gPR�1(#��~��z�%�h��ekg��I�����abK�o�+��m��N��8J�fd� Yy�u�G`_�G��G���8�|":R�,�/����m�y�f����L��hki������?��������;;�����9���
endstream
endobj
5 0 obj
   372266
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-0-1 8 0 R
      /f-1-0 9 0 R
   >>
>>
endobj
10 0 obj
<< /Type /ObjStm
   /Length 11 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
11 0 obj
   16
endobj
13 0 obj
<< /Length 14 0 R
   /Filter /FlateDecode
   /Length1 4468
>>
stream
x��Xkl[�y~��P�$����n���$��xx���&u�����(��u�-�vc��b�m�6-�9u�����i�d)Z�	R���b(�aE���R��`�b
j[�e��9�C�r����C��<��=��<���R<�<�~�f�����,����8w�'���
T�|��������
@��~`��������>�����E�d����6���\L�x����S�/���3����K��9��f^����@@�V1/��CL2��*���U���l�\�G���q������)�.sn���@y��
�E���C�e�"rUL����.�dYQh��r�J���V�<�k�8w�*��|c��X�����bm���&���m*��L�hi�bz�9Qx<y$��<�z�,�s�F�r�"�����>������2b^8����n�y�D�^/x).��������r������)e�����������N�M���[��f>y~1��@f:jy����*~�����������\�U����]���	-��l����9���f�:!�G�$��P$�����	���#�D5 rp�L�g?[=Sj��t�j������3�S����1�F��;���y����Dnq����4��]�{iyiA���_pFr�NM�oO�������b�����������Y�:���E8	�`���Yv	N��� )�hmo�S����*�l
���o|�-i���2���,�9?��io+�r��mp*
^����fU/*�������F���b��;��7u
����
u{<013����tQy��Q����A_��CE�jcI&I*5,����"p�k=8�58�HW?�l?���[S��tSW��I3o�ih����q������������3O>�G#�C}��l���mx��r�~���^J�L�4�E��)��J���4�9��=�r�����u��]���L_1\f�,��,E��U���2/x���2��b{I���T�
�+��R�?qj|��T��'-�i�wP������?s��|o����J,���Yj?��}��>T9+
��Z6�>��of��6C)�Ev���$���XP^���N���������J�C�:S$U?7��844T���m>{�X}c�4����WVM[�O^{#�3�4��Bg��=�p��~�����������hX���<RZed:���Xrx���������tt����fk���Fw����%s��m��,�P�D���9��^<�"��1����j���+5��k����b���Z_�t�{��?a���#������.�HP��= ����x�g6M��Z|V��j
u�F,|��*���
VK�S�y&�~�0�(Z��D���g|5�~m�o�U��\~��M��\����5���*�`���"H�",���M����X_#�g�:������]Lu?uW~�tl��<�� $����b�k�g"I�;�6+�6�y��S�����O^�Z2���j�j��d��w���g|\6�Hc_���6M(n��05��-_���0]or5��������������[����\P�_qP�
AY��/K�7V{x��=�?��r��<���&�����%��_�]k��C����s��kL}�������yt"��%
�Rv]���#)Z0�o���
����B!�����������5)2�<K��,��v����{d��c��|I���]fd����9�'5�g������cb��������
����r��:�������>��+�<�����'76>���yu�m�T�W��&TD��'F�Z�X�����B��d�����A��:�T%9���B�a�P������}.���#��m����<$99}u��%��x -f���&��AK����g����JF��������>8r�+0��K�sS����#���������v�/��#����x���o�<,�=��FS��?&F�+uo^~��Q��n���%\�{�4���u4"�D��P��b�Y��|���s�`�
@�5�����t?�4
�NO��������?b
'�U���OP<iv=8����n�I���B42w�=7[���2��T���[��)��	�j���I ����[�� ��a�JPq���kWc��v����U��]�^�]�sy�2/�{�)�.(�-U�{PM�m�ttp�������j�3]6S�3����>��w������S��M����J��H�L��;�W�>������S����E�,B� \��������_o=sy�������6��7��L�7(�v`J^�5D���u���vf7���&���*��!���$�\�p��<{;��s�&���8��~m�kf��l��<Y&�����G*���6i2<�R�����H�=���?������Zf�5��xz��������J�X�Z���=���x��i?�fXb�N���o�?�u�p�Z�u;�6���LwO6���T���:��h:\�+�%�e�>c��kUjG�B���:s��R���v��B����>G����=65��b����i�&kw/KUG�����EG��x��R�y���U��Q{zYv2U��+b��A1��� 9������@}�����m%%�zCd�X�#`�GOt'~����������+���� �u��1�Wt�U��4K�t��g���;������z����/��4Ckt�>[�����;�}�?�7�
=O/��b���Ao����K������
�v�A-�����G����?��[|�P	t��P5��(A��'��U���>��]��h�~g������qW���W'�wG=��O������W^���;_���E�-+�����5����{i�r�����#��X�ob�r �� �Cab�������#@�`"�	v
���V�����3�H�@�
d(����,�Y;",�(K��%��PX���vs����1E��a	�X�,�0K���7���p�@_�;�
`���Wv�>p7��;��J���_�z@����;�a0�%PJ��������?��P�
P�6P�`tF�����P��hh������:��S�K��]����
endstream
endobj
14 0 obj
   3126
endobj
15 0 obj
<< /Length 16 0 R
   /Filter /FlateDecode
>>
stream
x�]��n�0��~
�C���d!E��a6v3A�1����W��i�S�.�����>-��~�5��yI1�u���z�������������e�T��������O������������%�?*�u�+G�K:���n��p���|���Z���<���1n?�������i_���{7|+��7��|)����8�����u��<��S��a���1f��i��kgZ��	|�c��m
�F���*obaEo������uSCS���g����Z���0�Ix������}�N�^'^B~����Ga�5�.2���>�=!�	!?I~���^/!IB-�Z�b��}\�4N4�T�����F�F�9I�	��r.�I�'�,t*�y�r�2�_�n9s������Y���n�*�?����
endstream
endobj
16 0 obj
   403
endobj
17 0 obj
<< /Type /FontDescriptor
   /FontName /CVOVJV+NotoSans-Regular
   /FontFamily (Noto Sans)
   /Flags 32
   /FontBBox [ -621 -389 2800 1067 ]
   /ItalicAngle 0
   /Ascent 1069
   /Descent -293
   /CapHeight 1067
   /StemV 80
   /StemH 80
   /FontFile2 13 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /CVOVJV+NotoSans-Regular
   /FirstChar 32
   /LastChar 215
   /FontDescriptor 17 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 260 0 0 0 0 0 0 0 300 300 0 0 0 322 0 0 572 572 572 572 572 572 572 572 572 572 268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 329 0 329 0 0 0 561 0 480 615 564 344 615 0 258 0 0 258 935 618 605 615 0 413 479 361 0 508 786 529 510 470 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 572 ]
    /ToUnicode 15 0 R
>>
endobj
18 0 obj
<< /Length 19 0 R
   /Filter /FlateDecode
   /Length1 456
>>
stream
x�]�=Hq����;�J<L������~��I�����:I�v�M=A1-���KA�N\�"t�ZW����:��xI�kE�Y^xx�����&�(F�7>�=��l'~����?]�E��V�~�u��W���6�@������
H/HH�T�k��.�W _�\��F}�Wx��$�]���C��%�eG� ��XN�T�(��R]��:=F29.��^��?S�:�:RC��r����ln���Vs�%���(f�Su�6wZ�UK%���&����D.�He��n���;�N���n"P��M�Hf.��M�
�����p^��r56cw� `��D�����jG�q����$]y��������?�A��������)����hsP�
endstream
endobj
19 0 obj
   380
endobj
20 0 obj
<< /Length 21 0 R
   /Filter /FlateDecode
>>
stream
x�]P�j�0��[^�C���t��<�r K+Gp^��\���u�)vafg`v�6WC�����-V����W�#N�Dw��|=P�~vEHm�v[*��b}���T�����/��D�n�>(��r���������w7#�f>��TS��7m��[A�4�="�p)�#;�P�J)5@c��@
���k���qSw�zT]S����|��+3RmM� {�D�,������S�p�
endstream
endobj
21 0 obj
   225
endobj
22 0 obj
<< /Type /FontDescriptor
   /FontName /FIDQXL+NotoSans-Regular
   /FontFamily (Noto Sans)
   /Flags 4
   /FontBBox [ -621 -389 2800 1067 ]
   /ItalicAngle 0
   /Ascent 1069
   /Descent -293
   /CapHeight 1067
   /StemV 80
   /StemH 80
   /FontFile2 18 0 R
>>
endobj
23 0 obj
<< /Type /Font
   /Subtype /CIDFontType2
   /BaseFont /FIDQXL+NotoSans-Regular
   /CIDSystemInfo
   << /Registry (Adobe)
      /Ordering (Identity)
      /Supplement 0
   >>
   /FontDescriptor 22 0 R
   /W [0 [ 600 602 ]]
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /Type0
   /BaseFont /FIDQXL+NotoSans-Regular
   /Encoding /Identity-H
   /DescendantFonts [ 23 0 R]
   /ToUnicode 20 0 R
>>
endobj
24 0 obj
<< /Length 25 0 R
   /Filter /FlateDecode
   /Subtype /Type1C
>>
stream
x�uXkp�y]����c������I�i�:�=��v���:V�$��Z�d��L�EB$��X�O����XK|�EJ�H�D����5v�I�'����d:�L��](�4���d&���w0����������.��[������{F����c�z��\�-�uq�P������<iF��k9�A��bwZ��wY�w�g��9�����'���Q��o>z�#����c/9=�������d����3�����~����=�=�}�{����}���Un7��{�{������>�����������������=�}���?������!�������.�n��Y�u=���7u�t�M�7?t��n)�;X��/���m�������w�>r�w�{���������;�wv�|��������yu���w�����j��M�m�f�G�Q{7{�}��h��:�s�Um�\��X����[',��D3�f�/��gv�`:���O�R^�8��RR !�4�k�����no�����[�	o�������	�#�1�b�I�%��0Tr��-���Std��a}��}F/��KJB]Z"�����P)�z~���	:
2�q���k"~�qb�cf��w�&F�������/�9e����V����f�C<�7k{����������������
�f{��������C��	6�Q�M���9Hx�^(��ptqd��
<W"W����W�W�]��Xkau���������a�/�T�����$}��������~��{���G�fI/��(U���U�V3�������*b������$H~���G�� ��mLw�w��v�7k�:�����k��q������n�:��d?0�:�b�M��d�#nQ�r�]�1rg�"D%Z��B��)��Z�A����n����`�7x�~���cN���`�h����s66�^k���������9�k���f6����X��/�B�D_��p��dk����������x�O����������	sT�i*4U��M�����V�H�f�*�������Kv�g��H�
�~9x��}��6��mX�i�� �}���NG:��������#Gh����ss3�*b��t��`A�zv��~���x����� ��L�7����`2�b�T�X���zzn&=CX���B��<�ib�BCoM�����7������e�"��V�X�z?���!z�83A,��6�h:O7'i9���B�U\*������.
�0;6�C�C�+=:z��%��g�[�:�
�r}�����:N�|��EH~�&�����
O�����H�J��S *�����m|���5��Z�t9�eA��L�
1g�����0H�YGE�������u�Ac�'F��]d��m���r3��3;��h<'�T����z�Ch����)����S�)	21�Z*w8Sy�:�4�D[~,�.m�P���"�~�.e�0D����?8u �;�����g����v��*=�������������w�N]��|��p#�G�G+�R���_�!KlR��&uL�S3T�_��dGd���b�$_����J�(������e��?C�j�	�YXK��]� .�[P;|;����,{��TG�+72k6>����q���=�1����VI�F��gG�d������V�����#3�U��T&���x>$�
��z������*�U�V%!=>GK���S�s8=�5}��B���Z���	��F����<�q\�������S�\�h1�E�EfC�,�
?���W�
��]��Ujo�������KKi)%!%�B�J��A���*)-�!�e��2��x�� f������>����p1��^�Q0�`*�
�@��,?Y�R��LI%�T��(���R�R��\�yf�Ow,�����t^\����������1���(��x�0=���������Z��9����we�h\���t����On	�H��;���WV�//��X��+�	k��X�b!���H8l����,� ���`��+���>��8q&�S?�?3Tp����V��g���%Z�4e4�����n6�s�Fc����Q���=�3�D1Q������9��b�x���6� AlRJ�q���X���#���Kx'�8l>�L+W�Xx<H�u3�2���C��ve\H�R�TbTH�)
)-��+Hk�r��T)UHe��F��!��Yf))������������~���2u���F��L3�g+��g��3wYn�)b=�3|w�@���e>O�)��L�S�)���Y<M��'#�A ��'O��K�����Ob� s���f����<1�Z�h�rU�Q0Shd�7��������'L
�)E,W��
��|A�&� `<l�3>K�2%fx�o3�1k��}y?#�s��n?�����[
���e�V�������]Y��M��Wv;��oC�����e��C���_�������;�
��t�����ei���]���7�r�O�Q��	��#��0&:B�Q��ft����e/�����Qa�����F�{O�I����y�l��4������bYJ,��Z�^C�Z�������G�c4,s���*@�W���������r	r)[�h6��%�%��o�L���1o��/���x�������*�����[��W}�C��y�u���e!�7�O�i�x�u��#/����uf��~�y���%(�r�h�����$^.@��8;���}�w�p����(�������L���.�M�=n�bT�U@Z�����I-YB"��$�$)���'���3����q��>�:AV��N�_0���lumq�>����Pt(z�NM�B��f<���!G��#`�����;L�����Fy^m)h)k�M�9Z�x�����:�ZR��m]�!�l��{���'E���E�����#�]���������5}�c�r�2�Z����
����V}�;���YF�B��c����zp��\�)�����b0c+�,����%��jtu��-*��te�6@EI5�"�&�dG������v�b6��h��"�/z���AoW�:uk/�^%�7w�NrG� �`���Ntt�|�:���|`�P�J��E)[�*
j���!SM7hTet:���y�o�������� �I�����d����w����?��������7v�T�i�,V�����TAj���Z;����X�@$���<�����;�������3t*��J`�=3�G�fuG����A����]~g�����A�SJN��r��{C��v���_����P������.,BX��:h9��6�6+�Lz�Q�S��v��x*��T�T���"���T����B��4N�������O���*��]��Z^� �4T5e�l�,�*T�TQA ��%���W��#�Ox����_��V���xh$f|rrP	����"Z�|I�.`Z.e��)S�W�i��x~����4�������:r��Zy������/��lh��.��J�M��P��&�'A�H�S�|)_�O8DGc��Cc>Q�W:##��$���#V������ZQtz���'l�j��A���@(����}�a�(�>��_@��b�9�Q���aSGj�98�<��u�O���Es�HM����+��	Z�,�:D}T�Y]�%x�@0��$&|$��)oVDF��
�E��tDKT����~�������
�n�nSz-�:<~�>��G@�����V�z,=��`nXqZ�~��B'�X	�Htnf3��eM-V�h�:��N���pV���	nrO
		�D`O-��6���ZEu3eb�~VL����S�T�7�w��0OG&��z�
�����?>b��}`�GD��\�k?y���;)��i���MO�&��(M��������\�4��MW���L+u�9@�A�$X����!�:&U1Tu��+��D���V���}maaM����\$��Jt9������@�Q��^�to� =�h����)y����^�G4���N��g;����P�j���$4��h9\Bx:T������7I��xc��/�s*w�v�=�������1��2zy2��Q���ff����VB)�
endstream
endobj
25 0 obj
   4109
endobj
26 0 obj
<< /Length 27 0 R
   /Filter /FlateDecode
>>
stream
x�]RAn�0��>&��L"Y�"r���($�L�v�x��U��D�T�U�vW]���5��5m��Y/k���%��'>�Q������{>�����y��3���l�9]��q=�t�7Oa��Vi���8���o>��p��|��u����Uu���xb]���>p�k��}t������6�����-�ygNc<�ru��[�8�_��A����cR�Z�]][��k���F�<z�(LAp
\{��)x^�r�����������E�-Y������`�V���!��o�?C����E�B�
6�F44���,#Y�Z�Z#�t�s5e.dY�"�L�3���?����	���N�����2�\��m�u��}��|I�c.�YvC�b������KUy���C
endstream
endobj
27 0 obj
   388
endobj
28 0 obj
<< /Type /FontDescriptor
   /FontName /JLZCGZ+CairoFont-1-0
   /Flags 4
   /FontBBox [ -2 -240 906 765 ]
   /ItalicAngle 0
   /Ascent 765
   /Descent -240
   /CapHeight 765
   /StemV 80
   /StemH 80
   /FontFile3 24 0 R
>>
endobj
9 0 obj
<< /Type /Font
   /Subtype /Type1
   /BaseFont /JLZCGZ+CairoFont-1-0
   /FirstChar 32
   /LastChar 122
   /FontDescriptor 28 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 260 0 0 0 0 0 0 0 339 339 0 0 0 322 0 413 572 572 572 572 572 572 572 572 572 572 0 0 0 572 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 411 0 0 0 514 633 591 387 633 657 305 0 620 0 982 657 619 0 0 454 497 434 657 0 856 0 569 488 ]
    /ToUnicode 26 0 R
>>
endobj
12 0 obj
<< /Type /ObjStm
   /Length 31 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL�7�uNJ*�Qo�!��ZMI�1��h���<�H�7�$4T�DF�^g�( ��!��� ���p�a�SB+�5J�erl�;w�=F���AM�|�1�������V��������x|��lb������z�I�4'M�gS�B��h<����T�o��~G�,�O>�f�~p��T�����1����������kQ����%��!�4����u�����A���k��-�6��e��qp���=:��w6�����+�����*�{e)~��o4
endstream
endobj
31 0 obj
   275
endobj
32 0 obj
<< /Type /XRef
   /Length 134
   /Filter /FlateDecode
   /Size 33
   /W [1 3 2]
   /Root 30 0 R
   /Info 29 0 R
>>
stream
x�c```��������L21���������������
,�b�o��c@��d��Ln�wM��3 rO���������C� ��/yDDu� �X6�<�"o��.y+L�@]�&�Y���������"�
endstream
endobj
startxref
384724
%%EOF

ryzen-nvme-cold-10000000-20-16-scaled-relative.pdfapplication/pdf; name=ryzen-nvme-cold-10000000-20-16-scaled-relative.pdfDownload

%PDF-1.5
%����
4 0 obj
<< /Length 5 0 R
   /Filter /FlateDecode
>>
stream
x���M�4�q��_�Kj�R��������Gx16w���h����MI��_?��H����I
�7�������u
8p���?���3e�������#������������_���vN��1����tis1=����������=������/��W������O�z��o~������~�����3y�9����B���� ���|�j�����~���{�����������?������~������=��o~��_����������������������?>��������=�}��/������������������/��������������������<����o^���2�����?������??��|�������y�����G���?=���a����Gt����j�rM���1m�D~��/���q*�G��7e��7��:����/�5�u�c���w��=��%���#oy�]3��Z�����}�as������mo���S����q�k�6����nK�G������m��E�Z��wu��c���Q��w��~���X4�e�xW7k��-�������q��6���q-�����'s?6���>����E�4����HZ��&�����t���2�mO�����g-����I�������?�<���lz�x��I�]�����z����?���F�}�GM�+�[�MQr�|��Vv��~/[�w���m>��%�p!�1�h�����������0���F���);���3&NR&��LqKG�$�����S�c�!Z&�L���8�2����OO�L���8�1n��}'���r���)K����F�)%5"��Lq��� ���O9����>�m�����f��Q�iX����$mY���������3&NR&��Lq+���2QRc�x�T��^�)S�gz�gLn�����pCf\��d
������	2g�,����-��d��0�PQ<g�Ha��sQ�!c`�V�/Uf�'*����a�?V3�zJp���
�^$�j�0�H>�j
��qVc`�V���\����x�\�M������s����;*A�\�e������NC�l�������l��)2g�,����-�����Td��Y�M�9��{2� 8����"��R]M��
��
�^'�j�0�N������W�S���z�{��L�8��	0gK�W�s��Ep<g:�R��%����S6���~5d��Y�M�9�����Qd�FY���9[���]IE�l�e������X��,�Y����,��V�JQ�?�(J�n�^(Jj�s<�L��_��L�8��	0gK���"s�4|-	0g+��"s�2|Q0c�;�O�SCfl��l
��B�3I�9W�~&I<gJ[���d�EY���9����_EA�l�e������!�IB��I�F|���#���we��m��RqVk^����q�l�����s�:|35d�V�o&�l��
��qVc`�7��kSd�FY���9���RR`�ut��$�3����d�EY���)[��R����)g56�lGZ6A�g6����6Q���D���+��=T"��F>�jso�g�g1>����!���Y�0������������)��-��M�)g56�li+G�&����s�����	2g�,�������_X
��qVc`����w_A���(��10g�[��kd�FY���[����������?�
����7*�#��������yBI������V�3�-��?K��7?�����M[��4P�����eE�����s[*�PD��@��_cH�/P�D��P���^�������E�}@��'�g��9F���x����>��ov�
�)o���X��q�<��r_�Z����fw�R��U|rX0��	t%�$�b���Z���J�	��[����Q7�8��q��!�>��ovW}#8��Vz*(u�BJ	��
@�|5�V��P
o/��\f��s����7!��x<��i#��T�k�/��Y��R���^��6��� ����Rt�1��{��H���1\
�@�������T����b���^�!n���-��G�DD�zoc�x	���1��c�%�z�Q?2�U��9q���b���%�{�b���^�)���Z �3]?X��.��� \W�):�K��^q�M|:G�:@�$������S�K��;��c,%�z�%!��a`�1W�L?SA&S�K��Rd�1��{��Z�G-,�2pe�����_<`��*B���$.�p�Y�?T�8������%}	P�:�Rr�Ws]�����"~G�����\���pJ�A����C�]�v�cQ��[,<���t�(�������CEJ����'�P�Wbp��1�<"���b�x	�� �"������5$�\�����D�s��r�CN	��7#%�xC(����I?�O���~l��(J����%���t���^�)m��/�x�������S~s�K��c�:�Rr���o��{�fR1q����\��z�t���f�i���}�����6��� �[�Rx�1����z����.
�*���9���9�%.�o%�^�|�|��:�������6��� tA1E�c-��k��u�?T�8�\��.��� �F�t���f�i�s��;�5%B������1U�y���c�%7{��+�p��!��(@��oBJ	�?�(!��(j��{�D�
aK�(q���#�}i ~hi��n`��,����{z���^������(���g�w*���K���&����N���7��n�����P��E~�ZS����T����v}D�/������x����/���[I�$a����>�D������]���!��[�N��#cH�\P�Hw��!Kw6j��a]
�^���1H�&7����]M�dG���RW���c�x��1,��Id�>�\�J��!�>��/�$�Y��57Y��$�
%�&���O	���9�}�/U�h����J���6��N��Gt:�K��Y�e��v��C�im}�Fz��d�!��+�!,�8�m9�Nm�c��6L1j[k��E��9�Ncs�4Mm���jl{�t��}h�a����!��Z�[TU���*����V�c)���H]{�G}VO���A��!�
����Sd�1��+��U�����*<aR�l�0��bx��h������zu+l;>ET # ��R�U�vN��!��+��"���[��%����*:��r�hbC�%:T���A���<���!�
����Sd�1>���a�"[���j1��)�i^CH����7#%p�J���T�*����a����aC���8Ec)��a��d��V��?�j��.�d{}��jR?Al���H���+:�+r�:��}��a�F�::�k�o�N��BL���@	R>���5������[;"M�b����H):�K��E~R�P(�U���6M��At�;���Yj:��U�vT��CN�JV�o%�JkFj�uVl��U]��~�(E�b�x|��J������HS���+:��~�A�XJ.u(��Z��*@S����;�*Q��7����,�n
�JS������4�8!��������t�|li�M�By(���G�������
R>��� 5�,�.��zjY?����T�\��O����
���������?��U�U�x~�����^����������_���H?y��������/����?��?����o���?���~����������������_wp��<����c+ez�4S�tl�}����L�������0K:��lI�Z���}��N+��u����3Kz�[mI�Z�������/���w�����n[���i�k�-��K�-�]���������e���pK{��rK{W�,�}��%�k������O�W�[��Zhq�s�-�[������\�/���mu�����[��I�k�-i�K�-i[��������E���}K������"�M.�re��k..��{c��/[J��d�T_�Z�����Kk�n��L�8��	0g�t�h���Q�ac`�i�aSd��Y�M�9[��^;6A�l�e����D��10��$CE�����o��sQ�!c`���k����qVc`�6���M��$�Y����$�-���B&u�4���2qVk\��e�����*2g�,��������M�)g56�l����	2g�,����������L�8��	0g����	2g�,����-�2�&����s6Zb���Q�ac`�V6��w�"��������f�6�XO�������BqVk]���?yR�|O��IVs>`�V��z6A�l�e��ok�������)gWW�l ������(��10g����M�9e6�lu��#c`��I���)�w�6��L�8��	0g��BC����e��g>=7�|��C��n'��L�g���^�����"S6�jl��xI�ad�FY���9�1l�o������+0e�Gt����L�8��	0g���P�9e6�li�c�wSd�FY���9[�:�@�9&*��Le��?�B��,�Y����,�)7����;@��e���9���t��u7�
�rqV#`�V����s.L2TO���j��$�rqV#`���{����R����OC�l�x��U���dL�(�Qq<g����J�9e2�l���	2� 8������q���1|�<�������sV��`z�,�����#0��Y@��)n��� ����s������!s6�2lL���>��)������A�i��-���s�DK�
� s6�2l��\�g��sa���x�Tz�Df'���1�e�����J���x��+��u_)
L/*%��������2�sWC�\�������t��s.L2TO�R����B�)g52�l�#������(��10g+[������(��10e�;-�il�L�8��	0g��1��	2g�,���T���|�7*�U�.�����������}�Om�������O�������1�����,�)�hW�^�]���p�;|�)��H�yoc�x	B~:��C�����5:���o%�:E"!d�/3�T��W3x�!l�:Md���2?+D "����������ERR>������p��E�����R��{9�%��`0A����7U8����~��]|��W��:�@����k���L�x&q�8~���$���RJ���nw5����Q���,B���&������|5�V��0&p���2?�A �h���	)�%�v�R����k�%�p�q��a���p�������=��c"����O���V|":?���x��71�x�>3J��P���	�����n�N��yD���]�/Ah����c,%�z�>����l�@A ���7!��x<��.B���d:��
��:P���9��CN	��&�xC(����������
d"lyb�K�~F0A�����5y��^9�4����O�8"N<}D���-���b�����������d����"����>������x�!��{m��U�������"|T[M�|���%���A#a+��]{���s5iu�$���71���IC����w�`q$��)@q�~��!��x<aO$�xC(�7}�N���B?5B��@xroc��y�f�C�
�u�Vr�Qa
���xn8��&����5�	2�J��6\��;�6W "�`w����w3p$�w����}����+�q,s��mL/A�
<�����&��������Dzs�L�6v��$;�a[��N��0���'�u�R<�eic.x	���J��P��������g�7�(��@����1U�y�!B������������W�+���;;�0B?�!�G�Vp�S>oN�>9�-�p9�s��/�a�<R������8�D��R����^��	�`���pi�.
�`H�C.��pOs)F��d@D�;�Ta�;��)2�k��������������@7��n��L��u�!��+��|[X�W���
N��!�t����d�!<������fV�l��  �R�����)B���)e�{]�%�����(�������zJ���	c�d��J��U�!�\����.���|�'V�tR�
��C��!��������it��
}�+#�������4r���"%�xCx����Y��]��R����hcH�FT+�XE�C�����]��
���v6������+
�|�v���
���d40�KM/�#�U����CK1|�_�I�,���VC��k��Nq�_NM/�-�m�1N������E���FSY�zr�R��V>���5����MK2�3�r�)�
���E���c,%W:�*F�w1�X<Q-�S����b���J�,^��w�Fh}��u�e�g��N�3,�8��vP!,��t'���`#��b�0��J�!�XJ�t(���hE�XQ�S���#J�!�XJ�X���+c(�<"���1W9:�t����
�\���RCh	�H_�x��A����H�!�XK.���B?QW$��p���+��"	
m�1.������g
V�J	������AH��!��K��&yq���# 
�rz���F	<�J���D���
I���+��
a�:�����O�w���!���G��2q��6*c��bO�\%���Q-�{~j�b��0��c-���HL��C��%�jVc��ZZ,�&��!��K����p��Q���Hn"�b���`b�)B���	J0C�h�*�[�Az�M�!��V��)�|5���+�,5��=�D�B��R��m�t�!�����5ooe.�8xm�+�&�[	M���[���?��_������_���|+��u�z���m�V����{z
|�����R�����v�.Z���|K:�2�|m��
�������.Z���|Kz�2�}m���i�������>7����L{_�oi�_�oi���%�����w-����[����[���hI�s�-�]������~R�2����G���oq�Z$m�����i�%m���=7����H�^�oI�_�oI����m����m-����[����[o����O��+�X�P��,�_�oe����7� ����`��>{�C�.��L�8��	0g�������r�1]�)[�C6�"S6�jl��~"-� s6�2l����������s�c�*��9��oW`�V7s�&����S6�>O�ggC��g56��������2X���qE����Z�L/H��������S�����
��qVc`��Pd�l���(��10g+t�ad�FY���)��z���!S6�jl����e�!s6�oW`������b�����-����M��D�Y����D?���|+����uE���Z�L/T�x��aSd��Y�M�9�1lRo����7�+0c;v~.�l
��I��)0g��U,� s6�2l����A��9[�=����q�\G�u\�9S��2�����S6����"�iB��i�Jp���[�w��SC����Z�L������M�9e6�l�{�2e���&�����l���(��10g+[�L�2g�,����-���`C�l����������M�9e6�l�w_R`�;�%��Lyp�j�t��,CF�z�����b�u$:��.��2QR����e�WL*E�\�����[')0��$CE�������L�8��	0g�=@C�l��P`����!s��<
��`)UG�����s�:x/5d�U{�%�l���x���t������z��yKL��m�t�"�+�Y�w�W
���de�v�I��,��
����ZCf\��d
���V�o����(��10g+[������(��10e�=
�M�)g56�la����!s6�2l��bo����+v�J��2�l2L�fN2L/���G��s�*>nG���U���e������e�{�����u�!S�jL��<Z��10��$CE��)�6��B�)g52�l���6A�l�e���e��i���Q�ac`�[~��Q�9e6�l�>��	2g�,���TA�w�v��q��VI��"]:newl1�+8n���_
����]A""�!��T���)(EN`b-���2;>ri��w��� ��`���*^��&J�!�XJ����%���Jx@+H�3+}$
�S�K��Q8��c-��k�3��g�*��'t�M��t\����|5�^�i�C���U��TsLr<iR�K<s�|%�^�yGwrp�A�*9e:�o)�%�*�	R>������sq���x���s�[�1������	R>����O*���Q�������r_;�Z�(��nw������+q;"�>�z���%�S�):�K��^�aLp���
Dr-8���c7?pG# R>���I��^;,u�_^��������%y*�2/b���^�>nv�V^"*����<\��71� �0H���~�Ux������z6t>��-�	)�%zpQ����k<yw�S�7��*^����
?)���P��5����m���ZX1�&����$%p�J��&Q�B����S4����y�c.x	��JJ��P���y�i���HR#���6��qxv��B�~�6�>��g?���
��wO\��
P8J@��1L�������.�d�B^n���g�&���"�q<j��P�����t��a[��NA�*����Fp *JxR�K�|c����k�K�B:(QL:�=�C�~�8o�?��g?���<���Ap���Zt&44c.x	�mJ����=������?��
S  P������_<p��R�G-�^�p2�N�X;o�$"<�mL/A��
2x�!l7;
�Q���@�DDJ(�Z8�T�;�a[��N3���m�� ����b�x	�FQ�A#a+��i���1���'����S��/p���N�}���ZL;�+;���6E""�m�����%��1�G�Vp�S>q��]���v����
�=��T�G-���8�R5�R���r�K7�b����oi�.��`|��)��T[��N�s8c�\7��n�;%�V7��aWi���h6�dR]T�� �M��vN�!�8X���E@�}�<��U�Nq���M��~L�!�8^�B/��z�����W1n)�hsp�&p�J���X2s�
 {?���0���*r�Q��a����&B��[�p�9Z��!�t��EJ���������18�$�\E����!g��V�2X��n�k]�*�|�C�6�S`I�R�Jn��b����_iMd00<������#4m�S����A)*�
��\�P%��+T��VPIm)�(l5������_i�%/8�#v���������jq��B���
}�+����?(�wDTBc�0L�����a����CV��[�HF<�G/���j�)�t#���&&H�r���X�r���PCh���d���{��f�WN�!�XK�t����6�����$�sLF3s!p�1�Rr�C��b�[N#���GiZ�b����(�S>���5����(d��
�b�Jd���)B����W�����!t���a�XN1R}�E�)2���6M�Z���3F�;
��6�\a�10��t���B��X�MK����DD;�TaD1r��m�1��+���h$�$�4Tc�PM���(E�bU�;�+��V
�C�T;�rF�*b��Sx�1N��Vv�j8�V+����h_cH�M
��C	T>���5�������RDD%�SLF�?��c�%�:d]*y~�����J]����	���Rd�1��K���\�H^)"��)�
#y�5�b���R��F��5m���m�����H]�q�.B�����O������R+��N���z���ba�1�Zr�CR����e���*YC��*k��8_s)��~��w�����K���j�qK�,������3��������h��}�._��&]���U*�bm�!��,\��vkJ�B��#��R���[�.��nq�s�.�]�Z�k������N]�}�����N]������^]���^]�}3����^]������n]���n]�}�����n]������~]���~]�}3����~]�������]�����Kzo�[����K:�*m|��%W�K�.i��oQ�s�.i\����k�4��k�4�����k�4�U����K�����W� ��-��u�;��-u�\������4Go�l��di�
���W��G��)2e���&��-{�2g���v�l��l���(��10e���m��5d��Y�M�9:�Z2�\�d�(�3�a�yC�\��e���-���ed�FY���9[�rg���$!Y����$Q=,t��������k\��e��������)I�)g56�lu����9[���+0eK����!S6�jl���V}��Vd�FY���9[��r�M�9e6�lyv�7d��Y�M�9�,�2g��e�s���{�(2�(8����(R���#�v��e���"��Y�u���[���E�)g56�lq����9[���+0g�G�M�9e6�l�T`�������s�0�6d�z�B�l��t���+u��%�3�gSC�\G������!6A��g6������r;(����fWJ��������]����BsFN3��,����@��ofo���:0�0*4g�4�(��1��b�Q�#�YFF�e�34������
�3�7�ZF����fY0����!>��:)�`�c��f���YBB�3���H���b���H��� ��zQ�i����J;���t
��8�
�`L�>2���,KG�����Z���^�!s���;�9�0
�`�4d�c����,������fY:l���d���;?5d�X�}|��gN���|0����?����Y�gy��W��L����XI��RC�|�e������{��Ph��i�P�#���Z0R�edd�G1��Q�#�YFF�eP��`,�~��)#�����M%�1*�`���SC|��R`���1;6F��gY6>�C����\��N��M�����ovE�����f`~y�����e��|�fY0���!#>:Q���`+t���h�Gi���9#&��Uh��i�Q�#���Y�R!�4����1����Q�#�YFF����2
�`�4������{�R����bV����kO��'4���{��*����Dt>S$#R���mL/A�5%�!�XJn��;�$',��
	�T���6��� \B)<�k��^�Tv���q�������>�i�����R���w��|�m)>J�@�����%���(E�c)��k(xb�B�=~0�g
o����_<���x�!��{M���Fw��)1���1�	}��6N�����R����]���~���`�b���g>�?��r_-����CEJ���.�S4'�M4r��p��%v����i�zN�����4i��O�����������Z>���KXw�`kp"%���@-g�1�l&��<�)������w��Hv���1PA�����1��!SH<y��q��f������h���h����k�=��)���d��M�[��e�C{*�w�������a�����6�	���-��0<'p��zLPs�B���*��CQL�aN�T�m\~q�;�����@��h�a��{���e�`}*&�0'@����N~)dLd�"?K+�]g���e��(�����E��sd����06L:�zm��[V,JRT;�8�F��-`����_dF�/����p�K���v���f�%o;</���1��(PA����3�3@Ex���$�����2|dnP'B����z:��p��X�$�p����=�����CIJ�B���g'�������$�p���aJ���N�����r��1V��w��c,%7{��p��.�\C4��mP��&��� ���t���n�������l��R���[A���$�H���aw��	����������3@E�2c�	��0'@�n�������>Q��;k�����Q��eN��OW�n����8��X���l��=\�-+�%IF>\u��B2�}c$�8��=`����s�}c��n�w��j�Q�&���;�U1B���c�%��x�j�����Hm����C��!\��p��;4��|P���T�q��|����2���c�}D?������2
����x�3����8E��|�M�@�[F$j�b�����n�R���]�B�>����!lW*��)�l���D0+�b��uo6
�V���@F@U�!�t�"�6L��!��+���W�(s�+m��`q�j�%A��!��~�k=������&�t����CH���?,�w����}�1���>�����I�"�BJW���w&���C(�WZc)X����0K�c�Y�P��27ip��.>lM����5}�oQ^�<%�1X�0Q��8���9��*<5�D"�k�\X�V����Q��W�p���(��pQ����yL��O�a%�OU)~V�N���3G+��*�T?�A0I�9Ru�Q����z���sd+���Z?}n#������n��	e�{����pQ����I:�	��k����?Y}����P���u��p�v��U�U����R����Z�Pw��S�\�$����KM��I�]�
��"�����]��$�0' Yq��FES��a1PB�)sg��:��WI�aN�T]kT�4_q������S���.�D;��FI2�	��K��~�+����Ye�pQ'�y�&u�P�1k�6*��'K`��OK�U�;\��*�%IF>\u�Q��.O	<��w�w~x��9p
1�Zr�C��Z�	�{�Tn;T�ir�%�0'@��5*�X����@�&��*��8p�s��b�,����[�
�8R0����N��b]�9Ru�Q����v�l�9��Nu����$#��j��?n�0�5�R��zD�d�3@E���I�aN�T]k��/x���;Ap*��)�
����Sx�1���������M�b!m�jMu5�b����3k������hc�]k68�>YKg�������~���_���_y�_�m��U����m��*�������n;~?G�]?6n;��+��m��m�*i��{���?r������4~^���s�h\����n�x��}�]�����[�x��Vi�]����7���I�����\r�/�*m���6��^����4<�s�/�*m���}4C��W|mI%m;���ms�h[����nz�������4��kq�/�*m���6��u��z���%�I������[�4������o�&��)�m�p���h��j�4�����K���,[,?���Z="*�q���q0��ak����w�����VZ���C��}sV�[��}?*�[������e6�l������qVcS`��m�X�Q�a`�F�����e6�l�1d/�0�Pq<e����J�9g52�l�6l�e����s��},�|��,�&�r�H�l1E�V�����-��*qV�[��U*��s6�jl
��2Y46l�e��������1�`�,�&����-��-�����s6zxh�X�Q�a`�v��@cc`�FY�M�9[!k��������S6<m3���"��B�aC�sl�D���d���,
]���������_/8K�����#�YFA����sAcd��i�Q�c��6���#�YFA�.�����)�2
2g0v�A��fY0z�0���0��1�`��e�Y�Q�%d�����edd1�p�ed9������H>�I"{���Y\/N3�+2�^��qr�Q�#�FE���##+FJ��������FA��fY0z<�22�b�4�(��1�;�add�Hi�Q������Y1R�ed�X��+��fY:�l{�5:A3	�BE�3	�����O&�b����L��/W�������YBA�����
���e:l���
#+>J���,�p����w
�2�}�#E�����`�p��%$`��Y���[$�/C�����,� ��+�-##�ID�,� �I��?�B�s����N"�,�����_1O�Z���e:ly��{G���4K(��V�tw��,9�0*�`��22�b�4�(��1
�"+�4�L
Y0[��JEV��fY0V��2���0��10g�S�����b�,�&��a�/�x�r�N����p_�� ������k�����$�Z9�_��D��bd�Fi�O�9#��Mp,�(��	�`#�K�����,� ��	-##+FJ������u'�+�`�4�������������,� ��k�-##+FJ���L��������\�#�h�=��x�� �I(:4�G�]�B�@���w�`��������_<l���\>��kWmx4,K
����Pp��)������~H�J��&�H�r��j�;�b���s_-�U��V
���fw9�iZ!z>X��@�U������<��)I��P�o5yD80�bW�#(q����n|�o����������
|���6^�%G*SLW��e�)��������R
�������W|Nn,������#�}i� ~h���n���@`��1�'��������)L�������Hsk&��������sD_\zFD�A�`�6��� �g�Sd�1��{���fXT]��mZ���w��>/<��`����k�����w�*p�gh{
w��������|%�^���2�������H{��oc�x	��70��c�%�zM�4�����ED�<��6��� ��b���^���~�sp�������mL/A8sQ
1�Zr�W8>��jhaxC*���-o����<��`�1��SP�����
����yVc�x	����):�K��^��p�Y�	|Z��[��1U���@:�k��^��\��F�	
����|��qfCF�=�):�g��p�W��#�6�7�{����S��9:�k�~5��f�8�E�"���Aj)�b�x	���Rt�1>����1��]`ss��}�(W���Sg���J0E��d��_�5����/t��H���w$��S�/A��b�:�Rr���_�0J:�-�@��)��r�K���	2�J��&i���C��#<���+^�����c)����{����W����5	���s�K��Rd�1��{�����)3�6�"r�{g��d�fTL�!�8c
j�I0�lT�"�\|S�K�St�1����FRa��ga��
\��.����'&H�r��&3�so.�8�����x��G��j1�t-���|�^�^"���(���z|z6ed��N0���P���w�w����h��i3Z\���F���y6�t��%Dt�!���yUrs[��*z��������"R>���5��`>AKG*=�m��!�����	��V>���5Q�Q�D�DK�E�CJ7"^���!�������I*��a.�8�G����[p�Jw>��/�$*[`�Xb��I��K����#�����A�i��%4�n�X��.�\������(|�_��!�6�Ym.��kq}��*����Z�����%�d�@g+������X{c��Z����=
:�Rr�CU��)B'���#��b��*����C���\�P����ir ��1���Ip�q�1������C������io����1Ut���)��
��\�P�0����[�#.Tac����)2�K���FF�Vg��l}���d��QJ���XK�t����dE���6���������X�c-����]���,�9�JK4�S�Y�ij��Rt�1�������P��JZD�)gcL��VjJ�k����{������b��0��g},%W:�*Wr�l�#�	e6��N9�������\�P�,X�����rU�l���Jf>Q�:�Rr�C��`!U��U�����bz'�L��!��+��d��X;]��ju�1�
����R�6�Zr�C��.�q�X��>Jt^����R����{]*�G��BE��]cL��8E�c)����QI�X)��H$�1��N��UG2�K��U��[!=^��&l�1UtB��b���R��<E>��!"M�����y���iC�q���|���Ni�G'j���*�bz'jyJ��!��K������M�B_J"W
a�o5p�%p���_i"4���M�BA��p5��nt��R�W}��s_��V����h��3F�=q�-l�p	~��1�a}���E�����O4G���x��Fw��t�}��(%���-��D�k[4����Q�c4���I����l���Mzo�h���Mz�2�}����k�&�7w4�}n�&�k����G����G���A�>wH���L{_;�i��Z�I��#�{���I�Z���M�����K�t�l����O�t�e���'M���Fi�{sJ���Vi���i�k�4��[�4����Q��4���I���I�_��I�����O
��s)����i����i��SA�i���4M��FP$s��6
n�SJ��F�$�4��|(8��[E��fY0��(@�c�2g�~K�n�Pd��i�Q�#kZ;FFV��fY0��!+>��t,���m_�_��7d��j �Q�#�FE�t��edd1�p�ed=�D8&�ko5�I���yA���L���/W��h�22�b�4�(�����Y�"FN3��,��f�Y1R�ed�7��{��`�4�����o�:FFV��fY0�a��"+�8l�o��1>[�������c����9�,&N����'8����B-���/���q�i_��������H���,�YYA�u�Pd�X����=�1��,9�0*�`L[
�kd�Hi�Q��1xS*�b<w��,k��_�_�w�+0g���������"Fz�f�O%�fYN%(����"h��i�"���i�{E��+���(�����"�8�(�b��@C��qcY1R�ed�bZg�������"�0XU*�b�UeC��"�22�b�4�(��1�n[��r������nb�,fN����g��k�(�k�@w�V���e}��_�#��UC'�����"����HX��zC���
���6��,�8�*�`L���"+�4�k4d�X�I�c����+.H5�,�(��	�`s���"+>7Xm5d������b�4�(�z��m���Gl�<Gdq���Y���V����{O+lq;:7>EV|�f	Y0<��22�b�4�(����wI�`�4����1��"+�0HO
Y0�Z���(����,� ����`�w�Z
,�p%�e#`>�p�ec`}���(����`�e����`Yjw���KY��
�8C&m�;FH��)2�f��d�X��Mr��0��10gK�.�bF|�fY0f\�hY1R�ed�X������)�2
2g�C��iU�#�FE����22�b�4�(�\���7`���[�W&~��~��c�&��e�Dz6�";.��1U��-&��C�����7�?��/��q/`�p;]xR�K����1<��:�X�
���v�D�*��1U�!�!L�!�XK���h?,
��(��a��6��w1W�����������5{:�0�&2!���9��������=���h�`��\n�@�X�K..D�@��N��CN	���]� �
a���%��spG}�(�i=*�X��$w���|�W��
`�}��z*��gv)�%���.B��g
�G�����GK�D�H��x�S�/A�G�t���^�.����W~`CH��)�mL/Ah���c�%�z��B	7��wgC")��m�mL/Ah@)<�k�M{�����+;��F`w����s�K�~CQ�:�Rr�W�j�_������H��Loc�x	����"C�����|�mK�U
 )Ri�,�yS�K�o>�f�1��{�f2��K@�~C�j+���o��S9#��D)2��of�v����b� �xD����*^���+L�!�XJ��Z*o�rx�%h��D:@�w����%�;�b���^��v���R��*�w�����s�K����t����[0zaxv�u�I{�I�S�KvG�b���^�0=rKTQ�a$xD@�I�b�x	B�+���c,%7]�����bgg��D@�ro���zy�bcL�!�XK���V2�%������v����8�&wF��St�1�v_��^���n�<�@R[|R�K��5)���P��5�<���~v�`��h��s����%�La�:�Rr���{��lHA$%G�Oq�_v�<h��c���1����L�$�H�������%��S�:�Rr���o�s������Hp������\�������������K~A��$2Rv"pQ���+^�P���������5��y �
h���\��MH�/p�*B��y�����uk-���{�6�6Fj��C���n�o�6�Od�&���&F7�J���p����
#���c,%v'79��SJ+zD�^8�\�)��Rt�1��+��(~��hw��/���_��c�V D!#�!N��\�Pt;��t
DTB�.s"�Z(�����_i���=l	�J���7��n��}�)B���kt�`w:y���7��n$���o��u��������N��7��m�U�c�tU��P���F��7�;���e+��1Utr_�����},%W:T=���YC�����g�&��zgL�!�XK�thT�L2���p������*i�R����Zr�T��c��~�/����!�t+���X2r�r���DZ��)���E��1�
����)2��v��EP��7$"�*�)v�k��v�p
1�Zr�C���i,�aoH�5B�������M����H��`��?�P��\���"����E�c��
��G���q���J%�\���"p�l*n����r9s�:��>���C�raEGG���6�\a���8E�r��I��#��
����v�5�4��N�hC�q��S|���b� ������H��S��M���\)E�c-����R��q5$���)�'����t�1���>�P�#�M�"���EU;�An�UeC��`��`o�?����l���v�N��!��W���G+�)�|�'Q��v�VJ�F����SLVJ�v E�c)��!k[�����P�l=���\)���P�/��R�D,Q$��!��)���^v��"C�q������u�-��"U�N1UX�,J�!�XK.u���p�������ja��*�6v���C���\��$+��NOE8	l)�(b��|%��}/KT����_��m,VoM;�k[t�!����u}o�4��y�.m;m������������_�,����qy'-����(��<�J���Q%�-��<�����z�I����;_x�q��L:��#.�M��#N{W�7�}�'�k��������������Mz�{�I�Z���=���/=��w5{���q���i�k�8��K�8�]�����G���e���#N>��y�i�j�&��=��s-���qz��������Mz�{�I�Z���=���/=���f���/<���V&��'��������>�����H_z�i�_z���>d����`�g�������-�_{�eG����F��$$�4��t�D��FA��fY0���@�c�2g?�b��(�`�4������##+FJ���,�c		X�a��c`�v������]�!����;FFV��f�3�a����Vd1�p�aTd=��{w��#.����dq�8�4���r���O]��)�2
2g�n�4Pd��i�Q�c�u����#�YFA�esG��d�Hi�Q�9#hj�nFVd��i�Q��\Y1��e�!�8��)�b��XC����{���L8�2
��Lr���G���p&Q��dq�8������UZlY0r�aTd�x�������!S�c�G:�Q�9��5��,lY1>}�1=�x��[MEV�i��l�����X�����:��)����XC����FA�S��FE�S	���m��G��8:o1E���L����,
=##+FJ�����@���@�#�FE�	������)�2
�`,��22�b�4�(��1����"FN3��,�v���(����,� ����	����)��<��)��I8�
��I`���x�	O<2�3��\�e�`~������N��BE���IY��������p\�">N3��,�p����1�
Y0�A|Rd����!F8%�#$`��Y���[,�Y����!s�B�
� �I���"�IR>�B>�C*>�����b�f�Wd~�j���X���|K�)[e�s
Td�'i��!����[L�#�YFA��j�22�b��� sF���"FN3��,��ws�"+FJ���,c��%��/�>_
,�2,@�l�g��l,��I��&q��u�������A�������.��pX��6E������0��10g�c��_�,�8�*�`$�������,� FX��:FFV��fY0���edd�Hi�Q�#-����)�2
2W��'����h kW�c��.�c��_�.�r0c��7�d��Dr�N��O1W�y�n-J�A�XJn�����6����>�;������
8�^?��!?x��zT�g��M�b�x	"��
q���^������������>*>]t�b�x	B[<)Ec)��k��G���>�#6n���1�a�}��/��d�!��{M� ���po�9�-�	)�%�-� A��P��5y�-�U�W !�s��S�y�7X��	\>����Xigd�E\d�G��CJ	�;�1A��P��5Y=O��w������H.e���T���t0E�c-��+<~�)�V~Cx�=s-4�!��x�n����P��5���+�<>q`�E���<���+^�<h�!���c,%�z��>|�~A"���p�S�K��c�1�Zr���9�.x4)kHA����1U��/JN�!�XJ��
���.�`��
j|}g�-���(E����	�����;*A�@��Hoc�x	��Y�Rd�1��{����X���+@$��]��/|n+B����q������
W��#r���w1W�!�J�A�XJ��Z�-Wz����J���Q�N�c�x	B�Rx�1��{�V����aM����*��|�u���%��R�1�Rr�WP�`�9��`7$il���Hb���*^�<��Rd�1��{��bu2����JYoBJ	@��1A�B���$8���qj� �~�oc�x	����"C������&�N�
��@�
�>����mP�7���I8�m�8y�(��'����%��1E�j�T�������$���<K(��iCJ	@6_���
a�b�G���g�C�o��������%_PL�z���Rr������=t�n��)�&�]�/A�WJ�A�XJ���v�e���Iob����~��.��� ��t���f���������<6u�CN	��|L���P�o���km>o
!�����c�`����c��y�=6���|�T���$M�s�4�o�����!q�:���a>�P���}���=4�o��B�@�	9E����[S�
�[��@@���;)0Q��7�e��������V
!�����q`*����"C�q�ni���/�
�8`M�F6l���p��!��
�D���7��;�rPio)�(}p�$H�J���D�Klr�@A@�1LlW��=O	R>������n1n�6Qp  2�R�Q����|%�Jk*�y<5�hy>!���)�
���J)2�k��������Jx�&��1UX/��c,%W:Te-������ir�SE'����C���\�P�����	�_���0������oo�O�*��m3�*��4�SLV������C���\�PD3�7�a�98(�M�;�Ta��xp
1�Zr�C����$X=.���v����q`1�)2�K��E&K��hd��Q�m���*q)s�:�Rr�CQ�r��C
�'<*�����|����c��{�&�>nA�cGD��1�
+���)2����;%���k�6x��5y�S����)2�K��U�:��@����j�xg��&���Sx�1��+��uD4�3��
�j�)�
+���+��b)���H^���)Q�;>k��j�p
1�Zr�C�����g�(��l)��i��X����_iM��C�jB�����b��mZ&{Zb��v�*mb���mV���/sS�rnnC��
+gU����Z����&��b���Y���C���\��u,�|%+���65q�G^����q�1���$��CQ����a4$#���)|�E��:��
1����i��\��7��dD	CJ7���S����_�F�J��B�n*�b�nkr��5�k�?�{�p��[=��T��{�u�p����o�b�V*�jf�c��!+�7����wr��p������5}���{��0}��[�����[9n���M�o�m����M�������M����M�on����M�������M����M�on����M�������M����M�o.n����M�������M>����i����{���i�Z�z_�����N�oNn���N������N���N�7^n���N�ou��&p���&p����q�S8�]�Z�K8��K�w���,6�K�]���UX���
\���Vk^���Ik�7d�%��� FN3��,�`Y���1�
Y0:��02�b�4�(���t�2�,9�0*�`��r!+>��t,��` ���/

Y0����2�b�4�(����##��D�,� �����<�k�
�]��,.����_����O]�,9�0*�`��m�"+�:�4d���t�F�,9�0*�`�C�
##+FJ���,��G������,� s��F�,9�0*�`����"+F?8~5d�p��edd1�p�ed=�$x��G���I/�}A��L���/X����?� FN3��,��^���1�
Y0��������,� s�
����(�����"�08g*�b�sfC���`��z?l�`�����,��`��q����T�i�Q��TR��{8��2���5h~�4��M���-����X9��*�b��SA���u�*0��V�wVG
Z�r�eUh�J;8;V������
�b-��f���e��4��)���U�+�YV�V��7Sd��{[����`|���\�y�@����p<��G��s{�s���`du�(��A�
4��_7�V��gIZ��mH	Yr�m(Y1���-9�p�����
w(
Z�r�eUh���AK�0WZ��1=)!KNL�(Y1�mX����`f�k�-�+C����:V�>�iR���O��G�r��Z]9���A���+�7�Rd�Ii�R�c�7����V��gIZ�:��2�d���U�+=��XZ�R^�*����V���e��4gu�p��WZ�J�am�����a�,9}���c��p#!����:FF>�g�G=�+�9����v����s�	���du�)��hA����
�89��*�bM[�=)!KNL�(Y1<Z��dh�Iy�@��pe�$��+�YV�V�a+�	�d���U�+mQ�XZ�R^�*����w�-Y)�ch��7<���PGY�:�B���uuO�w�u���B*�w��TG���1W�y��1J�A�����_�4\w�D@�~:/�X�;X�����%;�`�1�Rr�WG'��;��kp�wu��t
1�%�@�.B�����#����LR���o���$�<�Sd�1�����P�XI��p�\���@
O[CJ	�E)���P��5�v����'��E�?�5F9�������S�7����T���T��S
�Q5�-�	)�%�#���|%�^���."���8��k;���x�f3J��!��X-��-�������.B"�����%�}1E�c-��k�i�pi���A�}�����)���d[	Z>���I���!�J�8�Yk����/�FYK]��7{r�\��0����HRu����T�z,�$�������m�0�z���@!��R��a��N�aNW�m�-$a`X���5A!|>��T�-+�s�s��n��%
[>��IO�� ���=�X�R�A�v�1�Rr�W��v�{#�
����c���P^^���I2�	���-���J	bEaQ����a��{���e��$�0'����|$������7(z��{i�\�-+�%IF>\u����n�&jY!�$��8|.��"lY1/I2�	���-W2w��Yce����=�{���e�|�$�H����~�
��`�#jWU�P>��*��$I�9\u�eG�����

U�P+�*���$�������d�9"u,HE�9�Hob�x	B�3�"C��������`���Z1�N����*^�<�]N)2�s��^a�2z����+����s>�y�	�"|+(9IG>\u���7���D�+9�+�t�*�������l}�n��7����is�p%_��-�E��bQ�d�p|c��\(��?��A�!���xp�,X�$�p���a+��?V��TDP�L�b�x	Bs��c,%��x��$�!d�W#]�s,�w�4�<b��&y��
�����5�0�7dC
"Mx��_tF���"C�q����u�� ���N}�C������n��=R��7��e�#��FA""0�c�0��O�Sx�1��+�
.������J��8��&5����H��8���v�
�+���� 9�����nF�8���b}�?lMT���#(�PYqt��"}��C����j���%��U�����7����>������^��EM;L����v'P�j���T�����*���1Wt�a�t���J�"��}���*Qi�S��
�A)2�k���*XYnL���c�%*>�>�{+6��jU�O���^d��^%�.�e��I:�	��k����|�@BF<T�����t������n���`����h�+^���:��c�%�:�b]�'I0�A�*�K)X88��� U��,>����`�W���*��@��$����=M�K	]���
BM�;T����K�s��Z�����J���g�M�c�K�}�;�s��E~��]��:j"���N�KN�d��U��
�d���P���u���)I�9\u�Q��R���:j����N�KN�d��S�O�-���L��P���uZ^�%I�9\u�Q�R�
�_������$����"C�q���C%���hV+���PS����$��$�p��FEI�w�^;���07�w������I:�	�����~�_V�KBM�;T�Iu�q�s�]��q������}'��������uL��$�����(ke)6�&�h	VS��u�L��$���h�v}�(�d�����q�|mU|;�Ta���8E�c)�x�����5���Tr;����$86�k��h���"��mr�k��x�j��������?��7����%���}���?=C�S)��"M�.0����`���zW�}�c_����������l�[�/��*���{����o��������Ou�w�q����*i���5�j�R�����KkM��q��k�6��M/����K�)�Y��7����F��U���y���i,k{K�h[����n��L�m{�k�'m��~�m�mk����M����L�}��q81����9^4�U�xW7m����[�q��$��%�K<o�UI�}���f?p#��]���#��U�?h\^�q��n}�~���z���~$<�'���p�5=���]kL�?����?��fgvIn_Oq��F�?�]��`�g�6�oo�/�T]fIu���������%���v�q��UU�'�(��k.��\/������������K�5W�f�K�5eu.��\��;��k.��\�����Pq�f��N����O'��4^2q�Ri���7�K�5eu.��\�xfn)5^����8�g����	���tp�v���
q�v���
�t^�ri���,��x�U�\1��x�EY�K�5W��f?��k.��\/����j��/�8K�4^s�m?���x�EY�K�5����$^sQV��x�u����%����:��K.�C���rt`9/H��)��p�,|[O
_�!��b��+��R��{W`}�B���,��>�|G���������:�_�W�O����m�1|8|�f�X��B�d�/
��8��)����m�8v�0��q�pEt�2d8l�f�p�2*#�O����i��������6��79��X_+N��+��V�*.Z>�|���p�h/�����4�'����x�O�S`��i�O����
��>��o(��>>:���	��Q����k��3#���Y���5W%��N��z���N'����
�7S�	+�L��/e��%^_���z26J3t��Z��:��k6��d;\�F��	��Q����;�g�8|�����%�������%��)_�NZ6t;l�e�8v��i26J3t8||�r�`9eH��c��)����0����lN3et`}�8�w���j�P��$^�QV'���*[������Fi�N�5�Yl��;���������4�'��Gi�O�/��P�<�Bp��dc�'��Gi�O��������a�,C������&�z������/@�i�1����m<�3��kG����p}Y1�R���L&.J3l����j;�I�f��N&����|iC&��Fi�N����q�O����k>�a��v�|���p�\.e�p�(��	������O����K��]��@)|���D+ZB���pg`1���5\�y��4P����)�(�q�Q���9J��I�����N�~��4E<����+�H�xn�C��o^Y��d�)��GMf�b������Q �SlC��R�[�W�
)�B��dIhg���o9���TcD����KK1�3��;8���Q��$&��s$�v���KK�o�awg�����r����]�s��o��/)���8�c?4�7~:|^��q��7G�!�1_�KJ)�3�����Q�]8�>�����k��o^�m�o
%�������,��-�E�����ye��'�Sx�9��2������S�h	����3�>�\���s�T�������n��-��'���Foi����1��� �?<�0E��c)y�k:�lW�9�HH�B@p�1�����O�;��s��
�����5P�������S�[���S
1�Z����P������dp2���CH�o���	2���X��_�C����"l��L>�;��0@�b���Y�(_�M$�O,,@-�O8MNs�oA^8�a�:�R���c��&��O4aR��yR����+��@�	�>�K��^�����xb*�r�:Cu'�kLoA^�q E��c)y��ql�t���FI��Nt��T������"C���<�5F��
��#�b���=�+���+����St�9��g��?K�N`/-�D���Z�>�T���+��s�%�z�
A!�0�c�A�����1U��^1E��c-y�kA�+��U�M��6��P&J^�'�	\>���������V���#d-<��xg�F^4����c-y�k��<��������]�b�[�>Ua�O��?k��|p#(�K�$D����1��� �
�b���a�z��TKS�$D@�Kc�xB�b�1�Z�������F����pQ��xg�?F��J)<�k��^���H����%9>����J("
�P��=��GM�.�D�����?s��o^�$�No
�h�y��b�B��������~q(Av��A�eG@��)�tU�^�����s(�7����G'��/"�2�R���*��+��j���T}�;�@�]?M'�!���������eC����Jx���K�G,��7F��[�8n������o�$z��U!,@C�n�8�y	�.�%J���lGe��������K��*�������o����vT����]��M,����hw$����%�5�A--�[��J"���!��.UN��)��[;�Y;�+nQ1]�QAn���
t{�t���N����-�6�r;"]��b�d��)U'�1��;�Nv����Yx��m����qG�t���N��������""*�]b�$��)"�M����P41�~&�$E
"*�]b����c�1�Rr�C���SaG�)�ns,�xFr;9��lC����/;T�DK ����`:��6�\a�6�c�t���N�FO+uT�"]a�c���R��6�Zr��.y���]- �u�)��AW��A�XJ�t(BW,���=a�v6��PUIe'��),��/`�������A���X6�\1�g|��:�Rr�CU���2���f#��1U�Y�b���N��i�D���T��2���&�>�?���U�D��c��_�.�]b���H):�K��E���������`s��A��b%H����k]���6]���4]���/a��7>����B�����D��x�a���.vM1W�W�t���N��P��w+{��t�k��b�����s,%�:]����V�����s� v%J�A�XJnu��(����A��JZC����8E�c)��!Q�L�"E��D�CN��V�o
��S��W'K���T��US(�F�j�d�)<G��YY������v�?%�a��Ep����r9�������a��������V_��n�r�8�6���6�>��7f_`�]��n_�u�����~_���i���W�w������n��w���{�e���_����_�{7�����_���i������������_����K:�"m�5���u�����7������L[������G�/���q�k�/�\��s��K����_�{7���`���i�������n<��w��{�e��&`���.`��#�>`���L/��N&)�g+08�.�lJX.��������1���������:�_7�w��k��������h�Gt`��i�O�/������4�'��W7k�����Y��c�����;���q{�|`����S`��i�O�f���	��68��1��������~a���T���K�i�u�����m�p�(��	���i�8��5�u>>����O����k>x�<�W]�5�u>�c�5Z>>J3|8|i����/���;����C�_=�:���8C>�S�>��:�0��7��vn���+��X���W`mA���F��T�������_��w��k��������|
��8��)��e|�m�p�(��	����Y�_�;���aO��[��k��:�h"��5�u:>z�d�XN�f��'����wC�"�5����ZqZ�]�����N������:�_��w��K���8|�01|8|�f�X�e�z��X�qZ�S�����b�8-v����s
��>�28Ii����IJc��]�:��78��1���g����P,�k��&H���c�q!�x}�j�E��L�5�u:��
w;l�e�8^s��2���k6N�t
8|y|n�������8|�(u��;G��k�':��l���$v��h#��-�6Rp�hu��`=ep��c��)����w���JT8t��-T`m��i�_���}�l��9�`<�4v�n�2d8l�f�p�N<���	��Q��`�w��B��)��������G��_��8|��6|8|�f�p��`���V{(���=t.��Sg.��8>!����U+{��'�������KY��J��W��d����sb	��#Kp��v�	Mb�
��k���c��C�5�u:������>�s�V����Gi�O�5�p<L�
��8��)���b�'��Gi�O��z:�w}n*��W�\���w�s���[1>���U� 
�2�����y�YBb��/f
��Y��hXn�]I��a��cLoA^�#�)2�k��^�?�rL� '"1�E�1U��^J�!�XJ��������LA��tTy"��9��� �?�>�\�����B,����[�4��A��5�����n����?��O���_
'�p]���kBJ���Q�O��?k�����%��l�8��X\�9���/�A	2�J��&�</�!�6��
�'�lu
3��0�k�1A��0��Bo|pB?�l��r��
��c�xB��"C���<�5d�+`CN��+��V�v��.��[�����Sx�9��g�`�F�x��������1��� ���Rd�9�����7P��x��>m�c�?������a�O��?k2U|�����&O@��6���S��o�W�|
%�Y�`G�nQm��#|G+Y>�T�
�"C���<��~qAg������	�g^����S�1��T�v�k{��G�?��������\���c�:�R����������;�#��8?�T��EK}1E��c)y�k#�Hr�")R�Do?[F�kLoA��Sx�9��g��v��g�{�:R����8��WF�4Rt�9�v���^��{��Ov�^I��9�p���-�EAL�!�XKZ
��$��H��)��D�GvBe��+��s1O����D�f��D��4~�#�c�z�b���Lv��
�#�������!��0�!���/�t�*�B��d>��]����OGh7���q��b���L�!�8�
f�{��G��v���u�����x>�T���sw>`��c-y��Iw�1�snGjwK�'��e�n�(E�����s.hX5�fJ�q�H%�$���'�d`�z�b��/�~�����UON��R�K�t}���-��):�K��^��y8C��*p��>l\�R�[|��.�B���O����)�d�@|��b������A��w%;�
n��E�d"�ev]���HW�8���Q
J�!�8[��/;T�n��,�FD�8�T1���RT�b-���v9��mG�)TU�K�)�����b���N�����tZ-�U��k�s�FD�H�A�XJ�t�B]:��M�L�l��M!��
���0A��������s=>���P�o
)��d���j���X��3�f��nV�o�0��c8���K�@�q�{=�w�����G��7�����:�zS��w�Et���������*y��*��wNI��=�Rr�C��@g%�DnA��n����y��b�:�Rr�C����H���U����}��w4N�!�XKn�Y��VpE��R�5�"�]b����)2�K��UY������H���8�mZ���)2���Nk*����=���Jw�{�t�.N�!�8���Q-xoa�:8�n�D���Ta�:XN�)<�k��E<���++���H�K|���*t�q
1�Zr�C��
z�Y]�!�u�9��A���"C�����P�2x�m�9��4Q�.1UX5	�b����C��c��^Z���� ����v���jp9R�1�Rr�C�rCE)l��]b�_����(+l4�K�-�_��r�35��R��6�\a���(E��j����C����lVe�U�.1UX�
|!E��c)��!�_9�������������N��)��;���#1XE�DTA�c�����1�!.�����%.��v9Q��S���`*��1�Rr�C�r�E�e��]���
�z�#��s\�R�/;9t�xZ�,7:�Q4�KLV3��*�1�Rr�C��c�l��Qel���*e9S�:�Rr�C�`�?}�Y��"%�����	\>����H����L12l��SYW����:�_��}�����F{'����m��om��������������o���8+;)k�$��}�k��ba���a��	]�o,�r�_-��k�*���g���i���Y>+m���Lz�^e��cq���2���3��W�3�]���������e��oq���hq���[����8���Hw-���-��o�*���g���i�������fq���W�t��8���L;�-����hq���W����8���L{�-���-����U��;g�{/����8���8��)�����Y�������)����g�����)`��@�z�
,wX�i��v`��i�O�����;���q�|`#{��X�qZ�S�������	��Q���������c�
��W��w�a����8|m;R�|8|�f�X���z���v`=mpZ���i����gL����K�i�u��
�a�R���4�'����q+�|���p�����	��Q����#�~�'��Gi�O�5_��|�]�X�qZ�S��;���p��q{}�4�Zu��K��U���b�/
��N3||1u8c�o�8+p�L��+��X���W`}����`�qX�qZ�S������8|u����%_�����u`�'i���/n��,��>�<�?v�����c�:l���a��6{��6�Zu�ak��U�|�^?u>���u>���V������a�����ZqZ�]���:hy�����4�'��/�N���8��)��e<���	��Q����;�������4�'������X�qZ�S��;������>�4x[i�����Jc����]X��f��b��
��bqV3S���x}�(��.��R��wv2�l���p��f�'�6�2d���>:�w`��i�N�/��
p������/�rQ�<�Ep��<K����Y��c����Vp���l��5��h��:��28��	���Q�8����������2X_-N��+��Z�|�4v���GJ�%�	��5����M���_�N�K��������Rv���4�'��/��Z��5�u>>:h��	��Q����K�]��[��4v�
,�2\/��2\�3��'��8;���v����w���������?c����C�5�u.�
<��c�
��k���i�f=�k6N�t
8|��>>J3|8|��>>J3|8|p��4p�(��	���bb�'��Gi�O��*7�l}�kF�_�3:T��5+�NA�����]��;�c^-���n�|�S�[�|B)r��k��m�%��Xp��Jf ���?%�m�R�[:�d�)��gM�}qi� p�,����kc�x�G�b�:�R���H':�EC
�;x>�����6S>�������a�x-%8����@���x�$8z�����/W���f�����?�����c��O��?k��-�Vr�R� PJ�'�K�����B|��)���V��@sl2�0�h~(�p_BN��j� �Ma�i�i��i�.��sn��x�F	y�NH�!�XK�
�Gp>82�1�BN��h'M2��*����@}�9��g�����s>���D�L�����F�I������#mG�}���z/HC�����1U��-[��C���<��M8�����N��-~�e>#}��1����������.�N�#
P�����7#/�r�)2��_��j���3�V�m-����c|��Fh�#��s�%�z-����'���OG" �c�xB>u��C���<��V���*�Z�I�|vDj=x�s�[rI�t���Y�'X$t�"�2E��gD_�K�oA^�
I)2������zw��=q�MGNDj��&�S�[��PR�1�R��W�����r��e��PV���+���
1���s\�.����e8�Y�:r"�����*���S��s,%�z=�VB4�!����[��#�6@����6�v������r"������>�����;�1���P��5	���t�:(3K�b�[��	Z>�����H{����#dlp��6�_bq-`��!�x0:��k�����A:Bng
��;�;}0B�PJ�!�XK�J?|�I�������Eic�xB��"C���<�5�]�����������1��� <KA�1�R������V�D0*'�B�R�[����^>�����>���L�#�M��S�fc�t�>���d���&�2��<�#4���= ��]c�0�\J�z}�XJnl���^�UF��nX!&��s���R�t���N�*�U��&�2j��zs�V�{]L�A�XJ�t��[�{��D��7�T1�{;��s,%w:dq�����@
tj	�xS�������L���P����R\���u�/ED��CJ�J_��.�P���&\�C	8PAo
32��=:�@��0�C	�lMD������Whc��zSH�F�K�J��P����2[`���$D�v7���/��G^�]��b-����k)�k�������%N��@��C�q��
��P�68D%�V�w�tv��g�/d�Oj�b��}��.�U���H�Q�n��bP����s,%w:i
��}��RFD��KLV�}{�z�k��EPKr�DG�	U�.���U�8E���t�ySe�\��E���jss�V���Sd�9����/;�����#U�.1UX�.��"C�������pds��������Yot��)]�b-���e+��;�kos��V������O���iMt�\g�
~���6�\a���F��W+�}���a%��UG"*�]���K]r�'��s�%w:
�����DTX�����B��b�1���_�e��|ey-�z
�r�%�w<]^��Sd�9^}�!�^��$@c��T�9,*���#'p���6��u���X���ie��g��*���5��s,%�:d�L�+�e<OHU�9�t�P����	Z>��G�mk�h�3�1�f���V��S����m�02�%��[����*?[�tql����e��)2������^�S\V)����+�D����>���X�o;$�*���D�B�AD	�BN7�X�� �M�iW�|��Kz��	"Fd"�qw�r��A��;c��V�}��h��sc���-��d�����l$��u�
�M��d!�:�P
!���oOv�����=���}����=�v�u�u�����{4(����u��I��N����L����L��^c����L�����oR���hR����7�4)��������i�?��i��o�[_��i�Z��������7�2��;�q�k�2�]�z��U�^������9������{����fe���fe��q���2���i�_��I����}��`��F>������o�i���vem�7B�z��$M[��r�@�v��)����������-�p���%����1X>>J3|����@�|
��8��)���Q���c�
��W��w�a���8|ekm�p8|�f�p��y��	��6$��1�O�8�v�lW���j[W`}�8�����R��?LS
��8��)���q[~�6n����N,��<X�qZ�S����.�W]������5Z>>J3|���>n������:��1:Tu��;F��8|q��~_XO�f��b�] �
ve��&��W`}�8�7���b�;�z�|
��8��)���qw~�4�����G�@��>�|����|
��8��)������_�;���a���[v�k�p����[-�:����(�'�z��4���<^���������� ���y��A��+�!��g�S����"gw���l�^|�8�0�5�1���y�S����m�S���,� �9z;��<GwG�8����p*�pr��T��<�x��`c��/�&]Y�,�g)��������o�4{�Sj�G��f�f>��5����*�0r��T����>R2�1b�%d��;Gz�x���Po����i� '�NE<�8
M�8�(5����J<FL��x|u��2��XG�+�x��Z�r
��+�g9�f^�i���Y�k�JfE���y�(�\�3Sp)�
���~�}�f(�0r��T���(�[NA<N����x�l�`8�8)�r
�q��e���(��9��ou�A;����������:�1��U<���,�y��,��*�@��'s����7�]w�B��{|�KLi�#�\�c��p�����"'.%��x��f	��N<��
�1R��������p]q89�p*�qF���5���k������j�r
�qR�����%��S���,� k���=�c'5�8���{|+��g����Vjg�C�`����� 	��:��S�[�dD)�|<k���m�C�`u�
��� ����c�x��� �"C���<�5��Z�R�^!�h��3v����JvL��)��gMtrl��G|xy�H"sn|C�)��� /��O)2�K��^���_!����$l��R�[�|')���m����L�9�a�U��W����D���q�`
��J�?�.7���y�ww�?-��L��CG�����/�y�H�����@�����x]����1��+I	T>�����s���lw�t�!���cLoA�

St�9��g�������D�Yxz�BL��c�k/�B��d�w�c
.�A�?#����vsH�o^8������\���p��}���A2 a���?���
B �"C���<��8�5x&��**�����N\-{����X�$�p����n9�rJ�!��S���#�E��b��t��UO[Wcv!?��T��\����;�,�8IG�?�B�f���i$�&�(����u�[V��$��_~��^mX�m6��J(��
;��P���.I2����-��~��Hd$���B\4p��T�V���(E�c.y��	w�|�t�o�@�^�����
p^^��$��������
�R@;��JB�a�����[V,I��|��a�a�h3���kY%�IA+����b��$�H���C�b�8��CMI���\�g���e��)I<����-t�s	,�A����H�����*���'��s�%O{��=��
�>c�x+B����C���<�~o��&�ID&&!��./�WA0��$�H���#��&�uo�gX��y#� �y�)2�'���A��NEv+�?��
���R?T��W0���$�H���O�{R��{�$^���"j9�*�I<����-�&�c�X�/�Z0�B���4�C��o���=|�������D���%#���:����	;[����Mu���#/qdK�.6�^\L�!�8Z��:�,�`%n��W��1U���(E��c)���
��n����H���+q1q�:�Rr�C�B�[����j�s���+��I	\>���5V����XJW
������	���s(�wZ���b2�a�88F�k����_��)�|�'V�@MAca@��E�)�t_(�`�7���5��N|;�%D��9�j�S��FBx�)��;����qf����H
��*���b���N�����!�DT!��Ta%C<>��!�XJnth�AZI:��
���1�X��e 3���[]Y�,�D{F�T���F?�$�N@�o�g$��E������Q	/
*!�~�$�p��FU�;������ �M
�`h�$�\������
�DvXF�T4*�''�8Ru�Q�T�D)���+T��J��z�Gy�����s�Oe4�1�����s��;�o;4]=�|
:�p�c+�$�K}�U�U=o����G&��w�h���$��7Ru�QU���<
�^*��P�(���$�\���h��v5�Q("d��@E��W%I����{��������Z��#�E���%IF������QQ��@�.��Wn��P�3�]�o
��x_�fD�}��B�67\D��`Y�d��U�5b���(��
��p�(�%N��/W�lTt4���������f���Q�2'����n6*�Y��	��F�!n�hT��$����S
�n��f������ ���z\�$�p��FE0�HcU����M1W*\�t����#7)mb'�X���6����kol����,��[~�Q��]�O-���*����q�����������fqq����p���?p�L��/`�k������~��x	B���"d>����X�����iZj�g[�������n��Wf_���4�W^��]s���U��P�l<�_�_���nS�xm;.���%^7�����n�8����1�4^�
���i\����n�88����p�8����s�4�U��P�������������#b�7�N�Z�mu�������4~l<��s�4�U��P�l����W�n�i��}�q����J������_5������a�Y�����UZ�;^-+��B�hL��������W�{n�p��g�� �����z�
��_���
���l8l�e�X������M�5gu6�li�g�l8l�e�X��!���1��Q�a`���6|4�.L2T/�J�Ws�J�5gu2�lq��X66�2l��2�2�l���2l�sD�wJ����V���}��J���V`y��]G;�k6��l
��
n�2l8l�e�X�5��
e6�l��E:�k6��l
��2���l8l�e�X�Uv�W66�2l��N4�2l8l�e�X���i:�A�Y���	;��p,�N�'�/�y<���Z�f�Wd}�B��-###�YFA��G@��(���i�Q�1#�ed�c�4�(��X�z������,� k���i��Q����"�[�-!fY:6>���1��Q�%�a��"�22��$�fqg�9������$�
	�� ���4��"�������(���i�Q����v����Hi�Q�5c��4f�a�4����x�z`����Hi�Q����[FF<FJ���8��o����Hi�Q��m�OfY:�l���� �L�i�P&����)�f29������s�(�t/��r��d��1��Q�%d��l0�8|�e�p�2n�t�x|�f	q���B����0:�d<�}��Y3JZg���x�Z&KH���Y���-�:K���Gi�P�,�����z�4�(�?����S�?��l����"��4��"�+v���!d���,C'��V���o� �YBA��pJ��d����i�Q���?2A<FJ���8�y�����2u�a�[�JE<FJ���8���s��M�\�a��c`�{�����3�P�a��r��x�����Hm;��(�A��m;r���!��K�I��r�����)��b�c�4�'��N�&8>�2t8l�~��1��Q�%�a<qY�ed�c�4�(��n�]����"#�3�TA<FJ���8�	�,[FF<FJ���,���=�"9u�
�T���2���`yj��R�}^��7Y��8������	��N{�����U���X�|
5��>VX����Z"�c��
��������*lZ�#^�|
%�Q����kM��\��@j(�\����x���)Lv���&�p���|h�
���y��������SX����MVpY��P�EL��q���|�8��1��^RJ��q�����������$���
��hg ��
�KJ)�3���
n^*�u��
c���j�#�}K�+�/)���8�s�-�����0n�!�"v\�	)�-���}&p�J�����i��U!�xwl�c������ZN�!�8�d��t0-���Z�!
�,>�T������b�"C���<�z��;m+�D�UG�s�[��IS�:�R���t��R�����i��-�S�[���"C���<�:�)
}I����N�S�[�~�)���c-y�ki(����
i��DD@�Oc�x�{�b���Y�p�^�[��J��H�e.c�xB�b�1�Z��W�h~��d�#�p����*���p��b���Q��pW��}�
����b�[����)�B��d8�?x)��r�_�����b�[���m�|
9�Y����r�'���)��~��q�
�����):�G�g�v�1n������u$!���1��� /���"C���<�6s��G@���.J�>�T����`�1�R��W(�������hQ�cLoA�WL�!�XJ��
w���U�#a�4��a[�%����
�`� �S(��L��_��3��A`�b������-��S�:�R������kh�E�8�[V"nf��\������)2������0�]d��z��B""G)tkz���-��O�+��s�%{=�V�+�U��������@�	t�)���M&c�'=�����Hc�x��SZ�b�{��^>z�r�����9>�r���{J��)������G����������_�!���/�v��/�	{�_���Y��;?n���&M�o���P�o
)�
~��P�ol�D&�6Ra�@@u�1�t#�����7���5Q����4�����7F���1���R����I���F�����x7��n�<�)����Z��Z���+]���/��!�����_KW���g�VO,�%Z���]�i�0)sc�����}h��Q�g�����5w�.eY�LT�.6����5D=��U�g���8�PjV5n)��s�X�C�������|��#�ez���%�
+�������c-���hd�4K�9���
os�V���Sd�9��;e�-\����+F~�,N,��F)]~�'�-��e\gE��.��1U���)2�K��U+�I>�4DTZ��T1Hm�Sd�9��;���_:2G�K�_�l;��s�%����U�]Q����V����+������:����������(_�[c��!�bcc�4���A�����j\��!�VC;���s�����1�}�K,%w:�+�5���dv��@t
-�)�}�9.�S��U�l�3P7�P6F��6q���Hd	����������L�HF�N�(6�T1�d�RT$�b-����V��@��c�t)l���Jc�l���s��{w���Y�
b4�tl���
b�Q�1�Zr�C������k)��`�D�C����R	��d��Q��S��`A!��~6�Zr�CV�RWE"*q]��v��	1E���b���C��������!ta�S��~�!�XJnu��SJ������*Z�8��s����o�f�9N���UvB�B+m%z�#J����Gu�����|
%�Vk�5�����1P�j)�KY���.�B������*D:Mc<�L��)�t����v�^�c�0��2�����mv�s������H!����X�6c��������@:��z9�-�faW���$��o�+�������]��s����>c��i���X����g�1��;�q�k�1�\��u�jL{��k�{7fc���6���2���1��g�1����q�k�1�]��w�pL{��qLz��c���sLz�2������M����:���m��s-��}�1�����I��x�{_;�I�Z����c����c�{�����c���i����������7I�����"����V��d��W4����b~I3�+�^��^�vG�"#�FE�6m�W�cl��������h�H(�0r�aT�a�[v�+�1R�e�a��`$���Y����M[������#k����(���i�Q�1na?FF����,� �D���}�b�[�C��8���L���/
���� #�YFA��5M�q9�0*�0���g�c�4�(��;��k/���i�Q���j�###�YFA�4m�W�cL�V��8�e��R�c,�gUG�s;�����L8�2
�O&pNU���b;q��i_��q�i_��=���V�?�"�+�Y?VA�6��W�cl������1��8��fqi��ed�c�4�(��X'�GE<�:9=v�al��x<�6��W`�v���J����"#�������I����SI
`����Ys0�R��^�f�Wd}�@�>FAFN3��8�i�Z�������q+��b�)�2
�f�;~�� #�FE�8�>*�1����#c�}������,� c������\���vN�\�83	�YBA��$�[����,�_���`��P�,�����r��+>
� �BE���w$x|�e�X���lI/���i�P�1O��x�yz����xN��"�9�OY36��n	p�(��	�����J�/L�Uq���22�L"�f�'�zn���c�w2�0�"q,�9���+�����PxG1��?�[���l���Gi�P�����p���Hi�Q�5����s�(���i�Q�1N��"c����8�y;��R���,� c�����hG���v��-�F�z�,��� D((���c�����>fp���3$�sq)�|\��X2�Q��0���Bi�HA�F�b:!fY:�l9�)�vA	#�BE����,###�YFA�s������Hi�Q�5#��<N��8��fqi��ed�c�4�(�Z��^�>73����Y��
�Wnf��u��?�������L��!1%��!�)���+���pa
���s�~�j�������/���H�S�?���By�m��s�����g�8E9�����5.l����O=#��s\~q�����+�
��QD"-I����b�x����s,%�z-�{�4N�y�`c�!l��
0 ���P��5	k�����^�1������������@L��)��gM�nw��x���o���&������|
��5v���a�T���x%�M�9��'l������v�� �M��?s[�M����
$P=�$p8�R�[�N[� �Ma��[����+����:B��,]���i�6��b���Y�G�Z��Y�\::�I�-�l���H�b�����&yh�Kx������H������-��t���Y����,�[��$�Bw�	�3}���-�'J�!�XK��
���g��t$!�}��2a���1��C���<�l
qel`[��$D���1��Y5Q
1�Z��W�v�RI�5�H����D��s�[rk�t���Y�g��H1��1h��TD���S�[��S�1�R��W0-IdB_P/i����cLoA^�o�Rd�9��g��������_;BFy/��u�#O����gL�!�8�2��#l{=�����mEND@��c�x�"SH�!�XJ:�W��_j����5��>�\��n0E��f7��5��,��i���HD���1��n��-�b���a�V^�_h$�h*9�-})�-���)���P��5	.9�8:������2��O1W���(L�A�XJ��i�:�S�!B��H���1U��'iJ�!�XJ�z�=%��U��I��Dw~���-�P
1�Z������R���VX�B$����w|���-��0E�c)y��A:|�#�DFHB.��1��� |]!E��c-y�+b@#�u�"�d�R�[|V�.�B�����U��:Bh�\|������z��s��)�qlp�/rEcI�qS��!*�j�	/1UX�<# E��c)����y��������1Uja�b���N�*�%<��j���	�1Ua�b���N�*�����*���+�S��2�)E�c)��!v5���@A@�9�<u�Xi>��)�v���5����������~S��FwCH������_���li�V�w����9�t��������)��;��$�7�Q����a���p����&��YT=/����TD��7�TaE>x��)2�K��U�;�g�#U�.���Y]���Sx�9��;��V�
���#�5�)��A�K�"�����r�d���sk�J^���rw���*y�q���S�%w:i-��,�����u�x������)<�k��EPK��j�&Y��t��*�j�vN�!�XJ�t�2��������jss�V���Sd�9��;�x��A6
]�U����1WX�.N9���K��U2;�E�") �:�s���eN�A���/b��P�28h�TNE
I���]�(*��q��Pb��U9�4�Uy��V$H&�'����+
.Q�:���q��b%�#�Q�
=���v���*o`M):�K�-�aQ�N�Nk�6'���ks�VoK�Rt�9��;�s�B���#��%�"
���w��d�9�V��C��1ik9b����� �M���iM�.X�WE-���*h�8��yW�2�����d����0�I*�t48?�4��%�
�����it�K,%�:daw9�z����]-��Ta�3<0�4C�����P�,�?<���rFD5�KLV3DH�!�XKnu��K���:B����]b9��+e��3�C��p�������UH���bs��L�1�@�����s"X��Z�VM5�9d�4U��Y��O�wNk�-�����v��Vk��'�>k��?��?��?������_}�������G���k��b����q����V�W~k����~k��X�I�K�5�[��m�o-�t���~k���I�k�5�\��u�oM{��oMz��i�������2���5��W�5�]���������e������������q����[���L{������7�5�\�������t�e�����W�G�5�]���������e������������q����[���L{�������>�&	���#i}�~G��;e?�����V��?���7H�i^����r�@v��"#�FE�2y(�1���#kFX�q�m�8��fq#�3�����,� c�#f,!fY:�:��W����N��8�t��ed�c�4�(��1����0
�L$�f�'���#���V2	��yA���i�yE��+���ed�c�4�(�����@����"c��`���Hi�Q��^�ZFF<FJ��������n�U�a�4����xL;���i�~G�4Yk)�1��Z�#cFy�22�L&�f�'�r�=��~kN��C��8��L���/X��H�"AFN3��8�u�	P�c��O@G��p�e�Y3JZg�����l###�YFA�<YT*�1����#cw����q��[����������5c�WV�Q��T"i�Qw*�/|M~�[�5��O�"���4��"��uT�22�1R�ed��i��"#�FE����XFF<FJ���8�'�Bg�)�2
�f���;�"#�FE��:YFF<FJ���8�i4���K���[��qfN����3I��+�����}p.W����{������8|�fq�6HYx|�e�X��}��W���4C����g
E<�8=kt�a�������'��##91��a��c�ak���"_���:�f<�C9E�I���"�$����B��CZ��s�Dq����Y_�v�FVx|�hd����ezp�Sd�'i��##,�m###�YFA���ZFF<FJ�����>iO�8��fq��C����Hi�Q�1��Yx|i��R�a+������E8��1��!'�?������������?��s��?�1na0�U�a�4�����<B<>��t���p���O����"c��}�����,� c�� �+�1R�e�a<�0N��x��fqi�ed�c�4�(�Z����>6Y�����t��k�Vm��������O+�t�f[�D_����_c�x��5��"��M��<��Z���-�D������;y�zB)2�g�)�v�G�c�<�^H��l�#���sLoAh���s�%�z�t�U=v>��#x�>�;�X������C���<�5�t\ba�����G��aa^���|
�/�+��hs�_c��F��P8����)L��)��gM��8�`u�5�	�R
<T^BJ���]�|
5�Y�'�|��y��(�
.>��������|
%�Y����\�h)�"�h�'����T�����Zb���Q��j�����5��� :�t�!��� /��[����k����h;�u�7��5r��1B��1E�����o4��&��BG""��T?�;�%0�vG�C���<�5&�8`�A��4���-~
1�-�g L��)��gM���^P���=�#����*����fJ�!�XJ���3>#�{n��HF����1U�y����s�%�zEu�|��Q�#dJ3����=*#dF)2��_UA��Ip��.J'"�}{D�A���+��@���]�c)y��Y���I������dD��cLoA���Rx�9��g�6�TZ�U����5��cLoA^��D)2�K��^At;����WY�#��1��.�&�|x�+aL��)L?LVgh[���O@j����@�Q��*�����Sd�9��g�m+������� ��<����1U�!w5J�!�XK��
7'�Vg��!@�FaBJ��J��)��gM�Ik;I���fDHBx���*���[��s�%{��s���2�6��v]z�1��� �KE�wx�9���������E�:�����)�����M�r��&Ai���Zai����@�o�%��� tA)���c-y�+�@'���(r"{2���*���~�):�K��^��<�<Th���g���~)�-��<���O��?~�����j!���^b�Hc����s�����)�_d��E��M�:��m�*�s|��Q
��*�S�%7���t����@%U���6�NJ��)��;��RW�H�������8�t�r`auS��c-����s��J�L���~���JE@8
Sx�9��;�6w��.�	�9D�#�}�vP����1y�^O"�%4�0J_E@��)�t������O���iM���%+�����yS������|i)������z8L
r�V�n�Y��*��M���iM�50!��xp,=��S����W��s,%7:TU
��`�;P�B����l�p�P��P���ft�]�I;"���1WJ]��������P�3���_��DT������>���C�q������{�,�B�s�U�����C\b)����e���G�Ko��.����T�l������������@L��k��*����T��6�Zr�CQ�2�V7[n���v���Jl���)2�K��E�J~a���2}�EH��Ta�5�3�)<�k��E����AN��7#�]b��r��Y�l���N��o�K���I��hv����h���Rx�9�������t�n(T-���o8�|�NN�!�XK�t(�V��at�\Q��S����lvL�!�XJ�t(B�<��Zv��v�:v����ex�P6C������E,8��F#��w@�(6��T#���|
������;Ry�8+a��*�2
�tb���V�,Uaq�zv���u�;|v=;Jf�9N�MAU��	����rED��KLV	�%H�"C�����������H`� ���%�
+���$��s,%w:M*W��S���T��c���W�3�u�9��o�D������]�! ��R����\w���J��G\�\����ESIk��L.�E��S��K����>�c����K����
��p|��v6���"�V�sv<������)xU��J;O0���+M{W�3n}����KUo�uK;k�w7?��i����[_��i�Z����������{c|F�;~i�}����pL��uL����w�vL����w�{�i�?z�i�����_{�i�Z���]����7�4����q�k�4�]�z��o�^�}���n����}��{�����i����i�}�@����i�����}�4��G��O�R�8�p8�,�W�����-�����N�zz����HZo�#�}-�()FAFN3��8�i�������}�q�m�a`d�c�4�(�����7� #�FE��
;X��0��1���i��"_�v�w�a���,###�YFA�s+���"��D�,� �D�8��g���g���\�f�Wd}�`��8u	�0r�aT�al�^E<�6�����4��H����"#�����x����0
�0��������,� k��O��q9�0*�0�M�"�1�du�a�
I�7Gg2�4�(�?��
��7x��J��i_��q�i_��;�-���H����"c���+�1�i�GFzib��fad��>}FAFN3��8�q��T�c���dG�<n�����&|�:�d)����'�##�����8S	�YFA�����h�{��}o��V��WL��0����=d<���
��r�eU�cm�����m��o ���x��e�c�<����Z����!���V�<�s�����zN�rX#�kY�X9��*�����".�1ziu�c��[X��s��
���M��������=��;�C0�]7J�A������	�qr�%U�c��>��rb�@���xN��r9��z�@k	�J�<V���
y�q��:���I�2��
�O������6P2�1��k�C.g�����6��V�����V���ir���a�����_��+�y�3(�\�3�W�x��f)q����
���<K����2Mo���7�
���-��rY)o`�c='e�C.�9i[Z��}��8�*��J�a���z��[����h���1�B���g������/��v���S[������U[\����w�)�~hA��|�[�W�89��*���������JF<���(r9)o �a�A����<V���
y�q;��5�e���U �5�����!���V�<VZt=�2��R��*��Hnz	����w�u���|����#���u;�V��[w9���{��8�9��� /:DSd�9��G�h�N����w�OQ�D���S�oA��{J�A����>���GW}u�H�i�����+���RRJ�A�XJ��z��Y�6*x])��[��1U�y�m7��s,%�z�'�x��&x=J8��������N!���C(���B��d:�:��$7|��@D��Z�!��x����@�S�����=E�g�k����O����r�=�����b�g�iw��g��:�@D���{	)�-��L�N-�B������^��2��.B2"
�?�T����#��s�%�zm;�O2�����H;��>�T���y�b���Q�a���Mt1�K
�*B��6�kL%oAh7@f� E�
��2\����@���u��*z��SL�a.�T=m�8�
���15PD(����3@E��`p�&�0@���>C���v#���+���"lY��I:�(vO�����wT$<�5l�*Ba�}��*�����\�z�2�-�m*�)�!v�i�P��X�$��_���^����]1)x�p�aa���*�����$��5�\������xXQ)�����A��"���I�a.W=m�,XsPM�w'L�	�`�����a��%I��/W=m��-W��v�
���� Eh�p��X�$�p����^��h`��1����V>�X�V�~�(E��c.y�k����M�3G���He�g ����!I2���]�#�r��[Y%b� B8��)��� ��Sd�9����V�?������*BHu|�/�bA�d��~��C7�����Z�B8
y\*���9I��\����7�y��"V��?��i||��Z��$�p����v����
�,������*��;$I��\�����"�,x'�@!�<�3@E��`A�d��UO[���oe��,HE5��)N�e����b��/�w-�GVg���r����]b��c�;��s������
lc�%�r*#��YHV���3�Y)��%�
�U��Sx�9��{��x(����r]��c1k5
e�b��/;T�0���i^]���,Z�%N):��q�2���P2D�H0��F}����
�jd��"C�q�n;�H��� �����F>|C5Hp8��8����ek����.<&��Th�"�U��i���R����I���w^
$TL����P����@�S��wZU0�&&@"=�e�9"���6Y�@�S��wZS�����x���8�\a��Z(E�c)����}�)�`�z �
�s�U�M
Rt�9��}/ot�����N!�"l���s���-����P��0U�@����EN�a.@��s�6��^R��|��nt�	H]P,p�s�U�m�
yg%��������I:���{��Z��UH.�h~W�\Q���0��U�g�A
�j����1V���aL�!��Knu(����/��o��bP�R�t��m�K#��<�("d��@E����$�0@��5j��|�J�|<'ao�hT��$�0���5��[!�Z��IV���Bv�����i���r���(m�v�m��2BF��T�����);���\����
l�5Z^B��t��-^�Rs��^�"��+��G�GW���*���NN�!�XK�u(b��_�A��I�P�����%�0@��5j4�p�Q��
��P������I2������p'B���4��c�[�G��1�Rr��.�uT�2 ��6�%#�KN��C�������d9�Q�K
��P����� %�0@��5�%��-`U�!�t���`��S(�7��s��]��NE�9d�:������O�W�v_.R�(��&���viK��\k�����������?|W���G;���q*��:�+5��[�L�Qw\ 3U~��m;��������W;�� ���;^�Z�{*?�~��6��,��������v�#R�y����*i}�[]u�H���X��h����L3ms���U��P���
���r%��z��=��9v�*m|�[^�&��_�L���"���|��d�W>�-����#P|�E�Sp�o\����sVo\�����WUek�9�)�d�,����
nO���:�b���&��-��f�Y�Q�ac`�F�\-�"K6^��X�5z��d,�0�PQ�d*�R����8��	�f���t6E�l�e�X�e^P��YM�e�p'	x��9��	��Vth�#���Y�q��	��qJ�����:�k6:P��)�d�,�����mg?�"K6�2l,����:|�;�b���&��-�a�eSd�FY���5[����M�%e6�l'z%Y6E�l�e�X���T8s��h5UHZ'T��,�8�r�����C@�c�~�V�K�L���/X���������,##k�#��l���%#�FA��+V
�BkFJ���8�u���Z3R�edd�'������%#�FAF8|$d��Y���-�;�@�����,!#c&-�2*��J8�22�N%�+��3�SI"�%�~����L���//���Z2�����K}��_=�
�)�22�f�����Y����fq<�l`Th�Hi���1���������,###�70*�f�4�������7H�5fY:�l�,�,]��S	�BA���_��|3��xL���,�e��X_����
t
��(�2�fla;�#K>�2t8l��:��|�f	q���@k�sz�Pd�x���!h�(i�Q���aBA�|�e�p��
t
��(�2�0�i��B�YD�,##�,F���!�$r�B�B�����q����/���A�P�%e:���<~%:���4K���Z��
�d�4�(�����_Y����fq��3h��'�I��ny�,
�f�4�������h d��Y���5�;~<E��e6�W.�I�7O"���X���u����=����+���b����/�nJ7\
��(��1�f�z�)N�%e:��Vh�Bk>J���8�'�UZ3R�edd�w���n�%#�FAFX�7N�Z3R�ed�aL��s`Th�Hi���%������U�S�p9���C�YLC)l
�#l���_��U<pM��a���@B��x8�%���t@LZ>���hs_�x��A����G��@��KH�o��-�B��d&#]�R�<\c\�)(S���5�U��^�����vWn���, ����Y}������h���_Q���b���1�Tw|�"�}k�C�������vwf�*��B���8b|��4=E���������R�g����x����_�A���9q�8��q���2T��a���$�r�������r�-�%L����A	\>��'_�Pi~��O��4������	^��
�|
%�Y�G����
��%87�S��o^tM0�S�����'m�k�y��R����(c�x�VN�!�XK��
������pG�D�"��cLoA���)2�K��^�8kx�p���/aG"!'t��T������"C���<��� ������M��{�S�[�~M(E��c)y�+|����-��"�>��?�T�����F)2�k��^Q�
8�������B�%��� /:<	Rt�9��G�{�]���,�	R	�;�1U���-��s,%�z
���Z�HB���:���v6���Jz��*�0���(�^���Fp��nA�>��"�E.����||N����l�"���2�B^tJ���XS���v���z��#It�8>x���1D�����Ys��Dwx��9��j�D�K�8���z`�N��%�Y�9m�x����$h[}'n���>`1D�����^�����/��.9��U>�3fB�Ct�~�����r�s�e���I�� Bo�C
x]g�I���������\�����$��S�[�����K��ZO��{�S6N�;�Xn��������1D&�����Vp��+e>�0q%��$�B_R�1g����7����XR�����t��E�:"�3�?�)�-j�)D�������LfL��Lr@%s�[�����I���<���/�t��c���V4��9q?_?������2�|�P������*>^�������|��!����pP���
%���Qp���+Ag�O�U�������~l�6�a������e�0��N|kW�p�@����Z���)��{5"��n(�7JS�-����zu��!������c@$�j�����c�#g�p^3�po��H4�J����o�&'�*���l�9�����Q�������Vib��-b���`l>�Q��x�e���(���fc(���tM��x���F�;����0V������v!KM��`�=������%A<��)�g���Q��!-V���}%����9��9��ug���_���G�� ��tW������c�h���Cd�~,)w*t�Y&O�yt@�G��)���(�<�f�)w*������9�u"N�0�����8D��n�)w*tV~�z;�87���l�80+�����������~�}E��Y��aL�~��Cd�~,)w*t�X�.��!�M�f���V(D'���r�Bq�j���J�	�Zk�8r�f�������XSnT������g������6�)�l�V��XS�T�L4��d��&�����!��j5M�6Z;~�C�|.��w5fZBb�Y?�o��'���XS�T�����F��6����l��:E?n��_V��V�.nF���e�8p363��J!2E?��;���}��Q7{�Gy�]v^"S�cM�S��X���I����X7�po� �uC��U�V��������;�Ya��2k�����������*���7���D
�aL����S�cM�S�8T���JNj'��W?�o��B!:i?�|���V��T��-"��B��5�)��_�t��]�XSnU�J9���Jp�Z4��s���J���K��
��*'zft�;q��!�;��d��������+n��K��B�)���
9��[�P���
%~���w��>�c�+�:'��yu��w|�~�����o��������h����ab����i4�J��*p�I�]���C���{7i���I��wo��5��_woJ�����m�����MT����oiR�/8������Tn-���y'�\���u'��_7q�����>o�$�k���n��W��>N�^
M;9�k�VNB�<T;�o��"i7wE�����2_�2��
M9�)
Y(^��e���W�wY�\1Gj����)�*r�S�P<��n�uh�Ha^��B��R;�s[wS�(/G`�v�;���]�Nb%sE�gy�/��T������b�����������"���$���}��T�p._�����0W����*���(*�+R�Wd2W��1����T������������CsE
��L����K�������,�v��UT4W�0��d�����s��X�B����+������b���EE��	�yE&��	�J�%T��v��~C�+�a�~!�V�{������ykK
��-�,�~�Cs�����d��8�������,��H�k44W�0��d�x�-��+�]�<%���Xld�w����b���84��0'(d���*�����D��"���$���ZBepx�VBM/�����/�<�����"�9E!��oPvh���
�J�'��o�)�+2�+�y�mk/�����,S�A���b�:�)Y(���GCsE
��L��kdd�W�@j�oq���V�a^���V�����i	���U��P2�H�Q�#5���:\8��M�8�	
Y(^[��D�\�������|��q��P<���������b�
(����3���/hf�*��Q��c�P}���z����d����t/P��.�a^���.r�����P���k�����C:���B�������yg����#`����m��\��� ��b�B����"�yE&s����E�hh��aNQ�B1���Cs���OJ�p�R��hh�Ha^��B�������m[�
�\5!��Gy5�����I�UG�����8��=�B���R�{�\�C���E�#.(��q�d�x�QO����Fy9s58����h��aNP�B��B�FQ�\���"��b���^Q�\���"��"��X
M9�)
Y(���FQ�\���:E&sw�{���-N��B%�9������F�W4��S�
�nxZ,�B������c�x��
�S�cIy�q/%>�l�S_!���>�/�b�����S����^��5��D{����
R*���D�22���w�)�q��w����t.�~n	���o���������/�2�����%��������&�wC�@��%[lt�n�l��]$��hk�wTPh�c�t]7L������I����w�]dMw��pw�Z��`���!�����Hz7��gEb_:��BG�
�"���8����l`���^�������L��z��z�D�"��6���2�B�[���XR����W����� �A����@K�1�����?+2�q�]`�1<�`rE$��$s�[��ea�L��%�a�����AK������h��8����	�����g�d�v��\�G\��|#��ic�8�������K��Z��,<8��`�"�>|�v���V�c�L���o:5d�����/:�W�d8Z�����h�?�|��r��.�2����G�����3 �''>�9�-
=8D&�����V�>�h���$�"�������P�
�)�q�[�o�
_���b�Oe�/� _5����;*
���!�?k^��t,l)�]C$% 9�B��1g��P����cIyVk��T�d��c���9>
�*����o5K���_������w�!	�]���2�B^�J0D��������Z���{��	�W���3�B^x�0D'������_���-��}N�1L�����!����}�|������q�x���M	tf?+������BCt�~\�i�v�'l�����Z��q�1��9�-����Bd�~,)k���6��!	xr���2�B^�l�Bd�~,)�j-�v�j{���"r 7/~S�[�����XS���9�����$pQ��1e���u���K��Z�����C�2�$��!������(�j����Z���n?����v�b���Q6i?^7��q������Rf��Ox4.��x��9��5������v�[{�Y���s���s�qo�;���v\�=��
���N�!9{�����8�����!:E?��;�������+���7Jj�?�!�1[�Y�
S�/�Y~������(�mc���2/cp3�������e��[M��@=�v������e �u�z��:o�{H+8���
#w�Vc�����
5�Ni������B�v^7,�KS����kjz7,���/{���v�
���e�1�x� ��n��wJSS���8���"���1gx��~������r�B���V��
�z 1��S�7���Cx�~�)�Z�)ks�M��t�-��9���2��-��%�N������Oo��/�|�7�)��s����k��
�-��:�Yr����,�aL����Cd�~,)w*���n+�hO�����j��J��
%�Ni�����2*���m��3���
������_Y��a�|���v���d�����n�Cd�~,)w*/z]��{mgA���0�o�A�+�)����i'*��*��l�[�$���1ex�
zKA�L��5�N�b�D^�4��N����gs�����
�������.���F�F.$��
c�����A�N��%�N�br�a�h��H�9�����R)D����r�B2�N����~d�a�����g����
��PC�����f��D]�a����fg����������o��{e�(����
c��^�$1���z�XRnU���yl���H����28�Cd�~,)w*���+x_���`>�0�����Bt�~,)�*d_��[9�w�`?�Y�������3P�N��%�V�dR�n�4����W?�p����8�J�����sI�)%�WJ��~�������R6i?�e������h��~�����m��u�������������������;N��\3����{�ac=�n+T��p�7�h���q����QZ�������V����>N���X�(��q���4������qJ+��QR����T�iZ����\�o;Ni��:Jj�w���5Mk_w����U��O��f��5�je%��e�[���8U9���F�l������L�Z�����*r�S�P,��h������d�
k�����"�9E!��?�EEsE
��L�����(d��Q^��B��7*;4�;���J��j�)�+2�+�}�G����N�aNQ��N�����;N����
M�������������\��NHW2W,����T����������{�CsE
��L�ug��74W�0��d��������"�9E!��oXvh��
�J���-��\1w���,�D�FQ��n�a^���nR�����q�T:n��ohz�8��/d~�p#A�odh��aNQ�B���I;4W<�m�J���N��Sth�(a��d���X�VQ�\���"����w�sh�xt]��,�n������n\�P���B��������b c�+��K$�)
Y�K�����o;N�1����^0s��_�x�:^Q�\���"��"8n��g�����,������+R�Wd�P�x�Q��h�Ha^��\1�}�>�����,�v^�����"�yE&��u22��m�!��wPrhz+�0/�d}+I�l��t�:��uw_%��EQ�|��U"��39CS=s�B����YF�z������o@��T������b�n84WL��
%��7��+��d�w[A!s=��rjW�_������R2W�7�{EC���9E!��H����J���
x��������2�dW�Z���������Z�mB��LO�LP�B��j�9��\���"��b�Jk�:4W�0��d���rh��aNQ�B1�I�^Q�\���"��b�z����g����tj���n#��,o"u����r�������S���P��j���.�D{������95!�OT��B�z�������������z��,.�j�)�+2Y(���(*�+R�Wd�P��s�QT4W�0��d�Hk=EEsE^%���������6S4�o3Ua����6S%�[.�m��.�wU\x��,�= *;?��yy�A�"��cIy���>0	�2��!I	tP�8������N`O��5�Y��>���,<��3����Qbf/���BW�Bd�~,)��M|�O�i��\p
� ���N����h��n(���<v<����^epQ��-�!���`��|����=Zvl�
}��-* #(G�������^�6%hz7��gE���~����G.��
�g:9�-��6A���
%�Y��+�.��Wxg�#�}�9�,5bp3�����G�}%��)��a���2�BxO���XS�
e�d�����	����O�*�y�1g���Q�5��XR��h!!�[��x!�@D��1e���=����k��Zc�b�=it0�����v~�|�7��w0D����������vQ�]E��o��X[�0�.#6i?����	�k��/~��,���O}b�[���`��wC�V�q�X�����p[���|s�[����c�L���7����/��=�_,���C
�f�d�n�����lu�(�{������Z�c�x���S�cMyV�U���Q'v�2R��W�gI��2�Bh����XR�
�U��`sid'���#}R�[\�L2_7��gE�k+!a{�}d��;�x���q�x+��B!2E?N�4���(��@�njG@PS��8�����5)��������]$������k��d 9v$�s|s�[��c�N��%�a��v�j�&%����J[Y��v_bb
�l�~�U�&�J�kT���d$5��4�)�-��1SO��5�a�{g^����^�k��k�Q>%�������?"~�)"�@{��c�x�;<���XS���|'���� �����0�/���I����/J�k_�6:R������k�"&���&���FG�#�"��FGf"e�e`�z��3��3�wNU���g�����ut&n�n�:��5�w`7��9D����7m�e�j<���@�����)������XR�T(V�%���P�35���t637��9D���M3�_V(�S��r��:�
�
)��P�[P}�v��wJ#w��Co
u��=��>�)C��a;N��%�������/K���*f4�Cw��yP���
%�Nib
aw%s��c����:��l�F�*�L|�&�
��0r!1O��=aw��)�q���BMn����3 1'�s�w�����M1�%�N�b�`����$��?������6���XR�T��<v��E�'u��1e4.R���K��
����[���"���1g4�Q���K��
����{�C�HF�&�0����!<E?��;�����c$!Q{hS��������XS�T(.<� �X	})R_hg�*bF��������^`������h���������B��Bd�~,)w*{�l�+0����}:�
�?����S�G)�����T��5�)��Si���K��
�4�6|��'�'�)���
�|������-�v��CWI�Hp���1g��P'0
�I��u�C�l���4*)�zc��3��p��m�����Q,,�^��:�Xb������!;#pz7<�YE�,M,+8�[�9��6�)��bp����XSnU�F���;���z_�����K���4��N\|�����VW?�po|�����wC��U�S���px���H������56���K��
���WW/�~�����]��2��WBt�~,)�*$S��������KM�~���z^�M��a��8�|-il�������K�9]���&���kl���b�2���Ec�x�W:o�6
�7���%��l��u6���v.�tv�nBH	n���Q���u{#{��H^������t/`����o~��������^����������H��w-��v�U���[i��g���������Vo����y�#�^���u�#��_�9����h�T^^����^,���}��k''�U��l���Y�J�k}���Nwh��aNQ�B1��q�+�n?���"5ql�)�+2�+���m������,��n432��(/G`�v��c����X%��]W��)�+2Y(�j�FQ��N"a^���NrEX��}�#�M��zC���a�z!������o�����,�~O�Cs�����d�������T������"��(*�+R�Wd�P�[<S��h�Ha^��\���.Y�����,c�����b��(Y(�j�FQ��n�a^���nr[
����n�74�b��2�b�M���F�����,s�5���b���*Y(�}�(*�+���������yECSEs�B��o
��\1u���,�n������n��P;��6��������"���������"����������(�����q<�]2��W`hz��<��Q56W�8�*h�z��p=[�^�>\C���;~�jl��q^U�J��I:����*�5��V��o��B�v��-T���5�����U�Tc�����fl��(Y)����g����5�������Y[�������	4�B���Q�
L��qu]#il��q^T�J���^T�B�I"+�����l�Y�N���%�?R<��r�W�RM�u��B5u����*���D-41��$�R<��8�-4��/�������U6��p\�����#o���K
;lZM�������8�"-.]=�f4�5)�K2Y(^����46��8/*h��p�Z��l�Jq�*��j��n����*�5��V��w�<[����24W
����������9UE+��u�qh���8JV�������k���FsE���j�B��1tU
!���]`h~�)��l&���K#il��q^T�J��R�D-41��$�R�xNg+�l�Iq�(��j
�,�]n�l��q^U�J5m���F����2Z����UU�P��F��J��������v��2��|��E?��D�L!�����Z0];�����`���f�Ri��N�D�
xF�$�-�Q�L��%����k�����xURw����^����B7�Bd�~|�3^o��$���Q�F��t��J��OOgB��(D�����~��(G�D:���*:���(]�G@I��c�� �f����T�8�����t�kJh�������N��
�_|�H�Q}�B���v28�4hb�aH�o/\���
5�Y��
�X	�s(�U�G�������^���8�J��"<&�{P�<9����k}S����od"S�cIyVk��F��=�?7`o�0�����]����r��2��������y!�	u����1g��������cIyT+���x�"=}`r��w�C�xaC,�:�3�c�xXh��|z\�0����o�Oc�x+��(�I�1�<�5Fz�rs�eDa�4������Z9H��d=-����Tg��I?�`���q�?!�MCt�~����
;�R���I�*�CN~�(	/�2����������Y��u(��U���P�,�V{"�>�zZr��������
baq�g@IX��$A2��7�9�{{25�������`zG�9	��}�S�cIyXk-����������DG�@��1f��PN
�)���<��*�\���C�s�aAW����I�VP�%Hf��w��.9��2���������tc0��G�IP��,A2�8�i���RI����������P��,J�L3�zZr���0
6�1Pq�E>G�!����0��n��O�<��6�1`;�.:���g@Ix]�%	�i�YOK����nS;��r���_��|�D%��������������g!�������da��t���R!�-���eC��/���P6�$H�@��C9�J��v��]�����%Q����4���%g:)�h�A�B��:���1e�����muS�cIyl+�K��[J�������c����o���x�x�^u�����[�:L�J�;������J
1��K����f�-�Oi%�!�1e4�d���K��
�h��<������$p����� y�����a����P����������O�i Y�
�0����<�Lcq�!�;~��E�����)�}Ah�U/g;�{�Ge��R���)X�4�j����
��.2�l��Pw�r�7������)��?�DE�����p���R�w�r�9�~`H��`����X���2��.~`��XR�T�>���5g&���'2b�c��f"n�?l�~�)7*t_�]��O]�~\�i��o+�9P�wp�l�?ZG�9�����7���A:�8�^�j��-bj�
"���Z�p� �f�u�Pu��cIn=��3	@IX���A:�8�^�j���W�VD�
%�~a� �f�u�P���kh\B��Q�7@I�K�8H�g�+T
;�d�Z��YM�~�o%/�k���K��
������"g�����R*;$Hfg�+�Yt�!��We8'����
�Ngdf`$�^����]��{����_t�%H���/��Aw�y�,�]��u��Z�/H�L3������Pg��A#�1������Jj��� �f�u�Po����;h!e���1ex������k��
�x������D��%5�yIO3��W����~�����1�:@I���%0H��d�,�\�=���;� Ro�u��?��8H��d�,T���%�02t r�(�q��RO3��Y�9l�x�<6|����$}*ll� �f�?�m�f�Q�5o��];��@�;�aO��5���pr��m�n��\?�`��q�0����m�/�8~4�h����ay�'��-�����������o�k�0,�������������P��B�x��=���N7�6��u��lp�)�R��^aZ{�~�V;�U��g�7��k��������M7�i��G��i�3XU�yV}�9���o3|���vOZ�~��,�3X��yV{�9��;t��&L��.�U/`Q��i�m���X�p|�~�&Lk/�v����`U��Y�M�����fhqv�������W�g��^���&s~�s���w�?���y�S
)�b����E�mn���bR;t|��a@���?����d������~,^��x#�+��X��(d��aNQ�T� ��)*�+J�)�+f|�yE!Es�J���v�O��Q�P�0��d��^#�`�GQNN�T�z�drJ�zf�F��t:�W�P�0��d���M�����D���������'����;�a�)^��rI�odz���w{�R2W�0S42W,���)
Y(r�ST2W�������"�9E%S�+n�l��J��f�F���8E!E5TQ�\��-�NQ�B������b��[NQ�B������"�S��y���N4�DZ�P��!�1�����@�n\�KP4�l�^��������
Z�r�WU4W����;UEU�s���>hT�T9��*Z�����Q�R�8��h��vjk�T-T%��Z�F���E��4)�K
Y(&���%�49��*Z��q������F������7�_���/n7�{���hq�$��C�wD<R��*Z�J�S5�P-����
Z�r�WU4W-;�U-T%��Z�FzN�U�T9��*Z���:�*h�*�������IKP����*�yUE�k�>���4)�K
�+��w�I*Z�n$��Z�n�p������q*���_���u�0����[���^R�J���������'*d��aNR�B����^R�J������j��Z�����CS�G����������C���6�LV��%�,35����V��E-T\V��
��i4��*Z�i��o�3�M�u����Z\:�s/����Ejx�D�,49�I*Y(��:�����&�yQEs�D��������9UCU�<�U�T9��*Z��fh�z�.�C�s;����*�yUEU���E��4)�K
�+f�\��,n5���%�J�=�h��p�'��E7�/Z��#l����E�(���/�;�s�'h��q^S�\�mo�B��$�,<f�����8/�h�Zq�L�*h��q^U�\����-T%��Z������h��q^U�B5�b�FU�J������j���q���'q����%<mZ��N�.H�ux������-E:a-��N�?�(���F:,{s������&����h�jNt�.��R
$�[�Cx���e��P�ypS���HP���$_�
c�x3�ND:E?��G��L��SI�����b�a'^+H�E�
O!�)�qj������)�>��! 8w2�!��
��e���4��M��HXu����[<@�c�y����#�}�8\�q�4����i��?��5DrDp�]�
9�m�xi:��t�?o��e�qn�����M�v}n�A��^ ���|�j��{yHx�2=cTa��%��qa;�4���%�������(1��#��h��1�Q�L3����]r�?F��eihb��N$�����6418H�@���I����c�
cF.\��(>�o#��O������m�]�q���X�Ml��w����(�&6�8H�@��M\.~�#�v��uy��@IP��pq�L3�zX2`�-���u�h�H��@?J����q@6����~�d4��ud��j(����o�3�$(YY���@��������R9��wZ�s��M��$(�XF�fg=,9D�����)�Q��]�8����� �y������p���7�*��.��y�o!/�K���K��Zs�*~3��2i��U�'��5V(�f���u�.�����Y��^�gsE��|�%���n����'��l�����{e�d$�8�C�8��7�}����k��Z�=wwqs�����y������#R�N�����6(l	�*���Q,	w�E�9�8H�@��1�I�L6�a��h�Y,��2����de��l�p����v�wl����bt o.�(	K��@�N3�zZr&�&�� o�P��py�g�IP���A:�8�i���2B]v���������?;
�)����W�*/v����0R�T�l��2�Lp��N!:E?���%h[	w�#�o7F3�/����+H��������L���XRn�UK.���/�H�����!���>����XRnT�N�.7=#t�S{o�������o6E?nnw������'�y��@b�^?�o���3��XSnT��[�o=

���a�o.
�����R7����������7��cC�u�����7;������x)Ss���!���+��h����n�r�Bo�������yZ��z������G�'@��Xv�����q#����#�t�p��B�=������D���)�-��>grScN�S��e�vKx_.%�@���(���Nr�^$�V�f�\���q��-��6J�B���A2�$�V����{���*mh2�m��Zq��t�p��B��[�#	����6Jj80~0H�g�*�����:4~��Q�%5�=�i Y�
��<2n����"��6Jj��(�{k��;�:?-�#8�2Lf�
 h/���a�s�z Y�
�.Z<[_
O}s�Z?�����!�Bk��r�B��N\^��4xlZ��
���G�1Z��fg�*4%�I�!Q��1)x�f�y��Ry�����8�����%z��4W�&_?N�0�	W!6E?N���/+�^Hg��%D��%���IA��k�d�*Tm����z/� 1��S�����D���cI�U!�kp����eD��}A����k�1Hl�h��B�S�g��l��X�37Jj�;x��A:�8�V�����:{�����q�������l�p��B�?��J��]�yp�3�B^d?B��r�XRnU(��Xc��r��N�a�b�z{�XRnU����8>\M��a���53���(���XR����W�3�0p�:`��m��2��� D����2�w��5��;�V
�h�w�����t�������������/z�����O��o��S7��\�qR��_t;K;�:�}�3-���I���gZ�&Z���gi�y3����X�3y��g�4�^����]����i���L���=��5��_�=���m�3y����E�3y��/�������L���gR�������V����^�������g�����h���u�3���5@�p�T�h����z�Z�\I�����Z��x�}�/��8�
������������9UC����oh�z����������������9UCUn��U�T�����v�SU8�e���0/)d�x���
�4�~;�Cs�#�yI^U�BU�����j��[UA����yUE��M>�\�o��J����E��&q��_�rm���)Z�r�WU4W=s���BU�����*��j_���*�yUEsU����EU�s����/�Q�R��z_T-Ts�-��J5���Z���K���j��d9�P�����
Z�q$��*Z�q�@��7z���
��A���I�{
���^��]���2��/�q���������7�R�����U�<Q0UEU�s������Q�R�8��h�z�})
�T��3�C����d�yu����c��dZhJ�5�P�?NU��~�q^U��~��f���2�{m�-C�'q�%�_88R������T
-Ts����J5�;�Z��t�W�R�8��h�z�xR�WU�P�8�jh���6��V��oS��B������*Z�r�WU�P-]�,%+����2�P�}O0C����yQE��M�5$��Z��q�[D���<����B���<xq�I*ZhJ�5�P�������&�yI!sE\-�~)ZhJ�5�P=��5�V�G�����j��1C+���c�U/8�����0'�d��FY�V��o���B5mG�^����F��������[�+���=���O����s�Za�w��������VTF�y;�n{�V��E-T����V��U�U��O�������9UC��[c�V����Z��f�FU�J����������R��<��WF���`9E&�;��yE!��4B!���VjG:7\cu����p��B�����V�8���U���A��n8�v�h�zm��
YiR��2W<��������8'jh�Zx��S�R�8��h�
���m�h��q^U�\���n����T
-Te��S�R�8��h�(�����S����jG����i��`�K�/��&����7���=.�O��8����P��<��o����5%9�ZS��n%�	�O��"����f��#M�������U�r���G|���B���@Ipy�%
�i�YK>��@��������(	J6F	�i�YK.������� THA��J��0��7�m����K��Z�D=��X�)�HR�|�h?��7~�b�N��%�Y�5��&8���
B��$������f��=������0~��B������Ur!I������2�L�����XR�����|��@���(C��%�_���.�i YK7��`�O��R��v��G@IP��zr�L3�zXr<��6A��El�:��O�8����u'���W�7�~����bqH��(:���@Ipy��i�YK�'?�����8t ��"�@IP�2p>1H��d=,�B�Ch������9D�8	J6vp��<����E�f��y���+��2�d��$(�X� �f���dp����{���?�8t"�g��@IP���A:�8�a����K��o��+��2���P�l,r�N3�zX��b�Wt'rmD�:����l�sl� �f�o���l8����7v^m�>����c�����J�r6���>����G��	/���� w��P�ll� �f����s��!�y��� �D
�u�3�$(�X� �y����L�;�c�����"�-�(	J6Fu�4���%��T�^x�(T���Hh!�i�o&/�����K��Z��
������QT����$���v
�i�YOK��e����f����#���p�l�p��<��������	[�8���X��%���d�H�����V�s�'�C���@IX��Z)H��d=-9��������\������X�(H��d=-9�q@-�`��/��i��o/�#�I�����������q`;���{yF���4Xw��>0���RNI(�j�O,����(�1?�IA:�$��Ncs#��1n5v�"2�s���d���n�&d��S�����`N��9Jj�����m���U�Y��{8D�	��9�(=���4��[���HX�����_��nH��k�	t�nx���/Ks�#���G�;��e;�oe��"S�cM�S�x���%�������3��
����XR�T�LJ���$���)�q��+����4��l���F+� &:�����z��9Jj�J�a�A���@�n*�!��;�(w�B���@I�����t�p��B�3���LcL�i���������Bt�~�)�:��G��.���l��s; I���o��3��nj�`�������@I�	Y9HM�H��B��m�����H]������h2�i�Y�
�:��PE�^�@I�p��(H�g�*T��+�O<o3^�9�#���f�U����#��[����Ri���: �o8� ?����A2�$�V����};��R�	���#���RL'�4��[m�������8�����8S'"&/mN�����v�"3���+A�o��Vg	�����Jd����u�Vk��Vx�[H���[8Jj��+R�M3��U�wW��2�2"��F@I�GX)�[�-��{�����8�	�9���2g��q�������tl^z?P���@I��A��@��oU�/�������8Jj]��Aj��@����\��\Z��@���PR��]��_$�V���lU�?�?���y����Kd�����n\�R(���r��^(����4��{�����G0r"1C�g���G���8?p�o�>uQc:s�F�}�������4�e��_.��h�������-^�]>�
������������_�[����n\Wvqu�j	��7P���{�]��c+�w��xw�_7�������?o\��k���n\w�g�_h\g/@����7����������������sZ��q����V��q�]�/����s����X���_4����]�:+_��i���uZ�&Z���uz��m\g/@����7���������������S�|���r]���r�P������� OY������s���t�B�pxQ����BU�����j�;Z�����Cs�DKC������9UC�D��zUA+U�������D��4)�K
Y(�}�C+���=��B��b.����*�yUEs�����U-�6�T
��6p�U��q]9ha���M��+04�n�2w�8E+U������%�]-T%��Z����FU�J������j����=(Z�r�WU4W=����NdCU�s�����A`h��-Ts����J5�-�Z��U�Q���H�WU�������o4�+p|Oj_�����8���X^���R�P�8�jh�z���T����CS�s���jh��q���Bv�\����*�yUE��ojh�z��AZ��]/%+���E`d�x�=��4��'�Cs��������F�������u����q�C������I�{	�����NU�J������*J��40�P�8�jh�z��NU�J���rLU�B���zNU�J������*<sn{�Z�J�S5�P����Q�R�8��h����dJV��kLfd�X�l���������+N�N����������n�^���u+���������9QC�sk�6%+M
��B������`h�)qN��B5�?k�TS��������c�V�Go�9�P=��y���&�yI!��oKfh�y�m������k�����F�����������s�yQ��E(Z\:�s/��������h�M�/	���7��*x�moCCsM�3Q���V��?C+U�����u+�ekh��q^U�\5��7fh�*qN��B5m{{c5�R�8��h���VcJV��k5fd�Xp��Wd2��H�W���T����\W�����s]m��_��������6
���
-�$�)Z�����e���0/)d��w>��-����8'jh��h��W�R�8��h�
�c�VU�J������j�Bw�U�R�8��h�����������v����d���q�:���VW�`g�/���pn���t�+�Iyq�C�v��=�!�#������A:��o� '��M"&|�/(������%��3������D���\^P����'A��2�����%'9W
v^���	�9�����3�B�'���������c��EC��J"-�,����3�L^�5�&�����]k�����o����B�r�-}S�����%"S�cMyV���R���^I
HJ)�3�s������1D'���/��]+�?]��U�;O8���P��^����n�����!
��t��pK�&��@������P7����y�dx�D��P����HM��U��4N����^&7��7����p�46���(���QO�P����^)H��d=,9�-��g�
O7UDg�������dc��l�p���S�7��_�q����L��~�%�9H�g=,�n���.fr�eD����P��l�$�@��|�����;�C	�	�~�tg1VN�i YKF+�P�
{�:g����c�[l�8*���XR��z���)qk/�
�v����rl� �f���^g��$�6_SY�����;�>J����i���$�f�z�qq?��dt :a�����de�� �f���d�k�=����\CF"������/��4���%�kKW�kq�}1A�m��d�}�%+�3����������OW����c�xy����������Vx_i��`�� ��/����7 ��RO����#��Z�H�]:o���!��U3J�'�[�X� �y��������}��w��*o�'a���#��<���9�_r�c���b�b�E�#��>�����dc�����%�-�TJ�����.���$,YX=9H��d=,���X����0�����@IX��zR�N3�zZrD���p^T,�@Ra���1e����M�!2E?���~�fmS�u�+�\�h�:��+��f�6u��
�<��6u�y������-����� �zv����46r?����@B��'5^��9Hg�y��Ic;���q85g8�w8)�9���[�z�1����8��$(�X� gnv��n�~���4�������9�15����XR�T(.�~qs%����u��9�[�{���_�����7�5R��a9���5���'�:����x��y�[&ZL�Ne?��.�[����XR�T��E�m�x�������1je�����&�>L�I�;���k8r��:�;27�8�V��E�N����#[�I�?yP��<�z���
Ol:�]Ih�zxr������l�p��B� ��wxVMc"jl�H��"�7����<���X��B�L��*���8JjHh��A:�8�V�ff�+���'"g3��������n�F`=����+�1q�������n��������8�0Z��v��$�V���aK�TQFd��(�1s� �f�u�P3���g�8rq~a?������!2E?����wB_��D�;��1e�������~�0��;�C�����vx��B�[��~3>�R�����K�3�P��f��������t�$�c���:���>FN$��c��� t�O�q��������W�:?>��7Ij��+S��<��W�Z~x�pc^@���1�-�E��bS�cN�U�:|;?Ct(!r��vy��L�������d�+��z�o�X}�s�@I������p��B��7���+���7Jj>p!��y��{�����R9#��b6^?�y����N
Q[�k��_����1�C���|�hc:����f�kL�����@���Ec��,���Fc����^b�J���<�a�)u�m�.�1
O����Vh�uo:{�bN^��7���t/`���B������K�&s�����%h�{	��t�o�U:{�fN^��?���t/`����
��P�/�5�����P�/�2�%��G��|���^����0�Qg/@3�Xw���o���K�fs��]��%h�{	�>u�o�U���7SA�=A��m����� �K|�����1��C��g������l�v�����9UC��780�R�}���u�ShU�T9��*��F:c��*Z�J�S5�PM[�W�J���������0��<�V-Tf4��V��U-T�V�����w�����w�+���������/��E��&q��_�|�i�^U�BU�������790�R��&�U�l��������9UC�c�b���h��q^U�B�n�L����*�yUEs���--T%��Z������j�;�9�PMx@u�*hq��8��h}�9�������:�W�{
�WN��k04�r����^���T
-Ts����J5��Z���*h�*+�������5��h�*qN��B5�}@
�TS������uP��<��F�g����J��[�9�P�'C^U��~#q^U��~S���_�Svp���f��/����lz�����FY�JY�������5pl�|��
<[(���0:ee+e	���V��N�k��-�9�QV�R�}_P�������B\���;+[)K�W6�R�]�2CK���)sh����l��oH�+[����S����.�����D��r����yE��U���@/ll�|l{/�h�Kq����j�`pl�[�#<[(���rl�,�^��J9���cK��k�����tw`AK]�kd�T����cK��od��J��U������Hee��������.�����}�([]B	�/������fh��q^V�B�C��[F�JW����r�JT�T��FY�J>�NY�R�ee+��;o�-�k��y6W�����G�����[)������n���9�R��,�Qe��I\�*�7�+�s�������]����!�Z���Z]p��/]�����5��V��������|'�h�Kq����j��YaK]l��-�S���~�����zec+����{k)[*s`��l�|���VY�R�ee+e^W�([*��l��l����������!]t��/z�]����_�C�^z:f��M��@���?�
t��8	0;8Hgg=��{�rx^�q����k�9���nO�z5#�|VV�s�r��	����
���P���X>9H��d=,9�1n1����(X���{�%+�4���%�J���?��_!	��Y��2�L��T=��k��Zs��b�B���BN$W�zm;��7��'���K��Z���D���;�!(��\G�����3�L�]����XR��
����.h#���az������Z����|����m?Am~��$�����1f���Bt�~,)�j�v��������������(	n��4��g%�}��}x��x
�)���]���3��\ec]M7�HH�q�!n;�u��W�*���wZ���������a6�HH�q��S��u��(���k�~"�G��fS��W�����mO5����D� �ByT���a6�HH�q��f��'��p�h7x������Q�w��F���_9�t�*���3���|�-�)����@�
��I��Qs�K��Y1��+�H���������Q�w	��FB��+?�%)NM��<�]Na?w�I��`U�0�&S��W^����K��G_U��7��������qR�L5|\9�A�#�����dv��Jp����Q����T#!������]
���7e��{��V~"���;%L����/���=K���b�MY��B��O��r����T#I�����c�;���v,(*�1����l�wl� �f�o�P���D����6L��P-�D0�.��P%L�	>�v'�@�@��a��AF�
���q���a:�HX�y�I�$G����!'8_�e{$Q������S���W��@`"j	��n�u�D0�*W.	��F���+�bcEn����'����?����K�d������a�z���A�\��[�?J������:�$��?S���!��we�����sl� �f��}��6�����s.h��5�p�"����Z�5q�N3����<Q��]��
='��-�P��K����nj�h���c
��T;����4��w�7��~vvkD���-���n-�3��n�~h����d�.�����2�5qH��Q;��;z#��J!���F�R�1g4�j�g�6cI�S�z�����f����M	�P
�Cx�~�)w*�3����HB���0�����QO��k��
���F4��m�v�(	�`��Q`�@�njv����(�T ������A�M��9�F����yK9�Fi<��:�j��0��3R�Y����x���h��(��#�������6f�B�7���%�����x"Zc���:S��b�N5�Y�9��?��;��:�����a6�H���R�Gf����'��p��H0�7@��T#!���z����=+��x��'�7���m�����z�{���5;���\��`^gvb��O5�Y�y�r\�w!�xoh��z��H�N?�������}�+���D����	���&�9c�#,x�^s���Y����	��v�.a:�H�?����6�b�`b����	��&fL�����z�I�v�XF�=��`^g]�Ha2�HX�f��E�����3�������-��3�����k��cgSd��	�u6%�M)L�		��������� �Qj�������O;����HkAZ���8��'�7���m�����z���ew��X���3�I�W�v�i@Y7uV�&l�E�v�x�=���[�0�-v�o��6��k����}��1�-�������n,)7]r��e�C�!���h�@��4�]����&�hA�&��-a�U�����������������Zw��%*�9�h�G~XV-��=a����S���x%B�O�M0��>�R�q�!<Z:�U��h�7����7���nRz���r��%���y����r~.;���=nWak�LV�k�U�dNK��-4�����v.=Aw�p���,J�D-������s�?�b��.=�#CKg�*]��&����v���>�5�^��*?6�����A�e�L�N���	��{�����/���B�9�-
8f�#9J��1|��?��rD3��L�8J�d�;��SS2��0�0����c��L%L,�Q���� ����� ��q�J��f��'��s�J��f��t��T2�00,�V�����C�TP���q�^���c}�;c�fdz�$LKW0�Vg������bdR4���wRF�b�r
���9'�d.�a&(`!H�^����E���bW�o�F�r�z
��69;A%sA3A�7;=s9�25/�N�0���Lofz���
���_7~�1��!Y�M/���02�d�)9�_=��J��*Y�����U��8'*d!
>Dhn%�E%�D��D�m������(�9Q!+���i:�=������Ec����(�9Q!��������������h������]F��"��]�p�zvq���]��/F����z��u���^��\R�LS�J����,$��)�O�� ���-$9�i
Y��1������������h�������*Y���:QEQ�s�BV�'��E-D9��
Y�Vz��E��/�D���s�-�����on2����)�_7�W `q��I�T2�0���L��%��\R�LS�J�7�yQEQ�s�BV�KzM!I
s�V�u��ZH��V�d.�����th*�q�U*Y���6rh!;���J>�R+�h��J���L�=������S�[7�����8{
J.������CsQ�3Q%+Q:��U��8'*d%za_�FT�B������(�h/����\T�LT�J4�wT�����Y�:���*Z�r�������B�vn���h���N���^#q&��7����=�^��u\��U�q��F��x�mooOJ�����c/�}��4�+*ZHr���-ak?���%9���N�-$9�i
Y�V����*Z�r��=��lh.*q&�d%��nD-D9��
Y���9�C��=�1��s�����(������x�{�g����_hE,��*�E��d��q]�_j���]s���u�
.�.��������'�o��c>O��E>�X�� Q��a�b���t���Xd�h�]�5�o�HB�3m����f���N
�)���<����h^2��7r 9:�gG^O�E��N��5�Q��L���Q��N�$����qLo&/�n�l�~,)�j���}���&,��<�7���=u�~��������<�x%�:������c�x3��JG������G�?h����y��a��$~
�������f����c�v��tW>��~����
��_���$X�ll� �f���d����������wj+�P����$�@������a5�N�L��G|F��p�l,s��<�zXr.�������(:i`����$(�X� �f��������'��-����(����%�����o48/	o�T��<[�?J������d�H�������R�E=x�3t 
'w��(	JVv^$�@��\/��/�z;D�C��i���C�X�3t�H��������_[]:t���>�����B�0m�H����NO���n��r �;�m���f�_)0D��������Y���3��L"	������f�_�0D������������?���\����2�L��b�L��5�Y�9�#�+��A�D�<�(��r�^��@��|$���a��W������p�l�~����YOK�hsr�����Q<��F@IP��pq�L3�zXr��X��<$E�����.����z�i YOK�/P!�w�M���"<��P��,W�i YK�,�w��?�iF�����w^��%�����@���,
���7����"�1�(	KV)��g=-Y�����c��$����c�xy��Ct�~�)k-.���6{fo�HBW%~S������(���������7Z�w�N����b�sh��C������4��M?i�=&�d)i���o�9��L���s.�!�;#s�������o�WU��&1�L�C��n��f_�}
��u�L����J3?1��)F23)�1e4�e�5-�������������l�I?�p�)��/oX��\�nx���l���������������8��M��%�F�f&��j� 1��S��#OZ]iS�cI��O��}7�$�:����3�I+�l�~�|��_wMO����h�y����N��{�t�l�����w��
��i���;���8Jj���au�d�.����x����g�
Mf+���Z��p��<��U�Z~`�QaC����q8Jj��R9H��d�*�;��5�_A��Y����S,��d�H��B���/�A'6xS�I8Nj���p��<����,TM�=��~vv����J��.A���/5+��N4�!��I8Jj\C��A:�8�V�b����3y�7��/K�?J�^a:�pb�f��'fc��q^�C�q;�8��!�8�-��[�z,�V[l���mQ}6�^tb'��f�������W�K�$�KW+���Z���Z7��;���������w���+�(�jx�� �f�u�P�����h����p�Qm���n�44��Pq��|v����/6�C�������Wb����o��
���
'5���)�f�y����YG>[�, j|�pRk��3��n��U.x������_���%5�X��4�4��{��SuR�0#v�2��S�[��r�����{��=��LLx����8	.�1~xa3��g�-T���
{���D
�~��K����XR���e_��j,��Jl�~(�����Wf{��O7��-���~t���C����]*���A;EX��-�N�,E��?�������R`����~���v���N8`c�]?0��V�m��[G0�~�L��L�~�L���'�V�M���iW0�]���eW0����`Z��������v���������:�I��5�T��
&�[�V���`R�w���z������i��i��������_5�t�2�"�/\{m��XnIa�����E���x%�>af��/]O�:�:A%sA3A+A��-$9�i
Y�V��E-D9��
Y����m����������h�J��7��8'*d%*Q����(�9Q!+�z4����D��D�-�����^$��
��
&'lY�e+�t\�_���E�8�_���������d����\�CAx�Wih.)q��d%��0q����D��DO������$�9E+�k��ZH^��T�B����E
�E%�D��D���K�d!IaN��J0o���54��H����{����,%<�m�����KP��?�G�v�
H
�;�J�� U�-}OP����
jd%
k.C+�h!�qNT�B4��U�CsQ�3Q%+����:�=�����(?������D��Di�o#�h!�qNT��?�>42��H�k�`}��a�`���=�r
mR��EQV:��*�r�FN�B������h��/#sI3E+A>�+*ZHr��������Z�r��=��jh.*q&�d%��=��YCQ�s�BV��Wg8QEQ�s�BV�e;���dh~o�8'*�w8�3��~c�����S2�nf�@���z���Y�POmg[+����k��h!�qNS�BT$s����g�JV�p�d�)d!IaN��J��RC���J��D+���E-D9��
Y�^{�594�8U�����w��]W�t��)f��$��w1P��Q0�m�P0���$3�����No�[�C���_b��0���#qNT�����%����6�-�0]_#�]����C��knkd!z����CsQ�3Q%+�����>��B����������=��J��*Y��������]Q��D#�)�E-D9��
Y���wrh!�;���J�����^k4��
�b���=���X�UO��~c)������~'�m�`q�a%���hCsI�3M%+���Djd!y��H,��=���Z���qnc!+�����G�h!�qNT�J�W`C��_��,Da?�����J��*Y���!��g�@���)n�T?n8F���c)����8;��F��p�_�
���8m�O8��x3�����������O;��^c'o�K1.|�. ����!�����������?+2��\�.y���B�e��'�����2^B~h������ro2��5��S��(��x�Q@C�@I��r� �f����T�������
��"��P�l�� �f���d��<�A	CQ�	��%+�f�$�@��v���w<����D��}�%+���!H��d=,�:�T��{?��"Xf�>J���Ac?�i Y���� ���%	��'\�z�����dc	��y������]�a�����Q�����'�~<Kd3���|�u��9�jlx�����$��3^B��|`�M��9�Y������m��A?��di��X� �f�7m��
�[\�2���P@�����Qc�p�L3�zXrJ[N�fq�����Dx����,vt,p�N3����H�w����"�\QA���$(���A:�8�a���z��Y�p��C��(	JV��b�L3�zXr-�x
~�Jm��k,?�SVjK�4����m+=M�n�N���$^p��nL/&pqK�Cx�~�)�j+3��-��I�f�K���
}�0�C���4��M���s�'Op���A@��V����C��Xz7��gE��f��^�6���>Jz�8�*�4���%�2#Z�q�k�qD`�]%A��
�i�YOK.[FK,�����(����Y�e,U�i YKN�"�)'�9J��8PJ�����C��S�cIyXk�b�J��]�"��?J���,U�i YOK������nkW��P@Fa�(	K�
�4���%gyB�wt��kc�����$,YX*��@����V���9Dpz�`��c���l�H���y{B8��*������)����v�S�cMy�D����I�1%�b�\�wc�!F�������x�v������w�L�C��	�H����/k%"�0D|�n�)w��9]a�Y�����	~Om,�f��wJ3��n5�>&n��|�nL���(D'���r�BoUR�'�2"gV���,(rfe��Pf$f^���E\����n�y�:��{�!:i?�����px�_�}%�e� �f�u�_�����o����![�I�1I����Y�
��!��hlC@���Nj��HA6����3x����BkA6c�x	���Z��s��
����{�9#rv�(��w
�i�Y�
�& ����F@���J�B����p��B�����oc3jl�PRo3b��;�Y�
� <?�[kj��pRk.F���Y�
U���?g��[8Jj,E�%� �f�u�Pu���n�Dx�8z�p����L<z�p�u�P��`�{m�C����#���>L�[���U�:y��/������{�I�i#��(�w��
e�������z'��1g4Va���K��
��w�+`!5��B�Ax��1)���^Q�����	H�	��;��D���7��5�N�f���w����Y����������{ Y�
U{.������L����9�������d�+��r�-K������J��?
r�_$�V���+�������%a��R� o��@���>����l-���9|����+��_$�^��}���>@���J��>
rF_$�V����������y�����d�^$�^��i��L;SPc�u��zS������W��l���[y@����w���XybV^3��[���[��iF�Y�v�X����������_6P���� ����y5��}�BY�P��������������/�^jygP_5�{�?��^j5�]��S���mt��'����V����y/5�^3��u/5��W���z��f�O{�i��g�/{�i������kG4+}�KMk�L+~�KM��U/5��:�i��^jR�ej�_�R���KM���hV����V��V���Z�I�^�U/�wW!��/z��j�������l5���Z>����d�b��Y����gxp�	
YR�d�<���
�%%�4��D�v����h!�qNT�J�v�iD-D9��
Y����������������<�q����D��D��������D��D3���E�//�D��/0�Oq���Z���������I���dq�R{����������-Q��FH�B������(���E
�E%�D��D#�bNS�B���"��`������z*Y�������(�9Q!Q����T2��0S�<�)�ST4��H�������+��e0������Y�8;%��Ve_'�h!�qNT�B,��9�CsQ�3Q%+��[�8QEQ�s�B��e��>�ME5NE��D��b��B��.�FV�	_omD-D9��
Y�f|���*Z�r���]�d#���9E_\f`���O+���c52?af�X��3m�����\R�LS�B4�����$�����`������$�9M!+���gQEQ�s�BV����:���u��B4��s����J��*Y���*�U��8'*d%yOM'�h~}�8'*d}�)��J�kz��������y�0;�����_��� �����`�M;����$�9M!+����9QEQ�s�B�����bF��f�V�q��ZH��R�d%�y#S'�h!�qNT�J���C��yLF�u�;&;4��H��*��oD����Z���-�C�'qvJ'�*}�^���k�kd.Z��o���TT�T��J4m{8ZQEQ�s�BV���X��B�tK�,D����:4�8U�=��SCQ�s�BV����Z���{2�M}�d����s�B����C�o{�U�F'?��V�����dd~�9�Z��D��w�4ECsI�3M%+��uW5��,mwU������UEsI�3M%+���G;QEQ�s�BV�y�Z���
�d%Z�]�FT�B��������?rh!zu��,���	��j4���Vak��U�|�����jy�=�J���F�6$]7���W���^d�M���o����<�w���	%	I�/�g�fL/&?��Cd�~�)�j=e�2����.+�@�);�c�x	���k�O��9�Y�A��*�=�e�?����P����\9H��d=,9��p&��%J�r<���$(YY�8H��d=,�$�v�j�T�C�t���w���de�<
�d�H������I+t�	��������L��J�O3��ge��n
t��7ED^��(	��2�3�A2�$�a��b� '�,���(��M���$��;vr�N3��7�[2<R
�\	���21J����-�(	JV���p�@�����J��zQ=x��?��g����4�]W�r�;V��8D�D�A�Y�m.b�{��4�]s���/�T�"xO��v��%�d3���k�w~��M��-(���y��p�l,S��<�zXr��i����%�Y6��x�E��$(YY�(H��d=,8q''Z���")���������t�D��v���O����YV�ND��]�[�IP���A:�.���v�5��t�����
�HJ�{�wc�x1�����S�cMyV�U��1,��
��������r��7c�x1��E��S�cMyTk���'M��$��w��zK1���0���?��gE��l�uo}
���;?�]n}����d�H��������o58W�"p��@IP�2X:�A2�$�a�!m��
2��6������=�P����U7�$�i��f���~����~���p��,s��<�zXr������g���u��>�C�c�;�i YOK��~�z8j�u��Q�����5�
�4��jf[��D�.�������.�<���
�d�H����8!�R�$�LmDPq��{@IX�0hVA:�$�i��
c���f��	�,���Yx�}7���V�!:i?���FD��a��US�}�r:h�d?�&iD\_5��/���=������j�|��,����v��J�p�g;����x�����&�k�c��|Mq�f3��;z����f�����v��WZ�fb�s3�q��^�q�~�����,[�I���8Hgg�*�����8��
��@�_���� ��$�����B�R��|��e<��ie8	
5�9Hg���{������P����w+����#��W�s+�9���S�3�s�PD�,����y���({ Y�
5����0� �q���pR�L����<����}Y����G��Y�����d���5;�Y�
5�1J��
+�9���;(��@|P��t���c���8Nj��L�����Y�
50��x��A�3@I����4��o����R�3B��m����?��i Y�
� ���[�j��PR�.�i�Y�
U��b�X�)�w '��1gxOn�1D&���r�B5�v9�J��s��w>!�I,"Nb7�����6�6��2���V�q������5����[|%��!t�j��fL/!?�.��cN�S�7�v�C�x������{����H��B������Ct������9����!��d�+TM�]��S9_p���������d�*�[t��s5n`(�w1���-��{��Sw�6���;9����z���t�\~�oU;���Fc�E ���������`��~�XR�Thv��j}>���]�nL/!b������r�B���5#����^3�����������`���t6����j����0��i��qS53���M��{���Xv�<�7M�R�������q~�e�Vv�e�+l�_��������&q!�����X/���_�Y���[��?��f�k���iM��U�5�_;�i��NkV�$�������_�Z���c�?��f�k�+�mM��U�5���L�����~K�����&������o]���y�5�_S]���k����v\{s�Q2_,#��/���X/��iK������+��S��5~���_���D�NU��TP�TP�B���S44��8�T�
����@���(�9Q!+����A�CQ�s�BV�7�u����D�,DOyd����g�JV����U��8'*d%J{�6�����s�B����N���v�4|��O��Y�J'-�/����$����� x��QZHr�����Hq����D�,D�I��i*�Kr�)
X	��Jjh!�+���h�����2��8'*d%
oD��B��,��L��m����3M%_\[��>���.p����lh~�$�A�������*Z�r���m���\T�LT�J4�N2NT�B������h�;�:��]gS#�k.����g�JV��V�6-D9��
Y��-�
1Z�r���]�e#�
�9E_\fj�`��_w_;��+���'���n�O���[j��-%��*Z
>���li��Y���w������:]EK�������0zaAK��_P=[	�����R���L^X�J�����p��e'll!,�NX�R��D����U���������kK�_�����9|7-N ����:}���������u�>!+�x�R�jl!+�NW�R���</�l%��^X�R��j�!U���8��d%����kl!+�NW�R8���^X�J�����p�]*�V����Z
��S�g���zaA�\�R����_����K�{"{�8��D����w�l!,�NX�R����z���N����N�vVca	t�����o���J8u�SZ
���kl%\���������^�c���h����Goky�>:c���p�;{6� i����:6��~���8`s��A���8b�q���I�8w�BV���ygN�jl!+�NW�R8uM`Z���
���h�RJ����,z]A+a�#���r([K�V������p������i���6��@/,h)\��R���K�\��E������p4��w��E$�p�;n�3Z����n�W�u����q;����5`�����h}���nl��IH�U�_���2^Lx9$ ��k��Z�V�����eE���F�#����g;�4���%��k,��'�����W���S����4���%��?�k\���VQDt��V>J���A?:�i YK��M X��O|���"�������dc'�4���%�����_<C��~|�%�9H�g=,9g�mJ���G��*��o~�%;9H�g=,�v"��k+�����*����$(Yt�� �f����Z�I���[���U�47%A���4���%_�G����V�g:�
"x�>J���������|�����/E��HYFv�{��O�~���K�N5|\���
p	[<3y��2�c�&���Q�w��FB��+�v���3��,#� ��<����a6�HH�q������9��L��+���i���;�yT�B�M�t���������]�N�XBv��$��<�\a�8L�	>���C�p>z��2-��q}��yT����l�������Mk���l�
�G�{�[�wc�x	������s��Z+��G����^�������
��*�J�8��L���+�31�#�$�D���u����0��$l�H����=��B�PS�7��N����O�A�vQ�L5|\��ykB����c�\��8���`U���0�j$,����Z�DUn��0<�����y\�@��t��������r��2X��H=g@n�D0�*W�^(�hk��������]���������7�W.P���F����p��U�0�d������`W.~�b�N5|^�<�,�h+W�/����?����*a2�HX�y�j�H�A��� ����`W.0W	��FR~���8"oW	�}�c�����5����A� �i YO�����K�qN	w���EF����#�:����x�q�.����s���-�gc�V$f��c��U�~���K��E��P�=*3T�!������	����������S�=��������{�qY1����{O9s3P�`G2g���@=���=5���8���_V�=���
*��K:JjlS�tA�wI;�Y�
5���)ED��!-���-?�4��[��oyb��"M���9Jj,R��d�H��B�[IW\��������3^B��!�m��r�B3'w�3�P@�l���g��������d�*�{��R@]J@���� T�9H�@��A�1"nr�x������{��<�p��B��(O=����	�uN'w:uS��I�So7���H��q9���M�~y��F��7�5�1R��
��9�r��&����l����_ZOVE�&&��jL��`�hbR�71;��7�5'��]k]F� :�r$��[���������C����!�N�H0�3,�o��j$O��:�&���L��19J��0���d�H��B�J�������5����������Q�F�����cE�@�������@
���,���.�jmI^��L��`^oKV	S[r ,x�^�|�5##������`^oFV	S3r ,x�^s~6�Y�[7g8��z�r�Y�a�����o7�1�y�q$���E��x����|���x����=9�&��)���a���:���������d�	��&#�y��#,x�^�����Z��Zk�'�7Z��������z����'B���N�&�-��D
r^b8�^����g��><C�������;��Z��XRn���iK;#��N��a,���.�v�8v�/[�}�V�[�~O]�yK���t�Z�-�������������Zw�;�����'��v&��$����'��8�D�n^Um3��
z�������NvR{�n^m�Y����6sV{�q��z������������dU�%j�M��vhR�?��������	�*�4-��M�_�C���}�o^�uX��iZ����}�f(O�J�������I�B��D)��������*�w���k��W�����vK����i������}�����Q>AW���j������e������%�s�L�����@�8����[���0+�����X��ST2W�0��d��[����G����Z��cNN�\������"�b�FQ�\������"�nP�C#SE	3E#s�@/Q8E%sEs�J���<k��d��aNQ�\�l�h.$B	s�J��tl��*����-�]�F�'K��t#�����8E#SE	3E#s���9E%sEs�J�����9DS=�29s�D��89%s=s�J���!Zs12W�0��d�x����)�*J�)�+�Z������"�9E%s�@{�8E&����9E%����:W�#����w��~���L����4X��K-������9UC���r�����U-Tqg�FT�B�������I{m{ICsM�s�����C�����U-T3�������U-T+m��U-T9��*��B����+hq��8�jh����m��W�/.7���M��O���C04?q��G>[UEU��������7[������9UCU�?Q!M
��B�u+g{�1���8/�h�����������T
-TC���B5��-T=������U-TiK�F���r#q^U��r����7�\q
��rGah~�$������������J�S5�P�u�����*�yUEsU�S�������9UC���?��,4)�K
Y(zW�K*Zhr�U�P��[
C�k��04U=�cK�9V�����VCUX�X[UF+wW�������-.��������|��;��&������PN��\U�����j��b����*�yUEU�E��*Z�r�WU4W
�fR#�d��aNR�B1�s�FR�B������j��6Z���n���j�w�����*�yUEU�
�}b.hu��8��hy�9��ew�y��t�O��,_��;�N���w�O���74?���SjU-T9��*��fj)�U
�U%��Z�&����*Z�r�WU�P���7#��=�32W����$�kJ�5�P=���FT�B������b��9�����?����S�3��W�����D�J�/��%��w�&�pn��J�h��#���a�_�F"=�[}?>���?G��G�)���<Z9"��������6T������_�%��^�t�~�)�j�	�������\��G�ksr��@���4��M���D���XO�$g$!I����2^L��b�L��5�Q���[��$���o�T$���c�x1�ZO
�)���<�����;��`�����`��!����+j�����Y/��Z.�CVHER����)���O(���XR���k�.�
�����1���`���1Mz��w_���4�ja�kW�?����^D~�S�!2E?��g���Z$��d�#(B�m���wu�N�w��e
���YK�~m��e�����w����1f�������/7E?��g���
���C�<��
"��?J��k�� �f���d����`������m���1�%�P���s��Z���
[N��5T�c��a�$�����M3���Oo�xky�|�)�[�|��=�$(Y,F� �f����Z���8SC����z�7�(��;8H�@���o����v�[���5���8������kt��_5�9��n�`�c��}�D$���}��2^L�����XS��z�[A��T���<LB�'�?���K'�~<Kd3�������Z-��z/�D�{�IP��LA6���4Y<������%�������=�$(�X� �y�_\}���{���V����h�=�$(�X� �y�7�q�t�~p.*��s�Bt�D��@IP���A:�8�a�p[��<Wt���9���vL/!?|B��bs��Z���i��+����3j�2N��k,q��<�zX2� ��O�Uj�a}8	KV�(�f@�M��|������VO1]�]o�����A���
9�a�����((�������}�����<�_~2���mm��+����|7����Q�L��%������	ns�����n���)����n1D�����������=��~�X�L��E\!o�"1��S�w.�������rc��Y�����+w�N�C
���p{���n��7JS�0���'kA��c7�pgO��&gO���n�[��_��[S� 1�S�7%���	��%�F��F�yl�0��F�������:v��~WZq�y�d���f�0�q��Ct�~,)7*��`��6�v�m�vL���!�vl��r�#���	�N�f#������R���w�m�H��,����������X���2��m���,�n�)w*T����W�g�9���2^B~��!6i?�n�V0��,���7�Non�s8N��X� �y�u�Pu��;�MD0p1��I��Xd�2�y��wG\=4���;�?s�rf�8	
59Hgg�*T�<�h'��&�%�P��
�;��4��[����������H[#��P�w�&�d�H��B��;�#�#����F@I�Gx�����n�v4G�[zC����7Nj�A���A:�.c���Zu�7�������2��>1D����r�Bo��K��j\�pR�b�w;P��^��X`C9�Z�o����I����� �y�u�P��
7)t����9~=�����>�n����	�����8�����@I��(��g�+�Yni����q�:@I��GA���g�*��6����z���z=�$,T=Bq3������l��Ac�j�pRo�Q�3�:P�_j��X66����x=�����d3���*�Z��������w����.b���
9�Vi�G;b�-;@�e�N�-;
r�]8�^���|��u�r�u��3�.rH�p�XR��8�-J���Aff\?�p����`���n���n_���������i/�������&na����Z����������������������s�vA���"�>�JiW��^*������nZ��b��]��~������������������y'7�_3��u'7��W���~��f�O��i��g�/��Y����f�Kg6�}��Mk�<�}���j�UO7��z�i���nR�ej�_tu��������lV�������V��������n����q����C�~e���q�����
�~h�����
�$���y��~��]�-�q��,�h����U2��0'�d��������&�yQEUxl[UEU�����mJ�U-T9��*����o���\U�����j�]tUEU������~�{UF����yUE��
�k,�/6������i�8w���
�f��CsU�s������*Z�r�WU4W-��v602��0'�d�h3/�h��q^T�B5��2Z��~]�C���pqh�Z�.-:����j�=Q�~Q�C�������V������P��+z����r��eh~�$������.�w��*Z�r�WU4����-�����y�W�s�^
-TK����B�tMM�����W'ih�)qN��B��;�:�P=��-Tc�����j��8�P���h���_��B���	�*���F������>xxF�+.���w�\��'N��!Z��N[:���-�9�7�V4W�9��"�U%��Z���
�t4�0/��hc�W����&�yQE��79uh�Z�&��U���h84W�8�jh�z�]��*Z�r�WU�P�
�������F������&��^��^q�����C�v����+��;�J����J�S5�P�����*Z�r�WU4W������J�S5�P=��+�d�Ia^R�B1�}NZh����C�2�jZ���V��B����^U�B������6;�Dh�a`���H���A���o}|u�Y_ql�����C��$��C`h���^��;��h�Ze���*Zlcq���~E��,8��d�h����yUEU�������:sF��%���c+mgZ����D
-TC���B5�w-T�VR{z
-T9��*Z�����2Zm"q^U����-��U\
iC��n����cC��`8���h~�c�WU������9UCU����*Z�r�WU�����]��[���p���F�B5u���,4S����B��������&�yQEs����iNT�\��������?1th�y�O���#����p&ycW�@��`��fq���?�[\8t���oc�3sE���>Jz�x(��g=lw��m���w4EDV}�%+�����@��h+6xE��T7�Xo�v��.�D�H���k���|����+���p�� ��n�v
�i~��;���x�'+\�,V��(������dc��t�p���a	��o�f\`�f��O��
���dc��t�p���a1Q��&1�sJFQ�7�>J�����A2�$�a���(������3T���(	J6vr�N3�zX2,0�.
�I�@`���!���`a ��������%F��E����Q�uJ@�O�c��t���|z#�6J�N5�q(!�GC�pJ�c����4���%��#�4���96��'����C	��x��wC�V$�9�k8l9�����(�������;(��g=��
��2��K�*�b����$(��AA6�8�a��V+��0!���"��[���%+���d�H���a������E��-���j��N���%���YK��W{���8El#�P��,U�i YK�MPN����+$"���&�c�x1�����XS�5'�a��J��'�L����?i�?c�x	���"S�cIyV+l6e��=���("JG$��
����c�r�L3�zXr��N�
g��%D��]%A���EA:�$�a�������j
�(��R����8	J6�)�fg=-��q���	KVDY�'/%A��R� �f���dh�s����
Eh�%a����A:�$�i���<I�5FQJ�V>J���A�$�@���K�_k����@��D7����_?e�~�d�H��n�;���F`Q���� )e�P��)����x`�L��5�a���#^�R`[�\�R���p�^e��l�p���������8��f pn��c�x1�[x����k�C?"�-�u�s�Z��t��p�������f�~s��
��C�9G�����3!r��(�1?��.�4��[}����w���U�o�c�x	�f�bng7��;����4TB�������?s8q�stS�����o{���X����5��\,��'��f� �y�u�P��a�5�rF��ae�T�i Y�
5c1�%������HrIv�� �f�_��,����l��HcS6c�x	�����M�~\��/+4�0r�K�
"�A��Z��� �f�AL�o���7�PG�s�X����B�6�h�����e#L3��G�MVD�v%����A:�8�V��Fl�IX^�'o6��ZG�� �f�u�P3
w�A�8��3@IP��D�l�#���U���Y���"�����G@I�	;�@�N3����UmA�`�5,r�����c���M3������P�>@f) kc ���x��Y���Zpn�����=���m��3��
�,���_6h<Fj��,�PR�1^�c$�V���#�;�@������j�E
1g��������Y���u��8J�����4��[���/���K�s	�1e����L���cN�S�w����x|�����z���l�p��B����$4�a��_��)�%��j�o���m��m�����1����@I�Axq��=��[��W��KJ�"���Z[�P���=��{�:���K������v�;}1����7��}[�r�Ck��6����������m�n,)�z��W������x��SFg�a����������y�'����|8�Ofc�a���Z Y�
����jFhO���������wq��z�XSn�0�M��9�������j���6�sS�����}�r�[0�>�������
���������?��?�����Q]����U��l�����5��F�5��=���vX��~��N��vsZ��Q��o�Z������5�������?oT��k���nTg���Q��/
���i�:�^���e�:��W���vi8g�O�i��g�/�Y��jT��[�9-~��N��L-��FuZ���Y��n���7���5��_7�������u�>N��@z������R�+lg�}�:�����+�����#ph�D��_�Y^��\U�����*=���B��%���a��44��8'jh�*{�:UEU����*Z��V��*Z�r�WU�P��2-T����Cs�@/YxUCsU�s����~�{UF����yUE��
�k,[���5�@��oh~�$��oh~����T
�U%��Z��3I��h��q^U�B�VT�B������"�n��e��5%��Z��~Q�C��/
uh����1-TS�>���j�t;�P-��n��d24��V�������x(�}���m��eh~�$�������v������U�U/�	����J�S5�P�]C#��5T1�P,��zIEM������0s�a�������C��o��B����8�P���l���_���B5m���*Z\o4��*Z_o`G��/�S�������I�;C�w�&D���CsU�s���e�{��eh��q^U�\n���d��aNR�B1����T���8/�h����-Ts�`���j�o4Z���F���j���j�jh�*qN��B���UF����yUE��M����_����dm_j���N��Q��;x����wh��q^U�\5�����_����T
-T�V�3lh��q^U�\�@lU�\������b�{�:��}�U��i��0�PM�����j������*�yUE�J�DzUF�+��yUE�+��Y:����+�;eo����N��Q���+�kwN��\U�����j�w\��m��j��:�P����WU�P�8��h�Z������&�yI!�kKmW\���E�U����ph�*qN��B��bmT-T9��*Z�&�i��2Z\q4��*Z^q��#��Q]=����6��g������tK�;pC��NzO����J�S5�P�7�UEU������p���E�U%��Z�������f���Y(��������F��E-T+���E�,4)�K
Y(^�C��W����y{����Fu8���Uh����.C������.]nG������""x�"|��*��n0H��d=[S�w�|ZU~�D��(���P{(���Y� �f����s�����ix����@2,6z;����Ct�~,)�j
�7��8�I_���@o�2�b!������t{�VX.t]�t�@�SXE	�9>J����|Q�N3�zX2��\��>y���.;�}�%+��v��@��K�B�/Xh��LP=���{�IP���A:�8�a�%����Kt�*}�`]��p�l,q��<�zX2,+�S�q���	�����2^L~�����XR��z����uc��\1"�%�"�����*���d�H����^�G
^���eeD�lh�(�����A:�8�a���(_�t��W�
��2�?���so'A�������%�*�����YV���.���=�$(�X� �y�oz��x�/����!�\�'�'��K�Sc�~�t�H����I��P�`��
��~�%+�����@������L��gAp�?%�c8	J6)�fg=,�DY�Y�CQ��,0@�fI��d� ��YR��7�O�����")�o�1e�����Cd�~,)�j�`[�*�T3��"�rlH�����*����M3�zV28�h^��@��w(!��������c��d�H���a��"+�jE�LQB�]�(	JVK�0H��d=,���I��M���C�g����C%A����A2�$�a�!��q�xd(Sg�v[6�N���e��p��G�K���u���ED���`�_�m�H���c�^6�U�6c�P��P�,,
�i YOK�&�m;1�N�9�����{�IP���A:�8�i�������}J��{89����/iP�!<E?��g�&~o%�%4�����~�/!�hBt�~,)k=��&4���9F"��-���b��Cd�~�)M
�H���9D]�r���t���1�R����m]��)�?�m��'O��h(Q�Es>@I�zq����@���Cn\I�D���������p��B�#��q�'o{�c��lOq�g;��;�y����	��7@Ip*����t�H��B�~������������,��l�p��B�t�Q[{3qff?�����v!6E?��;z���4���Z��g����6�)�q��6���R�[�2j��PR�^�����n�v���t6�eDbe?��Y��!<E?��;z?1�N���1&[�I�S9Hgg�)���	�4�dE���PR�O�����n�Va*[&�����3{�I�+	+`��<��U�����,%Df3���/2_d�f$�V��'��k�B@���Nj�HA6�8�V�j��?l[�PAd�����s�z��i���3��7f#�����8�1�%.L�1�)�1���P?�����\��?���#���\L'�4��[�����Sc)�)���8�~��,��!�J6cM�S��{;�/���
��
@I��X8H��H��B���'?Gu(!2gp�<G5.�i Y�
5'o;�P �<��jg�iX8HM�H��B������Sxq�`7�����!:i?��;�]-;gpG���pR�F���Y�
U��;:�;�9��K;?&��M���wl<���r�Pc�u��z�������W�:|���1�""g�
��Z��r�z|=��{���c�Ko�$f��c�h���!b�ucM�S�������������u5�]\��O��z��������{��2:wC����5��s1��{�#�����a����X�:7E?��{��o\����~�w�K%Wx����u��M��Z�t����.R���q���~}!�}f������vz�
��������|��N��w-�����������#X7��#�U;;mF�0mbg�K�+�����U;�^��i��6vV�$�������_5���];:)��N��T;�/Z�������5��#����#�Tw�fvv�jf��**��z!=x[`h�^�R��������#04_� q�M�1\;���,�qh��q���B��xQ!M
��B��p�K*Zhr�U4W=x�
�jh�*qN��B5m����Z�r�WU�P-}����o3��B_�����*�yUEsUx��.rW���h�S5������r9��bs�Fw���M����6x���xsh��q^U�\U�Q:UCsU�s���`���B��%�,i��FR�B������j:����U%��Z�����C��7�qh���e�-TS�����j��sUF����yUE��M����+��]�7p�`h~�$�����+qK�����J�S5�P�mnUEU�����������kr��T�P��IEM����������B��}X��^{�b�����9UC��_���B��Wp;�P��_�*���F���������dA7�c��=��96=uhG����;�2{������zec+���tu��V��(+[)_]�V�W������ylW�~:���@/ll�������c�����r��@<[)�����r�eh����26���a���s-lqA�@�llyI:v����5-��=R?��P�-N��C1�8��l1v�|�V��(+[(�[:Wn��-�%�+[)'|��*+[)s`��l�\��d+Z�R\#+h�
�ObwY2���@/ll�|�7*�����V���r������26��V����l��-/K�(+�����}uU���T��{s(�'Q��[����>�ll�,�^��J^��.K�V��(+[(�}x��-�%�+[)��v�E+]�kd�T�v�����J�ae+�k�g1�R��{cs�c?h?p�d��\Y��c+��vm���.K�(+���t�����y�q�8p�y�q_i�����K�?zc��~�n������zec+ex��WV�R��FY�J�n�v��[)s`��l����C]����V�a+G��2����FX�J9��t����.�5��V��|��J��0=�����?n��3�?���q�K��\v��W���O��W�n�EjC���?}A_�6�s�N3�������+��a�����Z����de�� �f����k���r)��2>�g�S�s���wc�x1���j"��cIyV����b\����s����c�x	��fk�O��9�Y���)^����*�<������s���i�����M'�KS����VI}�%+�v���@��+�j���v��	�;�.��'A��"�����%c��D�����;A�Dt�������dc��l�p���a�9��F.$�g�K@��/��S����2�Y+����,<=IcT]����$8���i�YK��Q��&����`y�p��XY�$�@���|��F����P9VO����m�By?^
�����W������cz�"������^?
��3��u��K>/��I�/~Z�, �g&��<:�
��1��T#a�����vg���'��"����?���BSd��F���+O��4�e��@�;���8���r��p�N5|\94=������������}"�G�+,��t���������I_�[�8t�]���Vr��_Me��t�p���k��o�,Uq\	�E<�r�	>J���e
��p����K����@�z��z�eW��<�|������V~�I�+�-Z��������a�f	��GB��+?�39�9���4a�BZ3���?����r�I�t�������������D����^�8����GBy\�B~��M?|^y����D��������\�@(�*7�$L�I����Wvl�A��Nv������g�����r��S��~$���j�������5b��X��}��������n������=���?�A�O!pz3�����2^B~��f7�0��������b)�*/a���xj>���l0K�N?|\y���nE_
T�j��Mo'��Kc��t�p�S���2���5��B���#�n~���`���V�Y������egm��kGd��8��^#����[+��J
"��
�����1�%��lq^k3��z�s���
�����3:�B�_m��r�B3<q�7���{��E�������Bt�~�)w*4���5^jF�l�PR���d���nj�f7�P%/��PR��������{�_j&�����n�7�)���ozQ���=��[�6�hI�[�9�t����5�h$�V�j_�a�����'��9��H���M�������f�3J���9Jj�Qx�	�d�H��B��,��o�������Nj���A:�8�N��)I��Zw`n\��P���r�sA{r����
���u����s?ByT��$a:�H��;�������&Od�����g�0�~$�������xK-�H��W����vg�0�~$��X��z�s<�Ikkr&j��,������59/3�s ,x�^�4�)��MvR���g�zk����������^u+��d�[�~F��pRkh&��P�NO�j�bg����a�����5�8��I�������X����0���y�����L�����o�����q.��K�\v��F����s���?���^o�#����?��_���Js~eO�0�m��4��Rk]��P��RR�w){R�Z_���B�B�z����&;By�7�a���I�������:G�@���P^�H&��GB�w�U[���c������9��}��a6�H��V���V��}����}����jz�1q�L������`��M��x��c(��;e��t�T�����:�/��[������[��,�p�=SC�>i)��4�ps@&�R�M���j)��{�o=��:�;���`�J�r-_�������������k�M0�da_m���?p�.��=�����������|{*`�.�u��c;?��[�C����>!R:�W�K�����}���kO������T������t��-O�o3�g���C��u�1��x`�3X��yV{�9=��X�v�����+�z��5��o2�g>�Oo���Qk��jtx��5�jo2�g����9��v��g��^���&s~�O���W��7������K�V�k���d���ql�C��p���+`,�^��z����������Gz�+Sk/�w��N�U��e����Y������g��eUp������w��*��hgR�	�t@o�+������k|�����n$�50�v#�r���Lv�T2�0'�d&��Ks�F��f�F����^P�\������`������ �9A%SAx���	��Q�S0��cstF�rfzF����s�J����L����L�������JX]>�w���5����dI��ndv��l/XF��f�F���{�	*�JT2���T2�0'�d&�VK��72�042L���	*�r�T2��A���� �9A%SA���*�r�T2<���Sthz�8�thu�`��sy!���t3���4�sG`hz�~��E-D9��*����C�E
�E%����&j�E-D9��*��2�����(�yQES���K�D
�E%�������h�)d!Ia^Q�\0PK:��h!�q^S�\4Q�S/�h~��8/�hu��������|q����kpG`h~�$����i�7�B+jh.*qN��\4o�l? ���EMEaG������J�54�]hQEQ������	�<kD-D9��*����K/�h!�q^T�\��nzM!I����L�N������UF�����Uv�:�����BS�s�;%���a��L�Z-���+*ZHr��T4����4��%9�)*�&�;�+*ZHr��T4���������L�����rh*�q&��\����^S�B������`�-F����$�yMEs���#�������������o./���ydz�������8w��'����^S�\������`�������$�yMES����+�CsQ�s���������Z�r�U4M����B4���CsQ�X�U��8/�h.z���)d!Ia^Q�T0������������C������g]O������,_4Y<a�����O1E�0=��;�y9EA�T�$ME�5���d.�aNQ�\0m�t����5�E���FT�B������h�M�������9QCs����{QEQ��������{QEQ����f����3Fz�������&�D�_C
��F����&W�;��3��iW���kb8i�%lWy���J�+rLN��es������&����hm5�p�+`�y��m%	l6�~L/&P+,�;���K��ZaG�3�Y��HD���S����7@:E?��G����/���0�H`�����(�Z��XS�
Mb
R/xGAAPv4�!��~,���v��}=����2��N�=�?�8����_�M���n����&����%>$����:���0���������N��;����������>�;�}���b�C��"��cIy�����V���.B�NW����#�$|�\Y� �y����������1��c���4J���������o�_8��
OH%�=�5%A����A2�$�a�����py�!~�n4>J����i����%��������{���
E����[�(	J6vr�N3�zX2t:�'x]d���?������_�6&��~H��������O�
��F����>wnxH�O3��ge��s�[���I�y":���{�I��m,S��<�zV�����\���Q�OZF����%�������%�L�����u��^h��7��^?��d� ����K>�q�Q`�06�2+����<*?�'A�������%��������D�����dc���M3��W��.9���������}��O�(	J6vP�M3�zXr���q��d)*��!�%A��
�i�YK�����W���$"9a���1e����9J�"S�cMyXk�������5�g3�����5�g���/O/5���=Q��
�z�|��%+;8H�g=,��o0�lE����Y��%a���EA:�$�i�������_�2"��������l���������Q��R��c����%A��v�i�Y��4����?�F����c�x1���'���XS>��i�-r���o���{��stW�-��Y���30"�I����X���$cS��1��s�7�����)����������������j���S����Q�yn�Vsh���w��3#/#�I�����P��Z�Q���8{�9��;�,
ku�D����zC�G��3�
5��!��������{~�N���:s�>?��Zq���_��{���������d^�P��4��xC�SFc�"S�cI���A����R���{ g����Z[.q��<��o��,T�ta�!o���z��6Nj���9Hgg�*T}18��8pp�����1e�����!:i?������
�i�B~��i���q;8H�g�*��jtj(��B'�>[� g�u��;�/U�+��A�P�Sb��`�_��b� �f�u�P��j��u47mI�Vc����@z��P���9��V�-3mc�K�n�_(�)�������iOEc�^E��(�[l����i�Y�
5���}�]���yf���X�U�n�\~�/5������i�Vn���8�������{6��U�:[�s�9DM���){L9�� ��z Y�
?�����D������3p11��g�*T],h�BN������PRc��JA:�$�V��`�	q6������1gx��d
�I�����P+�+y6�X�F�z
c�x	�!�C����r�B5���h�����\����9H�g�*Tl)������Y[�3^B~�������r�Bv��z����`N�0����<B�M����-��t���3{����@I��*���4��{��M�_��1���9��u;>���C��)�����r�7��hv��nV�m�;�}7�����������}�N�[��V��9��	��;b�AZ����8�oq������how�u������6u����n��A

��ip�����wZ������=��xI���=�������6�;�������_����-Q�������_��������?�t��k����t���������������&�������]�;-���I��~wZ�&Z��~wv���N�:���[��h����������������?�z��k����zg���m��\�;>�E�;9K�������_v������T?�}��K����}���������54og������:�v��}�p�5��h�C�������54�8'jh.z��Z�^}s�����
}�aEsQ�s��������*Z�r�U4-[�d��B���������Iph!y�}���c�W{�����D
�E�]��U4��H�U����s_�&/�������Y�8w��g
v���l���EMEaC���fh.*qN��\�P�u/�h!�q^T�T~���7��\T��������ZQEQ�����������B4������K�C��wIsh.Zqm\#�h~��8/�hy�C��+������t������C04���r������0��N����W��������iph*z��''jh.z�{�*jh.
�<�Gjh!�q^T�\���HZ����Cs��k�`d!yum�L��o���\R�����(?���������e��e&���/����m{�94?m��������S[QCsQ�s���������B4����y�U��8/�h*�v�����J�54
}_R����K��\4�9��*Z�r�U4�]�4#���J32�}G8��W�����W�/�5}�"�l?J�P��/@����j��W�CsI�s��������(YH^[w�d*X�aCsI�s���������B4�?^�����rh!Z{������o�&#sIs�J��G�(�����7Jsh.�OC#�h~��8/�hy�)u�I��G���=�54��B�����[��]32#�-I�����`�J�e���$�yMEs���QEQ����������yQCsQ�s�������Z����rh.�h+t/�h!�q^T�\�t���,$K����\��x^P��#a^�����(G&��Wm�R(�fw��^
��hs�J�'���1+��X�hoK�e����E�7-R4���^�,$)�+
�
��/�6/�(�KJ��44��~�U�����D�E��V�	2��8/�h*��-tW^CsQ�s�������U��������I�?�~�Kgj������L/��V�R7=����-��Q���(i+%�~���!H��d=[�oH���vz�-Vo���p���7���dc��l��v����#�0��}q�n��2�9o%A��*��m�@���xS58�>�*"|��P�l�� �f����|����j�l���#	��S0�9���]c�N��%�Y�����F];v�3���\yS�~L/&?��Cd�~�)�j�n
���L���1�]���'��c�x1�E����cIyV���_�z,	HB����wn�D����@�L��5�Yw�=����h=�#H�i���q��qL~�Ct�~��_�z$���h���&F���c���(���X)$�@��|R/WZ�q�E<��%A��J� �f���_l�~qr�&veD~�|A��9�s�N3���&v�����k�7��$�)"��]o�����)D'�����Vh����j���B�~��V"?����S����Z��6�}r�[����U�;@I�QP.�i YK��^jr�%�D_��,�t�� �f�/��]24���h�u8�	Vt!�����$(�X� �f��������r�����Q��'A��"�����������p�#H�3=�����C� D���������>L�N�:R������1�%��.ybS�cNy����2���v]���P@v��P���1x��t�H�����=xp����
�l�=�$(�X� �y�������@�l)� �[vR[|3�����e�2_7��gE&Z���TZp"$�� ���c�x	��� �&���/�%�@��	��k���OJ�_�����C����������dx3�����fD�����3^B~��)���<�5�[9���q�S�("���(	�[�R� �f�����]#6#��#�Dpv�����de��l�p���#�w��6
��M�����3^L��F!2i?���.��t����p~�Ze���M3�/������P}�w����u"2�r��X���;����;k�w��B��y��32[�I���A���g�*�L����P�Lv J��
�i��0'�����7-�����PRkZR�4�Q�@�n���!��(���Z����i�UI���5��5�N������H@b�d?��}�3(�����XS�T��a��YF�.0�G�~�l��Bt�~�7�2��Z
y3�@b�c7����L���cI���T�A�#�~L*�]�W����$�1i,q��<��U����wq>!�����pR�?&��P�N0_j�`�;�����HrG���A:������E�si�z�	j��PR�5��A2�$�V�j�Ag��j,F�M�_�S'A��2����M������q�W�/^���8J�B��i�Y�
U�/�N�{�eGd��8����D�V�8�V����yp5�b(�w)�9���[���?���4�E8Jj|�T8H��d�*T-��sK%�"3G@I�{�
�4��[����G��7����a3���3��_n�~\o7�l|��_�U%����I�������p��V������mB@�M�J�mB�6a$�V��	���D�%��`� ��@��*�<e6BO��������BrWx3��S�/+4�.m�l
A0@��%��4vR�M3��W��t���:7�v�qv`8	 e��t���mz�.rO+��j��p�,r��<������P3���R��-)s�z �Scf5����T��9�}��g���p�`�APc�����*�^���EH-�~,Wg��H-�n���d���!t���7�+��>�=�|�|�����*�Z����q;B�����nG���������������_�m	��������}��">�e
=���o���8�iK�`?���m	�|�/��O�j��f�/��Z�[�/�j��]P�_�%��-Q���-�����%������?oK��k���nK����m	������-��&������U[B+_�j����Z�&Z����v�������mve����y^�O�t/
x�`
���6�w�A2�R��Q}|*�oH;�2�J�hwA=�y[B=M�X�%���U[B+_�j����Z�&Z����v��lK�`��m	�,Q����z��kKh�KA�~��P��4�}��P���m	�}���s�-CZ��{��������gu�������s`h��%���v�E%��������C����ph*
��k�p�����9QCsQz��U��8/�h.���6�B����K�v���d��N84��p�VT�B���������:/jh~��8'jhy��i����9�O�����I�;C���owe3��8/�h*
/{vW6CsQ�s����	��oD-D9��*����(�_���(�yQES����Z/jh.*qN��\4��'Z�����Cs��o:��B����94��sQE���yQE�M����F[���b{���M��!���+l�h���E%��������C����ph&Zv~Be�ME5�D���-��U��8/�h.��^�-DS�����h��PYH��
�����w�sh!y�]��������^f4��Z]f����o��p�-���6�sG`hz�B�*����(�yQES��n�E%����&����*Z�&�&�D��?����U��8/�h*z�}�W����D
�E�V�����B����������3��<��sF���o����*#q^S��*�-�������J�g���(����=b�hh.)qN��\�l��fd!Y��B32,{�/�CsI�s������yqh!������w�Z����rh.�k�M!I
��B��W�x�����7�sh*ZisP/jh~��8'jhy�������%�����C�'q�MO�U��nF����fd&Xw���=�64��8�th.
�c�VT�B������h�_QEQ���������^�E%����F��U��8/�h.zv���,$������`�������F���������W}	�	n��%���F��V���u}�m��M�Cs1�s����^Ji4�,$)�+
�
�;�)���Q4��8�ih.J��7����E�Ea�DjE-D9��*��V\D��*Z�r�U4�������(�yQES��X��!���"���R��a>
�l��"�����Ct��i��7��~^�E�i�Y�X�����p�^�QT����_%A���i�YK�m�NK���p���������A�m�������.9�v�:�).>2=�����!Y�f$���J�nr|��$#�|&�Z�S���3�)���<�5�m���V�
����2o���)����V�b�N��%�Y����C�L-\�~���$��7���2~���E���XS��
��+��W��@s�����|3��&�sCt�~,)�j����M0�������@��-�����/�X� �y���dxvDW0B?Mb������Uhc��i����S��z�-_�_�v$��5��1f���-���XR���-d�sW/z>!���qm�=����g��l�p����I�O��B��%�[����"�4���A���;�A��
lmE'"x�p�(	JV0H��d=,��W�k������?�z8	J6�8Hgg=,���.�
2�h����/ �
2��4H~����iN�����#����q��qL^��
Cd�~,)�j���W{/�u���/���/�����`��4���%_�������dt!����/���dc��t�p��������������!�D�@I?/���A2�$�a�����J_.����Oh��(	JV���.7�$�a����������w�
��v�(	J6vp�N3�zX2<L���%*�_�G�S��������{�Q�L��%�Y��f�U�����I_��{�Ipz�%
��P���n�W��%02����Vc�>
=�$,YY� �y�����?W��R��I_�O=�<N���e��p}����3��|#�i78=�{�IX���A:��7�L�j��p
6�eE��;���=�$,Y�IA6�8�i���X��`�l>���������h�|��~����<XkI�u ��O��-	�qB�f����
>5�������6�w6����f������ �lv {���B�d��:I���Y����������n��"���j\��A9Jj\Lx��d�H��B�h���L�_���O�N�B���3��n�Fc�fF�g����8qO�f
�)�q�������0x���,�1e4�N!:E?��;����Q������~����9��zIx�����"�J�k$��/��N��Ovc��~e9)D'����_�Y�����������=(���1���f��vo�6<�Sw��#nn�(	
5v s��������gX�d�:�9N��ac��l���� gA^�%�]I���t��8�q%��� �y�u�P5��*�[q��]�f%5^d�8H��d�*Tm��y�#x�6kq�G��E��`�L��%�N��J[%g8R%g8� k[%�N������y�����f���Kt�8����A:�8�V����jxw�.-�N%A���i�Y�
5�/n���Y����8Jj�E�]�t�p��B��{�H�����Y�#���O��t�p��B��W��;�oG�L�pR�"��3��n�>h�����X�-���;L�3��nj6^�����]vf�8�u����{��������H�/��?�u>a���'�T��
�e_z�;�q[�I�;�A��g�+T
�v5����y�����d`$�^�j�~�����s{�I�xr��<�r�w�3�.z���?|8���nL?Bx��I����w}X�����gc�j��pRo�a���:P����.�Q�\> ��k��!F������cI�y�N���"t�Z�y7�E�\?j=��|�����*�z�����^�5�z�;���I3B��������V���C�q���w;�H��?iG7a!}���@�
J��v�V�����
	�u�S�eCB-����
	�,������z��kIh`���-	�4���)��|����z�!���!h�;�u[B�7��-��u���%��Lw����o�ecB;�/(�0oLh������	������v�aP`���@3��������	�\�A>�EsB=��C��=��|���@�J����V�����
������w���2;r]-bh���/N�7(�v~rg��t����84[�r��Z�������9QCs���z��B���^84���VT�B������h�q�/jh.*qN��\4n�jm#I
��B���o@��B2�
(��f��U��8/�h.Z��6whz��8/�hu���������.���5�s`hz�N�&����J�54���-D����CSQ�@�����\T�����h�QEQ������u%�����EME��Bwi34�8'jh.��s-DC�~���h���~��/4�E-/4����7^�w���M��!���J�yQCsQ�s����g������7�ph.������(�yQESQ��b{�����D
�Ec�����h���:4M]?
#����02,}�9����?��\�&zQE����yQE��L��+�A���9�v��lz�,�����;�#���^��BX���������l%|��.<��H�cyaca	�������l����9�V��}X�V�����\8�[:�cca	������P��J6t=�Z����g���6��VW�c��~���bx�'m&�D���q�?E���vz���[�J��5�N���
Z��m�U-Dk���g+������\^joy<[K�6���_��J8���g������h%Kq����h�{�y��-};���jx+�lq��FX��*��-��N��^�-������J�?c�XS�4���,�yUEs�k���}\�-d%��[�-�W?c+al��-�a���	+[	s`#�l!\{K���p�M5����No�6oK�k�vl!�>v�dC��������6���W!�kD��AW��_�C<�#l��n?��8����0�8���\��t�{�5����u�-�a-V�+h%Kq����h�J�>��V���*���^m�xR��@/ll!���O���06���	���V��+[�K�����06���
����{'�LM��������'��e���4O��n�[��bxy@��*����p��j,q��<�z���y�Bxs�z�@QBt��V���r�\$�@��|�.����+�&��x��Di����A:��7�����2��o6;T�c�_%A���i�YK���y��e0R�\1�f���2~�`���!2E?��g�����0ii��+ ���6�������E�
�I�����5]��
;�^��S]������O��F��}I1D����7���O���T������Y��1e�0yI��dS�cMyVk��+���k?�ct"����/���OLY���@��|�|u�xa�#�����P��Wy2��!:E?�������~`_������A�H���.xW���R~��h)!�br����{�m���+n	��y���^��������fS��W=��+^z=����o��F0�*79��I��Q(Z��(�$�*�;�L[��s���<����a6�H�_�z�rh�.M2��/t��%z!l��k#�G�+�u��S��Wf!]�
>q���������%��PUn��0�~$$���|��<h
�����u�7s~#�G��%L�I��O�W^�]������c����e�.���Q�w	��FB��+�?9x��/�W�e�%�$r�=�|G0�*WX.	��F���+�t�t
JU��YF�����<���.a:�HH�i��N�q���S�����7�y�Ow	��FB��+?�s����&���=���<���.a:�HH�q�!�{60��P����q��z;��!/*Ct�~�)Ok-��ta'M������������T#��4E��ZO���u��,#;������y\����l��D���A������g�ty����j�NA6�(�q�p*���LQ��	p~�/��dc;�4�_48?�X���?CC�wG�����0���i��#,���q���"j���kz���tl� �f�����K��[3G��w�u��{���1��h�`��\!�i�������;��:�8���t_����7��vcI�����Byc���Q����c���3��n���c�o���������Z;�p���=��[�6�)6�6�B����pRk���=�p��B������}��U��w�B6��HbO��5�N�����aZ;��v�i
����3L���Q�:�Y���a�h?<����v �h;��;���	7��	�z�������E��
%�Ni�LF��"s�y�����c;�4���9�2o����-��:�1f��g���s��
�Yu)�c�Y�#��������7fn��D��������p?����=���9�����a6�HH�f�f9���b�qd��	����a6�HH�f��zL�������1{�y����	�~�����z�wY��������{9����8���=!�������p����/�37��H0�w1#��T#!�����X��d�������o���e�0u*��7�5�\���L��8+r$��9��d��d�����z�mH�	[�X�S��F����O��Y���A�,�%����8� _BKd
��F����m,B6�%�.�w$;`f��#)����f����#����K���8��,HxCa2�HX�f��	�#]G�S������:��\"S��z���w��z���:����QV.�i�Y�
5�/���]L�,���#	��h�.n�oS����+F_���c% s��H(��U�9���u�#y������Q<�����Q��T�0�*�$��I�->i���#uV�H0��X.	��FR��n�=�XuV"~����J��D
rVb8��@����t�ZC�78��L��A����^���F������_��������"��?���������nw<>����	.��_��
����Pv����|{&�k����_`~&>����z��jg��]���&�}�d@�V����?h
&��W�;W�^��z���������m��'����^*�H�V�k���dN�<�~._�������T�z��5��o2�g8�_���;Bj�g@��jg��]���&s~�����y3H�>D�.A�g��^���&s~�C��/�8�} ��=mg����5�jo2�g~�F������?.�.�>62��nYRz�7;���G~�������6���7���V��Q�����qm�e ��������O�\�F|��K���7R�+��������QT2W�0��d���/p�F�����Ol����)�+*�+�-��D��cd�Ha^Q�\��As=��r�j����N��T������b�x��d�Ha^Q�\����B���B���5��#�_�%�����LO����LO����Y�*r�S42W����+*�+f��a�J����yE%sE
��J��W���N��T������b��^Q�\������b��^Q�\�p;ST2W����+*�+Vn�g�J��p�?������D�����v'D`~E���
��C04=m���������*Z�R\��h��^�U
�U9��Z�&|���*Z�R\��h�Z���FU�B��UEs��c;X�jh��q^��B5@��FT�B3P�
')d��]#�h��
�U�PM�p�Q���p\�*h}���
��j������v��O�)����O\
���W54W�8�jh�����FU�B5s���h��w������r�W5�P
�5h��h�Jq����*�5��*Z�&nD�T-T������B��UEUhU��
Yh^[���d�X����T���i�g��>��@������O�8:�5�d~�(����y���n#�h��/�zQEs����P����&�yI%���u7����7�p������qh�Z�����a�{�����9U���soD�,41���P<��\#�h�yrG;'�h�
�s�*h~���FU�Wx���uf}�	�B�����q�?C�S�a�U2��$�$�,3���H*ZhR\#�h�
���w��U9��Z�F|��QU�P��FU�B5��C���`-T�jT-T)�QU�P��'\#*d�yQ:')d�{���)dq��0�(��GI��s��g��ym��N7�}C:���K��$c�?b�S��UT��h�Hq����j��)^T�\������b���I*ZhR\#�h�Z��FU�B��UEsU���>
ph��q^��B��m/�����*Z����x��h�Jq����j���}����5a�U��P�������Ju�E.t�����R_��%������+9�~�v�~3��&/x�"�)���<Z���&|�(���1BM#���c��a��FD6E?��G�&j��%���1�$$����0��&P�u!�)���<�5���m�>���
����$]l9tc��a��&D6i?��G��^k�O��&�:�}�~��?�O��%�o�X$4�<�����W@&s�����#.l��<��M��� ��3�����hI�u�]n?�����#���?��������TL*m@��2�7c��a��?
�)���<lQ�������6�~8��OK�3����\A"��O��ezm6��P�ll� �f����X�$�zJ�t:i��d��H�X�$�@��|B��H�����D�E]����/���dc�[��g=,9]�hv��
�
U��9�[�P�l,p�N3�zXr��YlK�����""x�p�(	JV��$�@��j��&\���Pt�o	��%���n�\��+&����G��#��(�@I?/�"�4��g%����]��>�`O

�j[pu��IP�����<�zX�
oaXo�WC��?j����I?/�N
���YK����}i�CQ8d��P�l��d�i����%G��x@����7t!���(	J6)��g=,��[�p��m�
A�x���{8	J6�x���<�zXr:��`�:���[+xj�=�$(��IA6��7=_<q���r�C#�$_ %h��fL?L^���!2E?������97#�bQ?r�����X� �f��>�!lX`5{9.,YQB`i�/���da��d�H�������yG'��������@���
�$,YX*��@����+���m_���
~��c���} ���������M� =>��p^�_��xy9H�@����=>M��A�;��T$�Jy3��&/�f���K��M�b�������=���w���w	�]t��c��\B@�%l��rc���v��-4� 4�����x��y��XSnThf����wO$��c�h����vl)7*4�/�=��53��us>���|�vl)7*T{��.3����k���?��u8�/S|;�����;v�E7��������c&`�2��P|;���yd��Fj�& !'�R�O���j�B�o���-���S;w�������'��v���!2E?>}��O{5����#�����T�\�PR��R��u�P3���S`���f�
��Z//pP��H~��[
���z����o���1f�y���!:E?��;����=�7��G0���^�(�5�xCo��z Y�
��m�c����'�6]� ��u���x>,������oW��
��s.�����-����-o���f�
 I����� ��u �oyk�X,x����wb��m���I�O~�[gg�)�ymz�j��G��6�Au��Gm�47�j���(4�[��t%5��q�k:�8�V���������������P����B� �f�~/Os����f��V���){g���`�N
���Y�
e�+������Z���H�7�����pR��<��U(;\�I�;1����}�w���^��t�e��xp?���	��.o��'���1ex
��!D����r�Br��Y�P�mG!��;��L�<�9H�@�7�����u��������PRc���A2�$�V�fe������;��c���}�C�M��9�V�dW�'�:O�>�U=�7��~G�'F������r�B����$6F��Dgq�c����op�I�1���0�]r:�=^�f��N@x;��&��L!2E?��;��rtp�w&���Kb��S�Ysa�#���XR���>|����G�Sa=tu|�:��[��g<c�nu�g��B��]������A����=�(�--�F,�n)XJ��~��.�+(��nuZ���������|M�����"tz��a����v��X����D=�Z�����g��oM���y�:-_��u�:=�_�����ur��uz�h�n\�����uZ���������|M���������N��������h���}������i���N��w���5��_w����m;>����_4���-M�������o�����R��cgG�����E��O������\�/_E������|eD<���4vh��q^��B��;)8�P��N
�U���w7�x�U9��Z�&���QU�P��FU�B�lm�(#Mk$�,����C���i��\5��^��\������j��B�*hq���FU�W�^'���]�q��=C���q���[���UEU~���*��������\������*X���Z�R\��h�
w���{04W�8�jh��7V��h�Jq�������6ph�z��
Z�����C���:sh�Z�Y�W���p\�*��+l�v���x�����?C�3�q�����v�����E�`�k�+Z�^}S��W�T������u�����!o_����j�Rl���B��UE��7uh�Z���-T���������802WG�����&�yQCU~��U��7���c�����G��!�	�j�8�����8��'6f�����*�yUC��o���B���%8�P-�Q��h�Jq����j�q3E�jh��q^��B5��FZ�����CU����w5�P��FU�B5w
��,4s����B����-.7��
��rs�=�!�	����������
 d~�J��nM��\��������u�m�,4���e�d�[�u�.�kr�5�PM�m�C����8�P��9��B����CsU�_Q%sM
��J�G�������w;sh�JK�UA�+
�5��>�������/��v��q�����w�0���d���Y�C�w���,O|��H*ZhR\#�h�
oJ���Z�R\��h�v|a����r�W5�P��5��B5���CUZ���*Z�R\��h�Z�&fF��kbfd�X��7��d~���F��;��������X��|�/����vEJ�����A+YlW���������E��X�h�zm��*YhbX#)d���O�����&�yQC�L�AzUE�,k�MU�B�n�}R��B��UEsU|W����r�W5�P�������*�5����r���yS<�����������x1\�����-�n	z��xP.���Q<�i;I�yyF������g�c�
�`�,�G����b����
�$(YY���@��|��t��)��\%�,9�L.�8��!/z�Bl�~�)�jM����!0#�
���(	N����l�p���s��u��K���z��*��8�zs"��Cx�~l)�j-�6R��B��I����a�y=������g���v�5��&�@%z�&������a��?^�	�M������^��75a-x&$"��#���2~�����!<E?��g�=����#".�s(#:�_�/��~^�������o���	���ymF�D�����I���e���YKi�N��}��q�"juO}�_%A��v�i�YO;��-���u�sMx��� :c����J��������|z��_�����X~(���	o���%��3����m�.�9�P�w��x8	J6vr��<��M��:f�{�	�����":a;�_%A���4���%���s���8������@IP����t�p�����r��P���!jF{��0�J����������yQ�Zav�C4T:�?�����8	J6vr��<�z�*���!�����$���gz�8N�yy�)�fg=,9�p
��p�C'"p��/���de�R�N3�z���^W���_h))@��;��1�G���5@�M��9�Y��A��G*^*J��v��J���,�B7�f�����/���@C�LJ@����7�1g�0���9D&������h��b�f�
���p�^c��l�T���~��;���B��`G����tp�p�l,s��<�zZr����w���ND`-�_�!]
�A��i YK����:�b�*"�%�{�IX�2jW�f@��
���_�����P�?���@IX�2���i��������?_�N|p��'g7��&|����K�C[�����s��Ryi��m��3�I�4X7��	����O���}q����9�$}q���������~X���	_6�gE��PRcyB�a�i�Y�
U�����w:��Y�#�$(�X� �f�u�P��UD��@[�1�>�4H����Pg;�G��6�xo�SF�mb��6��R�T��Kx���# Q�sF�h��0�`�cI�S��-�R�>�Z���3��|�V�u/�q�N����X�I�31������3����@�N��%�N�Q����gk]D��%����A:�8�V��N���bkXFD��%��e� �';�Y�
U�V��5t!2r�����<�4��{-g�1/��:��8Nj�Ip�0Hgg�*�����:9Kr���u�d� u ;�Y�
5w���������Nj���A:�8�V��	�����R��8Jj��rp�N3��U�:���{�:���8�(��8�s�N3�x��������QE�e`r.�8���N�t�p��B��+�#9���9���M!&/�W���8����a�i3e�C��
1�pI7q��
q3��<��B��+��n���#8�v����48�_���+���;D[��[8� ����xQ��=��[����FHk���}���������'��d�+T������b�`?�|�����Cd�~��`8G0���I�9�p�����*��{ Y�
U0j��2+�58����9H�@|R��t��o��������8H
�H��B�������$�p��W3��[�����Pk������PRk�]d~^$�^�b��[)��}����
1\,=x+&
�J��[q90��=����<�1�����}?|���H�t7L[*�����������������o�m� 0��������7�X���Mk���q�Lxs�I��tA���Z���O��7��5��_7L�p��/�X�?9�E�@9K���A����Z���O��7��5��_7���m�@;���0o���v��z��kh�k�?-� P��D+� ����
������A��&������A���}���y�@-_��u�@;�_6��N\��A �oiR�
��� ��W���j,w�����z=V�-������f�/_E�e"����t�Ht��U
�U9��Z�����C��w�ph�������r�W5�P����VU�B��UE�D�?9Q!Mk$�,K�����f�{<8�P�m����*�5������{{UE��
�yUE\mb�B��A`�V��=C���q������5��B��UEs�|��%��r�W5�PM���QU�P��FU�B��n�-T)�QU4W-;nO�U
�U9��Z�����C���zph�z���Z��}�7����M�*hq���FU�W�\�=�������1��9���`h~�xa�jh��q^��B��]&Z�����CS����Suh�*qN���*��W��h�Jq����j��:�PM}V�����`d�Y��F�W��������~sh�z��7��h~��8��h}��O~r�mX��7�sh~�8�������@��h�Jq����j}�����U
-T�(�T-T)�QU�P�w�UEU�kT�U������\������j�����Z�R\��h�zv
��,4�����b��Z\n8����&��<�A`I���;%��Fa����[�:��44��8/jh�Z��k3��,[���+������CsM��������qh������7�Z���sh�Z��I���&�5�B�W��������sh�Z�-��
-�4�U}p���e�_�X�c������:��ah~���uX3���]�5#S�
�z�C�����9Q��	�b4�����*Z�V�e��*Z�R\��h�z��7��\������j������B��UE��k�fd�yv-��,3�����/5�(2Y_h�~���U��
[_:��>�~pR������G�q;�.���8�+Z�bs�FT�B�I!sE���Q��\������j������*�5����6�������*Z�V�k��*Z�R\��h��/�{UEU�kT�m��Y����4S�����~�0_k�t��l��u��}E'�/�(��e�PR�n���z��9�3�L6�w���|���H�l�� �y���� �A9���+��r8�%A��� �i�YK��,�
Y3���y���$(YY)$�@���v����kBhm��'���n,?L^��Ct�~|}�l-�}�?X>��������)�k��%�����C�N��%�Y�����C�WB�<���3�
wc��a"�Bt�~,)�j�������0!I�U7o����y��?���o]dn���^��CQ�;��I��v�:��i��������"���fu��r��v�s����i�U��rDl�����
*�
7�|(����AA6�8�a���� ^Wq��#��U�k+����#�EWU�)�1�<�5�N�6cO�"��-�(	N�2���p�@��|R�\����).�����p�l,q��<�zXrJtW�=�#a� *���/���dc�4���%�����~����Cx��/���dc��t�p���K�5e��0t��(v�z�~0�E*�8���4����\���B�

�L|8	J6vr��<��{ �.��[����6:q�DT�HO�J����&�f���d0k������<DF���������g��t�p���j(G+��Cu�J�����A2�$�a����!�oq	�C'��v���c,W�i YK���|�G!�JHDR��o�����E��Bx�~�)�j��w��w�7�C'"�J�J����A�N3�zZr�~	���nlqSc�
�3~�������������w����b�^�����������3����\�J���l��E�xn��c���}�0D'���������T���o
�ND����P�^a�R�N3�zZr���	��0�-0�Y�}��5�`��� @����?6"���t�����xA���ct����� ����?���"v!��'6�&�8���p�JA�<�g�Y�������=�������������nj>d���7��I�����#�E���O���4�+w62tQK"3)�I��Y����;5V#��n���j��fL�)D����r�B5�13�rGb�e;����<)D'���~3���4�rGb^e;����<)D'���~�Oo��v%�P�c��Jqe;��;�Y]���C�/E-�7 H3ic�EA:�$�N�j��{�:t!R��
�$�F�_I=�4��[�z�Ve9�R�d��$�� O
��c8�V�j
����c���k%A���4��[��x��X�:^	����3~����"�����%�N�f)���O�\����e�8HM�H��B��Kt�H�L�~��'=��0�����n� ��������X"s	�I��X2����[�z��~W��y����@����\�W���.�k��.����3~������A8�9�N�j����y�0Q�63�I�Kx&���Yw
u����QT����P�u�N������a�f�[�B
���8Jj��A��@�nj���}0"r��(������U�s�\y��n���������C�����r�B3�"7�t(#r��4�t��k�i�7�t&\��r�V_F���P�l� �f�u�P��"^��/#r6�(�����A:�8�^�j��:�C'"g�
����W9H�H��B�n�yK.�""g�
`�-���G�v9g��u�P1�v\��
���,�~L�����!j�ucM�yN��]�&������1j���g]?}���H+s��
Xs�p��������8���*l%:4�2�#�nE��	��?iX+�X��1������7��Lw����$�o�ek@=���a�P�2�>hh�_5��r���v��`�P�
�mh�`]�����4���A��|� �������h�;�u�@�7��E��u��C���C�Lw�&��o�U�@;��'0oh����m����6�r����M���i�4
���e���_f�����u����eAH��Q�����zMW�H�;��+G�c��=����*�yUC���0��B��;L8�P��3s��h�Jq����j����e��U9��Z���]md��a����b�{=8��L}���W�4�����*Z��-���
��6��
Z_m���~�(���v������#04?og�bw�34W�8�jh�z�]&Z�^}�	������������U
-T�v�����B��UE���O6�����*���/��"gh��q^��B5�-�Z����CU�e�O�����UA\qR���7^�l��g���1����ogn���U9��Z��}�	��g�l���*?������*��^����X
�U9��Z����C��7buh����F��k�`d�X�p-4K����*?������kT}p�����/4
<�&n;�y6=u������;�#�~�N��B�ec+���+��J��;Kx�P�7u����([)g�`�QV�R��VY�J���Y=[)��5�g��o�����-�9�Q6�R]�8�V�����J5���_�4���$;�^[�N7�c?i�,$�'����P�8�i�|����.6��V�i�{aA+����l�>��|��?���tk��g�|�7C�-�9�Q6�R�����J9���g+����
�h��q����j���y��-}'9�V��(i���.G�*��r��-�����,��c�S����[�BXy�]-t)��U�P��-��G��B�ac+����:hl�L�����r�������2���V��w�<[)��{�l���	���cse	�����C�K���n���9�R=���F���r�q�*�O.FW��
_�%<`������8�s��;-N8�5��hq���+,Yc]l�������NX�J�ZYA+����}���t)�V�P�����m�l�������r�j�>Z�V��*+[)'|��UV�R��VY�J�_}o����)�UV6���?����fjzG�������a����a�d3�]_u5�o�^���W]�a�B�Mg;�Y��]_;�^�N��/(:]���_%�����d�H�����pO�-��{E�u��F@IP���A:�8�a�������v������N������@��dc��t�p���c���a�x���L`z�������������1D&���������{�W�T�y�A�1�G���1@��!�?+2]��j/hG��Q&���J'�c�c��a�'�Bd�~,)�j��6)>_�Q�se�����j��~�x��=���t�����������;(�S����g���s��Z�]�'~q�E�U�\y@I��n�� �f����c�	O^�{�bcp������g��,;1��=R{\���h��wR��2n��s��w�~^
�)��e=-9zn7Z)�SE	���E�`���^������m�S�~~�<�o'���g�By�����0�~$$��rhix�0uxs�c;�'��<��`�0�~$��>o�	��z�#�����c����"��`Un��0�j$$���|��7W�1�^�����
J�6�H���$�������L������Q�w	��F���s@)p�-ReU�(x�,g��<��`�0�~$$��r��SpAr/�������(b�6�_M�����iu�+�����#�3ji��^��`�W��H�L5|\�Q��#MD��<��_!�V������"a2�HX�q�!�8�f�.��`i\fa���p|A��~6�8�q��6��c?�LS��Z��`�l��X��d������a��4o���i�'�� ��OD(	O��LA6�(�q��~��?���f'�&.y��`�l��B�t������3�3�/��E|0g��0	O��R8H�g=.��U*��pZ;E:?�=�$*YY� �y�5��_2,��1���T"�O�[�IT���A:�8��{�����C���:#/Z�v>t,p�N3�e��z�
{�S::�����U�9gt���R�+� �f�ug)�s-����G#���c��������h7��;�I	��g���#rVg$���p�NA:�8�V�fM�6i�wA�+ gp��Z4r�N3��U����7E1������#�p����Cd�~��(V��$lqw6�gAb&g?�q=_d'Q�L��%�N��D�$gcwH���������Cd�~,)w*T_���+7FgAbg?�1=�R�L��%�N��H&�����^9�s�0��4��[������+�����3; I����� �fg�)���f	);������j9&���=����X3�V�-TO���S�r$�G�����#!����y�;��Nd��	�Q�
���t�����z�����#�[���0	�Xs5/
2��u�P�a}}�c���z����*+�4��{�������eB�
��`^�[�[��S��o�k����E�VFd��	�une)&S��o���a��M��i?Co>��:������|C��,��z�(�~�ez	�Y�#�����%L�		��WmC06��y�2�<��^�A	�)���#���gz1c����,�4p��H0�w(���T#!����q���S��ug8�$�7��A�t�����3��p�k�K�����$-��� �o	�MB���Z>[�����#�����S��o�kv`��Z��Td�V	�q�
��T#!����/��
�g2�'��z�1r�M5�|����i��R���sG�yT�� a:�HH�n�j��kqWk5�����#�<�W�!a:�HH�f����-��e�m����1f�y�I���s�M���O[0:D�0|��c��t�|�����C����������p	�������?�����?���Ng �w��'z'~yM6Z*hy��k�e�=�����?4�6^��CF��jg��]���&�}��ZE�pm���r?������Y�4^�.YV�����=o5<m&ux����p��-OKo3�g=�c�_Z~�eQ�O�_������<��������-��o�`Qk�b�I�V�k���d��|��������Z����U�`U��Y�M���y���]�U|s�Trb��_�����m�����%NJ�&`;���~i?�{=wz�c������.aZ���
�~�����T2�0T0L�(5A#SA	SAs�7
��J��f�
���T���� �������[:9=s9�25O��A�L��TN�T��\0��7'�d.�a&�`.��_KF��	3A��G�8����W����v#��%aZ������Vq�F��������i9A%sA3AsAx��=B%sA3AS�+l�4x#SA	SAs������	*�r�	*�jz���9��+uQs�J��f�
���Dn>5M�"���,�#�
�wN/$ty8��Z{��'M����O�Q�p\����(�9Q%s�pPC'jh.*q&jd!��	�U��8'�d!Z�R�#U��8'�d.wj`�D
�E%�D�,D�1��B��,i��FQ�B������h��:^T��*#qNT��*�
��E�����-����6��#02?m)���N��\T�L��B4�o/�h!�qNT�\4����54�85�
���U��8'�d!��e�U��8'�d!Z�p������D�,D����R���0�(`.X���S44��H�iY_e��s����'�zl�_���Y�0;��V��P��h!�qNS�\��j�/#sI3E�D�>xEEI�s�J���aqh!Z�[G��a�{��������#��[o{M!I
s��'�5�����4�,Dn���*�^`4��*Y_`�m�'����-��\`��8��c02?q�z�9M%sI3E��]��xZHr��T2�����w���g�F���|��gh!�qNT�B4���C���[�,D�����CQ�s�J�6��B���a[�� ���0f�
��B%#��/������|���/7���tl����)�(;`�Op*[n{8��8'�d.
;��@%sI3E�D{�{EEI�s�J��^������D��E��z��;4�85��wC�����(�9Q%��^>�����D�LE�g���I���u��]�J��{�+�&�^SK�l�z��_jZ������J.z?����8�R!"/:2@6E?�~u����g���U�%A!q����{~c��a�D/����������Sw�&b��HBr^�������`�"��[��Z�I;���7���$]l+������g����XS�ZN�_3�7�h����y�
9����xi:��t�?,���^���=;to��l���?��;�����N��#�:h�>$$P\���C
�1�,=P|;���B�7��^
�:D�NMP���1g�0��f89D&������@�|�7`�Q���?9h)�;@I�O��H�A2�$�a����^
?=bq#X�0��3~������)���<�5V|��(�CQ�e���=�$8��2����j��%�u+���H���P��9��@IP���A:��Wx���M�)�`*��V:Q��$(���A:�8�a���G�l��AI�q��A�vL?B��6����cNyVk������?V���^4�)�G�z�7���s��Z�����'u����ND�A�P|��JA:�$�Y�����K��~�����Z
�D������4��|z��&�����Ct=
;��~�\��y�0�f����f1���/�#�L�^nj��z'\�=<E?��g��oC8N^i�PA�V�/����;(�������.9��q�I��|E�^8�-�(	J6(��g=,A�����`g)t�L	���S�����Bt�~,)k=i�0b�C	�*��i���@Ipz��i�YOK.�l��9%+���A����
�$,YY� �f���dp��^5\�l6!���#g~qa��%+�d3������>~F����*B���Kf��z8	KV���6�8�i�'�1pq}�#	[�����q�?B^t�0D����������pa��w2>��e��0R;/:���w��E��*�
_�_*V�?���aL?L�|�d���XS~m��v����wB������q�%��?X7W���1ex��8��XRn,c5/Sk�����n��U?�:�Gv��0ow�/��y�#�z�8��j��_y�)�q��||V�9p����o3��a�? /��L��w����>+MM��o;/!P�R�s�.���(���Vi����0
.��u���\����a�Y����7����=�O��e��uC
w�4�?�c�S|;��V:l�����}:�*���c�h|��!2E?��[k��C��L4r������D^t5���{���
�7����������N��Wc��t�p��B��[��3�����Nj���A:�8�V�b�����R}����Zk=�$���+q��<����iVX����u���7��I���8Hgg�*���$]�
�n��
���Pc��t�$���B�VK���H��lf���D=�����m�����m������i���s�@I���:p�Y�f�u�Ps���US-dD�����S
�+�6�8�V��o��14N��	�#���#�����i��%�N��M���x�7�U�b��/������XG�g�*T���sKO�"����.-=��W�L3��U��n�6]��^�zQ��pR���LA6�8�V�j�A���PBd��(�u����3�z Y�
�-V1n���j��s���bbgG&���[6�V����o89sn��8x�p�L3��W��i1r�;�2�zWKn���vP�4��;���v���C<�q#8eH�h�G7�N�����qv�{5��4f������]�wi�����y���B�����jc�%��^�8	
5�9Hg���|��P��*��������c�k���)D&�����������	��'�X������=9x�!��ucM�71��=����"�W��-�[���q�>�����f��u0����?~>�^v��g�yB�|���:���-�J�l��Y�2��J�����o���y3-_��u3��h���=������?�b��K�U��bf���6fR��c��/�I������L��������������&��{��������o����y;3-_��u;3;���3���fr��fz�h��hf����fo.���M��������
�s��������]�M���8;#�w����vq�CsQ�3Q#��_r��B����;2=��MgPEsQ�3Q#���v%�CQ�s�J���Q�)d!IaNQ�B��W�;��������E���E
�E%�D�,D�v��U4��H�U�����=����b�[.���5��02?k��Z����(�9Q%s�r�����J��Y��-��|Z�r�U2����54�85�
��C#�h!�qNT�B��W�;�=�U��,Ds�����h�z_9��[8�����F������vC9�Fw�O�j{���M����{���t���h���\X%��_}��B��V�;2=���54�85��P{����D�,DK�^���h�L:�����F�W���\0}�+���g�F�����*�^f4���7�����v�;�	^�I���i�8;#���A�V��\T�L��B����;�=�u��,Dm~�E-D9��*����6qs����g�F���6��B4v�&Y��.�����(�9Q%��u�2���m,����rh~��8��d}�9�����������/d���\x��V��i����g�F�����Q���0�(`.�{���@���g�F���yqh!���G��w�Z����rd.z�[{�dd.�a��`!x���ZH]�+G�]�FT��#qNT���u���hi?�Q�{�gh�I���6FM�]�)#��&����������w�CI�s�J�u;�x���D��E�@�����g�F����Z����rd!
{�������D�,DK����B��]�,+�X��L/0��w�82��_5GK�l�����h)�N�����O2��1+X�C��������N4���Q���	}�)d!IaNQ�\0E������h.)q�id!���:/�h!�qNT�B��Z5/�h!�qNT�\6������g�F��*�U����MT��$��[?���5-�l���"
��)���G��2���<�Vt!�+�@I?/��!��f��l+vC�����6<<�V��_%A���BA:�$�a���7�W|��
�����cfh�%Kd3����d��=�BueD��@IP����l�p���s��/����wR���T�a���������S�cIyV+l;~�2Zwb�Ds��J����G��h���XS��Z�&��"n�m$!��=��c��a��k���XS��z��e^SW�3 �UXHM+��1g�0y�J=�I���<������k//y��(!:�k�/��~^���4���%�?p������C�r�y�r!@�w���� ����/4���(�������$(���A:�8�i3��%��*���R�\@NX��~�?Bp�R���s��Z��O[���c	zt�D'���H�^c��t�p�����3>����8T���K��$(�X� �f�w9�]2tP��x�e����J��O�k�=�$(���A:�8�a�`��%�����C�	��~�%�� }��q���"��\S���s��#����de�� �f�����;�?����������%A��J� �f���}��o	���
\�l�D	~P�����,S��<�zXr8h9|N��\'A�^
o)��%Kd3���6i>�|V:��$�H�{zV�������dc��l�p���a]�4q��)����%�d�����\�]����B�XHE����o�����E�
�)���<��c�R����#��p�^c����P|#��%'^3_�v�i%�b���^w��)�G��������S�Zy�rq7FC5 s�����X� �y�7���+y��1������p��,Q��<�zZr �	/��6I\xs���N���%
��P}g��%���?-�V���O��n�?L^����_~��}[B�Y��9D]��v�n��r���4��U\���'��ScR�f:D=��9���5�t�����a�fG���l��l�~|P�&&/:1��K��
���|Eq�B���d��8)��@�W�5����<E��Z��s9Nj����l�p��B�o���F������X�������N����������q�rge�?0x{b]c�h��H!:E?��;��X�lU��`�T�n����(D'���~{�(�O3o[��e3>��Uc[b��-����X���������ZoVv����YIA��l���X��
#�Th,J@�E�J�-J
re8�V���v<���q5�c$������(�;��SF�~4~$Xq���Pj,r�N3��U�����*��9J�^$4�� �f�u�P�a:6M
��v���j��yU���u�P5�����rf)� HgN^��e$�V�f�[��
"g$��Z��� �f�u�P5�*|��c���>%5#�eM�Rg�*����]g�"rF�(�qk� �f�u�[��|��|n��t!2�p����p��}8��U�Y�q������s@I���yO?�8�^�gg���w�9�p��z�y���u�P��u�t�V:+�jk����|tS�cI�U!{w�c��c#	����2���1����������)������7JjlA�n
A:�$�^�j���R�!z�����$�~�#,�`$�^���K����'���;����p��B��{iJ������3~���3g��ucI�U�z{;��5v_D���PRk�Q�V���@��j��A�.����|��3:��B��k��r�6]�O{�9D����������sN^>�8����o�AZ����4��t��j9���������?�����o��sy3P�`����u���Z�����'��}�z.]�����sV�v�������|M�����R-�����sv�CN������%��_���s�U�9-�:�I���sR�%j����s�m�9;� �0o=���v��sz��k=g�k9-�zN��D+�z���������������&��[��������K���� ;ry�����������o=�y�������4���t�B�Q�����������h��8��]?G�����6�C��J��Y�B'��U��8'�d!�h_�)d!IaNQ�B���ZH����#�=�FT�B�������I.�54��H��Y_d"8�������t`h~�$�����%z)�U��8'�d.�����CsQ�3Q#�D��zQEQ�s�J�u;J��`h!�qNT�\���r�54�85�
}����k1��B���9�=��b�,D�����dh~��8'�d}��u���h=�a�����y�8;#��v������J��Y�����C���6pd*Zv~�`�ME5NEY��-��U��8'�d!��~�-DS�����h�Z
YH��������wsh!yu���������^f4�D�,/3�������J8�6e�O������PhW9/�h!�qNT�\4��c�CsQ�3Q#Q����*Z�r�U����U��8'�d.z�}+O���g�F���y�����D�,D���������3��}5��W�s�J�W��)���s%��=������(����PS44��8�4�-[k�YHR�S0,{���CsI�3M#����8�����#���^-DS�z9�-[���B��������7sh!yu�����]f/jh~��85���d�j����+����^`�O���1����v��,$k����T��Y�v&th*�q���B4�������(�9Q%QX�w�����D��EqYLs5uh.*q&jd!����:��8'�d!zv
��,$��a���`�W�����F��������'�_����
�A����n�n5js�Q?���1��)^��\L�L��B�l�=8%I
s����N�i�4�%%�4�,D#�]�E-D9��*Y���h�~���D�,D�vtW[CQ�s�J��
�U����MT��l�V?�7G5��j:h������lg������nYy����@��H{����~^�%��P�Z��kb�![�]'��y%xc�a���1f��w���Bt�~�)�j
������)x��("��d�g��W��p=���%G������z�K����<2=�I���e���YKN;�Nz^$m$!�g"?a^�L��a�L��5�Y�y��`E@�W���Hr���l?��&�hCd�~�)�j-�D:
�~
����I���x�1g�0y�rX�I���<���MP����@_��'�`nR���5v+��9�Y�W�/��Q�A�# ��n�{�I��e,S��<�zV2<I������*��U��5�S��9v s����][9��/^M
g�����S��=����g��l�p������+q�5+q�"*��m/��$�J��&n��o���x��^f�UD%��~�H����A:��_�x�d����+`�l�PE��_%A���4���%���#�~?Bpi�S�M�wc������!:i?��g��c������C�	�������|q�L3��U{���MB��ja3�QET���/���dc��t�p���k���;��a%;��)zioQ)�]��`��c����� �I��}�������Z�yD�\�^�n������7o%A���i�YK��P9[�D�%�����_��c���m�0D'�������_Np���[dT���P���g��t�p����i|89R���;<D����%;8H�g=,��[��
����%^���i�'A�������%��B����
�SB.$%�k0�)����]���K��Z����[e���P	�@*����X� �y����(
8�;��i��P��������de'�����%g�
�bW7�PAT����G@IP����t�p����vf��@��
��nO}
8	KV�8Hg@���o��v2�p�e"/M��L�������de��l�p�����:�F������f�S�/|T��S�cIy�G�5k�9�"[r�F�����g�`�a�G�/�?���Z�|�9�w=��#RCs��]O��GA:�8��&agC�v���,���9J�B�����n�����Pctv��z��������U��2���C	��7p������ �f6@�n�������	�����2:SC����5�N��:����Y��B���0�����S�cI�S����70�x�sFg`b�30��u����'�w�e���S��)C�K���S�cI�S��h��g�~��,����g	O�i�Yw
5�6��O����L�PR�TB���4��[��i�	ccO�o�7�1e�
����M1�9�N���
�M;��8��c����
!6E?��;�%�3��2h���#�$�>b[7��p�o�eF �X<J�9V��N%5�#��� �f�u�P5�1��X������(	
U����H��B3��H4v �Jh\�pRk3f��P��U��T����Ex���|�pRc.��A:�8�V���e8'�q�N8�pc��l��S�N��9�N�f��uo!���Z�g8J�[i,r�N3��S�yz�����a�T����Z��� u{ Y�
U'��7�!|�w^`7��_��&��)�Ni�x�� �o
��@�[L~��;:������Pl:hGFm]�D$��
����:/�������������@�ND�@��[W����u�P5�*7at�����~=�����N�n���M�WqM�w�
�f5#o���}��t�p��B���n�QsH��
@�E:��r�Zz=x�-��x���������7Jj��������d�+��6v��W����
���D��U6�����r��8]�6�s�z��A7m&��=��L�|�L��U�/m@�l�E3���O�������nr��rG[�Z���7�LC�W��-��r���\)����~rv�N.`�O�.@3��;��+q��/;��%hk8��iG9��s�/{��������Z�q���rZ�eZ�t����mW9�k'�0�*g�����}��g�]_9�k'0�+g�������g�eg9�k'�0�,g��������g�Uo��RE��?z�����z�O�����-Ww~�b��t����82��P���vk�CsQ�3Q#��oX��B��8��%��U��8'�d.
o��3�E%�D�,D��n�5���0�(`!���-$S�;���h�j�.S�B������h�r�����CF�����C�����}o�+�����������VPyQCsQ�3Q#��7-ph!Z����E����
�CsQ�3Q#����6����D�,D�v\�U��8'�d.�����CsQ�3Q#�����B��:�9���P�U4�H�U�~�����7z�U>3�_���}�8�#��V������\T�L��B��{8�=���,D���U��7&�d.ZaX{����g�F��o���B4v
;Y�������dj[	X^}�0��W�:���(��������s�J����2�Bo�����6"�lz�,P/����;l�+�����:aC+���)�l%\�F-����":aca	t��V��O����V�����K�����p�Zxz������gll!,�N��J����9��=��b��Dc�F���s������(�p���w�����|��(Z�>�s��dq����C�����:]C+�������,�yU!+�����J�t"x�������BX����p��2�V��s�<Z	�-w�]E+Y���BV�W�~�����5 �h%\�}�\e���zaE<�������.�p>[��v[�@	t�ahqK��}9���8��d!Z����_cY	t��V�q�����J�����p�b�0���0zaE+��[j���Kg�y4pHv��ul.��&��J��Z�9��=�&d��DO\���
�?�$��
��TzM�U'���a�����B����9C���q���,n��o�����BV����p������,�yU!+�B�u7��V��u-�c���rea	t��V�pHv�[el%��^X�J8m�{N;��@/�h%�+�ae+aY^���{�u��������w!�J������i;�J��R�4JhQJ�]�J
t��@I���X� �f��l�n����=�j%p��
�����+��~�S�cNyVk���`w�����"��R��6�(����R9H��d=,��c�B�B��M����}(	JV��i YK���F����W�L�wX�Sc$�s�/�il�a,)�j=��pH1#�4��'g����j�F!2E?���XM��O���q�Hj:��r�����!2E?�����^q�3��5*	;���|�c��e�CO(
�I���M�Fx1G���wu�
�
���J��b�
�i�YK�;#pd\$�KUD�d��G@IP��HA6�8�Y�a��v��]���y�����`��b(���n����������=f��1�q���np{E(
�0s�M?|\�q���=p;	�h�[�a��;�yT����!�T#���COo|��9�����9�/?��I�?�aK,?�8�i��+���uh�+��ua�3G�L���d� }��L�=�����xv"�Y����`�~(����O5|\y>e���gW�?*V�^�	���!�G�<%L�		>��J�y�D�����#��sg��PUn��0�~$���{����_6�s;+Un�B�����yT�� a:�HH�q�5o	O�+w��,!C�jG0�*WxU	��F��O+���S&�
	�2t�;�yX�A���d��������(����AStBg����CcW� �f����#���A�V���(���P����`�L3�z\��4��?�z�����`�~(�s�(L�	>��Sg���������pd�0�����fS���W��~��6��7��*�H0�+x��F���+��n8�e��+;��������^E�d�������Yxd_������<�\ �a
��F���+�]��O?���C	:��5�$|��*�4����]�v�s���Sn����1�5g�`�y���a��J�=3CC�O��w
���,�PR����A:�8���jg������[/D�_%�O�X� �f�u�P�?����C���/%y��z�i�Y�
5��B>[�uG���pRk���3��n�&j��d�V�H�V����mU������XR�T�>g��	F*3O�q���L�"S��|�e����H�M����t'�(�*��@!2E?N����Mz�s�������~L�qZ9D��n�)w*������KR'����X)��@�n��%�y��1I�W���%5&)��� �f�u�Po_f>�3:����������I���|Z�7?�`��WPg�����6.i�����K���zW�����
��A{�y�
Ja�m�������������#�<��`�0�j$$x�^�#�-���>/d��	�Q���T#!����/	�{u�'�X�=�$tj�e
��@Y�
�6$uLj�H`�����]� a:�H��}���x�R���,	�7-G�y��yU	��F��7�5�Q�����V�H0�s4�*a2�HJs<���z���6j�F`�k���}�,a:�H�?���z��X��^�^d�����e�0�~$$x�^s/�����;2oR���^�����#!�����X�U�����'{�I�a�8Hg@������E^D:'��zSr ����I�t�������W�J9;s�@�]��P^gN^��l�����z�g���5��Oc@v��zK��wn�����)�r�U�jg:����S�8Hge�,�������f�2v��z�1I�N?���/�y��27nc@���pR�6&���Yw��L���4�p�������R���}?\v������_���D��V}������������n	w�����<�@�f�-���_��#^���2_��P�����x�����(�/I�V�k���d����������t2��	�V���T�`U��Y�M������z�������D�o���xQ�fi�M�����!�]��w���#��EJg��]���&sz��T�����Z;|k��R;�U��g�7��;���4^�����'���7��e��:�������z����g�/������N�C�x#���x	����������U�����9�)*�*�@��L��TQ�L��\���E%sEs�J�������9�)*�+V<��	
��Q��0U��^2����������b��37�J������kF�����A"aNQ��A�����9������x#��%aV����*tH�S42U�0S42W����)*�+r�ST2W�������"�9E%S�zl�j~��L%�����R��d��aNQ�\���NQ�\������b�vdNQ�\������"��>s�����q�q&jh�@��s������A����K04�m�.�����k;BmU-T9��*����8UCsU�s�����y��h��q^U�B����UEU������q�F�N��\U�������n����0/)d����T���8/�h�����WU4�H�W�~�����|�|��9����C�'q��o\��������9UC�L������*�yUEsU�E8�������T
-T:���*Z�r�WU�PM���*Z�r�WU�P����?WCU�����u��v)YhR��2W�����44�H�U����,|����������}�0w	J����RL/�h��q^T�\��5���&�9I%�D�(xIEM�������*��B��_f�����[MU5�TZ�x�����0/)d�xRO1/�h��q^T�B5a��FU��I�q^U�O�-��sf���CaJ��qh~�$�]����;�U�U2��0'�d�����_CM������phq�������9UCU8���hh��q^U�B5��C���`-T�-�V���U-T+v��B��%���T��2��5��|�*�(�����Y�������������uG�#m�5��d�rW,`~�������CE������p�u�(T2��0'�d�H��IEM������V�yUEU���������}��\U�����*,�l����U-Ti�g��h��q^U�T�U��
�7��|����
2��d��:n�8N8���g���������1��+1m�|1��_&?ta�l�~,)�6�����K{��?�����l���2~�@�����M��%�Q��[f�P��?�(���?ge;�s�/��	�M��%�Q�����Q6\�� �W�T�n���+��HGb�uC�T$,��hu��pR��k';�F������0:5���X$l �����W!p^3��NT5��_�G&�M���n��]��3�	{V�U"�a�����m7��_�
Z�hj��f���`#�����B@���WC����	 ��������Q~������D���
�$\���� �f�������E���^'1�����>�c��x"�H!:E?��g����		���Y�h�+���o%��59H�@�;eo�|�B�x��Lx/�(!
�Q�
�$(Y�Q9H��d=,9U~�}h{���(��^J���E�i�YK��_I\wi����D/�7���de�p�L3�zX2
��'�����L�wD���%;)�fg=,�8��K���CQ(�\/%A��v
�i�Y�J>�!���R��z��p�l,Q��y����pl��^�����#�>*�5����hS��y��tr����������!>42���/�.��� ��-�@��|F��p���]J��h���	_�S��8(�������.9�[���'�m��,��p�l,Q��<�zXr�i�F�
!'/�L�����2~��P-"S�cMyX�I�SR�Ay�=�y�������;9Hg��M��#_�m����h���L�\�����$,Y��A:�8�a�`-�����W�M�*�O���5����de��l�p������@?�]L.&���e;_�1�W��)�1�<��$��e0Ga��QB��|(	n���r�L3�zZr���I�Uld���\�v�fS�/���Cx�~�)����%�f~y�T$En�0��_&?��Ct�~,)o{�����_����5��s�6�h,���,�~L�� ������rc7��O���
�������};���1g4v���&�����>�����0r!1_������y��l�~|�G�g�W��V��@�_;�p���O�>����o���zK�����f��#�u�^����Zz���ov:c�-�X�Zu�����{`O����N����l�K��x����x�8�fs"?T����������[��i%�<o��8�5�y���~�Bs��}Kt6������Pjn��A���@�n�FZ��0���	��q#��������H��B�O+|"�s��<g����Z�.q��<������P5����M�?.Df����.�i�Y�
U�����d����n#������e����n�������]��l#������K����d�*Tm�+�k�������PR��]d�[$�N�������n�#4�����n����<�z��������u�*"��
���
�d���nj�W��*���,�$9��1:��M3��O0��P|�#;WNYZ�(J������*�����@�n����k6�x$:-T��a��B��\��S�cI�S��pG�����;2��3o��8~����u�p��B���]%y�����*��
c��>��S�N��%�V�b�>������Q���(�q����4��{���v���������]7Jj<�#p�N3��~�?������#'vV�nc���^}��XRnU(�������j/��F@Ix+�d������O������r#�2BY($�@���6�!�������f	�W*����x`	�y��o��Gf~\7�pq����t�nx5�Vm}��_z{�������]��Q�Z���yT�]
-������q��b������
���{���86���]��|t��R��>k)����}O1-���I���bZ�&Z���bq�y���m�����\����^�&����������o����yo1)^���uo1��_7�����T?o/��k���n/�����bZ�5�����|M�������b��������^�u����G+��}�1<W"����h�l]���/[��������\U�����j���;�P��6w�������lph�*qN��B5��b����*�yUE�kk�YhR���P���s����p��\5���^��\U�����j���^U��i#q^U�O�N���Y�eW{���M�����\��FU�B�������u�[���J�S5�P���v�qh��q^U�\����CsU�s����q�����U-T�~��C���x��B5�}�Z����C����m24�H�W���Z
�[Y�spJ{
��wN��5�wS�O<x�i��h����|WE���xwh�Z���U;�N��\5�JZU5�PM[���Z�r�WU�P����-T����C���?7�����s#sEp6[s�����9QCU~u�UM�7�U�������������V�o���K04�qp�AiU
�U%��Z���^v��g������E�}xUEU������i���������9UC��7�th��F�-T�����jh��q^U�B5w���,4s����B��]��?n$��
��qs���������l����E�v
��������x��I�kJ�5�P�[��F�B������"Y���74��8'jh����5-TS�����j��1���7����}k�K�kr��T�P}+*���oE��B5�Q������F�����4W��_iV������/34?\@�����D�]�'#�C"������"�C�����E-T��/�-T9��*��;��������T
-Tco�9�P��5��B5mG�\uh��q^U�B��:LYh^]�)#���������F��"������m�2xq�����F��,%���a���,��I���/���4����(Z�����C%M
��B��)��~������9QC�L�����j��p��h�
�w�_#CU������pr�64W�8�jh���������7UEsG����c��[����u��xT������L�����aC����^N����� �y�?p��������n�������#o���2~���.G�I�1�<���3���#��U����%��U���@��������y�F`�V���/���+�o/�������fo��:��[��O&qGS!��s�/��h�!:i?��g�^�����,��d$#���y���2~����@�N��%�Y�%���x���F��m8�5��|87����!:E?>�y��k������b#I,�E�c��e���0D��������~��x��k7&�	��_�1�W��Bb�L��%�Y�!��6!���g��I�N���,s��<��7!�.��^`��,��;�2"x�s��%�9H�g=�����x�;��L
A7�r�9����'A����(��#�w��&�%�!:����7��������d�H�����2���y+�B?\QE�[��J���E�i�YK�.��U�f^@�PE}e�7���dc��t�d�A�v�`���v���@�ND�H.�A��
�4���%��~.������n�}=��_!?��Cd�~,)�j��v��e�uO`�*�����3�(	n�2^@e�@�v�
����	������ �O��@I�?�d����|�m�C����K6�����J���]��t�H����a��L�S�dA0�^���'A��2�����e�&%G��v0|����B�-��%�d3����|���*=��P�%hq�K%"��(�Cx�~\������-��c9�=B�ff�^�)�W��������&�	����
\x�������7 ��6��i������������F���f��X�J������������#_z�ND�Co%a��
�i�YOK>�Q�?����!:cn��P��,P�M3�����hVt�{�g���n��rL�L~�~R�L��%��!�����Nl�7o7Z������i�n�v.��'�gsd���;5vg8��;1�����;{r��x��1�r�	��5@IP�0�LA2�4�V�j;��y�3\���%5gd���nj������&���q2@IP�0�����h��B���(t�&oj6c��LMq�f;��;�������z�u����T+���i:E?��;��x������fW�C
������*�!��)M����{�2!1�S��+Z�S�cM��"����@cSVD��%�MofR�s;;�Y�
5�0����[�����'����t�D�'��B�-;����y�#�����/�i Y����Gxm{m���3#�1e�
������cN�S��C>[��@���-���}��3��n*F �%�W��s� �_J��c�6��O3��U����������j"�Q���9H�@|��]L?��
0C�w�u����be�K��h��B������"�#���Po6R��;�Y�
5�O������t�A��e�x��+�y�~�Is�����C�33�p���d�����w�^��#�`9���38Nj��BoP��~�V��9
�lP�i8��>������#��d�*���)��*L��U8Jj��JA�l�d�+T�����2��!��w>�����A��5�N�fF4q_�|g���V�)��g�+T����:�'X9�o��z���t�p��B��K���q"g�
��Z'�� �f�u�P3�.z�e��q������+���b��������B���b=o���	X��N�8���N
��^8�^���eyp*�;j�����?g�E�)�q���4�M:��n���ci���G���I��g}�>\[���-���%hU�U��������������_�K����n��w1u�j���J�YlZ����+g�r�����������>mZ�������V���[*79����]��i��7x��D��u�7��_5x���S�T?o�&�k���n�����oV��i���
��|M���
��������>mZ�������V������/��z�����k�E��v,����^��-yc�������5�]���2�M�����U%��Z���3�C��wph�
�TK�������9UCU8������U-T�vv�*d�Ia^R�B����;����=�-T�v��UU�P�8��h�z�|����?m$��*��i�v���ehew�W`h~�$�]���}w�{�Z�r�WU4W�����\U�����*-�iT-T9��*Z�<i�QU�P�8��h�z�[:���U%��Z��^}��G�W�������rh�z���Z�&:���*�?q$��
���'W����-��1��C�;'q���\=�+�?/CsU�s���W�%�����w	ph�z�l���CSU�3U��p�FmU-T9��*Z�����C��7�th�zu{��,4�n����b�{w9���}�.������8UC����9UE��
���o��
�.�"�6sh~�$�]����;.>��*Z�r�WU4W�G�������9UC��G�9UE�$'��jr���T�t*��h��q^U�\������U%��Z��1^U�B��������5�2��<�^F��oT���q#q^T���+5�N��+��3�"���[�3l����[>xy�I�kJ�5�P���k3���0/)d�x��9�5%��Z���k�C���qh��zs���j��1����61���0/)d�X��]-4k�����j	[l;�94�H�SU���&������Un��ah~�$�]��������ed�Y�YF������CSM�3Q�����QU�P�8��h�Z��Z�-T9��*��������J�S5�P�|l�SU�P�8��h�zv-��,4��%���b�UM^Q��Q#a^���AS�����:�8u�>��V��!��� ���Q>��Qb���^��\M�����*j��
YhR��2W<�����P4��8'jh�iU�WU�P�8��h�����ph��q^U�B��r�FU�B������*/�����~��*�������]�h&����@�����e����S���]O=�����}EQ�#���P���g��t�p����9��n!oG��:_QE�a��@IP���A:�8�a����\�o����3���D�%�$(�X� �y����(���~�]^�2*��mkx(	J6vp�N3�zXr��N=������D~�0��z���%a�L���~o��w:v��&{#IN�OO���������A�N��%�Y�W�Lp���RB�-r��]xS�/�Z��!2E?��g�x�L�iS��H���g��s�/���!:i?������e�!��}���1�ev������
�i�Y�J��O�t�������Q���Q#�$(�X@��g=,9D\fw@�kl[�H/}��1f�
�_�Cl�~�)�j=N�3�(�`����
u���S6�9(����7��.9����sb�J��������/���+D��^n�a�)�j����$�
��:�'�����z��B~�s����S�����3�Y
������uEo%�����A:�8�a����n��y��}	$	J6�9Hgg=,�T�� \�k�������$(�X����1jg=,v���w�YEDW���z�%+K�4���%���:�p�_
Ct$�U#��A��n���A2�$�Y�`<s�:��2��`$�7��~<�����*6����0p�����{(	JV}�0H��d=,���j�c�1C5 *G��k�IP���A:�8�a�����0xF�S���HJ��~s�/�G!2i?��g��g��w�Q':CuGR�5�$����a��y�7����]�%oW{�QETb���#�$,YY� �f�����gDA3���� *g��#�$(���A:�8�i�e;s��`b�1*����7���de�4���%���%���QTwD������%+��������%h�4��	�����hC?��_&�����K�c?��Y{�9D-�r�;7�����q�`��
vY�/�?���9��:h9�2 r�f8�1?a�$��(�������_�H��<i9�s3{�I�����3��nj�f����:Z�<�pRct�Y\��9�zk��O�����"�\��Z{3p�N3��U���Y~�J*��������fj���������������2�x+�SFgeb��2������y����50�x�SFg`b�30�����P�3��g[��|�M9�)���������5�N�� ��)�8�����x�m�L�@
�)������C�Cs]�7	j����#������n���g�*���]������9J�����A:�$�V��F��n(�ns�{����l�D����B�?��z�
�N�B���x'��A�f�
��8R�8��cF�@�S�cN�S�����,bs"���q�b$�������t�p��B�X��6j��HR�6R�s;�Y�
U��J|p�C���#Hrp�c�4H���U��N�Y�2"3
G���c;�4��������w��z?�o��
@I��8H�g�)���raK&�"������E,'����[���n5�81T�-8�,�99H�@��}X�����p�j�pR�R�s;P��6��
5�P���/��)���p����XS�Th�]��|�hm=@���Jj�A8s�t�H��B����`����I�'x���3���\r��%/
A]����7Nj����t�\�����z��*��Wi3�Y{����K���p��B������1� ������+���8b��cN�U��oE,v5�[�b�uc�hL�D!:i?.�\���9]��x3B=����������xs����z�}����+H�t�x�����wZ����o��h�V��y�f�i��v���v�����m4���f`����]��4�]���[��:����]��k�K��y�K�Lw	�No�3����^��l������|��������������mR����]�f�+Xw{���]�7�k�&0��f�������g�e����!E��5z��|�n�zw
�[M��{�����+ph�@��
�n��^zUCsU�s���g�(�����7
ph�J�uUEU����������������9UC����Z�,4)�K
Y(�~��C��o�wh���Z�U�P�8��h�Z���FU��i�q^U��iSq�����j�x8��C��&q�
���	�����
�U%��Z���Y�C��7ph�
g������J�S5�PM[=�����U-T�v\�UU�P�8��h���~��CsU�s���G�������w�rh��+��M��O����>x�����7�������k04�s�������wSxUCsU�s���g�3������ph����WU�P���*���_�k54W�8�jh��~�-Tc�O���j�v�Yh�n�������rh�y���Z�����h���8�*���M�N��[��5k�y6�uhW�����=$>@�)[(K�W6�R�}{�V��o��B��|�S6�P�@�ll���8��l�������r��kz�R.}�M��q������-�%�+[)]�/�V�G�����j��y6} Y`#,l�H
;�-y���pa?��%�7���u(Z�����L'kl�+�^��J9m{/,h�Kq����j�2�l�[��<[(���l�,�^��J9���g+��k������'���.�5��V�W������7�l�\�}�`e���6��>y�s��q\������1����/����*��1�h��q^V�B6��������zac+����9hl�������r�b�����26��V��w�<[)��{�l��}����cse
t������%�C+��k
��J���T�����H�UF�<��A���.�J8��v���-n8��KW���pk����-t%�[)Cs�NX�J��YA+��'X{Ye+]l��-�c��v�Bee	���V�q+���2�R��FY�J9���VY�J�ee+e^��(+[)�*g��l��
����[��+u��EW���?�]�r"�q��^Eq�E�5���#�$X�h�� �y�_�{w�l�����
�[�Sw��j��~h_hr��������}����m����B/�7��~���
���z�YK>�h����B�$ ������AL~h�(���XR��o�=�����H~�0������ZS�L���7���Y����G|�$��z�|�k7��_&?�9
Cd�~,)�jM�>I�~����J"��N�N=�)��	�����k��Z3���0m�7R��'��g��O��~�"S����VxgG�Zx�[<&x�B�������+��7^~�~�)�j�E��<�M��C��y �%��j��.c��l�T�C�n�a�a�I�
����1>;�������~�.���I���[���}q�i����g	Y��+�;�yP���r�N5|\�q���=o|�
�
o,O{0���a��l�P�����v�r.�f�X8������P�l���l�������Y�����A�&�2�����d_���
�fS��W6���
���?�@X������pUn��0�~$$���|��&XKL��(��^�
�$�7�s�i��:@YOK��1��)��(���_;��	���6�%L�		>����;�����MU���~/9L����^I~�p����+�Udw�x��`�R��IX���A:�(�a�a�p# nL��|h�)��������������T#!����k�9�D���#lW��]y0	n�c;�4�_|A	�TD�!�5��[�l�5���U8H��d=.��3J���X�����w�����.a:�HH�q�1H���O-�����B8��yA0�+�-1L�	>�<����0��������&��y\���r�N5|^9|#��~p�V�,#�u�#�G��%L�		>���U� �\���o���`W.�:R�L5|\9����]��
W����^J�����r�L3�zn~��k�;���]=��@{�99H����]�e�3L�)=��3��7U5�j8�7U)�����;�������.G�������>>y;���S����������4>������$���=p�N3��U��q�.�ED���`�.��T�8H-�H��B��<�b�BbVi?>�Q����!2E?>����
�=pg�7Li�����2��r���XS�T��f����iH��������Cd�~,)w*tV(��u��y Q3�sFc�f13�K��
����-;��|3AG����
2��u�P5)a����0ah��pR���~��<��S�w<w���YD�\�`V��R;^�y��nkVd��>�3���9����#s�M?�Y������xTy3s��9��4H�����#|�}N`����}N
�>gOH�f�b6��E�6�;2 ��eO8�u7��8��I����5��mk;k��`^�i��F���mk/������z�=�<�����������z�UL��`������$�}0xH�N5�����EX-:��B����`�k0H�N5�Y���`����.d���Q�����#!�{�6��4���"����	�u��uI�L5�Y�:��W��l�*d���Q�����#)��6~bfW�9��\gL���Z�2q��<�r��jk ���B��F���]���.	��F��7���!5fo��P{�#������l�������YX��j�E�Q��y��^d�0�~$$x�^1���r��!��Z��u�uJ�N?�����������.cG(������#!����'(- �+H����K[@:vr��<�- ���f{F������X:�1�f{n�~�Q��O����'i���f{i�1$���A���������_���������q]���'z(�,�=.�hy��o!�}��{���{x�c��i�=�=��f�jg��]���&�u������x�^O�?j2'��xU�dY�>o~����1��ug�"�p��-OKo3�w=�7��M���,G�J�V�k�U�dN�|�p���vt��G���3X��yV{�9���l��o�'��^<i��-���]����k��L�����������6���H���n���[oj�0����6�^�zA%sAs�Jf����8A#SA	3A#S�����J����L/t����� �9A%S��G�8=s9�rjfb9�&S32��0�32��u�FP�\������ tek>q�>?$�	*Y=?�Y=>��T.�w��ndz�$�j72�Y����ld*(a&hd*�io�T2�0'�d*H������ �9A%3A8��j~��L%��L�SZl��9�	*�
^�+�	*�r�T2,�~�	*�r�T2���gn~k�>E4�4Z=G�<.�x1�|�Y!@��
Mo���+04�m���lD-D9��*����8QCsQ�s�����lx/�h!�q^T�\���mD-D9��*��FZ��E
�E%������k
YHR�W2��f�+*ZHr��T4M�n��*�?e$��*Z=ep�}]>d>x�����]���m�8w���-�.p/jh.*qN��\4�{c/�h!�q^T�TZg��=CsQ�s������E-D9��*��&���E-D9��*��^�Q�����(�yQEs��u��,$)�+
�
^�(�)�?e$�iZ>e�������O4����I)��5sW�dz�
�������5MEk�S/����$�9E%s�D&xEEI�����������B��_Y����WME5�D��xD�����0�(d.HEEI������	�lDM0�E-0��������r����<`��8�s�`hz��
��T2��0��d.����_CI������p�o�������9QCsQ�t���3��8/�h.�zs���h��-���!��*Z�r�U4����k
YHR�W2�v!�E*�?a8�	*Y�:
��|y�Z�8�v������Ag�#A"���������t������� �yIESQ8	�}*�Kr�ST2LtF�WT���8��h.Zh��U��8/�h*
��[g�����9QCsQX`�>u
-D9��*�������*Z�r�U4��5�|����x����^ ����&O�3`��#������R��x���+������D$�Eg��c��e��M"��k��}�'�<?.h��k��$$������
m~J�)���<�5q���8q��?���ge�gS�/�5 �)���<�5�t&���������$U��1e�2��
H���������s!/8_:��������=���k`�c�;���?,�$<|��;5��1v��U��h��2�5�����O;��~p��N>�`,��jH��
x�G�i����y��
��_9l������H���rL�L�wz�cO��5�������`���+*���@I��V��A:�8�a���y�/{���QD���@IP��tq�L3�zX2�?�<O�W#�""xA��J������d�H�����]���,��2��b���k�IP���A:��7���*�q9*w�4D!�[y
8	J6�9Hg@��g$���We����"�7�@IP���A:�8�a��m��@/����eDV���%�)��g=,M9�
W��ED�4��%++�4��g%{�U������vj�W/�^3N����tN��<�����G8p���S�#2�tTt���;w�~(�����P�N3�zX�ql��+�����;�7N���%
���YK���
��u��("����%+�����@��|�����\v\�
���H`�r{(	JV��!H��d=,9�[����[��?Ty7�Y��p�l,�z�y�������l��7YI%�8�-�S�/��)��,S�cIyX+u���&�T1TwD �������A:�.�X�~�m"�K�>*UD|r{(	KV9H�g=,2\\Qh��f�2/��c��%+��O�L3�zZ����?4��1��$��5�$,Y�IA6�.�-�~�'�JP��H�b;yH�3~��P1"S�cIyX+��>�Uq�)9{H��m�����3~�`��Cd�~,)����3p�F���"�q.z[#���)����m����t{}�b��m�:V������j]vc�h����&���rcS�y��8��H������������r�BoR���Q	�%+�l�~��my"�6e;��������2#1s�S�7+�T��rS�cI�Q�x���w	fQ��X������L�'�,J����U���3�s&8k���@gN^������N�����KW��
�
�[��d�\�J���w���MK����LH�{����x��C�����rk[�����Wl��v��B~�q�c;��;zw��������PRc8�������������e�������y���2~��O;Q�N��9�N�f!���P����*��ZW�� 5{ Y�
5���?�^��#2�p��x��� �y�u�P����#6"�}Dg����ma�N3��U��y��8�0$Df	���Z��R���=��[�zk�zw�s���|`���
!�l�����*l;������s������-�����w�t>]������7Nj,�D�n���?s��������f��#����p��B��;O�r��E����%5�,�� �f��o�a�������������P�RY���@�n����+���B��
c���_4w�)�1���PL�����j��N�0N����=X������^�d�	7y?/���j����?/�����*��������d{nc��(^�E7E?��;��O���E��7�O�0����H�)����B1��W'4��g������/����d�+T�l����x��&���1f4�\�S�cN�U![e`���+�O��
���,5?N+8����5���n��t�Z0#	��n����]����B�u���2�e��J��6���s<��v+3X�Sq�
�2�|'�c?��d������~��,��g���c����u[�G�-����Y������i���L���2��5��_�2������e������i73-_���e73����3��]?3�}��L��D����fz���h�`-���=��4�.`��L���M��|�j&����i��h�����������g��yg3�vmmfh��%�s����f������h��^���.���p����J�54���{����w��TN��>����D
�E����-D9��*��^[�?��B������`���;����x���)�>54�8'jh.�`�FT��!#q^T��!s�;��;����2���5�s`hz�2�=o��Z�r�U4��~��CsQ�s����tRX#�h!�q^T�T����CsQ�s����N��*Z�r�U4=���-D�~+�Cs��7�rh!��X�E�����������8�!��i����%��7�s�`h��u�����C�����*���~�C��o�wh*�{w����A������(m<kD-D9��*��^}�I��W�e���h�v�YH�nG��� l&l-����4
�E���U4}�h�U�z�����i'x`m;-���M����68R�������D
�E�~s�C���\��\��cD����(�yQES����hN��\T�����h�[N:��}�I���	�5����E�Es���B2wm���K�����SF�����S����E�	���_%�V��[���5<z��k����4
�E��}�Q���[�9F�T��������9MCs��yqh!��//�EK�z9�-����T���OR%sIs�J�������B2�=����-��2�0�E-0W��_��c���34?�@��Y���M�]�)#��&��������]m=���5�E��/�-D9��*��;E�E
�E%�������rh!{����h���a��B����������2�����VF���!yA!���yA!�c/`{{��C�5��:�����,#���a������3����-���8�ih.Z��>�,$)�+
�
�H�(������4
�E3�F����������h������(�yQESQ��=y
�E%������]/�h!*�}����I���~�&-��Z���r���?����]�?�Q�����lx���"�!��/%��x)��g=���R"meq�?G�f�ri_
�j�!6E?�n+��ZO9nl�1��*:���%���+}^�i�YKNr�������"����O_N���e
���YK��	��
}#�Z?�B��0��_&�ABd�~�)�j��|�'�]%��J.|�g?��&��
M"S�cMyV+t�mR�������xU���1�W�����4�J��"+>��������I,p���q�3����Y:�\��������~��h�#����i���q���L~�'���XR��?O��k�*�NxX�I�r;vp�N3���r�.�H[��g�V��EQFor�7���dc;�4����?��w���"��=#}bx8	J6�9Hg@���o�|^�j&��^D0J�������%+;*�4���%C�g|��:���#��k�IP���A:�8�a��8������N���P�l,r�N3����.�;|����������P�ll� �f��I�]r����&��N�����u-o%A���i�YK���(P��j���=A��3~��P�<�)�1�<kv�pv�'��<���=���v7�
�d�4#�|V����y+�f�"�(�����Y� �f����r��Y���K�I���K IP��DA6��W%C���M�.<�F��xi�%�$(��IA6�8�i������*���3n��1e�2��K���K��Z��M����C�[��7@�m:F�5�4���fJ�a��v�
+�&+)�u>�1�W������k��Z�|�>���C	x��
8���2����|s��f=,�~���5���d�<9���^$��!6E?���^l���c�2E���
�$�����>�4���%�d�w@2�����_�njD�y)�������e���-���h�"o�<�q����4��Q�ql�2���)MFh�u����
��A���Q�T7����f.����f����g9J�B��,d�e$�V�f2^�����A�������<)�f����~X����ic.j<�p�������n�4�N?,����+����[���3�2S�N��%�N��%���WcPH��������Bt�~,)w*t�a��������vLbM�P���f�)w*T���"~XW�v$�A�c���$���m�f,)wz6�iX�����<9���_|�e���Y�
Uo0E���9Z�0� ��v�i��5G�n����O�+��Z�1s��0;�Y��u�X�|��r����Nj-��A:�8�V��U��D��H=���8Jj�F0P1�����[�����XSk/VD�^%5�",�� �f�u�Ps���S�j����n��~q�L3������P�����+7V�^�K8Jj���� �f�u�Pq�`�=Uo ����|(�5+�_���U�8z`���G
�	�q�ozf���!:E?N��.����Uv��0�{H�����+��HA6�8�V���:C��wD��'5�a9)�fg�k{�N�����pG�L�HRk�d3��n��t���Zz"������,���3������}�j���f�f6Cs���������h�I�OD�����7J�~ T�A2�4�^�j������vs��WL�~��D�*��q���|�������3~����Ct�~�)w*�
���������x�$t��e
���Y�
U��
U�M�R�M���.�y�
�m��!�D:R��~����f�]"F^;��[�������C�L���p�4o�q?5�����k��|i�>��-���,�,s�b���������������Vky�O���o~���U�%��F�r&�V�I�����u�5+_;�i��VkZ�&Z��Vk�\����Vkv�3M���Z��%��_�Z�{�U�5-�:�I��VkR�%j��Z�{�m�5����0o����v�Vkz��k�f�k�4-�jM��D+�j������^={Mw��k������������Vk9�k��T4]��q��V����cV������9QCs����wh!����ME��ii6C84�8'jh.�dhD-D9��*��&:��i
YHR�W2����-$�~+�Cs��gnE-D9��*����VR�9���CF�����C����[��y�p���C��&q�M�������B������h�~����D
�E������(�yQEs������`h!�q^T�TN�8�
��E%������z��G�������7�rh!z�����&<S�U4�H�U�|�����o�Z��_���}�8w	�����������9QCs������B����;4��"N�������CsQx�U[QEQ��������_��B4��+��^��z#���Zod.X�NZ-$k�I���h�7#N���1�qN���1�o�K�m�5X����rh~�$�]���m;.>s��*Z�r�U4�-��}����D
�E���D-D���b���E��>)��*Z�r�U4=��u�CsQ�s�������U��8/�h.zv���,$������`���94�H��T�|�DX`�wZ�������d~���W�dz���K$M��\R��������&���������������KJ��44����������\4���C���^�E���bd!Ia^Q�\����ZH����CS���v�sh���8'jh���p�_��v����>`�o���k04�q�t��,$K����L��Y�v�sh*�q���\4���FT�B������h�rk�:��8/�h*���rh.*qN��\4n{�4uh!�q^T�\��dYH�]�,#s�L������F��������'�_�Z+��]�A���~�(|�(������Qb�C�������9ACs�kK��)YHR�W2<w>T��pP4��8�ih.ya�U��8/�h.�����-D9��*���F��*Z�r�U4���NT�BT��:QES��Y������W+)�A��W�������_-�rfh��E����H�b���~<;9Hg��M�����#lxj�#�������8��_!?�Cd�~,)�j=���=�� �*�|���� �� �"�4��l#���t�t��%���$c\}�p�l,s��<��D�]2���}<y���	��}��������(�)����
��5��{��a?C#IN�� ���������A�N��%�Y�W��Hx�6>��g��3}�����k�(D���������aZ��B����I����b��L~��c�N��%�Y��C I�R�}#j��K�v��i���Dw������}w��]��������%�v�n�p���C�2.������(�""�@�P���c�+A:�$�a����v�&��PA���7 ����r7���s��x��xNbF�NDW���P���bb�L3�zX24��y(��|+*��-C}(	J6vp�N3�zX2t���Z$�\QE���7���dc��t�p������h.����"x�o%A����A2�$�a��%;x��6Y��2��:���(	J6�s�N3�zXr�����k��~�&��Dn/=� {�����$�@��\6���S�L������z��Bx���XS�
�`������|�/d����J���l� �f������e��{����{(	J6�s�N3�zX2�	���:�������#m�
�$(Y�PO3�zX2�o�+��Fh�{�a/��7��VV"S���������yS�.�����1f�
��P"S�cMyXk�M�Q�CQ�V
o%������l�p����|�x
[k��L���r��>J���a�t�i�YOK.����/c�QD�����%3��_���f=,9��/��/q�v�������D��b�L3�zZ�A���<V,$")�����2~���9�)���<�%�&hC6���l��o�6d3���l����F)�m��4dk,Jj!��������w,p�N3�����P�N��h=���19[�IP���A:�8�V�jB�@g_8r1#s:���5N�u6�����������:D�C��9�]��:{�� u3{ Y�
U�B��	7gavCC7��<������)M���<93�H������x�'����XR�T�&c����L�:���2�2R�N��%�N�j%B3���+�+�1g4~e���K��
�:�e�u)"g@��Z��� �f�u�PgJ�q]f;�
5Fu�i��P�
#���d����8Jj���l�p��B��HoT%zb�Nm��x��m�<h��X�G���x!rf�(�qc� g6v��njF��
E��fv�8��O��p���������q���L�PR�4;�4��[�����3�!zB;;q���������U�Z}%�Gk�*"3G����c��t�d����B��#��^"|��6�@I�K$/�����njf�t����y��$(�X� �y�u�P��%�����t��s@I�}xp�N3��U�8y`Ggj�b(#R?��$o��MA:�8�V�f�U~�,��s; I�U�9Hg@��!?,��;���Gg�c�p!��r��8���m:�.����dgM�������'r����f�+4Z�I���x�"g�u@��'���@��<��W��u�=�9v�.����	��3�A����rn��Qc�5@�Z��� �y�u�P��Q/�����|
�����d3����m�\���f�$���c�h��D!:i?.�;j���=��mf��@{�9��Z�9C��l�t��K��6�>��M�q����m���g[��
�"Vp���kCf������v�����
0���f`����]��4�]��o[����/�m�K�lr��mv���_vn���U�6-�5`�����|���?�������mv	��M.a���.A3�%��������mv��M.`���.@3���������m/�C��;l��u1�����o!}������B/���n��ph��B_�v��CsQ�s����g������7ph.Z�=�VT�B�������������E%��������jd!Ia^Q�\0���ZH�~�CsQr�QEQ������e�m�Y������V�z[-��o����]��wM�����3m�{���J�54�}c���o��T�C�����J�54M[=��	���E�E�v\�U��8/�h*��~��CsQ�s����G�������w�rh.�+��H�������x��F��z��[	���M��%����ognX����D
�E��?�C�����\T�8QEQ������h
�Z����J�54�}SL���o���\4u���,$S�]��\���s9�����\�E�}�U4�H�U�|��G���������g�g�v��n]�C��������zac����l%\�f��������ga	�����Od���V��+[��I�g+�����l.�-�����BX������5�rh%{t-�Z���U�g���6��VO���I���i����q�hq�8�_����K;/�t����u�-������V�i�{UA��9��J���x6�����������O���AE�z��HF��U@������ ��Hi����w"8B��43��`M�O�a	����<�e�E�yt�<��R����H��:UA��1���,�=�&_����N��,x
I`'�,~
�}��\Z�bw�a,�������
le���P �q^U��(�5]����@V���@8/u|���9�V���4+��9�V��R�,n�����p������ka
t��������Hv}9���d��
�~
I\'*(~��&�U�����`�����RJ�2��0�n��W���v�B��+�c��z]c�0,xtE���

D��U�E���*��������a	�������e&����	+������c�0v��aYw���E���+��g7�G�Z�f�Z���O>D<�-w�e��Io�v�!q9��YX<�hG����@I�xO��8HOc�d=�{�r,,�Yw\.��!:����Jz�x�q�N3�zXr������e�a{����	p�l�p��<�����Y�&G��7t� ����&M����d�	H���s�s<w��ox�T$g�dfLc�x1��z�Cd�q,)�j��9R�Ny#;�s����qL/&?�-�Bx�q�)�j-'}��+7�2BK������i�r�*"�b�Bx�q�)�j������R�6%p@k�
g��s��	����t�������OB������`�����3^B�Cp`�N1�%�Y��m+�,�6Th���*?�&�I�80F�����P�-9�kZ��7\�jj����	��>��������w�����������6D�h�`Z�;��%��{�8�������i�6~��rn�2P�#K+,>�D0�*W�5��f���+����{�Q��2|�Q?���
nfS�����+����3 �
V�vdi�>?�����a:�LX�q��������{V��u?����`Unp�0�j&�o��_y�e#��g(�[��@�VX���`Un0I�N5|\�Q��lC����Qf��h/@Vn�=���
	��gB��+o�Gj�MK��*�}Kim�_P�!�G�L�S�d����W~����^T�g��"C�j�D0�*7�J�N5|ZyZ�OM��!���&�/X?���\%L��I�
E�W��%��}������k&�~<�%F��f����-�:�3a�4���RJm9�J�����A2�$�q���N�2�5;7d��} �G��U�t������s���th#;r�vd)mUZ����r��D�t������3�|u��lu��t(���)3�$��(k��t�	p���+�����G���edx��O��f+l���T3a���72�`���DE�{����`�le�� �f���d�)��x(�������i���2��Ct�q���{�.Y��9D�H�]n����1n_g�L �g�`�n���g�\�u)e0\3"��N��z��� 5\G Yw�"��	o*��f�;"sPg@I���8���=��[���Z����\m���:J��U81�t�	p��B��L����TwD�A�%��j� �TG Y�
u��q�F�����qL���9���~,)w*T����Y�}Q�tSFg�n"S�cI�S����a�zs��w9�$E�W2~����T��g���F������8qLg�6�sk��
����~������l�P��0FQ�4��[���Y�=f�����:J�P�QA6�8�N����S�z�s������1E-�z����sL��nUi�c��K����x�r&�7����l������������ed���	�����afnN�o��=G����:�
��`�`i���t�����z��PF#Xod��f#����9�Y���E~�8F;�/9���j�8���I��j��W<D��3���dQ���z�~D��;rz���z�sH��{�XoU���f����U9��_�[���+w=���M���L0o0(�7-��T3a���z�p��5�9��
�7!G�y�-�	��f����u�`��t��#�.�L0o0#���0�j&,x�^3
�"7�������yT��M�t�������KX�W�Ct��'P�W�c��t�	�����	X�����eg&N��S�q�N3������]@j�������������q�M5�[��7�� ���^�D(��U�s�M?��~�Io�U<������3���W��S������_�WX�5z�����}���&a:�L���~�����/�f�m��y8N"�SX� �y�u�����yQ�<�g�]���xr���#��k���J�:��
�q�����
�E������������o�3���W��/<��`��N��n����XqY����V�qy��&�|>m�����m��vQ��g�w��k����Y����}��z8����i�.YV�����p8F{�"O��
��q���-OK�3/�z�����=i7�z��~�t�3���<������
V)��}����J������{��mH�t��vhF��O^�����v:��o$����K�o���r>%��N���Z�����KEX(yt�h�RQ�L�����-u�J�9�)*�V��Q^Q��"�9E%��'�\�Q��p�V��19#�zf�F��r��STr��aNQ��"4-��x�|�H�S?H
�W��#�_����PW����%aV������KV����������k�J;����kEs�J�O����\+r�STr�?1��co�RQ�L���b�=^Q��"�9E%����w�J�9�)*�V�^Q��"�9E%���}y��G�����������GX������E��De�K0ty�4�]�����p�NUQ��q^U��*�U�����U%��
T��^UQ��q^UQ�z,G�UQ��q^U��*�%��]�J�S5�n����
	4)�K
	i=r'�(��8/�(P-����*�~�H�W?npS�>m�x�����]���'q�]����ykN�����9UC�j���^UQ��q^U��j]���j�ZU����@u������@�����@�^�w��U�����c������@�����@O>�D���%�\+�}�I�~�H�U����cm_|�����p3��%�����%(��o��,zIE�&�yQE��'O�E�\kr��T(:e�K*
49��*
T��S��@��?f�T�`�]�o9t��q��P�����^TH�Ia^RH��S�-/�(��8/�(P�f��WUt���8�*��'
���s&~�l�.�uO��o����0t}�6�e�D�\kr��T(��,�G�P��q^T��*����w�ZU����@:���
��U�e��
T���9�B������@�����@��vY^TH�Ia^R��"�i�_���G
�9E!_�J����4�z�m��l�;��Y�h������\�d�rW,��Z��)
9�k*�V����G��kMs�J�B�N{IE�&�yQE�j��^^UQ��q^U��*|�_8t�*qN�P�����NUQ��q^UQ���Z��(P�8���Ru|U��k[���pY�A%T:u\���jO��.4��^�3[�ej���9-�L��N�����i��/?��wK��ho't�*�f�7�
v|����b�Ky&K�����}�O����w��'[@�X���D�Ll�q��:�[k��X1�����

�^�w�����_~��3[�8��GE;�xV<|��?2>����c%�Q����KHu����V��B��NZ%cx��?m�����b_6�c�4���wY�5
�0��8�v���~D�/�N=�
���?�y[�6�CWoa���5-��1e���?����bk��=�?�[����trX�@I�B\��A:����:�|J�����#F,WO%A�����d�	H���3u���V�"xES�N�����'�YK�iE�m7jc��J����$(�X� �f����r�X�����n�R��0�7���dc��t�	p�����o0vX~�u�� ���@IP��|r�L3�zX2���`2u�YQF�`��@IP���� �f���d����1����[�N�r�*=��J����/�f�o��[��&����W��(�,��'�ma�'A��*-A��'�YKN��p
`!w,FE����K�����l� �f����
v��c�����"���7N����d3O������xm���5F�
����f��������I�����5�{�}��O���w���vL/!?�y��t���)�A��N[����QP�����������`l� �y���dxs�h�?	N�!T
|���������"�or��"w����U�v������&�7�����9Hg�g=-�����5y�)��t�,m��'a���}��<����]���U����"t"�j��3�$,YY� �f����
�W��]|7B�9���m��N����d3O��{����?��Vq���Q+yBo%A��2�4���%Wt�����F�3I{|Zr0�9����>�"��cIy�9�%��7��GI�n4Mn�8������!:�8�����rY~-������Z��:?8k2�3A��iL�N�I�"�bK�����n���u�5'ro���T�������r�B����$w����a���G��������P�o�f���c� W$fBc����M:��$��B�	3��6�V�1C
w�
���+����*M��?O�}<��8)��#��O�����lU�!��w�#�=�!�w�#��w�#�����Z�!|�����x��sFg;�"��cI��K^@h���f��������lL��L��nj�_������#r��(����8��U��}�4�2cq��Xf,���zc�r��<��A�-�����PT��P�p������t�	��'�1�K.��X9q�����e�x�p�u�P���vQR�dQOpS�K��[��|���~S�������D�������)�����h����P1��01��b�Q�oz��n��a�M1��r��yt���_����O��	HRg��C�f��y���9s�@�8;�eD�����:�8(H���d�*T�8���-�:��Ru�f@I��w$
�i&�Y�
5/�����W"g�M���Pe�!�t�	H��B��+e��c�4�����4����+����S�T���V���`ClWV�n��y|p���8�i;_*�[iK::k�&$j��c�w�������XRnU(^[��%���+�7Q�n���z���t�	p��B�a+m9�PA5!2�n���y�	�t�	p��B�W��-g�,<�B�����'u| �f�g�+T�4���gg�:��<�PRg���=�u�	p��B����v��I�e���A��;�2�4�������V��)���65��1gx��.�I��q���q�\Qr��(j��c�������t��������j���m���W8R�}���va���KT������va�������VaI�=�?�m��H��K���g��L�5�k]���naZ�������i��h������r'��6���aR�e�0-_����e�������]�0�=�&�[�V�E�0���n�`}�����h�]@�8L��/;����k��w�^�5��n��i_j�}����s�>�.W�k��C���3�����U%��
T�5�SU����u��U�m)�[������9UC�jYr����@�����@�X��<FM
��B�s�G�P�y����V-��w8UC���T
�yI���*�~�H�W������}��\3^��C��M����oV=���
��U]������kU�s��UX��_��@�����kU�����]�J�S5�nx&R��(P�8��(P����������@����
T��^��@�������'��yUA_<q��M�	Y>����`���I��C�MR���v}Y�Y9�wfU���Fv��s�����jb����V�8�j(P-K���
T9��*
T����C��1�tt(P=�m�F�s�Vn�ZqKc)��5%��
T�5�WUt���8��o9�����}�_hB����7�r���I��C�7Nh���kU�s���}���P���[�
T:��*
T9��*�V-+���T
]�J�S5�����C�j�;:��e�����U�����:4�2h�����@�����~�H������	�������;��~�B���Y�t;t�)qN�P�z.�w%�&�yI!��m�Z����8'j(P-����2��q(Pm�9�P��Fs��k�s]��RF�59�I*	��a��@3��
T3���*�~�H�W����hK�+=���/)���]� q��C������Hp��:�p2(�x�Z'�(��8/�(PmK��;�r�WUt���xt�W5t�*qN�P��Gk��@5���C�*��*
T9��*
T��q��@�G	.��B.�4��|q
E�����Y����z�8��?"F��m�0w�J�#b
-�g�(
��8J��@�\Z�8ThR��r�X2�W(���8'j(P��8��*
T9��*
T.��T��U]�����!l�ZU����@��zUE����u��������Fd8S����L�C��,o�r���Vdr�,L��AQC�S�C�Jz�x�Q�M3�z�!3g9Dl=��+������V��p�l�R��<�zX�.g�A����S��\�'�%A���yP�L3�zXr�#����w��2"|��P��:�B�N3�zXr����B�Qe��������o�R'��I��p��z�L}�a=+n}SeS��k��+�1g����>"�I���<�N���A��/�PG�|���wR�K��"��or��"Ox�Lm�
�#pl2j�fH�/?�C
t�a����&�;?�k^�!��UD�J;��Jz�x�r�N3�zXr��0���bR�[cZ�Y4�3�od�C�a�N1��/:>�[YN\1P�n�&D�Jg8	n���A:��?��~'�����%}"�����~"�A'J�i& YK�~AS6�{�n�a����F���6��Bd�q,)�j�6���c�y��C"x��>J��k,q�N3�zX2��x!m&w�6���G��$(Yw9�i& YK����f4R���PE�;?�f@IP����t�	p����I{c�D[L��U���zP��%�8H��g=,���)��k��@CU��"����'�IP����t�	p���	Z}nt/2-�6T���{�P���1X�A:�$�a�[�V������$A�wl#����%�)�f�g=mt��m�M�D6�"���{�IP��BA6�8�a�{Z�]O��%+�������N����d3O����|�����)�b&���?e��Uo���b�C?")D&������q����EI`�J+"�:�N��kl� �y����"��O�]
%J��}��N���U��'�YOKn��9M�3B����$,Y>���f���d�u���p�x������S�K�} �&��/���5G�t0�O�"h��)�F�$���
��h��D��.��D=���'z&U�l8�9����"����7�����l��q�2p��@���xF�����[�m��K���z���XW9^AI�����+}���Pi<SA��a,)7��vv��;vD���%��IA����d�*����_G������@�����ld��4�V��)V���������U��LA6�����e�f+�����D���aL�Y�������q�)w*t#�g����x��SF�Zn�S�cI�S�Z�;�;3B_n���;w0#�C���x������c�U�bT�tX�����;��W�������/�'�g����|�9�q���c�wCt�q��~::���y7�����8�)�w#+�8C�s��
�<fr=�pR�A���8��{��6�t� ���B@���N����A:���Y�e�f
V������a�A��q�e�i&P�w�s��fag6��|�	PRg6���d�	H��B��m��Pg1��F��������U��}�nh��E����'�I��X9�����[���o�[�&����|�i�/!?��Ct�q�)w*�V!�ev��:�p��I�����|�n,)wz����Z���
��IMeG� �f�u�P��
7�r�!r���4�r�:w�i&P�7�rn���i0WD�'���NA6�8�V���%\��O�����P�����4��{��?�q��?���L�qL�	�r�L1�%�N���e\������v��(���V
�i&�Y�
U�,��C
���&@I���qP�>���{��W���m��!r�(���6�i&�Y�
5���-��[s���7J�B��?�i&�Y�
U�-�1����v���v����[)���g�+T����#������������x���e'}�qc33�& }�������������k&������q�����5hg�;�����������������Y]����5����a
i+������h���g�����n{f�k�2-�������V~�����{����]��/��/��i��f��m��������n�������o�Z�m�������]�v/��n{���vq�3���k{������=&v��t���p�I]a-�����$��{�(�\��q�
].~���;���kU�s���:��w(P��~y��U�Gg��8t�*qN�P�
������@�����@�,�p�BM
��B�c���P�y�;�
T�e�k��(P�8���Zu�cr�������9UE_<m2
���g\���C��M����o�m�8C�*�yUE��5�{��V�8�j(P��Z
��U��"�SU�r�WUt�z�K��}�]�J�S5�n�v��m���P���
�
T����C�j�3�;UE�O�����x�����o�=�p*I�������8w
����I�����kU�s���c�;�P�z�{��T=VY)��]�j��:���y���U�����2��t(P-c[I��c��n$�<���F�s�h�P�y���VM�v�T
]>o4��*��7��k�_h{vlil������8w	��o�v�I[NUQ��q^U��j���]�J�S5�>.��*
T9��*
T��T��U]����Q��kU�s����Ns���U�����}hke$����VF�:��r��q#q^T�����N�����&�"�\�7s������i���v�ZS����@�Xz��H�Ia^R����������8'j(P�����<��q(P-�9�P�ZFs��@�X��FM
��B�slj�P�y�M��Vmi�}7���4�T}������<g�x��ii����u�.����;��7�H����QF.�J����j�.55�D
T������U�t@l��(P�8���Z5��7�����9UC�j^����P��q^UQ����������H�Xi��Sr���0��$~�4\����gm���|�����C��AZ���}������T���k5�s���c)��)	4)�K
�V�W>n�--Pt�)qN�P��y��SU�r�WU��%��
T9��*
T�|���k(P�8��(P���NUQ��q^U��-9��~���f�z�������{��t�v���^gu���������}A��������^?�U��'�Y�6p�$�����B/�%2�j���=�$(�X� �y����M�-�2-�w���mgkhgZ�����c�N1��[&��,�q���{��/�=!���=�$���
�����%��Nv�c[���Bv$u/d5Lc�x1�
�"S�cMyVk]�G>��Z7|��6n���Z*���3^Ld#����XR��z$Zq<���	iH*��z3���n@!:�8��g������G��%�L���p(��1e��`��Bt�q,)�j=7�[��h���PK�jk��p�3V(�f�g=+^<��v�&y��"�g�g�(	J6�"s�L����2��}��]	j��a�P���3j^���@�Mo�c��X�V���#9��o����C�@�M:�9�Y�y�7���R#����u�o���"���bs��Z���@��-�j:��;���x�?�L~�a#���XR��Z
���w\(
~���K�>J�����A:�8�a�h�%�P�c�C���@IP����t�	p�����7�l�K��Y"yro�������v
�I�1�<��K=e�!>��l���h�A�c�x	��^g���cNyV�	m!7����2�g�W�3H�/���8H���d=+��������^C)!�(������p��<�zXr:�|JW3l����]����depS1H���d=,y;�/���fm5Dm+(�P�ll� �f����\h��z��4%���hZ��ro���b�=�(D&��o���wE��a;�~���(#��P�^e+-2�i& YOK��v���$i�I��q�/!?���M1�%�a�����w�|���I;�|�i�W�=�$���*�����%C�	�C
�~#�"�;�{�IX���A:�8�a�ee	6�T��eD�}��(	K���L@�����+���z�SMH����x�s��	w>��tK�c���:��Fg�����u>s��K��'v>��T�����|�����gA��	PRop����H�����\���l���������<8H]�H��B�gL�1����5���P�l=9��G_$�V�f/�Q�������,�~�/!?�Cd�q,)w*t6��Sg\H��������(���XR�T�N"�w�nem���.V%v��0@��!��)�����[�@���7��b�d=)D���r�B��<����9�9��$����XR�T����7"����1f����{v��8��v� �Oo�|�PRg<��M3��U�z�����9���8J������]�d�*������:9[qU�;�)���@����,��������s�q�>L�al��n,)w*4�rG��dN��tDr,s�N3�z�#�s�V~g�PF���	����X>8H���d�*T����-y+Z�m���3���JDK,�i&�Y�
Uoo��3������RHg ��Bt�q\o��t�^�&r5D�(�@�&r�m��L��o"�����:t rv�(���4�������
�u���%�N��A:�8�V�������N��m��8N����A:�8�V�f�Ul����'"g�M��zW0s�N3��U��tE�x(iH������q^��!2�8.�$�/+[���x�f�md���7I� tM� �y����-T-�
��x���8�o���2�20H���f�+T��������y��ul� �f���3���:��4�w�@I���q�N3��U��pw�u� r��6��?���y��{���F-����R�:���1et��F!:�8���?����h�����t�h�Q�37��k���B��v m�}��B�NX�y��Y��i�
��aQ�y,�|XX�R>��o)A���]������[��X'3���hv��. n��`5R�M�����\�e4��s��m�����
����q�A4-�2��/����m#4��g&�p��.A3�%�������Z���G��z��^��,��o<9�c��[��/��?������gW����xxS��{L�V�8�j(P��=�������@��Q��*
T9��*�V����(]�J�S5�����i$��0/)$P,�nv��2�fw(P��d�SU�r�WU�����O�|�h�W?m�
����B�w�������}�8w����������U%��
T�q�C��9��w�Z��[�8t�*qN�P�
}���U����U��:UE�*�yUE����ax��V�8�j(P��&W������@5/G�?M���8�U}���3���h�v�k�_���;'q�]���.{��{�V�8�j(P����������@Ulp��(P�8���Z?���V�8�j(P�c�I��<��t(P-��v#�f6�	����C��1v�r(P��NU���F����/�7
�8��Vh	fzfyvy�,��������T�d.�l,P�@�l,R>����E���w��@y�|��S6(K�W6)W>^�++��9�SV)����g�r�Oz(���}��l,P�@�l,R��>XE���	��H5�=�<�| Y`',,~$��.<�N����|>��E�
�8��XV^��d�����E�eYGaA�.�u��"�6���,�mc�����C�����E�y��<���h�y)��O`E�.�u��"�c���Y�{���<��O�]�++G�)��q���JO���R��w�Xp%�_����24�r(��8/�(P=�e[����@W���H6��Ac�2v��"������H�;ee�r�7�"�6zo�]+�u���W����5�);)oC�,�"�m���P����$�*��q$q�*�oF'�K���k)�m)���k)�}~w
n8��KW��m�eq���@W���H�,y�E���
�T��e�E��	+�s��m~���@Y���H9/�-c�2v��"������X������HY��zee�2v���{����wi���.m)�S�������_����r\[��r�`��Q��3����$X8gl� �y��l�).Y���p�2.P3�WD�,������l� �y�o:A�I�Y���dz��(#�����$(Y��� �f����MN�}"��Y	�v�7������{dV
�)�1�<�57:�r�y���	4u)o�;��$�C=l(D���_�y�V��v��8���2�X��9�����<����XR��
��*��:*�B;�`�g};��n}F!2�8��g�Vx+N�	�v�;F&p$};�`��wc�x1�-�"��cIyVk[�o�[�� ��wa~��c+�4���%�+?V�I���������iS�K���t�nW��Z���e���1���u�u9?��������<3a��e���$���A�g
YZ���'�yP����T3!���o�Z�w���:��}"�G�+,'��T3a���Cj|M�g��3��Q�V�~"+u�{yKR1L��	>�J����z����V8Y����O��4���%s�V���\meYZ��jo���V��S��W^wiD�#�1�m��
_�?���
V	��gB��+�������2���n���c����i&�YOKn����RR����HZ�}�'�yt�
&	��f���%�+?+>�p�5�1��s�D����pUn�J�N?��o�s��������
m0v"C3�~"���;�%L��		>�<K����;��O&x{����X� �f�o>�i������
&�Q�
��R[�����df�� �f�����8�xJ�YYF�j�'�y��P�	��f���+���[���{���c�#Ki�����<�\`k�S���W����y��f;����W��!�����M�S���W^�-����E�g2�Q�'�yT��$a:�LH�y��l,�b�)#'o���`W.�0��T3a��������Sm�
"4a�{�Imc�� �f���
�K��q��K~�~�m�c;����5�Y����9wU:�8D�\��:���X� �f:�8����?o�B�����:J�m��A:�8�V�f�6<%�3W"�����zs�p��<��U�����k};OuC�L�pj�r��<��U�sO��{?��H�?�����;����%�N����jb��B���0�ts�wQ��!2�8��;����wF��$���c���)�a�S�cM�S�33�<�g��R?�pqK�h�3K�!��)�\�]^��W1������x�Q�7�v����B���� ���:=
�1:��M3��(��
�.����zO4#s�V���y�yOt"�v�X�zR'P����Y�3���	-�}�����@�����'���^�0	M�>��9��W���n��������3����<9�������z�y����w9+2�`���^�+��T3!����Y�����!���L0ot87��fB�7�5�Wqw�f:y�r��\c��t�	P��B�ll�2���	��-'Bytc
	��g��7N2K�����0�Gc��5�`�Xe�q�L3��W�9��r��Y��V�8/r&�7X���0�j&,x�^������E��u��V��k�4��{�z�p������7"G�y�$(���d�����z�:��2�su~�(�7%3�4��G�^!�����z�q �7Z�����39��	��k!mf�gp
����H0o4 7��fB�w�U_�,�V{�>mT��G�y���q�M5�[�x��wp�Nd�E|C0o0[�0�j&��7q�l����<z����b�$L��		��W�?\��}���8�pc�K��!:�8���f��������S�p�����A:���X�����N$��~���,9���u���������������������[Z�cX��&Q�ay�v��>��+�:2���t.�Z 4�3�6��S;l���8�\���.�}��������C��o��i�����D�k��e^���h��������yAR;��v������;�r����u�����[��zQ��g�w��w~k�^�W7���a;
�z��D�k���e^�y���>4��q�����p�<j�,-����������
0]�{B
Y+6o�P�^p�����c���]��a+t��}9���X�)�x%pv�����Y�F���
^N��)
	9�)*�T,	��:E%��f�F���h�S(r�STr�x,��U����@�����k���{A�E99�j5�+0'��ZO�L���b^��v�BEs�J���\��I���D�����AR��M���k�[��+^����0+����j����g��ZQ�L���b�}^^QH��aNQ��������@�����KEX8vt{%��f�F��������\+���^QH��aNQ��b�~j^QH��aNQ��"�W�I�~�h��:>P��.h�<Q�h�h#��E��M��%��q�V�w��"U�����U������@U����@����NUP��q^UQ�z�v�NUP��q^U��*l
��O��@U����@u���(�H�����@1c�NRP��q^TQ�J{|:UA��F�����q�
��i���f�.l�7N��%��q��+���U�s������;UA�*�yUE��u��-^UQ�*qN�P����x���H�����@�`3�NUP��q^UQ�JK�;UA�*�yUE�*�'��2�4)�K
�V<���T<n$���7�,z����7O������7s�����5Z(�I
�49��*�V��
^TH��aNRI�X�`�NRP��q^TQ����2�"�6��q�Ru[���2t��q��P������(�H�����@q��g���H�����@��G�T]?i4��*��4�G���L���R]r��4��['q�"]��
{�yQ!�&�9I%�b]��DE��E]������{C���9UC�j��y���H�����@����H��.�C�*,����*�T9��*
TOX���2�4)�K
�V�����<j8�)*�_%m��{~�<����������f����V����b�d�rW,��8�m��E��5]��)���PH��aNRI�X���NRP��q^TQ��p}]�*(R�8���Z������T
����+�T9��*
Tw\���
�T9��*�T_U�������2�F�Q2����eD���jO�q'�j������(/o��T����9i?�����S����������hc�����@~1.�HHr[��1g��@�� �I���<��P�����=1��c$#����1e��@���H���������p8
}���Bm��W�J�k�S�cMyT���Z��N:���?
�����i(��+���,�����X$,���8�C^h�����l
tC	�����v������1$l#�/	�`���!�����z�I���a���`%b���U�1)H��ML��)��>�p�9u!�)���<lU@���o5P�#�n�@�n���^RO3�zX24m%Wz��
���QF[	�@IP�0�PRO3�zXr���A;p�� �����o�$A��
�����%�
[���scC�:���`��8����A<�4�a����+���g��C��(�}i%A��2�4���%C�Bxa�������������%��3O���g���7-����UD	V	}�%[)���g=,�:���!g���	n���!g���i&������������Ti��I�'A��
�a3O�����%�<������`��J���Nz�xV(�f�g=,y������~�`8D��-�����~0����f��`�����<�~��AF�:�lt��;@IP����l�	p���a���S�
�2m���$(Y���+u�	H�������y�
�$%�H����+��b���mn�q,)�j�W�x��:L���m�����wc�x1�R����XR����>�	r��Ee������'�G�X� �y����������yC%��X5���7���de��t�	��|�G������������9����	p��l� �y���d���������$�iN�V�c�K����bK��Zwv���=��3TVDp_�������9Hg�g=-��G�T6�@y���fH�/?TX��q��������Sk� i|����>"?`����:�8�����rY����!f������z�+�	��rR����_��O#�
���m�cn��6�Y����e9�9��,�1l�������Uh�"���F%�8���4������l�q,)7*4'��EoO&$fGc������&��r�Bq
a:Zi
��q��ld�i�F����JS��b���<<��8)�Y����8+2S|?��Vr���C���:�����	���yP|?��B�j�%��Dm�q,MJ����L�v���r�}�w��g6�:��=���l� o6z�Y�
��_������@I��HA�b��f�*��>ic��E�Z���HRo,V��'���eg�������u�a$��w��'�Y�
5gOn*��D�n��$�7w��'�����	�������'@I�u�8H��H��B����o�C���y����0<���4��[��w�����H�	pRor�"�y�9'pe��!v�����F���2��@�nj>��^����u��8��w
��'px���B����p���4�o���{� �y�������{�;T��7�M�4�O
�i& Y�
U'.7l=�M���P��7N�L?8��l�	p��B�����:r1koWj���>����)�q��P��\�U���,|1��qeG���E������V(^[�qA���2�q4�n���z9q�N3��W�8l���o�Ak.h��>�(�3���A:�8�V������ ��������:��A�M3�z�	�yj���U�����U7N�B�U
��'�Y�
+
|n����/�!7J�\;���A:�8�^�l�A��y��:6���1e�_��%a�L1�%���s��Zf��y%j�M��^�um��bgo�|6��[�����������K/������n�����[K��������o�/:�%6�\��m[���m���6�[+�].}��-�u9����I������n�&�k���w��J=~��M��oR�u�7-_�������_�x��������M��D+?n�f���]�����\�u�7�M�������]�7-�:�I�����|M���Vov���M.���q�����|M���noz�����cSQ��M��:�
����/����[��]����B�q�
]o!��:��*
T%��
T�����H�[8t��o����*(P�8�j(P-K����T9��*
T��o��$��0/)$P<����"�s�����*X2g��U�s�����u�UO������
t0n�o�k�E��
�M����o�~���E��U]���v�P�*qN�P�
�:�kU�r�WUt�
?������@U����@���^UP��q^UQ���ME�����@��=�E�u�	�P���-��&E�G������gs���9.�
m&
��;'q�]w�]����/CQ7]���t���}�P�z���VM����*
T%��
T�Rr��"U�����c��i(R=�>����@I�y��\+�����1hJ�5��; �*��y�q^UQ��������v���c��'q�]�88C����U�s���}�+`(R�������
�T9��*�V-+�x�U��T
�yl�i(R�cSN����{��UQ��q^UQ�Z��`J"�:t3(�����q#q^TQ���am����C���#"$j�Oa������vZ	�$��D
��2|�iR��r��j
�(
4%��
T����P�Z��5�m4�E�m4��V=q�hr��T(��-��H3�m�
T3.�TO�����'����W�����W}���$��`(8�co)��X��[F�}9����"M�������w�*(R�8���Zu[�m�WU�J�S5���3���s(P-��?W
E��U����KI�y����
WcyE&�O	��B��CR%��W��J>�u�8����\!�m�0w�J�s}���g=�N��8���@N��E�D��%�\+��Ggw+<��D
��V�yUA�*�yUE�j[r���P��q^U��*�=<���T
�����
�T9��*�v�������%Z2��������o����\��_�?N����6��/���.<�����cp}���
��z�76g>�
vk��C
Q���?J���md�L�����yl�����;�#���H���X��6�$�a���K��]��]0�B��w���P������L@��\7>�Zf��P!5�.
=�yL/&?�Ct�q,)�j=2��T��s�/�������3^L~��I���<���r�$x�o����o>N�};�9����� a�N:�%�Y�'���]^�jUr�������)�����xa�N1�%�Y?�u�?	g� Q����Y:�8�r�N3��� �*��.�����)F�${���%A��p1ks�L@���jb��J��/�D�h�(	J6�8H��g=m�Z��x7�Q�]�fy�?{�����3^B~��I�1�<�u?�}�������q��vX�p�^c��t�	p����G��$5-q�������������A:���4-���4����������t����q�L3��m�]2�{��9��a�+�D����k�%K��L����N��TO��f� �FR�?;,�y8	J6V9Hg�g=,z7�����u��2�wx��%+�3O���VN�R��_Y���'!��Z�?e��6p�����P��<�zX����hT}����L�H�Y6V�4���
��#2���W�`	������]z8	J6V)�f�g=,y�+����_��(#.�%++�4���%�bn �����$e���vL/&?t�(���������)�G	[_9�#���$����D� �f����"[�7<�JVT��d3�(	JVVN�i& YOKn��;�I�6�e���P����>5����zXr]�#q��=�Q�(#�r�(	KV
�i& YOK��m�p�&�|�
"�=�@IX��rR�N3�zZ�N�A��4J*5������2C���"����1�}w�����9D
��}�>�M�&��s6���B�m.��OjC������P����a������B
�l�	��~�7�J���3@�n��(�7@�4��[���-�7�(*���9J�m������d�*�,���rvf'�l`���9J�����l�	p��B�I�����������y��0�bk��
�s<�N���w@��a���Y)D'���{�����;��3�1gv&�8;�K��
�_��!��!h��8���������bk�����OdS�[���y�rYM5o]R��.����e��"f��X���	d����A:����P�;,������9�(������t�	�����o����C������������t�	d����B��L|��C;"�LN �I#��l�F�$�V�f��:?�!rV�(��#7�i&�Y�
5k�[�;�����8���;>i
C��������Bs�rn[o:����8J�M��A:�8�V���Q��z�P�"��z�q� �f�u�Ps����C��y�������8H��@�n��F����j���+"g+���z[q� �y�5�e����B��H
�L@�B���A�L��d�k-��^�_����9q��[���l�	p��B��������D�<�	PRof
�i&�Y�
5goKggnH�=��9��+����XR�ThV^���:9�o������l�	T�����W��f��mip�zP���c��t�	���4;��$��w�:�������t�	p��B�YG�4;�P�����zp� �y��~���k�W�I���T�o�����I�y����*n��h���"1�osFg��"��cI��k]�7�_�5�3o���}���M3�/��}���UH�z��+K�[K��6v����������/�_������+������u��+>�Ome6�5�j[�kgW`����vvZ��������I�������
�_k����]����vvZ�&Z�q;;���kgg�kW:-�������V~�����/���hW:���vvz�h������vvV�v������i��h����������^��������I�������������������Cv��%��p�P]�e_���&~�����75h��C���F�'���@U����@��}E�u�����*4�m��C���9UC�*�*g�*(R�8��(P-t��eiR��(cGC��1v$p(P=�m����H�����k�}]Z��k
�6�T
�Oh�����B����E�}�8w������q�"U�����U����(P�8�j(P��'���H�����@��B�NUP��q^U���A��{UE���9UC��6v&0�ncg��}lTf(R��Fe�e9k�iR<q$��*��8��O���������_����I��C�w���#�����T
����P�z�=�T=V~�`���U5�T
T3~�:UA�*�yUE�j���T��.��@�:(�4��C��@�;��4��S�C���_
9UE���s����
|�����vv����g��'q�]��������H�����kU8J��t`(P�8�j(P-|"�S�r�WU�6:���
�T9��*�V���S��@U����@�w�{UA�*�yUE��>�+Si�C�2#�b��
7�E��8�i�;���BG���7s�����AB���P�)qN�P�z,���$��0/)�Z�X�S��D
�y�Yc(R�����2�c�"�2�c����TQiR��(�c�2C��96+s�Z���#��(x�H�S5?i*�N�+���3-��OE���8w��o���~`J"�6�3r���[�;���8u(P-K����T9��*
T��NUP��q^U��jZGo�P�*qN�P������T9��*
T����Hs�	+.���L�5����������k�}>�g���!��� ��������r^R�s�P�&qN�P�z,��<!�&�yI!��;�h��u
4%��
T3-����"U�����������H�����@�ht��"U����U^��UE��U]������=�h���]+Is{��t,{�/�aWW9e���M�������f>�v�����r��<�z���&>�N���}A�������EYw�p��<����pu�3�p]1V�+�?u����qL/!�x��)�1�<�5�Yi�X�$f�2"x��>�UNb6����f������q��n�5��{�s���t�#�C�N�)�q�U��ZW<�����D#I���o�+o6$�C
�0D������#�w%�2������$�Z+�����b�C�0D'�����o�����1���6�����������b[��Zq�.��Mn�!�-�-�@I�OLX:O
�i&�Y�J�O|f���N�������*�N�%|m��4���%��f�^���_�
@��'������R��<�zX��/[��r��^r0��\���\��@��dct���y������K��W���.�tVDG^��H�ll� �y���d����[��FN ��|?����zLc�N1�9�Y���o���F������8�P&?�Cd�q,)�jE/Q3l�h�% ���o����C}q1D'������FG
�$g��E�����x��8	>��
�Si���d�����h��C���h��u��l7V�i& YK���u�^�;����83��{�l�ql� �f��"s�d����G�v�]��@R�(���Y� �f����
���J�1E���'�P���$�L@���Q{6��x�
eDm+K�(	JVV�i& YK�7����hX�����?-�%�s��	����tK��Z���Uq�ln��� U�����m�u3O��2{��,;��	$g�����&��'@IX���(H���d=-��n��R�W�
7'���nL/!����j���1�<���S�W�&6���
p_�����*�_$�L@��\Vv����^�DS�#��4�@IX�0��
A:�$�i�y��%�P1�k��`k��1g���Ct�q,)���f�S������{�O�1nKg�L �S���������y�i������s2'@I��	�Z0H���d��\����u&�����#�����9Hg�g�*����tF$���%A��2�4��[���x�������m������Cme0D'��r�Bu���6fBb��0�����FU6�8��;��X���.a��*�!��q��[�����4�O\���JZ�i^e?��������cI�S���i9�����9���2:��Q�:���R�Thf!m��|������$�K���A6�8�N��3<�U�s
7D�|�$�n$5�r3O������+=t�o���+=A��-
!�6�cM�S����J�sg8�c�x	a�Bl�q�)w*4������`L@����9�%D*<��8�9�N�������������\�<������R;�oC���pj�r��<�����P3._��������$(���A:�8�V��4�C��#���8J�}��A�6��d�*�[����siI�3'@I�{x����p��B��C�?wf���'u�!X�3O�����+��;�L�K����	8N����s��<��U��x'�u>�����#�����3O��oH�,�������+��vo�M��zW0s�N3��U��tv��^ ��5/psF��"��cI�S��rMlls�6p��;���l�	4oc[��s?@�?�"r6�8����)�f��q��sm����o	;w��+@��e
�������V�N\�^����}7�*�p���L��������o��7����@�7��e
�i&P���o�����f������.^��-�����������i�9����9s3��s�Q�97���>��B�� m�}�}�B�TX�y��\���>�N������X��Cn�L)����2�M���`Q�����X#9�������yV}����z����o�����tv��� �A���w=�����\�u:��tw�����Bg�`�������%h������7�U:�k''p���.@3�������/;��%XK9���Ntz�g�������������`��^�-�7��9�%����;W~�bW`�zs���8t�9�Lp�Q��P�*qN�P���ME������@�-kN���H�����k�m��q���@U����@5/��[%�&�yI!�b�	�4��N��@�.�9\��H�����@�-���k��i�q^UQ��9�m9��{���\��_����I��C��m/K�q�U�s���sld`(R=�F]�B���W��@U����@�,���{P�r�WU�6<��S�r�WUt�
&���S�J�S5�nc�1C��6vs(P����O����#q^UQ��)ei�o��;����AQp�$�]���;�h��WU�J�S5��c?C��>�3p(P�7^UP��q^U�����5���(P�8�j(P�c�OC�j{}:������H������f�P�y�m�
T���W<o$��*��7p���z��u=��e�]�:�������5:��++��%�+����u�c��96/�,P�2�t���E����E����������)+�������P���?=����}���,R�@�l,R��Fd�B�mhE�P����k�]?�,�V>��
x�a]Zw:�_���r��E�
,+-!���"]	���"����0�P��:YA�jYp,�m�1������1�X�,�^�X��G[��P9���g�rY���R\'+(R=�Ve��������H�\����E�#	���}�8����]Z�iJ��(�n��1�B8�vx
�t9��*
T��1�d�E����E�y��sPY������H�.��3�X������H����c�r�7������Y���qe��:e�"�mhWf(����eE�;.�T�#��T}�0:7z����w)�m�Ngw[����e���(���/]Qp��S:Ye��zac�r�n
�0�P��:YA�j�c�;Ya�.v����h�`��EX�,�^�X���V����P�;ee�r�M���P�;ee�2��������)+�n�7���&�f�����O><>��wBk��_�'��zt����`����������%��Ee��A2�$���s�S�p��I�
��5-�����g��t�	p���VG}���WT�i#��
�$(���A:�8�a���}�h����!�w��'A��*��8����%�����&���4$g�t.�4���n��!2�8��g���a��Ve|�+Oj?���������0@��!�?+�H�=Y�Tr"9����c�x1�~R"S�cIyV+���J�����>I g��s�X2^Lx�,����XR���h�����_�1*�Nh��P��a�L3�zX���a+~w�w*�2��U1�'@IP�0��	A2�4�Y�i]�G*�:���`�I�p�|0�U�s������/;�)����%��YZ�c)��A�f��fB��+�6zI�Vh!�C��%��j����O�V
�i&��>]x|���c�������de�Q�N3�zZ24���	���3����L[��������a6�L~���lC|:'~����c�7�;��6���
V��gB��+���������F+o��G��Bm�������'~\�m�7����!���z��,�
�$�++'�4���%���h��N7UY�W�im��F�!�G�����3!����u)G�=cE�Hf��M���<���*a:�LH�i��>�����YF��_�D0+7�&S��W��n���iq,#C�}"�G�+�	��f���+�2��I%�&w����RJ
�����4��Q$�L@��|��
i�� :.�loF����������O?���r�n��?�O#V6d)m�������`�r�M?|^y�/�;7jr�`g�����O����`�R��<���A������;MX���:>���m�J�N?|^yc;jo������]:�J��m�p��<�����)�~��.���*)~���S�K��Q�I������`/�%[�=�������{���1n�g�L ���������)=����K�T�qb�S����z�4q�N3���K��P�Zf<��3H+"�}N��z�t� �f�u�P�*���R�#r��(��Ek� �f�u�P�P�:z���Vh7�����BS{�s��
�-��g�{�@����)c�>)���~,)w*TK�^Ay��"1�sS���?t���XR�T�N��+N���#	�C��q,���s�L:�%�N�ji�����;39�1et&g�15���������zom��`m���&k�:k���U�������3w �������B�����r�B�(�Cl�/�1���~�)j]r���uF�8�V���o����1����	��v��a6�LH�f��`���Ky4d��|C0��5�q�M5�Y��
�;��,��x�r$�G�z����M9�Y�����zF����8�����l������G����M���������,f���������tyV�y�q&���{���0�j&,x�^����."�s�n�D(��5X$L��I��+����������Y�3������S��o�����Gfd��8��HX��v��LX�^��&��p�s!���'�]vl����~>�l�AI]���j$�`�*`���?�{��1�1�No�[�C��J��D����`^�Bf	��FB�7�5s���t�!��T�	���c�0�j$$x�^�+�����
5fb(�u7�i�Y7U;��mL<dsGR��6�$L�I�v����X��j�<��GM��8������l�����z����#|N�V�0��7
�iPo�A��_�/�����G�y���I�N5�z���_�;^�v�Mu��H���:�J�N5�����<�"7����3+qE� ��4(� �����C����h�?�V�i�Y��O?}�K�������C�T��Q�����?�����q����)(��M���'4O��Nv;h���m� y/}��s��e��l���O;�i�i�6�V;��v�������SS�\����/��&�gp�����ToyZ}�9�>����U_>��8���U� �^���&s^=��7?7K�ZC��vt��D�k�U�d����R�w�����Ra}�O��v���}���5�k�7M.��rZ�v�RR8��r������Y�i��\k��������R<���%�J�y��l�����]���-�����v�����i��u-���&��]�N��d�3���<����Wv��+M������*I�C��������3;:]����
w���?�}-�d�j7�/e��.QV�����z�H_�zA%SA�r�B��5�3�	�	J�	*	��35�J�������j�L9�	
	/�w��	��Q�Sc0��_55#39�2=%�`F3�*�
r�Bs��>����!QNPH|�73G�����������dI���d~���{{�22�(T����	*�
r�^�y�G�d*�QNP�\���N��LP�LPI H�yA%SA�r�B��Z^8A%SA�r�B����9A%SA�r�B���4���O�C��������:��a������!��B�G`hv�4�������|nD�E9��

D�D�4������9QE�(�.kD�E9��
�D���FT�\�����@4�������J�U�n���k
�KR�Wd	f����%9�k
�D+�����W	�����������W��
����6	sG�(8m��{QCSQ	s��"����������E��J�?9QCSQ	s��"��6�������EE��Z�zQEsQ���"�c������\�����H�Z��VJ�������s���W	s��>��@;�����O.4'~��A���q�;!�Y;��b��h.�a^SP z��
��d*�QNQH$Xi���h.�a^SP$z�,�E�����\����+�f�f��"�
���B����D����yEEsI���"Qh4����f
���>�����O./��eK��������I�;E����G��T2��(�($�mYEEsI���Q�k�}zwh**aNTQ$�q�FT�\�����H����Cs���[�"Q����754�0/*(�����2��(��$�����T2��p���K!xU�����C�V�e���|�Y>h������LO1�f��z,{{���\�����@t��^S�T�����H��>�^Q�\�����H��o,����(�yQA�(�����CSQ	s��"Q�������r���n��*��r�4�_6�����{f�x�(9��?�R4�B���r��JK:^�s�K
�F��e��}���H���
�����o�S�cIy���������]������
��1g�2�Z��I���M������~��D%��>�9��_&?�� ��~i��Z�B[�A%���?��H����1e�2��Z��XS�
����XN�i(����}|1��_�����O7��"�"�X���qa����\���R�������G�y����q7C�	Z�%���,/�D��h�!����h:��t�?�4�N�4�_�hEyZ�r�s�/�*�Bd�~,);Vt��Ogy�]%��~=��_!X�A!2E?��g�n]:��������P|ol� �f����|.���k�L��#�Uoh��@��dc;�����%�s9���z�:'x"��>����5�$(���A:�8�a�����d�H�Z�*+�Ac��%�$(�X� �y������w���ld����uC��K IP����t�p���a�B���IGlC�,�}(	JVv^��@����\�?����!X���?�x���%+d3����MpQY�
wu��n��A��^J���%dn��oz�ni�5�9�w��Q��h[���@I�?�e
�i�YK�����r��\l[2�m�w�|(	J6�)��g=,9g\��7X����hG&��P�ll� �f��������y
�\8�w����S`�C��bS��������R���sb�a��}G�l	$	N�1���f��W��.^����}�����j]�c��e�CgCt�~,)k-�]�2��[C|G����8	N����t�\~�����������B��2�m�v�%a�����d�H������kxs���,F'�m���}U���<7��7�n��]{����6B�sp��X^N����d3����\�MZ�]Yu�vDpz�@IP����t�p���w�RQsQ�A�����0�7d!nIJ"S�cIyfD�!o��.�tw���1g�2���I��M��%�mG�\���=^����!������w)9��u9�)�[���?'7E?��+��[�7��,'���a��)�107D6E?����X7�4R��M9���4����p:E?����x������<w5'�1gx�����3'���������������^�����������N�������6�~�������jF6#�uVdYJrNd��f�����k�o��T��%�c?�������C�ml��r�������d*���8�$O2�����x��d�*�����z�Pc8v��z��������U�y�l'a6�J[@���Ijm��A:��������Wx��D�7@�]sK����]s>,��|�i4�"��Rl�$��"9K���4>,�q��bc$B���H�@�#������U�yz�2���aF���PRkd�a$�V�������7
5�a8�7
1�����;����>�j�BX�}��%�V���9�=��[��k�el��
�mGd��(�1���l�p��B����&�-�.D�����[|�i�Y�
�.CS�ty30����2�����<)���h��B�����-@,���7�1��O1��k��
��K�������f�O]:+��~��D�)F��Sv�3B����7�e�=��N�������n}�V����y�3��#��@I����x
�i�Y�
eO
L��q���]�W7�1�W�Y�������r�B��`2�8��:��q/%y�n���5�i Y�
e�~�v�-V��wP��p����G�l�p��B������9pz���P�w���\
�i�Y�
%������s��6�9��q75�I��u���yn����N$~�z�u�9\�??��u8��{�D]���8���h��\����}�f��
�A��q���7�K{>���������~��/����
m����{���o���Z��J���/���7����s��?o���k�������=���'�������,Q���}z_�����}r�~z�h7���������������D;������M����SS�����/iV}��O}�����[#?�}��O��D�>��g���V~z��O��6���%����Y��v���];?�>��'�[���A??=����XG?9�yK?=M��[��|����__C��~z�����F��������~��Vkmo��h�;E�"�L��������9QE�����ph.z�='����7UV4���&�(��qX#�h.�a^TP$z,m�,#sI���L"��o���\���?
DkZ�������9QE�h^������	���>���c)����`��~�`hz�$������n���*��r����s/jh**aNTQ$z,{�����(�yQA�(<�o�o�����9QE���GnE�E9��
�DK����h��@�D����Cs��o�f(���A24��H����o��F{�|����?C��&a�������������9�7H�^}�	���W�~�P ��M�54�0'�(�������(�yQA����Wuh.z�V
E�W�
��\���A(	��wsh*)aNSQ$�o����eF�����2^����,`�����6	sG�(8m�������J�U�����Cs��7�0��}�U4�0/*(�+m��D
ME%��*�Ds�n���h����Di��FT�\�����Ht�Z��K�]8%���7�shz��0�)���L�o��N{���%�m(�o��$8kG�/������9ME���tw2J��������s��04��0��(����Cs��?��D���rh.z����@�Z�����T�����H0����K�����H4���FT��#a^T���\������D�w����tH���CQ�����X32�ie�z�)��r��#�Kr����Kj���r��n+n>�E
ME%��*�Ds�y94���e(���^L��r��][7#s��k��$<��</(dv��(/���-_`���mw�����������FLJ�'���1�6b�Ki7�rh�����E��r�A%sI���L������G�TR����H�Z�5�����EE���[����(�yQA���.�����J�U���^T�\������$�_\?o�����Xw�2������H�EMu������]��B���c�7��~�'�P�i Y�V>����2����VTxC�?7}(	JV����@��\x�<Xoq�B�vD9���������JA6�8�a��7������ou��-�@IP��DA6�8�a����B�7z�)�B���V�8��_&?��Ct�~,)�j=2����$\=edG����^�������3���K��Za���*�w��L�Eo7�f�1e�2��Z0D������V�@_����Z�$P��rL�L���Bd�~�)������$\�@o�����,������+���=���XR�����y��^�}��|(�����A:�8�a�[]����&�0�PA�
�$(Y�qr�L3�z�w_*��������(:��-�@IP����t�p����A�[��nn�k����]o%A��V�i��i�[��y�C_���ND��!��%�8H�g=,z�kk	�,�����n�o%A�����d�H���������]�?�8t!*G���(	J6�9H�g=,������$��*�����%+�U�$�@��|��LL����
Q�*uH'A��v��p}s��f��y9���������;��'��xV(�fg=,y�;U(t��eD`.�7���de�� �f���K9�5=����"|���K�'A��*�����%C;�W��_H�����P�l,S�M3�zZ����3�����I-y)/��������(D������V��_u�r��� u�Ez�9�(��@���\�E�����$C���,jM��a�%�9Hg�����jx��6n��PE&��l����zq�L3�zX�N}���@���1:�E����%+Kd����������������D�g{(	KV�Q�M3�zZr!�`��%���N��r\���*�Bd�~\}����u4���}�^�����n��=7�����dP��
/_Kc����\�PRc��������d�Yt�\�C
UTVVos����P��Q;p<(���
�L5�g��S�h���3]�c_.}=��[��	{c�y�*"�1G@I��Y.
�i Y�
��X���L����+��w�A!jivcM�S�x�`�ao	s!7$fdvc�����S�N��%�N��4�{�y;-����aL��yR����XS�T��b��`#t�j�e?�|;k&&�+�n�~�����N��H���Pc^v`�����������U����%D�[���%9Jj}���t�p��B�S�x��C��������O�d�H��N�lB+Q��y���N5���$�Q���4��[�6����HkC�	6���0D����r�BuOz���+"�1���#�'���Y�
5�6�����9gq������t�p��Bk������[@I���r�N3��Uhc�.M�%XJ�,6��Z�q� �y��]��_��l�!j�f��(�1���t���-�����a��xq�b?��_!?T����S�T��xAxo���
�3{�I�������{�����^��5�a8��1�{���[���W���C'"g�J�Y�6
�iP���U/g������'�0W�b�x�~2<�a��wC��S�zuy��0�]+���&�a�h�2�f�u�Pq���;�8���7Jj��������j�d�+T|�������d����T[9H�@~�S�n�vaCko��
�yz�����v
���Y�
.�9 ��9y#�������l�p��B��FO��M>h�t�f��c��&��c�L��%�����x��P	7)T��kB&?���M��?k\����/]AZ���s!l<�d�4.�������?���_�[���70�W��#�����o�D=��-�;K]������_70�����?o`��k��70�����
���@��P��>h`��mC;�C�0o`���vqC;�/�hB=�yC=M���|���@:j���Z��Y�aC��������}��yC-^�������/�H'B���P��4�>l`h���P��>�R}��P��D����z�60��>�z��z�h70���������fk���u	�;�*�}���~��pO�v��>(����0w��k`�
vAk�94�0'�(����E��+��@4�W�^��TT����Hni�VT�\�����H��&ZNS�\���"�H��T84�<��"�k����*��r���u9ks����"#aNT��\�-��p�N�[{��gM��(
�����l����E�{�;c84�0'�(�o����(�yQA�(}I��*��r��+�>�E
ME%��*�D��Q�Cs��oTa(-}{:����oOg(�x���*�^h$��
��B�������;l���C04=o�AQp�`�Sj��ME%��*�D��C�Cs����ah.z��j�D��j���Da%���*��r����+�Cs��w�5�]�
#s��kX�$���t�%��?��@4�{C'jhv��0'�(�����|����R�����i�0w�������U4�0/*(�[�������9QE�h�X����(�yQA���;s:QEsQ������
a��J�U�n�q�������EE��kRgd.Y�&uJ"��o����*#a^S�W�����G�}��1(��5�rG $8k�F_;ECSI	s��"�ciM4#sI����n2<�~
����4E��xqh.���C�h�]/����w�E�����12��(��$��u�%��E��@�L�3{QC���9QE\`���������r������I�;E��������A8����'}�N0�����$5�4
E��p�U4�0/*(=��E#�h.�a^TP ����rh**aNTQ$
���������EE��k;gd.Y��sJ"�����BfW���L����2���`xne9���q��5��1����������I�T���d�)d.IQ^�I ����G����4E��>�������EE��h��*��r���K�������EE���U4�0/*(0������LM������������e��]�U������!�A����^���,s�N3�����w�����C'�=�e}(	J6�q�N3�zX��[f�{���o���[!gc����5Q�N��9�Y������9�q7��e��;��P�^e�� �f������&�y��M���^*�����:]��n����������p_Qx!(�"�q���1e�2���a��������#���$a�7#�
�����;�)��	,�^���XR��
��N�+��F�������
Hz7��gE^]�KJK�wgL
��<�s�c���g����K��Z�����6|�D��UD�u�S=J����Bf�@�����C�����Mz6�+fl�?u6�$�a�-�������RBtXK�p�l�R��<�zXr.�"{�T|��Dt����U�8�q�N3����.Z���F0��z�*"x�p��%+;/�i YK���]��1���2�>z(	JMD)���f=,�:��=
����l��
�$(�X� �f�������k�5��9�#:��������dc+�4����;�;����CUQFt@�7���de�� �f���dh/��E-u���W�z/�&�_����d�H�����U)��V�X��ND��7��~<�8H�g=,9]���Q�B�Dh��E:�8H�@��A�	k�>iU�����%��%�8H�g=,^����m-�M����b���l����
%�Y��e��42w�qhGJ�
��M����l�d�M�~���*��^�c����P�,��!H��d=-y�'�5��T�
��$~�%A����� �f����b����y�
���5�$,Y��A:�8�a�ueGi]y��2"Xs�����l�hs�f����
��
7���>&������/����WW
�I���<�#�$hgBC��p�z�z`�	#g��<��3!,��������)��OecKj��PRkw������,.7���|�PFd��V�
q&��A�i�@�n��#8���x�������9N�B�U���Y�
5�q�=�_�B�l�PR�k�FDWj��p��B������{����cpS�7
�E!2E?��;:���z{���q�n��=�!�l��7`��B5�.�������|�n�b�P�5�I���������k��%����%���1Z���g�)�yt+w�v��	;�o�t�vv�����=��[��3'p9D�m9�o��3i�-g� 5p}X��|�8�u���7o��c���C'B�����r�B���e_����9o��:}��t�p��B�t�.uQS:g�
@��9�r�N3�]���.�7�p_���u�3~������O��9�N�����z�1�.D��%�&^� �f�u�P������u�"j]�pRc���t�p��B�E+�������P�i�c��t���M���m����+;"��
��Z�n� �f�u�P���|������n�������`��a>|�RGn�g�k���+������s��
���[�9�,q�Z7��_!�`Ct�~�)w*G��S�%'���q�F�D~�
���������&rio�6h)�vg��`���������@��*���k)E"��F@I���e
�i�Y�
��1��q��%c���6J���@
����H��B���f~���i�3u��*%5~��`�L3��U�Z[y�O�eDf��`���"� �f�u�Pv��1R��Y����9��.�����/Yz?<�'O���q�?%��O=�~�m��X�?7i?�����_+�t�h���J.�y�k���Oz��;k����X��c���R>��d)�6U��n�	MU���g`M������4�@����f�����k������C�L;�:��!|������!�;��!h�;������=���u������f�C����!|���A���L���H��?�����������}�T?��g�k��?��g����v��O�`���@�\�a�?+���Z�k������,�����z�v��C�&~r��v��!��g��e����MWh���b;��k]R������;0;�f�I4���|5��`g�f��CSQ	s��"��7�ph.Z���"Q��U4�0/*(�`?�f��CSQ	s��"�������%)�+2�k�����d��;�Dw���U4�0/*(=��oD�.2�E��k��������/������Y�0w���V���+�����9QE���7�ph.z��%��1S�������9QE�h]���M04�0/*(���QEsQ����}��<84�0'�(���n�E�����H4/Gj�M/4�E}p��u9����w���C04=o�AQp��M���c��J�U�����Cs����0��;*/�h.�a^TP z%������J�U�����Cs��7U5�������d��=(�����Cs��o�f(���^T��2#a^T�������i��m����Y��c�S��Ti�H/ll.,q^XY(|��$<�����c���i�)/ll.,q^XY(��~����@��aa���7Y�,>�6��E���#ll.,q^XY(�u
�
d����P4���<�]�,��_��
��I`Z���D���q�?A���+}��U��e%��*������Y
kT����qZ��oY�Y {��V8	����l.,q^XY(�{���@8���c�0�ku���2
E��E�g���7�s,��u8\e����5��>�
�����H0�������al~%����������\�����H6�^�csY����Ba�6����9�
�Kn�zzs\#,,>{K��@��M5������]y�
k�6
o]�8���kg(-��`#*hz��F��'� �	��m0��-�a�n�����j�\��ts�?rA���h��F��\V����P�.y�t���2
EO�9hU����
��s��G������%�+�3�~�?�:����+��k������;�FXY �q����b�"�qkB��iM�R���~���Z��������%�V�m�oW���9���+�C��CK�K���h���v��*%�oeD����
������A2�$�a���������+�]���$(�X� �f�����������#��,��}=��sZ&?���!:E?��/po�
��q�`��\�I=��i��~|�6�D���r�L���o�B\����&Vpk�c�\���)�s�/�j�D!2i?��g�����Vx�F/7��H�Z�	zS�/�ZK!2E?��g���V��\]�$%$���s�/���F!2i?��g���dk�&|���9Ft��`S�7���r���A�N3�zX����kY6l�����:w��G@IP��DA6�8�Y�i����q6u`�����}�����c+�yF��i�����z#Z�}l�ZC��>6�;���+ByP����l����������'/hu,%diM�K��<��`�0�~$�_�z�rX�S��Ts��YF�o<�w��r����d��������9.���R.v����
{��#�G�LfS��W.�*�S!�Z��,���K�+�yT���a6�HH�q�{�5yGF;
Re;���=�;�yT��U�t�����������/L.t��di�.���Q�
�)a2�HX�q������?�?�U��-HZ�Do$G@I�?
��3�����'-������`��L6��P�l�p��<�zXr���'�V����|4��w�����$a:�H�7
�SJ���LS�?t��dh����Q�
�)a2�HX�q�[�ojpM�J����{J�\�������X� �f�7�)a��B�PcT�2m��j�Byt�
�������=*:m�_���o�2�����T� +Y��o]�S���W���>��q�
!fYJ����y\�@8��S���W�/[J��T��);���:����
&	��FB��+?����\��
O��`�j*�.�i�YOK������	JV�#B����D%[9H�g=wD���E4�]���o����A:�����u.�%�?�-�Y���[�:"�W�I�$�X� �y��R��B�VH7n+��m��*�,��������U��U�D):9�t��z��4��[��������<�f�;7�K�O3��;���Y�m��N��Y���2�4q�L��%�N�j��-ni�����2�a
��0D����r�B�`�3}k��6i;���&��I�����P}��Rw��z!13�SFc�f�)������������D5�h��Kc�b��D[�twi����2wF(���%�F(y#��u�PgH��U��-�5�f(��?�3h�~���nk$��S:�3!���@(�7=+���#!��������fkufd��	��V��afl�o�k��-����y;�'�����l�����y�1��5��I1�r$������l���'����
�f[3�"�.�H0�33���t�����z�Q�/l��Z�2�M��:ZS�N?�Y�9�;6<l��|!���H0�7.���T#!���������}Ej�����P�k�H�N?��~'_o &��i]F`����z�S�d�����z['��^�^"\�[[�#I�^L���T#a����X�:����@�}��`^oO&	��FB�7�����'�`�6�B�r0�����O�������
�L��~k=�t"s���y�	�S��o��F�J�~5�#���[�#���y�g���zAH�n�b����7��wGBy�����]|A�7~\��~�M���.c���_�M�p�
J7�HX�n�b���5��"LY�7GBy\��]�t��\�%������o��PAd���l�al=9H��d���;���J�U����X�@2���n�~�Q��O�&}�D����#+���5����������������y/��-tP���v;h��m��O���g"]�
�^�r�k�~���o[{Y���5�jo2_��Q�����=
�sZ=����U� �^���&s^=|m��U�����y�\���z�����i��� �qH����/V<���%�J�y���������������?�R:��v�������k��|�����F^�����v�����������
!��
�wuWJQ��g�7���7�5�����a���\���5��o2����Ky�
��_)!���������sW�X.z �T��V�����z�c�7�;)^	l)��m�+aV�j��.q_.X������Lk��ST2W�0S42W,�!�W(r�ST2W�M��FQH��aNQ�\����� �@������������)��I�	�+��({�($P�0��d�HO�^Q��B"aNQIx!��B-�������ek�W2?]f�����5��J��f�F��;��s�BEs�J��>�yE!�"�9E%S��^W:E%sE	3E#s�����)
	9�)*�+�S�($P�0��d�xR�:�($P�0��d�+�a�'ih~9�8u(������,e~E��/8��=E���q��O\���W�*(R�8��h�
������@U����@����*(R�8��(P=p��FUP��q^U�\v�)�O��@U����@u�n^�I�Ia^RH���9��ir�U�Vz��U�������
�q�������=��!(
N���C04?qu��4���@U����@u����H����������\NUQ�*qN�P�J�4��"U�����J
b���H�����@��6��"U�����k��v	�4)�K
�+�p�I*
.7�D
��p�no>���i�~	�����!(���������"M������W��Q���@�����@���^RP��q^TQ�z��2�"���qh���k�o��j��:�n���eiR��(�3�%E��E���}oU��4�U�W����u&��l�v�l�4��S'q� �O�F�����@�����@q_����(�49��*���Z����P�*qN�P�
�]����"U������{`�"���`����
�T9��*
T/��E�D��%��a���0��s�J�WI�w�^h�u1���l���Y>hb�����'������z,{{3b(R�8��h���G@^TH��aNRI�X�rt��"M������>q���"U������p���
0�J�S5����XQ��q^UQ�Z��q�*(R�8��h����n�,K��)���}�A��iCqE�+UI��&����_jR���}�q�����d;h+�qL�L~�C@:E?��Gk�K����$��G�����o_�9����+�I���<��r#�������\H�E��8��_&P����XR���W��F��Q����^����-�@�� �:��8�E�(������	����X�W���+���%hG��<��M������{*�zI���\�����)H��I����?�GH7C[��%�R�@A-�^
)�W,�;�M���n�����C��
7��*�wRZ�7�8��_?yOpz7���m"h�}�a��u��a�.D���
�$��[Y� �f������K|@�U~�D�t�?��#�S�������s��Z����+�.��[�����U�p(��i YK.�r�{�T���{E'��o%A��6�i�YK�F�xU�.D��I�T�r��%+�3����/~��f��M�:�mC/�k�IP����t�p���a�@�s�?���`�4b���^N���
���YKF{�[�w���E��!J���H�l���p3�������t����Z���E��^J���2�f�����-����
p�`�m�^��3~��MXA��������m�E�x?O��*����F�����X>)H��d=,9S��\am�����@���P�l,Q�M3�zXr��	73,1��5T���m�'A��
��8����������+��A�vD`��o%A��V
�i��z��%��u<�Y~.����Y���g��%�C�
�S����O�^hc(%�X�������+������s��Z�P�����ENl'�P�(���`��^@?�K��������>J�����v7��7�.�L��D�i���1�v0�@IX��rP�N3�zZri�#2�:�i�=��s�~c����w�8wh��o3�T$pr�������GvqO��5��� ��fnSddGr�i��[
�ZWjcdS�����r2s]`Uw��o���>c�;S�4�fe3�����ts���cI��~������?��we;�������XRnT�b�p9�7*7$fLvc��F%,���M1�%�F��&���HAbvd?N�k����d<�6E?���k��3!gJ������k`�6���R|;����+�������j�����
��"7�o�����Bx���Gx;k~c;�Xg>^K���xap3��VNl��((���Y��x��6�2�(D�������:B���O���D|�PRk6Js�f�A�s�hk��c�L�d�'A��*����[��������_,���8Jj�����_��d�*T��-�'�*n��#���U<9���-��[�z�o�~R�K�x�-���K� �%v��n�}>���;}���7;@I���9H����n��P3���S�B��������7\)��g�*��A�l���;��c���C5A�M��9�F�����w��O �	����+_wn����S�Th6����3n�rr��8	���d3��nj�}U��'"��
��#�ld���n��4�%C������PRc�A�[�i�Y�
U'n�����?������+��������ni����
a���ND�����s�!y/o�u�Pq��
�9Ko�H�����,�r]�S�cK�U�xl����u7�,[9H�@������u�+�!�b�A7�]�D�2�4�\��P+���l�;�]�t�\$�{w�[j���Y�
m�B%�K����PeR��<�G���S�]\m|:|�W��I�>]��|���y�u�	W��6yx�nM's��q��s�=���������g����h)����Vk����:�������=���������O��������A]b����������h/�[���V�z��u�n��������I��?�R��k��w�������mS�`}������4� nTg�m�:9��� �U'`�z�����Y��v�������|I���vuV�����xkX'��;�i��h��������u�����N��7���5���������k����������D;��o�������s���u��&�������]�����{����u��EG��w��7���3k���`A������|=D.���/6�J�S5�^}�C����Oph�Z�Ii��

T%��
T�[�ZUA�*�yUE�������D��%��W���P�y����V�h��*
T%��
T�k�FUPp��8��(���c)��]�`���h�@Qp�$����y������W�r�WU4W=J�C�P�*qN�P�z,{���P��q^U�\��/����@U����@L���
�T9��*
TK���P�Z���{���P���
�
T�eK�O����#q^UQ|������h�{[g{��3'q�����e�����P���|o`E����R0�^}+����_�8UE���9UC�j]jn�UQ��q^UQ�z�-E
E�G�T��@��(�4���������g�M�s��U~��U��7�U������o���Y�vi��'q��O��r���U�s����7I0���I�C��A{yUA�*�yUEs���~�NUQ�*qN�P�����"��wu(P�]�UA�*�yUE����9Si�]�3#����s3\n$��*�/7����w��������v�0�#���y;*}��$��D
������4)�K
�+������@S����@���5�"��?�8���9f(R={s�����.����@�����@1�=�E���q�P����w*(��H�WU_i�s9�J��|��_�*
v��8�o��`����"�$�}d���	~u�H
�49��*
T�%���"U��������xUE���9UC�j��1C�j��1���l�u�P��q^UQ�zt���D�G���H�x�gg^���J#a^QH�J������^��.x������$8��ZI�I��i�W
E�q���HQ�z-g{9iR��2W��6f���
4%��
Tw���
�T9��*
Ta���HQ��q^U�\�f�.��U�s��U�F��
�T9��*�;��k����p��40;��z����/��k���}��
^��]*�V��k�I�?�Qg67��oz���������+X��m��$�%A��V
�i�YK.��\�}�O|��h�/�Nf�@IP����+�i�YK����k�7�����e�&�#�$(��NA6�.�e���a]�FkXhCl%�x����0��_&?��Ct�~|�M�o�zd��^n�3����O��my��d�2�Z/�IP'������3��t����h�[�=wc��e��V
�I���<���w���j�]��\����]�^�)��������K������qM��)~(�HR`���1f�
��o[1D�������Ty�����!�q+ie�{�����zr�L3�zX�V�����.}
�}
	/�@IP����t�p��v�;.��<i� (���`[�����dc��t���������r�s����-�����a�'A��*�����%Cg��g�bG:������
�$(Y�zp�L3�zX2�����\�~��sET`/�����dc��t�p�����[�{������Q�z�P���$�@��|^�d���3+��G>�y��c+�4����
����h��UT�cOy(	JVV.�i Y�D���������$���
"x�Y�J��q�nA�N3�zX2����q�����L�
�$(�X� �f���7]jqA�z�BC'"0��@IP����l�p����g��k`/| H��!�v}�'A��v
���YOK>���Q����$��{5��%�\���
�/�W��+��P���sJ+�(�'�y5F}���8���
W*j&�f~�t�kM��`Y�>[9H�@���5|�^+7�s�^9���TiL���A:��7����;�W����D`�lo%a����� �f�����m�Un�%����W���7B��}��y_���0tV��.��H���/������rc�L��%��-��,����y`;lo7������i����n���z��1���xt��Z���~��B~�����XRn,�n�I�x��5�{�{����X����t�H��B�}������;�h�{����[�W:��9��B�s��K9t����\��Jc)�6
�iP�7�2���9��<�2GL�~L�������XR�T�.c���8���v�.�1e4V�F!:E?��;�]Y�7�0r 1�S�30������cI�S�:�_y���WIjS�c�hlK��K����r��������Y��^���=X�E�c;��h^}X���[��pGoL���2��3��F����i�
,Z�@���PR�O&�i�Y�:������JVD��%����AjB�@�nj���q��"r^d8��"��8��5��b�2�@��	
f.��Zr� �f�u�P5��_.���vr��b8�����t�p��B�,(���"g.����&�i�Y�
�N uCj�@@����U�!;�i Y�
5����\g1!rFb8�u+����[������J�'VD�>%�~�EA�'�@�n��^����C"3G�K[�2�4����5C6~�����Og�����d3��n�6�l���R3p�:[�2�_:�v�l}Z(y��z�n�j
�*
���P�H
�)�q�����;��N6����#���$,���M3��W��u��J��D�J\�pRc
���t�\����Pq��2WC/t�y}#���,�4��[��7Wv�{����\�P����l�p��B�2*+~���?��e/��%5���� �f�u�Pv�`�`.��� ���1ex��$�)����zhW�Tz:DM��A�V��Q�@7�>�$��w�/�BZ������pW�:	����������/�-���;
�
��~�
� ~]��S��v���R���'�k��~�Q�����Z�������V~�Q�|���Q�@��;
�h�@�Q�����z�P �((`�zt�����������|I�����V���xm���;
j��h��������V�6����|M�����v���(h�����4� �(h�eGA;m�0�(���vqGA;�/;
���h�|K�lU�C�.0���}GA��b���`A�u%����t]	n��.�6�J�S5��}+
C�����ph�
m��f���@U����@5�F����H�����@���LN�I�Ia^RH�x�M!E�G���@�Z�����"U������e���(��H�S5_mr]��}G�L��=E�y�8w���
���(R�8��h������@U����@�b��FUP��q^UQ�z.�hE��U�U�g���U�s����oa(R������g(R-}�8���_C8UA�G��������������;l$��cP�9�s�`h~��
o����@U����@���R�T��-�CS�c�w@�jh��q��P�����ZUA�*�yUE�j�;��Tk����@���D(�4��I��@�����4��Y�Cs��/������F�����z?u���mG�cK}�9C���8w��'n;x�D�*(R�8��h�������@U����@��F�NUP�Ze�&SU�����S�r�WU4W-k���P�*qN�P��-����*�T9��*
TK�1NI�Y��qF����g(��H�U_n2:g���Qi�xwB���a�����N+����@S����@�XZ�MI�Ia^R�\�X��/��D
���1������@�����H����C����]m�D��%��W�/�P�y��������m�DC��F�����J��>����q��<�+����I�;C�S��w��'-�$�K���d;�[o�N�kj��:���l���T9��*
T�q@jUE��U�U�QC��
T%��
T���VC�*�yUE�j�z�)�4K��H����r^���R#a^QHx�9�BN�W-��,������>��<����������������5�s��U�|�eiR��2W,+��>�hJ�5�f�P��
�T9��*
T�%��E��U������H�����@��+���"U������d���qA��i#x����Y�=������ 6��w�	��+���{Vy(����zp�n���z��yO� |���� ��j}=��_!?��.���XR����~~���^��|J��Y�9�r�N3����.9�vu��t�'�eD��5��%+��4���%C���:������s��x����c0�)���-����K��Z�P�����Q��d��/�������a�N��%�Y�G�{%X�w��nLHA����c��e�V_A�L��5�Y��"�����*W%iC�����1g�2����!:i?��g�^�-��sl����h��vo%����LA6�8�Y�����m��xUp����u��=�M�
������d=,9�u����a�.D��'��t>t,S�M3������FL��u�KE�#]��P�l,Q�M3�zXr.������/��"��"7�%�iP���8`�
��������Fm���p�ll� �y�op��Z��fK�g��������J������d�H�����K�g!aI�
"���7���da����x�h������A��=�]��t--�$A��*�����%����m ��v(#:��
�$(Y�~p�L3�zX24
�7���7q<eD����@IP�2��
�d�H����0<7nH��j��R~
8�����A:���������d�W�=EvZ�b�D�F�@IP����t�p�����L��/.X0��[]�@IP�2�
�d�H�������m���b���C�s�/�n
�!2i?��g��*�����p�hU7H�7 ��p�V
�i�����%gn���%��e�NDg>���(	KV�Q�M3�zZ������O��=���j��%+�3����\��s?F�������
�e_�2�4�����������ND���z�4�ul� �f����',�?������S�	�_�W~�'�C��"S�cMylR�1k7A��y �@���&��Qt� �&�����?����'7�b��B+"�|��Z+�� �B{ Yw���7����6hI���'A��*����[��%y,����������)�W��9���s��
g�V��9os������A��lg�*T��s�����R��lG���u�?����L|�&q�P;������Z��x�}��yQ�sC�������Ex��N�J�
����3��	�����������Bq��8�k��;�&e?�p�,���Ir��n��wJ3���~-�2�*5r��������<��~�� �|��<���Y�=�����{@7�N��a��~��G�2��1�q��:����1 ;p��z��AXQ����6���3~����.����S�T��@vJ�����;`N��)���h��5gq]*4�2"g6��Z��� �{ Y�
��_-��[�������c�-��m���-���X����8Jj�����X��f�*T=����hg�m��S�Nj��s� �y�����P��]g�!^gk!�c���C~!�8��K��
��;�Ecm<��@dv�(����A:�8�N�f����x�hk3GP��X� �f��%�L�����C;"��F����c+�4�������U{;�d(#2�o�����9�4��[��M]��_�IO���
�*��Bk@�)�q��V��\��K�7Kb��0��_!?�[Bt�~,)�*�$\��
?��r^�(�1��IA:�$�^����{��w��
���a��<��W�8n%A�q��A[��\�a��B~��a�L��%�N�j��*&�!����%5^��5�����?-���"�)��6H�~\d�"s�
mhd���f���������Q/@��F���^�n�|����+_:�2����%��<nvL�'����>���u,���o�L)����r��z�4<OxD��A������7��Lwq���������v��Oa�"�A3�!�M��l���z��!M�,���6�z��	�C�~r�6�v���Z�_6
������?oh����V����]�@;��'0oh����f��o�e�@;��'�0oh�����v�v_��C��r�v�v��!nh��e������$���0/���%��^+���#04_A�qvMW�\�����U�s����w�0�����C����9���"U��������N�:�P�*qN�P���vY��H�����@��=E�����P��/����H�����@���������F�����js�^���
/�>���M�����B� xUE���9UC���w�0�^}�	����_R���P�*qN�P�Z�kkE��U����U�r�WU4W������@U����@u�[��T���C�j���UA�G������[I���0�:�>���(8s��������R���KQ�*qN�P�Z���"��7�p(P�7A^UP��q^U�\�J���cU�J�S5���!��H5�
Y
Tk�BI�Y�F���g(�<�^p�����

�7�U���Z�����i��,����Y��g����TySG��,R�@�l,R������W�a��@y��C�SV)K�W6)��3�W*s`��,R>�����g����@9�K-����HY���Hy���
u��_�C�j�;�96� Y`#�,�$���9�NW���L����r�?E�	�=���*�t%��������B]�kdE�g��c������Y����a��HY���H9���c�r��5�"����XP�Kq���H����9�}G9�"�kY�]�$�QV�����%����i=.\����J�?c�)<k���P��q^VQ�z���{#�,��@/l,R���_����(+���%��@�9�QV)����X�|���gs�����]������������Pw���9��0�Qe\�$�Q��������W�	SJ������'L	����E'����+
N��������HW���H�.y������
�TO�����r`#�,P����l�9)K�W6)g��������}�6��"��_����Bel��E���{�,,T��FY���b��q/C���e�R�=�?ifx�u��I3���}V|z�/D��x��a��B�a�����XR-��V��>��������=!�����'��xV9Hgg=,9�V����[{%����6��|c���^[���S����y���Rx1�hG�o�J��kl� �f���d��J��=�f�����3m��9�������XR��ZN��b���"��r"�J���1e�2���U"S�cIyVk���*K�W�L��U=\�c��e�C��(D&������^���Xl�f�@r��X���2~��p/�)���<��\����K���C����
�J�_1c�4���%_+o�K�:�PN��s���%�d3������5-��J�}�W'�`�����.X=�P��`�1��=R{\v��5��[<���YZWZ���P�a�0�~$�w'|���������,����;�yT����l�������Q{����yf'2|+��#�G��8��		>�����OZ�p�2�����o&��x��������!^���v;�hvZ�����P�l���l��~y�����k����<;����yG0�*7�I�N5���;�����
4N�E��D�VX`��`Unp�0�j$$���s_���m��=�=������U/	�Q����T#!���_�R~��}���dv"C�j}G0�*7�I�N5|Zy��;�H�B����������L�S��W���5������ea��|C(�*7X$L�		>�|���Lm����)A����H�h��4(���N����m�l�);�����`�l����T#!����jq��Pu�;�����;b��'[,�=�4���%g~�����x��,��b�`$��'[a�0�~$$���}�R��t���"��t�#u����\&S���W~��U6�I�YE�'�xG0�+�/	��F���+�^����5
�_�����jL�Bx�5�����S;"t��2�!��r9w#������� l�x^�j�?�e�y��������SG(�XG@I�t�X� �f�uga�Y��l����aF�x
����t���[t8��`���d=9Wu��K����n����5X/D�;�.��;���~j�����n��Im���
�����3:[�B�Vm����V�����1S+o��c�pf*�����[��
��	�5�*o�6c��,T
1�K��
��I=��q
����pk��8�3N�q��
�y�;w�w�B�������1j�����;�X�w.W�������������M�p��B�Yy.-�w�hB���PVg�V�s���nk�d�C��#7N�H0�7DW��FB�7���$�m�mP`�
��JoR{���
��Y�w'����B"���	�u�',R�0�j$,x�^�C��������=�<�����}�����z���e�;��~gG(o�;)���=!������/���8v!���H0�w9���T#!�������a����*��&���89H�g�+T�E�f������y�r ��Y����t�����z�O�{$�m����9�/�u&6:)�T#!�{����������p�1?��������&S��o���a���;��q��5��`^gV�-wn���o�+�aN��J��2�q��X���A2�$�f��n������4��o%����r��<��W���[�������8Lj}�m� �f�vYg��E[m�G��h��V|A6���q��a:�HX�n�bn�>u,S�Rg&���:�q+���������q1&�W�B|A���~��I����z���m���`���.s�I����p��<��$��C@��}�6���@�::�8H��g}?����=I��u��Ai�����������������c^������rA�D��on��Z�v������T��`����7���%HZ{���c�3�j�<���|]�F�#�&>o������_��D�k�U�d�����K����Z=� �R=��z�����y��������<yST�^@P��i�m����T�i|�I�F���t���Q��e�������j�k�y�����WQ��g�7���}���n�Z=�6���U@T��Y�M��z0=����z����:/ �^���&s^=l,���h|�WJ���������k���q��u�S7�+���4�y�������7R�k[��Y�JR����%����WT2W�0�(d�X=���������@����^Q�\�����@��D��J����������(��1�����M���������@1/G�E%sEs�BE����J�	s�B�I�Wi�u$��=��^S�����0+^��t���s�kmd�(a��$P�w�^Q�\�����@�Z��=F%sEs�B���������*J�)*	��WT2W�0�($P<�)�ST2W�0�($P<���ST2W�0�(d������84��h���/(�� vW�_Q>��� �=C���q�'.���VUQ��q^UP��%�z�T
�U%��*�T+�����U����"�c9��X��U�y�V�N��\U����Hu��^TH�Ia^�I������Thr��Vz��U�/7�U����p�W�.7����C04?q�AQp��F�^:UCsU�s��"��>A���U����UpT��W��\U����Hu��:���@�����H�R�Y��(P�8�*(R=��j�]
��UE�����RhR��d(���I�_n$��*��r]��no>���i�~	�����C��������M�����+��+NT�\�����H���^RQ��q^TP�z��2�g�0ch���k�o94U�8S5�n���hR��d)jI�%��EE��'~oUM�4�U}p��71�\g���v�$������8w��S�Q�7'�d��aNRH��/WmD��E��sy�t��\U����H5�X��(P�8�*(R���P�Z{�P�z`��FUQ��q^UP�za9/*$��0/�$P�����T2��p�S���$��L/4�Zn^�g����N�[MK��K��G�����#�y_Wq��FER��e/
�10z�0|�F��L2�<=^���#��t�T�����&S��b�[���#-9�k
Z��	���P�\������b�R�$-49��
Z�VZ��U-T9��
Z�����U�CsU�s��V���oT-T9��
Z�&Z2�U-T9��
�������e��&��O��e�J8q�����*�>ze�z��_�`ZKW�^�B��m�s�/�@'"��K���)�����:�V#'�Xq!��1e�2��@6E?��G�f��^�+��h�T$�����2~�@�����������3�Y}�J����$�b�`S�/�ZW������������
���*�a������*�#������Bu����V={p�./#D�����������O7�����)(&<��&(�����
?�(���y�(��+�bN�1�ID���_�)���]
���XSv��U��� �&�$������@��`��Bd�~,)�j=
=3`��^����n��c���C+u1D�������X��k�)
�#���o%�jyc�������*�.9U�����-xA��B�3�@IP���A:�8�a���
�����A�r�	����$(�X� �y���]�y�k
x;��OgD�d�MBz(	JVv�i YK���L����*:vD�d~}9N���%
���YK�?q�W���C�PO~�����dc��l�p����=�v��'�|�H�B���q�?�L�������/>+��,�;RiE��������@I�?���k�@��|��%�3sx�*m�	|��@IP����l�p���c�"]��iC�C'"���@IP�1����@�[�n���V��LI�=`r):���P�ll� �f������"M�?�d�����^�]0��BA:�$�a��V�� _���H�f�rL�L�K|���k��Z�xQ:��UT���P�^c�4���%�M��
E������`��%+K�3���~~����EF(!:��������da���t�H�����:��a9Ab�P<���^N����d3������?�t
�s�.Dp{�@IP���A:�8�i�'z%f��f�"��s�gntF���(D�����F��@�%�� ����Cz9��_&Pk ��&��W�uvt2c���]����o���oDh���D��aL���~e�
H����rc�������,�k'CL,�~���������XRnT�=Dj���J ��l����6F% oT���[�~V�8��e�����|bG���������������~�?5!������P[�>�C��
�9�����o��,C���I��l��8�'��I�o������B�.�|��@}�nH����o����������Fa��A�*��J��n����9D&���Aw,�wmk�(Jcq@I��Y,��c$�V��Fl�8�'"g1���Pc;�4��[���x��C�3@I��8H�g�*����=�����9�p��z���t�p��B������sD�Y�'����A:�.�z��B�����"g��Z?�r���=��[�z����
E�73q���{g��i���U�z�Nns�q�`?Nt6� ��������Y��������������N����f�����]���	D��'�>�IA6�8�V�f�]��i����'���IA6���?�f����{����%��`� �{ Y�
U{.�������|#���	����M3���{��Po��K���������� ���@�n�������d$������;���u�������pq��b������f�L~�K~�l���|w�/:�-��ro����G��
��Zk�� �f�u�P���?fj��B���PRk�E
�i�Y�
Uo����Qob��
��/\��@���fR�&(�������/7Jj���� gSu����n�q��]F�\-������n�e.�)�����b��wa#pg��W���������������XR���>\������b�e��[��:���������o�w�O�Z�������^`��u,������n7���QW���^�����;�i��ZO������5��_�������m��`�������4�.`�^�.���zz�`O.`�aO/@�����m�'�z��,���X�^�M�������m���i�=-_���e�=�������n����y�=-^��u�=�������f{r�n{z�h���g�m�=���'0o����v��{v�v�{������^�5�����*q-��m�������`E�]��@�bGL[iwE;4W�8��h�z�]Z�^}�C�t`K��y�����9UE+����UEU����V��f����0/�d�x��Zh^}�CU8��j/��\U�����j��~����O����>x�����}o�g��
�����+P��o���{�Z�r�W�P-�����\U�����*���^���*�yUAU��{������T�T�����*Z�r�W�RM}�����`h�z�m�Z��}[6C+����i24�H�W����y_��l6
��wN��5(Z4�opt��-s��h,h�z�
 Z�^}C��/{������9UE+��G.6����U�TK����j�[�Z�^];#��k��d�x��/�CsM�s��V��6��*�>o4��
Z?oR��#�y_��o�����I��E�$k�jh�*qNU�J5��Z�������j������*�yUA�L[�������9UE+���Euh�����V��j�QU�P�8�*h�zv���,4��9���b���94�H����&������/�xJ}s����9��1`��o%�b\�44��8'�h�zm�g%M
��L����04��8'�h����5-Ts����J����C���c�����}�*�kr���R}g6�����fh��|v�h���8�*��'M�[�+����������gmH�;mC������}fdqf���>S�RL[i�;:���8/*h�Z��~�wh��q^U�B�0�NO14W�8��h�{k���j��1C+����7����U�TK�p��B�t
���+.���B�O	��L>8����|��/c�����������m�0w�BVG+�q���J��+q�?]I�J��������&�yI&�� y��F�\S������Ik&����*�yUA+UX���Z�r�W�P����������9UE+U^K�U-Te�S�p�������LM�|F�J�I�x\���?�����;+(/��t�@gL�L�����~<���l�p��]�1�z���6Tr�s��'A��N
���YKNr$^����P
��<�l����dc���<��7��.9�	p����.�Q���IZ�����de�EA:�$�a���'�x~=�=��Z 1W2���'��������R��Z"����v%�6�������w��B���.c�N����.�k��>>x���oB`�5��b��9���uL�tv3O��%�Y������.�p��'��C
��k���uC��������ewTy��D����7��~<�9H�g=,9d>,��F���(����P�l,r�N3�zX24���G��Z%D��'�Avj��d�H��&���;>�3�[A1 J�1�5�$(�X� �y����T�U~L���P�%X�p�l,q��<�zXr.�n�H���q%������+����1D�����������LZ�XEp8s���p�^c'�����%�����l��
;�T?��IP���A:�8�a����dp���CQ��0o�!
���4���%_�+��3cgG���R^�1�W��g�6E?��g�����UZ)F~'��(���������4���%����%u���43?�V��'A��N
���YO��l\��D��c����M�J�����4���%���3�7��{J��v�o%A���JA:�$�i������C�LJ@��a��1g�2����"��cIyVk>�h�����`�b�P�������p3��]I����������dE'��5Mx(	J6�s�N3�zZr�o����w*k��������t�|��|�����A*m13T�/y���%+�o~n��o������R��4��1��5p{�7���de��l�p���Y'�����T$ps��c��e�C��Bd�~,)��f��g�����p�����39�Dg�� 
�?��:���rl�O��o����Jj��|Q�N3�����1&+nIo�I\���PRkyF
�i�Y�
5W3r3$�NZ�hF��4Crl� �f�~3$�9&:v�����������P`���x�S����]��B5��J�?�rg����yP�N�����?�����x�{����<�f���I� �a���~�	o*l��������s��9�s.1�9��XR�T���t��a�x��8���2���!;
C�K��5�NRst�i������3~��[�!:E?��;�a�T������8Jj���������U����a5�dE��PR�H�4��{�g����������Y������9H�g�*T-�P�V����#r�b8�qC� �y�u�P���)/�U��8J�B�5IA��l�d�*������15�c8�7)���|��U����Jdr��8��^�y�������d�*�,��&�3wD�K�'��b� �y��jn_�����L�Dy����A:����a�i3�*.��~�Ek<�*'�~b� �y�u�P��*��j\�����=���E�d3���5�v,��G��j��pR��d3��nj�^�^d�1�o�g�f�(�u�4��{����p���	���|�~L�O8D����r����#�q�J�����j��HA6�N�j��B��;�{nc	���}�3~������s��
��;���P�'s�p�O��|q�L3��U��q��m����yz�������t�H��B������#�����N_���PGU�)�������7��u���������9��{9q�L�������6���/��u�������n�|�,����/]A����[�	�`�������������_�K����nx�����3���:�f�i�{�QN���O��Z�~�4����Z��i����V��i`��{|�4�.@{�����h�]��i�]��M����^��i�^�&����|�4P/�z��,��X�^�M��mh ����i�@-_���e�@����i�������y�@-^��u�@��_6
���z���z�h�nh�e�@����0o���v���v_6
|���t��]�n�p�c�G������M��o���`E��#��@�|��y�A_��6����T�T����C���6ah�ii�W54W�8��h�
�]����*�yUA+�L�D9Q!M
��LV������B��}�T)�QU�P�8�*h�
����S�C����9UE<m"����i������7�sW�hq���`��3�P�8�*h�
�X��34W�8��h�J-�UEU����V�u��}0�P�8�*h�
+��f�CsU�s��V�G��������0�RM};8���ogh��q�FU���#q^U�O�5~����'m�k04�s��A���]�VB��eh�*qNU�J���'Z�������j�����:4U�8S5�R��E�QU�P�8�*h�����-Ts����J�t} �,4K�B�J����9����~p�����9UC����9UE��
���O��M��n��94�q�.A����q��SU�P�8�*h�����CsU�s��V���u����U�T+7�T-T9��
Z�������\U�����*�������0��
Z���)���f���)Y)�}�;�������>x�DX��w����=������K��o�.�����\S�����j�Z���B���$��b���-�kJ�U�R�������Zch��{s���j��1C+UZ,�E�,4)�K2Y)^}K8��W���B��-���?i$��*��I{�>y�|����Vk��14�u�.B���]���fd�Y��kJ������+��TS�L��J6|]����*�yUA+���^UEU�����a��1����T�T���V���U�TS����B3um���O\��L5���4�m>|�5V��A���~�|������wXj�����$�)*Z��-���d�Ia^��B1�|f�[`�h�)qNT�J5��H��h��q^U�J��B�z���*�yUA+��b����*�yUA+UYH�T-T���S��%�w��;�LM������g��P�t��t
<wY�w����Wt!:������~<������mj>����q��!hwp�?gH��p�l�� �y����CN���	����UD'lfy(	J6vp�N3�zXr���j��{D��5��%�� }�}���&;�h����~�L��0>y�:���!:E?>�m�7k=w:h�J���K	=��|�����2~��Q�)���<������i�����<O�6<�3�
!�C[,1D�����N#g����4l����$$g��_�)���_���XS��z�'���Q�C	�	�$�J�_1e�R�N3�zV2����`��We�{�%���48�6��%�w����%�]��~x(����HA6�8�a��h���uz��(!����
�$(YY���@�������^(���o�
�$(�X� �f���dh���E�\��qA�@�"�����dc'�����%��_n��@&�����r�7���dc��t�p������=6p$c}=>�k������K��ZK�
f�`�����
"�@y����X� �f����Z��d|�g�]eD�F��>J������d�H���/�+���C���\�_���dc'����g%�aX�\�z�Z�eD`,�7��~��#$�@����~��\�����7���de�r�L3�zX�qmg����1E6�x(	JV�����f���dx-�������������2~����a�)���<��X���_f���	�(�_N��k�� �y����H������P�_�X����������4���%����L��g��=����@IP���A:�8�i�uKg��`��Q��B��z(	KV9H�g=,9��6QWT�ND���z�tEul� �f�����R��3�f&%I��/�������S������S�Y��6�g�;7���%n�,���+�����
t��.���	�q9;��G)c�p�L3���������9T9osY��9vp�N3�|��y��?��'�����%��g� u9{ Y�
���N�����x�����,��d�H��B�}��E�?�Y���2��Y�B!<E?��;���)�����8roo36��fuS�����X�:�Q~(������e?�����L�q�)�q|���~��~�71$fZvc�hL��Bt�~,)w*4?��^6����9������E�d�������l�x�PcXv���1,����eNo�|X��3�k�����������m�p��B�������"@I�9�y/��u�PoggI�����Z�;�4��[������o����3"���ogDR���u�P�	�/�������H��z����������������s5�c(	
Uv\���H��B�����C�3�K�Cc{� �f�u�Ps3�PE�,�d98����t�dp������s��""�.��Zw�p���=��;�6���6���Bl'����A:���?�f��q�Wc$�9����1e�
��/����cN�S��{���q�5�`���c��t�4/�>,T������"��q�Sd�]xp�L���?x��
��K�I��D�	 �0	C�	 y���B��K�-���5�_�tu�V��i��w5�.�7��<9������� �y�u�P���V�_����9|�������n�-:���=�z�N_;��_!�EBt�~,)�*d._|��s���1gx�/��
6i?�����<�"h���y7�"�5
t3��.����|���/����Z���'}����[�z�����b�6��#]�A'�Z�S�����!�\����]�f�X��%s+�/{	�%XK@��y/A��t���&h��e7A�k
(�0�&h���������'�����%,�	�%X�]����(h����
��$����)��������5�T?�)h�k���U����]��9�\����]�f�KX��K����]���K���K�Lw	���v	_v|�'M�|�^�m�����u�[��w�v~fW��t�����o5���������T�TS�����j�[RZ�V�ylT-T9��
Z�p�Y�O�����9UE+������,4)�K2Y)��9�C��7�0�R��^UEU����V������F�������:����w�`�oi�����I��E�����]�{oh�*qNU�J���R8�P�����p�R�������9UE+U8i��}0�P�8�*h�Z���FU�B������*,��r����T�T��g�C���gh����&C�'��yUA<qr�j���B���C�;'q�-�\��t��^����T�TS�����j��SZ��� ��h�*������*|������J�SU�R�}�V�����jh���^F����d�X��q-4K�4��J���yUE����yUA<o*�D���l�������@�
��7/�!�q�N��BY������7��l�|��([)��.u������-�O>��++[)s`�,l�\�N����k�����r����gce	�����G�V�����5�3�T�}=��$l���Ia�q��i?�D�n�+Q�����C������zYc]	�����y�{aA+]�kd-Ukd�g+�����J��_�<[(K�WV�T�����J9���cK����X�J��YFK��7��l�[��s�-�/�,�*+[<�$�Q���(�-���a/�I��c�[(��B��na�]�7���e�T�};��G��BW����2-�k����9�Q�T>��v
�l�������r��7�V����[(�}�J�vl���N��R����9��=��s���	��5����#�kT}�0�z��U���a����Ci��wZ�p���.hu��_��[�J�V�T�p����.�5����u+G�sml�������r���Ys�l�,�^Y�R9���~�S��}�6����y��#���26�������QV�R��^Y���b��q�C��izB��������v�������v�4h��E'�z9yz�K]������p��-��.g��DXl�("����7��~;�4���%9z�T��d�R[�+�6
���dc��t��o�
\���w����!������SZ�89H���M�+V>"8�R #'�+F�9�1e�2��-�"S�cIyVk�tR��c�Yx�+$"�R����1e�2��L�S�cMyVk��V<��	^�
�H.<����2~�p�*
�)���<���D�sG�6�����\'��b��L~h������|��^������Oy��(!����
�$�S+�4���%_;�0+�p���D�+�z��>N����2C7��o:x�}[J\O���ed�EK�_���e��$e?�HX�q�����{�s�)����e���A����T#a����h��C�3��@K�_���
fS���'y�r�����!lnc��YF��H�w��r���0�j$,��rh	p��|���XFv�UzG0�*W.��F���+/M�������1�R�'���7���
�f���W[�
7@�������c������r�I�t���K�~��NsD�a1T7�.;��&��4�� �f����zn;���;l�D�UQ����0	KVV
�4���%_���Y�U�Kf�B����<��PX.	��F��O+;|�����-9eZ��;�yX��|I�L5|\y([<#u��f��ed���w��r�����d�������H�qr�^�%D!@;�� H/c�r�L3�z\r�;r�����0��H�xC(�n��$a:�HH�q������71�E@!'5�x0	o�2���4���%G�b��GQ;�K�O{�A0����0s�M?����|��i�a�~k*���P�lc'�����%Wv����ys���#�;T����m0I�N?|\9�8��u�9����2~���m
n�a�)�=�f��h�[)^I�����c'��X�V�W��,�JoE��V��e� ���c���Cy1D'��u����������W3"g���Z�� �W{ Y�
��'nCn���j-�v���
��������v;�����M�P��v��Zu� �f�q�a��j���x$g���v b��c�p�)����k��
�=���0���f��c���)���f,)w*��h�~X���'�[r�(�����������Cd�~,)w*l|�kpD���:?�!>�B7���������et@K����-/��;��w@K�Yh�D���������5�����a�����(/o<��5��	�ung�9��I�"��z�rL;��i,N�.Y�u9LB;���BAff�����=��w4@�}�J�B��d3���j�",t������$,���BAf^����j�aU����^�������f����o��]C�p����"���H0��)	��FB�7�5����$��f��H8��5�$L�		����Bi��}�V�H0��5$L���f��7�|T�g��G�y�-yH�N5������^�
F]Mc����zI�W��S�$�W��k!5Jn-Hj]�
��P^oAf	��GB�7�U7p�sO�K��8�]�=u�c� 5{ Y7���������G�uv#ZG�O�����z���dJYB�]��`��Fc�0s��w�U����R�0����8����a6�H�?%��z���Kg,Bc���=�<6�F	��FB�w�U��n{�<��[�=�$�A�e��@Y�
uv�<�����a(�
9H����|�	�[(�
q�F�`�;9Hg�g�?]M����M��[7��X��A���������_���{6�o@�l��o�/�
�,�=
�ty�+u�>����v�������r���������j�<���|];�����N�Z}�l�H�V�k�U�d������]���nZ}8��+��z��5��o2����
/M���3r��bt�^��z�'�w������I����r��\�g��^���&s�;/X��������1F��Zjg��]���&sv�#4���u���w6^����	����z�����y�`��i��I?��z!g����|�l8����x�)qZ=������v�?��I�.�od��%o��0W�����K<���FQ�\������"|�/�5�*r�S42WL�4�+*�+R�WT2W���FQ�\��������[yAs=��r�jg�WzN��T������b�?�^Q�\������"uR��B���J��o�V���������+���vq�+���v�A������"�9E#s�w�yE%sE
��J�����+*�+R�WT2U���,�����"�9E#s���Gl��)�+*�+��^Q�\������b���+*�+R�WT2U��;IC����9Q��8x�����vp�}l/����I��C�
��FU�B��UEs�#`{�jh��q^��B5���FU�B��UE��o�UEU�kT�U��[�����*�yUC�z34�B��H
Y(Fl�H*ZhR\#�h����mT-7��*Z>n�%��|�|��I��_�����8	��7.xJ�W54W�8�jh�z�K�FU�B��UEsU8F�h
�U9��Z�x�e��h�Jq����j�v�����*�5���7)4�����*Z���{����&�5�B�����yIE��
�yQC��
��>>�x���B���"�����KP2�o��6����f��*��^�'��J���%�,3m�H*ZhR\#�h�Z��2-Tk�e����������TU���CU���
YhbX#)d���w[#�h�Iq����*m�mT��4��*Z?i���'������k��qh~�8�_����;��U2��0/�d�xnWnD-4)�U4W����o��U9��Z�F|3��*Z�R\��h��{���j�]0��e������*�5���������FR�\Nbo/S��QCa^Q��U�Q�������G�c�`��f��%������0����X���������"�5����p |�(T2��0/�d���T:IEM�kD-T+.lT-T)�QU4W���������U
-Taul�6�P��FU�B5���FU�B��UES��Uu�>+|��/��s�2wX�v�*�<�R�eD��eK��P�K��"-��g�,�RB;�����B"?�h��M���]�w�����KH�-��`#	l�z=��_&?��P��XS���E9l��� "�t���1�W�O�`�_��n(���<�f�v�BRa����R����r��L�?#�I���<���
����GD�
�_9���I=Ht�����Y3�c�p
�Q�\sA��
����N|>�����O7���}�)(�|'DW����p�;# ���N�?����(A����*oSR��~$_�)��	�h��!2E?����
>�q8��^�1*����@I�8DX� �f�����l��2W�9m��`3Dx
8	J6�9Hg@���o��.�'�^@Eg@oq�����dc��t�p���S�
�n�Q�"��>a��%�9H�@��n���"����PD>A�_�7���dc;�4���%��������:���+�7���dc��t�p�����>�{��vQ�����,z
8	J6�(�fg=,]�@��������	���W?��_!�D�'�����S�z��E��������(\���$��������d=,��������|�"�c����%��8�
�i YK>���k>������#T~8J����"x�i`YK�vp��B�]���Ip��@IP�28X�t�H���S�����C�`���u{(	J6vP�M3�zXrN[��A��W�*��E;�J���
�i�YK�w����S�e#�,���>�g��L����c�N���7���3�AW*���P��� �����X� �y��no�I��itC�@���q�#mS����.�)�qq�&n�
�~�N'~>a��}d9���������*�d�����|��v�r~�)��cz
8	KV�(�f@���V�H���yt�*"�=�
�$(���A:�8�i����04�-�F*�9��1e�2�g���XR�r�1����1R�T�m��2~�@�1P�N��%�m�<��������������lSr�H������=�����XRn�IVS�>��;��D�N�0��\��LX��S�cI�Q���E�h���,�~L�YB��Bd�e7����xa�!oVH��������'"��K��
��L��W-/S��&�����EI�?���JS#�rK^�u"�!�;c2Sk^�/S|;��y����������
w�Q;�R��#+���������I=j�		�����2:�B��l��r�+����a��/D�U%��c� �f�u�Ps��7������nL�B~�^a�N����>��L��N���#r�b8����(�;�6�����9�p�����������U�|���)����q��)�W���G���z���7����w��S��	�q���L~���!2E?��;z?���E���1;@I�S(��g�*���������8�p��d��������5�`D���PRkd���d�*�|���K� r�(��C� �f�u�P���{O�
"��
����6y������P���$G*����LM!����!��k��vIg������}pFd���7Nj��DA6�8�V����yy� ��X����I���(�[z��[����]�;{0��������=��!:E?��[�yw���������7Jj��������d�+T
�����;9����*;9Hg�����x])7|�I��s#����+d�]$�^����y�����w����+�}�H��B���u�s���7�j6����Q1D��n,)�*d�,�����s�&Us��1e4�\�q�������lH��Q��	�:��g��`����gv���?q���>[�����=��O�����{8�#����1���=���������q��I�ql�m��~-���Q��%2|��.��V��-��|�I'����i��h����E<��ot����tr��tz�h��Kg�mc:��L'0oM���v��tv��������V���\�\�'����mO�`����
��4�.`��N~���P��[�:)��N��D+��N���M���K�\��M�^�&�����|�����E�Nuz�����r�Qi;�����Ao��b��{$�_�����W�����*�yUC������B��{%84W�%��]��\������j���FU�B��UE���]��,41���P���-4��k�CsUX�y�jh��q^��B5��P�*h����FU��i����������W`h~�8�_���};��v�8CU�kT�UK��%84W�8�jh�JfA��h�Jq����*|Y<��CsU��������VU�B��UE��w/ph����-T����C��of��B�*�O�����UE�'���F��xU|�������q��C���;��o��Z����������Mph�z�m��ySb����A�_����j�rl���B��UE���uh�Z��-T��������5102W<B�����&�yQCUz���
�?o$�QU�|��N�_hy���l[�94�q�/����%:����r�W5�PM}C���o���B��?7�����*���K����CsU�������&��B5��DZ��mO�����*�5���g����B��Z�Y(��u������QE��M�%2��]��~����E�k������y��I�kr�5�P���������u�l���1�`h��q^��B5�_kZ���k�C���c-Tko�94W��p�FT�\������b���9��}?3���a5��O�kT��4�n��t��{�}"�����	���0�8id�:�Y�4�w��,�V���-4)�U�P�[h��;�P��FU�\�����W54W�8�jh�{k���j��1��y;���CU�kT-TK����B�tm��,+,*k���4�(
Y�yN2���x�c�����]�C�����o3���V�8�(����D"E�#�(�9�H�B����B��H
�+���q��*����E
-TO\`��*Z���6��*Z�V\���*Z�R\��h�
g�waCsU��������*Z�R\��h�(����������w���?�{�k+�?���%x�xQ��BC�`_J���,R�M3�z��7F9yv`a+�.���~fo%A��"�4���%'9H.'hlos�luO�z�1f�
���bS�cNyVk�����@8D-���@Ip{����t�H������C�^{	��(���+���2~�����Bt�~,)�j-��J�/�{� ���xV>�R�����:_7����b���	?[��E!���I������3~��PoA�I���<�����G	���{0!'���p���1e�2��J�zW������u'l����R�-BU~��o�
�����i�YK������Q������8	J6�9Hgg=,��x9�sR�E)��vG��k�IP���A:�8�iO�s���7~����f�E{=��_!�S�!:E?��g��V
|y�T�����L��
8e��c��t�����L�/=����C����
�$(�X� �f�7������k���eD	>Y��%+;.�i YK;o�������Q*I\=�$(�X� �y����z���u;p��!h8�Vj��p�l,s��<�zX�um�z��:����5x;��$2������$�Y���o_��_�����D���_���~<�)��g=,�����]E�(B����
h�l,Q��<�z�O��e����(����o@��r�d� �S.�v�3R7r���}"k-�����3�P���W�2����8�
~%INqK/�������SO��5�Y���CN�&������@Ip{�AoX�i YOK�� ^�r�ZC'������5�$(�X� �y��N�h��o,t��>��'a��N���YK>w���H����eD�=o%a��`:�4���%�!����
�sx�b�����de��l�T���~�i���S��JB��8�9��1g�2���(D&�����z�k�vxQ�;��7����xF-��4X7�;�
t��������e+��H&8{��%5�U(H��X��}��5�B{���_��_��)�W���0�&�������pL���{������RY���@�nj�b�o����9�r��Z�������d�*���1������3�2S�3*�����P����?���<�O�8���2����M���g7��;�3�������tT�z���2��OWo�k��
�<�S}�C�-�������8�w	�)����i������~_���9_������|�p��B�"��i[�F�����pR�F��3��nj���e�y rd8�� O��p�o�������	5^c(	
Uv�i Y�
U/0&~X:T��8Jjl�8H�@���t���Z�1#r>�(�1�}0�4��[���G�R������8Jj,�8H�g�*��>�C�}Ex�ty��c����~�!:E?��;������C�*�Y����G�� =h������sy���:;��d�H��N�l�Q�
4���9�����D��A6�8�V�f���lC������le��V�4�������v_�3��Pcv����^��@�n�m��5�!��1�%��!y���u�P��j�f=J�+����8�7D�	k�����_*���si!������PR�:M���d�+T�:���-AhDs9��S����{R�N��9�V�����(�X�C��{#�����86�$�V����{�����'5�_9)�fg�+TL"����d(��D����@RV
�4��{���V��4B�#����_O��W*���XSn}Mw�&u�s����k7�,�|�Nzn�|�[��e�/�A������	���Z?�����?����?��_����u��s�PW���>K.��Ul>�+�r~�e/�c��e���fyZ�������V���^��;|�e�.@�������h�]����]��]���Y�^����^�&�����|�eO.�5��Xu��p�r�t������]�6���w���D��u�=�����������y�=-_��u�=��_w���fyz�.{z�h���g�e��W��4���N��������/��]��@o���b��{-$�_���^��8���W54W�8�jh�z��Z��}{����]�6�Y��r�W5�P����VU�B��UE�L�9Q!Mk$�,K�(���f�%8�P�hU�WU�P��FU�\5�[��GC��
�yUC��
�V��w�;�}��C���q�
�������Z�R\��h�z��E�CsU�����y�m(����*Z�V:}��*Z�R\��h�Z�j������U
-T��a�C��o���B5���Z����C����UA�'�5���O��n��7���p�Rl�����q��C�;w[	�����*�yUC���jph�Z�V
MU�.ofT�����9U��q;���*Z�R\��h���.�-Ts�����j�'Yh��q�����7Psh�y�
���Y�o���������������o���#�����8���`h~����k����*�5�������m���\������j�.����j�������j�3����*�5����i��:4W�8�jh�zl�j�]
-T)�QU�PM]5#��uQ3�P<�nq���kD�7���;]�J>��I�d~�2�d��d~�N��%
�59��Z������,4��:mF��e��ph��q^��B5�_kZ���k�C���c-Tso�9�P-[����B�I!������B��{�94W��m;AC�'
�yUC�'�	'���6{�
[�������q��C�[w��M���f����*���-������DZ�f��UEU�kT-T�v���CU�kT�U��{c�U9��Z��������*�5������fd����hF����Ud2�pX�(d���{"���6{�H[���������w��~��|��1n�m��\������*v�jD�,4����2WL;����(�kr�5�P����QU�P��B��*Z��[h_8�P��FU�B��94�����*Z���l��h�Jq����-���~�Z�fjZ�U��?k�w�����Z����x�W}x���Bt����%��x9H�g=�!|9#�D�3/�������
�$(���A:�8�a��uwV�����BJr����1e�
�f
�Bt�~�)�j�r����'��%T�����5�$���2�����%���s
x������g�d5c��e�C;P1D���������\�f(��U�����Cx����:������������������'AH����I_��1g�2�ZS���K��Z+�y���������E����+��B��wC�V$���x��p���sGt����'����DA6�8�Y�����0��V��/������k���dc'27�8�a�����<�b���so#��	�'��xvR��<�zX��h?/��'����)����c����'���K��Zc�76Y��:�H��_N��k,s��<����h�s�?��\P>�[��5�$(���A:�8�a�p�OT���`{+�XY�P�l,r�N3�zX2zx�(����D��E�_N���%��P�itn,������%�!�(_
5|x�k��8H��d=,��������-�����|�)5r�p�l�� �y��b~���:F�=�-��'�Q�"�/%A��v�i�Y�J��R_���h�5d����J��Q�;(���e=,��.�x��
����{(	JVV/�i YK>.�F����CQ=h�+���c�4�����"��W���W%$"�1l���2~�p�=
�)���<��!�SNZ�@����@Ip{����i YOK��8]��/6��}S���@IX����l�p���O���#��`w��~y�j
�u~�%�9H�g=-�
b9�/E�n��z|R�	&?T���������=4d,��w�:D�;aW���k�X�(H��d=-�SP_08�*��T���rL�L~h#,���XS�t��o�!j�wf�s#�~{�Q{=7����`����S���	n@���y r�f8�q=�� �y�ug��3 �n�TT9s��z��4��[���I�]oqB����9N�B�e���Y�
5/��goo^�j���pRcoB����Y�
u&�F
31�qb/�L�(����8y���
��x��ge�Vf;�����ge6cM�S�z��9�����~L��yP�N��%�N��+&��f� 1��'�E}�@!:E?N�������~���<9����z�'����;�:[���F%<��O��1�W�xh�{��XR�T�����7�$���l'��$yc��u�P��������C���*��h�=�H��B� ����"�����d}�;�8Hg���yc7ROH�2"�@�����+IA�pl�d�*����������Y����w����nj �z�nc@���pR�6f���Y�
��"���@����3�mb��U1����8o����?[8Sk)&D�R%��b� u{ Y�
5w/�U���� �a���A2�$�N����&^��D�-l�&����A:�8�V����[n-�k�����+�����>�R����cGs0��`�������9{s���]�:>��Y�|��Z��x�#@��s"`7��;�;W.��n��k���I�XN
��p���n�\=x��C�j������X�(H��d�+T�8X��Jc�A��T��7NjL�rr��<������Pq�j��HJ���A��H���A2�$�V�j��<���Yx#�����;�4��{���V����{p��yy������C��Zz7��������{QC=��F��#'�M3��:�}����H;s���W����^8>i�W/2�j��luh�7d�K�).8�3|�d�Bc��u�=���'0o�g����m�*��B�=���'�0o�g���������l�g�`�������4�]����]�����|�<��U�=�����I�=��/���%X�<��y�=��t��n�'��_�����yr�v{v��.`�pO�
�n�g�`}���
��4�]����]��-�^�M����Wo;�78]���-�����+ph�C��84��q�{4{UCsU����������B5��Z��m��UU�P��FU�\�����
��U9��Z����Xld��a����b��&8���}���'�boT-T)�QU�P�CTUA����5���O��8��~�r���J{���������D�����*�yUC������B���584W��z��P�U9��Z�f|���*Z�R\��h�Z�uv��h�Jq����*�_�r����U
-T����C��o���B5n%�?M�O�kT��89��Z��q�Gm�����q��C�;���{��r�W5�PM}����o���BU^�8UEU>e��*��^�����*�yUC���4uh����-Ts�D��B3wM�,K�M���f���9�P��gNU��ys�2��h����,�/��p�X�������@w�Mo^�3��4A�ll���������wh�l�|�=<[(���t����([)�|$�WV�R��VY�J��-N=[)����g�H�oecel��������C+������J5����?�4�V�|$<f����|����gF��&>�]���
�;.�md�-t9�6�R���Z��m�e�Tk��g+���&��B���!���([)��V�l�{c���r���	�h��q����j�;�y��-}O6�V���,l�8��VY��#X"�Wz��Vg������-���B�-n!�������*Z�^�v������.6��V�q;�����2���V���v����)�UV�R�����J����gs���[�����%�+;�R>��l�t��/�C+U\���2Z<�8�U����:���W
�B�f�A��B����7���KW�����]��`���[)�-����.����V���s���V��
+[(��k �E;���([)�F��G��J�[ee+��k�Zee+e
l����e��WV�R��VY��a��q7@���B������3���OhX/9p�]o��@���^O�}(	a������m`�v9��Ww�bE�������.�F��^
�i YKr&�����C�v�F����c�4��t��9�n?�S�&Vt!�w�o%A��"�4���%�����[._�H���������)���<�5U:Bw��k]!t`�����)��	o����k��Z�E��������B]
��c��e��j#���XR��z��q��V�+��&���'YlO��j���_����
�I��p�I������I6i8(��@��4n�|�|��s�v
C�&9�����8	J6FV��y����D.z�_���ag@������~���?��	�=.;����z0Vz��B����`�a�0�j$$�����M��BL|� �b�ru�����
&��GB��+����7�R0��}�#9����
fS�$�
,�c��	��A�;����������r���l���oZ�����&h���e=#W2��w��r���l��������[������h�*��nBG0�*W��Q�L5|\y�[J�)W~��%��,;w`A���+���S�����]�g���;���*���^�&��4�� �f���d�	[��4�g�YB�.���`�l��-��h�������^��*mM��c�XB����`Vn�T	��F���+e�g�F��|5A�����W_J��2�����%�V������b������p�,,q��<�z\r��,J�s{�fe�������C�t������a����CL9B\��2���7�X� �y�����_�a�h�j�n�,��b�`$��7[a�0�~$$��r8��o�^�re����`Un0H�N5|^yES����UPl#?2����PW���0�~$�7�����&����
����/'����A:�8��#B��}	�6�W�[8�K�X� �f�����9�O�K�<Qi��uys���t�w,p�N3����5*{�9z rk8�5]O���Y�
U���[���<�a|�|&?T���XR�Thvg�&$��Z9�t��:��4��[�:�4�!6F��1��S��S��!���cI�S�38k�]T �Em�����b.j3��;�����'����O���c�h���!2E?��;�?�5�wLsL�1e4�i�qH���������-��[��g7�����{�>���-���������m'�~(y?����e�K^|��wAd��eu.(���������X�"��j�O�'�L��P^�}&��GB�7�5w��~j�ed��	�����a�o�o�k����;d��	��>��a6�H.����z��(^�7 ���^fO�y���T#I���^�3����"�f�H0��4��F��k�O�5c��F���<��	��N�%a�d�o�kv"ugm�Kj�����`^�_���T#!������+���L��%9��z���a4�HT�f��)^[�JgY��A���-�S�t�����z�oxl�m�P��k����$��k&��A�Q����jf��gC7�$4���9�#C�`�0�j$$x�^u
�.dRt"r~�(�5%w�i�Y7g�,���"i���By�y&	��GB�7�5Wp�^��Y"2�.���^�g�0�j$,x�^5w���YD�m��`^�>Bu�<�T#a������k�<�@�3�#��y��=��t�����z���	�1��-����cj�0�j$,x�^1����;���;�������u��Gp�o�oV�6Ht��!�a8m����A:�>k�������$��~� �g�)�$���������o�c�1��C���������7�oz]����?S���n�k��\�����m��^���i��\��p������O@J}�@��n�Zy*��BJg��]���&s^=��<�7B�������R=�U��g�7���#����V�~rK�V�k�U�d����g_�?��C-��&�z��-O�o3�����!�a�:���s��V�k���dN�|8/}��P�O|��O�U��e���y��L�|����3^I��nW��gQ��\���tF�������]��e�i��O�v#���]�e�9��z����{�T2�('�d*g��
��%����Vjh�L9�	*��-����cd*�QNP�\��3�����95S18@;5Wgd&'Q�gd.�_�T2�('�d.��pa/�d���('�d���9������OW�����(����f����ed&(Q&hd.xR�
'�d*�QNP�\����^��� G9A%SA8��4?�Ff�e�F��y����T2�('�d.X��T2�('�d.X���T2�('�d*���]�Sth��0�th��#�����A�A��}��+04�i����������jE�E9��*����L8QCSQ	s����^�xQEsQ����e+��REsQ�����q���N��TT�����(�|j4��%)�+
YFj"��%9�k*Z����FT��)#a^T��)�]��C���L�>t�
Mo���+04�mp�vlE
ME%��Z������*��r�U4=�
�54�0'jh!Jv5�����E-D3�q������E-D������(�yQE�k��Z)�KR�W2,��)�>e$�iZ?e�������O45l�o���]�(wJ�w���V��h.�a^S�\�
x@��T2��(��d!��0��h.�a^S�B��_X����+�CS�c�{��������CQ\��h
�KR�W������$�yMEQh�~���f
����x���e�|9`�ym0Mo���k04�q5Ns�J����,O\��(*�Kr��T4����o�ME%��Z���FT�\������h��-������Z�\���*��r�U���#��2��(�(d.����d���('�d�R��h����Y��R��uX>n��;r��?X��������\p�H#�h.�a^R�\�Jo�J����,3�"��%9�k*Z�V\���*��r�U4������TT�����(,pm������E-Di�h#�h.�a^T�T���.���w��|.k��pn{�&O��H�.��z��_j�:m�8����1R���}1��_&?�:xI�����h�/�b����kUr�q��n���3~��J�l�~\�	�wk����R�J#	��y=��_&P�Ekcd�~l)�j=�ww��{�(9$�y���3~���@��86i?����Sa/2�*6���X��b���N?�N�?������?���T��#�;}����k ��t������V#�'�e�o		\�@�h�!���~,=R|;��vI[���1��:!I�q/������VVRO��5�a��B������`gj��z���3�b+�������o�z~t���$:���
P,�6�s�N3��MCIX{��>��{a��*"xCs��%;8H�g=,9�������4�QD������
�$(Y�Y9H��d=,9_��~D3�QTx�Iz5:8H�@��W#,���v\l�3:����P�ll� �f���dh��h���n�"
���
�$(��AA6�8�a�h��+t��f4��(���g#�$(�X� �f����c��	���qD�Q*��%��4H���;������u(!�u��@I�?�E�(j�@��|[��y���p�h��s|�?��c�zh�48�o>���-��q�����e4�����������
9�Y�)n?��_�byCt"�lo%�}5(��g=,9'�%��w�6x9D��6;��S6x9)����7x�.����F��*�H��-�^�)���-����k��ZJ��k(�}9�m��^eg� �f����B[���q��+��W��}!�S����9S�N��9�Y����R@�w����N�?�e&�c���r� ��������s#��p�%�vv�g����3~���eN!2i?.�]��o��#�@B��L�|Dy9��_&?t�1D'��Wk1��Z�[����f��7�������������1ex?
�"�`S�cM��������Y�fse�P�g|U���^�8���;;��XRnThf-�������n��c����XRnT��X�/�
�8���a�/�
v����O7���7��chg�fX��yg�V8c�t�����nU��D�+���������a�������a�
�{^�v 1��sFc��"��cI����yT����T�S�Y]-�$j�+,s��<���3�YQx����x;�c���)�M��%�N��<%���%���~����Ct�~��g�+4�)�'f�ND��%5�V�9H�@���?,��K���;Y�r�;Y����� �du =h\g���w���2"gM
�����8H��d�*������V�$\��'��U����jg�*������~j�=�
��iFy�g
Un�c���F
�pR�Bedn����o�!��?48T9�i��xO�� �f�u�P������q��/�u:3i��8NG� �y�u�P3���r(!r��(	
U*�����U�@'�Bs[�9��������FcG�����&h��T����u 2�h����{��u%={^W���D�Q����h���a��?���$S���ho>����Re�"���T(�;I��[���s�����D��0�o(]t��M��%�V���\a�t+
%D�����V�i Y�
_�^t��W�Y��)�1�j���_�$�+T7�(����.:X�L�p�Je��l�T<����U����p�"#��-����*�l_�4(��	����`��y��kG�"���K�3�����<$�N&����[?�����Cd�~|��>��-��Lo���e.[?�p�������#��v�P��}�>\Y������R����������;�rQ���7-;�R�M�������
���v���86���]�rt;�2�>kY������eZ�6-���]��xI���]�",R��m�o}���y�2-_��u�2��o;��X�2��y�2�M�X�.���y�^�u/���/��D��u�2��o����f|�fr���A3�������o=���y3-_��u3��_w1��6fR�����/iV�����m#��~C�Vfv�����r�J`��,�B���R4[Y�a��W������T����D
-D�~��Cs���h��\4�AFM7SESQ	s���y���s����E-D��mdd.IQ^Q�B����;4���=��Es@�����J�5��[��VT��!#a^T��!'
��[��G��`hz�$�]���]+��4�����E�E���54�0'jh!J+,QEsQ������xQCSQ	s����v��U4�0/�h!�����ES�����h�;^94-}�+������A24}�H�U�~��	Z�o�D�p�qm/����I��C��p�Ji���7��0�V�B����;4��=��E��1QCSQ	s���y���RCsQ����g�T�������th!zu[���%�n������WM%%�iZ���)'�h���0/�h��I�<���h	L���C��&a�
�ocQ[QCSQ	s�����M��\4���Z��tR�U4�0/�h.
N^�������9QC����th.��-D3�2������E-DK����\�t}��,k�����SF�����S&�����-�����C���;E���B�w
oo�U;4��0�ih!zm�'%s�k�>�(��c��oh*)aN��B4�_^������C���^�Ek�z94�w���T�T������`��^94�}�+��4���	������n���D�{�Bl�@�M a�pC��%�������{�X��B0mg�1���$�yME�����CsQ��������~yQCSQ	s�����������rh!���}�:4�0/�h!zv����%������`��`^P��#Q^P��������l��#�}�-�����3J�7���5+Y�>��mys�����3��Q��`�D�)d.IQ^Q�\0G:��/6Q4��0�ih!ZhU�U4�0/�h!JQEsQ�����pfu��54�0'jh!*K�����(�yQEs��q��/��4F���l��1Z<�����h�b�*��H��(���{������9t�������QN��yK6t"�G��@IP�1�D��g=,9�qa'�EB�e%D1vsF@IP��R)H��d=,9��Z{��u0���;��c�����3�����S��Z>�2��x�)����\�[�������Y@�N��%�Y����8&��	�%��Hb��c��e�����M��%�Y��Q��l��$$����o?��_&Xk���K��Z/x�L�b�q�z�+s�j,�w����@�N���j�/��ol��L�n�Q�?V���
����v�i�YK���gZ`F��E^s���=J������d�H����g�y�~�+E��Io%��8�0H��d=��Yp[������Ho�����^J������d�H����_
_��������Q��]o%A���4���%��_~��~`��.D��!��%������D��Af�\�E��p��Pd��c��t��oE���O��]�����\��O�a��B~���tS�cIyVk�hGL�M-��	
V������	��!2E?��g�^�Q
������Q��
��1f�
��"S�cIy�8@�����G�l�PE���oG@I�?�d����|��5���)*>�lc���0y�)�1�<l�Kx�#�1��Ft���1f�
�.o�S�cIyVk
[-���._�w���Y�o%����@A6�8�i�'�P�3w�6R�Ct�[z9��1	~���:E?��t��������.><�2"�:�J��+��c�L3�zZr�}�{��KP��Q�������dc��t�p���+�����Sl��������%+K�3���\������j�[�����p���P��<��o��_�!VR��d�1J���o%a��*�@�i YOKN�g�.r ���7g9;��m�����@�5k�:C����5��s?�Q{:7��
���i���X�������E�55kgh�&�/!6E?��w�7�#��h�G@���� �>����t�H��B�r<���L���.{�I��I
��8���Pw1����=�P�	���#�$(�X� �y����a�b-������+�1ex�r�"S�cM�S���x��Bb.e?������S�cI�S�:�;wg3��W��wn���Jzc�d7��;�y���FI�wu$�q��~�P�_j�����0�,L����%�c�7�1f�
aCt�~�)w*4o0��(%^h���(��!+�
���U�9��}~�/SS�q������M��H��F���C�P����0��Z���������U��e;Si����@I��9H�g�*�<����h����9����^<8H�g�*�����6�b�/���c���C~�)�������>\��x�.�1��SF�!
����s��
��+����Eo�+A����M�M3��_2|X�w�����|�����pj,q��<���O�3�
[���eov���0����w�����v�]�e�hY�s@I���X� �f�u���zv��4�`B���P�,T
2w��u�P���-�k[�2"g�
��Wp�(H��d�+���ko�@�<�;��sF�f1��K��
������`���'y�� o$v���*f\�b��~�q��7Jj}��Aj��@��*�^����w�BDdn�(��}�bR���=��[�����{��?[b����=>��t�H��B��_��*���_K@j������]61�fg�+4���}����w���a��������f�]?��[_��S��w�`�;���17�c��m����?�{��������}������R�g�������������_�[������Wv6mj	��w?�YLZ���o�)����t������xV��������xI����r=�)����xmd���;�i��h��;��|��.@���;��h�]���]�������^���^�&��;��|�O/���,:��X�^������xV�6������|M��������.@Z�i��xZ��Y��xV���^=�
�6)���=-�)�����}��E����h��B���n�(����rh**aN��B��m������CsQh�[�=*ME%��Z�F<J�U4�0/�h!����)d.IQ^Q�B��;84�<�-D/<:�U4�0/�h.�GM5��>d$��Z?db���}��s��M����04�k�{wO6CsQ�����%����J�5��xlm#�h.�a^T�B�n�l�E9��*���J������TT�������w:ph.z��Z�����Cs���7sh!�q[#�h���0/�h����i�ot�+��,��`hz�$�]���}����?�����9QC��o���\��[,84=wy3���D5�DZ��yW+�h.�a^T�B4�mE������C���x`d.yv�,����Cs��op��\4�{+54{�h�5�|������<B�-���m�0w����8��='�h.�a^T�\4}�����D
-D3�D�E�2-�����t�����E�E��wuh**aN��B�����754�0/�h!��.gF����rfd!X�^nM�2�5��2����xg�c��5(���,��(���r�n{S44��0�ih!zn��fd.yn��fd.x��M%%�iZ�����Cs��yqh!�{����h�]/����6f12��(�(d!x�=��K^}�3���5l�m����#aN���S`{�_i�w^a��}���8	s�`h~���532��]#1#S�
fy� �������C��'�7�����E-D�VZ���(�yQEs�����CSQ	s���q����CsQ������[fd.���eF���9A!�'�DyA!��K������G���A��~�������Q?��Q#l�o>s84�0'hh!��7�B�����aJ�������9MC��k����(�yQE��'�6�����E-D�p�FT�\���������v�����E����m���w4S�������Y���-�J����z50Gq��"��
����/�$��x�8Hg������[�A�g��bE�?����o%A���4H~����9K����q�����-�P�l�� �f����(���~�^�2�����P�l,r�N3�zX2��I,����e{I�\�~x�����<�^����?+��&�f"�r!)������2~��p/3�)���<����������KHBRJ�����2~���/�)���<�ea���H����I9���c��e���a�I���<��K�������������7 Jw3��w��f���f�F��X�	��K&F���\���P�l, s����"�QL��������mNx(������l�p�����Q�=�1�]��pm�@IP��HA6�8�a�1���3�o���<g�����$(�X� �f���d������	�o%x��d�^�1�W���N~�~�)�j�������*�����9���~L�B���!:i?���Z
?kJ�.\��>�����L-��`����8��<W�2E
$��%R��<x�T?��_!?��Ct�~�)�j��V��j>�)������%x?��_!?����)���ES>0+|����Kb��QAt^�_�����q`l� �f���dp��m�`����.D��7��~<�������-��#sL�*"4���,]0;8H�@��f=���u
�_���(#�G����%A��`?�4���%�K���;��b�EF�6���������R�L��%�Y�p���������@}@*����q�:�������_r�m���%+�\a�f#8�df���*�i�YOK.�����tb�PET�E�@IP����t�p�����J��n�e�	�(!��s���%���d�H����.V��gF�{�(#�}0�@IX��zQ�N3�zZ�����z��'�DRa3��1e�2����!:E?���]���3�]�J�;7k{�u�s3`��6D��I�;g>�AP��~dE�,�PR�st�^8���f���-f��y����%5��8H�g�*�;�g�<Mr����x��� �f�u�P����:�8�1 r&e$�;��KA:����4f0�Bd 0���'�����_�C��S��(�R�k	�R���������B�K��%�N�j#��u5^eAb�d?��U��Bt�~,)w*T���} �����C��9�q(����XR�T��B������_�Qz�5�$y_��~�6g�-��Y9�q����27�8�V�f�e������(����HA6���^���c��Xw f)�c���C�����S�Th��z���
5q��c+��@��~X�w��e_�/j�Dy����A�c�@�/�>,���^����8��T���A��88�w5�/�1A����8�$�98H�@��}X�y�0�����@IP��������U��}�n<�����=�$(�X� �y�u�P3�jW�:'�(����f��1D����r�����VN���q;���c;�4��[99'/�ED�@���9H�@�}X��w��Cp� r��������A:��?��B�L��*9�=U��'6S� �C�cl����Vh`����g��[0�d� �E�*�Ea�JQDd��(����^gg/+��u��"-���2$^�{/�3����E�4�9�i����N���2"��^J����$�@�n*N�K���)��7N�v��� �y�u����O��F��?1�����&�4B�L��Cs���_�����x�`<���1��cb-��9��?k������� m���%�y���{-���IO�z��H[�
����n�!����"����+^��w���xv��N��v���%�U���Wa�L�}��|ko'�����h���ug<��/;��%X�;��yg<��t����g��eo<�kq'�0��g�����������������%,���%X�]�������xv��N.`��.@3��;����������\��C�]�����=���/{���[�h���.\7,Z�c��-��{�];�.�ph��B������������TT�����h�/84M}���t*}#�h.�a^T�\���Tf54�0'jh!�v����$EyE!���@ph.��-D��iD�E9��*Z��)^T��!�a^T��!s�v��{�]����^���]�0w��w-�9��������9QC��o���\���/84�sr�vNME%��Z�f\���*��r�U������(�yQEs������J�5�=�h�E���C�����A24}�H�U�~�����7z�]'��M����K04�ou�Ri��ME%��Z����Cs���`ph!*�L����(�yQEs���������D
-Dc�x���h��:��]K#s���D0�<�h�%���CQy��DM3�E�3�~��ya�m�����Y�]�g�[����>'ll.,q^��J���-x�����-���G9acsa����V��������+[	���g���"�l!�-�����\X�������5Ish!{tm�Z����g����5���O���j���4�{�����(��>���hq�����gl.+q^��J8o{�+h!���W�����-dk��g���<���%�[	��/�l!{����p�J��U����FU�J����y��=�Fj����m.W��)$q����B9m��4��ym%v;��o����0���0w��S4��0��h!z���w?.����u����V�����0�5��V��7m��-�9�V������B����gs����p�y[ol*�qN������Rsh!{t��Z�&Z��EM�B��
��th���/�����!���~ah~�9�_��������G^����u�����NW�B��UA+����{UeY�kt�-�c���~������yac+�����T[s\#�l%���=�[s\#�l%,�����0�5���=������LM��������}�����?�_��dJm�`��(\A���U{�)H���@���?�v9@�J�q�����0_�$G�� ����K��.������
��\=J����T���p���9&:���H�
"|��P�ll� �f����X����?����\1��1�)���u=���k��ZS�s��
���[J�����nXh�	�n^���aqRn�/ZxQ2�p4R�\9���a��#��D!2E?��4o�p�7Ls��Bxo�6}�����O��3~���_
�I���<�������k>B'���������w�u�q3���0���c�L�����U?�F�dk�c��l�$���n�a��! o��G�g�>-�0�~0���c~���o�ba~��M��9���,�;�zI0
�p�0�j$�oQ�_9�|��G��o����2��vX}��`U�W�^~�������9y��r$���Ed����#�G�+��V0L�	>�z����~�A����������r���l�������3�y�-�sv�w(�=]��Q����#9�im��x�X.x�/���.da�:��Q����T#!����yK��6�f6W���m���[�� �G�<$L�		>���m�AF�UX��^��	�Q�����#!���_e�'of=���
2t��w��r����T#!�����W��&�����}2�;�yX��Q�t���������mjt����V��h��%L�I��r���[s��|oB���
H���g��t�\�|4�8������Y��F�|j�k�yt�
	��Gr}�Q\8j��\`����,��p3���r�����d������#�2v�[)�}�>8�;���?l���r���l�������}7��!����F���Q����T#I�w����X\��7zv!�u�#�I�60J�N5��M��&.��� (\���T�C�5(��89H�@���Ir��d������X�A�
��X6�W��@���&��CJ�q3Pc�v��z3�������O�|�*O|g������=������ ��h���n(=s��+t����:J��+>s9�u�u�P�@aq6w�BdN�(��[a��4��[��y����DM�aL���s�L��%�N��}�V,��V$���c�{����P�L��%�N�jyF��f�y9�F�z���
b��w�x�K�9�[�y5 1�sFc�f�I�������Rx���i���~��B���)�1���������J5^i��xi�R�^i��/�g	���)��!�f
)�iOH�f�fS&���^�5v�0	>�8(��@����]���xv6hA�M��`^g�B�K��FB�7�5w�n������[������f���o�kd�����Y.d��	���g�0�j$$x�^��^cR�9�#q���l��<����Lg�x�'�x�r���h��i���1�j�b���y!�6�H0�w3���T#!�����x�������ys ��{�I�t�����z�HL�r�3��-��`^�\	��F�����a�
��_I��9����C�t�����z�4L������G�y�K$L�I����1�+��$���l�$��$9o��u�Pgf���I`�#��Is�dOH�f�f�z������#�����8�|�����z�����|����8L�oGb�� Y7e�����"2s_����S��o��[�:~u;������y�#���b�����o��n_����Ig6�p�����t�Tx���]�46t�O8��K!kl��n�Yc�O��t"i�����y�����Yc�������_���{��nA�cN��uLh�x���h�����������7#\�a�����v�R{���T��,j�<���|]�A�!�ox�Al������_I�V�k�U�dN�|����i�9�=^i�%,\:��k���M�z����m����O~J�V�k��dN�z������[���]�����7��E���F����v�U���F�u�[�vzq`�	�Q���%��7Bk��%<��+*�+r�ST2U��>����������"����9�)*�+����9�)*�+^xl�0��(''`�V��19#S=	3A#sEZ����9�)*�+BO�������D������$�K��sd�W���aM�F��K��x#��U<�)�*J�)�+�[���9�)*�+^�	�+*�+r�ST2U���������������"�E%sEs�J��'m�w�J�����+��r�J����Laws*������D�L�������'�=B����Mo���K04�q��C�UEU������G�}�N��\U�����j&c��*Z�r�WU�P=��v��h��q^U�\5��H����J�S5�P=p�����0/)d�����T���8/�h�����We�z�H�WU�|����k����q����C�'q��o\>�<�jh�*qN��B���X��h��q^U�\����^��\U������A��yUEU������*�QU�P�8��h�z���FU�B��������u��,4)�K
�+����I
Z=n$��Z?n�K�����O�85l�/���}�0w	J�����E/�h��q^T�\��7�J���$�,3B�%-49��*Z�����C���qh�
��:�������C��`��B��%�,�������E-T�viU-�4�U��4������A�0bm�4�o����04�u��r�J���$�,��H*Zhr�U4W����o��U%��Z����o���*�yUE��{`-Ts��9�P=��>XZ�r�WU�P�������0/)d�GM���d���0��d�*�������Y�7Xx����f���������0����X���s+�����5�U����Q�d��aNR�B1�a�^R�B������j�E^^U�B�������IM��e��w�T
-Taec�6�P�8��h��h��WU�P�8��h����~��-��`�&+(t<��s*��Bz��^$������V%8��n���B��c��e�("�)���<����f\:�����HA��_�)����#��w��K��Z3��Ni�p����m:/rt�!��
�I	N���
%�Q�%��c)�i[�2U�/L��!��
�I���hz7��GE����F��}�����r���D+�u�D��������-��� �;~���k?�X:}B����
?�$�'I�i|�&��_M����Z��h*��?o������vw#'��a�4��}5��_&?t�)D&����p���r0ivy��(!���o%�zff�r�L3�zX�q�#��C*�u���e���3~�������K��Zc�
�8
4���$�2"x]S�J����^$�@���*�@�tE|
�ED���x(	J6�s�N3�zXr��
�>�RwCQ�	7���%+��*�i YK.���_����%D�!��%+���d�H����P5�s�Yb(�nX�P���V
�i YKF;.�?�A�
E���-�����dc��l�p���a�na�?��m����.�Q�M:F���4��t�<4�y�k��9�P=���p���g��l�p�������tn��P�-	���{(	JVV*�4���%��x8�-�9�dD_���o%A��`S(�4���%��U�\.|)	���:����p�l�P��<�zXrN[������HP=ac�����dc��l�p�����z�y��r!�����2~����S�cIyXk���������[y������A:�8�i�'����/
%+��`�y|(	K$�4���%�v��?l�c�����/���+~$����������?��MU��UD�)�7����*;(��g=-9�qW����r :�K�K�IP���A:�8�i�ew��V���ps��ci�L���'�����.�Uoe����c�"�zo�1e�2���!���K���g1o�����Mj��=3C���<�ZBg�3!.e?������l�~,)7�����"vZq^e*H��������S���K��
�B�x��| QS�s���?t�l�~|�6��U��!��7'#�:���2����
J�)�����P����#����l���I�"����n�[��e����V=��:������<�ls$i����f73�3,dF$X^9���
)�y���=;3���v���l���A�L���L�n��	�9D&�����t���;��j����9����z<9H��H��B�x��pW�����&Z���!�pl��r�B���1�#��y�����1����!:E?����������h=�v��B�Bq�b;��;��'���1��R!W�M��ThF���{l=���Y����3�8H=�H��B��;pwI���g��Z��� s
{ Y�
5��nh�8{0q�`7��_!�����s��
�G�p_c����7Jj����M3��U�9~r��C��03���f&����z~=��[��qmBc�D��%5��O� �f�u�Pu���YQ�.D�������	�i�Y�
U�����m����7Jjl>�
A:�$�V���%�H�'����{(��{���i�Y�
e�
�/��AQ�x������9`��wC��U�j�5#�1*eDj����<��BA2�$�^�l��E����\8 ��^J��]�#a0H��d�*T�4��'�!~� ��`�"�%~��� �ON��R������R����{(�[u)���@����\f�tp���P�7��R�L3��W(9fp�g�����Dl�qL���{G!<E?��;_��';F�3�m��of�)D������M��V6�t�hs��������a���N��a��c{}�������A�����kFt���U��\�n/Z�?���aq/���o&���a\��y��o�Z�����So�����X�0��y�0�M�X����e1)�:�q��bR��i��bz���!��[1�~�EL��D��EL���m�^���O����]�v3���C�J���X<���E�������|iz�����$����T
-T�~+�C������\5�a�������T
-T3�g��*Z�r�WU�P=��I���&�yI!���T��B��7�;4W��U����J�S5�P�SlUF����yUE��M�5��7��g{���M����7<3���7�P�8��h�z�~;�CsU�s����V��9-T9��*���G���}04W�8�jh�JKpUEU�������\��B5���Z�����C����rh�Z�#�?M�VO�����O�/�FG���k{
��wN��5�wL�a�v�����K+��6���W��������jwh���v�����T
-T��c{����U-T����C������B����Yh^�s#s�#����kJ�5�P��^U^M��7�U-�7��3���d�e��\����I��C�'�V��\U�����j���;�PM�~u�����QU�P�8��h��w:�����J�S5�P�}�G���o���B5o{j�]
-T9��*Z�������f�:NY(�������F������&������^���
�,��S�o�.d~�N:��K�kJ�5�P���������u�l��aG����\S�����j���8�P������7�Z���sh�z�[�]��\������b��M9��}�)���16��VO�����O8���4$�{�6��&C�$�!`hq��5t2�8b�:Y(&\��H*Zhr�U�P�[h��;�P�8��h�z�[���04W�8�jh�{k���j��1���%kT-T9��*Z��])#���"ed�Xq5�W$�x�H�W�>�"2��jH�����AC�g����(��fs�dq^L���c-N��8d������5�B��%��s�S��*EsM�s����V�yUEU�����u������U�U����!lh�*qN��B�W�zUEU������r���yW2���%�p|��Q[�x�i����dz:���v�l�����@Wp;F��4hVp���#��i]�C�x��W���dc�|�M3��W��.9�)_���kv�"�����P�l�� �f���|����Z�x�����M�/^_N���
��p�3oo�\:01'�fd�D������M�����I�)�q���,.�/Zp-��m��	����c��e�C��1D���������.e���$"������2~����@�L��5�Y��:,RW^����E���b�x�,��O�!:E?N~���>�{��p�$v�q�D�vn��$�L�X� �f���4�P��������b�(����P�l�� �f����#oW��[�&���.D�(��%������6�,�Z��#�jT��7�({	8	J6�8Hg��M�CxQ�/e`8k��]�;���^N���%���YK��&���)�m�����N"��%��3�9��v��C�v�����P��h8�5�$(�X� �y�7-����'��������LK����
�$(Yl&� �f���dh�x���z�:D�Lab}(	JV[J0H��d=,��&M�����9����L��_�]��+'�4����{��� ��*;�LX���c����gV��XR��z�m?y �{����DS~(����@A6�8�ic���~��H��eZ �]x
8	J6�(�fg=,9���H���HQB\|(	JV��$7�$�i�'�K���������j��L�/_���W�~�v�����0/�<HQER�@Ip{�d������y����3����Q����%A���4���%W��z���N��l��P��l� �f�����������a(!��x(	KhO�M3�zZ��n�P�<>F	���
�$,Y��� �f����D����?!J"�9��1e�2�^Z'���XS�|����!jk�=��Z��u6s�`���86\��_��L�I��|]�
���r.�8��B�q3����z����a��EFx������3�1f�
���Cl�~�)w*4�1��f�h��30��s�c�4��s�����X�PZo�Bb^f?������S�cI�S�3C�M ��l���9���v,)w*T���]
��<��o��9��1����XR�T��b���Uv�H�0���f���0�S>l��8�g�J�����-����� oV���n�}C[y�%����%�%9���u�Ps����`�����I�1�8Hgg����a�����HZ��pRkGf��P}��5g���L��3!{�I�	Y8Hg����5?�PG<G�w�<�~\����u��������GwN#.��Nc8�w)�9���[���G��7�b�mf'���ca�N3��U�Y}���q3�*2q���������Uhk��K,�=���D�������C�Q������7�7bF�<�P|	TV.
2��u�P�����y�����%���IA��l�d��U,N^����!�)d~���&r�� �f�~k"������"���%5Va��l�i Y�
e���-�c�HA���0�o��Cd�~,)w*T�z8s�ApZAI��'5�`.d3���*X�4f ��8�o���������d�+T|9h��_����.���`�S��A��i Y�
U7Nt8�������<��r��1��A:�$�^������+7vv���%5v_�v�6�$�^����?G>�����0SoS�7��C��bj��r�k��m�3�!j�f�������O��fvN�p��Ks�v�>n�V`�s����i����������/�-��_�P+;�����}��)�,H�y�v������I����_�P��������&�[���A�\�E��=����^����^�&��{�������������y5)^���u5����P����V?����k���������j�~u��i�a�]��Cwh�e������{���&��(����8w��+�����S54W�8�jh�Z���-TK�����*t���^����T
-T#��U-T9��*Z��-u�*d�Ia^R�B����;��<�m�-T��H�UU�P�8��h����W�z�H�S5�~�����j�����7�sW`h~��[F��3�P�8��h�ZB������9UC������CU������)�QU�P�8��h�
&��l�uh�*qN��B����;�P=���-TS����j��c9�P�x�C��h���8��h��)>���j�8��5��9�s�`h~�.:W����J�S5�P=���-T�~#�CS�sg��T��j��:�P��Q�VU�B������j�{T:�P�}�J��g�-��B����Y(^}{,��W����j��NU��y�qN���y?u���m��}�-��7N��%�����������*�yUEs�x�����J�S5�P��UE�,������j�s�����*�yUEs����)��J�S5�P=xc�SU�P�8��h���YF����ed�X�^`�V����������N�3�)�"���[�3�����[9h���44��8'jh�zn��fd�yn��fd�x�}�����D
-Tc�����j���8�P��9��B5���C�sk�xYhR���P��Y-4��C�CsU8��m�h���8�jh�����<g>x�\a��}���:�sah~���5�2���]*#S���N�^��TS�L���*��|�����U-T+���*Z�r�WU4W
{��94W�8�jh���E��h��q^U�B5u]��,4S����B��!�Hd���0�(d���{"���&j��]����{H��;Hm���{���6�uh�&qN��B���B��%��/j�K�5%��Z�FZ#�U-T9��*Z��-��Z�r�WU�P�[�����U-TyI�WU�P�8��hnK���7N��|��
���G��J8�T�)�����������B���?��1e�
�������cNy���>�*�{�nW*:�c����Cd�~,)�j=���
q	�C����%��x8H�@��n��\.� �#��/!���g|�^N���e���YK�;�Z��^z2)II��e��������Q���K��ZaO��a�No�$$%:sS�/��!2E?��g�B>��W9��R���8�I"?��
Ct�~��e�7k���>2����W_B"�r��j�������.Cd�~�)�j��p[=��������o%����zQ�N3�zV2�o������CQ�*?�F���c;27��7���Kgq��m��
����	o%��x(��g=,z;P����������	[��J���
�i�YK��������e�����;�;x(	J6vp�N3�zX2������������P�l,p�N3�zXr���~'"{��
���Qz�9�s�N3��M=2�ML'�;DG-����z���zr�L3��9��v���[i`�pf��<���z��S������!:i?��g��s+��6���`�*����Q�@Ip{���������:wU��NF	�	��7���de���d�H���� �o@��l�PDFR|(���1����4���%�k��!%g�|����G���1f�
�O�����S��z\�Qz%���_I:��#ck�qL�B~�3�����S��
/�q����	��b$5bW�q��L~�U����XR�����06>��C�PJ�������X� �y��Y�_r�]���J1F'�+�A���
�i�����~P=2w�p�:>�����t�p�� �f�w��_r�R	��+^��8F���
�$,Y��A:�8�a�yg���9�����\o@���B� �f�������c�f�F"�
��_�w�O���a�L��5����l���^h%��{�9�c��M3�es4�U?<�5��'�\W�wj�p\���A���g���j6���C�����D����A:���0�}��~5>'4q�����)�W����!:i?��;��	�V[[�b8W�c���2>1�9���S�T�����q��I�r��������'��m��5�N��+f~��H������Rf^�t��M����}X�s�$
oY��e;�c%�C�e����(>�P=���F�����n��QY(D'���r�B�fz���I@�=�J��I��d$�N��)L���!j����$�w�X@��@�������q�"r.�(��"
�i���#�]A��l��+��c;�'��y���v|�!����	��5~c�|�m�F
r~c���a�������q5.c(�w)�����[�z���4h�E@����(g4�"9o���i�a�f�t�G1!r�(�u+�����U�Y|��5>b�|�nL���)D'���r�B����[����7�!��
��"hz7��;��uwpcw���3pHcw�^�.a$�N����[o
��5�q&���|j\�~��[�a����;u,v���7�wj?���rR�L��%�N����"~������FP��09H�@��4Wc�1��w����~��7�)����	Cd�~,)w*T#�D\^�=?�D[���F@I��Wh���f�u�P��J���Df�� ��K�v
�i��4����y�P���FpH/9c�)���d�+T�6�u��U9D�L�a7� �U���d�H��B�`+�at���I7��+���n���XRnUx��!:��5*U�n>����L?�:E?������E��gQ�3��F�-���xG��g-�>]8����M�_�<;����Y8>�yV/���:T��>t.2C�g��-��>�zV+t�����^�k^���z�`�v�=���6���gv	��L.a���.A3�%�;����w������?�|��k�U��}�����gV��0������4�]�����������%����N�v]Lnh�������g��.�]�C���gW��t9��|�f��CsU�s�����6��B5���Z��m��UU�P�8��h�zP�[�jh�*qN��B5n�O#M
��B������B3��Z�����U�P�8��h�Z��v;U�x�h�WU�|�\��]���gW��p�W`h~�$�]���}Kt@�W54W�8�jh�z�[�Z�^��y���p�E�������9UC��]G��`h��q^U�B�n�[UEU������e�7�;4W�8�jh�z�}�Z�}_+�����Qe�z�H�WU�~�����7��]'}a��`h~�$�]����������24W�8�jh����-TS�����*��^U�B��������=��jh�*qN��B5��%Z�����C���g7����~v#��ol��B��[9�P�w^U^W,�7�U��7��.���ga��m�����Y�]�g���@��4���������~��g+���.��B��������zec+�B'�5��V��(+[)����g+�����l��@�F��BY��������rh�{t��Z��������ae�GR�������H{�CQ��(Z�@����hq3�tmd�-t%�[)�m���t����y���r�q�K�d�G}�-��u�;�Co�y�P�@�ll�{[���r��5�V�y+�X�J��YA+��o���J���cy�R��}�`f���6��>x�����F-���{��1����/����R��1�h��q^V�B���kB#kl�+�^��J9n�[)s`��l�L�[ee+el����k��y�R�����\9�;n�o^��+k�Svl�|t
�Z�]�,�V�	W$5��V�#�kT}�0�z��U����ao����B����tE��q��-n��oW�9����zac+eX�	Z�R\#+h�Z�t�FV�J�ae�h�Z��B�BY����2�|��h[)s`��l����=�[)s`��l���{ee+el��������7f��|c���Gf/;�];4 ��tf������.�QT�+�9�#�$X�gl� �f��lK����j�{���v(#����7��#�����d�H�������dC������� h��b� �f��������e�	}*"|�p�l�p��<�zXr�|�f�e�FN$W�|l?N�����-@�)�q�To��*���u��{�HEr�D�n���p���~
�)�q��vn��/�x���p%�N��r�o���3~�`��Cd�~|���o�Z��8�Z�C��E#x@���*px��1g�2��1"��cIyVk��/B���7��B�s�?�k�I�+f�P��<�zX����6T�3�]�I�
�$(Y�� �f����������-jA�i]e�����c��<#Im��e������G�e7����7�~(��@���N����]����Y�a��;�yp�=fS�$}��x|�;i�g�)��Q�w��F�f���C[g\��J3Un,#;����`U�6vc�N5|\9X���;
9���G�( w�p �G��f���W�{3h�a���]���y����
F	��FB��+?��7_[�ReY���Y�"�G�+</	��F���+��d��-R�����^]����qh,s��<�zZ�U��g�����:v��P4��7���f,�����NoWv������bGCN������`V�`�0�j$$���@���[��)�������Q�
����T#a���C_�H���N�H%D!@'�����\Y�$�@��|������;�?����7���f�����W�n�'.�fO�X���p�6?���&��GB��+���_��yF]�B8.�	z�y\����l��|�.�X��6N�i�\��o���`Un0J�N5|^yeo;��C�XE�7�|G0�+WxH�N5|\9xo��@�\��}��P>T��i�Y���tk������@�9vp�N3�e��
�^�U�8g�t~6!r�(�qa�'���P�@���vi���F����5U�1f�
�>l�Bt�~�)w*4C���eo���sQ@I��9H�g�*�\�?e4����=���`M�3��n�����@�������)�������k��
���n��I4����hfj��S�cM�S�3L������,�~|p�|g�^b�i3��;:O3a�o���i3���8�3N����������)�4!rv�(��K+�;���U�9��:�8R�8G�Gj���G[��)�q����u)�'o=Q���'��<Qj=��$o�Z��'��]k�k���`}��n(�y;�'�~������Z_)76hO0o�A)���=I���O��^��F�7?/t������O
��gOH�f��@~��>6��9LB+���JA�v�����
G���8���q8;By��Ia���I���>��l�s+)��&4�K��#����L����o�����1��������n&�y7�%,x�^o,�=&&�����Q����0gd��o����9�f�����{a`R�70;B���5��,#T�97r���v���8H��p��B�4���\QT���3@I��)K�3���*N!|q�jcJ�W��:�q����9q��<��Y�8��t�J��+2�1� ��Y���0�j$$x�^���zt9V�9g��������T#�~������#��v��kdsG�m�c;�4��m������#�����;�����e����T#��]g�����Z��^%y�piS�NY� �f�����P��r�]z�J����#���J�;�4���f�z���!�Wg�����c��t�|�����/-H�7���]��=�x~���_��������_��uq�{p���{��'&4J�Yl{��k���s�E]����m��<�V_����I��:�w�K���r������u�1���-�w������r���_1�t��5��o2��h8\�4���������3XU�yV}�9�>	�w���{h������i��]��t�7�;������W�L)�3X��yV|�9�]�CbB|��Nj����t?3�[���fN�|�E�oZ����)�$a�7��e��S<��-��T��v�(u���/�����^�X�F
v�yS��Y�F"���H������"�9E%SE��t6�hd�(a�hd�����FQ�\��������:����"�9E%s��q��z��L�`�~j���TO�L��\1ng*����"�9E%sE����J�	s�J���mV������8W�����0+���v�����ed�(a�hd�H��yE%sEs�J��~w��J����L�c+g�cod�(a�hd���(,��d��aNQ�\��nNQ�\������b�kNQ�\������"�(O���qh�8�8uh�@��������z���`hz�4�]����tBl��h��q^U�\��n����J�S5�P����*Z�r�WU�P=��v��h��q^U�\5�������J�S5�P=������0/)d�����T���8/�h�����WU4�H�WU�|�`��k����q��/��C�'q��o\>�09�jh�*qN��B���o��h��q^U�\N=�_ACsU�s������U-T9��*Z�fjo�U-T9��*Z���q������U-T�����d�Ia^R�\���pN���q#qN���q���>�|���a�~	�����KP2�o���zIEM������tT�U2��0'�d����/�h��q^T�B��_eZ�����CS��7���CSU�3U���d�E�,4)�K
Y(&���%-49��*Z�B����*�>i4��*Z?i`��'�����e��y�84�u�.�����]��*�kr��T�P,���CM������pp�������9UCU8���
4�P�8��h��{���j�]0������CU�����6/��B��%������2��5���_%��{~�<��w�k;�'���,��;r���^�7������|n��0��B������*��>
��59�I*Y(f:R�K*Zhr�U�P�����*Z�r�WU4W���������T
-Ta%i�6�P�8��h�J�_UEU������������"�Fkh�J�#�M�T)l��5nu�N��4���
xqC�ZAz�t�8�yG2�(�D�S�cMy��{��"�_q������S�/�@"��K��Z3���i���_��\�q��&�Z��8s_m�ON���JR@g0�s�/�Z��&������[�����;��{��F8 ��nD��6�H?���?�����}%�Z�w���Tw�R�9��@�R:_��v�����SP��K�MAA�_
)��v���t���M��A,�N������z9G�����X2~����P�L��%�a�����P
m�u���bc�7������(���f=,�8����FT�h�l!o�!Q�A���@���R.���NCQ�u��%A���B
�i�YKNu;�u���oK� >_�c���Zo�!:E?��g���_�����;T���P�^c�4�~������o7v�kd�3:���P�l,p�N3�zX2|�_M�w��/��"
�.�
�$(��AA6�8�a�h��|4���(@O�7���da�v@�L3�zV���LzzT8������
(��0F-��4(�<*�pl��n&<@�GE'�c��/%��x(��g=,�8�����Z��$�j��#�����2~���h�P[�������~�M.�zl��XL��P}=��_!P�ya�N��%�Y�)�'�T�kD�KI�.��1f�
�Z��ftS�cIyVkN[��MX��svYy9��_!Xk���s��Z�=z����@pd�"01�W�O*��M����Hj�������(�o%�s�X� �f�����v��t��b(YQFt�H;�_J����A�$�@���}sN����0B��(E�����de��l�p�����*���P�V��!���
�$,YY� �f��V�`��UT�����cD{����7���dc��t�p����cx���Ul&��;�9���d�2;�J"��cIyf6�%gp
����������c��e��'���XS��������\�^�~�^N,���I�)K(H�����|�vD6E?���v�8�;�IH����[��;�W>2�����r�B���mB�'�j"d?��I���&���r�B��(`������n	��H:����n�)7*T������y P�r�� ��7gB�Yp���J����va�6�^�7��|<Z�1{����q�	��{�:�������r��z��|�+���n� �O=YA����C��KFc4&�I����t~��@=���{� rn�(���'�rp���n��>n��ME���M��j����L�h��B��w�w��J�X����J� g%6@�nj�^��,��@�a��}
�b�����v����S��Pcv�O��lC
r�a���>,�<8�8�faE�|�PRk����n�};�W�Y������[��-�p��B���?v�1�9p����1���S���N~q��|����$�v`B�f��_X�w��y�& ���%�& y��u�Ps�.X��X�l�Y��2Z��`�M��9�N���;C�?@���J�[���(��~-��[�z#.����5~_(	
UVN
��_$�V������;�Pc�����Pc�������U�q��E0r!1k�>��Z��Bt�~\�Y�V���z�����|����� g�u���*~^vn�=���\�PRc��}� �f�u�Pu���L�1��_���A�W��r� �f�u�P���h���D{p�(���r� �f�u�Pq�
u>����|+�@I�y
1H��d�+�
������JB���0�g�AI�S�cM���\<4h���DF5��1ex���B�s�XS�}�>\���������n���}�m��z����[�S�g�}�������=��}�=�qlq��6�b��v��|�A_�tq��p#��M��>n\�����iZ���[����otr�����T?����k�������u37���&0o����v�vnv��s���nR�����/iV�����mK7-�z�I���nZ�&Z���n���eW7)��u��}��|K��?�������n/���������������V�������R��_�t����+04�,�v��o��J�S5�P���-T����Cs�D�l������T
-T����UU�P�8��h�znm�$#M
��B�W�������o�wh������B
�U%��Z��i������F������&�	�����%n�l�����I��C��Vh�b��h��q^U�\�L}�����T
-T�,�FU�B������*|�:��CsU�s����v��UU�P�8��h�����-TS�����j��9�P-}�/��tn`��h���8��h���C���������5��9�s�`h��vO�U�/�}s9�7�U�P��>-T����Cs���������9UC��V|��h��q^U�B���m:�P=���-T�n��������72W<B�������9QCU~!�UM�7�U-�7p�(�V�����A�^b�o���K04�qphFmU
�U%��Z�����C��wph�z�1^U�B������j��h;�jh�*qN��B5��7Z�����CU����*Z�r�WU�P-]0#���3�P�}�3����������+L�N�88.��Vhd�.��|�|!��vfZ�$
�5%��Z�^[��F�B������"���D`h�)qN��B5�_kZ���k�C���c-Tko�94W����.ed��aNR�B1�
�Zh���C����]����F��������[�+-��^�����H�s�:Z��w-��,����[F�i;�.�-49��*Z��-�_�Z�r�WU4W=v������T
-Tco�9�P��5��B5oG�\uh��q^U�B���zYh�]_/#El��(
�>i$�+
Y
�_���>h��4��%���a���,N��[jOurhq���C|-T����C%M
��B��9�9�~������9QC�B������*�yUE��'7����U�U����!lh�*qN��B��1{UEU������r���y�8��i�����Q\�'L�'u��3nB����2�7�P���c��
�t�H���1�Qm��[\�}���(e�o%A���IA:�$�a�IN^��R�PL&����b,d��c���;�!�&�������'���������������X� �yW���f�p�9l)HQ*V"J��qY�,M;���8~U���X!]w\�$"�������2~���3�)���<��F���^.?S��!��������j��P��	?��9!�;0�Nl8����;�e�K��p��~�{��Y�+�!#R~��iC�8��_!?tk1D'��������)t����P������i����Y� �y�7����+e����K.8�88	J6���m�p���n��"�15$o�����'A��
�����%������(��t(!J���
����Y9H��d=,�Y�B?�W!�\Q�M���
�$(Y�yr�L3�zX2t����Z"�r�S����'��xV8Hgg=,���,��[�dr��<��1��bb-�l�~�U��T/��tf���WQ�BX��P���������G�s��*��=DR��{i���h�
�a������dJ��qFU|��!�\��+1h,r�N3�zX�E�5p����PEt\����$(�X� �f���'r��"}�j3���(��������Y� �f�����C%���g�*"���@IP��HA6�8�i��������#8p��3~���BN�)�1�<��I�D';�9[f���^�1�W��X���K��ZO|��KcW>�K0	����@���`9��n(�������.Z��$���3~�p�@�)���<�5�Zy��;D,VP������'������t�p���+�4E�G�5D��7X_N����3���\v\�����=$o�QF�b|(	K�.
�i YOK�b��g���;89���2~���!6i?����t�����
������/����K�9D������P�C��q���
��`�����K�M3������kZ�I���\<�#�mOD��%��e� �f�ug�pc)�/���"���l'�f�AA6�8�V��!y���L��I9J�B��_���U���w�p�D���PR�O
�ip��a�a���������B�c���$<����k��
�6,�2��"3�^���0����L�S�cI�S�Z���#3�q`���&��!�����P��{�*�;5�1g����2!D'����6�aK��x��8�x���d8	�)��E�����[��EX�����@w��F���Z72s��<��o~?,����p���� @I�yr�Z�=��{�K�H�>�����w[�I��HA�y��u�������7��^���o<{���u�P����w��R�R���z�3�a��z��K��
��G�S�zE9S������L�\��5:����h���{��y�N�B�e�v]�g�*����6|1Rq�3�1e�
��/*����m��a����pcr&s@I�m8H�g�jV�������8Nj��BA��g�*�|���2������=,"?��@�M�����>���f]�Ei�%X9p����ns���njN]�MP*��8�$��8�S�M3��7A��Pg����*�����na^ �}�5���n�{��9'�"9D{ 9�o�)�c���f��H�*�L����	����PRc����s{ Y�
'���1����y#�p�X�$�@�n�����<:goa��J���H��B�{��9��8|��&;KoQ�
�N��M3��W(op�d�+9���7�)��{p�0�����r��\�6m.����#�]7���`��G
�����O[�}���KW��>�1W�3�����s��?�����?���K��/�^se��X{�������8Z���r�G�J��������^sZ�6�������xM������=��z�Y��2N������5��_����������������D;�u�9;�/{��H�8��kN��4�~�k�������-���y�9-_��u�9�������o-���E�9)���z��������K����;v}_������b������������+��8�8w��+`���]���\U�����j��8�P-}�������6k:��J�S5�P����VU�B������j��m����&�yI!��o��B���8�P��^��h��q^U�\��M������F������&�-��{����#04?o���������]�-T9��*����7*ph�*qN��B�v&iT-T9��*Z��-������*�yUEsU��!6����J�S5�P�}����o��B����9�P=�.b-T�9�SU4��H�WU�����P�F�9�%�������c04?s����{��J�S5�P=��-T��a�CSU�	�mC$���g�-T����*Z�r�WU�P�}/O�������B���Yh�]�#��o#��B����94W
�`���^o4��Z^o�S���o{�����$sh~�$������'�7�T-T9��*�����!ph�*qN��B5�n�NU�B������j�������*�yUEs�c��x:4W�8�jh���j�]
-T9��*Z�]/1#���%fd�X��i�/7�E�/7	�1�;���L����P2?o�A����H��:ICsM�s�����zmF��%����������9QC����8�PM�m�C���c-Tso�9�P=������&�yI!���$��B��;�94W��mS=��W�s���W�����fs'��Y�+�����8w�����]�.#��5�22U������jj��:�P�[m��Z�r�WU�P�[i-[���U�U��{c�U%��Z�&|%�QU�P�8��h�zt���,4��;���b�������F��������9�_5������A�������Aj{�Q?���)m�m���\M�����������,4)�K
�+;���^�P4��8'jh���%=��h��q^U�BZE�VU�B������*l��^x
-T9��*Z��[�^U�B������-�?�~�`�fj��x��u����(�%
�`m��������
,'<��C��88H���M[�x�8��tb����1��g���*���s�XS��y�7XKIo�;��������2X�IA��\4�a��72�w\�(O~%D��5��%+�x�i YK�;�Hy$�+)��;2m%<��p�9�)�q���=�^0�=��$$����2~���24�)���<���[�����gcL�	l?~�s�/�Zo�!:i?��g��
��oE�+�B���+���s�K�/��:(D'�����Vx���=n{�Gg�2�R�-�k@I�'&�\�4��g%��'j���=�bt"*We{{�%��4���%�y���,�{����7�����Y� �f����Hk:px�r(����^��%A���4���%Cg����:z�h���-��%K�������K���;��p!�C"xq��%K������3���a�WT��������c��t��o�x���W��PF���D���iT�"��zX�y��&<��2��Nh)�D�4�*� �f����zn�����9A05��&���I������YK�_�%��*-�r��V��~Y��X� �f�7}��0�WT|LA"�K�
�����4���%�kK��7��T|�m��c���C�>1D'���[	��xm%���B&��
�j�[x9��_!��Ct�~�)�j��t���J������_�)��	/<���K��Za3�����%�!��R�
8��X� �f�7�`p�h�}��s�C	QM����`��Q�2���i YOK.|�
ORc�����o�G@IP���r�L3�zZr�{-j���j]�*�����%+K������w���4�o
C�y��5�k�IX��BA6����5`��I����j)���z�lB�C
��C;���9���k���vt%�){�?�1nGg�`��!l������������w=cBd��(�q=��A2�$��JcgC&���� r��(	
5�s�N3��U���w���w(D��%5F'o�m�@�nj��E�";C�\��{3�9d3���>,T����-���lf7��_go^[�q������j_�����6������3�yY
����XR�T�n���3�e��e��9��,����XR�T�<DX���@�Q��������v,)w*���A���=	��';@I�=�A��l�f�)�q
�QxcJjL��(�1%�5�d�(��B�?���'��������2� �����~�"gR�<���%9Sq��:�����[�z�~m7~#��o�@�_���HA�o�@���?,������
�\��pR�2�3���{>,����1�9�pE~c8�8H�@������P�(��w;�P�(R�s�Y�
5s/mG,��X9�p��}D�*� �f�u�P�
wl2���	�s@I�{xr�����[�z#� ��{������{��<�p��B[��zC9�f��Z��p�3;py���B�
��U_c&D�!%A�f���`$�V���U���������=�$(�X� �y��v3�N�q����#Q'�sF�"��cI�S�7�����5~`(��1�����{��5w&n��u�5�oI�:F���4H���!���x��O��z#�zF����d�H��B��;#���?X�\�3�F@I��W.�i Y�
U���������,�PR�����t�H��B�~��g���Y���7�)C>��Ct�~,)7���o�6uQW:s�F�m���G]�������}���Ks���>nSwn�^rIw����I��z�%��]���������-�!�#�O:��
�A��T�`-���y�:-_���u��zfnv�e�:+�Z�I��^uv���`��N��nuv�tNa���A3�!����!|���A��L���H��������c��o����y�:;�tG��Y������p���=��,����u�o�m���RE�Bz����z���o!��������C�egG��t�`G�f��CsU�s���G��������;ph�Z�=�VU�B������j�}�������9UC���ks�,4)�K
Y(����C��7ph�Z0�QU�P�8��h�Z��v	vhz��8��hy��"�&}���J�?C��&q���������J�S5�P���-T����CsU������\U�����j��1����*�yUE���X7����U�U��7 ph�*qN��B5���Z����C��37���W�����W�ov���u�Iw�������c04?sV��^����T
-T����C���|��B�GxUEU������W��������9UC��wuh�����-Ts����B3w}�,��!�C��oH��B��yUE����yUE��
���_�Zvp���f�MO��Qx6=ya�v��������������g+��os��B9&��+[(K�W6�R.�+_��l�������r���z�R�}�P��i����;[(K�W6�R�]�2�V��kZ��J5���<�^�,�V��$���<�Nk�����?E��q�8-N`���R/kl�+�^��J9o{/,h�Kq����j��c�l�[�
<[(���y�P�@�ll��z[���r��5�V�y+�X�J��YA+��oj��J����y�R�������Hee\�������.��/u�#��P��[����b-t9��*Z�^���#cl�+�^��J9m��[)s`��l�\��v��l�������r��7�V����<�+�k�]��+k�Svl���f�tc�������o�5����#�kT}p1����y!��aO��M�B���wZ�p����hq���]��`���������b',h�Kq����j��Ye+]l��-�S���7]�-�%�+[)C����el�������2��h����9�QV�R�W�����26���m�������LMC������;�]{�������;5�JH�h�c���o���4+�4H�J��k��������
��s�Bt�B�W��~<K�����x[�6Z���^�d�B���G@IP���A:�8�a��wm�����X	��IO�_�wZ���GW}��XR���*�zz��d&�:�
��xC�n��L~��3
���j;��g�Bs�K�Z�����:���S�/�yg�pS�cIyVk������>epW�^%�����f
���!�?+���p("�`F*���F{���������P�L��%�Y�����2�5�����7@�L)��R:���2u];�W��b��)�
��v�G@IP����l�p�������(�c��\�.d�U��
���`K��	�=.;�m��3��E�������I�?
Kd�����#?S���ic������d9�F��F��;��&<�D`��%��"��!��<��`�0�j$$��rh���]:��������#�G��fS��W��\�*��,#�q��'�G��U�&S�DW^^��S�c�L&{�UZ/�Q�����#������A%27���";���#�G���S�$��3����Kl��3���,!{
����<�\a=%L�	>��*[>y�d,Q,Kf	�V�;�yT����0�j$,������g\�q��g2��;�yX��R%L�	>�<P{���7DE+M|nK&���!�G�<$L�		>�<&|
������Q�n�k@I��o,q�N3�z\�I�@�D;]���Z|G
u��m`�0�j$$��rX���Ms�.��]�B��;�� ���+LfS��|�1��;����HA��0K�B�/�q����t�������6�;p�:V��D�'�|G0�*7$L�		>����u|�
+Wv �u�#����}*�T#a������)����Z��%�$��(+�3���#x����!��w|
_m�g�;��4X���W�p�����~�$
x����x r��(��_�K��n�$���l���a����[������t�+�4��[���	w���X��A��:Jj�����Y��d�*T���
|����Y�#�$(�|��A���@�n��g�g/�h�Q ���c����Q9�����r�B�;~{O�@bj?�����!<E?��;:���9�����w�#�C~)����5�N��p��EE �i�
9\,T�z�����n��wJ3+3m��b�qi?��_!?�6B�M��9�N���d���h!{�������M�$oEV�7+3�2��hE�\�`Vg�F�����n�]���GXk��������0����o�k���9�7A����9�7An��4�E���-�����z�=����0����o�����8�X�|��F�������Y��O����y�y����`��wR��;["�7�5Gs��������	��.�)a�i�Do�kVc���{�2oZ��zo3I�N5�Y���,iM~����#��������@X�f��V�-�{X�c��F������W�3#w+�,#�^�H0�w//	S�r ,x�^��Y��,�9��,By�g�%L������><�����<wDfB��Z��� �y�u�Pg�=�$����	���$�9�'$x�^5	�!_�����n|A��p0p�M5����Y�%����BW3_��2w��F�\��,a;��@e2g-� ��9�� ��d�����z�,�]7�c����P|A0��s�0�j$,x�^��hSc6��L�F���+��ObrS�cI�������^���@�:��r�z�=���������'i9����e��?���������������+�v:	W�������O�y	7�����"|����|y6����^����nd�5IZ;�
����`U��Y�M���#u���j<�((��zm�Q�z��-O�o3���+n�����V���z��5��o2�����O�j�GFm����z�����y�`��w���j�1���U�`U��Y�M��o^�w
�>�m2w���]���&s~���{������/��B"u?xS������'J�};�����v�Q�_Y��<>8��Nt�x#��7�K�o����w�;5{E%sEs�J��9�OTS42U�0S42W<���FQ�\��������e�
��?F�����/���	
��Q��0U+������������b��M��d��aNQ�\��KzE!���9E!�I�g;����w,V�M�F��K��x#��U��Y�gmd�(a�hd�Xhe�ST2W�0��d�H{�xE%sEs�J��W���|��L%�������H��d��aNQ�\��n	NQ�\������"��{E%sEs�J�����(�G���r�q&jh}A�U7��c~E���D�^n�!��6�s�`h~��B��*Z�r�WU4W���18UCsU�s����1xUEU�����'~�UEU������i��N��\U�����j�^^T�B������b��k^R�B������*�#��
Z\n$��
Z_n�����|p�9�o�;C�'q��O\��7�S54W�8�jh�Z���WU�P�8��h�Zv�W����J�S5�P����WU�P�8��h�����WU�P�8��h�z��B����*�yUE�k����,4)�K
�+��>�I*Z\n$��*��r���?o>����u�J�����!(����/�zIEM������W��@����&�9I%�L[IxIEM��������qh�Z�����v1j�-���g�-T#nL�E�,4)�K
Y(�+�K*Zhr�U�P���e�*h~��8�*��+
����:����P�T�+�C�S'q� �O]��cNT�\������b���~D-49��*��������CsU�s�����C�hh��q^U�B5��C���`-T�-�V���U-T�u�FT�B������"������������%�J�=_h����lu	�d7�-�b[��%`~�)����)�'6�j�-9�k*���6���P�\������b������&�yQE�J��yUEU���������}��\U�����j��\UEU��������U-T9��*�����w�K���[�X�M�TG����y][|���_j��"���t��@	������6������PAd��cIy����%��~���JB@��)�8��_&P���Mz��K��Z3���e�T~��\�qL�L��#"�)���<�^gW���V�)(��,_��C
��s��m�nx�5kw�<�A����^�RP�;����
�~,}��v��E���4TX���l#�`�l����%��@��G�+�N��a�
�O��W@�\�no�!���76�����O7��fF�Y�B�w��z}y��r��ef"?t("S���������+;�	7��8�QE����J�%�"�4���%��.p��%�!j5K
��=�~<�Ns6���<�����@����<�)�'A��2�����%��q|�KI�S��g~a��%��3����/~&*�O�SA1 
�l�5�$(�X� �y����r�C�p�;E1��8�]�p�l�p��<��/��]2���w(������!����A�'A��
�������%�KW��o�-��=�j���0��_!xE>1D��������S����������u���p�^c27��o���@��a�9��Dq��/%��8�J H��d=,9��]���E�d_
;�xs�p�l�� �y������^��32�KU��7���da�[�d�h������H�#p���Rt!B��
�$(�X� �f���d����xg�� \ot��V^N���e
���YK�g��n���jfs����t�#�C7�S�����r��p�a��?%���Y?�"�4�_}"�IK�+����\�K����3~���=+���XR��
��U�*� �t]�"��7����*Kd����������%���k��7���de��l�~1����������cT��9�J���E�i�YOK.��7�'K����������>��p�L��%����|^}��1r!�|��1e�2�fD��Bt�~,)o����B�p��_�O�������0wj5���0��a��Jo:E?��+{�T<�q�s.��\���l������[�����_��U�Vb��#�
���0���+3}}����6��B�+��2�20��Rx�R�^N6_7��{�yx���?���y�~���M��&i����Vij@����9�	�:�����#	_�9�'����l7��m	�T�|��@}�v�������)���n�!6�~�j�������2���0�����r����E�[(���������sL���<hk���z���q;@I��HA�i�g�*��~�t����(���-���_� �/v��v�jV�������s{�I��X8Hgg�*���������9f�����$�N1f	�[��K%D�%%�.��Aj
�@�njF��]@�����K�
�dV`$�V�����OPc	j<�P�,��m�H��B�GG]O;�Z;;�������y�u�Ps�"��������@I�	xQ���=��[��W���s�g�uc����F��M�~\o7;u�[��XZ��@���PR��U
2���u�Po���{��j���so�=��^��?wg��c���S�����c�����y����f�l����{y��p����������������V;�E�ND��%�^� �f�i��B�RK[N�5�
"��
���Pe;�4��[���V#���}a�7JjL��� �f�u�P��j�7��UW�%5s�F@I�UwV
�i Y�
�������3�m��tg� �f�u���7�)���	��p��2�-w�R���k��{p9�]:R*!?L��a�KcI5������,�v���|o�}���K7���9.�����4����X�P�W���~�!G���������[`O��'�qKCO���*�[����]��nni/���o���[?7)��M��D+��-Ac��7:����n|��nr���AO7=�����XW79�y[7=M�X�u������5v��wv���U������m��`�������4�`��M�������[{7)��M��D+������
�^\B-Z�I����h�2'�c+��.o)�c�%�h�h@���/H�v��p��J�S5�P��&-T����Cs�#����g�����9UCU���jU-T9��*Z��[�>��B��������/�wh�y�����fj��U
�U%��Z�R[�FU��j#q^U�W�vv��\*���?C��&q���[��vUEU��������7ph�*qN��B�\h���B������*�l�����\U�����j���UEU�����G�������/�wh�Z�.`-TK����j�bh?M�W����>���m�F��tU\V������I�;C�^�`������E�\��
t-T����C���7��\5�cS54W�8�jh�����c5�P�8��h�z�}7Z��}�M��W����B��V��+���sh�)qN��BU9UA����yUA������_�w����sh~�$�����;h��W54W�8�jh�z��Z�}'��'m�U-T9��*��fZ��U
�U%��Z���
�C����th�J��5����U-TK���B�t���,k�����r#q^T����4�;��h��~D�,��S�o�/d~��L�~:ICsM�s�������Q���0/)d�X��#�����9QC����8�P��m�C���c-Tko�94W�����22��0'�d��F`-4C����*����
Z\i$��
��Js����Vq�����~��[;H�����b���k�ed�E����2�P<���&��B������*m���*Z�r�WU4W����f����T
-TSo�9�PM�5��B5o���:�P�8��h�zv���,4������b�w��"���F��"�6������r:7\)s�U\�m���|��O3���V���'cB�����^>�7�Q�P���^�,4)�K
�+�Dj��=�5%��Z�^3�T-T9��*Z��-�O
Z�r�WU4W�-��������9UCUy���*Z�r�WU4w��������LM��\���A�����E�d��k!<(Wt J���
����"�?�i Y�V���[�A9���+9��X��z��B~��)�1�<������^pkox���B�R!���$8���4���%g�i-�*CCGD��\_N���
��p���K.�w�������$�2����/|�����N��5�Y�g��*�����)�_m�T�FrS�/�jX�!2E?��g�B��B����#I:/����O���m1D�������]���%���\��H��_�9�������XR�����.�:��;t�(���1x|�A2�$�a�����3.St(��v�F����de��Q�i YK�����[x���W����7���de�� �f���'j�2��\7?p�����
�$(�X� �f�o�p���z8���t�F�1��ao%A��v�i�YK�������J���L���
�7���de����S�i YK.�nM�_�Z9t��
v��%�������{t��[)dm+�����k�(	JV'�d�H������������(����^�y	$���3^�l3�����q]��'>��Ut":�L��_�C�;8H�����n��-����5���Dy/�4�����j� �f�������t)��O�"0��@IP��zR�N3�z��:�]�s�zn�"0��@IP�2X�A:�$�a�G�j��o�}m/Nh���c���C�!�&������_U�Yp���/��!i;����q������O;���r�-*?��G����p���g����<�����kt#/<=�[�.D9Z��P�l,q�N3�zZr��g���l����x��
�$,YY� �f�����o'����W`O��IOH����'a��
���YOK�l,��_���)����R6���M3�zZ���A��d%��K����������:ZQ�L���7����c��rQ79�����~<�g^n��[��KZ��z��7ym��\�����=����,d3���,66���O�#���6��N����yb��K�����PlG�W>�78���Hf]�c���C� �����r�Bs��(Wkh�����Nj
���l�p��B�q<����;�-�1g46�A!:i?��;��x��(g^��D��~���<����XR�T�nb�v��VR��E��)��,#���XR�T������5*3*�1e4F�I!jLvcM����{���4�$����PRoOR��'[ Y�
�N!=fkLI���d(�7%)���-��[��?H�S+��`��(��"�� �f�u�������4$/�1oq�����t�p��B�������������l�p��B����l<9oq�������l��d�*��~���b�X����b� g1v��n��>��i�Ez����r���n��-��[�z�/��N�H;�v��Z;1��U�g�*����[�5&b����%�&b� �f�u����y]��:�9�p��X��[�4��[����t^�a�9p����a����)Z���1�6aF�\�PRk^d6a$�V�f�C~��������gH��#��+�ogN���k�
����~L���;
W�k��
�����$s���s{�I�;X(�f�u�g�u�7�=A|��7Jj=������d�+���Ex��;���hO��'5N X��3��n��?S��?U5ko�<�5�'��i Y�
+��
���`��8�q�*�t3������
��Q����Q�������>0�1D&���r��]�O;��uj��@:��xF�#����o���\��i����ue;��?k\����������������n`Wv6�
�p��U���n�t����4����o������}���y;-_��u�\O�~�e;=�C'�h`'`�z4�����������7���D;�u;;�/��h:=�y;=M�X7�����������7���D;�u;�������}���y;-_��u;;�_6�{u	U4_-�q���z�P����
�J�g1�K@�t���#04]�P"ln�,�rh�*qN��B���Z�����CsUh|[��"�U%��Z�&���QU�P�8��h��i�'*d�Ia^R�B��{8��<�-T/|f��*Z�r�WU4W=h�X��hq��8�����M��&�o`W�U_l�����I�;C���sw�3�P�8��h�ZB�������9UC������CU�����ug��`h��q^U�\������U%��Z����C���"ph�z���Z�}k2���-q�*hq��8�*��+N��^�F���D��`h~�$������{���{��J�S5�P=�.-T���CS�s�����TU�L���*�w�����U-Ts� ���j��:�P=��F�g����B��{�9�����d�U?(r�����s�������'��vg}�3��'N��!���x�vNU�B������j�}o����T
-T3���T-T9��*Z�V����*Z�r�WU4W=��7�CsU�s����njT-T9��*Z�]�2#��kPfd�X�Fl������>��$x��4�;3m�B���q�;%��V"�C�$
�5%��Z��[��YhR��2W<�~�����D
-TS[��B5��5-Tso�9�P��9��B���&*F��%�,��=�C��oO��\�����*Z\i$��*��JS`�������`k{�14?u�����]��fd�Y�`F�������CSM�3Q��y����CU�����u+�e��B������j�{o�����9UC��������*�yUE��k9fd�yt-��,qK��d~��0��d}��,�S�U�a���j{R?����G����&z��+��I�S4�P=�����&�yI!s�c����+����D
-T���U-T9��*Z��-��Z�r�WU�P��
v��h��q^U�B��m�����U�m��Y���u4S����@�����X��_�����9_��E
<�WT�=���#����"�4��g�lK���j���KC���g�������dc��t�\���)�2��1��p� *����G@IP����t�p���o�g�3�6���5��%+�4���%���
�b������+�������"S�cMyVk�q�l���[�J��s���������������K��Z�@?�����\HJ)t7<��E$�Z�D!:E?.~���Z+<���������$!)'�r�jL�L���Bd�~�)�j�E��p���;�FbR`e��1e�
���<!�&�����Vx�D�Dl�O���$ ��lnc������l�~,)�j�Y|� ^-����G=�
�Z�|@�N3�zXr<�H��jOx��(#:��=�^J������t�H������y�2<.P��i����%A���K�d�H��������D\\�PF�7���de�� �f�������
@������	��c����T7E?��g����W+����<���3~�p�R����%�Y��I���R������<#�dz(	>
���A2�$�a���
~��������Bt����G@IP���A:�8�a���Y��-�?�c�k����9H�@�Z�n��
V��n��$�P�fF�!����~<��3�~�$InN��T�>�D����%�9H�g=,9^[	��S�P���'�T�7�H�P��4(�4���Ri��W7#nx�bL�L~�8
�)���<��������[�
"��o%��5�S�M3�zZr�u��W�:����DY�j�uC�N3�zZr�[���s������}�%+;+�4���%��(���A<9F���
�$,YX�8H��d=,9�l,�����KST����
�$,YY� �f�������7r �����8p�~"lD`�L��5����l]��&u%��{�k�c���M3�e�:X`�����ufE��J5v%���l'5�g,�3��?�>,�����4^���y��������f$�V�j;��:�y9�C��p,p�N3���?,����M#	31������'�����%�N��0���������:�����m�rR����XS�T��b�M�A����8����%<����k��
���_��^�WcP6C������M���4�G�h����~L��tI��d;��;�exma��19g����
����;�6~aL�����3%�hJ���,�)Yn~0��[��+���[����H��BW�:�::��4���
��[�z/���jlG@����(�@5�#9���u�P��x�f#��l�%�f#9���u�Ps�"����9Sq��*�����u�P��"mW�H���~i�1�/^R���%�N�����X�*j\�PR�*R�s[ Y�
5���E�K�>��[����K<8Hgg�*�l���.:9sp��:���t���4�I��g��
"���Z��� �	{ Y�
5/����@�7>���EA<�4�V�f���sy���"�� I�G�9Hgg�*TM�,mb�T$��c�h���!2E?��;�Iw��b�$������l��48���i�j�%��j\@Z��L�PR�B_?�i�Y�
C��z�z&����z#��������V_$�^�b�A����������PR���~*$�@�n��|��������m�PR��A����z��{����>���s�����q�
kl��9G��o���ig:C��N
�Xg:����y�u������> -�}���������v����tp�8���qc�ek�����!%��?iN��c��9������7���Lw��t�l�z|��N�u��CX���C�L;���!|�������!���!h�;�u�:;�/[��!X�99�y�:;�t��nRg��e�:;�5'�0oRg�����m�����6uv�mN`���@3���������^^P-I�{�_��������������84]��qvMW.��`lW;4W�8�jh�z�Z�}���u�ShU-T9��*��Fz�����J�S5�P�WiQ!M
��B���5��B3��Z�����U�P�8��h�
�9����W�����W���}��+][8�#04?o�������2�W54W�8�jh�z�]Z�^}�����WL�H�����9UC��]��{0�P�8��h��;�{UEU������e�{84W�8�jh��d-Tc�����j�W+UA�+��yUA\q`���7�]'���c04?s������U�M����J�S5�P=�f-T����CUy�T-T9��*��^o�������9UC��7uh���F�-Ts�z��B3w��,���C���A��BU�9UA����yUA\o���/4�;8�mG3�����(<�����L[�yece	���V�W�������w6�l�K]��J�-�%�+[)���QV�R��FY�J���A=[)��5�g��o�����-�%�+[)��K�C+���)sh����l��/H�[_��[~���]��H����	�8�'0��b��5���@/ll��������FV�J��;0x�����-�K�o�<[(K�W6�RN����J9���g+����
�h�Kq�������1�l�{���<[)_�"�U��I`�,���Q>��W:��������c�S(��@�-Na�]�0���e-Ta3����[�J�6�RN��[ee+el����.Vk����9�QV�R�����J����gs�����]��+k�Svl��^f�tc���������5���#�kT}r1��1���x!��a��}�B���wZ�p����hq�����w�c]	���V�yK�����FV�J��������.6���)����/��������Vs��2�R��FY�J��Lo����e�u��l�,/]{ee+el���;���������z!����_7��W�c�/i��w��p�i��!�^5�o��k���ip|�v��x���#�!�����7 J�$c��A2�$�a��w����+���^���<Ko%A���4���%G���fz$
��]��Y�@IP���A:�8�a������^�ca&'u��R�-��1g�2��u�"��cIyV�Qq���p���u��H����j�1e�2���
"S�cIyVk���~�x@*�Dr��n��1e�2���"S�cIyVk����&cg/#'���N{���������P�L��%�Y�u��x��<��X�~������1e�
�&zb��cNyV+�L9y�f9#>S���������p\�e
���Y�J;���zp�x�@�U���S~�pW��������j���c|���@����;�y?~=��t������c�GKv�
t��u�7�N��grnu�|y��38>���F�-yf	>)����VK1L�	>��n�MO��d�	�,!;ly��`U�0��S��W����2�U�g	Y��K.z#�<�\!��a:�HX�q���5Y05��D�TX�M��^���+ByT��C�t���������S��$�Nda?wnQ��`Un0H�N5���P�_y-���1��x�����r5����PUn�H�N?����u�'����`�3[��
2���w��r����T#!���Xu�Z��V��
��`\���������*K������������`�l�����d������c��o`	]��������q�&�L� Q�N���}�C�'m���o��OI�7���a,p�N3�������a�8���3�TB,����<�`�9���W�����{x�E��;B��Z�b'`�D'[Y� �f������x�Ve�N,�������P�l�����#!���W6�J�U\�
2<O�;�y\��]�t������a�z�6z���[�Dh������c",W�i Y�]<t���5��>�/�v�s,q�N3�e�>�"�f9�9�?���a=v ���c���C�)�83�s��%��?pT?,T�Ml�����"%5>*l��A:�8�V�fi�xF�{
������pR���3��n�vf��@�����<�~L�g�s�L��%�N�bbB����H�����R���!2E?��;�u�c�����n��l>���
5�Ni�T���v�]��W�?�1gx;t?8D&����=�a��O���_�����
�N��c����[����o��������3~�p�>1��K��
[#��{�Xcyv�~��d�����V�Y�7��{�vk���`��vR�w;;��7���#�:�MN`�����z��Ia���������Wop���9��
��afp�o��]�<����4sghf�f��W������Y�w�yy������/{�y�}Ia������<��;x�2 ���@(�7-����#�~��O�5�0S�D�* �@��)���A:���V�����\i�I�[�/9L�B��<9H-�p��B�/���T#XkD��F#����������^�	\����2�-���^�`�0�j$$x�^o��zL���z�1E���c�m�����v����~���x���\'Q�����Y7U�/�F���X�yq$�����(a:�HH�f�����g���wG�y��8�����d��^��j���o�z��|���^�;��T#!�������Rk+V�Uqv�0�<.�/Rw��u�Pq�j�m1=�],�?��`^g#����T#a�����7E�yXS��`�C���p�u���S��n�����:vp��<���~���K��B?nX�P/����=�����?��������i����A����'�;(x��l<q���'9�]���
���'���]+����Z{�����N�U��e����uG���US���W�_�[�+]��v������Y��c����aZ{��@���`U��Y�M����Z��z�����O?%�����<������tn���������g���jg��]���&s~����7��x���+����]�v����7B��xh����KK�������,{W���	��.QV���.���T2�('�d*���s�Ff�e�F�����T2�('�d./_���� G9A%s�v��z�r��L�J�GN��LN�L��\0a�Y/�d*�QNP�\0�f1^P���!QNP�����Y������rf����ndv�$�j72=Yx���9�	J�	�|-�*�
r�T2��7�T2�('�d*xE���	�	J�	�f|l��L9�	*���;�*�
r�T2��f�*�
r�T2�U���8E�fW
3M������-�:(hW����I�0w���
6�	W+�h.�a^T�\P�������9QC��|#�h.�a^T�B��e�����(�yQEs��c�O/jh**aN��B�4�B����,6#h�%9�k*Z�f|,��*�^e$��*Z^epq����|p�9�#�?C��&a��O[����54�0'jh!Z�q#�h.�a^T�\^��������9QC���QEsQ�����5�����E-Di��FT�\������(l�j
�KR�W2<�!�W44��H��4���@������O.4z�5��dz�8����Y���d��h.�a^S�\�
`,{M%SI�r�J���n�%9�k*Z�������h�oY���}��+�f�f�-D#�y5�B����,���(*�Kr��T��-QE���yQE��d�������P�T��C�'a��O\�&Z^S�T������`���~<�%9�k*��������CSQ	s���	�W4�����E-Dson94�����B���	QEsQ�����m�h
�KR�W2�-���T2��p�T�~(+�|}y�B.��fkw�|�A.���_X��������|������� �yIEsQ����*�Jr�ST����w��h.�a^S�B���f����(�yQEsQ���:�ME%��Z�F|!�U4�0/�h!JQEsQ���������M�����������B&��#l������]��R?���������f�"�'���bL�L~ED6E?��G�I��;���+���~R�.���uT~�Viz7<����Efj`��E��g5r!9.vk�1e�2�Z}t�~,)�j-ny�CH[:�e"#I��N������������������Z�����

�s'��R�����j:��w�?,��
_g�~���(�@Pw������6e��M���n���������������F�r�������t�����g;��u���=.e!��X*��#�1g�2��&lH�!2i?����O���)�����%D�m�
�$���,�$�@��O���<���QB�
�$(YT�A2�$�a��n�������(:����P��,U�i YK����<lG��"�D���
�$(YY�$�@���/y�q�%<�P�X��P��,]$�@��\.y����(#��@IP��tq�L3�zX2l�F�&�#���(#
���@IP�2l7��4���%�%�_�<��f��Z��6J������t�H����py
:�yp(�b�!�D]��N���4���%�zE�2���%Dq����%��8�a�i YK�����x$����P��_8	J6�)�fg=,9%�4���CXD�
����X� �f��/�����&m;��9T���P�ll� �f����|l!���A�ovC)"���%+d3����p�'��~p�oc%) ��>�s�/��	/�@�N���7��#�#����u,�*"���J��k,r�N3�zZ2�$��3�O�X��&���qL�B�������_���3�?[C��)l�:E�
�P�^a{� �f�����]�\����r�2��-�s�V����-��[�AnL��u�ND{���%������\�*�#n�L���NN|9��_&��]BS�cKy�G�!�!����c�@R���c��e�C�0D�������Z�[4���W�u����.c���w���f]v�*��������&���v�Uhc������H�����!&&���|Y����r�B5��o�9��������aL���h���K��
�N��Jh<���I�Qvc���%l>P�M1�%�F�b%�_;��~p��9���P�SI�?���JS��D�O&jHvC
w��	���'����l��.b��`�#p6d3�p�J���N������X�;s"�!���������
9���~��&�����;@I��HA�zl�d�*�\��VVc8�����PRk8VR���u�P��v\Q���	�s@I��xr���=��[����Z�]Y��8Jj�����\��d�*��|W�E�����3~��PM���f,)w*4[�x[n��]��1�W��}�~�~�)w*�.^�Goj�PRob�7[ Y�
m��8���3	c�����cCH��?���
��Ne�
�9��`�TzWX�
����tn����7Jj��JA��@�n��ri0S�������c&��B�/��E�Uh.\��l��B���PRk�Q�)7�8�V�����`����|-��������kg�*�;n���1�5�^8�7�0��{��s�un[����[z��3:KC����%�V���l��y�����@I���8H�g�+T�:���������w�3����~�~�)w*T'
�a���v�oS�������� ��@�]�?-T�3�+\�Uiy��p#������� ��g�+T\38n�e6�{ATOQ��b�L3��Wh��b~�'$�#9���1g8[_w����������L}2z��yqh����1\�8x}-\��[;��y/��s|i��2�?~*����j'�����[��:�~�Vf������m��y���C��F���nZ�'v����Y��v��;�i���L���2��%�j_�2K��m�mf�������E;3)������������o
���yG3-_��uG3=�_�4���fr��fz�h�njf����fZ��5���}��|M���}���������������63�\����M���Y����_<����0w��o��c;�%�ME%��Z�^��{���W�����(��Y���TT�����h�~�����(�yQE�sk�KR�W���5��%�~
�Cs�LxQCSQ	s���i{lEM/2�E�/2l��}�Th'F��gM�����rm�����r�U4=�~��CSQ	s���'�r��*��r�U4�����%��J�5��-�QEsQ����G��������wh!Z�X�EK����(�{�~�M/4�E�/4�/<��i���Y�C04=o����c+�<��c94o�a�K�������wh.z�k�����U��TT�����h�rj���\��������7�th.z�m&Z�^��x#s��[od.N��CSI	s�������*�]f4��*Z^f�M�B��l����C��&a��O<M�������9QC��_]��\��W�;�=q'�FT�\������h�q�3/jh**aN��B4�='������C����U4�0/�h!Z�>XF�����ed!X�n_M�2�5��2�*��wz����x(����(�^�����_�4ECSI	s�������Q2����w���`�C�}CSI	s�����yqh.������w������rh.z�����$G9E%��7�rh.�&X-D��Kw���	������n���H����?34��@��f��M�]�)#��&���������m=����5-D+����*��r�U4�'������D
-DS�y94M����B4o���:4�0/�h!zv����%������`������F��������o�U���-�2������I�(w�J��d�����F�|?��(Z��������$EyE!s����y�A�TR�����h�������(�yQE�����wh.�a^T�\6M�������D
-D��]U4�0/�hn����I��|��\n��A��/l�_�)M����]<�VT��q��W��~<�d����-�LI�;�vY+�%|�5�$(�X� �f����C6�
x*��@�Ra/g�%+�����@���e�����Q,|��P�l���ip|�; �H;`���VDP�d0�c��e��(������/E���L�D?��$x8��p�R�d�c��e�/vf
�I���<��&������_BN$	��}9��_&?t8�S�cIyV���}�#�,"U����X�@��N6E?��q<
�o8m�&�P����5������p~�&�0Qz���-����;���$(�X� �y��������yj�'2�R@Ow�����dc��t�p����[��<���P
�|���p�l,s��<����.�8�i
�U��!8���q��'A��2�����%C�g|/�cg:"����F�k�IP���A:��o���M�^���Rl������N��%+�3���������g5�U���<�"�N���e���YK�����kFRr�O�Wc�����0��EC��S��z]��-������*��-��A�K�Ipz����Y{	�}��0^���P��7?('��xVx�e�y�7p3�H9y)\MdV	����.��8	J6�)�fg=���r��=z�\�	��xn�5�$(��{�fg=,���D?R��U0�%��)�[z
8	J6vP��<�zZ�I����z(�A"�n$_)�W��2_7��gE�������I��"jwJ�@Ip^��V�6�r�6�n������>�,(FD9^�>N������YOK��{5��@6�U�a���%a��"�4���%�};C�.��^��9F�:.��
�^J����d������f��rYmAF=g����J������t�H���6!v�����$�{�%�/������z7QO��5��-��l���fj`;�7 ����S�i�A�5XE]��Z���Q������H�����*
��4������y�'g�-855G`��1��l�4�����	I�������9Jj��)��g�*��G�������/�1f�
���/���K��
��X������X�~��bh�/�=a��wC��S���	~�a�f3������a�cI�S���	V�{����O�
1\\K�y�c��wC��S�z��D�5*/$fL�c�h��D!:E?��;}�g�
pw�L�9��)�W�����L���v��"����LG�E6c���7����v�)w*4G����1"r�c8�5����{M<��B:�Pc9���z��������U��T��9�������t�p��B����|c/D�^�'��b� �y�u�Ps�*u[p�"n��,�nL�B���6�a?��;��wqS@�!R�>�!���Z�p��<��~S@g��E�Y���{�IP���A:�8�V���;���mC@�m�N�mC
r�a8�V�^�������_�tfa8�71�����[��owm�n��"���E�Nj-�BA6�8�^3d����h�1��yc�������T�K���z����x7^��f`3��_!��B������%o��$����:��s�8�R ����%�N�����������z����;(�fg�+T����J;"���I����p��q�*���z46�5b�8�������p�-K>,T7�9�o�1��9��v#(����DA6���������95������%5���A6�8�^���Ac)��5R��q7�)�y�:���XRn����&����k��
cn���G[�9�������*_:���qS���q����5e�������?���_�[��Y7g+��}��j��-�x��i�{�����?i��a+����lV�tY������xI�����r=�_�����x��&�/��I������M��w���|�������i��h��������9���X��7g��D;�us6;�_5g������?o���k���n�f����l��=�f�g�����-W��=n��}s����x*����a�M��/�g�������9QC����wh.Z���E�jm�O84�0'jh!��YQ#�h.�a^T�B4����)d.IQ^Q�B���;4�<���-D�-�U4�0/�h.�;x4�S�^d$��Z_dR�b��9[��m�?C��&a���������\������h	������D
-Diw�FT�\������h����%��r�U4=w�a����J�5���"|����_���B��[o94=��[-D�v���dhz��0/�h}�)u���h�V`{�����y�0w���
������TT���������wh.z���������*��LT�L���h�b�ZQEsQ������x��\4�/Z���b|#s��[�od!x����K^}�-���A�����eF�����e>p����9�C�����i�0w���
�<�NT�\������(?�������D
-D�l�g�����E-D���f�����E�E��ov��TT�����h����754�0/�h!zt
���%�����`���94��H��T���$x	��4g;3�	c��dz�8����Y����vh*)aN��B��Z��\��Z��\�����J���������;��XC{pQ��z�<(F��n�
����=Hf�)������CZU��_dP�����h���84������w������rh!znm�#sI���B�W�~�������rh.Z�����	s����GX���l'V������I��C�w������d�z\�
���;�}�f�f�-D3����*��r�U��[imX����E�E�,���rh**aN��B4n{�4uh.�a^T�B4u-���%S�R��B��B%'(d���(/(d�|�{"����lV�_�����F]�������W���f�M�$�	Z���R��2��(�(d.�v9��V8(�JJ��4���s^T�\������(��mD�E9��*Z��F�FT�\�������,�u�����E����m���l4���Vs�3���J8�T�)��.g��]-x?/(De�r�h8�����A:�8�����l�k��K6��h#�@IP��zr�L3�zX�!��]���uE	Q�fo%A��j� �f����������w�)��%�))�A���3~����������c�k-� Od>�E��@�������0��3���
�}���F�������I)��=c��e��-4 D�������
�d��x�
��f�p���1g�2���1D'���������*��{KI@J���zh��*Bt�~,)�j��2�0^N�v�~.Wxz��H?�4H�I��Kq+��=r(��{�	o@�&P���n��o�@�pH7������!�yu�T�
D����NA6����u����
G�����+*�N���P�ll� �f���d�f��<��������
�������t�D���v�9�������
dg�^pQ��9�s�N3�����.M������UT�����%�9H�g=,�<yOL�����MF�y�+f�%�9H�g=,�B_�@���T��5J?�P��,^$�@��|����g�0�X������a��;m>c�C90D������Vp�*�I�m��D]�����A��q����U���-^<q��k]���z(	J69H�g=,��*pb�k`� ��D�
�$(���A:�8�a����w�R�Q#�1���������]e�S�cMyV+J�yis��x ��p�^c��l�\~s��������.6�tg7�����%a��v
�i�YOK.��5��B��O�����=*����f�w��T�T5��v�Q[E�=���]�c�4���k�m%��$�U
��"�h
M�7����l^�4���%�>��(+)H*l~9��_&?��Ct�~,)���f�k��1+����}��q3�f��f��~x�1}����s���w������$�$B,
���Yw��:S:�\�?���58Ij�C8b�t�p��B���&
�q
�������5�����n
�p����bh���("�M�
�$(T�@A<�4�V��7\�L��Bb&f?�����%�6E?��;����k��2+s/�1e���'
`�N��%�N��b���1�x�'~�6>&�8�'� ��B5/��`������n��{Y(D'����n	�������G�Z����F����
��;��.����a�J�#t�e$�V��;${��)56e(��)1�����d��P�&|V6�$����%��$ys��u���'��%�{K2��d~aI�����%y�P������������1")��������zO�bC��~������~� g?v��n��@<�����Z��8��L~�9�~c;��H�+����y������{��<�p��B��Wc�t�b�lE�n������T,��x�!����=u��;��z!���J!j$vcI�Qa����E�}��[�I�}HA�>���O.��P��U���MC@�i�J�MC
r�a8�V���%n������H�>���A:����q��\��
����~L�AX9D�n�)w*40`���L��-8Jjm�JA��@��*]����C���|���V���'�����W��}�v��X�����
���Cb��<��W��q�e4���K��<�pRc�A�R���Y�
U�c��QEp��p��7Nj�>8�
�l�p��B���6W�&,����z��2���w D����r��9�S����qf��0�fL���3���g=�>]�������{���%�x��Y8>izV/1m�un��Y���V�&8�S-�����
o��n{f�����i�3+_�\���g������gZ��_��/��X�]��������.�:���[��h���u�3�?����]��0�K�7?�K�Lw	��g��U�3��b&0of����
�����h/�C��[T��u5���&�>A}������^�C���f��t9����f��CSQ	s�����U��\4���Z��m��U4�0/�h.*M\������9QC����A��%)�+
Y�~��Cs��opwh!���yQEsQ����p����*�=d4��*Z>d.8��~��+�l/����I��C�����'�����9QC���Y��\��w�;4��9�^=ME%��Z���:��CsQ����u;���*��r�U4-{�������9QC��oo��\���[9���,�����F������&�����u�7\	���M��%���
/���,CSQ	s�����`��\4��Z��_o�����E�E�����RCSQ	s�����*��\4�]%Z��n����d���Y�}+���g��������1QE����yQE������B���9�v��lv�,�.����{�|^�66�8/ll%|�{�=[_�nz��`��-|<�K�6�.|��V���FX�J��]&=[����ga��S�ll.,q^��J��:`9��=�X�Dc�����s��]e�'Q��$���tI{��Z��(��>���hq���+.������y]c+�������l��^U�J��=�=[���k�g���<���%�[	��/�l!{����p�J��U����FU�J��{dy��=�.Y���/��[ae����5��>x
�����I-������06���������5�rh.�a^U�B�~��������y]c+�����gl!�q����0�����0�5��V����<[��T�l.;	�����TX���c+������B��:e9�M�>��
�>�$���3�:�5�W��l��[w�����}�04����\��v�vuy��J��5����������FU�J��Q�^U�B��]e�x��_��l.,q^��J�Ww?U���+[	��v�i���+[	�bc'�l!�q���yC��E���l4����;��Qs�k���)���%g���RQ���W�cH��JHc���f@�����k����=ka���9������c���Cm�0D������� ���	�/��wEt�v�F@I�?��i�YK>��/8><��^E��}vY
ol?9H��d=,9V:�3��b�Br�H�0.�w��]������.�kM�z�eK��r!�R�CK�1e�2���"S�cIyVk���T�x9��R�\9���aL�L~h5���XR��Z�
7�����`'���U��c��e��{B1D&������J��p��U.~UG(�.��p��+d3���|����Z���E	�U�X��$(Y��� �f�����S���	{qy��]W��7�~�CY�0��3V{\v8�=�@Z���(;<C����=�9��I����Wk�+Ok����2:�#�����$�G��9��		>��7�����,#����<�\a�8L�	>��6�
��d�+*���c��������l�P����*�G�I%���u��o��%+��4���%�������;���V��#�G?
��Aa2�HX�q�g�N{�4�Wwh�*K��~���z$�G�+�U�d������k�v|O	��2=�Ak���m�@(�*7X$L�		>��*�3������dv!C�jG0�*7%L�		>�>���SS�M���&���<���]�t��|�w3@)��H�L�2j���^}G0�*W�
)L�	>�>�G�w
�b�JK��]/'���X� �y�����O!���i�������6%L�		>�n��]����g�7(�JyG0�+N��Fr4��T�c1����Z���p��By\���a6�H����S��R�X�
��c�N��&�����t�P����X^��kVV��:����mp�0�j$$��r����E��E��5�$���v�i�Y��tke�u�����+����q�:�f�Vv�K��������nl������2~������cN��
�9�����������X�$�@�n*�'d�Q%Dj��A�0V+�4��[���wV?,�����a���g����uQ�q�3K5������?���
��L�.�{�'�J�1e4�i���K��
���(�aZS���!��[
G�@o������t}�?,��P�0�}rw��yc���v��� 5�?,����?;�9t�����Y���=��;�z{2����-���9�j�P��C3�HH�f���<��=���&�C��O��g(�^����WO`����O
��gOH�f�f;B'�����w5G�y��	��l�����z[����7��o����To�p������y����#�X��fA����`�kp�0�j$$x�^o3�c�5�����$�E��&9K��u�P�*B�������=�����0gcv�o����r�y	�5/;By�y�a���		�����?Q9�����P^oY&	��Gr��t������{����joT��F����Q��Y�w��o�I��{�'����=�a������9����������-����� gJv��o�����)8}aK��H(��"����#9���O�5?�pa�.d�n	�u�9�{�=)�{	{�1��w��u�t&�����l�dw������%��������_�c�Na�0�j$��r�q���[�4�Q�9�����b<v	��FB�7���8�#��X<2"�G@I��xq���=����y|�>�l�g>�0�^xL�}�3��g��>]Y�����������{�o�g����?�����������w����d1�a"��2���m��w�;���2_��pm;x������6����CS;�U��g�7��k?�	DL����u�3��8�����<����W�o�O�W���z��5��o2���0�~��C�?����^��z�����i��ulGx�m�~^��w�3XU�yV}�9��=����u�9�=��v��5�jo2�w>������=����z��5��o2����vo���;k�z%;�~S�^v�u�J:�{R=t5��a<����?8``��)���o��0+�-�z�X�w�WT2W�0��d��}�6E#SE	3E#s���8����"�9E%s�s������12W�0��d�x��BNP�\������Z	�Z���L�$����v��(*�+r�ST2W�xD�WT2}�H�S�~�dx�z��?5���t���.	���LoW�w=�����������b��kNQ�\��������_�����"�9E%S����N��TQ�L��\1o;�=u�J�����Oj����9�)*�+Vj����9�)*�*�>x8u�I:4}�h��Z?P��3lA2�|��"@��Mo���K04�q��#tUEU������p_�V��\U�����j�7#^U�B���������q6����U�Ua-rj��U%��Z�����
YhR���P����K*Zhr�U�P��aUE����yUA��
�,Y>m>x�$��.����I��C��:�����J�S5�P-�.��*Z�r�WU4W-������J�S5�P=�`A��h��q^U�B5S�U��h��q^U�B��������*�yUE�k��v)YhR��2W<����44�H�U����Y|����S���*��7s��d~�*/~�����E�U����8Q%sMs�J���������E-Tk�U���j���84U=������j��:�P=��h/*d�Ia^R�B1Q�6/�h��q^T�B���VU��I�q^U�O����sf��9B�bm�4�o����04�u���*�kr��T�P,���CM�������p��v��\U�����j��UEU��������Z���sh�zn�}�:�P�8��h�za�5/*d�Ia^R�\��n/S��Q�aNQ����$�`�g�xq��������������%`~�)�]���-��V�#-9�k*���y���P�\������b�S�����&�yQE�J�����*�yUEsU����
ph�*qN��B���`CU������_{UEU������������"#�+#��B���<�R�}��F���R����Z���:W�HAr�t��8�|<4��2@6E?��D�����Y�q��1�A�y9��/~�q��4I������}��
[�����"�']l��C�����$������?*�$:����v�X%'�|�S0�)��	��_�M��%�Q�g��9#IM�M�Q������I���k�^?�N�?�����Z�V�QM�6A��l�n���m������?��G`�
�o���Z������l`7O=4������Y�`�,�,��E���]^/�����P���K��N'=�c�[�'�#�"�$�p���9Hgg=,�8����Lz�D(��M�7�p�l,s��<�zXr�[��y����DQD�o�@IP��tr�L3�zXr����`�.x�"��v��%+K��d�H�����/c�!?��4�����`(K�4���%���n��{h:�����C�X�8H��d=,����79r��
�����(���n��o:��=�+���(S�Z�!7J������4��g%{�=Q��CRQA���D��%���n�p������K.�����;��c����74�)���<��8����w��o��6��Oc|(����tR�N3�zXr�[<�/�q����B�P}(	J6)��g=,9E�t�c�E.�u)J��n{(	JV+�4���%���s���}#%0���1f�
�bX�Q���S��Zv�>l.�����y+/����J=.
�)���<�5�9���c$^\(��S��(6
�I�1�<���=�6�����(�:���m1�_$�@��F~P���zc�.DG)�qw��%+�d�����C��0���\���0M� $�@�����8�#�x����Z�=�%�$(�X� �y�o��_rAsL��_<�Dx{�g�9���d�2�!�Bd�~,)��dhN�{�����cH/����j=�k�����#ea�����#X+��[���X/~�(��Q�.e?��ZV���I��u�+�7<����@�[��)C���U�
K?����X���Zz��d7�poT�)w�9���������|t�$l�p�d3�p�&��
@�3�9�Fij�1c����P���GkNd��mfD��O;����k������������9y���{��N�{���T�U��J7Q�o�,G�������=������X�����)�1�)������p���H
��^<9{�������t�p��B������HF9S�IO�r�p��<���m��\+1��Jl%
V"y+��u�Ps�v�&�~
rna?��k��	C�;���r�B3��wOc �����+��*���s��
����5�`B���PRkVR{��u�P���4���S������M�p��B����Xz�oq���;h������������Q�7`I�����X�9r�%l|���}�$�Rdl� �f�u�P3��X��y�f�
 ���<�JA���@�n��7��s�5N_(�w�0�;}��[�z��������:@I���A��k�d�*��"o�AS�=9o��#���<��U��m���5�7�K�}��"���)�q����
�c��|�4N����Yw#�$�m��;9H
�H��B�Z���������s��2~���2�I��y���9i��#����_���7JjL��� �f�u�P���B��H���%y�.B�`�i Y�
7��8���"3�F@I���p��4��{��	���J#��r��2�-��
�)�����F��[��r%������9��q������������gk_�y�=�O�����.oG=C=_������W��������\`O�5$�6���s��~��O<u���s`�����sZ������
��|M���
������v���tr�tz�h��Ag�m:��B'0oC���v�6tv�������b|�Ntr���A':���[��X/:��y3:�M�X7��_�/��i���N������5��_�����mC:��H'0oI���v��tv���{�7���+�^���S��L���}c:�hJ�}�_1E�}����|�CL���vh�*qN��B���"8�P��ONU�\5���iR�h�*qN��B5cF��h��q^U�B����OF��%�,��C�C���P��\V&_����J�S5�P���G��h���8�*���M����^�N��W`h~�$�]���}+~onT-T9��*��������\U�����������CU�������m�h�U%��Z��y�VU�B������j�;8�PM}����o\��B����Z�V<��QU4�H�W��E��]�EX��C�;'q�����	�7-Z�r����h�z�-Z�^}����A��������9UC��KUEU�����g�*�����7uh�zu
�,4��a���"�1��T�5%��Z��+-��h���8�*h��Ip�p����S��`h~�$�]����K������J�S5�PM}����o~��B��Cz����*�yUEsUX���-th�*qN��B5��CZ���s�C����UEU������k_fd�Y��eF��o����q#q^T���Ke�N{����%�N��;����3�V'ih�)qN��B����6J��%���1�`h�)qN��B5�_kZ���k�C���c-Tko�94W����.ed��aNR�B1���Zh��w�C����]����F�����4g��_�n���[��{KC��($��Gahq���u3�8Ud���Y(R��FR�B������j�3�UEU������.��/bh�*qN��B5���C��[c-T3n�mT-T9��*Z��]K2#��kIfd�XiU�S2}�H�Wd���&���|��.�s�Ucw��e8Y�=|H��6s��h%�������Z?�q��!EU<�����0/)d��#�Y���(�kJ�5�P-����*Z�r�WU�P�[l�8�P�8��h�
'�waCsU�s���� ��*Z�r�WU4w�����[��LM��\"W�A�;�#�R�;w�!��������@I�?���U�i Y�����&:e�>��!:@=w��%+���t�H����?I�"�P�����p�l,Q��<�����]r�3���������p|��z��B~h����XR��Z:�:y��XM$�J�C7��_&�Ct�~|����k�F�5f�])�"SHFk'��1e�2���"S�cMyVk�����q7�&��'��c��e���O
�)���<�����0MA�^�	�����,_�)���]���XR�u���-�%��I�P�#M�Mx	8�����A:��o��;(:N���m��x Jag�{�%+�3���m[:�v%^K������^��7���de�%�d�H��6���`�N���PB����o@��y���A2�$�a���W9q�
�ED	���lp6��y7�$�a���;�<�.�PD�x��+@IP�2��d�H���a=��r=�WD��1���8cG� �f��w�d0��F7�H���g�G\?���0����M��%�Y���[���@�XQB�`5�@Ip{��;�A2�$�a���]����%D	���A>���i Y�?�}��v.��J�E/�q����z@�M3�zX2����dC\!4	{�deR�N3�/K����2���k�("�.������A��r2�$�a����D������F���7���deP;�4���%���������)�� a_�q��L~h�$����XR���>~����6$(��&zJ�c���!:E?����f���-(VQ%36���?
��i�YOK��y5@;�$������P�,l�$�@��\vl����d�
l ��{U^N�����v�y�������������G(�d���|�'a�������%'6!���B��n��rL�L~�M"���XR�|��$�w��M���4�������n����?�W�:��7#;t"r^���?w��,�i���?o�M�
����$��
J&�s8;����w����=8�9�s��������U�w3���	�18;@IP��zQ��8[ Y�
U��n�q4�f@b�f;��������cI�S���_�xG� 1�S�w4�!�)������Y�����1�x�SF�cb��1�������)������vL�{�!��l��r�k���w�j=���y�=������3��n���#����8�-����� �Tv��n�MC�:P�Ov��z���?������Y����s%5�d(�w%)���-��[��A��Z�J���8Jj���A�E�@�nj�`�>�J���8Jj��A�@�@�njf��_��1"r6�(��OR���u�P3��R1��B�_@�W*���A��F-��[���w�gN��}�sp�gNg5^�Vc$�V������7���
�}<y�����������{}{�7�"~���b(��1���-��[���w�/�C��
�p�/�3/
23��u����z��n�#�>\�_��1�W���O�=���r�Bo�����1�8����1� ��@��*�����'�=��vc�h|��!2i?��K�+4��n�L�=�����p�Jc��l�p��B��;N�}n�������+��a�N��������<'<\h�&�(�q���t�p��B��;����G��w��F@I��*�����W�Z|�$��~�V���7Nj\�P(�fg�+�m8p�c�������9�{}!q�L��%���t�����!n�g�����w�����y���p��K��6�>n�W������g
�����������_�K����n�Wv6m!k	���R�Y�Z�����)�B�O�e8�a���������y#>-_��u#�\On��e#>����0o����v�F|v_6���~zz�F|z�h�n�g�e#>���'�h�'`�z4�����]�����7���D��u#>����������y#>-_��u#>��_6���~zz�F|z�h�n�g�e#�WMw=���fC�}Oe?�����~k���)�n��8w���1��L5����J�S5�P-}�������\�O?������T
-T����UU�P�8��h����"'*d�Ia^R�B��{)8��<�^
-T��@��h��q^U�\5����?m$��*��i���}#�����W`h~�$�]���}��{�Z�r�WU4W-�����\U�����j�b�(���*�yUEU�T��*Z�r�WU4W=�
jvV;4W�8�jh�z�=Z�}O���o���B5�-�Z�����O���G�����8^���F|N]��5��9�s�`h~�.0���/CsU�s���g�������wsph�z���DU��j��:�P��Q�VU�B������j��:�P�}�S��g�[��B��z+Y(^}�5��W�c���j�w[�jh���8��h����:�����	Zm�6��7N��%�����S����*�yUEsU����hph�*qN��B5��DNU�B������j������*�yUEs���=N��J�S5�P=��j�_
-T9��*Z�������f��Y(����C����yQA<n`[��w����l��P2�o�.A�������:ICsM�s�����zmF��%����?/�����9QC���qh���5-Tso�9�P��9��B�vyQ!M
��B�W�f������Ysh�Z������4�T}��)p��_��w^a��}���:�sah~���u23���]'3#S�
�z���������CU8f�jU-T9��*Z�V<�QU�P�8��h���sh�*qN��B5n{�`uh��q^U�B5u���,4S�:��B��23�(d���0��d���{"���N|�H�y=��W��!��� ���Q?��Qat�-�����9EC�s���)YhR��2WL;�����(�kJ�5�P�����*Z�r�WU�P����*Z�r�WU�P�s�����*�yUEUY��T-T9��*���������h���^������{%�[*���{e�uop��~_N��=�����������g�K�c��5z���*����Y6
;vp�N3��n�Y�!��e�{�����?�H������dc��t�p�����V�tz��D�[�@IP���A:�8�a�y��_���b%II��aL�Lp��A!:E?��g��G^�~��6�AJ.�tC
�����!@������3�g�G�O��t )����~��L���B!:i?��g�Vx�Lk;ig���RN8�����"Pk�}d6E?.~O��Za6�-���D����p�u��J��-e�� �f���dx�D�d��#��eD��lo����de��p�i YK���qAA����!j�o|�@I�?���4n��o�2�G�|C��u��~��9�
�$(�X� �f��n�'lg�2O���ND'��}(	J68H�g=,���h���t�����P�l�� �f������[M��.�!��s�:�7 K���i��i�C�^��AE;[I���z��B~�YP���S��z��M�Q��q���A��_��� ����'6��_0��oUt":k���#�$(�X� �f����~�zJ*�PAt^�_����dc;�4��g%�SX�|�]@*���""p��@I�?����d�H���������+��������dc��t�p�����n��9���:T�����Q��:�s�N3��M��3-l�O�	+�I8���3��������d�����	�gBD:3�<.F(�7����*;N
�i YOK��B n��f���J��~��B~x�c�S�cNyXk����l��KPM�?�����+�3����_|�� �-!O��w�z(	KVvp�N3�zXr��X�=��#^������J���
�i�YOK>�#�[�������N��c��e��1D������� �N���.{%�����s������l������v��������&gh�����������l�vda����N������UD�����~�~\�[�a��7��ixk3FD�Z���k�89H��d�*T]F���lM�w����PRch��� �f�u�Pv��jsf����l�v,����C0D'����g�]��_�{	�����3�������v,)w*��"�C�\�
���sF�\B�w.������xBK<gW��L4w�R�x��2aO=��r����g���w���9��w�@���*��L6cI�Q��#{F
g=��HF���X��8:o��
�L��q�D�|�$y�8(��@���i|�z��r�X����r� o9v��nj�_����X9q����4��[�z�/������\l���+��/q�b;��;z���6n"��M�@�����HA�M�@��?,��{t�Sc"j\�Pj�� g$v �S�>,��z�e�35b(�7)����[�z/��;��6������6� gv��n��L���,������,� gv����|����B&	���1f�
����B����r�Bo������8�-���� �v��/>,��st�I�j|�D9���)�������U�o�C��D$�������"�/O��k��
��;�/���7Nj�?l�8}8�?����� 9D���A�H�
�i���O.(������7Jj���� 5
{ Y�
�
�\��x|���g�
����(�3������6�f�����f���H�~�"�4(�[����������������<X�!:E?���������z����t��z�Q'=7�>k������> m�}�Z���K��Z/�����n��un��YjB<��p!Dj��Iw�Z��������I�\����]�f�X���g��_���K�6yr	��zv	��.a�a�.��{v	�(O.a�a�.A3�%�{��%|�cO/����KX���K�L�����%|�e�.����%����%h���u�=�e����]���������Lw�N{��e�=�k�'�0��g�������������o����&�z������M����}��k�7HvM7_h�]�C��W��L<�jh�*qN��B5�MZ���I�C��37����U�U���U
�U%��Z����Qld�Ia^R�B1��Zh��]�C��]Ww����U-T�L�FU��i�q^U��is�v��{�]����^���}�8w���-�w�����J�S5�P��F
-T��Q�CsU\�^�����9UC��'�:UEU�������%�����U�U���Mph�*qN��B����9�P=�.j-T#V�U��8�U}���y��o���N2 �5��9�s�`h~���;�������9UC���kph���~
-T���SU�P�8��h�z�m�����J�S5�P�}/S����e��B5w��,4s�=��B����9��<�6j-T���SU4�H�W����B����k/�`b�M�<��:���lz��2#���-�%�+[)_}k�V�W�����2������l�,�^��J��Y�^Y�J�ee+���6�l�\����-������?[(K�W6�R>�Fk�t����C+��7��l�@��FX���v8h��;
����0'w%�7���u(Z�@�3{u7��BW����r��^X�J��YA+��"��J���Hx�P.��2��BY����r�m5�V��7�<[)��tO`E+]�kd�T���g+��o���J�^�����#	l��}�8�i��i_��k+��;blq%�_���-���|��B�������Em�Yc]	���V�q+�s��J�ee+������g+el����k��y�R�����\9�;������+k�Svl�|t��Z�]C6�V���yUA����5��>y]�W���_�������~!��u�;-n8��KW���r���5���@/ll���xt��V���
Z�V>���*[�r`#�l�G���6�P�@�ll����-c+el�����V�G�c+el����e��WV�R��FY���`��q@��il�?�x������^�H������;	�^���'�ZLc��t�p�����.���7w�bE��������~<�9H�g=,9�a|%�Jzx{���_�`�i�%+���d�H���9�.�����F|�P��l�$�@��S��Y��c���\1��1���
�~8D���_���R��#��[(����J�,�~��L~hq3�������U��)���~jJ�A��A��1g�2��5�"�����NjW����%��S+v����V^�9����n���K��Z�Np�/>{��.8��
�$�S�~!H��d=,�����|�k�2����G@IP��D��m�H������^t���������v�c���5\S3�8�a������ew���D����;�y?~8��		>��8�
���*7V�����#�G��9��		>�Z�g�!�b�3;������`Un0p�M5|\9���@���3Y���0������Q���T#!�������>�3�/��Y��%���`Un0p�M5|\9|B;y�x�}��vX��`��O��4���%��������6�M:�~��j}��%�9H�@���Y�k�}F��G��XB��m��y���0U	��F���+�
~.�?�5�,���+���7���r�I�t�������~n!��Ux��^���=���D�����]�t�������K{�e��'j�R@�&_}C(�*7�%L�I���uG��;1�
K����Bu�^J�_Mc��t�p���O>��(�U�������6%L�I��kZo�E���c���*��q�C�0�j$,�����������������p\��z$���+Lf����W^pv����C�uo��`��;8H�e=.���C�r���� �;t�#�G7��.a:�HH�q��+��=�G��
����c�w���w��� �P;���W�[8�D�X� �y�N���2��Dhfh=���C�Y�#���w=/R���ug���@��o��������m=8H�g�*T�O8=���c=�o���#���c=�4��[�����	+ED�4�.'�(�f~��@�n*Vg���SCB���0���O�"�i7��;��	�d��'sM�1e4.j���K��
�SZys������~�s����b^i3��;:s����M �!m���8��Cd�~,)w*��%���|R@�O�J�}R�>i$�V��e�
CZw���;�J�Qj��H��B�W	�_�����=�~�)�'
Vb3�8�V�fLF>�����[�#���	�9��I��v|Z���_S��'m����H0��?��FB�7��&d:������=�����0�z��o����Z��uk���`��uR��:{B�7�m��G�3������p81�q8;B�7��fc	q06���fO0����&�9s�'$x�^�4������Y{S��05koj�����6�;�r��gtj�7.G�y��yI�Z�i����^����70���
��P^o`f	��G����w�����'2�G��z�2H�N5�Y�y�v�h���7+By�YY$L�		��W}����8T9�qY�iq�� �f����q���|�1�S:�������$�y�a����;x���,l���8����Cly�~$���?�W
����xv!�&�H0�3!��a6�H�?M��z�<"�R�YA�|����S�K�N5������^������$v��/H���;V	S�q ,x�^�����F�=�g$�����^o.�@��l(J�DC����X�D�2����O���N_�����q�D�����g}��?���������n����&������E�f.8j�s`y���v�/�F���\�����p��Z{���!�3X��yV{������I�rlW~��n�KM���7�p�3XU�yV}�9�>����_�Q��C��R<�W�K������a������@����N�S#`Q��i�m���}�j}�Q��.���}��^���&s�������i�:�����q���s�j�5��)_[~�����i�������r�����y��5�M��O��x�����������L�%&X��O�2��l�O��J[�o�_�6y��+9�����%��7���x�pMp�Xm�,9�)*�*�����F%sE	3E#s���54�B����O|U��,9�)*�+^xX�d���(''`�V��k'�d�'a&hd�1�+
Y(r�ST2W�z�>%���9E%�I��Q���������\�J��K��x#��U�[[�k�d�(a�hd�Xh/�S�P�0��d�x���(d��aNQ�T�:�r6?�J��f�F���<)6�B����O����,9�)*�+V���,9�)*�*�y��I�?N4�DZ>P�T;lT2�|�����%(��6�s�`h~�9�����*�yUEsU��������T
-T��iT�T9��*Z���Y�k�R�8��h�
��R���h�*qN��B���
^��J������b�&q^R�J������j��V^U��q#q^U��q�M������M�Ns�-n���K04�q�h%�����T
-T�����V��U�U�N�:9UEU�s���-�U�T9��*Z�fj��U�T9��*Z��1�����*�yUE�k��v	YiR��2W<����T�x�H�5�~�@C����7�<qj��_B!���a����[�U�^R�J��������'*d��aNR�B1��^R�J������j���Z�����CS�c�{��\U�L�����'{Q&+M
��B�������8/�h��qa�*h���8��h������<g��#�-��Ichq�$�]����;�_����0'�d�X�+�?"�V��E�U�4��������9UCU8���
T�R�8��h��{��J5�.�C�s�����J�����������(��&�yI!sE8���L!�G
�9E%�WIG%��4��p�!~n����j�_/��LQ���o1��m?�Z)r��T4W����G���&�9I%�L��{IA+M�������	zUA+U���������}`h�*qN��B�@�FU�J������j���^U�J������j���q?������\�.m�M�T�������_/�K��A�`
�07r!�]t���������|��<E:^0�N\le�������c��e��D6E?��G�fnb������\Hp���q�&~D~ED6E?.���+����{�N���GI�;�b��s�/�5#�I���<��Lx�g�5���TAA�6_
)��,�?��S|;��"k��%r:yK�8� �?u'S�r���Z�����n��-���.���R����
�-�&^�� ��nM�N����{.r���PI	H�����3~���f�"����MBx�xX�I������~=>�m�Z>�!2E?��g������(��_/1��;!��1f�
�Z��x?E?��g���\
P���RG�O��)����+��~N0D�������T�_��]���.D>S�����E�iP��K�+�<{����.D�B��%������\.~�QN�"?��t���MBz
8	J6�8Hgg=,���?-�����*��%�	� �5�$(�X� �y���d���QS�/v�	�Q���`�$(�X� �y����:�<v�3rRw����P��2~���S�pS��E�#�V��{~��y�?��}�G�I�?���d7��7�k��<��|a������U^�����dc��l�p���c���81�����"_���%;(��g=,9��{��O�I
����$��5�$(�X� �y�7-I�L��r�f����|��P��,]��@��o���S�}���0��_�9��	<)R���K��Z�o]0pme��C���F��nH��<ic?|M��P�NDG�� ��$�08H�@�;.o�|��5Ikk����Q
����6���M3��Ukk�������w8D6�NLo%a��b� �f����D�QN���!�o
�'����X<9H�@���9xu����������9^�9��	���"/�I���<3��S:�������m��%�X�(D�������^�l��/�Wos����������]	�����2:���+�����v���yb�$gR������3�I	?�)��XRnT(�a��.f�d��"�1e8k2������XSnTh^���S�����!��9�1$"�������O�A���@m�n�����'�
I�?���Js���y�Gc=��xt�����Gc;w��������!�;�W�;��V�w�����~b��������)�q+�����5�V�g�����X�8��$�� ^�`�3[ Y�
5�����U<9q������n��>�1�Eh��v����!�Ql��r�Bo��3�������3~�����!�Bl��r�B��h���
aw2�fVGp������t�p��B���fa8:���4�����p��B�����ta@�,�pRkf
��P<2����16c� ����pRk��f����
uN��._���!�N�B����y�u�Po���z/P����z/�����U��������8�-���� �v��Cg>,������9KoE��8)��@��>,�[p�G�q�5n_8�w�0��}����t�[�e���W9Ko��x|p��Xz��[����;���w0_�R��z~�����
9�Vi����C���������8���t�q�xy��]�~��B~�����s��
�P�F��`�Og��c�����������~P��|��C�s�p�!���(���H��B�k�t��t�����t��H��B�1�K�pv��w����A�����n���xa�u#�l�8r�S3��+,���8�������.�t�h���\4}����|�8^������$������~�E.�a�������R�n��]�>�#�����FrZ�u�������|M�������j�m/9�k&'0�&���v�nrv�����~rR�����/iV�����mG9)�����=��|K��?�)g�mS9��*'0o+���v��r���e_��^d_������IV�������[�i��[NJ�7���5��_7����������E9�vm0gh�-N�)��s��9�/�����sW`h�c!����F5CU�s���W����J��;84W�����bh�*qN��B5c��FU�J������*��(��&�yI!���5`h�y���������PEU�s��������F������&�[�����%n�l�@���I��C��V.��U�T9��*�����r`h�*qN��B��J�H��J������*|5<��EU�s��������V��U-TS�s��J5�=Z������j�[�9�P����&E�'��yUE�'p�F��xU����A���I��C�F�{������V�{9�w�U�P��f�V�W�����j����h�*qN��B���7��V��U-T����������th�zu���4������"��S���D
-T���S4�h�WU�|��-6�B���d������I��C����VU�BU�����j��Z�����C�������V��U�U��V����BU�����j�{�Z����C���UA+U������kD�d�Y�FdF��o�fh���8/�h������iT�`!@�#"d����|�~!��vf^�j����D
-T���l#d�ym�'%s�z�!(ZhJ�5�P���C+���qh�Z{s��J����Cs�k���RJ��$�,C����J3�]�Z��-��B-�4�U��4p��_�S�k�c��R��\	�s'KZ��w}������>_F�tX]#)h��q^T�B�n���oh��q^U�\��q��WU�P�8�jh�{k��J5���C���s��J��������5S��<��bF����9E&�'��yE!��JB!���>u�����S�/��c��,n3���V�8F(o�=Z��� !��'	)Z�^���e���0/)d�G��KT-4%��Z�^+�T�T9��*Z��-�o
�T9��*���	��CX�BU��������v��|{��q^U��Q�_{?oV�35��`q3=�A�:<�7���������mTD1d:�u����Y� �y��ln�r^\�|P�CB�J����d� �#�o�����\���C"8g���%�d��|�|+f9��:��H��eD���
�$(YY�(H��d=,�|�)m����L�I����~��L`sH���cIyV���<��^�
�H"��y9��_&P+,L���K��Zag����G\�I����������\�)�q�{�o�z��Z�i�����aw8n������
`�N��%�Y��=����%J��N'u�����t�� �f������|�`���b��$\��r��B~�:�)���<�����W��Nx��� ��?�
�$���v�i�YO��-�/N�#��-���>��o%A����A2�$�a���W9���K^[0�����@IP��xq�L3�zX2���<�v>�����z|��^&?T���XR��
��x#��#\��N���
�$�����A2�$�a����#>����ND�w#��%T�)��zXr�d�U�?�`+)@��y=��_!��	Ct�~�)�j�.l}E]$���	>���c���CwCt�~�)��6��M��Gu������;�$���X� �f������`���`�Z �IQ�Q��u#�$(��NA6�8�ik�����7qv�����J���E���i YK����`�X>��Y��~[|
8	J6�y��<�zZ�I�Zb�^�P������-�S�/���M!2E?��g����N�sR�0Ty����o@�O��q'�f��O��;����_�*��4p��D����,��i����}n���e���G�_�����<��W_���;C��];6bs(����x(	KvT
�i YOK>�X�{��x��A�6p��k�IX�2��r3������t+�E������_�9���}y���K�C#�����9D���h8��CZ?�w6�>hzw���u�s��~���y!jL�PR�zF
�i�Yw�F�Y���tDfc����YA��<��U����n�pj����8����c��B�l��!�PA�l�PR�k^;�4��[����_|��Y�s/�1gx7vsB�N��/�����������<�~�C�y���v��XS�T����V�d$�T�����sy\"S�cM�S�Z��x��Jj�o~e7��?����c�&��M���oz��t.%����%�.%9���u�P�
������l�������~��D�)�;�)��c�����1�W;
�|�v�)���:��=��F@���J��F
r^c$�V���;��0j�PR�0R�s[ Y�
�."��F��6�g6��W��3���qv���66�n���������u1���
��[�������*��=8Jj=���t�p��B������OD�+%A��]����u�P���aJ�mSk��D'#��!�8��'w���m��:�Z�4fa(i0!�1[�Y�
5�.��Hc���d��(���4��{������?����s���<��Az�)�yt��m������Nj���_t�p��B��;�>��o�~��3������8��B�������u����@�/b��G�����|��X���E��>j8dv_;��qL~���.Q�������Z�-����kw�^9g�����[��5@�njV���ZW/"r&�(�u�N
2W��u�Po������;���k'
^5^^8�^�f�]�3��w��w~H���Ac<����r��o���i�<��?�yq#��w���yn�|�4���/-?����k^�B�W���y��?�������/�%��]��+;[|�*��!��'�y�6�UX�Q>���������Y��O������5��_�����>_����6xz��yz�h���g�e�<�i���O��i��f�/��Y�_�����
�T���'�[���A�<��/���h<��y�<�M�X���_����Y��
O������9�
_�����m�<+\��i���yZ�&Z���yv_��{��W4����]� 8���Tv8����y%�;��K�|���+04�EQX���2�P�8�jh�Z���V��o���\��f������9UCU8��jU�T9��*Z�f:3��2YiR���P<��V�g����������V��U�UaMXn>,Z<m$��Z?m��k��^�t���E��&q�
�����3N�J������*�q�>�-T%��Z��6��Z�T9��*Z��-������*�yUEs��N������T
-T��������Bph����h�V������B5�R)S�x�H�WU�~�����o��+p�Ql�A���I��C�;w���/EU�s���g����J���084U=wyS�����g�-T#�Q��
Z�r�WU�P�}RC+���'uh�zv-��4��%������F3�����h�U���2UE����9UC��
���O�����#�}�-n���K04�q����9UA+U����������*Z�J�S5�P�|>�S�R�8��h�Z� B�*h��q^U�\5�}kRCU�s����v^�����*�yUE���GS��L]4#����3�x�H�U�~�DX�w�����T����o�.A����]�~�6���8'jh�zn���d�yn��fd�x��1���D
-Tc����J5�_kZ���3�R��9��B���.JV��%�,��;�����wGsh�Z�j��h���8�jh��)p��_i�w^a��}�(Z�:�sah~���5 S���]2#S�
�z�b��\S�L���j�j����J������jE�FU�J������j�{o��BU�����j����jh��q^U�B5u���4S����B��r2��d���0�(d���{"���z�H�y=h�W��!��� ���Q?��Qc��\^I�BM������	��Q&+M
��B��i�����AM�s�����:UA+U�������%���V��U-T�����*�yUEUY|�T�T9��*�������M�h��i^������i^	'^�?�i^��=�Fhi��Q�#��%��x9H�g=��[��~K�u��s�
�7{5��_�,���nH���<��:8�8��x%�&���a�h%%���!:E?Nn���Z��vE<*���2*���jx(	~��������w>�������B
���^�)��	���)D�������������b7B��K.d�������"S�cMyV���Q����HERJ�/��8s�""Pk>(D�����NG���fZ�I�>�j�YN8O��8K�N"X+u��I��W�>��>�a���;O:�Z�
�+@I�+f,P�M3�zV2�j�c	
my��O�.D��lh����dc��f������Kp=x��+���������~<�d3���|���w�E~�
��c{���mZ������������#K������-��%+�3����?���}?������P����$�@���3�)�loA���8ae�k�IP���A:�8�a����tj��u}vi�f��T�Y��zX2|?x�m����s�o��;�e�C�a�L��%�Y����%���J������v�Tj��$�@��|U��@?Y�������yEyQ�N�����p}�
�������%����#)�����Y� �y����pm������#'t�^��od�C�#0D����������"�����T�j=�^N��k�p��<����^��GB�&g�x�[���B���.�H��
������Sr��p��@)�I��9(��@��;_� [��UeDv>���%K�4���%��
_
�zp�"���%A���Shp�@���vY�>�C�\8B�I�o��p��,s��<�zX2���/���8D�#`��iJ����iP|S��%�4�	*R�T���rL�L�]
���XR��$S��9D]�J�;�h�<c�����Ml���L�<3���2t"2s����!p�N3���'���3��f�o����,�Pj�� �f�u�P5K��6]���A��6�E�iP���q�c�����i#a�V�i Y�
�lv��$�m&f7�oj��!:i?��;:�17F����-�c�8m��������I��t��d�71��s��,u��I����X����OB0���Q��h���J!jLvcM�S��������������� oOv �G>+�9�'�a��$/\3�q����	��y�u�P��-`kgE���Nj��BA6�8�V�����$���l'�$y��u�P�wX���������3~��U�!j0vcI�S�y��v��e<9S�������t�p��B�H����C����7/�-��Vb$�V���+���q5�b(�w)�9�-��[�z��Z�7F"��I��.-����������U�w��UzPc"v��z�������U���;n;n����Y����:<9H��H��B��We����3{�I�a�9Hgg�*��])��-;[�cFk�S�cN�S�9�;w�t("r��v�@��������d�*T=��{�*�o��a.`?���F�������:�.s{?��Xg�
 ���d� �o��Xs�"N]@@���C��7. y��u�Pg��G����x����� ���@���m��������C
�{����|7K3�M�N�;�8�i7Nj�=j4�f������F�����Rv~^3�|6B��A����q���Yk�;�������qG��>OQ'DJ��c�	z��Aof����?�N�eW�R"�HUYi?I����
$;�0��3�,����W�|��K�d��%�wz^��K��������gRG����t�t\�������x%>����q|�.@S�x���t�i���P���z����%<��%H�YB��K�0BO� Yx���z��3�C�t�����M�?��h����=]��1z�M��%<���%H�YB�'����t��������.3�(JO���Qz:uM���?����K�Y@��K�0L���F��K�z���(~u��[*�������
=�B�t=}�������U�3�����0(�TO��`P�:~P�Y�Q�JuVU�s����$VUP��uFUQ�zl�+�B"M,��L���E���!��cE�*�H�����@�muN^V��j#uVUPx��9o�}�����k^����q�Y�����,���q�U�3�����E��'1�\ul�3�=)
T���*
T����y�R�U��-_���(R�:�*��j�}.��@�����@5��4E�j�1i���3�2
�8\gU�W�R��~G�^��o�k�9�3kP����};���%(P�:��(P=} ��H����|����m�@�������}��;�UP��uFUQ�z��RE����J
T��Gi��$P�|N��H��9i�|S�������������M1 �!L/����S�{~��PWa�����Th#@�,,R�B��,R�>{��P�����|�CFYX���VYY�\i�C��,T��IYX��|x�a�r����������;����*+���KRS�f��fP�z��8��_��p^��>6J�~O�^�O�����Qt���CPp�M��N��H����H�l�&��m���"��w�0,�m~������!�"e.���"���j����7�,���V��Q��u�,�H��Yk����O[�,R���,�Yt9��IY���rn�oI�K�����i!��S��v!��S���6S�R����}���/#,��B+�,R>�����Be*���E�u;��L�Be*���E��;o����{o�=WN��]�"�,P�B�lX��]���P7��5�"��d������M��^����>��K)�
���f��������p��K����.�FVX���VXY�\�#;aB�.�M��"�F�[Yf�.N���#����Af�2Zee������k	��pR)���K��P�
'ea�2?n����H�
'ea���m��s�p�)�/����Z�_����?���ly�����'	���D��������%*�ap~��}�y��>n������>����w|�v9��o&_\%2�?�����x��=om�;��
��h��������`o���t�9����e;���5@p��Mc��2�0��7�|4�G��S0�3����@c9��o"_��6����[���x<���"����}<���;��|��XBC�ciyo���7�r�c�J*�^N����t5�/}n[����} ��\+|N��S��1��y������:��P�������k;�-/��W���
�_x1x�i|���;�f�P��S�;]]*^��=A
Po��[+��1ee���f������'x._M�D:�F�V=~�\�`�������";�JP��i������
�&�Mf~uL�~A&�#�}c�*��W��o�<g��4r�Bl����>�[��`�\a�2~%(���G�y�����<���xk�>���5��E:����)���|����g�6
��wj�}x�V*��W�?	�A��I�"B�<������m%��3WX�L�_I�A!�g^Oz��vu]��wt�q�����}8s���d�����3��v�<P!����g���������+�\&����������mO�s��e	�D��7=�}8s���d�����3����?����$���}��D�g���2j%(����~m�#�1������>����M�x�����3O��^�����U���@�g���2~%�'�jS�>�S0k*F��aJ��0�G��� e��d�P��S�h{��?f,��/��"��'[�����ZI���
8p���G�r�`)�J������>(��VB�����o�g��X�J�/F)w������,T����}�+���z����N���>�����d��T�)���1O
�"��Z�.`p������&.��Vr~�Z��x	?���xQT�3[l��
3x'��a�]��#�#J�A�:�O:���*��Afqa�m�=e��B�F�������' 5XW�M���0��;��[4�oS����q[�M�_GX�T��,��nMT��OTX�5��v��.����:���HF^u��({��l���f��>�W��e��d�L;l�8Q�?��0&kN@�T���1���JxP�-wf(.��N�Z�������c�VO*�A�1����x�yKW��D
T��P=;���XZ��P���&�F����Fu��1��'������6i�s��5����E�M�y��H�Yu��������������QU�a��f�~T��.G�kwFif�R������+A����n%��1������#����Xf�QO�����Z���\k��kLQG�o5E�������8g��k�B��PG��k�P,�~�'(xs���L��X4�5b}PG�o�A���������Z3���������#����Xf�OOP��|�I���y��0y��`��yR��<=���xu��x��z�s���t�V������	
������k�M�dZws�����E��t�o�/��]lWZ��w	���	��v&�;��7��S�gYL��f��[ML*3&�#$xs��I���u9�l]:�}�uIe�����|7��y�+��+�s+3Ns�/�&��*�//rw��.d��������l�2�OR��'!����&!n24������'����Xf]IGH��|�58x�^�`����^$�Y/��;_c�M����;W��I��`�Ie�������m����wl�=���w�2�;z��7�+�OmOncd��`��6�mK�H�Yu�5
�%��D�0Q��HH�a;�0x-$���P���r��!��V���+I��?�����_�s�����a�
endstream
endobj
5 0 obj
   323825
endobj
3 0 obj
<<
   /ExtGState <<
      /a0 << /CA 1 /ca 1 >>
   >>
   /Font <<
      /f-0-0 7 0 R
      /f-0-1 8 0 R
      /f-1-0 9 0 R
   >>
>>
endobj
10 0 obj
<< /Type /ObjStm
   /Length 11 0 R
   /N 1
   /First 4
   /Filter /FlateDecode
>>
stream
x�3S0�����8]
endstream
endobj
11 0 obj
   16
endobj
13 0 obj
<< /Length 14 0 R
   /Filter /FlateDecode
   /Length1 4320
>>
stream
x��Xkl�����I����[���)�����A�zR�^�����[���n�4A�u�M7)��4A7NZ�M�6M��m��E)�n�"���&�z��E��A�]d�X#[D�b�T����^�3�;3�����;�\���	0������rW~��
�K�N�;���gw|���~`}+�����!@��~`������5��S�����}0<����6���.��������Q O��xN?���m�>��A �Q�I���`��5R+�s�]�Pr��7�����h�B����c�i�m������
Wg+�������Y0��]�$�,0���C��&$%I���#u\#�j�O���,R����c��:Jo�$;lM�������09���[GT!��F��^�xE��x�k���S�wr��[����C�O)>Q����P�p�3�Y�^������`PLI����zS� �o�����S�����<�h��\��%������
��U�i��e�/.<>�����B�_E�6}���_*��4�@��m��d�FjLb�-o��A����VJ����3`R�	�=�L��(X��79��+F>������
������:S#������_QG��6�������w�T|c��O��>�RJ�5�n
I�5���K�b����#�2-��\g���5��,W}_��-���!�m���\vo�x<Z���|����p�{�xQ��`�9���,m
�)G�Ht|�O}�+-;��m����6���C����XN��
FA������Y-��bJzu!fq�-6��:v�P�?��:�Ngk���D}�A�0]���zTe�(�$��T��+�XlB2CEQ��Y,�j�s�gz����=�)a������~:��r�������pdd�7��R7m�M�����H��'���h����qk,����@n���>E/5U&
{T����DG���G�-�>��������������V7?���j��������a�������� E�qZ�b��/�
����-.
��'
eK'�9I�y}���(i<�������_�.'�������{����C���	�.,����������PL���B0(����FR�W�h����s[m��QCjFXc<�,,����=s���V��A0�����[ay��u����� ��0Z��6�����h~O(�l�W2<yMS_��,����k[����et�����#ZC=e���[z���F���XK����������I��O�T����k����P�K~@}��O�+�T��c����������i��p�`��-�F�-u.�hH�n> �����d�2���-�������\��w���s�[�y��~N�������D�{�����@����w�.W������.W����q
�7H#�#`NI�,THi��4�H��F<��Lr�ML3/=�^y���K�f{��Wy���:8�y���P�E	��f��'�1+���i��3��g/M���L�����<Iw��Ax�Wr�'�a�
d���������vc�-��4�Q���{x.�����fy�x�.��^�"���UWY�pP�$sU��(��J��;���D��'��m��~mrkBnZ����������7"M���g0�d0���e��X��dB�K�:zC���+�*�15�*1�h�V/���+�s]�����������_$�?�M'z�=���-^���X�.���,��z��0�k��Lr��5T��}�o����]�z0Uz��_��].��J����jr���~����W�����'76>���9e�����xyNa��tH�����l���@��5���6.�����@o�GRaQ�Q"h�TZa%5�U�Tr���\���i�`4���GD�1���3���r��`�����pG.du�����l��F��V0��GO�'��tqh*��x�$HE8�'���_	��V���?�6 �[y�+�B�UV
-����b��:��B	au�M��_~y��.�������5��5Vu���#]�F���h�J��J�f,g�P#5W�����]��@4��s|d��
���������D`0n���/�#|������j��4[�\gz>/���La *��wo���e^l�4�h���$y��/��"���#�(x���
iCrH��7���&�����k��4#\�&\�*\���(�e#MP{Z��.�+�W��PX�u�����9ys0�3j(���g��l�4�"�����{��.<s��Yo�R�[��`�F#3Yn�HBZN�����"�L��fF��c�CZ����.���������D4����X���7�w��$A�C���R�$��s��f;�V�i*o��*���W���vv��;��.U��l�.e�ux�c{O��'z����J1N��=vH^���u��s���-�B������7���������;�����8���[�N�Sm�@��_���S5������&$m�������s�]k��vb������������
���>E�D.Z��-�d�%�6ck���c��T����vE�*�f���,U�3�`��?G���ur�<N��o�[��'���|��+����Ff�9K>[�o�$<�������e��B���U?o�7���M@�|�Y�MeW��f8@+�hQ=�����O����v
0��
���E�P�Fx�<�Sw`��J��'j�?����)���������jH�'�����O��w�������^,��{�mWcK+�$���J���w_�{�����t+��(���MA��H&��P$,@��H(&�1�d2]�L(��o�=�
R�<�"L�p�h'�4�M����H���gh�j_=�"�`"L��hY�F��|�@��"@�
���H��]�����-��#�&P3��
h��f�\4w���{@tk��}��	���<@�!��m@��0���+�E�(	�@s�,����������V���
endstream
endobj
14 0 obj
   2993
endobj
15 0 obj
<< /Length 16 0 R
   /Filter /FlateDecode
>>
stream
x�]��n�0��~
����D��V��a�l�f��c��s��_�+�J{}6U����0��i+����0q���b��~�����-���q�cU��2����~-|��+�u�������~����(�u�#G�[:���a��t;�?|�Tt��^G^U3|�����uS��c�T�r|��������"�=����9�Y����_�^q��}#����9+�L�}�:�+�^�/�3x���]e�	3�{�M�l��-�V4���oZhZ�<A�$��R�:�&0	��U8��0��r��������K�O��<#��0��Z�m�|���������$?ACU/���'FzB�F��P��nD�Xs�LW{�M�T�a����O��?����E�f����	��9�:�ufdZ���s}�����G���
endstream
endobj
16 0 obj
   399
endobj
17 0 obj
<< /Type /FontDescriptor
   /FontName /PMQFQP+NotoSans-Regular
   /FontFamily (Noto Sans)
   /Flags 32
   /FontBBox [ -621 -389 2800 1067 ]
   /ItalicAngle 0
   /Ascent 1069
   /Descent -293
   /CapHeight 1067
   /StemV 80
   /StemH 80
   /FontFile2 13 0 R
>>
endobj
7 0 obj
<< /Type /Font
   /Subtype /TrueType
   /BaseFont /PMQFQP+NotoSans-Regular
   /FirstChar 32
   /LastChar 215
   /FontDescriptor 17 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 260 0 0 0 0 0 0 0 300 300 0 0 0 322 268 0 572 572 572 572 572 572 572 0 572 0 268 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 329 0 329 0 0 0 561 0 480 615 564 344 615 0 258 0 0 258 935 618 605 615 0 413 479 361 0 508 786 529 510 470 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 572 ]
    /ToUnicode 15 0 R
>>
endobj
18 0 obj
<< /Length 19 0 R
   /Filter /FlateDecode
   /Length1 456
>>
stream
x�]�=Hq����;�J<L������~��I�����:I�v�M=A1-���KA�N\�"t�ZW����:��xI�kE�Y^xx�����&�(F�7>�=��l'~����?]�E��V�~�u��W���6�@������
H/HH�T�k��.�W _�\��F}�Wx��$�]���C��%�eG� ��XN�T�(��R]��:=F29.��^��?S�:�:RC��r����ln���Vs�%���(f�Su�6wZ�UK%���&����D.�He��n���;�N���n"P��M�Hf.��M�
�����p^��r56cw� `��D�����jG�q����$]y��������?�A��������)����hsP�
endstream
endobj
19 0 obj
   380
endobj
20 0 obj
<< /Length 21 0 R
   /Filter /FlateDecode
>>
stream
x�]P�j�0��[^�C���t��<�r K+Gp^��\���u�)vafg`v�6WC�����-V����W�#N�Dw��|=P�~vEHm�v[*��b}���T�����/��D�n�>(��r���������w7#�f>��TS��7m��[A�4�="�p)�#;�P�J)5@c��@
���k���qSw�zT]S����|��+3RmM� {�D�,������S�p�
endstream
endobj
21 0 obj
   225
endobj
22 0 obj
<< /Type /FontDescriptor
   /FontName /FIDQXL+NotoSans-Regular
   /FontFamily (Noto Sans)
   /Flags 4
   /FontBBox [ -621 -389 2800 1067 ]
   /ItalicAngle 0
   /Ascent 1069
   /Descent -293
   /CapHeight 1067
   /StemV 80
   /StemH 80
   /FontFile2 18 0 R
>>
endobj
23 0 obj
<< /Type /Font
   /Subtype /CIDFontType2
   /BaseFont /FIDQXL+NotoSans-Regular
   /CIDSystemInfo
   << /Registry (Adobe)
      /Ordering (Identity)
      /Supplement 0
   >>
   /FontDescriptor 22 0 R
   /W [0 [ 600 602 ]]
>>
endobj
8 0 obj
<< /Type /Font
   /Subtype /Type0
   /BaseFont /FIDQXL+NotoSans-Regular
   /Encoding /Identity-H
   /DescendantFonts [ 23 0 R]
   /ToUnicode 20 0 R
>>
endobj
24 0 obj
<< /Length 25 0 R
   /Filter /FlateDecode
   /Subtype /Type1C
>>
stream
x�uXkp�y]����c������I�i�:�=��v���:V�$��Z�d��L�EB$��X�O����XK|�EJ�H�D����5v�I�'����d:�L��](�4���d&���w0����������.��[������{F����c�z��\�-�uq�P������<iF��k9�A��bwZ��wY�w�g��9�����'���Q��o>z�#����c/9=�������d����3�����~����=�=�}�{����}���Un7��{�{������>�����������������=�}���?������!�������.�n��Y�u=���7u�t�M�7?t��n)�;X��/���m�������w�>r�w�{���������;�wv�|��������yu���w�����j��M�m�f�G�Q{7{�}��h��:�s�Um�\��X����[',��D3�f�/��gv�`:���O�R^�8��RR !�4�k�����no�����[�	o�������	�#�1�b�I�%��0Tr��-���Std��a}��}F/��KJB]Z"�����P)�z~���	:
2�q���k"~�qb�cf��w�&F�������/�9e����V����f�C<�7k{����������������
�f{��������C��	6�Q�M���9Hx�^(��ptqd��
<W"W����W�W�]��Xkau���������a�/�T�����$}��������~��{���G�fI/��(U���U�V3�������*b������$H~���G�� ��mLw�w��v�7k�:�����k��q������n�:��d?0�:�b�M��d�#nQ�r�]�1rg�"D%Z��B��)��Z�A����n����`�7x�~���cN���`�h����s66�^k���������9�k���f6����X��/�B�D_��p��dk����������x�O����������	sT�i*4U��M�����V�H�f�*�������Kv�g��H�
�~9x��}��6��mX�i�� �}���NG:��������#Gh����ss3�*b��t��`A�zv��~���x����� ��L�7����`2�b�T�X���zzn&=CX���B��<�ib�BCoM�����7������e�"��V�X�z?���!z�83A,��6�h:O7'i9���B�U\*������.
�0;6�C�C�+=:z��%��g�[�:�
�r}�����:N�|��EH~�&�����
O�����H�J��S *�����m|���5��Z�t9�eA��L�
1g�����0H�YGE�������u�Ac�'F��]d��m���r3��3;��h<'�T����z�Ch����)����S�)	21�Z*w8Sy�:�4�D[~,�.m�P���"�~�.e�0D����?8u �;�����g����v��*=�������������w�N]��|��p#�G�G+�R���_�!KlR��&uL�S3T�_��dGd���b�$_����J�(������e��?C�j�	�YXK��]� .�[P;|;����,{��TG�+72k6>����q���=�1����VI�F��gG�d������V�����#3�U��T&���x>$�
��z������*�U�V%!=>GK���S�s8=�5}��B���Z���	��F����<�q\�������S�\�h1�E�EfC�,�
?���W�
��]��Ujo�������KKi)%!%�B�J��A���*)-�!�e��2��x�� f������>����p1��^�Q0�`*�
�@��,?Y�R��LI%�T��(���R�R��\�yf�Ow,�����t^\����������1���(��x�0=���������Z��9����we�h\���t����On	�H��;���WV�//��X��+�	k��X�b!���H8l����,� ���`��+���>��8q&�S?�?3Tp����V��g���%Z�4e4�����n6�s�Fc����Q���=�3�D1Q������9��b�x���6� AlRJ�q���X���#���Kx'�8l>�L+W�Xx<H�u3�2���C��ve\H�R�TbTH�)
)-��+Hk�r��T)UHe��F��!��Yf))������������~���2u���F��L3�g+��g��3wYn�)b=�3|w�@���e>O�)��L�S�)���Y<M��'#�A ��'O��K�����Ob� s���f����<1�Z�h�rU�Q0Shd�7��������'L
�)E,W��
��|A�&� `<l�3>K�2%fx�o3�1k��}y?#�s��n?�����[
���e�V�������]Y��M��Wv;��oC�����e��C���_�������;�
��t�����ei���]���7�r�O�Q��	��#��0&:B�Q��ft����e/�����Qa�����F�{O�I����y�l��4������bYJ,��Z�^C�Z�������G�c4,s���*@�W���������r	r)[�h6��%�%��o�L���1o��/���x�������*�����[��W}�C��y�u���e!�7�O�i�x�u��#/����uf��~�y���%(�r�h�����$^.@��8;���}�w�p����(�������L���.�M�=n�bT�U@Z�����I-YB"��$�$)���'���3����q��>�:AV��N�_0���lumq�>����Pt(z�NM�B��f<���!G��#`�����;L�����Fy^m)h)k�M�9Z�x�����:�ZR��m]�!�l��{���'E���E�����#�]���������5}�c�r�2�Z����
����V}�;���YF�B��c����zp��\�)�����b0c+�,����%��jtu��-*��te�6@EI5�"�&�dG������v�b6��h��"�/z���AoW�:uk/�^%�7w�NrG� �`���Ntt�|�:���|`�P�J��E)[�*
j���!SM7hTet:���y�o�������� �I�����d����w����?��������7v�T�i�,V�����TAj���Z;����X�@$���<�����;�������3t*��J`�=3�G�fuG����A����]~g�����A�SJN��r��{C��v���_����P������.,BX��:h9��6�6+�Lz�Q�S��v��x*��T�T���"���T����B��4N�������O���*��]��Z^� �4T5e�l�,�*T�TQA ��%���W��#�Ox����_��V���xh$f|rrP	����"Z�|I�.`Z.e��)S�W�i��x~����4�������:r��Zy������/��lh��.��J�M��P��&�'A�H�S�|)_�O8DGc��Cc>Q�W:##��$���#V������ZQtz���'l�j��A���@(����}�a�(�>��_@��b�9�Q���aSGj�98�<��u�O���Es�HM����+��	Z�,�:D}T�Y]�%x�@0��$&|$��)oVDF��
�E��tDKT����~�������
�n�nSz-�:<~�>��G@�����V�z,=��`nXqZ�~��B'�X	�Htnf3��eM-V�h�:��N���pV���	nrO
		�D`O-��6���ZEu3eb�~VL����S�T�7�w��0OG&��z�
�����?>b��}`�GD��\�k?y���;)��i���MO�&��(M��������\�4��MW���L+u�9@�A�$X����!�:&U1Tu��+��D���V���}maaM����\$��Jt9������@�Q��^�to� =�h����)y����^�G4���N��g;����P�j���$4��h9\Bx:T������7I��xc��/�s*w�v�=�������1��2zy2��Q���ff����VB)�
endstream
endobj
25 0 obj
   4109
endobj
26 0 obj
<< /Length 27 0 R
   /Filter /FlateDecode
>>
stream
x�]RAn�0��>&��L"Y�"r���($�L�v�x��U��D�T�U�vW]���5��5m��Y/k���%��'>�Q������{>�����y��3���l�9]��q=�t�7Oa��Vi���8���o>��p��|��u����Uu���xb]���>p�k��}t������6�����-�ygNc<�ru��[�8�_��A����cR�Z�]][��k���F�<z�(LAp
\{��)x^�r�����������E�-Y������`�V���!��o�?C����E�B�
6�F44���,#Y�Z�Z#�t�s5e.dY�"�L�3���?����	���N�����2�\��m�u��}��|I�c.�YvC�b������KUy���C
endstream
endobj
27 0 obj
   388
endobj
28 0 obj
<< /Type /FontDescriptor
   /FontName /JLZCGZ+CairoFont-1-0
   /Flags 4
   /FontBBox [ -2 -240 906 765 ]
   /ItalicAngle 0
   /Ascent 765
   /Descent -240
   /CapHeight 765
   /StemV 80
   /StemH 80
   /FontFile3 24 0 R
>>
endobj
9 0 obj
<< /Type /Font
   /Subtype /Type1
   /BaseFont /JLZCGZ+CairoFont-1-0
   /FirstChar 32
   /LastChar 122
   /FontDescriptor 28 0 R
   /Encoding /WinAnsiEncoding
   /Widths [ 260 0 0 0 0 0 0 0 339 339 0 0 0 322 0 413 572 572 572 572 572 572 572 572 572 572 0 0 0 572 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 411 0 0 0 514 633 591 387 633 657 305 0 620 0 982 657 619 0 0 454 497 434 657 0 856 0 569 488 ]
    /ToUnicode 26 0 R
>>
endobj
12 0 obj
<< /Type /ObjStm
   /Length 31 0 R
   /N 4
   /First 23
   /Filter /FlateDecode
>>
stream
x�U�Ok�0����eL�7�uNJ*�Qo�!��ZMI�1��h���<�H�7�$4T�DF�^g�( ��!��� ���p�a�SB+�5J�erl�;w�=F���AM�|�1�������V��������x|��lb������z�I�4'M�O3z!�L4J�f�	*���s��_��'�k�t?8�@P*��gj���Dnb����x��(�^��������o���m�:�wD���V{�5��w�e�2��8��[����;o��������U��������o3
endstream
endobj
31 0 obj
   274
endobj
32 0 obj
<< /Type /XRef
   /Length 129
   /Filter /FlateDecode
   /Size 33
   /W [1 3 2]
   /Root 30 0 R
   /Info 29 0 R
>>
stream
x�-�1����e#�^.�t�=�^�NA�f�N���gU4V������L2m��li���"m*�"��������#_����V~O�����d�����d��|'{gY,��b�rt����a;�M���~�0
endstream
endobj
startxref
336143
%%EOF

#204

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#203)

Re: index prefetching

On Tue, Aug 5, 2025 at 10:52 AM Tomas Vondra <tomas@vondra.me> wrote:

I ran some more tests, comparing the two patches, using data sets
generated in a way to have a more gradual transition between correlated
and random cases.

Cool.

So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
would be way too many tests, with very little change. Instead, I used
the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I think that it probably makes sense to standardize on using fewer
distinct "fuzz" settings than this going forward. It's useful to test
more things at first, but I expect that the performance impact of
changes from a given new patch revision will become important soon.

I don't claim those data sets are perfect, or a great representation of
particular (real-world) data sets. It seems like a much nicer transition
between random and correlated data sets.

That makes sense to me.

A test suite that is representative of real-world usage patterns isn't
so important. But it is important that we have at least one test for
each interesting variation of an index scan. What exactly that means
is subject to interpretation, and will likely evolve over time. But
the general idea is that we should choose tests that experience has
shown to be particularly good at highlighting the advantages or
disadvantages of one approach over another (e.g., simple vs complex).

It's just as important that we cut tests that don't seem to tell us
anything we can't get from some other tests. I suspect that many $fuzz
values aren't at all interesting. We double to get each increment, but
that probably isn't all that informative, outside of the extremes.

It'd also be good to just not test "sync" anymore, at some point. And
maybe to standardize on testing either "worker" or "io_uring" for most
individual tests. There's just too many tests right now.

In most cases the two patches perform fairly close - the green and red
data series mostly overlap. But there are cases where the complex patch
performs much better - especially for low fuzz values. Which is not
surprising, because those cases require higher prefetch distance, and
the complex patch can do that.

Right.

It surprised me a bit the complex patch can actually help even cases
where I'd not expect prefetching to help very much - e.g. fuzz=0 is
perfectly correlated, I'd expect read-ahead to work just fine. Yet the
complex patch can help ~2x (at least when scanning larger fraction of
the data).

Maybe it has something to do with reading multiple leaf pages together
leading to fewer icache misses.

Andres recently told me that he isn't expecting to be able to simulate
read-ahead with direct I/O. It seems possible that read-ahead
eventually won't be used at all, which argues for the complex patch.

BTW, I experimented with using READ_STREAM_USE_BATCHING (not
READ_STREAM_DEFAULT) in the complex patch. That's probably
deadlock-prone, but I suspect that it works well enough to get a good
sense of what is possible. What I saw (with that same TPC-C test
query) was that "I/O Timings" was about 10x lower, even though the
query runtime didn't change at all. This suggests to me that "I/O
Timings" is an independently interesting measure: getting it lower
might not visibly help when only one query runs, but it'll likely
still lead to more efficient use of available I/O bandwidth in the
aggregate (when many queries run at the same time).

There are also a couple cases where "simple" performs better than
"complex". But most of the time this is only for the "sync" iomethod,
and when scanning significant fraction of the data (10%+). So that
doesn't seem like a great argument in favor of the simple patch,
considering "sync" is not a proper AIO method, I've been arguing against
using it as a default, and with methods like "worker" the "complex"
patch often performs better ...

I suspect that this is just a case of "sync" making aggressive
prefetching a bad idea in general.

Let's say the complex patch is the way to go. What are the open problems
/ missing parts we need to address to make it committable?

I think that what you're interested in here is mostly project risk --
things that come with a notable risk of blocking commit/significantly
undermining our general approach.

I can think of these issues. I'm sure the list is incomplete and there
are many "smaller issues" and things I haven't even thought about:

I have a list of issues to solve in my personal notes. Most of them
aren't particularly important.

1) Making sure the interface can work for other index AMs (both in core
and out-of-core), including cases like GiST etc.

What would put your mind at ease here? Maybe you'd feel better about
this if we also implemented prefetching for at least one other index
AM. Maybe GiST, since it's likely both the next-hardest and next most
important index AM (after nbtree).

Right now, I'm not motivated to work on the patch at all, since it's
still not clear that any of it has buy-in from you. I'm willing to do
more work to try to convince you, but it's not clear what it would
take/where your doubts are. I'm starting to be concerned about that
just never happening, quite honestly. Getting a feature of this
complexity into committable shape requires laser focus.

2) Proper layering between index AM and table AM (like the TID issue
pointed out by Andres some time ago).

3) Allowing more flexible management of prefetch distance (this might
involve something like the "scan manager" idea suggested by Peter),
various improvements to ReadStream heuristics, etc.

The definition of "scan manager" is quite fuzzy right now. I think
that the "complex" patch already implements a very basic version of
that idea.

To me, the important point was always that the general design/API of
index prefetching be structured in a way that would allow us to
accomodate more sophisticated strategies. As I've said many times,
somebody needs to see all of the costs and all of the benefits --
that's what's needed to make optimal choices.

4) More testing to minimize the risk of regressions.

5) Figuring out how to make this work for IOS (the simple patch has some
special logic in the callback, which may not be great, not sure what's
the right solution in the complex patch).

I agree that all these items are probably the biggest risks to the
project. I'm not sure that I can attribute this to the use of the
"complex" approach over the "simple" approach.

6) ????

I guess that this means "unknown unknowns", which are another significant risk.

--
Peter Geoghegan

#205

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#204)

Re: index prefetching

On 8/5/25 19:19, Peter Geoghegan wrote:

On Tue, Aug 5, 2025 at 10:52 AM Tomas Vondra <tomas@vondra.me> wrote:

I ran some more tests, comparing the two patches, using data sets
generated in a way to have a more gradual transition between correlated
and random cases.

Cool.

So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
would be way too many tests, with very little change. Instead, I used
the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I think that it probably makes sense to standardize on using fewer
distinct "fuzz" settings than this going forward. It's useful to test
more things at first, but I expect that the performance impact of
changes from a given new patch revision will become important soon.

Probably. It was hard to predict which values will be interesting, maybe
we can pick some subset now. I'll start by just doing larger steps, I
think. Maybe increase by 4x rather than 2x, that'll reduce the number of
combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't
seem to have a lot of impact anyway.

I don't claim those data sets are perfect, or a great representation of
particular (real-world) data sets. It seems like a much nicer transition
between random and correlated data sets.

That makes sense to me.

A test suite that is representative of real-world usage patterns isn't
so important. But it is important that we have at least one test for
each interesting variation of an index scan. What exactly that means
is subject to interpretation, and will likely evolve over time. But
the general idea is that we should choose tests that experience has
shown to be particularly good at highlighting the advantages or
disadvantages of one approach over another (e.g., simple vs complex).

True. These tests use very simple queries, with a single range clause.

So what other index scan variations would you suggest to test? I can
imagine e.g. IN () conditions with variable list length, maybe
multi-column indexes, and/or skip scan cases. Any other ideas?

FWIW I'm not planning to keep testing simple vs complex patches. We've
seen the complex patch can do much better in certain workloads cases,
the fact that we can discover more such cases does not change much.

I'm much more interested in benchmarking master vs. complex patch.

It's just as important that we cut tests that don't seem to tell us
anything we can't get from some other tests. I suspect that many $fuzz
values aren't at all interesting. We double to get each increment, but
that probably isn't all that informative, outside of the extremes.

It'd also be good to just not test "sync" anymore, at some point. And
maybe to standardize on testing either "worker" or "io_uring" for most
individual tests. There's just too many tests right now.

Agreed.

In most cases the two patches perform fairly close - the green and red
data series mostly overlap. But there are cases where the complex patch
performs much better - especially for low fuzz values. Which is not
surprising, because those cases require higher prefetch distance, and
the complex patch can do that.

Right.

It surprised me a bit the complex patch can actually help even cases
where I'd not expect prefetching to help very much - e.g. fuzz=0 is
perfectly correlated, I'd expect read-ahead to work just fine. Yet the
complex patch can help ~2x (at least when scanning larger fraction of
the data).

Maybe it has something to do with reading multiple leaf pages together
leading to fewer icache misses.

Maybe, not sure.

Andres recently told me that he isn't expecting to be able to simulate
read-ahead with direct I/O. It seems possible that read-ahead
eventually won't be used at all, which argues for the complex patch.

True, the complex patch could prefetch the leaf pages.

BTW, I experimented with using READ_STREAM_USE_BATCHING (not
READ_STREAM_DEFAULT) in the complex patch. That's probably
deadlock-prone, but I suspect that it works well enough to get a good
sense of what is possible. What I saw (with that same TPC-C test
query) was that "I/O Timings" was about 10x lower, even though the
query runtime didn't change at all. This suggests to me that "I/O
Timings" is an independently interesting measure: getting it lower
might not visibly help when only one query runs, but it'll likely
still lead to more efficient use of available I/O bandwidth in the
aggregate (when many queries run at the same time).

Interesting. Does that mean we should try enabling batching in some
cases? Or just that there's room for improvement?

Could we do the next_block callbacks in a way that make deadlocks
impossible?

I'm not that familiar with the batch mode - how would the deadlock even
happen in index scans? I suppose there might be two index scans in
opposite directions, requesting the pages in different order. Or even
just index scans with different keys, that happen to touch the heap
pages in different order. Also, if we "streamify" the access to leaf
pages, there could be a deadlock between the two streams.

Not sure how to prevent these cases.

There are also a couple cases where "simple" performs better than
"complex". But most of the time this is only for the "sync" iomethod,
and when scanning significant fraction of the data (10%+). So that
doesn't seem like a great argument in favor of the simple patch,
considering "sync" is not a proper AIO method, I've been arguing against
using it as a default, and with methods like "worker" the "complex"
patch often performs better ...

I suspect that this is just a case of "sync" making aggressive
prefetching a bad idea in general.

Let's say the complex patch is the way to go. What are the open problems
/ missing parts we need to address to make it committable?

I think that what you're interested in here is mostly project risk --
things that come with a notable risk of blocking commit/significantly
undermining our general approach.

In a way, yes. I'm interested in anything I have not thought about.

I can think of these issues. I'm sure the list is incomplete and there
are many "smaller issues" and things I haven't even thought about:

I have a list of issues to solve in my personal notes. Most of them
aren't particularly important.

Good to hear.

1) Making sure the interface can work for other index AMs (both in core
and out-of-core), including cases like GiST etc.

What would put your mind at ease here? Maybe you'd feel better about
this if we also implemented prefetching for at least one other index
AM. Maybe GiST, since it's likely both the next-hardest and next most
important index AM (after nbtree).

Right now, I'm not motivated to work on the patch at all, since it's
still not clear that any of it has buy-in from you. I'm willing to do
more work to try to convince you, but it's not clear what it would
take/where your doubts are. I'm starting to be concerned about that
just never happening, quite honestly. Getting a feature of this
complexity into committable shape requires laser focus.

I think the only way is to try reworking some of the index AMs to use
the new interface. For some AMs (e.g. hash) it's going to be very
similar to what you did with btree, because it basically works like a
btree. For others (GiST/SP-GiST) it may be more work.

Not sure about out-of-core AMs, like pgvector etc. That may be a step
too far / too much work.

It doesn't need to be committable, just good enough to be reasonably
certain it's possible.

2) Proper layering between index AM and table AM (like the TID issue
pointed out by Andres some time ago).

3) Allowing more flexible management of prefetch distance (this might
involve something like the "scan manager" idea suggested by Peter),
various improvements to ReadStream heuristics, etc.

The definition of "scan manager" is quite fuzzy right now. I think
that the "complex" patch already implements a very basic version of
that idea.

To me, the important point was always that the general design/API of
index prefetching be structured in a way that would allow us to
accomodate more sophisticated strategies. As I've said many times,
somebody needs to see all of the costs and all of the benefits --
that's what's needed to make optimal choices.

Understood, and I agree in principle. It's just that given the fuzziness
I find it hard how it should look like.

4) More testing to minimize the risk of regressions.

5) Figuring out how to make this work for IOS (the simple patch has some
special logic in the callback, which may not be great, not sure what's
the right solution in the complex patch).

I agree that all these items are probably the biggest risks to the
project. I'm not sure that I can attribute this to the use of the
"complex" approach over the "simple" approach.

True, most of these points applies to both patches - including the IOS
handling in callback. And the complex patch could do it the same way,
except there would be just one callback, not a callback per index AM.

6) ????

I guess that this means "unknown unknowns", which are another significant risk.

Yeah, that's what I meant. Sorry, I should have been more explicit.

regards

--
Tomas Vondra

#206

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#205)

Re: index prefetching

On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote:

Probably. It was hard to predict which values will be interesting, maybe
we can pick some subset now. I'll start by just doing larger steps, I
think. Maybe increase by 4x rather than 2x, that'll reduce the number of
combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't
seem to have a lot of impact anyway.

I don't think that fillfactor matters all that much, either way. A low
setting provides a simple way of simulating "wide" heap tuples, but
that probably isn't going to make the crucial difference.

It's not like the TPC-C index I used in my own recent testing (which
showed that the complex patch was almost 3x faster than the simple
patch) has all that strong of a pg_stats.correlation. You can probably
come up with indexes/test cases where groups of related TIDs that each
point to the same heap block appear together, even though in general
the index tuple heap TIDs appear completely out of order. It probably
isn't even that different to a simple pgbench_accounts_pkey from a
prefetching POV, though, in spite of these rather conspicuous
differences. In time we might find that just using
pgbench_accounts_pkey directly works just as well for our purposes
(unsure of that, but seems possible).

So what other index scan variations would you suggest to test? I can
imagine e.g. IN () conditions with variable list length, maybe
multi-column indexes, and/or skip scan cases. Any other ideas?

The only thing that's really interesting about IN() conditions is that
they provide an easy way to write a query that only returns a subset
of all index tuples from every leaf page read. You can get a similar
access pattern from other types of quals, but that's not quite as
intuitive.

I really don't think that IN() conditions are all that special.
They're perfectly fine as a way of getting this general access
pattern.

I like to look for and debug "behavioral inconsistencies". For
example, I have an open item in my notes (which I sent to you over IM
a short while ago) about a backwards scan that is significantly slower
than an "equivalent" forwards scan. This involves
pgbench_accounts_pkey. It's quite likely that the underlying problem
has nothing much to do with backwards scans. I suspect that the
underlying problem is a more general one, that could also be seen with
the right forwards scan test case.

In general, it might make the most sense to look for pairs of
similar-ish queries that are inconsistent in a way that doesn't make
sense intuitively, in order to understand and fix the inconsistency.
Since chances are that it's actually just some kind of performance bug
that accidentally doesn't happen in only one variant of the query.

I bet that there's at least a couple of not-that-noticeable
performance bugs, for example due to some hard to pin down issue with
prefetch distance getting out of hand. Possibly because the read
stream doesn't get to see contiguous requests for TIDs that point to
the same heap page, but does see it when things are slightly out of
order. Two different queries that have approximately the same accesses
should have approximately the same performance -- minor variations in
leaf page layout or heap page layout or scan direction shouldn't be
confounding.

FWIW I'm not planning to keep testing simple vs complex patches. We've
seen the complex patch can do much better in certain workloads cases,
the fact that we can discover more such cases does not change much.

I'm much more interested in benchmarking master vs. complex patch.

Great!

It'd also be good to just not test "sync" anymore, at some point. And
maybe to standardize on testing either "worker" or "io_uring" for most
individual tests. There's just too many tests right now.

Agreed.

Might also make sense to standardize on direct I/O when testing the
patch (but probably not when testing master). The fact that we can't
get any OS readahead is likely to be useful.

Andres recently told me that he isn't expecting to be able to simulate
read-ahead with direct I/O. It seems possible that read-ahead
eventually won't be used at all, which argues for the complex patch.

True, the complex patch could prefetch the leaf pages.

What I meant was that the complex patch can make up for the fact that
direct I/O presumably won't ever have an equivalent to simple
read-ahead. Just by having a very flexible prefetching implementation
(and without any special sequential access heuristics ever being
required).

BTW, I experimented with using READ_STREAM_USE_BATCHING (not
READ_STREAM_DEFAULT) in the complex patch. That's probably
deadlock-prone, but I suspect that it works well enough to get a good
sense of what is possible. What I saw (with that same TPC-C test
query) was that "I/O Timings" was about 10x lower, even though the
query runtime didn't change at all. This suggests to me that "I/O
Timings" is an independently interesting measure: getting it lower
might not visibly help when only one query runs, but it'll likely
still lead to more efficient use of available I/O bandwidth in the
aggregate (when many queries run at the same time).

Interesting. Does that mean we should try enabling batching in some
cases? Or just that there's room for improvement?

I don't know what it means myself. I never got as far as even starting
to understand what it would take to make READ_STREAM_USE_BATCHING
work.

AFAIK it wouldn't be hard to make that work here at all, in which case
we should definitely use it. OTOH, maybe it's really hard. I just
don't know right now.

Could we do the next_block callbacks in a way that make deadlocks
impossible?

I'm not that familiar with the batch mode - how would the deadlock even
happen in index scans?

I have no idea. Maybe it's already safe. I didn't notice any problems
(but didn't look for them, beyond running my tests plus the regression
tests).

I think the only way is to try reworking some of the index AMs to use
the new interface. For some AMs (e.g. hash) it's going to be very
similar to what you did with btree, because it basically works like a
btree. For others (GiST/SP-GiST) it may be more work.

The main difficulty with GiST may be that we may be obligated to fix
existing (unfixed!) bugs that affect index-only scans. The master
branch is subtly broken, but we can't in good conscience ignore those
problems while making these kinds of changes.

It doesn't need to be committable, just good enough to be reasonably
certain it's possible.

That's what I have in mind, too. If we have support for a second index
AM, then we're much less likely to over-optimize for nbtree in a way
that doesn't really make sense.

Understood, and I agree in principle. It's just that given the fuzziness
I find it hard how it should look like.

I suspect that index AMs are much more similar for the purposes of
prefetching than they are in other ways.

--
Peter Geoghegan

#207

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Peter Geoghegan (#206)

Re: index prefetching

On Wed, Aug 6, 2025 at 9:35 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote:

True, the complex patch could prefetch the leaf pages.

There must be a similar opportunity for parallel index scans. It has
that "seize the scan" concept where parallel workers do one-at-a-time
locked linked list leapfrog.

What I meant was that the complex patch can make up for the fact that
direct I/O presumably won't ever have an equivalent to simple
read-ahead. Just by having a very flexible prefetching implementation
(and without any special sequential access heuristics ever being
required).

I'm not so sure, there are certainly opportunities in different layers
of the system. I'm going to dust off a couple of experimental patches
(stuff I talked to Peter about back in Athens), and try to describe
some other vague ideas Andres and I have bounced around over the past
few years when chatting about what you lose when you turn on direct
I/O. Basically, the stuff that we can't fix with "precise" I/O
streaming as I like to call it, where it might still be interesting to
think about opportunities to do fuzzier speculative lookahead. I'll
start a new thread.

#208

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#206)

Re: index prefetching

On 8/5/25 23:35, Peter Geoghegan wrote:

On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote:

Probably. It was hard to predict which values will be interesting, maybe
we can pick some subset now. I'll start by just doing larger steps, I
think. Maybe increase by 4x rather than 2x, that'll reduce the number of
combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't
seem to have a lot of impact anyway.

I don't think that fillfactor matters all that much, either way. A low
setting provides a simple way of simulating "wide" heap tuples, but
that probably isn't going to make the crucial difference.

Agreed.

It's not like the TPC-C index I used in my own recent testing (which
showed that the complex patch was almost 3x faster than the simple
patch) has all that strong of a pg_stats.correlation. You can probably
come up with indexes/test cases where groups of related TIDs that each
point to the same heap block appear together, even though in general
the index tuple heap TIDs appear completely out of order. It probably
isn't even that different to a simple pgbench_accounts_pkey from a
prefetching POV, though, in spite of these rather conspicuous
differences. In time we might find that just using
pgbench_accounts_pkey directly works just as well for our purposes
(unsure of that, but seems possible).

That's quite possible. What concerns me about using tables like pgbench
accounts table is reproducibility - initially it's correlated, and then
it gets "randomized" by the workload. But maybe the exact pattern
depends on the workload - how many clients, how long, how it correlates
with vacuum, etc. Reproducing the dataset might be quite tricky.

That's why I prefer using "reproducible" data sets. I think the data
sets with "fuzz" seem like a pretty good model. I plan to experiment
with adding some duplicate values / runs, possibly with two "levels" of
randomness (global for all runs, and smaller local perturbations).

So what other index scan variations would you suggest to test? I can
imagine e.g. IN () conditions with variable list length, maybe
multi-column indexes, and/or skip scan cases. Any other ideas?

The only thing that's really interesting about IN() conditions is that
they provide an easy way to write a query that only returns a subset
of all index tuples from every leaf page read. You can get a similar
access pattern from other types of quals, but that's not quite as
intuitive.

I really don't think that IN() conditions are all that special.
They're perfectly fine as a way of getting this general access
pattern.

I like to look for and debug "behavioral inconsistencies". For
example, I have an open item in my notes (which I sent to you over IM
a short while ago) about a backwards scan that is significantly slower
than an "equivalent" forwards scan. This involves
pgbench_accounts_pkey. It's quite likely that the underlying problem
has nothing much to do with backwards scans. I suspect that the
underlying problem is a more general one, that could also be seen with
the right forwards scan test case.

In general, it might make the most sense to look for pairs of

similar-ish queries that are inconsistent in a way that doesn't make
sense intuitively, in order to understand and fix the inconsistency.
Since chances are that it's actually just some kind of performance bug
that accidentally doesn't happen in only one variant of the query.

Yeah, cases like that are interesting. I plan to do some randomized
testing, exploring "strange" combinations of parameters, looking for
weird behaviors like that.

The question is what parameters to consider - the data distributions is
one such parameter. Different "types" of scans are another.

I bet that there's at least a couple of not-that-noticeable
performance bugs, for example due to some hard to pin down issue with
prefetch distance getting out of hand. Possibly because the read
stream doesn't get to see contiguous requests for TIDs that point to
the same heap page, but does see it when things are slightly out of
order. Two different queries that have approximately the same accesses
should have approximately the same performance -- minor variations in
leaf page layout or heap page layout or scan direction shouldn't be
confounding.

I think in a way cases like that are somewhat inherent, I wouldn't even
call that "bug" probably. Any heuristics (driving the distance) will
have such issues. Give me a heuristics and I'll construct an adversary
case breaking it.

I think the question will be how likely (and how serious) such cases
are. If it's rare / limited to cases where we're unlikely to pick an
index scan etc. then maybe it's OK.

FWIW I'm not planning to keep testing simple vs complex patches. We've
seen the complex patch can do much better in certain workloads cases,
the fact that we can discover more such cases does not change much.

I'm much more interested in benchmarking master vs. complex patch.

Great!

It'd also be good to just not test "sync" anymore, at some point. And
maybe to standardize on testing either "worker" or "io_uring" for most
individual tests. There's just too many tests right now.

Agreed.

Might also make sense to standardize on direct I/O when testing the
patch (but probably not when testing master). The fact that we can't
get any OS readahead is likely to be useful.

I plan to keep testing with buffered I/O (with "io_method=worker"),
simply because that's what most systems will keep using for a while. But
it's a good idea to test with direct I/O too.

Andres recently told me that he isn't expecting to be able to simulate
read-ahead with direct I/O. It seems possible that read-ahead
eventually won't be used at all, which argues for the complex patch.

True, the complex patch could prefetch the leaf pages.

What I meant was that the complex patch can make up for the fact that
direct I/O presumably won't ever have an equivalent to simple
read-ahead. Just by having a very flexible prefetching implementation
(and without any special sequential access heuristics ever being
required).

BTW, I experimented with using READ_STREAM_USE_BATCHING (not
READ_STREAM_DEFAULT) in the complex patch. That's probably
deadlock-prone, but I suspect that it works well enough to get a good
sense of what is possible. What I saw (with that same TPC-C test
query) was that "I/O Timings" was about 10x lower, even though the
query runtime didn't change at all. This suggests to me that "I/O
Timings" is an independently interesting measure: getting it lower
might not visibly help when only one query runs, but it'll likely
still lead to more efficient use of available I/O bandwidth in the
aggregate (when many queries run at the same time).

Interesting. Does that mean we should try enabling batching in some
cases? Or just that there's room for improvement?

I don't know what it means myself. I never got as far as even starting
to understand what it would take to make READ_STREAM_USE_BATCHING
work.

AFAIK it wouldn't be hard to make that work here at all, in which case
we should definitely use it. OTOH, maybe it's really hard. I just
don't know right now.

Same here. I read the comments about batch mode and deadlocks multiple
times, and it's still not clear to me what exactly would be needed to
make it safe.

Could we do the next_block callbacks in a way that make deadlocks
impossible?

I'm not that familiar with the batch mode - how would the deadlock even
happen in index scans?

I have no idea. Maybe it's already safe. I didn't notice any problems
(but didn't look for them, beyond running my tests plus the regression
tests).

I think the only way is to try reworking some of the index AMs to use
the new interface. For some AMs (e.g. hash) it's going to be very
similar to what you did with btree, because it basically works like a
btree. For others (GiST/SP-GiST) it may be more work.

The main difficulty with GiST may be that we may be obligated to fix
existing (unfixed!) bugs that affect index-only scans. The master
branch is subtly broken, but we can't in good conscience ignore those
problems while making these kinds of changes.

Right, that's a valid point.

The thing that worries me a bit is that the ordered scans (e.g. with
reordering by distance) detach the scan from the leaf pages, i.e. the
batches are no longer "tied" to a leaf page.

Perhaps "worries" is not the right word - I don't think it should be a
problem, but it's a difference.

It doesn't need to be committable, just good enough to be reasonably
certain it's possible.

That's what I have in mind, too. If we have support for a second index
AM, then we're much less likely to over-optimize for nbtree in a way
that doesn't really make sense.

Yep.

Understood, and I agree in principle. It's just that given the fuzziness
I find it hard how it should look like.

I suspect that index AMs are much more similar for the purposes of
prefetching than they are in other ways.

Probably.

regards

--
Tomas Vondra

#209

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#208)

Re: index prefetching

On Wed, Aug 6, 2025 at 10:12 AM Tomas Vondra <tomas@vondra.me> wrote:

That's quite possible. What concerns me about using tables like pgbench
accounts table is reproducibility - initially it's correlated, and then
it gets "randomized" by the workload. But maybe the exact pattern
depends on the workload - how many clients, how long, how it correlates
with vacuum, etc. Reproducing the dataset might be quite tricky.

I meant a pristine/newly created pgbench_accounts_pkey index.

That's why I prefer using "reproducible" data sets. I think the data
sets with "fuzz" seem like a pretty good model. I plan to experiment
with adding some duplicate values / runs, possibly with two "levels" of
randomness (global for all runs, and smaller local perturbations).

Agreed that reproducibility is really important.

I bet that there's at least a couple of not-that-noticeable
performance bugs, for example due to some hard to pin down issue with
prefetch distance getting out of hand. Possibly because the read
stream doesn't get to see contiguous requests for TIDs that point to
the same heap page, but does see it when things are slightly out of
order. Two different queries that have approximately the same accesses
should have approximately the same performance -- minor variations in
leaf page layout or heap page layout or scan direction shouldn't be
confounding.

I think in a way cases like that are somewhat inherent, I wouldn't even
call that "bug" probably. Any heuristics (driving the distance) will
have such issues. Give me a heuristics and I'll construct an adversary
case breaking it.

I think the question will be how likely (and how serious) such cases
are. If it's rare / limited to cases where we're unlikely to pick an
index scan etc. then maybe it's OK.

It's something that needs to be considered on a case-by-case basis.
But in general when I see an inconsistency like that, I'm suspicious.
The difference that I see right now feels quite random and
unprincipled. It's not a small difference (375.752 ms vs 465.370 ms
for the backwards scan).

Maybe if I go down the road of fixing this particular issue, I'll find
myself playing performance whack-a-mole, where every change that
benefits one query comes at some cost to some other query. But I doubt
it.

I plan to keep testing with buffered I/O (with "io_method=worker"),
simply because that's what most systems will keep using for a while. But
it's a good idea to test with direct I/O too.

OK.

Same here. I read the comments about batch mode and deadlocks multiple
times, and it's still not clear to me what exactly would be needed to
make it safe.

It feels like the comments about READ_STREAM_USE_BATCHING could use some work.

The main difficulty with GiST may be that we may be obligated to fix
existing (unfixed!) bugs that affect index-only scans. The master
branch is subtly broken, but we can't in good conscience ignore those
problems while making these kinds of changes.

Right, that's a valid point.

The thing that worries me a bit is that the ordered scans (e.g. with
reordering by distance) detach the scan from the leaf pages, i.e. the
batches are no longer "tied" to a leaf page.

Perhaps "worries" is not the right word - I don't think it should be a
problem, but it's a difference.

Obviously, the problem that GiST ordered scans create for us isn't a
new one. The new API isn't that different to the old amgettuple one in
all the ways that matter here. amgettuple has exactly the same
stipulations about holding on to buffer pins to prevent unsafe
concurrent TID recycling -- stipulations that GiST currently just
ignores (at least in the case of index-only scans, which cannot rely
on a _bt_drop_lock_and_maybe_pin-like mechanism to avoid unsafe
concurrent TID recycling hazards).

If, in the end, the only solution that really works for GiST is a more
aggressive/invasive one than we'd prefer, then making those changes
must have been inevitable all along -- even with the old amgettuple
interface. That's why I'm not too worried about GiST ordered scans;
we're not making that problem any harder to solve. It's even possible
that it'll be a bit *easier* to fix the problem with the new batch
interface, since it somewhat normalizes the idea of hanging on to
buffer pins for longer.

--
Peter Geoghegan

#210

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Thomas Munro (#207)

Re: index prefetching

On Tue, Aug 5, 2025 at 7:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

There must be a similar opportunity for parallel index scans. It has
that "seize the scan" concept where parallel workers do one-at-a-time
locked linked list leapfrog.

True. More generally, flexibility to reorder work would be useful there.

The structure of parallel B-tree scans is one where each worker
performs its own "independent" index scan. The workers each only
return tuples from those leaf pages that they themselves manage to
read. That isn't particularly efficient, since we'll usually have to
merge the "independent" index scan tuples together once again using a
GatherMerge.

In principle, we could avoid a GatherMerge by keeping track of the
logical order of leaf pages at some higher level, and outputting
tuples in that same order -- which isn't a million miles from what the
batch interface that Tomas wrote already does. Imagine an enhanced
version of that design where the current read_stream callback wholly
farms out the work of reading leaf pages to parallel workers. Once we
decouple the index page reading from the heap access, we might be able
to invent the idea of "task specialization", where some workers more
or less exclusively read leaf pages, and other workers more or less
exclusively perform related heap accesses.

Basically, the stuff that we can't fix with "precise" I/O
streaming as I like to call it, where it might still be interesting to
think about opportunities to do fuzzier speculative lookahead. I'll
start a new thread.

That sounds interesting. I worry that we won't ever be able to get
away without some fallback that behaves roughly like OS readahead.

--
Peter Geoghegan

#211

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Peter Geoghegan (#210)

3 attachment(s)

Re: index prefetching

On Thu, Aug 7, 2025 at 8:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Aug 5, 2025 at 7:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

There must be a similar opportunity for parallel index scans. It has
that "seize the scan" concept where parallel workers do one-at-a-time
locked linked list leapfrog.

True. More generally, flexibility to reorder work would be useful there.

The structure of parallel B-tree scans is one where each worker
performs its own "independent" index scan. The workers each only
return tuples from those leaf pages that they themselves manage to
read. That isn't particularly efficient, since we'll usually have to
merge the "independent" index scan tuples together once again using a
GatherMerge.

Yeah. This all sails close to the stuff I wrote about in the
post-mortem of my failed attempt to teach parallel bitmap heapscan not
to throw I/O combining opportunities out the window for v18, after
Melanie streamified BHS. Basically you have competing goals:

* preserve natural ranges of blocks up to io_combine_limit
* make workers run out of work at the same time
* avoiding I/O stalls using lookahead and concurrency

You can't have all three right now: I/O streams are elastic so
allocation decisions made at the producer end don't control work
*finishing* time, so we need something new. I wrote about an idea
based on work stealing when data runs out. Read streams would work
independently, but cooperate at end of stream, avoiding interlocking
almost all the time. That was basically a refinement of an earlier
"shared" read stream that seems too locky. (Obviously the
seize-the-scan block producer is a total lockfest navitaging a
link-at-a-time data structure, but let's call that a use-case specific
problem.)

Other architectures are surely possible too, including admitting that
precise streams are not right for that problem, and using something
like the PredictBlock() approach mentioned below for prefetching and
then sticking to block-at-a-time work distribution. Or we could go
the other way and admit that block-at-a-time is also not ideal -- what
if some blocks are 10,000 times more expensive to process than others?
-- and do work stealing that degrades ultimately to tuple granularity,
a logical extreme position.

Basically, the stuff that we can't fix with "precise" I/O
streaming as I like to call it, where it might still be interesting to
think about opportunities to do fuzzier speculative lookahead. I'll
start a new thread.

That sounds interesting. I worry that we won't ever be able to get
away without some fallback that behaves roughly like OS readahead.

Yeah. I might write about some of these things properly but here is
an unfiltered brain dump of assorted theories of varying
crackpottedness and some illustrative patches that I'm *not*
proposing:

* you can make a dumb speculative sequential readahead stream pretty
easily, but it's not entirely satisfying: here's one of the toy
patches I mentioned in Athens, that shows btree leaf scans (but not
parallel ones) doing that, producing nice I/O combining and
concurrency that ramps up in the usual way if it happens to be
sequential (well I just rebased this and didn't test it, but it should
still work); I will describe some other approaches to try to place
this in the space of possibilities I'm aware of...

* you could make a stream that pulls leaf pages from higher level
internal pages on demand (if you want to avoid the flow control
problems that come from trying to choose a batch size up front before
you know you'll even need it by using a pull approach), or just notice
that it looks sequential and install a block range producer, and if
that doesn't match the next page pointers by the time you get there
then you destroy it and switch strategies, or something

* you could just pretend it's always sequential and reset the stream
every time you're wrong or some only slightly smarter scheme than
that, but it's still hard to know what's going on in cooperating
processes...

* you could put sequential extent information in meta blocks or
somehow scatter hints around...

* you could instead give up on explicit streams for fuzzy problems,
and teach the buffer pool to do the same tricks as the kernel, with a
scheme that lets go of the pins and reacquires them later (hopefully
cheaply with ReadRecentBuffer(), by leaving a trail of breadcrumbs in
SMgrRelation or shared memory, similar to what I already proposed for
btree root pages and another related patch that speeds up seq scans,
which I plan to repost soon): SMgrRelation could hold the state
necessary for the buffer pool to notice when you keep calling
ReadBuffer() for sequential blocks and begin to prefetch them
speculatively with growing distance heuristics so it doesn't overdo
it, but somehow not hold pins on your behalf (this was one of the
driving concerns that made me originally think that I needed an
explicit stream as an explicit opt-in and scoping for extra pins,
which an AM might not want at certain times, truncation or cleanup or
something, who knows)

* you could steal Linux's BM_READAHEAD concept, where speculatively
loaded pages carry a special marker so they can be recognized by later
ReadBuffer() calls to encourage more magic readahead, because it's
measurably fruitful; this will be seen also by other backends, eg
parallel workers working on the same problem, though there is a bit of
an interleaving edge problem (you probably want to know if adjacent
pages have the flag in some window, and I have an idea for that that
doesn't involve the buffer mapping table); in other words the state
tracked in SMgrRelation is only used to ignite readahead, but shared
flags in or parallel with the buffer pool apply fuel

* from 30,000 feet, the question is what scope you do the detection
at; you can find examples of OSes that only look at one fd for
sequential detection and only consider strict-next-block (old Unixen,
I suspect maybe Windows but IDK), systems that have a tolerance
windows (Linux), systems that search a small table of active streams
no matter which fds they're coming through (ZFS does this I think,
sort of like our own synchronized_scan detector), and systems that use
per-page accounting to measure and amplify success, and we have
analogies in our architecture as candidate scopes: explicit stream
objects, the per-backend SMgrRelation, the proposed system-wide
SharedSMgrRelation, the buffer pool itself, and perhaps a per-buffer
hint array with relaxed access (this last is something I've
experimented with, both as a way to store relaxed navigation
information for sequential scans skipping the buffer mapping table and
as a place to accumulate prefetch-driving statistics)

* so far that's just talking about sequential heuristics, but we have
many clues the kernel doesn't, it's just that they're not always
reliable enough for a "precise" read streams and we might not want to
use the approach to speculation I mentioned above where you have a
read stream but as soon as it gives you something you didn't expect
you have to give up completely or reset it and start feeding it again;
presumably you could code around that with a fuzzy speculation buffer
that tolerates a bit of disorder

* you could make a better version of PrefetchBuffer() for guided
prefetching, let's call it PredictBuffer(), that is initially lazy but
if the predictions turn out to be correct it starts looking further
ahead in the stream of predictions you made and eventually becomes
quite eager, like PrefetchBuffer(), but just lazy enough to perform
I/O combining; note that I'm now talking about "pushing" rather than
"pulling" predictions, another central question with implications, and
one of the nice things about read streams

* for a totally different line of attack that goes back to precise
pull-based streams, you could imagine a read stream that lets you
'peek' at data coming down the pipe as soon as it is ready (ie it's
already in cache, or it isn't but the IO finishes before the stream
consumer gets to it), so you can get a head start on a jump requiring
I/O in a self-referential data structure like a linked list (with some
obvious limitations); here is a toy patch that allows you to install
such a callback, which could feed next-block information to the main
block number callback, so now we have three times of interest in the
I/O stream: block numbers are pulled into the producer end, valid
pages are pushed out to you as soon as possible somewhere in the
middle or probably often just a step ahead the producer and can feed
block numbers back to it, and pages are eventually pulled out of the
consumer end for processing; BUT NOTE: this idea is not entirely
compatible with the lazy I/O completion draining of io_method=io_uring
(or the posix_aio patch I dumped on the list the other day, and the
Windows equivalent could plausibly go either way), and works much
better with io_method=worker whose completions are advertised eagerly,
so this implementation of the idea is a dead end, if even the goal
itself is interesting, not sure

* the same effect could be achieved with chained streams where the
consumer-facing stream is a simple elastic queue of buffers that is
fed by the real I/O stream, with the peeking in between; that would
suit those I/O methods much better; it might need a new
read_stream_next_buffer_conditional() that calls that
WaitReadBuffersWouldStall() function, unless the consumer queue is
empty and it has to call read_stream_next_buffer() which might block;
the point being to periodically pump the peeking mechanism

* the peek concept is pretty weak on its own because it's hard to
reach a state where you have enough lookahead window that it can
follow a navigational jump in time to save you from a stall but ...
maybe there are streams that contain a lot of either sequential or
well cached blocks with occasional jumps to random I/O; if you could
somehow combine the advanced vapourware of several of these magic
bullet points, then perhaps you can avoid some stalls

Please take all of that with an absolutely massive grain of salt, it's
just very raw ideas...

Attachments:

0001-Add-AutoReadStream.txttext/plain; charset=US-ASCII; name=0001-Add-AutoReadStream.txtDownload

From 2eeec2f0488a74fa1f3cfdaeef36c5f9926d417c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 12 Oct 2024 09:57:01 +1300
Subject: [PATCH 1/2] Add AutoReadStream.

Provide auto_read_buffer() as a drop-in replacement for ReadBuffer().
As long as the caller requests sequential pages, they come from a
ReadStream that predicts that it'll keep doing that.  Otherwise it falls
back to regular ReadBuffer().

XXX experimental
---
 src/backend/storage/aio/Makefile           |  1 +
 src/backend/storage/aio/auto_read_stream.c | 92 ++++++++++++++++++++++
 src/backend/storage/aio/meson.build        |  1 +
 src/include/storage/auto_read_stream.h     | 16 ++++
 src/tools/pgindent/typedefs.list           |  1 +
 5 files changed, 111 insertions(+)
 create mode 100644 src/backend/storage/aio/auto_read_stream.c
 create mode 100644 src/include/storage/auto_read_stream.h

diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
index 3f2469cc399..7270f486109 100644
--- a/src/backend/storage/aio/Makefile
+++ b/src/backend/storage/aio/Makefile
@@ -15,6 +15,7 @@ OBJS = \
 	aio_init.o \
 	aio_io.o \
 	aio_target.o \
+	auto_read_stream.o \
 	method_io_uring.o \
 	method_sync.o \
 	method_worker.o \
diff --git a/src/backend/storage/aio/auto_read_stream.c b/src/backend/storage/aio/auto_read_stream.c
new file mode 100644
index 00000000000..a8dac066bd6
--- /dev/null
+++ b/src/backend/storage/aio/auto_read_stream.c
@@ -0,0 +1,92 @@
+#include "postgres.h"
+#include "storage/auto_read_stream.h"
+#include "storage/bufmgr.h"
+#include "storage/read_stream.h"
+
+struct AutoReadStream
+{
+	ReadStream *stream;
+	Relation	rel;
+	BlockNumber next;
+	BlockNumber last;
+};
+
+static BlockNumber
+auto_read_stream_cb(ReadStream *stream,
+					void *callback_private_data,
+					void *per_buffer_data)
+{
+	AutoReadStream *auto_stream = callback_private_data;
+	BlockNumber next = auto_stream->next;
+
+	if (next == InvalidBlockNumber)
+		return InvalidBuffer;
+	auto_stream->next = Min(auto_stream->next + 1, auto_stream->last);
+
+	return next;
+}
+
+AutoReadStream *
+auto_read_stream_begin(BufferAccessStrategy strategy,
+					   Relation rel,
+					   ForkNumber forknum)
+{
+	AutoReadStream *auto_stream;
+
+	auto_stream = palloc(sizeof(*auto_stream));
+	auto_stream->rel = rel;
+	auto_stream->next = InvalidBlockNumber;
+	auto_stream->last = InvalidBlockNumber;
+	auto_stream->stream = read_stream_begin_relation(0,
+													 strategy,
+													 rel,
+													 forknum,
+													 auto_read_stream_cb,
+													 auto_stream,
+													 0);
+
+	return auto_stream;
+}
+
+void
+auto_read_stream_end(AutoReadStream *auto_stream)
+{
+	read_stream_end(auto_stream->stream);
+	pfree(auto_stream);
+}
+
+Buffer
+auto_read_buffer(AutoReadStream *auto_stream, BlockNumber blocknum)
+{
+
+	if (auto_stream->last != InvalidBlockNumber)
+	{
+		Buffer		buffer = read_stream_next_buffer(auto_stream->stream, NULL);
+
+		/* Did the stream guess right? */
+		if (buffer != InvalidBlockNumber)
+		{
+			if (BufferGetBlockNumber(buffer) == blocknum)
+				return buffer;
+			ReleaseBuffer(buffer);
+		}
+	}
+
+	/* Figure out the highest block number if we haven't already. */
+	if (auto_stream->last == InvalidBlockNumber)
+	{
+		BlockNumber nblocks;
+
+		nblocks = RelationGetNumberOfBlocks(auto_stream->rel);
+		if (nblocks > 0)
+			auto_stream->last = nblocks - 1;
+	}
+
+	/* Take a guess at the next block. */
+	if (auto_stream->last != InvalidBlockNumber &&
+		auto_stream->last >= blocknum + 1)
+		auto_stream->next = blocknum + 1;
+	read_stream_reset(auto_stream->stream);
+
+	return ReadBuffer(auto_stream->rel, blocknum);
+}
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
index da6df2d3654..07c2fb2e42c 100644
--- a/src/backend/storage/aio/meson.build
+++ b/src/backend/storage/aio/meson.build
@@ -7,6 +7,7 @@ backend_sources += files(
   'aio_init.c',
   'aio_io.c',
   'aio_target.c',
+  'auto_read_stream.c',
   'method_io_uring.c',
   'method_sync.c',
   'method_worker.c',
diff --git a/src/include/storage/auto_read_stream.h b/src/include/storage/auto_read_stream.h
new file mode 100644
index 00000000000..96d4a9ac726
--- /dev/null
+++ b/src/include/storage/auto_read_stream.h
@@ -0,0 +1,16 @@
+#ifndef AUTO_READ_STREAM_H
+#define AUTO_READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+struct AutoReadStream;
+typedef struct AutoReadStream AutoReadStream;
+
+extern AutoReadStream *auto_read_stream_begin(BufferAccessStrategy strategy,
+											  Relation rel,
+											  ForkNumber forknum);
+extern void auto_read_stream_end(AutoReadStream *auto_stream);
+extern Buffer auto_read_buffer(AutoReadStream *auto_stream,
+							   BlockNumber blocknum);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..43f206e30ed 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -173,6 +173,7 @@ AuthRequest
 AuthToken
 AutoPrewarmReadStreamData
 AutoPrewarmSharedState
+AutoReadStream
 AutoVacOpts
 AutoVacuumShmemStruct
 AutoVacuumWorkItem
-- 
2.39.5 (Apple Git-154)

0002-Use-AutoReadStream-for-btree-leaf-page-scans.txttext/plain; charset=US-ASCII; name=0002-Use-AutoReadStream-for-btree-leaf-page-scans.txtDownload

From b3dee96bfaf31a4dff31b903ad05a7d5430d23e2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 12 Oct 2024 09:57:17 +1300
Subject: [PATCH 2/2] Use AutoReadStream for btree leaf page scans.

Leaf pages tend to be sequential on recently built/clustered indexes, so
there is an opportunity to do I/O combining and look-ahead.  It's hard
to do it 'precisely' with a ReadStream because they form a linked list.

XXX POC
---
 src/backend/access/nbtree/nbtpage.c   | 22 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtree.c    |  5 +++++
 src/backend/access/nbtree/nbtsearch.c |  2 +-
 src/include/access/nbtree.h           |  5 +++++
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c79dd38ee18..e617bf108dd 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -856,6 +856,28 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	return buf;
 }
 
+/*
+ * Like _bt_getbuf(), but use auto_stream instead of reading directly.  This
+ * allows I/O to be combined if the blocks happen to be sequential on disk.
+ */
+Buffer
+_bt_getbuf_auto(AutoReadStream *auto_stream,
+				Relation rel,
+				BlockNumber blkno,
+				int access)
+{
+	Buffer		buf;
+
+	Assert(BlockNumberIsValid(blkno));
+
+	/* Read an existing block of the relation */
+	buf = auto_read_buffer(auto_stream, blkno);
+	_bt_lockbuf(rel, buf, access);
+	_bt_checkpage(rel, buf);
+
+	return buf;
+}
+
 /*
  *	_bt_allocbuf() -- Allocate a new block/page.
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..309147bea15 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -364,6 +364,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	/* XXX defer until it looks like it's worth it? */
+	so->auto_stream = auto_read_stream_begin(NULL, rel, MAIN_FORKNUM);
+
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.  However, we set up
@@ -483,6 +486,8 @@ btendscan(IndexScanDesc scan)
 	so->markItemIndex = -1;
 	BTScanPosUnpinIfPinned(so->markPos);
 
+	auto_read_stream_end(so->auto_stream);
+
 	/* No need to invalidate positions, the RAM is about to be freed. */
 
 	/* Release storage */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795b4..f15e05042b4 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2408,7 +2408,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			so->currPos.buf = _bt_getbuf_auto(so->auto_stream, rel, blkno, BT_READ);
 		}
 		else
 		{
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..c4b82cd4466 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -22,6 +22,7 @@
 #include "catalog/pg_am_d.h"
 #include "catalog/pg_index.h"
 #include "lib/stringinfo.h"
+#include "storage/auto_read_stream.h"
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 #include "utils/skipsupport.h"
@@ -1072,6 +1073,8 @@ typedef struct BTScanOpaqueData
 	int			numKilled;		/* number of currently stored items */
 	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
 
+	AutoReadStream *auto_stream;
+
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
 	 * workspaces for the currPos and markPos respectively.  Each is of size
@@ -1273,6 +1276,8 @@ extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
 							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_getbuf_auto(AutoReadStream *auto_stream,
+							  Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_allocbuf(Relation rel, Relation heaprel);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 							   BlockNumber blkno, int access);
-- 
2.39.5 (Apple Git-154)

0001-Experiment-Peeking-ahead-in-I-O-streams.txttext/plain; charset=US-ASCII; name=0001-Experiment-Peeking-ahead-in-I-O-streams.txtDownload

From ac2dd3de67deef67bc15207944e72263c5625905 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 6 Aug 2025 16:23:13 +1200
Subject: [PATCH 1/2] Experiment:  Peeking ahead in I/O streams.

Allow consumers of self-referential block streams to install a callback
that gets a chance to "peek" at each block as soon as possible.

Consumers of self-referential block streams can use this to feed
navigational information to the block number callback much earlier in
the pipeline while they are still consuming older blocks.

This can't help if there is an IO stall in the way of progress at every
jump, but some mixed hit/miss patterns may be able to generate some
extra concurrency this way.

XXX highly experimental idea exploration...
---
 src/backend/storage/aio/read_stream.c | 74 ++++++++++++++++++++++++---
 src/include/storage/bufmgr.h          | 13 ++++-
 src/include/storage/read_stream.h     |  8 +++
 3 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 0e7f5557f5c..2c64ce9bbd1 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,9 +114,12 @@ struct ReadStream
 
 	/*
 	 * The callback that will tell us which block numbers to read, and an
-	 * opaque pointer that will be pass to it for its own purposes.
+	 * opaque pointer that will be pass to it for its own purposes.  Optional
+	 * peek_callback will have a chance to examine data in buffers some time
+	 * before they are returned.
 	 */
 	ReadStreamBlockNumberCB callback;
+	ReadStreamPeekCB peek_callback;
 	void	   *callback_private_data;
 
 	/* Next expected block, for detecting sequential access. */
@@ -140,6 +143,7 @@ struct ReadStream
 
 	/* Circular queue of buffers. */
 	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		peek_buffer_index;	/* Next pinned buffer to peek at */
 	int16		next_buffer_index;	/* Index of next buffer to pin */
 	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
 };
@@ -216,6 +220,26 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
+/*
+ * Pass as many pinned and valid buffers as we can to peek_callback.
+ */
+static void
+read_stream_peek(ReadStream *stream)
+{
+	while (stream->peek_buffer_index != stream->next_buffer_index &&
+		   !(stream->ios_in_progress > 0 &&
+			 stream->peek_buffer_index ==
+			 stream->ios[stream->oldest_io_index].buffer_index))
+	{
+		stream->peek_callback(stream,
+							  stream->callback_private_data,
+							  get_per_buffer_data(stream, stream->peek_buffer_index),
+							  stream->buffers[stream->peek_buffer_index]);
+		if (++stream->peek_buffer_index == stream->queue_size)
+			stream->peek_buffer_index = 0;
+	}
+}
+
 /*
  * Start as much of the current pending read as we can.  If we have to split it
  * because of the per-backend buffer limit, or the buffer manager decides to
@@ -429,6 +453,14 @@ read_stream_look_ahead(ReadStream *stream)
 			continue;
 		}
 
+		/*
+		 * New naviation information might be available if we got a cache hit
+		 * in a self-referential stream, so invoke peek_callback before every
+		 * block number callback.
+		 */
+		if (stream->peek_callback)
+			read_stream_peek(stream);
+
 		/*
 		 * See which block the callback wants next in the stream.  We need to
 		 * compute the index of the Nth block of the pending read including
@@ -791,6 +823,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(stream->distance == 1);
 		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
+		Assert(stream->peek_callback == NULL);
 		Assert(stream->initialized_buffers > stream->oldest_buffer_index);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -888,15 +921,24 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	Assert(BufferIsValid(buffer));
 
 	/* Do we have to wait for an associated I/O first? */
-	if (stream->ios_in_progress > 0 &&
-		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	while (stream->ios_in_progress > 0)
 	{
 		int16		io_index = stream->oldest_io_index;
 		int32		distance;	/* wider temporary value, clamped below */
 
-		/* Sanity check that we still agree on the buffers. */
-		Assert(stream->ios[io_index].op.buffers ==
-			   &stream->buffers[oldest_buffer_index]);
+		/*
+		 * If the oldest IO covers the buffer we're returning, we have to wait
+		 * for it.  Otherwise, process as many as we can opportunistically
+		 * without stalling.
+		 *
+		 * XXX This works for io_method=worker because its IOs progress is
+		 * autonomous.  For io_method=io_uring it doesn't because this backend
+		 * has to drain the completion queue.  In theory you could check with
+		 * a cheap load if the cqe has data.
+		 */
+		if (stream->ios[stream->oldest_io_index].buffer_index != oldest_buffer_index &&
+			WaitReadBuffersMightStall(&stream->ios[io_index].op))
+			break;
 
 		WaitReadBuffers(&stream->ios[io_index].op);
 
@@ -920,6 +962,13 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->seq_until_processed = InvalidBlockNumber;
 	}
 
+	/*
+	 * Invoke peek_callback so it can see the oldest buffer, before we zap it.
+	 * XXX Could reorder things and not need this
+	 */
+	if (stream->peek_callback)
+		read_stream_peek(stream);
+
 	/*
 	 * We must zap this queue entry, or else it would appear as a forwarded
 	 * buffer.  If it's potentially in the overflow zone (ie from a
@@ -977,7 +1026,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
 		stream->pending_read_nblocks == 0 &&
-		stream->per_buffer_data_size == 0)
+		stream->per_buffer_data_size == 0 &&
+		stream->peek_callback == NULL)
 	{
 		stream->fast_path = true;
 	}
@@ -1057,3 +1107,13 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+/*
+ * Install a callback that will have a chance to examine buffers as soon as
+ * possible.
+ */
+void
+read_stream_set_peek_callback(ReadStream *stream, ReadStreamPeekCB peek_callback)
+{
+	stream->peek_callback = peek_callback;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..4ea522e0d4a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -15,7 +15,7 @@
 #define BUFMGR_H
 
 #include "port/pg_iovec.h"
-#include "storage/aio_types.h"
+#include "storage/aio.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -228,6 +228,17 @@ extern bool StartReadBuffers(ReadBuffersOperation *operation,
 							 int flags);
 extern void WaitReadBuffers(ReadBuffersOperation *operation);
 
+/*
+ * Return true if a ReadBuffersOperation is not known to be physically
+ * completed already.
+ */
+static inline bool
+WaitReadBuffersMightStall(ReadBuffersOperation *operation)
+{
+	return !pgaio_wref_valid(&operation->io_wref) ||
+		!pgaio_wref_check_done(&operation->io_wref);
+}
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9b0d65161d0..c735bb0c6ab 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -78,6 +78,12 @@ typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *callback_private_data,
 												void *per_buffer_data);
 
+/* Callback that has a chance to peek at buffers when they are ready. */
+typedef void (*ReadStreamPeekCB) (ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data,
+								  Buffer buffer);
+
 extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
 											  void *callback_private_data,
 											  void *per_buffer_data);
@@ -99,6 +105,8 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 												   ReadStreamBlockNumberCB callback,
 												   void *callback_private_data,
 												   size_t per_buffer_data_size);
+extern void read_stream_set_peek_callback(ReadStream *stream,
+										  ReadStreamPeekCB peek_callback);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
-- 
2.39.5 (Apple Git-154)

#212

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#208)

Re: index prefetching

Hi,

On 2025-08-06 16:12:53 +0200, Tomas Vondra wrote:

That's quite possible. What concerns me about using tables like pgbench
accounts table is reproducibility - initially it's correlated, and then
it gets "randomized" by the workload. But maybe the exact pattern
depends on the workload - how many clients, how long, how it correlates
with vacuum, etc. Reproducing the dataset might be quite tricky.

That's why I prefer using "reproducible" data sets. I think the data
sets with "fuzz" seem like a pretty good model. I plan to experiment
with adding some duplicate values / runs, possibly with two "levels" of
randomness (global for all runs, and smaller local perturbations).
[...]
Yeah, cases like that are interesting. I plan to do some randomized
testing, exploring "strange" combinations of parameters, looking for
weird behaviors like that.

I'm just catching up: Isn't it a bit early to focus this much on testing? ISMT
that the patchsets for both approaches currently have some known architectural
issues and that addressing them seems likely to change their performance
characteristics.

Greetings,

Andres Freund

#213

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#212)

Re: index prefetching

On 8/9/25 01:47, Andres Freund wrote:

Hi,

On 2025-08-06 16:12:53 +0200, Tomas Vondra wrote:

That's quite possible. What concerns me about using tables like pgbench
accounts table is reproducibility - initially it's correlated, and then
it gets "randomized" by the workload. But maybe the exact pattern
depends on the workload - how many clients, how long, how it correlates
with vacuum, etc. Reproducing the dataset might be quite tricky.

That's why I prefer using "reproducible" data sets. I think the data
sets with "fuzz" seem like a pretty good model. I plan to experiment
with adding some duplicate values / runs, possibly with two "levels" of
randomness (global for all runs, and smaller local perturbations).
[...]
Yeah, cases like that are interesting. I plan to do some randomized
testing, exploring "strange" combinations of parameters, looking for
weird behaviors like that.

I'm just catching up: Isn't it a bit early to focus this much on testing? ISMT
that the patchsets for both approaches currently have some known architectural
issues and that addressing them seems likely to change their performance
characteristics.

Perhaps. For me benchmarks are a way to learn about stuff and better
understand the pros/cons of approaches. It's possible some of the
changes will impact the characteristics, but I doubt it can change the
fundamental differences due to the simple approach being limited to a
single leaf page, etc.

regards

--
Tomas Vondra

#214

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#213)

Re: index prefetching

On Mon, Aug 11, 2025 at 10:16 AM Tomas Vondra <tomas@vondra.me> wrote:

Perhaps. For me benchmarks are a way to learn about stuff and better
understand the pros/cons of approaches. It's possible some of the
changes will impact the characteristics, but I doubt it can change the
fundamental differences due to the simple approach being limited to a
single leaf page, etc.

I think that we're all now agreed that we want to take the complex
patch's approach. ISTM that that development makes comparative
benchmarking much less interesting, at least for the time being. IMV
we should focus on cleaning up the complex patch, and on closing out
at least a few open items.

The main thing that I'm personally interested in right now,
benchmark-wise, is cases where the complex patch doesn't perform as
well as expected when we compare (say) backwards scans to forwards
scans with the complex patch. In other words, I'm mostly interested in
getting an overall sense of the performance profile of the complex
patch -- which has nothing to do with how it performs against the
master branch. I'd like to find and debug any weird performance
bugs/strange discontinuities in performance. I have a feeling that
there are at least a couple of those lurking in the complex patch
right now. Once we have some confidence that the overall performance
profile of the complex patch "makes sense", we can do more invasive
refactoring (while systematically avoiding new regressions for the
cases that were fixed).

In summary, I think that we should focus on fixing smaller open items
for now -- with an emphasis on fixing strange inconsistencies in
performance for distinct-though-similar queries (pairs of queries that
intuitively seem like they should perform very similarly, but somehow
have very different performance). I can't really justify that, but my
gut feeling is that that's the best place to focus our efforts for the
time being.

--
Peter Geoghegan

#215

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Thomas Munro (#211)

Re: index prefetching

On Thu, Aug 7, 2025 at 1:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:

* you could make a stream that pulls leaf pages from higher level
internal pages on demand (if you want to avoid the flow control
problems that come from trying to choose a batch size up front before
you know you'll even need it by using a pull approach), or just notice
that it looks sequential and install a block range producer, and if
that doesn't match the next page pointers by the time you get there
then you destroy it and switch strategies, or something

I was hoping that we wouldn't ever have to teach index scans to
prefetch leaf pages like this. It is pretty complicated, primarily
because it completely breaks with the idea of the scan having to
access pages in some fixed order. (Whereas if we're just prefetching
heap pages, then there is a fixed order, which makes maintaining
prefetch distance relatively straightforward and index AM neutral.)

It's also awkward to make such a scheme work, especially when there's
any uncertainty about how many leaf pages will ultimately be read/how
much work to do speculatively. There might not be that many relevant
leaf pages (level 0 pages) whose block numbers are conveniently
available as prefetchable downlinks/block numbers to the right of the
downlink we use to descend to the first leaf page to be read (our
initial downlink might be positioned towards the end of the relevant
internal page at level 1). I guess we could re-read the internal page
only when prefetching later leaf pages starts to look like a good
idea, but that's another complicated code path to maintain.

--
Peter Geoghegan

#216

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#214)

Re: index prefetching

On 8/11/25 22:14, Peter Geoghegan wrote:

On Mon, Aug 11, 2025 at 10:16 AM Tomas Vondra <tomas@vondra.me> wrote:

Perhaps. For me benchmarks are a way to learn about stuff and better
understand the pros/cons of approaches. It's possible some of the
changes will impact the characteristics, but I doubt it can change the
fundamental differences due to the simple approach being limited to a
single leaf page, etc.

I think that we're all now agreed that we want to take the complex
patch's approach. ISTM that that development makes comparative
benchmarking much less interesting, at least for the time being. IMV
we should focus on cleaning up the complex patch, and on closing out
at least a few open items.

I agree comparing "simple" and "complex" patches is less interesting. I
still plan to keep comparing "master" and "complex", mostly to look for
unexpected regressions etc.

The main thing that I'm personally interested in right now,
benchmark-wise, is cases where the complex patch doesn't perform as
well as expected when we compare (say) backwards scans to forwards
scans with the complex patch. In other words, I'm mostly interested in
getting an overall sense of the performance profile of the complex
patch -- which has nothing to do with how it performs against the
master branch. I'd like to find and debug any weird performance
bugs/strange discontinuities in performance. I have a feeling that
there are at least a couple of those lurking in the complex patch
right now. Once we have some confidence that the overall performance
profile of the complex patch "makes sense", we can do more invasive
refactoring (while systematically avoiding new regressions for the
cases that were fixed).

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

In summary, I think that we should focus on fixing smaller open items
for now -- with an emphasis on fixing strange inconsistencies in
performance for distinct-though-similar queries (pairs of queries that
intuitively seem like they should perform very similarly, but somehow
have very different performance). I can't really justify that, but my
gut feeling is that that's the best place to focus our efforts for the
time being.

--
Tomas Vondra

#217

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#216)

Re: index prefetching

On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

I was talking more about finding "performance bugs" through a
semi-directed process of trying random things while looking out for
discrepancies. Something like that shouldn't require the usual
"benchmarking rigor", since suspicious inconsistencies should be
fairly obvious once encountered. I expect similar queries to have
similar performance, regardless of superficial differences such as
scan direction, DESC vs ASC column order, etc.

I tested this issue again (using my original pgbench_account query),
having rebased on top of HEAD as of today. I found that the
inconsistency seems to be much smaller now -- so much so that I don't
think that the remaining inconsistency is particularly suspicious.

I also think that performance might have improved across the board. I
see that the same TPC-C query that took 768.454 ms a few weeks back
now takes only 617.408 ms. Also, while I originally saw "I/O Timings:
shared read=138.856" with this query, I now see "I/O Timings: shared
read=46.745". That feels like a performance bug fix to me.

I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
read_stream.c's split IO handling") fixed the issue, without anyone
realizing that the bug in question could manifest like this.

--
Peter Geoghegan

#218

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Peter Geoghegan (#217)

Re: index prefetching

On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

I was talking more about finding "performance bugs" through a
semi-directed process of trying random things while looking out for
discrepancies. Something like that shouldn't require the usual
"benchmarking rigor", since suspicious inconsistencies should be
fairly obvious once encountered. I expect similar queries to have
similar performance, regardless of superficial differences such as
scan direction, DESC vs ASC column order, etc.

I'd be interested to hear more about reverse scans. Bilal was
speculating about backwards I/O combining in read_stream.c a while
back, but we didn't have anything interesting to use it yet. You'll
probably see a flood of uncombined 8KB IOs in the pg_aios view while
travelling up the heap with cache misses today. I suspect Linux does
reverse sequential prefetching with buffered I/O (less sure about
other OSes) which should help but we'd still have more overheads than
we could if we combined them, not to mention direct I/O.

Not tested, but something like this might do it:

                /* Can we merge it with the pending read? */
-               if (stream->pending_read_nblocks > 0 &&
-                       stream->pending_read_blocknum +
stream->pending_read_nblocks == blocknum)
+               if (stream->pending_read_nblocks > 0)
                {
-                       stream->pending_read_nblocks++;
-                       continue;
+                       if (stream->pending_read_blocknum +
stream->pending_read_nblocks ==
+                               blocknum)
+                       {
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
+                       else if (stream->pending_read_blocknum ==
blocknum + 1 &&
+                                        stream->forwarded_buffers == 0)
+                       {
+                               stream->pending_read_blocknum--;
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
                }

I tested this issue again (using my original pgbench_account query),
having rebased on top of HEAD as of today. I found that the
inconsistency seems to be much smaller now -- so much so that I don't
think that the remaining inconsistency is particularly suspicious.

I also think that performance might have improved across the board. I
see that the same TPC-C query that took 768.454 ms a few weeks back
now takes only 617.408 ms. Also, while I originally saw "I/O Timings:
shared read=138.856" with this query, I now see "I/O Timings: shared
read=46.745". That feels like a performance bug fix to me.

I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
read_stream.c's split IO handling") fixed the issue, without anyone
realizing that the bug in question could manifest like this.

I can't explain that. If you can consistently reproduce the change at
the two base commits, maybe bisect? If it's a real phenomenon I'm
definitely curious to know what you're seeing.

#219

Nazir Bilal Yavuz

byavuz81@gmail.com

5 months ago

In reply to: Thomas Munro (#218)

Re: index prefetching

Hi,

On Tue, 12 Aug 2025 at 08:07, Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

I was talking more about finding "performance bugs" through a
semi-directed process of trying random things while looking out for
discrepancies. Something like that shouldn't require the usual
"benchmarking rigor", since suspicious inconsistencies should be
fairly obvious once encountered. I expect similar queries to have
similar performance, regardless of superficial differences such as
scan direction, DESC vs ASC column order, etc.

I'd be interested to hear more about reverse scans. Bilal was
speculating about backwards I/O combining in read_stream.c a while
back, but we didn't have anything interesting to use it yet. You'll
probably see a flood of uncombined 8KB IOs in the pg_aios view while
travelling up the heap with cache misses today. I suspect Linux does
reverse sequential prefetching with buffered I/O (less sure about
other OSes) which should help but we'd still have more overheads than
we could if we combined them, not to mention direct I/O.

If I remember correctly, I didn't continue working on this as I didn't
see performance improvement. Right now, my changes don't apply cleanly
to the current HEAD but I can give it another try if you see value in
this.

Not tested, but something like this might do it:

/* Can we merge it with the pending read? */
-               if (stream->pending_read_nblocks > 0 &&
-                       stream->pending_read_blocknum +
stream->pending_read_nblocks == blocknum)
+               if (stream->pending_read_nblocks > 0)
{
-                       stream->pending_read_nblocks++;
-                       continue;
+                       if (stream->pending_read_blocknum +
stream->pending_read_nblocks ==
+                               blocknum)
+                       {
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
+                       else if (stream->pending_read_blocknum ==
blocknum + 1 &&
+                                        stream->forwarded_buffers == 0)
+                       {
+                               stream->pending_read_blocknum--;
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
}

Unfortunately this doesn't work. We need to handle backwards I/O
combining in the StartReadBuffersImpl() function too as buffer indexes
won't have correct blocknums. Also, I think buffer forwarding of split
backwards I/O should be handled in a couple of places.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#220

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Nazir Bilal Yavuz (#219)

1 attachment(s)

Re: index prefetching

On 8/12/25 13:22, Nazir Bilal Yavuz wrote:

Hi,

On Tue, 12 Aug 2025 at 08:07, Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

I was talking more about finding "performance bugs" through a
semi-directed process of trying random things while looking out for
discrepancies. Something like that shouldn't require the usual
"benchmarking rigor", since suspicious inconsistencies should be
fairly obvious once encountered. I expect similar queries to have
similar performance, regardless of superficial differences such as
scan direction, DESC vs ASC column order, etc.

I'd be interested to hear more about reverse scans. Bilal was
speculating about backwards I/O combining in read_stream.c a while
back, but we didn't have anything interesting to use it yet. You'll
probably see a flood of uncombined 8KB IOs in the pg_aios view while
travelling up the heap with cache misses today. I suspect Linux does
reverse sequential prefetching with buffered I/O (less sure about
other OSes) which should help but we'd still have more overheads than
we could if we combined them, not to mention direct I/O.

If I remember correctly, I didn't continue working on this as I didn't
see performance improvement. Right now, my changes don't apply cleanly
to the current HEAD but I can give it another try if you see value in
this.
Not tested, but something like this might do it:
/* Can we merge it with the pending read? */
-               if (stream->pending_read_nblocks > 0 &&
-                       stream->pending_read_blocknum +
stream->pending_read_nblocks == blocknum)
+               if (stream->pending_read_nblocks > 0)
{
-                       stream->pending_read_nblocks++;
-                       continue;
+                       if (stream->pending_read_blocknum +
stream->pending_read_nblocks ==
+                               blocknum)
+                       {
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
+                       else if (stream->pending_read_blocknum ==
blocknum + 1 &&
+                                        stream->forwarded_buffers == 0)
+                       {
+                               stream->pending_read_blocknum--;
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
}
Unfortunately this doesn't work. We need to handle backwards I/O
combining in the StartReadBuffersImpl() function too as buffer indexes
won't have correct blocknums. Also, I think buffer forwarding of split
backwards I/O should be handled in a couple of places.

I'm running some tests looking for these weird changes, not just with
the patches, but on master too. And I don't think b4212231 changed the
situation very much.

FWIW this issue is not caused by the index prefetching patches, I can
reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
from yesterday). So maybe we should split this into a separate thread.

Consider for example the dataset built by create.sql - it's randomly
generated, but the idea is that it's correlated, but not perfectly. The
table is ~3.7GB, and it's a cold run - caches dropped + restart).

Anyway, a simple range query look like this:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on t
(actual time=0.584..433.208 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=7435 read=50872
I/O Timings: shared read=332.270
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.254
Planning Time: 3.364 ms
Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on t
(actual time=0.566..22002.780 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=36131 read=50872
I/O Timings: shared read=21217.995
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=2.375
Planning Time: 3.478 ms
Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.

regards

--
Tomas Vondra

#221

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#220)

Re: index prefetching

On 8/12/25 18:53, Tomas Vondra wrote:

...

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on t
(actual time=0.584..433.208 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=7435 read=50872
I/O Timings: shared read=332.270
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.254
Planning Time: 3.364 ms
Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on t
(actual time=0.566..22002.780 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=36131 read=50872
I/O Timings: shared read=21217.995
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=2.375
Planning Time: 3.478 ms
Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.

One more detail I just noticed - the DESC scan apparently needs more
buffers (~87k vs. 57k). That probably shouldn't cause such massive
regression, though.

regards

--
Tomas Vondra

#222

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Nazir Bilal Yavuz (#219)

Re: index prefetching

On Tue, Aug 12, 2025 at 11:22 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Unfortunately this doesn't work. We need to handle backwards I/O
combining in the StartReadBuffersImpl() function too as buffer indexes
won't have correct blocknums. Also, I think buffer forwarding of split
backwards I/O should be handled in a couple of places.

Perhaps there could be a flag pending_read_backwards that can only
become set with pending_read_nblocks goes from 1 to 2, and then a new
flag stream->ios[x].backwards (in struct InProgressIO) that is set in
read_stream_start_pending_read(). Then immediately after
WaitReadBuffers(), we reverse the buffers it returned in place if that
flag was set. Oh, I see, you were imagining a flag
READ_BUFFERS_REVERSE that tells WaitReadBuffers() to do that
internally. Hmm. Either way I don't think you need to consider the
forwarded buffers because they will be reversed during a later call
that includes them in *nblocks (output value), no?

#223

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#221)

Re: index prefetching

On Tue, Aug 12, 2025 at 1:51 PM Tomas Vondra <tomas@vondra.me> wrote:

One more detail I just noticed - the DESC scan apparently needs more
buffers (~87k vs. 57k). That probably shouldn't cause such massive
regression, though.

I can reproduce this.

I wondered if the difference might be attributable to the issue with
posting lists and backwards scans (this index has fairly large posting
lists), which is addressed by this patch of mine:

https://commitfest.postgresql.org/patch/5824/

This makes the difference in buffers read identical between the
forwards and backwards scan case. However, it makes exactly no
difference to the execution time of the backwards scan case -- it's
still way higher.

I imagine that this is down to some linux readahead implementation
detail. Maybe it is more willing to speculatively read ahead when the
scan is mostly in ascending order, compared to when the scan is mostly
in descending order. The performance gap that I see is surprisingly
large, but I agree that it has nothing to do with this prefetching
work/the issue that I saw with backwards scans.

I had imagined that we'd be much less sensitive to these kinds of
differences once we don't need to depend on heuristic-driven OS
readahead. Maybe that was wrong.

--
Peter Geoghegan

#224

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#220)

Re: index prefetching

Hi,

On 2025-08-12 18:53:13 +0200, Tomas Vondra wrote:

I'm running some tests looking for these weird changes, not just with
the patches, but on master too. And I don't think b4212231 changed the
situation very much.

FWIW this issue is not caused by the index prefetching patches, I can
reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
from yesterday). So maybe we should split this into a separate thread.

Consider for example the dataset built by create.sql - it's randomly
generated, but the idea is that it's correlated, but not perfectly. The
table is ~3.7GB, and it's a cold run - caches dropped + restart).

Anyway, a simple range query look like this:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on t
(actual time=0.584..433.208 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=7435 read=50872
I/O Timings: shared read=332.270
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.254
Planning Time: 3.364 ms
Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on t
(actual time=0.566..22002.780 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=36131 read=50872
I/O Timings: shared read=21217.995
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=2.375
Planning Time: 3.478 ms
Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I suspect what you're mainly seeing here is that the OS can do readahead for
us for forward scans, but not for backward scans. Indeed, if I look at
iostat, the forward scan shows:

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme6n1 3352.00 400.89 0.00 0.00 0.18 122.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.62 47.90

whereas the backward scan shows:

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme6n1 10958.00 85.57 0.00 0.00 0.06 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.69 63.80

Note the different read sizes...

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.

That's what we'd expect, right? There's nothing on master that'd perform read
combining for index scans...

Greetings,

Andres Freund

#225

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Thomas Munro (#218)

Re: index prefetching

On Tue Aug 12, 2025 at 1:06 AM EDT, Thomas Munro wrote:

I'd be interested to hear more about reverse scans. Bilal was
speculating about backwards I/O combining in read_stream.c a while
back, but we didn't have anything interesting to use it yet. You'll
probably see a flood of uncombined 8KB IOs in the pg_aios view while
travelling up the heap with cache misses today. I suspect Linux does
reverse sequential prefetching with buffered I/O (less sure about
other OSes) which should help but we'd still have more overheads than
we could if we combined them, not to mention direct I/O.

Doesn't look like Linux will do this, if what my local testing shows is anything
to go on. I'm a bit surprised by this (I also thought that OS readahead on linux
was quite sophisticated).

There does seem to be something fishy going on with the patch here. I can see
strange inconsistencies in EXPLAIN ANALYZE output when the server is started
with --debug_io_direct=data with the master, compared to what I see with the
patch.

Test case
=========

My test case is a minor refinement of Tomas' backwards scan test case from
earlier today, though with one important difference: I ran
"alter index idx set (deduplicate_items = off); reindex index idx;" to get a
pristine index without any posting lists (since the unrelated issue with posting
list TIDs otherwise risks obscuring something relevant).

master
------

pg@regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.117..982.469 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10353 read=49933 │
│ I/O Timings: shared read=861.953 │
│ Planning: │
│ Buffers: shared hit=63 read=20 │
│ I/O Timings: shared read=1.898 │
│ Planning Time: 2.131 ms │
│ Execution Time: 1015.679 ms │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg@regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.919..6340.579 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10350 read=49933 │
│ I/O Timings: shared read=6219.776 │
│ Planning: │
│ Buffers: shared hit=5 │
│ Planning Time: 0.076 ms │
│ Execution Time: 6374.008 ms │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Notice that readahead seems to be effective with the forwards scan only (even
though I'm using debug_io_direct=data for this). Also notice that each query
shows identical "Buffers:" output -- that detail is exactly as expected.

Prefetch patch
--------------

Same pair of queries/prewarming/eviction steps with my working copy of the
prefetching patch:

pg@regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.136..298.301 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6619 read=49933 │
│ I/O Timings: shared read=45.313 │
│ Planning: │
│ Buffers: shared hit=63 read=20 │
│ I/O Timings: shared read=2.232 │
│ Planning Time: 2.634 ms │
│ Execution Time: 330.379 ms │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg@regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.926..1201.988 rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10350 read=49933 │
│ I/O Timings: shared read=194.774 │
│ Planning: │
│ Buffers: shared hit=5 │
│ Planning Time: 0.097 ms │
│ Execution Time: 1236.655 ms │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

It looks like the patch does significantly better with the forwards scan,
compared to the backwards scan (though both are improved by a lot). But that's
not the main thing about these results that I find interesting.

The really odd thing is that we get "shared hit=6619 read=49933" for the
forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
latter matches master (regardless of the scan direction used on master), while
the former just looks wrong. What explains the "missing buffer hits" seen with
the forwards scan?

Discrepancies
-------------

All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
to simply be broken/giving wrong answers. Might it be that the "Buffers"
instrumentation is broken?

The premise of my original complaint was that big inconsistencies in performance
shouldn't happen between similar forwards and backwards scans (at least not with
direct I/O). I now have serious doubts about that premise, since it looks like
OS readahead remains a big factor with direct I/O. Did I just miss something
obvious?

I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
read_stream.c's split IO handling") fixed the issue, without anyone
realizing that the bug in question could manifest like this.

I can't explain that. If you can consistently reproduce the change at
the two base commits, maybe bisect?

Commit b4212231 was a wild guess on my part. Probably should have refrained
from that.

--
Peter Geoghegan

#226

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#225)

Re: index prefetching

On 8/12/25 23:22, Peter Geoghegan wrote:

...

It looks like the patch does significantly better with the forwards scan,
compared to the backwards scan (though both are improved by a lot). But that's
not the main thing about these results that I find interesting.

The really odd thing is that we get "shared hit=6619 read=49933" for the
forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
latter matches master (regardless of the scan direction used on master), while
the former just looks wrong. What explains the "missing buffer hits" seen with
the forwards scan?

Discrepancies
-------------

All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
to simply be broken/giving wrong answers. Might it be that the "Buffers"
instrumentation is broken?

I think a bug in the prefetch patch is more likely. I tried with a patch
that adds various prefetch-related counters to explain, and I see this:

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on public.t (actual time=0.682..527.055
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.263
Prefetch Count: 60888
Prefetch Stalls: 1
Prefetch Skips: 991211
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
Buffers: shared hit=5027 read=50872
I/O Timings: shared read=33.528
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.349
Planning Time: 3.686 ms
Execution Time: 559.659 ms
(17 rows)

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a DESC;
QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on public.t (actual time=1.110..4116.201
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.061
Prefetch Count: 118806
Prefetch Stalls: 1
Prefetch Skips: 962515
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
Buffers: shared hit=30024 read=50872
I/O Timings: shared read=581.353
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=3.168
Planning Time: 4.289 ms
Execution Time: 4185.407 ms
(17 rows)

These two parts are interesting:

Prefetch Count: 60888
Prefetch Skips: 991211

Prefetch Count: 118806
Prefetch Skips: 962515

It looks like the backwards scan skips fewer blocks. This is based on
the lastBlock optimization, i.e. looking for runs of the same block
number. I don't quite see why would it affect just the backwards scan,
though. Seems weird.

The premise of my original complaint was that big inconsistencies in performance
shouldn't happen between similar forwards and backwards scans (at least not with
direct I/O). I now have serious doubts about that premise, since it looks like
OS readahead remains a big factor with direct I/O. Did I just miss something
obvious?

I don't think you missed anything. It does seem the assumption relies on
the OS handling the underlying I/O patterns equally, and unfortunately
that does not seem to be the case. Maybe we could "invert" the data set,
i.e. make it "descending" instead of "ascending"? That would make the
heap access direction "forward" again ...

regards

--
Tomas Vondra

#227

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#225)

Re: index prefetching

Hi,

On 2025-08-12 17:22:20 -0400, Peter Geoghegan wrote:

Doesn't look like Linux will do this, if what my local testing shows is anything
to go on.

Yes, matches my experiments outside of postgres too.

I'm a bit surprised by this (I also thought that OS readahead on linux
was quite sophisticated).

It's mildly sophisticated in detecting various *forward scan* patterns. There
just isn't anything for backward scans - presumably because there's not
actually much that generates backward reads of files...

The premise of my original complaint was that big inconsistencies in performance
shouldn't happen between similar forwards and backwards scans (at least not with
direct I/O). I now have serious doubts about that premise, since it looks like
OS readahead remains a big factor with direct I/O. Did I just miss something
obvious?

There is absolutely no OS level readahead with direct IO (there can be
*merging* of neighboring IOs though, if they're submitted close enough
together).

However that doesn't mean that your storage hardware can't have its own set of
heuristics for faster access - afaict several NVMes I have access to have
shorter IO times for forward scans than for backward scans.

Besides actual IO times, there also is the issue that the page level access
might be differently efficient, the order in which tuples are accessed also
plays a role in how efficient memory level prefetching is.

OS level readahead is visible in some form in iostat - you get bigger reads or
multiple in-flight IOs.

Greetings,

Andres Freund

#228

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#225)

1 attachment(s)

Re: index prefetching

On Tue, Aug 12, 2025 at 5:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

There does seem to be something fishy going on with the patch here. I can see
strange inconsistencies in EXPLAIN ANALYZE output when the server is started
with --debug_io_direct=data with the master, compared to what I see with the
patch.

Attached is my working version of the patch, in case that helps anyone
with reproducing the problem.

Note that the nbtree changes are now included in this one
patch/commit. Definitely might make sense to revert to one patch per
index AM again later, but for now it's convenient to have one commit
that both adds the concept of amgetbatch, and removes nbtree's
btgettuple (since it bleeds into things like how indexam.c wants to do
mark and restore).

There are only fairly minor changes here. Most notably:

* Generalizes nbtree's _bt_drop_lock_and_maybe_pin, making it an
index-AM-generic thing I call index_batch_unlock.

Previous versions of this complex patch avoided the issue by always
holding on to a leaf page buffer pin, even when it wasn't truly
necessary (i.e. with plain index scans that use an MVCC snapshot).

It shouldn't be too hard to teach GiST to use index_batch_unlock to
continue dropping buffer pins on leaf pages, as before (with
gistgettuple). The hard part will be ordered GiST scans, and perhaps
every kind of GiST index-only scan (since in general index-only scans
cannot drop pins eagerly within index_batch_unlock, due to race
conditions with VACUUM concurrently setting VM bits all-visible).

* Replaces BufferMatches() with something a bit less invasive, which
works based on block numbers (not buffers).

* Various refinements to the way that nbtree deals with setting things
up using an existing batch.

In particular, the interface of _bt_readnextpage has been revised. It
now makes much more sense in a world where nbtree doesn't "own"
existing batches -- we no longer directly pass an existing batch to
_bt_readnextpage, and it no longer thinks it can clobber what is
actually an old batch.

--
Peter Geoghegan

Attachments:

v1-0001-Add-amgetbatch-interface-for-index-scan-prefetchi.patchapplication/octet-stream; name=v1-0001-Add-amgetbatch-interface-for-index-scan-prefetchi.patchDownload

From 112446c3c60996eee0ff76debc18a23eae161748 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 30 Sep 2024 22:48:12 +0200
Subject: [PATCH v1] Add amgetbatch interface for index scan prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   18 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  130 ++
 src/include/access/tableam.h                  |   12 +-
 src/include/nodes/pathnodes.h                 |    2 +-
 src/backend/access/brin/brin.c                |    1 -
 src/backend/access/gin/ginutil.c              |    1 -
 src/backend/access/gist/gist.c                |    1 -
 src/backend/access/hash/hash.c                |    1 -
 src/backend/access/heap/heapam_handler.c      |  124 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1207 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  306 +----
 src/backend/access/nbtree/nbtsearch.c         |  671 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   49 +-
 src/backend/access/spgist/spgutils.c          |    1 -
 src/backend/access/table/tableam.c            |    2 +-
 src/backend/commands/constraint.c             |    3 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 contrib/bloom/blutils.c                       |    1 -
 doc/src/sgml/indexam.sgml                     |   19 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    1 -
 src/tools/pgindent/typedefs.list              |    8 +-
 30 files changed, 1988 insertions(+), 806 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 70949de56..9b2e22c4c 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -196,6 +196,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  IndexScanBatch batch,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -203,11 +212,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
 /* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+typedef void (*amrestrpos_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -307,9 +314,10 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b..39382d8e0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -111,6 +112,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -231,6 +233,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897..18108c52c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -116,6 +116,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_cbuf_blk;
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0a..93b5ea709 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -939,10 +939,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -961,74 +961,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1066,32 +1017,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1191,14 +1117,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btrestrpos(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1305,8 +1232,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1326,7 +1254,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..d9731332b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,136 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for currPos */
+	IndexScanBatchPosItem *items;
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan, void *arg, IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool 		dropPin;
+
+	/*
+	 * Did we read the last batch? The batches may be loaded from multiple
+	 * places, and we need to remember when we fail to load the next batch in
+	 * a given scan (which means "no more batches"). amgetbatch may restart
+	 * the scan on the get call, so we need to remember it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	BlockNumber lastBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The firstBatch is an index of the first batch,
+	 * but needs to be translated by (modulo maxBatches) into index in the
+	 * batches array.
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			firstBatch;		/* first used batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+}			IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +266,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6..8d4691fb9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -413,8 +413,14 @@ typedef struct TableAmRoutine
 	 * structure with additional information.
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
+	 *
+	 * The ReadStream pointer is optional - NULL means the regular buffer
+	 * reads are used. If a valid ReadStream is provided, the callback
+	 * (generating the blocks to read) and index_fetch_tuple (consuming the
+	 * buffers) need to agree on the exact order.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  ReadStream *rs);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -1149,9 +1155,9 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, ReadStream *rs)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, rs);
 }
 
 /*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index ad2726f02..70cc5be7a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1232,7 +1232,7 @@ struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* does AM know how to mark/restore? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e4..410d3b1fa 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -296,7 +296,6 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..c4e6b412e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -84,7 +84,6 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 7b24380c9..ed4925683 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -105,7 +105,6 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..9ddb0c53a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -104,7 +104,6 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c9..fd60fda6f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -79,12 +79,14 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, ReadStream *rs)
 {
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = rs;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_cbuf_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_cbuf_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -129,16 +138,123 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	{
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
+		bool		release_prev = hscan->xs_cbuf_blk != InvalidBlockNumber;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/*
+		 * Read the block for the requested TID. With a read stream, simply
+		 * read the next block we queued earlier (from the callback).
+		 * Otherwise just do the regular read using the TID.
+		 *
+		 * XXX It's a bit fragile to just read buffers, expecting the right
+		 * block, which we queued from the callback sometime much earlier. If
+		 * the two streams get out of sync in any way (which can happen
+		 * easily, due to some optimization heuristics), it may misbehave in
+		 * strange ways.
+		 *
+		 * XXX We need to support both the old ReadBuffer and ReadStream, as
+		 * some places are unlikely to benefit from a read stream - e.g.
+		 * because they only fetch a single tuple. So better to support this.
+		 *
+		 * XXX Another reason is that some index AMs may not support the
+		 * batching interface, which is a prerequisite for using read_stream
+		 * API.
+		 */
+		if (scan->rs)
+		{
+			/*
+			 * If we're trying to read the same block as the last time, don't
+			 * try reading it from the stream again, but just return the last
+			 * buffer. We need to check if the previous buffer is still pinned
+			 * and contains the correct block (it might have been unpinned,
+			 * used for a different block, so we need to be careful).
+			 *
+			 * The place scheduling the blocks (index_scan_stream_read_next)
+			 * needs to do the same thing and not schedule the blocks if it
+			 * matches the previous one. Otherwise the stream will get out of
+			 * sync, causing confusion.
+			 *
+			 * This is what ReleaseAndReadBuffer does too, but it does not
+			 * have a queue of requests scheduled from somewhere else, so it
+			 * does not need to worry about that.
+			 *
+			 * XXX Maybe we should remember the block in IndexFetchTableData,
+			 * so that we can make the check even cheaper, without looking at
+			 * the buffer descriptor? But that assumes the buffer was not
+			 * unpinned (or repinned) elsewhere, before we got back here. But
+			 * can that even happen? If yes, I guess we shouldn't be releasing
+			 * the prev buffer anyway.
+			 *
+			 * XXX This has undesired impact on prefetch distance. The read
+			 * stream schedules reads for a certain number of future blocks,
+			 * but if we skip duplicate blocks, the prefetch distance may get
+			 * unexpectedly large (e.g. for correlated indexes, with long runs
+			 * of TIDs from the same heap page). This may spend a lot of CPU
+			 * time in the index_scan_stream_read_next callback, but more
+			 * importantly it may require reading (and keeping) a lot of leaf
+			 * pages from the index.
+			 *
+			 * XXX What if we pinned the buffer twice (increase the refcount),
+			 * so that if the caller unpins the buffer, we still keep the
+			 * second pin. Wouldn't that mean we don't need to worry about the
+			 * possibility someone loaded another page into the buffer?
+			 *
+			 * XXX We might also keep a longer history of recent blocks, not
+			 * just the immediately preceding one. But that makes it harder,
+			 * because the two places (read_next callback and here) need to
+			 * have a slightly different view.
+			 */
+			if (hscan->xs_cbuf_blk == ItemPointerGetBlockNumber(tid))
+				release_prev = false;
+			else
+			{
+				hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+				hscan->xs_cbuf_blk = BufferGetBlockNumber(hscan->xs_cbuf);
+			}
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+
+		/* We should always get a valid buffer for a valid TID. */
+		Assert(BufferIsValid(hscan->xs_cbuf));
+
+		/*
+		 * Did we read the expected block number (per the TID)? For the
+		 * regular buffer reads this should always match, but with the read
+		 * stream it might disagree due to a bug elsewhere (happened
+		 * repeatedly).
+		 */
+		Assert(BufferGetBlockNumber(hscan->xs_cbuf) == ItemPointerGetBlockNumber(tid));
 
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
 		if (prev_buf != hscan->xs_cbuf)
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+
+		/*
+		 * When using the read stream, release the old buffer - but only if
+		 * we're reading a different block.
+		 *
+		 * XXX Not sure this is really needed, or maybe this is not the right
+		 * place to do this, and buffers should be released elsewhere. The
+		 * problem is that other place may not really know if the index scan
+		 * uses read stream API.
+		 *
+		 * XXX We need to do this, because otherwise the caller would need to
+		 * do different things depending on whether the read_stream was used
+		 * or not. With the read_stream it'd have to also explicitly release
+		 * the buffers, but doing that for every caller seems error prone
+		 * (easy to forget). It's also not clear whether it would free the
+		 * buffer before or after the index_fetch_tuple call (we don't know if
+		 * the buffer changed until *after* the call, etc.).
+		 *
+		 * XXX Does this do the right thing when reading the same page? That
+		 * should return the same buffer, so won't we release it prematurely?
+		 */
+		if (scan->rs && prev_buf != InvalidBuffer && release_prev)
+			ReleaseBuffer(prev_buf);
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971..d831ebde2 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -107,8 +108,69 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static bool index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches firstBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->firstBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("%s: batch %d currPage %u %p first %d last %d item %d killed %d",
+				  label, i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->itemIndex, batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -259,6 +321,7 @@ index_beginscan(Relation heapRelation,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
 {
+	ReadStream *rs = NULL;
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
@@ -273,8 +336,22 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+	{
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heapRelation,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, rs);
 
 	return scan;
 }
@@ -370,6 +447,19 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amrescan, so that it could reinitialize
+	 * everything (this probably does not matter very much, now that we've
+	 * moved all the batching logic to indexam.c, it was more important when
+	 * the index AM was responsible for more of it).
+	 *
+	 * XXX Maybe this should also happen before table_index_fetch_reset?
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -384,6 +474,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -411,10 +504,37 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *pos = &batchState->markPos;
+	IndexScanBatchData *batch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current first/next range). This means that if
+	 * we're marking the same batch (different item), we don't really do
+	 * anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (batch != NULL && (pos->batch < batchState->firstBatch ||
+						  pos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, batch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -435,9 +555,14 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amrestrpos);
 
 	/* release resources (like buffer pins) from table accesses */
@@ -447,7 +572,46 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amrestrpos to let index AM know that we're doing this (just resets
+	 * scan's array keys currently)
+	 */
+	scan->indexRelation->rd_indam->amrestrpos(scan, markBatch);
+
+	/*
+	 * XXX The pos can be invalid, if we already advanced past the the marked
+	 * batch (and stashed it in markBatch instead of freeing). So this assert
+	 * would be incorrect.
+	 */
+	/* AssertCheckBatchPosValid(scan, &pos); */
+
+	/* FIXME we should still check the batch was not freed yet */
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch - the marked one.
+	 *
+	 * XXX This seems a bit ugly / hacky, maybe there's a more elegant way to
+	 * do this?
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->firstBatch = markPos->batch;
+	batchState->nextBatch = (batchState->firstBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+
+	/*
+	 * XXX I really dislike that we have so many definitions of "current"
+	 * batch. We have readPos, streamPos, ... seems very ad hoc
+	 */
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -569,6 +733,18 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc. We Do this
+	 * before calling amrescan, so that it can reinitialize everything.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -587,6 +763,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
+	ReadStream *rs = NULL;
 
 	Assert(RelFileLocatorEquals(heaprel->rd_locator, pscan->ps_locator));
 	Assert(RelFileLocatorEquals(indexrel->rd_locator, pscan->ps_indexlocator));
@@ -604,8 +781,22 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+	{
+		index_batch_init(scan);
+
+		/* initialize stream */
+		rs = read_stream_begin_relation(READ_STREAM_DEFAULT,
+										NULL,
+										heaprel,
+										MAIN_FORKNUM,
+										index_scan_stream_read_next,
+										scan,
+										0);
+	}
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, rs);
 
 	return scan;
 }
@@ -620,14 +811,259 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - ambatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchPos *pos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* read the next TID from the index */
+	pos = &scan->batchState->readPos;
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the last one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	if (scan->batchState->direction != direction)
+	{
+		/* release "future" batches in the wrong direction */
+		while (scan->batchState->nextBatch > scan->batchState->firstBatch + 1)
+		{
+			IndexScanBatch batch;
+
+			scan->batchState->nextBatch--;
+			batch = INDEX_SCAN_BATCH(scan, scan->batchState->nextBatch);
+			index_batch_free(scan, batch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		scan->batchState->direction = direction;
+		scan->batchState->finished = false;
+		scan->batchState->lastBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &scan->batchState->streamPos);
+		read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	DEBUG_LOG("index_batch_getnext_tid pos %d %d direction %d",
+			  pos->batch, pos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? If the advance/getnext
+	 * functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, pos))
+		{
+			IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = batch->items[pos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (batch->currTuples +
+								  batch->items[pos->index].tupleOffset);
+
+			DEBUG_LOG("pos batch %p first %d last %d pos %d/%d TID (%u,%u)",
+					  batch, batch->firstItem, batch->lastItem,
+					  pos->batch, pos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to firstBatch.
+			 */
+			if (pos->batch != scan->batchState->firstBatch)
+			{
+				batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
+				Assert(batch != NULL);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so mucu behind the position
+				 * gets invalid, as we already removed the batch. But that
+				 * means we don't need any heap blocks until the current read
+				 * position - if we did, we would not be in this situation (or
+				 * it's a sign of a bug, as those two places are expected to
+				 * be in sync). So if the streamPos still points at the batch
+				 * we're about to free, just reset the position - we'll set it
+				 * to readPos in the read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 */
+				if (scan->batchState->streamPos.batch == scan->batchState->firstBatch)
+				{
+					elog(WARNING, "index_batch_pos_reset called early due to scan->batchState->streamPos.batch == scan->batchState->firstBatch");
+					index_batch_pos_reset(scan, &scan->batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free batch %p firstBatch %d nextBatch %d",
+						  batch,
+						  scan->batchState->firstBatch,
+						  scan->batchState->nextBatch);
+
+				/* Free the batch (except when it's needed for mark/restore). */
+				index_batch_free(scan, batch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mar/restore.
+				 */
+				scan->batchState->firstBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed firstBatch %d nextBatch %d",
+						  scan->batchState->firstBatch,
+						  scan->batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(scan->batchState->firstBatch == pos->batch);
+			}
+
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (scan->batchState->reset)
+		{
+			DEBUG_LOG("resetting read stream pos %d,%d",
+					  scan->batchState->readPos.batch, scan->batchState->readPos.index);
+
+			scan->batchState->reset = false;
+			scan->batchState->lastBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &scan->batchState->streamPos);
+
+			read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the first batch from here.
+		 * Second, while most batches will be preloaded by the stream thank's
+		 * to prefetching, it's possible to set effective_io_concurrency=0, in
+		 * which case all the batch loads happen from here.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 */
+	index_batch_pos_reset(scan, pos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -694,9 +1130,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1084,3 +1529,745 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages. We
+ * should not really need this many batches - we need a certain number of TIDs,
+ * to satisfy the prefetch distance, and there usually are many index tuples
+ * per page. In the worst case we might have one index tuple per leaf page,
+ * but even that may not quite work in some cases.
+ *
+ * But there may be cases when this does not work - some examples:
+ *
+ * a) the index may be bloated, with many pages only have a single index item
+ *
+ * b) the index is correlated, and we skip prefetches of duplicate blocks
+ *
+ * c) we may be doing index-only scan, and we don't prefetch all-visible pages
+ *
+ * So we might need to load huge number of batches before we find the first
+ * block to load from the table. Or enough pages to satisfy the prefetch
+ * distance.
+ *
+ * XXX Currently, once we hit this number of batches, we fail in the stream
+ * callback (or rather in index_batch_getnext), because that's where we load
+ * batches. It'd be nice to "pause" the read stream for a bit instead, but
+ * there's no built-in way to do that. So we can only "stop" the stream by
+ * returning InvalidBlockNumber. But we could also remember this, and do
+ * read_stream_reset() to continue, after consuming all the already scheduled
+ * blocks.
+ *
+ * XXX Maybe 64 is too high - it also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS). Which might be an issue with LIMIT queries,
+ * when we actually won't need most of the leaf pages.
+ *
+ * XXX We could/should use a lower value for testing, to make it more likely
+ * we hit this issue. With 64 the whole check-world passes without hitting
+ * the limit, wo we wouldn't test it's handled correctly.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->firstBatch)
+
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->firstBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+	Assert(batch->lastItem <= MaxTIDsPerBTreePage); /* XXX tied to BTREE */
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(batch->numKilled <= MaxTIDsPerBTreePage);	/* XXX tied to BTREE */
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The first/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->firstBatch >= 0 &&
+		   batchState->firstBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->firstBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->firstBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance, right after loading the first batch, the
+ * position is still be undefined. Otherwise we expect the position to be
+ * valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The poisition is guaranteed to be valid only after an advance.
+ */
+static bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	IndexScanBatchData *batch;
+	ScanDirection direction = scan->batchState->direction;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the first batch. In that case just initialize it to the first
+	 * item in the batch (or last item, if it's backwards scaa).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the first batch, without having to go through the advance.
+	 *
+	 * XXX Add a macro INDEX_SCAN_POS_DEFINED() or something like this, to
+	 * make this easier to understand.
+	 */
+	if (pos->batch == -1 && pos->index == -1)
+	{
+		/*
+		 * we should have loaded the very first batch
+		 *
+		 * XXX Actually, we might have changed the direction of the scan, and
+		 * scanned all the way to the beginning/end. We reset the position,
+		 * but we're not on the first batch - we should have only one batch,
+		 * though.
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
+
+		pos->batch = scan->batchState->firstBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position we just set has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch. If the position is for the
+	 * last item in the batch, try advancing to the next batch (if loaded).
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (pos->index < batch->lastItem)
+		{
+			pos->index++;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (pos->index > batch->firstItem)
+		{
+			pos->index--;
+
+			/* the position has to be valid */
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		/* the position has to be valid */
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller. If
+ * that changes before consuming all buffers, we'll reset the stream and start
+ * from scratch.
+ *
+ * The position of the read_stream is stored in streamPos, which may be
+ * ahead of the current readPos (which is what got consumed by the scan).
+ *
+ * The scan direction change is checked / handled elsewhere. Here we rely
+ * on having the correct value in xs_batches->direction.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchPos *pos = &scan->batchState->streamPos;
+
+	/* we should have set the direction already */
+	Assert(scan->batchState->direction != NoMovementScanDirection);
+
+	/*
+	 * The read position has to be valid, because we initialize/advance it
+	 * before maybe even attempting to read the heap tuple. And it lags behind
+	 * the stream position, so it can't be invalid yet. If this is the first
+	 * time for this callback, we will use the readPos to init streamPos, so
+	 * better check it's valid.
+	 */
+	AssertCheckBatchPosValid(scan, &scan->batchState->readPos);
+
+	/*
+	 * Try to advance to the next item, and if there's none in the current
+	 * batch, try loading the next batch.
+	 *
+	 * XXX This loop shouldn't happen more than twice, because if we fail to
+	 * advance the position, we'll try to load the next batch and then in the
+	 * next loop the advance has to succeed.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position is undefined, just use the read position.
+		 *
+		 * It's possible we got here only fairly late in the scan, e.g. if
+		 * many tuples got skipped in the index-only scan, etc. In this case
+		 * just use the read position as a starting point.
+		 *
+		 * The first batch is loaded from index_batch_getnext_tid(), because
+		 * we don't get here until the first index_fetch_heap() call - only
+		 * then can read_stream start loading more batches. It's also possible
+		 * to disable prefetching (effective_io_concurrency=0), in which case
+		 * all batches get loaded in index_batch_getnext_tid.
+		 */
+		if (pos->batch == -1 && pos->index == -1)
+		{
+			*pos = scan->batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, pos))
+		{
+			advanced = true;
+		}
+
+		/* FIXME maybe check the streamPos is not behind readPos? */
+
+		/* If we advanced the position, return the block for the TID. */
+		if (advanced)
+		{
+			IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+			ItemPointer tid = &batch->items[pos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  pos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * if there's a prefetch callback, use it to decide if we will
+			 * need to read the block
+			 */
+			if (scan->batchState->prefetch &&
+				!scan->batchState->prefetch(scan,
+											scan->batchState->prefetchArg, pos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (scan->batchState->lastBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (lastBlock)");
+				continue;
+			}
+
+			scan->batchState->lastBlock = ItemPointerGetBlockNumber(tid);
+
+			return ItemPointerGetBlockNumber(tid);
+		}
+
+		/*
+		 * Couldn't advance the position, so either there are no more items in
+		 * the current batch, or maybe we don't have any batches yet (if is
+		 * the first time through). Try loading the next batch - if that
+		 * succeeds, try the advance again (and this time the advance should
+		 * work).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan))
+			break;
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read at least some TIDs into the batch, or
+ * false if there are no more TIDs in the scan. The batch load may fail for
+ * multiple reasons - there really may not be more batches in the scan, or
+ * maybe we reached INDEX_SCAN_MAX_BATCHES.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * XXX This only loads the TIDs and resets the various batch fields to
+ * fresh state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan)
+{
+	IndexScanBatch batch = NULL;
+	ScanDirection direction = scan->batchState->direction;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+	}
+
+	/*
+	 * Did we fill the batch queue, either in this or some earlier call? If
+	 * yes, we have to consume everything from currently loaded batch before
+	 * we reset the stream and continue. It's a bit like 'finished' but it's
+	 * only a temporary pause, not the end of the stream.
+	 */
+	if (scan->batchState->reset)
+		return NULL;
+
+	/*
+	 * Did we already read the last batch for this scan?
+	 *
+	 * We may read the batches in two places, so we need to remember that,
+	 * otherwise the retry restarts the scan.
+	 *
+	 * XXX This comment might be obsolete, from before using the read_stream.
+	 *
+	 * XXX Also, maybe we should do this before calling INDEX_SCAN_BATCH_FULL?
+	 */
+	if (scan->batchState->finished)
+		return NULL;
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (scan->batchState->firstBatch < scan->batchState->nextBatch)
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, batch, direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = scan->batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		scan->batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext firstBatch %d nextBatch %d batch %p",
+				  scan->batchState->firstBatch, scan->batchState->nextBatch, batch);
+	}
+	else
+		scan->batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info, assume batching is supported by the AM */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc0(sizeof(IndexScanBatchState));
+
+	/* We don't know direction of the scan yet. */
+	scan->batchState->direction = NoMovementScanDirection;
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->firstBatch = 0;	/* first batch */
+	scan->batchState->nextBatch = 0;	/* first batch is empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->lastBlock = InvalidBlockNumber;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && batchState->markBatch != NULL)
+	{
+		IndexScanBatchPos *pos = &batchState->markPos;
+		IndexScanBatch batch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (pos->batch < batchState->firstBatch ||
+			pos->batch >= batchState->nextBatch)
+			index_batch_free(scan, batch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->firstBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->firstBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->firstBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->firstBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	batchState->firstBatch = 0; /* first batch */
+	batchState->nextBatch = 0;	/* first batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->lastBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *pos = &scan->batchState->readPos;
+	IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (batch->killedItems == NULL)
+		batch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (batch->numKilled < MaxTIDsPerBTreePage)
+		batch->killedItems[batch->numKilled++] = pos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(sizeof(IndexScanBatchData));
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->itemIndex = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the currPos and markPos respectively.  Each is of size
+	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;	/* tuple storage for currPos */
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->items = palloc(sizeof(IndexScanBatchPosItem) * maxitems);
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ *
+ * TODO: Restore Valgrind nbtree buffer lock instrumentation.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer;	/* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..e7ae7c7c7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,10 +158,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,30 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			/* Save first tuple's TID */
+			heapTid = &batch->items[batch->firstItem].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++batch->itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[batch->itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +322,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +335,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +351,53 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check to see if we should kill tuples from the previous batch.
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->items)
+		pfree(batch->items);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +406,34 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btrestrpos() -- prepare for restoring scan using a mark
  */
 void
-btmarkpos(IndexScanDesc scan)
+btrestrpos(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
-	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
-	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+	so->needPrimScan = false;
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
 }
 
 /*
@@ -827,15 +680,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +829,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795..3bae53c7c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,63 +24,33 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
-
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, ItemPointer heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -872,8 +842,7 @@ _bt_compare(Relation rel,
  *		qualifications in the scan key.  On success exit, data about the
  *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
  *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		dropPin scans, when we drop both the lock and the pin.
  *
  * If there are no matching items in the index, we return false, with no
  * pins or locks held.  so->currPos will remain invalid.
@@ -883,7 +852,7 @@ _bt_compare(Relation rel,
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +866,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +891,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimscan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +908,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1201,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/*
+	 * Allocate space for first batch
+	 *
+	 * XXX Should we be recyling memory used for prior batches?
+	 */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1217,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1513,12 +1483,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1534,11 +1504,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1546,7 +1516,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
@@ -1568,11 +1538,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
@@ -1589,36 +1555,50 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
  *		still be possible for the scan to return tuples by changing direction,
  *		though we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1642,8 +1622,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1654,37 +1634,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1722,11 +1700,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1790,28 +1767,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1851,12 +1828,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
+		newbatch->itemIndex = 0;
 	}
 	else
 	{
@@ -1873,11 +1850,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1978,11 +1954,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1995,17 +1971,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2021,12 +1997,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
+		newbatch->itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2043,27 +2019,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /* Save an index item into so->currPos.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
@@ -2074,35 +2050,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * returned to scan first.  Second or subsequent TIDs for posting list should
  * be saved by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  ItemPointer heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
@@ -2113,132 +2091,23 @@ _bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2264,62 +2133,81 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
 	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2329,102 +2217,74 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * Allocate space for next batch
+	 *
+	 * XXX Should we be recyling memory used for prior batches?
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
+
+	/*
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
+	 */
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2432,17 +2292,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2457,19 +2317,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2692,25 +2569,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2721,7 +2596,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2747,9 +2622,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9aed20799..609d79beb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1392,7 +1392,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2037,7 +2037,7 @@ new_prim_scan:
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2150,7 +2150,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3157,7 +3157,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3253,48 +3253,46 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(!XLogRecPtrIsInvalid(so->currPos.lsn));
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(!XLogRecPtrIsInvalid(batch->lsn));
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3311,12 +3309,12 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3357,7 +3355,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3374,7 +3373,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3424,7 +3423,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9b86c016a..4b2c674d9 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -90,7 +90,6 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb..be8e02a9c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -217,7 +217,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221..8a5d79a27 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6f753ab6d..a32f9bdba 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -881,7 +881,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 6ce4efea1..f0c7cc9b6 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -311,11 +311,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amrestrpos != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index f59046ad6..9fea68a9b 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -876,7 +876,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..12b3986be 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,7 +148,6 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..1a034bcf5 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -163,7 +163,6 @@ typedef struct IndexAmRoutine
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
 
     /* interface functions to support parallel index scans */
@@ -789,25 +788,9 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
-</programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
-  </para>
-
-  <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
 amrestrpos (IndexScanDesc scan);
 </programlisting>
-   Restore the scan to the most recently marked position.
+   Notify index AM that core code restored the scan using a mark.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..434653527 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,7 +319,6 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2..2a41f00ec 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -192,6 +192,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -1265,6 +1267,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3416,10 +3422,10 @@ amestimateparallelscan_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
-- 
2.50.0

#229

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#226)

Re: index prefetching

On 8/12/25 23:52, Tomas Vondra wrote:

On 8/12/25 23:22, Peter Geoghegan wrote:

...

It looks like the patch does significantly better with the forwards scan,
compared to the backwards scan (though both are improved by a lot). But that's
not the main thing about these results that I find interesting.

The really odd thing is that we get "shared hit=6619 read=49933" for the
forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The
latter matches master (regardless of the scan direction used on master), while
the former just looks wrong. What explains the "missing buffer hits" seen with
the forwards scan?

Discrepancies
-------------

All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
to simply be broken/giving wrong answers. Might it be that the "Buffers"
instrumentation is broken?

I think a bug in the prefetch patch is more likely. I tried with a patch
that adds various prefetch-related counters to explain, and I see this:

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on public.t (actual time=0.682..527.055
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.263
Prefetch Count: 60888
Prefetch Stalls: 1
Prefetch Skips: 991211
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
Buffers: shared hit=5027 read=50872
I/O Timings: shared read=33.528
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.349
Planning Time: 3.686 ms
Execution Time: 559.659 ms
(17 rows)

test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a DESC;
QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on public.t (actual time=1.110..4116.201
rows=1048576.00 loops=1)
Output: a, b
Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
Index Searches: 1
Prefetch Distance: 271.061
Prefetch Count: 118806
Prefetch Stalls: 1
Prefetch Skips: 962515
Prefetch Resets: 3
Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
Buffers: shared hit=30024 read=50872
I/O Timings: shared read=581.353
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=3.168
Planning Time: 4.289 ms
Execution Time: 4185.407 ms
(17 rows)

These two parts are interesting:

Prefetch Count: 60888
Prefetch Skips: 991211

Prefetch Count: 118806
Prefetch Skips: 962515

It looks like the backwards scan skips fewer blocks. This is based on
the lastBlock optimization, i.e. looking for runs of the same block
number. I don't quite see why would it affect just the backwards scan,
though. Seems weird.

Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):

index_scan_stream_read_next: block 24891
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24898
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24901
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24904
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24907
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24909
index_scan_stream_read_next: block 24910

while in the backwards scan (at the end) I see this

index_scan_stream_read_next: block 24910
index_scan_stream_read_next: block 24911
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24909
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24907
index_scan_stream_read_next: block 24908
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24906
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24904
index_scan_stream_read_next: block 24905
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24903
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24901
index_scan_stream_read_next: block 24902
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24900
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24898
index_scan_stream_read_next: block 24899
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24897
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24895
index_scan_stream_read_next: block 24896
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893
index_scan_stream_read_next: block 24894
index_scan_stream_read_next: block 24891
index_scan_stream_read_next: block 24892
index_scan_stream_read_next: block 24893

These are only the blocks that ended up passes to the read stream, not
the skipped ones. And you can immediately see the backward scan requests
more blocks for (roughly) the same part of the scan - the min/max block
roughly match.

The reason is pretty simple - the table is very correlated, and the
forward scan requests blocks mostly in the right order. Only rarely it
has to jump "back" when progressing to the next value, and so the
lastBlock optimization works nicely.

But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.

I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O. Could that
hide the extra buffer accesses, somehow?

Anyway, this access pattern in backwards scans seems a bit unfortunate.

regards

--
Tomas Vondra

#230

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#229)

Re: index prefetching

On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas@vondra.me> wrote:

Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):

Just to be clear: you did disable deduplication and then reindex,
right? You're accounting for the known issue with posting list TIDs
returning TIDs in the wrong order, relative to the scan direction
(when the scan direction is backwards)?

It won't be necessary to do this once I commit my patch that fixes the
issue directly, on the nbtree side, but for now deduplication messes
things up here. And so for now you have to work around it.

But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.

I don't think that this should be happening. The read stream ought to
be seeing blocks in exactly the same order as everything else.

I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O.

In what sense is it an issue on master?

On master, we simply access the TIDs in whatever order amgettuple
returns TIDs in. That should always be scan order/index key space
order, where heap TID counts as a tie-breaker/affects the key space in
the presence of duplicates (at least once that issue with posting
lists is fixed, or once deduplication has been disabled in a way that
leaves no posting list TIDs around via a reindex).

It is certainly not surprising that master does poorly on backwards
scans. And it isn't all that surprising that master does worse on
backwards scans when direct I/O is in use (per the explanation
Andres offered just now). But master should nevertheless always read
the TIDs in whatever order it gets them from amgettuple in.

It sounds like amgetbatch doesn't really behave analogously to master
here, at least with backwards scans. It sounds like you're saying that
we *won't* feed TIDs heap block numbers to the read stream in exactly
scan order (when we happen to be scanning backwards) -- which seems
wrong to me.

As you pointed out, a forwards scan of a DESC column index should feed
heap blocks to the read stream in a way that is very similar to an
equivalent backwards scan of a similar ASC column on the same table.
There might be some very minor differences, due to differences in the
precise leaf page boundaries among each of the indexes. But that
should hardly be noticeable at all.

Could that hide the extra buffer accesses, somehow?

I think that you meant to ask about *missing* buffer hits with the
patch, for the forwards scan. That doesn't agree with the backwards
scan with the patch, nor does it agree with master (with either the
forwards or backwards scan). Note that the heap accesses themselves
appear to have sane/consistent numbers, since we always see
"read=49933" as expected for those, for all 4 query executions that I
showed.

The "missing buffer hits" issue seems like an issue with the
instrumentation itself. Possibly one that is totally unrelated to
everything else we're discussing.

--
Peter Geoghegan

#231

Nazir Bilal Yavuz

byavuz81@gmail.com

5 months ago

In reply to: Thomas Munro (#222)

Re: index prefetching

Hi,

On Tue, 12 Aug 2025 at 22:30, Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 12, 2025 at 11:22 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Unfortunately this doesn't work. We need to handle backwards I/O
combining in the StartReadBuffersImpl() function too as buffer indexes
won't have correct blocknums. Also, I think buffer forwarding of split
backwards I/O should be handled in a couple of places.

Perhaps there could be a flag pending_read_backwards that can only
become set with pending_read_nblocks goes from 1 to 2, and then a new
flag stream->ios[x].backwards (in struct InProgressIO) that is set in
read_stream_start_pending_read(). Then immediately after
WaitReadBuffers(), we reverse the buffers it returned in place if that
flag was set. Oh, I see, you were imagining a flag
READ_BUFFERS_REVERSE that tells WaitReadBuffers() to do that
internally. Hmm. Either way I don't think you need to consider the
forwarded buffers because they will be reversed during a later call
that includes them in *nblocks (output value), no?

I think the problem is that we are not sure whether we will run
WaitReadBuffers() or not. Let's say that we will process blocknums 25,
24, 23, 22, 21 and 20 so we combined these IOs. We set the
pending_read_backwards flag and sent this IO operation to the
StartReadBuffers(). Let's consider that 22 and 20 are cache hits and
the rest are cache misses. In that case, starting processing buffers
(inside StartReadBuffers()) from 20 will fail because we will try to
return that immediately since this is a first buffer and it is cache
hit.

I think something like this, we will pass the pending_read_backwards
to the StartReadBuffers() and it will start to process blocknums from
backwards because of the pending_read_backwards being true. So,
buffer[0] -> 25 ... buffer[2] -> 23 and we will stop there because 22
is a cache hit. Now, we will reverse these buffers so that buffer[0]
-> 23 ... buffer[2] -> 25, and then send this IO operation to the
WaitReadBuffers() and reverse these buffers again after
WaitReadBuffers(). The problem with that approach is that we need to
forward 22, 21 and 20 and pending_read_blocknum shouldn't change
because we are still at 20, processed buffers don't affect
pending_read_blocknum. And we need to preserve pending_read_backwards
until we process all forwarded buffers, otherwise we may try to
combine forward (pending_read_blocknum is 20 and the let's say next
blocknum from read_stream_get_block() is 21, we shouldn't do IO
combining in that case).

--
Regards,
Nazir Bilal Yavuz
Microsoft

#232

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#230)

2 attachment(s)

Re: index prefetching

On 8/13/25 01:33, Peter Geoghegan wrote:

On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas@vondra.me> wrote:

Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):

Just to be clear: you did disable deduplication and then reindex,
right? You're accounting for the known issue with posting list TIDs
returning TIDs in the wrong order, relative to the scan direction
(when the scan direction is backwards)?

It won't be necessary to do this once I commit my patch that fixes the
issue directly, on the nbtree side, but for now deduplication messes
things up here. And so for now you have to work around it.

No, I forgot about that (and the the patch only applies to master).

But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.

I don't think that this should be happening. The read stream ought to
be seeing blocks in exactly the same order as everything else.

I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O.

In what sense is it an issue on master?

On master, we simply access the TIDs in whatever order amgettuple
returns TIDs in. That should always be scan order/index key space
order, where heap TID counts as a tie-breaker/affects the key space in
the presence of duplicates (at least once that issue with posting
lists is fixed, or once deduplication has been disabled in a way that
leaves no posting list TIDs around via a reindex).

It is certainly not surprising that master does poorly on backwards
scans. And it isn't all that surprising that master does worse on
backwards scans when direct I/O is in use (per the explanation
Andres offered just now). But master should nevertheless always read
the TIDs in whatever order it gets them from amgettuple in.

It sounds like amgetbatch doesn't really behave analogously to master
here, at least with backwards scans. It sounds like you're saying that
we *won't* feed TIDs heap block numbers to the read stream in exactly
scan order (when we happen to be scanning backwards) -- which seems
wrong to me.

As you pointed out, a forwards scan of a DESC column index should feed
heap blocks to the read stream in a way that is very similar to an
equivalent backwards scan of a similar ASC column on the same table.
There might be some very minor differences, due to differences in the
precise leaf page boundaries among each of the indexes. But that
should hardly be noticeable at all.

I gave this another try, this time with disabled deduplication, and on
master I also applied the patch (but now I realize that's probably
unnecessary, right?).

I did a couple more things for this experiment:

1) created a second table with an "inverse pattern" that's decreasing:

create table t2 (like t) with (fillfactor = 20);
insert into t2 select -a, b from t;
create index idx2 on t2 (a);
alter index idx2 set (deduplicate_items = false);
reindex index idx2;

The idea is that

SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC

is the same "block pattern" as

SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC

2) added logging to heapam_index_fetch_tuple

elog(LOG, "heapam_index_fetch_tuple block %u",
ItemPointerGetBlockNumber(tid));

3) disabled autovacuum (so that it doesn't trigger any logs)

4) python script that processes the block numbers and counts number of
blocks, runs, forward/backward advances

5) bash script that runs 4 "equivalent" queries on t/t2, with ASC/DESC.

And the results look like this (FWIW this is with io_method=sync):

Q1: SELECT * FROM t WHERE a BETWEEN 16336 AND 49103
Q2: SELECT * FROM t2 WHERE a BETWEEN -49103 AND -16336

master / buffered

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 575 1048576 57365 53648 3716
Q1 DESC 10245 1048576 57365 3716 53648
Q2 ASC 14819 1048576 86061 53293 32767
Q2 DESC 1063 1048576 86061 32767 53293

prefetch / buffered

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 701 1048576 57365 53648 3716
Q1 DESC 1805 1048576 57365 3716 53648
Q2 ASC 1221 1048576 86061 53293 32767
Q2 DESC 2101 1048576 86061 32767 53293

master / direct

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 6101 1048576 57365 53648 3716
Q1 DESC 12041 1048576 57365 3716 53648
Q2 ASC 14837 1048576 86061 53293 32767
Q2 DESC 14690 1048576 86061 32767 53293

prefetch / direct

query order time blocks runs forward backward
---------------------------------------------------------------
Q1 ASC 1504 1048576 57365 53648 3716
Q1 DESC 9034 1048576 57365 3716 53648
Q2 ASC 6988 1048576 86061 53293 32767
Q2 DESC 8959 1048576 86061 32767 53293

The timings are from runs without the extra logging, but there's still
quite a bit of run to run variation. But the differences are somewhat
stable.

Some observations:

* The block stats are perfectly stable (for each query), both for each
build and between builds. And also perfectly symmetrical between the
ASC/DESC version of each query. The ASC does the same number of
"forward" steps like DESC does "backward" steps.

* There's a clear difference between Q1 and Q2, with Q2 having many more
runs (and not as "nice" forward/backward steps). When I created the t2
data set, I expected Q1 ASC to behave the same as Q2 DESC, but it
doesn't seem to work that way. Clearly, the "descending" pattern in t2
breaks the sequence of block numbers into many more runs.

Could that hide the extra buffer accesses, somehow?

I think that you meant to ask about *missing* buffer hits with the
patch, for the forwards scan. That doesn't agree with the backwards
scan with the patch, nor does it agree with master (with either the
forwards or backwards scan). Note that the heap accesses themselves
appear to have sane/consistent numbers, since we always see
"read=49933" as expected for those, for all 4 query executions that I
showed.

The "missing buffer hits" issue seems like an issue with the
instrumentation itself. Possibly one that is totally unrelated to
everything else we're discussing.

Yes, I came to this conclusion too. The fact that the stats presented
above are exactly the same for all the different cases (for each query)
is a sign it's about the tracking.

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

In any case, it seems to depend on io_method, and it's confusing.

regards

--
Tomas Vondra

#233

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#232)

Re: index prefetching

Hi,

On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

Hm, I don't immediately see an issue there. The only case we don't call
PinBufferForBlock() is if we already have pinned the relevant buffer in a
prior call to StartReadBuffersImpl().

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

/*
* Check if we can start IO on the first to-be-read buffer.
*
* If an I/O is already in progress in another backend, we want to wait
* for the outcome: either done, or something went wrong and we will
* retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
{
...
/*
* Report and track this as a 'hit' for this backend, even though it
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
...
if (persistence == RELPERSISTENCE_TEMP)
pgBufferUsage.local_blks_hit += 1;
else
pgBufferUsage.shared_blks_hit += 1;
...

Greetings,

Andres Freund

#234

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#233)

Re: index prefetching

On Wed, Aug 13, 2025 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

AFAIK it is *not* reproducible on master.

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

This theory seems quite plausible to me. Though it is a bit surprising
that I see incorrect buffer hit counts on the "good" forwards scan
case, rather than on the "bad" backwards scan case.

Here's what I mean by things being broken on the read stream side (at
least with certain backwards scan cases):

When I add instrumentation to the read stream side, by adding elog
debug calls that show the blocknum seen by read_stream_get_block, I
see out-of-order and repeated blocknums with the "bad" backwards scan
case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a
desc"):

...
NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053
WARNING: blocknum: 25053, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051
WARNING: blocknum: 25051, 0x55614810efb0
...

Notice that we see the same blocknum twice in close succession. Also
notice that we're passed 25052 and then subsequently passed 25053,
only to be passed 25053 once more.

OTOH, when I run the equivalent "good" backwards scan ("SELECT * FROM
t WHERE a BETWEEN 16336 AND 49103 ORDER BY a"), the output looks just
about perfect. I have to look around quite a bit longer before I can
find repeated blocknum within successive read_stream_get_block calls:

...
NOTICE: index_scan_stream_read_next: index 303 TID (74783,1)
WARNING: prior lastBlock is 74782 for batchno 2862, new one: 74783
WARNING: blocknum: 74783, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 323 TID (74784,1)
WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784
WARNING: blocknum: 74784, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 324 TID (74783,21)
WARNING: prior lastBlock is 74784 for batchno 2862, new one: 74783
WARNING: blocknum: 74783, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 325 TID (74784,2)
WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784
WARNING: blocknum: 74784, 0x55614810efb0
...

These out-of-order repeat requests are much rarer. And I *never* see
identical requests in *immediate* succession, whereas those are common
with the backwards scan case.

I believe that the out-of-order repeat requests shown here are a
legitimate consequence of the TIDs being slightly out of order in
relatively few places (so the forwards scan case may well already be
behaving exactly as I expect):

pg@regression:5432 [2470184]=# select ctid, a from t where ctid
between '(74783,1)' and '(74784,1)';
┌────────────┬────────┐
│ ctid │ a │
├────────────┼────────┤
│ (74783,1) │ 49,077 │
│ (74783,2) │ 49,077 │
│ (74783,3) │ 49,077 │
│ (74783,4) │ 49,077 │
│ (74783,5) │ 49,077 │
│ (74783,6) │ 49,077 │
│ (74783,7) │ 49,077 │
│ (74783,8) │ 49,077 │
│ (74783,9) │ 49,077 │
│ (74783,10) │ 49,077 │
│ (74783,11) │ 49,077 │
│ (74783,12) │ 49,077 │
│ (74783,13) │ 49,077 │
│ (74783,14) │ 49,077 │
│ (74783,15) │ 49,077 │
│ (74783,16) │ 49,077 │
│ (74783,17) │ 49,077 │
│ (74783,18) │ 49,077 │
│ (74783,19) │ 49,077 │
│ (74783,20) │ 49,077 │
│ (74783,21) │ 49,078 │
│ (74784,1) │ 49,077 │
└────────────┴────────┘
(22 rows)

Bear in mind that EXPLAIN ANALYZE shows *identical* "Buffers:" details
for each query on master. So I believe that I am completely justified
in expecting the calls to read_stream_get_block for the backwards scan
to use identical blocknums to the ones for the equivalent/good
forwards scan (except that they should be in the exact opposite
order). And yet that's not what I see.

Maybe this is something to do with the read position and the stream
position becoming mixed up? I find it odd that the relevant readstream
callback, index_scan_stream_read_next, says "If the stream position
is undefined, just use the read position". That's just a guess,
though. This issue is tricky to debug. I'm not yet used to debugging
problems such as these (though I'll probably become an expert on it in
the months ahead).

--
Peter Geoghegan

#235

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#232)

Re: index prefetching

On Wed, Aug 13, 2025 at 8:15 AM Tomas Vondra <tomas@vondra.me> wrote:

1) created a second table with an "inverse pattern" that's decreasing:

create table t2 (like t) with (fillfactor = 20);
insert into t2 select -a, b from t;
create index idx2 on t2 (a);
alter index idx2 set (deduplicate_items = false);
reindex index idx2;

The idea is that

SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC

is the same "block pattern" as

SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC

A quick look at "idx2" using pageinspect seems to show heap block numbers that
are significantly less in-order than those from the original "idx" index,
though. While the original "idx" has block numbers that are *almost* in perfect
order (I do see the odd index tuple that has a non-consecutive TID, possibly
just due to the influence of the heap FSM), "idx2" seems to have leaf pages that
each have heap blocks that are somewhat "shuffled" within each page.

While the average total number of heap blocks seen with "idx2" might not be very
much higher than "idx", it is nevertheless true that the heap TIDs appear in a
less consistent order. So AFAICT we have no principled reason to expect the
"runs" seen on "idx2" to be anything like "idx" (maybe the performance gap is a
real problem, since the physical attributes of each index aren't hugely
different, but even then the "runs" stats don't seem all that uninformative).

I'll show what I mean by "shuffled" via a comparison of 2 random leaf pages from
each index. Here's what block 5555 from "idx2" looks like according to
bt_page_items (it shows a certain amount of "localized shuffling"):

┌────────────┬───────────────┬───────────────┬─────────────┐
│ itemoffset │ ctid │ data │ htid │
├────────────┼───────────────┼───────────────┼─────────────┤
│ 1 │ (379861,4097) │ (a)=(-249285) │ (379861,7) │
│ 2 │ (379880,13) │ (a)=(-249297) │ (379880,13) │
│ 3 │ (379880,14) │ (a)=(-249297) │ (379880,14) │
│ 4 │ (379880,15) │ (a)=(-249297) │ (379880,15) │
│ 5 │ (379880,16) │ (a)=(-249297) │ (379880,16) │
│ 6 │ (379880,17) │ (a)=(-249297) │ (379880,17) │
│ 7 │ (379880,18) │ (a)=(-249297) │ (379880,18) │
│ 8 │ (379880,19) │ (a)=(-249297) │ (379880,19) │
│ 9 │ (379880,20) │ (a)=(-249297) │ (379880,20) │
│ 10 │ (379880,21) │ (a)=(-249297) │ (379880,21) │
│ 11 │ (379881,2) │ (a)=(-249297) │ (379881,2) │
│ 12 │ (379881,3) │ (a)=(-249297) │ (379881,3) │
│ 13 │ (379881,4) │ (a)=(-249297) │ (379881,4) │
│ 14 │ (379878,2) │ (a)=(-249296) │ (379878,2) │
│ 15 │ (379878,3) │ (a)=(-249296) │ (379878,3) │
│ 16 │ (379878,5) │ (a)=(-249296) │ (379878,5) │
│ 17 │ (379878,6) │ (a)=(-249296) │ (379878,6) │
│ 18 │ (379878,7) │ (a)=(-249296) │ (379878,7) │
│ 19 │ (379878,8) │ (a)=(-249296) │ (379878,8) │
│ 20 │ (379878,9) │ (a)=(-249296) │ (379878,9) │
│ 21 │ (379878,10) │ (a)=(-249296) │ (379878,10) │
│ 22 │ (379878,11) │ (a)=(-249296) │ (379878,11) │
│ 23 │ (379878,12) │ (a)=(-249296) │ (379878,12) │
│ 24 │ (379878,13) │ (a)=(-249296) │ (379878,13) │
│ 25 │ (379878,14) │ (a)=(-249296) │ (379878,14) │
│ 26 │ (379878,15) │ (a)=(-249296) │ (379878,15) │
│ 27 │ (379878,16) │ (a)=(-249296) │ (379878,16) │
│ 28 │ (379878,17) │ (a)=(-249296) │ (379878,17) │
│ 29 │ (379878,18) │ (a)=(-249296) │ (379878,18) │
│ 30 │ (379878,19) │ (a)=(-249296) │ (379878,19) │
│ 31 │ (379878,20) │ (a)=(-249296) │ (379878,20) │
│ 32 │ (379878,21) │ (a)=(-249296) │ (379878,21) │
│ 33 │ (379879,1) │ (a)=(-249296) │ (379879,1) │
│ 34 │ (379879,2) │ (a)=(-249296) │ (379879,2) │
│ 35 │ (379879,3) │ (a)=(-249296) │ (379879,3) │
│ 36 │ (379879,4) │ (a)=(-249296) │ (379879,4) │
│ 37 │ (379879,5) │ (a)=(-249296) │ (379879,5) │
│ 38 │ (379879,6) │ (a)=(-249296) │ (379879,6) │
│ 39 │ (379879,7) │ (a)=(-249296) │ (379879,7) │
│ 40 │ (379879,8) │ (a)=(-249296) │ (379879,8) │
│ 41 │ (379879,9) │ (a)=(-249296) │ (379879,9) │
│ 42 │ (379879,10) │ (a)=(-249296) │ (379879,10) │
│ 43 │ (379879,12) │ (a)=(-249296) │ (379879,12) │
│ 44 │ (379879,13) │ (a)=(-249296) │ (379879,13) │
│ 45 │ (379879,14) │ (a)=(-249296) │ (379879,14) │
│ 46 │ (379876,10) │ (a)=(-249295) │ (379876,10) │
│ 47 │ (379876,12) │ (a)=(-249295) │ (379876,12) │
│ 48 │ (379876,14) │ (a)=(-249295) │ (379876,14) │
│ 49 │ (379876,16) │ (a)=(-249295) │ (379876,16) │
│ 50 │ (379876,17) │ (a)=(-249295) │ (379876,17) │
│ 51 │ (379876,18) │ (a)=(-249295) │ (379876,18) │
│ 52 │ (379876,19) │ (a)=(-249295) │ (379876,19) │
│ 53 │ (379876,20) │ (a)=(-249295) │ (379876,20) │
│ 54 │ (379876,21) │ (a)=(-249295) │ (379876,21) │
│ 55 │ (379877,1) │ (a)=(-249295) │ (379877,1) │
│ 56 │ (379877,2) │ (a)=(-249295) │ (379877,2) │
│ 57 │ (379877,3) │ (a)=(-249295) │ (379877,3) │
│ 58 │ (379877,4) │ (a)=(-249295) │ (379877,4) │
│ 59 │ (379877,5) │ (a)=(-249295) │ (379877,5) │
│ 60 │ (379877,6) │ (a)=(-249295) │ (379877,6) │
│ 61 │ (379877,7) │ (a)=(-249295) │ (379877,7) │
│ 62 │ (379877,8) │ (a)=(-249295) │ (379877,8) │
│ 63 │ (379877,9) │ (a)=(-249295) │ (379877,9) │
│ 64 │ (379877,10) │ (a)=(-249295) │ (379877,10) │
│ 65 │ (379877,11) │ (a)=(-249295) │ (379877,11) │
│ 66 │ (379877,12) │ (a)=(-249295) │ (379877,12) │
│ 67 │ (379877,13) │ (a)=(-249295) │ (379877,13) │
│ 68 │ (379877,14) │ (a)=(-249295) │ (379877,14) │
│ 69 │ (379877,15) │ (a)=(-249295) │ (379877,15) │
│ 70 │ (379877,16) │ (a)=(-249295) │ (379877,16) │
│ 71 │ (379877,17) │ (a)=(-249295) │ (379877,17) │
│ 72 │ (379877,18) │ (a)=(-249295) │ (379877,18) │
│ 73 │ (379877,19) │ (a)=(-249295) │ (379877,19) │
│ 74 │ (379877,20) │ (a)=(-249295) │ (379877,20) │
│ 75 │ (379877,21) │ (a)=(-249295) │ (379877,21) │
│ 76 │ (379878,1) │ (a)=(-249295) │ (379878,1) │
│ 77 │ (379878,4) │ (a)=(-249295) │ (379878,4) │
│ 78 │ (379874,20) │ (a)=(-249294) │ (379874,20) │
│ 79 │ (379875,2) │ (a)=(-249294) │ (379875,2) │
│ 80 │ (379875,3) │ (a)=(-249294) │ (379875,3) │
│ 81 │ (379875,5) │ (a)=(-249294) │ (379875,5) │
│ 82 │ (379875,6) │ (a)=(-249294) │ (379875,6) │
│ 83 │ (379875,7) │ (a)=(-249294) │ (379875,7) │
│ 84 │ (379875,8) │ (a)=(-249294) │ (379875,8) │
│ 85 │ (379875,9) │ (a)=(-249294) │ (379875,9) │
│ 86 │ (379875,10) │ (a)=(-249294) │ (379875,10) │
│ 87 │ (379875,11) │ (a)=(-249294) │ (379875,11) │
│ 88 │ (379875,12) │ (a)=(-249294) │ (379875,12) │
│ 89 │ (379875,13) │ (a)=(-249294) │ (379875,13) │
│ 90 │ (379875,14) │ (a)=(-249294) │ (379875,14) │
│ 91 │ (379875,15) │ (a)=(-249294) │ (379875,15) │
│ 92 │ (379875,16) │ (a)=(-249294) │ (379875,16) │
│ 93 │ (379875,17) │ (a)=(-249294) │ (379875,17) │
│ 94 │ (379875,18) │ (a)=(-249294) │ (379875,18) │
│ 95 │ (379875,19) │ (a)=(-249294) │ (379875,19) │
│ 96 │ (379875,20) │ (a)=(-249294) │ (379875,20) │
│ 97 │ (379875,21) │ (a)=(-249294) │ (379875,21) │
│ 98 │ (379876,1) │ (a)=(-249294) │ (379876,1) │
│ 99 │ (379876,2) │ (a)=(-249294) │ (379876,2) │
│ 100 │ (379876,3) │ (a)=(-249294) │ (379876,3) │
│ 101 │ (379876,4) │ (a)=(-249294) │ (379876,4) │
│ 102 │ (379876,5) │ (a)=(-249294) │ (379876,5) │
│ 103 │ (379876,6) │ (a)=(-249294) │ (379876,6) │
│ 104 │ (379876,7) │ (a)=(-249294) │ (379876,7) │
│ 105 │ (379876,8) │ (a)=(-249294) │ (379876,8) │
│ 106 │ (379876,9) │ (a)=(-249294) │ (379876,9) │
│ 107 │ (379876,11) │ (a)=(-249294) │ (379876,11) │
│ 108 │ (379876,13) │ (a)=(-249294) │ (379876,13) │
│ 109 │ (379876,15) │ (a)=(-249294) │ (379876,15) │
│ 110 │ (379873,11) │ (a)=(-249293) │ (379873,11) │
│ 111 │ (379873,13) │ (a)=(-249293) │ (379873,13) │
│ 112 │ (379873,14) │ (a)=(-249293) │ (379873,14) │
│ 113 │ (379873,15) │ (a)=(-249293) │ (379873,15) │
│ 114 │ (379873,16) │ (a)=(-249293) │ (379873,16) │
│ 115 │ (379873,17) │ (a)=(-249293) │ (379873,17) │
│ 116 │ (379873,18) │ (a)=(-249293) │ (379873,18) │
│ 117 │ (379873,19) │ (a)=(-249293) │ (379873,19) │
│ 118 │ (379873,20) │ (a)=(-249293) │ (379873,20) │
│ 119 │ (379873,21) │ (a)=(-249293) │ (379873,21) │
│ 120 │ (379874,1) │ (a)=(-249293) │ (379874,1) │
│ 121 │ (379874,2) │ (a)=(-249293) │ (379874,2) │
│ 122 │ (379874,3) │ (a)=(-249293) │ (379874,3) │
│ 123 │ (379874,4) │ (a)=(-249293) │ (379874,4) │
│ 124 │ (379874,5) │ (a)=(-249293) │ (379874,5) │
│ 125 │ (379874,6) │ (a)=(-249293) │ (379874,6) │
│ 126 │ (379874,7) │ (a)=(-249293) │ (379874,7) │
│ 127 │ (379874,8) │ (a)=(-249293) │ (379874,8) │
│ 128 │ (379874,9) │ (a)=(-249293) │ (379874,9) │
│ 129 │ (379874,10) │ (a)=(-249293) │ (379874,10) │
│ 130 │ (379874,11) │ (a)=(-249293) │ (379874,11) │
│ 131 │ (379874,12) │ (a)=(-249293) │ (379874,12) │
│ 132 │ (379874,13) │ (a)=(-249293) │ (379874,13) │
│ 133 │ (379874,14) │ (a)=(-249293) │ (379874,14) │
│ 134 │ (379874,15) │ (a)=(-249293) │ (379874,15) │
│ 135 │ (379874,16) │ (a)=(-249293) │ (379874,16) │
│ 136 │ (379874,17) │ (a)=(-249293) │ (379874,17) │
│ 137 │ (379874,18) │ (a)=(-249293) │ (379874,18) │
│ 138 │ (379874,19) │ (a)=(-249293) │ (379874,19) │
│ 139 │ (379874,21) │ (a)=(-249293) │ (379874,21) │
│ 140 │ (379875,1) │ (a)=(-249293) │ (379875,1) │
│ 141 │ (379875,4) │ (a)=(-249293) │ (379875,4) │
│ 142 │ (379871,21) │ (a)=(-249292) │ (379871,21) │
│ 143 │ (379872,2) │ (a)=(-249292) │ (379872,2) │
│ 144 │ (379872,3) │ (a)=(-249292) │ (379872,3) │
│ 145 │ (379872,4) │ (a)=(-249292) │ (379872,4) │
│ 146 │ (379872,5) │ (a)=(-249292) │ (379872,5) │
│ 147 │ (379872,6) │ (a)=(-249292) │ (379872,6) │
│ 148 │ (379872,7) │ (a)=(-249292) │ (379872,7) │
│ 149 │ (379872,8) │ (a)=(-249292) │ (379872,8) │
│ 150 │ (379872,9) │ (a)=(-249292) │ (379872,9) │
│ 151 │ (379872,10) │ (a)=(-249292) │ (379872,10) │
│ 152 │ (379872,11) │ (a)=(-249292) │ (379872,11) │
│ 153 │ (379872,12) │ (a)=(-249292) │ (379872,12) │
│ 154 │ (379872,13) │ (a)=(-249292) │ (379872,13) │
│ 155 │ (379872,14) │ (a)=(-249292) │ (379872,14) │
│ 156 │ (379872,15) │ (a)=(-249292) │ (379872,15) │
│ 157 │ (379872,16) │ (a)=(-249292) │ (379872,16) │
│ 158 │ (379872,17) │ (a)=(-249292) │ (379872,17) │
│ 159 │ (379872,18) │ (a)=(-249292) │ (379872,18) │
│ 160 │ (379872,19) │ (a)=(-249292) │ (379872,19) │
│ 161 │ (379872,20) │ (a)=(-249292) │ (379872,20) │
│ 162 │ (379872,21) │ (a)=(-249292) │ (379872,21) │
│ 163 │ (379873,1) │ (a)=(-249292) │ (379873,1) │
│ 164 │ (379873,2) │ (a)=(-249292) │ (379873,2) │
│ 165 │ (379873,3) │ (a)=(-249292) │ (379873,3) │
│ 166 │ (379873,4) │ (a)=(-249292) │ (379873,4) │
│ 167 │ (379873,5) │ (a)=(-249292) │ (379873,5) │
│ 168 │ (379873,6) │ (a)=(-249292) │ (379873,6) │
│ 169 │ (379873,7) │ (a)=(-249292) │ (379873,7) │
│ 170 │ (379873,8) │ (a)=(-249292) │ (379873,8) │
│ 171 │ (379873,9) │ (a)=(-249292) │ (379873,9) │
│ 172 │ (379873,10) │ (a)=(-249292) │ (379873,10) │
│ 173 │ (379873,12) │ (a)=(-249292) │ (379873,12) │
│ 174 │ (379870,9) │ (a)=(-249291) │ (379870,9) │
│ 175 │ (379870,11) │ (a)=(-249291) │ (379870,11) │
│ 176 │ (379870,12) │ (a)=(-249291) │ (379870,12) │
│ 177 │ (379870,14) │ (a)=(-249291) │ (379870,14) │
│ 178 │ (379870,15) │ (a)=(-249291) │ (379870,15) │
│ 179 │ (379870,16) │ (a)=(-249291) │ (379870,16) │
│ 180 │ (379870,17) │ (a)=(-249291) │ (379870,17) │
│ 181 │ (379870,18) │ (a)=(-249291) │ (379870,18) │
│ 182 │ (379870,19) │ (a)=(-249291) │ (379870,19) │
│ 183 │ (379870,20) │ (a)=(-249291) │ (379870,20) │
│ 184 │ (379870,21) │ (a)=(-249291) │ (379870,21) │
│ 185 │ (379871,1) │ (a)=(-249291) │ (379871,1) │
│ 186 │ (379871,2) │ (a)=(-249291) │ (379871,2) │
│ 187 │ (379871,3) │ (a)=(-249291) │ (379871,3) │
│ 188 │ (379871,4) │ (a)=(-249291) │ (379871,4) │
│ 189 │ (379871,5) │ (a)=(-249291) │ (379871,5) │
│ 190 │ (379871,6) │ (a)=(-249291) │ (379871,6) │
│ 191 │ (379871,7) │ (a)=(-249291) │ (379871,7) │
│ 192 │ (379871,8) │ (a)=(-249291) │ (379871,8) │
│ 193 │ (379871,9) │ (a)=(-249291) │ (379871,9) │
│ 194 │ (379871,10) │ (a)=(-249291) │ (379871,10) │
│ 195 │ (379871,11) │ (a)=(-249291) │ (379871,11) │
│ 196 │ (379871,12) │ (a)=(-249291) │ (379871,12) │
│ 197 │ (379871,13) │ (a)=(-249291) │ (379871,13) │
│ 198 │ (379871,14) │ (a)=(-249291) │ (379871,14) │
│ 199 │ (379871,15) │ (a)=(-249291) │ (379871,15) │
│ 200 │ (379871,16) │ (a)=(-249291) │ (379871,16) │
│ 201 │ (379871,17) │ (a)=(-249291) │ (379871,17) │
│ 202 │ (379871,18) │ (a)=(-249291) │ (379871,18) │
│ 203 │ (379871,19) │ (a)=(-249291) │ (379871,19) │
│ 204 │ (379871,20) │ (a)=(-249291) │ (379871,20) │
│ 205 │ (379872,1) │ (a)=(-249291) │ (379872,1) │
│ 206 │ (379868,20) │ (a)=(-249290) │ (379868,20) │
│ 207 │ (379868,21) │ (a)=(-249290) │ (379868,21) │
│ 208 │ (379869,1) │ (a)=(-249290) │ (379869,1) │
│ 209 │ (379869,3) │ (a)=(-249290) │ (379869,3) │
│ 210 │ (379869,4) │ (a)=(-249290) │ (379869,4) │
│ 211 │ (379869,5) │ (a)=(-249290) │ (379869,5) │
│ 212 │ (379869,6) │ (a)=(-249290) │ (379869,6) │
│ 213 │ (379869,7) │ (a)=(-249290) │ (379869,7) │
│ 214 │ (379869,8) │ (a)=(-249290) │ (379869,8) │
│ 215 │ (379869,9) │ (a)=(-249290) │ (379869,9) │
│ 216 │ (379869,10) │ (a)=(-249290) │ (379869,10) │
│ 217 │ (379869,11) │ (a)=(-249290) │ (379869,11) │
│ 218 │ (379869,12) │ (a)=(-249290) │ (379869,12) │
│ 219 │ (379869,13) │ (a)=(-249290) │ (379869,13) │
│ 220 │ (379869,14) │ (a)=(-249290) │ (379869,14) │
│ 221 │ (379869,15) │ (a)=(-249290) │ (379869,15) │
│ 222 │ (379869,16) │ (a)=(-249290) │ (379869,16) │
│ 223 │ (379869,17) │ (a)=(-249290) │ (379869,17) │
│ 224 │ (379869,18) │ (a)=(-249290) │ (379869,18) │
│ 225 │ (379869,19) │ (a)=(-249290) │ (379869,19) │
│ 226 │ (379869,20) │ (a)=(-249290) │ (379869,20) │
│ 227 │ (379869,21) │ (a)=(-249290) │ (379869,21) │
│ 228 │ (379870,1) │ (a)=(-249290) │ (379870,1) │
│ 229 │ (379870,2) │ (a)=(-249290) │ (379870,2) │
│ 230 │ (379870,3) │ (a)=(-249290) │ (379870,3) │
│ 231 │ (379870,4) │ (a)=(-249290) │ (379870,4) │
│ 232 │ (379870,5) │ (a)=(-249290) │ (379870,5) │
│ 233 │ (379870,6) │ (a)=(-249290) │ (379870,6) │
│ 234 │ (379870,7) │ (a)=(-249290) │ (379870,7) │
│ 235 │ (379870,8) │ (a)=(-249290) │ (379870,8) │
│ 236 │ (379870,10) │ (a)=(-249290) │ (379870,10) │
│ 237 │ (379870,13) │ (a)=(-249290) │ (379870,13) │
│ 238 │ (379867,10) │ (a)=(-249289) │ (379867,10) │
│ 239 │ (379867,11) │ (a)=(-249289) │ (379867,11) │
│ 240 │ (379867,12) │ (a)=(-249289) │ (379867,12) │
│ 241 │ (379867,13) │ (a)=(-249289) │ (379867,13) │
│ 242 │ (379867,14) │ (a)=(-249289) │ (379867,14) │
│ 243 │ (379867,15) │ (a)=(-249289) │ (379867,15) │
│ 244 │ (379867,16) │ (a)=(-249289) │ (379867,16) │
│ 245 │ (379867,17) │ (a)=(-249289) │ (379867,17) │
│ 246 │ (379867,18) │ (a)=(-249289) │ (379867,18) │
│ 247 │ (379867,19) │ (a)=(-249289) │ (379867,19) │
│ 248 │ (379867,20) │ (a)=(-249289) │ (379867,20) │
│ 249 │ (379867,21) │ (a)=(-249289) │ (379867,21) │
│ 250 │ (379868,1) │ (a)=(-249289) │ (379868,1) │
│ 251 │ (379868,2) │ (a)=(-249289) │ (379868,2) │
│ 252 │ (379868,3) │ (a)=(-249289) │ (379868,3) │
│ 253 │ (379868,4) │ (a)=(-249289) │ (379868,4) │
│ 254 │ (379868,5) │ (a)=(-249289) │ (379868,5) │
│ 255 │ (379868,6) │ (a)=(-249289) │ (379868,6) │
│ 256 │ (379868,7) │ (a)=(-249289) │ (379868,7) │
│ 257 │ (379868,8) │ (a)=(-249289) │ (379868,8) │
│ 258 │ (379868,9) │ (a)=(-249289) │ (379868,9) │
│ 259 │ (379868,10) │ (a)=(-249289) │ (379868,10) │
│ 260 │ (379868,11) │ (a)=(-249289) │ (379868,11) │
│ 261 │ (379868,12) │ (a)=(-249289) │ (379868,12) │
│ 262 │ (379868,13) │ (a)=(-249289) │ (379868,13) │
│ 263 │ (379868,14) │ (a)=(-249289) │ (379868,14) │
│ 264 │ (379868,15) │ (a)=(-249289) │ (379868,15) │
│ 265 │ (379868,16) │ (a)=(-249289) │ (379868,16) │
│ 266 │ (379868,17) │ (a)=(-249289) │ (379868,17) │
│ 267 │ (379868,18) │ (a)=(-249289) │ (379868,18) │
│ 268 │ (379868,19) │ (a)=(-249289) │ (379868,19) │
│ 269 │ (379869,2) │ (a)=(-249289) │ (379869,2) │
│ 270 │ (379865,19) │ (a)=(-249288) │ (379865,19) │
│ 271 │ (379865,20) │ (a)=(-249288) │ (379865,20) │
│ 272 │ (379865,21) │ (a)=(-249288) │ (379865,21) │
│ 273 │ (379866,2) │ (a)=(-249288) │ (379866,2) │
│ 274 │ (379866,3) │ (a)=(-249288) │ (379866,3) │
│ 275 │ (379866,4) │ (a)=(-249288) │ (379866,4) │
│ 276 │ (379866,5) │ (a)=(-249288) │ (379866,5) │
│ 277 │ (379866,6) │ (a)=(-249288) │ (379866,6) │
│ 278 │ (379866,7) │ (a)=(-249288) │ (379866,7) │
│ 279 │ (379866,8) │ (a)=(-249288) │ (379866,8) │
│ 280 │ (379866,9) │ (a)=(-249288) │ (379866,9) │
│ 281 │ (379866,10) │ (a)=(-249288) │ (379866,10) │
│ 282 │ (379866,11) │ (a)=(-249288) │ (379866,11) │
│ 283 │ (379866,12) │ (a)=(-249288) │ (379866,12) │
│ 284 │ (379866,13) │ (a)=(-249288) │ (379866,13) │
│ 285 │ (379866,14) │ (a)=(-249288) │ (379866,14) │
│ 286 │ (379866,15) │ (a)=(-249288) │ (379866,15) │
│ 287 │ (379866,16) │ (a)=(-249288) │ (379866,16) │
│ 288 │ (379866,17) │ (a)=(-249288) │ (379866,17) │
│ 289 │ (379866,18) │ (a)=(-249288) │ (379866,18) │
│ 290 │ (379866,19) │ (a)=(-249288) │ (379866,19) │
│ 291 │ (379866,20) │ (a)=(-249288) │ (379866,20) │
│ 292 │ (379866,21) │ (a)=(-249288) │ (379866,21) │
│ 293 │ (379867,1) │ (a)=(-249288) │ (379867,1) │
│ 294 │ (379867,2) │ (a)=(-249288) │ (379867,2) │
│ 295 │ (379867,3) │ (a)=(-249288) │ (379867,3) │
│ 296 │ (379867,4) │ (a)=(-249288) │ (379867,4) │
│ 297 │ (379867,5) │ (a)=(-249288) │ (379867,5) │
│ 298 │ (379867,6) │ (a)=(-249288) │ (379867,6) │
│ 299 │ (379867,7) │ (a)=(-249288) │ (379867,7) │
│ 300 │ (379867,8) │ (a)=(-249288) │ (379867,8) │
│ 301 │ (379867,9) │ (a)=(-249288) │ (379867,9) │
│ 302 │ (379864,9) │ (a)=(-249287) │ (379864,9) │
│ 303 │ (379864,10) │ (a)=(-249287) │ (379864,10) │
│ 304 │ (379864,11) │ (a)=(-249287) │ (379864,11) │
│ 305 │ (379864,12) │ (a)=(-249287) │ (379864,12) │
│ 306 │ (379864,13) │ (a)=(-249287) │ (379864,13) │
│ 307 │ (379864,14) │ (a)=(-249287) │ (379864,14) │
│ 308 │ (379864,15) │ (a)=(-249287) │ (379864,15) │
│ 309 │ (379864,16) │ (a)=(-249287) │ (379864,16) │
│ 310 │ (379864,17) │ (a)=(-249287) │ (379864,17) │
│ 311 │ (379864,18) │ (a)=(-249287) │ (379864,18) │
│ 312 │ (379864,19) │ (a)=(-249287) │ (379864,19) │
│ 313 │ (379864,20) │ (a)=(-249287) │ (379864,20) │
│ 314 │ (379864,21) │ (a)=(-249287) │ (379864,21) │
│ 315 │ (379865,1) │ (a)=(-249287) │ (379865,1) │
│ 316 │ (379865,2) │ (a)=(-249287) │ (379865,2) │
│ 317 │ (379865,3) │ (a)=(-249287) │ (379865,3) │
│ 318 │ (379865,4) │ (a)=(-249287) │ (379865,4) │
│ 319 │ (379865,5) │ (a)=(-249287) │ (379865,5) │
│ 320 │ (379865,6) │ (a)=(-249287) │ (379865,6) │
│ 321 │ (379865,7) │ (a)=(-249287) │ (379865,7) │
│ 322 │ (379865,8) │ (a)=(-249287) │ (379865,8) │
│ 323 │ (379865,9) │ (a)=(-249287) │ (379865,9) │
│ 324 │ (379865,10) │ (a)=(-249287) │ (379865,10) │
│ 325 │ (379865,11) │ (a)=(-249287) │ (379865,11) │
│ 326 │ (379865,12) │ (a)=(-249287) │ (379865,12) │
│ 327 │ (379865,13) │ (a)=(-249287) │ (379865,13) │
│ 328 │ (379865,14) │ (a)=(-249287) │ (379865,14) │
│ 329 │ (379865,15) │ (a)=(-249287) │ (379865,15) │
│ 330 │ (379865,16) │ (a)=(-249287) │ (379865,16) │
│ 331 │ (379865,17) │ (a)=(-249287) │ (379865,17) │
│ 332 │ (379865,18) │ (a)=(-249287) │ (379865,18) │
│ 333 │ (379866,1) │ (a)=(-249287) │ (379866,1) │
│ 334 │ (379862,16) │ (a)=(-249286) │ (379862,16) │
│ 335 │ (379862,17) │ (a)=(-249286) │ (379862,17) │
│ 336 │ (379862,20) │ (a)=(-249286) │ (379862,20) │
│ 337 │ (379863,1) │ (a)=(-249286) │ (379863,1) │
│ 338 │ (379863,2) │ (a)=(-249286) │ (379863,2) │
│ 339 │ (379863,3) │ (a)=(-249286) │ (379863,3) │
│ 340 │ (379863,4) │ (a)=(-249286) │ (379863,4) │
│ 341 │ (379863,5) │ (a)=(-249286) │ (379863,5) │
│ 342 │ (379863,6) │ (a)=(-249286) │ (379863,6) │
│ 343 │ (379863,7) │ (a)=(-249286) │ (379863,7) │
│ 344 │ (379863,8) │ (a)=(-249286) │ (379863,8) │
│ 345 │ (379863,9) │ (a)=(-249286) │ (379863,9) │
│ 346 │ (379863,10) │ (a)=(-249286) │ (379863,10) │
│ 347 │ (379863,11) │ (a)=(-249286) │ (379863,11) │
│ 348 │ (379863,12) │ (a)=(-249286) │ (379863,12) │
│ 349 │ (379863,13) │ (a)=(-249286) │ (379863,13) │
│ 350 │ (379863,14) │ (a)=(-249286) │ (379863,14) │
│ 351 │ (379863,15) │ (a)=(-249286) │ (379863,15) │
│ 352 │ (379863,16) │ (a)=(-249286) │ (379863,16) │
│ 353 │ (379863,17) │ (a)=(-249286) │ (379863,17) │
│ 354 │ (379863,18) │ (a)=(-249286) │ (379863,18) │
│ 355 │ (379863,19) │ (a)=(-249286) │ (379863,19) │
│ 356 │ (379863,20) │ (a)=(-249286) │ (379863,20) │
│ 357 │ (379863,21) │ (a)=(-249286) │ (379863,21) │
│ 358 │ (379864,1) │ (a)=(-249286) │ (379864,1) │
│ 359 │ (379864,2) │ (a)=(-249286) │ (379864,2) │
│ 360 │ (379864,3) │ (a)=(-249286) │ (379864,3) │
│ 361 │ (379864,4) │ (a)=(-249286) │ (379864,4) │
│ 362 │ (379864,5) │ (a)=(-249286) │ (379864,5) │
│ 363 │ (379864,6) │ (a)=(-249286) │ (379864,6) │
│ 364 │ (379864,7) │ (a)=(-249286) │ (379864,7) │
│ 365 │ (379864,8) │ (a)=(-249286) │ (379864,8) │
│ 366 │ (379861,6) │ (a)=(-249285) │ (379861,6) │
│ 367 │ (379861,7) │ (a)=(-249285) │ (379861,7) │
└────────────┴───────────────┴───────────────┴─────────────┘
(367 rows)

And here's what block 5555 from "idx" looks like (note that the fact that I'm
using the same index block number as before has no particular significance):

────────────┬──────────────┬─────────────┬────────────┐
│ itemoffset │ ctid │ data │ htid │
├────────────┼──────────────┼─────────────┼────────────┤
│ 1 │ (96327,4097) │ (a)=(63216) │ (96327,15) │
│ 2 │ (96310,7) │ (a)=(63204) │ (96310,7) │
│ 3 │ (96310,8) │ (a)=(63204) │ (96310,8) │
│ 4 │ (96310,9) │ (a)=(63204) │ (96310,9) │
│ 5 │ (96310,10) │ (a)=(63204) │ (96310,10) │
│ 6 │ (96310,11) │ (a)=(63204) │ (96310,11) │
│ 7 │ (96310,12) │ (a)=(63204) │ (96310,12) │
│ 8 │ (96310,13) │ (a)=(63204) │ (96310,13) │
│ 9 │ (96310,14) │ (a)=(63204) │ (96310,14) │
│ 10 │ (96310,15) │ (a)=(63204) │ (96310,15) │
│ 11 │ (96310,16) │ (a)=(63204) │ (96310,16) │
│ 12 │ (96310,17) │ (a)=(63204) │ (96310,17) │
│ 13 │ (96310,18) │ (a)=(63204) │ (96310,18) │
│ 14 │ (96310,19) │ (a)=(63205) │ (96310,19) │
│ 15 │ (96310,20) │ (a)=(63205) │ (96310,20) │
│ 16 │ (96310,21) │ (a)=(63205) │ (96310,21) │
│ 17 │ (96311,1) │ (a)=(63205) │ (96311,1) │
│ 18 │ (96311,2) │ (a)=(63205) │ (96311,2) │
│ 19 │ (96311,3) │ (a)=(63205) │ (96311,3) │
│ 20 │ (96311,4) │ (a)=(63205) │ (96311,4) │
│ 21 │ (96311,5) │ (a)=(63205) │ (96311,5) │
│ 22 │ (96311,6) │ (a)=(63205) │ (96311,6) │
│ 23 │ (96311,7) │ (a)=(63205) │ (96311,7) │
│ 24 │ (96311,8) │ (a)=(63205) │ (96311,8) │
│ 25 │ (96311,9) │ (a)=(63205) │ (96311,9) │
│ 26 │ (96311,10) │ (a)=(63205) │ (96311,10) │
│ 27 │ (96311,11) │ (a)=(63205) │ (96311,11) │
│ 28 │ (96311,12) │ (a)=(63205) │ (96311,12) │
│ 29 │ (96311,13) │ (a)=(63205) │ (96311,13) │
│ 30 │ (96311,14) │ (a)=(63205) │ (96311,14) │
│ 31 │ (96311,15) │ (a)=(63205) │ (96311,15) │
│ 32 │ (96311,16) │ (a)=(63205) │ (96311,16) │
│ 33 │ (96311,17) │ (a)=(63205) │ (96311,17) │
│ 34 │ (96311,18) │ (a)=(63205) │ (96311,18) │
│ 35 │ (96311,19) │ (a)=(63205) │ (96311,19) │
│ 36 │ (96311,20) │ (a)=(63205) │ (96311,20) │
│ 37 │ (96311,21) │ (a)=(63205) │ (96311,21) │
│ 38 │ (96312,1) │ (a)=(63205) │ (96312,1) │
│ 39 │ (96312,2) │ (a)=(63205) │ (96312,2) │
│ 40 │ (96312,3) │ (a)=(63205) │ (96312,3) │
│ 41 │ (96312,4) │ (a)=(63205) │ (96312,4) │
│ 42 │ (96312,5) │ (a)=(63205) │ (96312,5) │
│ 43 │ (96312,6) │ (a)=(63205) │ (96312,6) │
│ 44 │ (96312,7) │ (a)=(63205) │ (96312,7) │
│ 45 │ (96312,9) │ (a)=(63205) │ (96312,9) │
│ 46 │ (96312,8) │ (a)=(63206) │ (96312,8) │
│ 47 │ (96312,10) │ (a)=(63206) │ (96312,10) │
│ 48 │ (96312,11) │ (a)=(63206) │ (96312,11) │
│ 49 │ (96312,12) │ (a)=(63206) │ (96312,12) │
│ 50 │ (96312,13) │ (a)=(63206) │ (96312,13) │
│ 51 │ (96312,14) │ (a)=(63206) │ (96312,14) │
│ 52 │ (96312,15) │ (a)=(63206) │ (96312,15) │
│ 53 │ (96312,16) │ (a)=(63206) │ (96312,16) │
│ 54 │ (96312,17) │ (a)=(63206) │ (96312,17) │
│ 55 │ (96312,18) │ (a)=(63206) │ (96312,18) │
│ 56 │ (96312,19) │ (a)=(63206) │ (96312,19) │
│ 57 │ (96312,20) │ (a)=(63206) │ (96312,20) │
│ 58 │ (96312,21) │ (a)=(63206) │ (96312,21) │
│ 59 │ (96313,1) │ (a)=(63206) │ (96313,1) │
│ 60 │ (96313,2) │ (a)=(63206) │ (96313,2) │
│ 61 │ (96313,3) │ (a)=(63206) │ (96313,3) │
│ 62 │ (96313,4) │ (a)=(63206) │ (96313,4) │
│ 63 │ (96313,5) │ (a)=(63206) │ (96313,5) │
│ 64 │ (96313,6) │ (a)=(63206) │ (96313,6) │
│ 65 │ (96313,7) │ (a)=(63206) │ (96313,7) │
│ 66 │ (96313,8) │ (a)=(63206) │ (96313,8) │
│ 67 │ (96313,9) │ (a)=(63206) │ (96313,9) │
│ 68 │ (96313,10) │ (a)=(63206) │ (96313,10) │
│ 69 │ (96313,11) │ (a)=(63206) │ (96313,11) │
│ 70 │ (96313,12) │ (a)=(63206) │ (96313,12) │
│ 71 │ (96313,13) │ (a)=(63206) │ (96313,13) │
│ 72 │ (96313,14) │ (a)=(63206) │ (96313,14) │
│ 73 │ (96313,15) │ (a)=(63206) │ (96313,15) │
│ 74 │ (96313,16) │ (a)=(63206) │ (96313,16) │
│ 75 │ (96313,17) │ (a)=(63206) │ (96313,17) │
│ 76 │ (96313,18) │ (a)=(63206) │ (96313,18) │
│ 77 │ (96313,20) │ (a)=(63206) │ (96313,20) │
│ 78 │ (96313,19) │ (a)=(63207) │ (96313,19) │
│ 79 │ (96313,21) │ (a)=(63207) │ (96313,21) │
│ 80 │ (96314,1) │ (a)=(63207) │ (96314,1) │
│ 81 │ (96314,2) │ (a)=(63207) │ (96314,2) │
│ 82 │ (96314,3) │ (a)=(63207) │ (96314,3) │
│ 83 │ (96314,4) │ (a)=(63207) │ (96314,4) │
│ 84 │ (96314,5) │ (a)=(63207) │ (96314,5) │
│ 85 │ (96314,6) │ (a)=(63207) │ (96314,6) │
│ 86 │ (96314,7) │ (a)=(63207) │ (96314,7) │
│ 87 │ (96314,8) │ (a)=(63207) │ (96314,8) │
│ 88 │ (96314,9) │ (a)=(63207) │ (96314,9) │
│ 89 │ (96314,10) │ (a)=(63207) │ (96314,10) │
│ 90 │ (96314,11) │ (a)=(63207) │ (96314,11) │
│ 91 │ (96314,12) │ (a)=(63207) │ (96314,12) │
│ 92 │ (96314,13) │ (a)=(63207) │ (96314,13) │
│ 93 │ (96314,14) │ (a)=(63207) │ (96314,14) │
│ 94 │ (96314,15) │ (a)=(63207) │ (96314,15) │
│ 95 │ (96314,16) │ (a)=(63207) │ (96314,16) │
│ 96 │ (96314,17) │ (a)=(63207) │ (96314,17) │
│ 97 │ (96314,18) │ (a)=(63207) │ (96314,18) │
│ 98 │ (96314,19) │ (a)=(63207) │ (96314,19) │
│ 99 │ (96314,20) │ (a)=(63207) │ (96314,20) │
│ 100 │ (96314,21) │ (a)=(63207) │ (96314,21) │
│ 101 │ (96315,1) │ (a)=(63207) │ (96315,1) │
│ 102 │ (96315,2) │ (a)=(63207) │ (96315,2) │
│ 103 │ (96315,3) │ (a)=(63207) │ (96315,3) │
│ 104 │ (96315,4) │ (a)=(63207) │ (96315,4) │
│ 105 │ (96315,5) │ (a)=(63207) │ (96315,5) │
│ 106 │ (96315,6) │ (a)=(63207) │ (96315,6) │
│ 107 │ (96315,7) │ (a)=(63207) │ (96315,7) │
│ 108 │ (96315,8) │ (a)=(63207) │ (96315,8) │
│ 109 │ (96315,12) │ (a)=(63207) │ (96315,12) │
│ 110 │ (96315,9) │ (a)=(63208) │ (96315,9) │
│ 111 │ (96315,10) │ (a)=(63208) │ (96315,10) │
│ 112 │ (96315,11) │ (a)=(63208) │ (96315,11) │
│ 113 │ (96315,13) │ (a)=(63208) │ (96315,13) │
│ 114 │ (96315,14) │ (a)=(63208) │ (96315,14) │
│ 115 │ (96315,15) │ (a)=(63208) │ (96315,15) │
│ 116 │ (96315,16) │ (a)=(63208) │ (96315,16) │
│ 117 │ (96315,17) │ (a)=(63208) │ (96315,17) │
│ 118 │ (96315,18) │ (a)=(63208) │ (96315,18) │
│ 119 │ (96315,19) │ (a)=(63208) │ (96315,19) │
│ 120 │ (96315,20) │ (a)=(63208) │ (96315,20) │
│ 121 │ (96315,21) │ (a)=(63208) │ (96315,21) │
│ 122 │ (96316,1) │ (a)=(63208) │ (96316,1) │
│ 123 │ (96316,2) │ (a)=(63208) │ (96316,2) │
│ 124 │ (96316,3) │ (a)=(63208) │ (96316,3) │
│ 125 │ (96316,4) │ (a)=(63208) │ (96316,4) │
│ 126 │ (96316,5) │ (a)=(63208) │ (96316,5) │
│ 127 │ (96316,6) │ (a)=(63208) │ (96316,6) │
│ 128 │ (96316,7) │ (a)=(63208) │ (96316,7) │
│ 129 │ (96316,8) │ (a)=(63208) │ (96316,8) │
│ 130 │ (96316,9) │ (a)=(63208) │ (96316,9) │
│ 131 │ (96316,10) │ (a)=(63208) │ (96316,10) │
│ 132 │ (96316,11) │ (a)=(63208) │ (96316,11) │
│ 133 │ (96316,12) │ (a)=(63208) │ (96316,12) │
│ 134 │ (96316,13) │ (a)=(63208) │ (96316,13) │
│ 135 │ (96316,14) │ (a)=(63208) │ (96316,14) │
│ 136 │ (96316,15) │ (a)=(63208) │ (96316,15) │
│ 137 │ (96316,16) │ (a)=(63208) │ (96316,16) │
│ 138 │ (96316,17) │ (a)=(63208) │ (96316,17) │
│ 139 │ (96316,18) │ (a)=(63208) │ (96316,18) │
│ 140 │ (96316,19) │ (a)=(63208) │ (96316,19) │
│ 141 │ (96316,20) │ (a)=(63208) │ (96316,20) │
│ 142 │ (96316,21) │ (a)=(63209) │ (96316,21) │
│ 143 │ (96317,1) │ (a)=(63209) │ (96317,1) │
│ 144 │ (96317,2) │ (a)=(63209) │ (96317,2) │
│ 145 │ (96317,3) │ (a)=(63209) │ (96317,3) │
│ 146 │ (96317,4) │ (a)=(63209) │ (96317,4) │
│ 147 │ (96317,5) │ (a)=(63209) │ (96317,5) │
│ 148 │ (96317,6) │ (a)=(63209) │ (96317,6) │
│ 149 │ (96317,7) │ (a)=(63209) │ (96317,7) │
│ 150 │ (96317,8) │ (a)=(63209) │ (96317,8) │
│ 151 │ (96317,9) │ (a)=(63209) │ (96317,9) │
│ 152 │ (96317,10) │ (a)=(63209) │ (96317,10) │
│ 153 │ (96317,11) │ (a)=(63209) │ (96317,11) │
│ 154 │ (96317,12) │ (a)=(63209) │ (96317,12) │
│ 155 │ (96317,13) │ (a)=(63209) │ (96317,13) │
│ 156 │ (96317,14) │ (a)=(63209) │ (96317,14) │
│ 157 │ (96317,15) │ (a)=(63209) │ (96317,15) │
│ 158 │ (96317,16) │ (a)=(63209) │ (96317,16) │
│ 159 │ (96317,17) │ (a)=(63209) │ (96317,17) │
│ 160 │ (96317,18) │ (a)=(63209) │ (96317,18) │
│ 161 │ (96317,19) │ (a)=(63209) │ (96317,19) │
│ 162 │ (96317,20) │ (a)=(63209) │ (96317,20) │
│ 163 │ (96317,21) │ (a)=(63209) │ (96317,21) │
│ 164 │ (96318,1) │ (a)=(63209) │ (96318,1) │
│ 165 │ (96318,2) │ (a)=(63209) │ (96318,2) │
│ 166 │ (96318,3) │ (a)=(63209) │ (96318,3) │
│ 167 │ (96318,4) │ (a)=(63209) │ (96318,4) │
│ 168 │ (96318,5) │ (a)=(63209) │ (96318,5) │
│ 169 │ (96318,6) │ (a)=(63209) │ (96318,6) │
│ 170 │ (96318,7) │ (a)=(63209) │ (96318,7) │
│ 171 │ (96318,8) │ (a)=(63209) │ (96318,8) │
│ 172 │ (96318,9) │ (a)=(63209) │ (96318,9) │
│ 173 │ (96318,10) │ (a)=(63209) │ (96318,10) │
│ 174 │ (96318,11) │ (a)=(63210) │ (96318,11) │
│ 175 │ (96318,12) │ (a)=(63210) │ (96318,12) │
│ 176 │ (96318,13) │ (a)=(63210) │ (96318,13) │
│ 177 │ (96318,14) │ (a)=(63210) │ (96318,14) │
│ 178 │ (96318,15) │ (a)=(63210) │ (96318,15) │
│ 179 │ (96318,16) │ (a)=(63210) │ (96318,16) │
│ 180 │ (96318,17) │ (a)=(63210) │ (96318,17) │
│ 181 │ (96318,18) │ (a)=(63210) │ (96318,18) │
│ 182 │ (96318,19) │ (a)=(63210) │ (96318,19) │
│ 183 │ (96318,20) │ (a)=(63210) │ (96318,20) │
│ 184 │ (96318,21) │ (a)=(63210) │ (96318,21) │
│ 185 │ (96319,1) │ (a)=(63210) │ (96319,1) │
│ 186 │ (96319,2) │ (a)=(63210) │ (96319,2) │
│ 187 │ (96319,3) │ (a)=(63210) │ (96319,3) │
│ 188 │ (96319,4) │ (a)=(63210) │ (96319,4) │
│ 189 │ (96319,5) │ (a)=(63210) │ (96319,5) │
│ 190 │ (96319,6) │ (a)=(63210) │ (96319,6) │
│ 191 │ (96319,7) │ (a)=(63210) │ (96319,7) │
│ 192 │ (96319,8) │ (a)=(63210) │ (96319,8) │
│ 193 │ (96319,9) │ (a)=(63210) │ (96319,9) │
│ 194 │ (96319,10) │ (a)=(63210) │ (96319,10) │
│ 195 │ (96319,11) │ (a)=(63210) │ (96319,11) │
│ 196 │ (96319,12) │ (a)=(63210) │ (96319,12) │
│ 197 │ (96319,13) │ (a)=(63210) │ (96319,13) │
│ 198 │ (96319,14) │ (a)=(63210) │ (96319,14) │
│ 199 │ (96319,15) │ (a)=(63210) │ (96319,15) │
│ 200 │ (96319,16) │ (a)=(63210) │ (96319,16) │
│ 201 │ (96319,17) │ (a)=(63210) │ (96319,17) │
│ 202 │ (96319,18) │ (a)=(63210) │ (96319,18) │
│ 203 │ (96319,19) │ (a)=(63210) │ (96319,19) │
│ 204 │ (96319,20) │ (a)=(63210) │ (96319,20) │
│ 205 │ (96320,1) │ (a)=(63210) │ (96320,1) │
│ 206 │ (96319,21) │ (a)=(63211) │ (96319,21) │
│ 207 │ (96320,2) │ (a)=(63211) │ (96320,2) │
│ 208 │ (96320,3) │ (a)=(63211) │ (96320,3) │
│ 209 │ (96320,4) │ (a)=(63211) │ (96320,4) │
│ 210 │ (96320,5) │ (a)=(63211) │ (96320,5) │
│ 211 │ (96320,6) │ (a)=(63211) │ (96320,6) │
│ 212 │ (96320,7) │ (a)=(63211) │ (96320,7) │
│ 213 │ (96320,8) │ (a)=(63211) │ (96320,8) │
│ 214 │ (96320,9) │ (a)=(63211) │ (96320,9) │
│ 215 │ (96320,10) │ (a)=(63211) │ (96320,10) │
│ 216 │ (96320,11) │ (a)=(63211) │ (96320,11) │
│ 217 │ (96320,12) │ (a)=(63211) │ (96320,12) │
│ 218 │ (96320,13) │ (a)=(63211) │ (96320,13) │
│ 219 │ (96320,14) │ (a)=(63211) │ (96320,14) │
│ 220 │ (96320,15) │ (a)=(63211) │ (96320,15) │
│ 221 │ (96320,16) │ (a)=(63211) │ (96320,16) │
│ 222 │ (96320,17) │ (a)=(63211) │ (96320,17) │
│ 223 │ (96320,18) │ (a)=(63211) │ (96320,18) │
│ 224 │ (96320,19) │ (a)=(63211) │ (96320,19) │
│ 225 │ (96320,20) │ (a)=(63211) │ (96320,20) │
│ 226 │ (96320,21) │ (a)=(63211) │ (96320,21) │
│ 227 │ (96321,1) │ (a)=(63211) │ (96321,1) │
│ 228 │ (96321,2) │ (a)=(63211) │ (96321,2) │
│ 229 │ (96321,3) │ (a)=(63211) │ (96321,3) │
│ 230 │ (96321,4) │ (a)=(63211) │ (96321,4) │
│ 231 │ (96321,5) │ (a)=(63211) │ (96321,5) │
│ 232 │ (96321,6) │ (a)=(63211) │ (96321,6) │
│ 233 │ (96321,7) │ (a)=(63211) │ (96321,7) │
│ 234 │ (96321,8) │ (a)=(63211) │ (96321,8) │
│ 235 │ (96321,9) │ (a)=(63211) │ (96321,9) │
│ 236 │ (96321,10) │ (a)=(63211) │ (96321,10) │
│ 237 │ (96321,11) │ (a)=(63211) │ (96321,11) │
│ 238 │ (96321,12) │ (a)=(63212) │ (96321,12) │
│ 239 │ (96321,13) │ (a)=(63212) │ (96321,13) │
│ 240 │ (96321,14) │ (a)=(63212) │ (96321,14) │
│ 241 │ (96321,15) │ (a)=(63212) │ (96321,15) │
│ 242 │ (96321,16) │ (a)=(63212) │ (96321,16) │
│ 243 │ (96321,17) │ (a)=(63212) │ (96321,17) │
│ 244 │ (96321,18) │ (a)=(63212) │ (96321,18) │
│ 245 │ (96321,19) │ (a)=(63212) │ (96321,19) │
│ 246 │ (96321,20) │ (a)=(63212) │ (96321,20) │
│ 247 │ (96321,21) │ (a)=(63212) │ (96321,21) │
│ 248 │ (96322,1) │ (a)=(63212) │ (96322,1) │
│ 249 │ (96322,2) │ (a)=(63212) │ (96322,2) │
│ 250 │ (96322,3) │ (a)=(63212) │ (96322,3) │
│ 251 │ (96322,4) │ (a)=(63212) │ (96322,4) │
│ 252 │ (96322,5) │ (a)=(63212) │ (96322,5) │
│ 253 │ (96322,6) │ (a)=(63212) │ (96322,6) │
│ 254 │ (96322,7) │ (a)=(63212) │ (96322,7) │
│ 255 │ (96322,8) │ (a)=(63212) │ (96322,8) │
│ 256 │ (96322,9) │ (a)=(63212) │ (96322,9) │
│ 257 │ (96322,10) │ (a)=(63212) │ (96322,10) │
│ 258 │ (96322,11) │ (a)=(63212) │ (96322,11) │
│ 259 │ (96322,12) │ (a)=(63212) │ (96322,12) │
│ 260 │ (96322,13) │ (a)=(63212) │ (96322,13) │
│ 261 │ (96322,14) │ (a)=(63212) │ (96322,14) │
│ 262 │ (96322,15) │ (a)=(63212) │ (96322,15) │
│ 263 │ (96322,16) │ (a)=(63212) │ (96322,16) │
│ 264 │ (96322,17) │ (a)=(63212) │ (96322,17) │
│ 265 │ (96322,18) │ (a)=(63212) │ (96322,18) │
│ 266 │ (96322,19) │ (a)=(63212) │ (96322,19) │
│ 267 │ (96322,20) │ (a)=(63212) │ (96322,20) │
│ 268 │ (96322,21) │ (a)=(63212) │ (96322,21) │
│ 269 │ (96323,3) │ (a)=(63212) │ (96323,3) │
│ 270 │ (96323,1) │ (a)=(63213) │ (96323,1) │
│ 271 │ (96323,2) │ (a)=(63213) │ (96323,2) │
│ 272 │ (96323,4) │ (a)=(63213) │ (96323,4) │
│ 273 │ (96323,5) │ (a)=(63213) │ (96323,5) │
│ 274 │ (96323,6) │ (a)=(63213) │ (96323,6) │
│ 275 │ (96323,7) │ (a)=(63213) │ (96323,7) │
│ 276 │ (96323,8) │ (a)=(63213) │ (96323,8) │
│ 277 │ (96323,9) │ (a)=(63213) │ (96323,9) │
│ 278 │ (96323,10) │ (a)=(63213) │ (96323,10) │
│ 279 │ (96323,11) │ (a)=(63213) │ (96323,11) │
│ 280 │ (96323,12) │ (a)=(63213) │ (96323,12) │
│ 281 │ (96323,13) │ (a)=(63213) │ (96323,13) │
│ 282 │ (96323,14) │ (a)=(63213) │ (96323,14) │
│ 283 │ (96323,15) │ (a)=(63213) │ (96323,15) │
│ 284 │ (96323,16) │ (a)=(63213) │ (96323,16) │
│ 285 │ (96323,17) │ (a)=(63213) │ (96323,17) │
│ 286 │ (96323,18) │ (a)=(63213) │ (96323,18) │
│ 287 │ (96323,19) │ (a)=(63213) │ (96323,19) │
│ 288 │ (96323,20) │ (a)=(63213) │ (96323,20) │
│ 289 │ (96323,21) │ (a)=(63213) │ (96323,21) │
│ 290 │ (96324,1) │ (a)=(63213) │ (96324,1) │
│ 291 │ (96324,2) │ (a)=(63213) │ (96324,2) │
│ 292 │ (96324,3) │ (a)=(63213) │ (96324,3) │
│ 293 │ (96324,4) │ (a)=(63213) │ (96324,4) │
│ 294 │ (96324,5) │ (a)=(63213) │ (96324,5) │
│ 295 │ (96324,6) │ (a)=(63213) │ (96324,6) │
│ 296 │ (96324,7) │ (a)=(63213) │ (96324,7) │
│ 297 │ (96324,8) │ (a)=(63213) │ (96324,8) │
│ 298 │ (96324,9) │ (a)=(63213) │ (96324,9) │
│ 299 │ (96324,11) │ (a)=(63213) │ (96324,11) │
│ 300 │ (96324,12) │ (a)=(63213) │ (96324,12) │
│ 301 │ (96324,13) │ (a)=(63213) │ (96324,13) │
│ 302 │ (96324,10) │ (a)=(63214) │ (96324,10) │
│ 303 │ (96324,14) │ (a)=(63214) │ (96324,14) │
│ 304 │ (96324,15) │ (a)=(63214) │ (96324,15) │
│ 305 │ (96324,16) │ (a)=(63214) │ (96324,16) │
│ 306 │ (96324,17) │ (a)=(63214) │ (96324,17) │
│ 307 │ (96324,18) │ (a)=(63214) │ (96324,18) │
│ 308 │ (96324,19) │ (a)=(63214) │ (96324,19) │
│ 309 │ (96324,20) │ (a)=(63214) │ (96324,20) │
│ 310 │ (96324,21) │ (a)=(63214) │ (96324,21) │
│ 311 │ (96325,1) │ (a)=(63214) │ (96325,1) │
│ 312 │ (96325,2) │ (a)=(63214) │ (96325,2) │
│ 313 │ (96325,3) │ (a)=(63214) │ (96325,3) │
│ 314 │ (96325,4) │ (a)=(63214) │ (96325,4) │
│ 315 │ (96325,5) │ (a)=(63214) │ (96325,5) │
│ 316 │ (96325,6) │ (a)=(63214) │ (96325,6) │
│ 317 │ (96325,7) │ (a)=(63214) │ (96325,7) │
│ 318 │ (96325,8) │ (a)=(63214) │ (96325,8) │
│ 319 │ (96325,9) │ (a)=(63214) │ (96325,9) │
│ 320 │ (96325,10) │ (a)=(63214) │ (96325,10) │
│ 321 │ (96325,11) │ (a)=(63214) │ (96325,11) │
│ 322 │ (96325,12) │ (a)=(63214) │ (96325,12) │
│ 323 │ (96325,13) │ (a)=(63214) │ (96325,13) │
│ 324 │ (96325,14) │ (a)=(63214) │ (96325,14) │
│ 325 │ (96325,15) │ (a)=(63214) │ (96325,15) │
│ 326 │ (96325,16) │ (a)=(63214) │ (96325,16) │
│ 327 │ (96325,17) │ (a)=(63214) │ (96325,17) │
│ 328 │ (96325,18) │ (a)=(63214) │ (96325,18) │
│ 329 │ (96325,19) │ (a)=(63214) │ (96325,19) │
│ 330 │ (96325,20) │ (a)=(63214) │ (96325,20) │
│ 331 │ (96325,21) │ (a)=(63214) │ (96325,21) │
│ 332 │ (96326,1) │ (a)=(63214) │ (96326,1) │
│ 333 │ (96326,3) │ (a)=(63214) │ (96326,3) │
│ 334 │ (96326,2) │ (a)=(63215) │ (96326,2) │
│ 335 │ (96326,4) │ (a)=(63215) │ (96326,4) │
│ 336 │ (96326,5) │ (a)=(63215) │ (96326,5) │
│ 337 │ (96326,6) │ (a)=(63215) │ (96326,6) │
│ 338 │ (96326,7) │ (a)=(63215) │ (96326,7) │
│ 339 │ (96326,8) │ (a)=(63215) │ (96326,8) │
│ 340 │ (96326,9) │ (a)=(63215) │ (96326,9) │
│ 341 │ (96326,10) │ (a)=(63215) │ (96326,10) │
│ 342 │ (96326,11) │ (a)=(63215) │ (96326,11) │
│ 343 │ (96326,12) │ (a)=(63215) │ (96326,12) │
│ 344 │ (96326,13) │ (a)=(63215) │ (96326,13) │
│ 345 │ (96326,14) │ (a)=(63215) │ (96326,14) │
│ 346 │ (96326,15) │ (a)=(63215) │ (96326,15) │
│ 347 │ (96326,16) │ (a)=(63215) │ (96326,16) │
│ 348 │ (96326,17) │ (a)=(63215) │ (96326,17) │
│ 349 │ (96326,18) │ (a)=(63215) │ (96326,18) │
│ 350 │ (96326,19) │ (a)=(63215) │ (96326,19) │
│ 351 │ (96326,20) │ (a)=(63215) │ (96326,20) │
│ 352 │ (96326,21) │ (a)=(63215) │ (96326,21) │
│ 353 │ (96327,1) │ (a)=(63215) │ (96327,1) │
│ 354 │ (96327,2) │ (a)=(63215) │ (96327,2) │
│ 355 │ (96327,3) │ (a)=(63215) │ (96327,3) │
│ 356 │ (96327,4) │ (a)=(63215) │ (96327,4) │
│ 357 │ (96327,5) │ (a)=(63215) │ (96327,5) │
│ 358 │ (96327,6) │ (a)=(63215) │ (96327,6) │
│ 359 │ (96327,7) │ (a)=(63215) │ (96327,7) │
│ 360 │ (96327,8) │ (a)=(63215) │ (96327,8) │
│ 361 │ (96327,9) │ (a)=(63215) │ (96327,9) │
│ 362 │ (96327,10) │ (a)=(63215) │ (96327,10) │
│ 363 │ (96327,11) │ (a)=(63215) │ (96327,11) │
│ 364 │ (96327,12) │ (a)=(63215) │ (96327,12) │
│ 365 │ (96327,14) │ (a)=(63215) │ (96327,14) │
│ 366 │ (96327,13) │ (a)=(63216) │ (96327,13) │
│ 367 │ (96327,15) │ (a)=(63216) │ (96327,15) │
└────────────┴──────────────┴─────────────┴────────────┘
(367 rows)

I only notice one tiny discontinuity in this "unshuffled" "idx" page: the index
tuple at offset 205 uses the heap TID (96320,1), whereas the index tuple right
after that (at offset 206) uses the heap TID (96319,21) (before we get to a
large run of heap TIDs that use heap block number 96320 once more).

As I touched on already, this effect can be seen even with perfectly correlated
inserts. The effect is caused by the FSM having a tiny bit of space left on one
heap page -- not enough space to fit an incoming heap tuple, but still enough to
fit a slightly smaller heap tuple that is inserted shortly thereafter. You end
up with exactly one index tuple whose heap TID is slightly out-of-order, though
only every once in a long while.

--
Peter Geoghegan

#236

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#235)

Re: index prefetching

On 8/13/25 18:36, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 8:15 AM Tomas Vondra <tomas@vondra.me> wrote:

1) created a second table with an "inverse pattern" that's decreasing:

create table t2 (like t) with (fillfactor = 20);
insert into t2 select -a, b from t;
create index idx2 on t2 (a);
alter index idx2 set (deduplicate_items = false);
reindex index idx2;

The idea is that

SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC

is the same "block pattern" as

SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC

A quick look at "idx2" using pageinspect seems to show heap block numbers that
are significantly less in-order than those from the original "idx" index,
though. While the original "idx" has block numbers that are *almost* in perfect
order (I do see the odd index tuple that has a non-consecutive TID, possibly
just due to the influence of the heap FSM), "idx2" seems to have leaf pages that
each have heap blocks that are somewhat "shuffled" within each page.

While the average total number of heap blocks seen with "idx2" might not be very
much higher than "idx", it is nevertheless true that the heap TIDs appear in a
less consistent order. So AFAICT we have no principled reason to expect the
"runs" seen on "idx2" to be anything like "idx" (maybe the performance gap is a
real problem, since the physical attributes of each index aren't hugely
different, but even then the "runs" stats don't seem all that uninformative).

I'll show what I mean by "shuffled" via a comparison of 2 random leaf pages from
each index. Here's what block 5555 from "idx2" looks like according to
bt_page_items (it shows a certain amount of "localized shuffling"):

┌────────────┬───────────────┬───────────────┬─────────────┐
│ itemoffset │ ctid │ data │ htid │
├────────────┼───────────────┼───────────────┼─────────────┤
│ 1 │ (379861,4097) │ (a)=(-249285) │ (379861,7) │
│ 2 │ (379880,13) │ (a)=(-249297) │ (379880,13) │
│ 3 │ (379880,14) │ (a)=(-249297) │ (379880,14) │
│ 4 │ (379880,15) │ (a)=(-249297) │ (379880,15) │
│ 5 │ (379880,16) │ (a)=(-249297) │ (379880,16) │
│ 6 │ (379880,17) │ (a)=(-249297) │ (379880,17) │
│ 7 │ (379880,18) │ (a)=(-249297) │ (379880,18) │
│ 8 │ (379880,19) │ (a)=(-249297) │ (379880,19) │
│ 9 │ (379880,20) │ (a)=(-249297) │ (379880,20) │
│ 10 │ (379880,21) │ (a)=(-249297) │ (379880,21) │
│ 11 │ (379881,2) │ (a)=(-249297) │ (379881,2) │
│ 12 │ (379881,3) │ (a)=(-249297) │ (379881,3) │
│ 13 │ (379881,4) │ (a)=(-249297) │ (379881,4) │
│ 14 │ (379878,2) │ (a)=(-249296) │ (379878,2) │
│ 15 │ (379878,3) │ (a)=(-249296) │ (379878,3) │
│ 16 │ (379878,5) │ (a)=(-249296) │ (379878,5) │
│ 17 │ (379878,6) │ (a)=(-249296) │ (379878,6) │
│ 18 │ (379878,7) │ (a)=(-249296) │ (379878,7) │
│ 19 │ (379878,8) │ (a)=(-249296) │ (379878,8) │
│ 20 │ (379878,9) │ (a)=(-249296) │ (379878,9) │
│ 21 │ (379878,10) │ (a)=(-249296) │ (379878,10) │
│ 22 │ (379878,11) │ (a)=(-249296) │ (379878,11) │
│ 23 │ (379878,12) │ (a)=(-249296) │ (379878,12) │
│ 24 │ (379878,13) │ (a)=(-249296) │ (379878,13) │
│ 25 │ (379878,14) │ (a)=(-249296) │ (379878,14) │
│ 26 │ (379878,15) │ (a)=(-249296) │ (379878,15) │
│ 27 │ (379878,16) │ (a)=(-249296) │ (379878,16) │
│ 28 │ (379878,17) │ (a)=(-249296) │ (379878,17) │
│ 29 │ (379878,18) │ (a)=(-249296) │ (379878,18) │
│ 30 │ (379878,19) │ (a)=(-249296) │ (379878,19) │
│ 31 │ (379878,20) │ (a)=(-249296) │ (379878,20) │
│ 32 │ (379878,21) │ (a)=(-249296) │ (379878,21) │
│ 33 │ (379879,1) │ (a)=(-249296) │ (379879,1) │
│ 34 │ (379879,2) │ (a)=(-249296) │ (379879,2) │
│ 35 │ (379879,3) │ (a)=(-249296) │ (379879,3) │
│ 36 │ (379879,4) │ (a)=(-249296) │ (379879,4) │
│ 37 │ (379879,5) │ (a)=(-249296) │ (379879,5) │
│ 38 │ (379879,6) │ (a)=(-249296) │ (379879,6) │
│ 39 │ (379879,7) │ (a)=(-249296) │ (379879,7) │
│ 40 │ (379879,8) │ (a)=(-249296) │ (379879,8) │
│ 41 │ (379879,9) │ (a)=(-249296) │ (379879,9) │
│ 42 │ (379879,10) │ (a)=(-249296) │ (379879,10) │
│ 43 │ (379879,12) │ (a)=(-249296) │ (379879,12) │
│ 44 │ (379879,13) │ (a)=(-249296) │ (379879,13) │
│ 45 │ (379879,14) │ (a)=(-249296) │ (379879,14) │
│ 46 │ (379876,10) │ (a)=(-249295) │ (379876,10) │
│ 47 │ (379876,12) │ (a)=(-249295) │ (379876,12) │
│ 48 │ (379876,14) │ (a)=(-249295) │ (379876,14) │
│ 49 │ (379876,16) │ (a)=(-249295) │ (379876,16) │
│ 50 │ (379876,17) │ (a)=(-249295) │ (379876,17) │
│ 51 │ (379876,18) │ (a)=(-249295) │ (379876,18) │
│ 52 │ (379876,19) │ (a)=(-249295) │ (379876,19) │
│ 53 │ (379876,20) │ (a)=(-249295) │ (379876,20) │
│ 54 │ (379876,21) │ (a)=(-249295) │ (379876,21) │
│ 55 │ (379877,1) │ (a)=(-249295) │ (379877,1) │
│ 56 │ (379877,2) │ (a)=(-249295) │ (379877,2) │
│ 57 │ (379877,3) │ (a)=(-249295) │ (379877,3) │
│ 58 │ (379877,4) │ (a)=(-249295) │ (379877,4) │
│ 59 │ (379877,5) │ (a)=(-249295) │ (379877,5) │
│ 60 │ (379877,6) │ (a)=(-249295) │ (379877,6) │
│ 61 │ (379877,7) │ (a)=(-249295) │ (379877,7) │
│ 62 │ (379877,8) │ (a)=(-249295) │ (379877,8) │
│ 63 │ (379877,9) │ (a)=(-249295) │ (379877,9) │
│ 64 │ (379877,10) │ (a)=(-249295) │ (379877,10) │
│ 65 │ (379877,11) │ (a)=(-249295) │ (379877,11) │
│ 66 │ (379877,12) │ (a)=(-249295) │ (379877,12) │
│ 67 │ (379877,13) │ (a)=(-249295) │ (379877,13) │
│ 68 │ (379877,14) │ (a)=(-249295) │ (379877,14) │
│ 69 │ (379877,15) │ (a)=(-249295) │ (379877,15) │
│ 70 │ (379877,16) │ (a)=(-249295) │ (379877,16) │
│ 71 │ (379877,17) │ (a)=(-249295) │ (379877,17) │
│ 72 │ (379877,18) │ (a)=(-249295) │ (379877,18) │
│ 73 │ (379877,19) │ (a)=(-249295) │ (379877,19) │
│ 74 │ (379877,20) │ (a)=(-249295) │ (379877,20) │
│ 75 │ (379877,21) │ (a)=(-249295) │ (379877,21) │
│ 76 │ (379878,1) │ (a)=(-249295) │ (379878,1) │
│ 77 │ (379878,4) │ (a)=(-249295) │ (379878,4) │
│ 78 │ (379874,20) │ (a)=(-249294) │ (379874,20) │
│ 79 │ (379875,2) │ (a)=(-249294) │ (379875,2) │
│ 80 │ (379875,3) │ (a)=(-249294) │ (379875,3) │
│ 81 │ (379875,5) │ (a)=(-249294) │ (379875,5) │
│ 82 │ (379875,6) │ (a)=(-249294) │ (379875,6) │
│ 83 │ (379875,7) │ (a)=(-249294) │ (379875,7) │
│ 84 │ (379875,8) │ (a)=(-249294) │ (379875,8) │
│ 85 │ (379875,9) │ (a)=(-249294) │ (379875,9) │
│ 86 │ (379875,10) │ (a)=(-249294) │ (379875,10) │
│ 87 │ (379875,11) │ (a)=(-249294) │ (379875,11) │
│ 88 │ (379875,12) │ (a)=(-249294) │ (379875,12) │
│ 89 │ (379875,13) │ (a)=(-249294) │ (379875,13) │
│ 90 │ (379875,14) │ (a)=(-249294) │ (379875,14) │
│ 91 │ (379875,15) │ (a)=(-249294) │ (379875,15) │
│ 92 │ (379875,16) │ (a)=(-249294) │ (379875,16) │
│ 93 │ (379875,17) │ (a)=(-249294) │ (379875,17) │
│ 94 │ (379875,18) │ (a)=(-249294) │ (379875,18) │
│ 95 │ (379875,19) │ (a)=(-249294) │ (379875,19) │
│ 96 │ (379875,20) │ (a)=(-249294) │ (379875,20) │
│ 97 │ (379875,21) │ (a)=(-249294) │ (379875,21) │
│ 98 │ (379876,1) │ (a)=(-249294) │ (379876,1) │
│ 99 │ (379876,2) │ (a)=(-249294) │ (379876,2) │
│ 100 │ (379876,3) │ (a)=(-249294) │ (379876,3) │
│ 101 │ (379876,4) │ (a)=(-249294) │ (379876,4) │
│ 102 │ (379876,5) │ (a)=(-249294) │ (379876,5) │
│ 103 │ (379876,6) │ (a)=(-249294) │ (379876,6) │
│ 104 │ (379876,7) │ (a)=(-249294) │ (379876,7) │
│ 105 │ (379876,8) │ (a)=(-249294) │ (379876,8) │
│ 106 │ (379876,9) │ (a)=(-249294) │ (379876,9) │
│ 107 │ (379876,11) │ (a)=(-249294) │ (379876,11) │
│ 108 │ (379876,13) │ (a)=(-249294) │ (379876,13) │
│ 109 │ (379876,15) │ (a)=(-249294) │ (379876,15) │
│ 110 │ (379873,11) │ (a)=(-249293) │ (379873,11) │
│ 111 │ (379873,13) │ (a)=(-249293) │ (379873,13) │
│ 112 │ (379873,14) │ (a)=(-249293) │ (379873,14) │
│ 113 │ (379873,15) │ (a)=(-249293) │ (379873,15) │
│ 114 │ (379873,16) │ (a)=(-249293) │ (379873,16) │
│ 115 │ (379873,17) │ (a)=(-249293) │ (379873,17) │
│ 116 │ (379873,18) │ (a)=(-249293) │ (379873,18) │
│ 117 │ (379873,19) │ (a)=(-249293) │ (379873,19) │
│ 118 │ (379873,20) │ (a)=(-249293) │ (379873,20) │
│ 119 │ (379873,21) │ (a)=(-249293) │ (379873,21) │
│ 120 │ (379874,1) │ (a)=(-249293) │ (379874,1) │
│ 121 │ (379874,2) │ (a)=(-249293) │ (379874,2) │
│ 122 │ (379874,3) │ (a)=(-249293) │ (379874,3) │
│ 123 │ (379874,4) │ (a)=(-249293) │ (379874,4) │
│ 124 │ (379874,5) │ (a)=(-249293) │ (379874,5) │
│ 125 │ (379874,6) │ (a)=(-249293) │ (379874,6) │
│ 126 │ (379874,7) │ (a)=(-249293) │ (379874,7) │
│ 127 │ (379874,8) │ (a)=(-249293) │ (379874,8) │
│ 128 │ (379874,9) │ (a)=(-249293) │ (379874,9) │
│ 129 │ (379874,10) │ (a)=(-249293) │ (379874,10) │
│ 130 │ (379874,11) │ (a)=(-249293) │ (379874,11) │
│ 131 │ (379874,12) │ (a)=(-249293) │ (379874,12) │
│ 132 │ (379874,13) │ (a)=(-249293) │ (379874,13) │
│ 133 │ (379874,14) │ (a)=(-249293) │ (379874,14) │
│ 134 │ (379874,15) │ (a)=(-249293) │ (379874,15) │
│ 135 │ (379874,16) │ (a)=(-249293) │ (379874,16) │
│ 136 │ (379874,17) │ (a)=(-249293) │ (379874,17) │
│ 137 │ (379874,18) │ (a)=(-249293) │ (379874,18) │
│ 138 │ (379874,19) │ (a)=(-249293) │ (379874,19) │
│ 139 │ (379874,21) │ (a)=(-249293) │ (379874,21) │
│ 140 │ (379875,1) │ (a)=(-249293) │ (379875,1) │
│ 141 │ (379875,4) │ (a)=(-249293) │ (379875,4) │
│ 142 │ (379871,21) │ (a)=(-249292) │ (379871,21) │
│ 143 │ (379872,2) │ (a)=(-249292) │ (379872,2) │
│ 144 │ (379872,3) │ (a)=(-249292) │ (379872,3) │
│ 145 │ (379872,4) │ (a)=(-249292) │ (379872,4) │
│ 146 │ (379872,5) │ (a)=(-249292) │ (379872,5) │
│ 147 │ (379872,6) │ (a)=(-249292) │ (379872,6) │
│ 148 │ (379872,7) │ (a)=(-249292) │ (379872,7) │
│ 149 │ (379872,8) │ (a)=(-249292) │ (379872,8) │
│ 150 │ (379872,9) │ (a)=(-249292) │ (379872,9) │
│ 151 │ (379872,10) │ (a)=(-249292) │ (379872,10) │
│ 152 │ (379872,11) │ (a)=(-249292) │ (379872,11) │
│ 153 │ (379872,12) │ (a)=(-249292) │ (379872,12) │
│ 154 │ (379872,13) │ (a)=(-249292) │ (379872,13) │
│ 155 │ (379872,14) │ (a)=(-249292) │ (379872,14) │
│ 156 │ (379872,15) │ (a)=(-249292) │ (379872,15) │
│ 157 │ (379872,16) │ (a)=(-249292) │ (379872,16) │
│ 158 │ (379872,17) │ (a)=(-249292) │ (379872,17) │
│ 159 │ (379872,18) │ (a)=(-249292) │ (379872,18) │
│ 160 │ (379872,19) │ (a)=(-249292) │ (379872,19) │
│ 161 │ (379872,20) │ (a)=(-249292) │ (379872,20) │
│ 162 │ (379872,21) │ (a)=(-249292) │ (379872,21) │
│ 163 │ (379873,1) │ (a)=(-249292) │ (379873,1) │
│ 164 │ (379873,2) │ (a)=(-249292) │ (379873,2) │
│ 165 │ (379873,3) │ (a)=(-249292) │ (379873,3) │
│ 166 │ (379873,4) │ (a)=(-249292) │ (379873,4) │
│ 167 │ (379873,5) │ (a)=(-249292) │ (379873,5) │
│ 168 │ (379873,6) │ (a)=(-249292) │ (379873,6) │
│ 169 │ (379873,7) │ (a)=(-249292) │ (379873,7) │
│ 170 │ (379873,8) │ (a)=(-249292) │ (379873,8) │
│ 171 │ (379873,9) │ (a)=(-249292) │ (379873,9) │
│ 172 │ (379873,10) │ (a)=(-249292) │ (379873,10) │
│ 173 │ (379873,12) │ (a)=(-249292) │ (379873,12) │
│ 174 │ (379870,9) │ (a)=(-249291) │ (379870,9) │
│ 175 │ (379870,11) │ (a)=(-249291) │ (379870,11) │
│ 176 │ (379870,12) │ (a)=(-249291) │ (379870,12) │
│ 177 │ (379870,14) │ (a)=(-249291) │ (379870,14) │
│ 178 │ (379870,15) │ (a)=(-249291) │ (379870,15) │
│ 179 │ (379870,16) │ (a)=(-249291) │ (379870,16) │
│ 180 │ (379870,17) │ (a)=(-249291) │ (379870,17) │
│ 181 │ (379870,18) │ (a)=(-249291) │ (379870,18) │
│ 182 │ (379870,19) │ (a)=(-249291) │ (379870,19) │
│ 183 │ (379870,20) │ (a)=(-249291) │ (379870,20) │
│ 184 │ (379870,21) │ (a)=(-249291) │ (379870,21) │
│ 185 │ (379871,1) │ (a)=(-249291) │ (379871,1) │
│ 186 │ (379871,2) │ (a)=(-249291) │ (379871,2) │
│ 187 │ (379871,3) │ (a)=(-249291) │ (379871,3) │
│ 188 │ (379871,4) │ (a)=(-249291) │ (379871,4) │
│ 189 │ (379871,5) │ (a)=(-249291) │ (379871,5) │
│ 190 │ (379871,6) │ (a)=(-249291) │ (379871,6) │
│ 191 │ (379871,7) │ (a)=(-249291) │ (379871,7) │
│ 192 │ (379871,8) │ (a)=(-249291) │ (379871,8) │
│ 193 │ (379871,9) │ (a)=(-249291) │ (379871,9) │
│ 194 │ (379871,10) │ (a)=(-249291) │ (379871,10) │
│ 195 │ (379871,11) │ (a)=(-249291) │ (379871,11) │
│ 196 │ (379871,12) │ (a)=(-249291) │ (379871,12) │
│ 197 │ (379871,13) │ (a)=(-249291) │ (379871,13) │
│ 198 │ (379871,14) │ (a)=(-249291) │ (379871,14) │
│ 199 │ (379871,15) │ (a)=(-249291) │ (379871,15) │
│ 200 │ (379871,16) │ (a)=(-249291) │ (379871,16) │
│ 201 │ (379871,17) │ (a)=(-249291) │ (379871,17) │
│ 202 │ (379871,18) │ (a)=(-249291) │ (379871,18) │
│ 203 │ (379871,19) │ (a)=(-249291) │ (379871,19) │
│ 204 │ (379871,20) │ (a)=(-249291) │ (379871,20) │
│ 205 │ (379872,1) │ (a)=(-249291) │ (379872,1) │
│ 206 │ (379868,20) │ (a)=(-249290) │ (379868,20) │
│ 207 │ (379868,21) │ (a)=(-249290) │ (379868,21) │
│ 208 │ (379869,1) │ (a)=(-249290) │ (379869,1) │
│ 209 │ (379869,3) │ (a)=(-249290) │ (379869,3) │
│ 210 │ (379869,4) │ (a)=(-249290) │ (379869,4) │
│ 211 │ (379869,5) │ (a)=(-249290) │ (379869,5) │
│ 212 │ (379869,6) │ (a)=(-249290) │ (379869,6) │
│ 213 │ (379869,7) │ (a)=(-249290) │ (379869,7) │
│ 214 │ (379869,8) │ (a)=(-249290) │ (379869,8) │
│ 215 │ (379869,9) │ (a)=(-249290) │ (379869,9) │
│ 216 │ (379869,10) │ (a)=(-249290) │ (379869,10) │
│ 217 │ (379869,11) │ (a)=(-249290) │ (379869,11) │
│ 218 │ (379869,12) │ (a)=(-249290) │ (379869,12) │
│ 219 │ (379869,13) │ (a)=(-249290) │ (379869,13) │
│ 220 │ (379869,14) │ (a)=(-249290) │ (379869,14) │
│ 221 │ (379869,15) │ (a)=(-249290) │ (379869,15) │
│ 222 │ (379869,16) │ (a)=(-249290) │ (379869,16) │
│ 223 │ (379869,17) │ (a)=(-249290) │ (379869,17) │
│ 224 │ (379869,18) │ (a)=(-249290) │ (379869,18) │
│ 225 │ (379869,19) │ (a)=(-249290) │ (379869,19) │
│ 226 │ (379869,20) │ (a)=(-249290) │ (379869,20) │
│ 227 │ (379869,21) │ (a)=(-249290) │ (379869,21) │
│ 228 │ (379870,1) │ (a)=(-249290) │ (379870,1) │
│ 229 │ (379870,2) │ (a)=(-249290) │ (379870,2) │
│ 230 │ (379870,3) │ (a)=(-249290) │ (379870,3) │
│ 231 │ (379870,4) │ (a)=(-249290) │ (379870,4) │
│ 232 │ (379870,5) │ (a)=(-249290) │ (379870,5) │
│ 233 │ (379870,6) │ (a)=(-249290) │ (379870,6) │
│ 234 │ (379870,7) │ (a)=(-249290) │ (379870,7) │
│ 235 │ (379870,8) │ (a)=(-249290) │ (379870,8) │
│ 236 │ (379870,10) │ (a)=(-249290) │ (379870,10) │
│ 237 │ (379870,13) │ (a)=(-249290) │ (379870,13) │
│ 238 │ (379867,10) │ (a)=(-249289) │ (379867,10) │
│ 239 │ (379867,11) │ (a)=(-249289) │ (379867,11) │
│ 240 │ (379867,12) │ (a)=(-249289) │ (379867,12) │
│ 241 │ (379867,13) │ (a)=(-249289) │ (379867,13) │
│ 242 │ (379867,14) │ (a)=(-249289) │ (379867,14) │
│ 243 │ (379867,15) │ (a)=(-249289) │ (379867,15) │
│ 244 │ (379867,16) │ (a)=(-249289) │ (379867,16) │
│ 245 │ (379867,17) │ (a)=(-249289) │ (379867,17) │
│ 246 │ (379867,18) │ (a)=(-249289) │ (379867,18) │
│ 247 │ (379867,19) │ (a)=(-249289) │ (379867,19) │
│ 248 │ (379867,20) │ (a)=(-249289) │ (379867,20) │
│ 249 │ (379867,21) │ (a)=(-249289) │ (379867,21) │
│ 250 │ (379868,1) │ (a)=(-249289) │ (379868,1) │
│ 251 │ (379868,2) │ (a)=(-249289) │ (379868,2) │
│ 252 │ (379868,3) │ (a)=(-249289) │ (379868,3) │
│ 253 │ (379868,4) │ (a)=(-249289) │ (379868,4) │
│ 254 │ (379868,5) │ (a)=(-249289) │ (379868,5) │
│ 255 │ (379868,6) │ (a)=(-249289) │ (379868,6) │
│ 256 │ (379868,7) │ (a)=(-249289) │ (379868,7) │
│ 257 │ (379868,8) │ (a)=(-249289) │ (379868,8) │
│ 258 │ (379868,9) │ (a)=(-249289) │ (379868,9) │
│ 259 │ (379868,10) │ (a)=(-249289) │ (379868,10) │
│ 260 │ (379868,11) │ (a)=(-249289) │ (379868,11) │
│ 261 │ (379868,12) │ (a)=(-249289) │ (379868,12) │
│ 262 │ (379868,13) │ (a)=(-249289) │ (379868,13) │
│ 263 │ (379868,14) │ (a)=(-249289) │ (379868,14) │
│ 264 │ (379868,15) │ (a)=(-249289) │ (379868,15) │
│ 265 │ (379868,16) │ (a)=(-249289) │ (379868,16) │
│ 266 │ (379868,17) │ (a)=(-249289) │ (379868,17) │
│ 267 │ (379868,18) │ (a)=(-249289) │ (379868,18) │
│ 268 │ (379868,19) │ (a)=(-249289) │ (379868,19) │
│ 269 │ (379869,2) │ (a)=(-249289) │ (379869,2) │
│ 270 │ (379865,19) │ (a)=(-249288) │ (379865,19) │
│ 271 │ (379865,20) │ (a)=(-249288) │ (379865,20) │
│ 272 │ (379865,21) │ (a)=(-249288) │ (379865,21) │
│ 273 │ (379866,2) │ (a)=(-249288) │ (379866,2) │
│ 274 │ (379866,3) │ (a)=(-249288) │ (379866,3) │
│ 275 │ (379866,4) │ (a)=(-249288) │ (379866,4) │
│ 276 │ (379866,5) │ (a)=(-249288) │ (379866,5) │
│ 277 │ (379866,6) │ (a)=(-249288) │ (379866,6) │
│ 278 │ (379866,7) │ (a)=(-249288) │ (379866,7) │
│ 279 │ (379866,8) │ (a)=(-249288) │ (379866,8) │
│ 280 │ (379866,9) │ (a)=(-249288) │ (379866,9) │
│ 281 │ (379866,10) │ (a)=(-249288) │ (379866,10) │
│ 282 │ (379866,11) │ (a)=(-249288) │ (379866,11) │
│ 283 │ (379866,12) │ (a)=(-249288) │ (379866,12) │
│ 284 │ (379866,13) │ (a)=(-249288) │ (379866,13) │
│ 285 │ (379866,14) │ (a)=(-249288) │ (379866,14) │
│ 286 │ (379866,15) │ (a)=(-249288) │ (379866,15) │
│ 287 │ (379866,16) │ (a)=(-249288) │ (379866,16) │
│ 288 │ (379866,17) │ (a)=(-249288) │ (379866,17) │
│ 289 │ (379866,18) │ (a)=(-249288) │ (379866,18) │
│ 290 │ (379866,19) │ (a)=(-249288) │ (379866,19) │
│ 291 │ (379866,20) │ (a)=(-249288) │ (379866,20) │
│ 292 │ (379866,21) │ (a)=(-249288) │ (379866,21) │
│ 293 │ (379867,1) │ (a)=(-249288) │ (379867,1) │
│ 294 │ (379867,2) │ (a)=(-249288) │ (379867,2) │
│ 295 │ (379867,3) │ (a)=(-249288) │ (379867,3) │
│ 296 │ (379867,4) │ (a)=(-249288) │ (379867,4) │
│ 297 │ (379867,5) │ (a)=(-249288) │ (379867,5) │
│ 298 │ (379867,6) │ (a)=(-249288) │ (379867,6) │
│ 299 │ (379867,7) │ (a)=(-249288) │ (379867,7) │
│ 300 │ (379867,8) │ (a)=(-249288) │ (379867,8) │
│ 301 │ (379867,9) │ (a)=(-249288) │ (379867,9) │
│ 302 │ (379864,9) │ (a)=(-249287) │ (379864,9) │
│ 303 │ (379864,10) │ (a)=(-249287) │ (379864,10) │
│ 304 │ (379864,11) │ (a)=(-249287) │ (379864,11) │
│ 305 │ (379864,12) │ (a)=(-249287) │ (379864,12) │
│ 306 │ (379864,13) │ (a)=(-249287) │ (379864,13) │
│ 307 │ (379864,14) │ (a)=(-249287) │ (379864,14) │
│ 308 │ (379864,15) │ (a)=(-249287) │ (379864,15) │
│ 309 │ (379864,16) │ (a)=(-249287) │ (379864,16) │
│ 310 │ (379864,17) │ (a)=(-249287) │ (379864,17) │
│ 311 │ (379864,18) │ (a)=(-249287) │ (379864,18) │
│ 312 │ (379864,19) │ (a)=(-249287) │ (379864,19) │
│ 313 │ (379864,20) │ (a)=(-249287) │ (379864,20) │
│ 314 │ (379864,21) │ (a)=(-249287) │ (379864,21) │
│ 315 │ (379865,1) │ (a)=(-249287) │ (379865,1) │
│ 316 │ (379865,2) │ (a)=(-249287) │ (379865,2) │
│ 317 │ (379865,3) │ (a)=(-249287) │ (379865,3) │
│ 318 │ (379865,4) │ (a)=(-249287) │ (379865,4) │
│ 319 │ (379865,5) │ (a)=(-249287) │ (379865,5) │
│ 320 │ (379865,6) │ (a)=(-249287) │ (379865,6) │
│ 321 │ (379865,7) │ (a)=(-249287) │ (379865,7) │
│ 322 │ (379865,8) │ (a)=(-249287) │ (379865,8) │
│ 323 │ (379865,9) │ (a)=(-249287) │ (379865,9) │
│ 324 │ (379865,10) │ (a)=(-249287) │ (379865,10) │
│ 325 │ (379865,11) │ (a)=(-249287) │ (379865,11) │
│ 326 │ (379865,12) │ (a)=(-249287) │ (379865,12) │
│ 327 │ (379865,13) │ (a)=(-249287) │ (379865,13) │
│ 328 │ (379865,14) │ (a)=(-249287) │ (379865,14) │
│ 329 │ (379865,15) │ (a)=(-249287) │ (379865,15) │
│ 330 │ (379865,16) │ (a)=(-249287) │ (379865,16) │
│ 331 │ (379865,17) │ (a)=(-249287) │ (379865,17) │
│ 332 │ (379865,18) │ (a)=(-249287) │ (379865,18) │
│ 333 │ (379866,1) │ (a)=(-249287) │ (379866,1) │
│ 334 │ (379862,16) │ (a)=(-249286) │ (379862,16) │
│ 335 │ (379862,17) │ (a)=(-249286) │ (379862,17) │
│ 336 │ (379862,20) │ (a)=(-249286) │ (379862,20) │
│ 337 │ (379863,1) │ (a)=(-249286) │ (379863,1) │
│ 338 │ (379863,2) │ (a)=(-249286) │ (379863,2) │
│ 339 │ (379863,3) │ (a)=(-249286) │ (379863,3) │
│ 340 │ (379863,4) │ (a)=(-249286) │ (379863,4) │
│ 341 │ (379863,5) │ (a)=(-249286) │ (379863,5) │
│ 342 │ (379863,6) │ (a)=(-249286) │ (379863,6) │
│ 343 │ (379863,7) │ (a)=(-249286) │ (379863,7) │
│ 344 │ (379863,8) │ (a)=(-249286) │ (379863,8) │
│ 345 │ (379863,9) │ (a)=(-249286) │ (379863,9) │
│ 346 │ (379863,10) │ (a)=(-249286) │ (379863,10) │
│ 347 │ (379863,11) │ (a)=(-249286) │ (379863,11) │
│ 348 │ (379863,12) │ (a)=(-249286) │ (379863,12) │
│ 349 │ (379863,13) │ (a)=(-249286) │ (379863,13) │
│ 350 │ (379863,14) │ (a)=(-249286) │ (379863,14) │
│ 351 │ (379863,15) │ (a)=(-249286) │ (379863,15) │
│ 352 │ (379863,16) │ (a)=(-249286) │ (379863,16) │
│ 353 │ (379863,17) │ (a)=(-249286) │ (379863,17) │
│ 354 │ (379863,18) │ (a)=(-249286) │ (379863,18) │
│ 355 │ (379863,19) │ (a)=(-249286) │ (379863,19) │
│ 356 │ (379863,20) │ (a)=(-249286) │ (379863,20) │
│ 357 │ (379863,21) │ (a)=(-249286) │ (379863,21) │
│ 358 │ (379864,1) │ (a)=(-249286) │ (379864,1) │
│ 359 │ (379864,2) │ (a)=(-249286) │ (379864,2) │
│ 360 │ (379864,3) │ (a)=(-249286) │ (379864,3) │
│ 361 │ (379864,4) │ (a)=(-249286) │ (379864,4) │
│ 362 │ (379864,5) │ (a)=(-249286) │ (379864,5) │
│ 363 │ (379864,6) │ (a)=(-249286) │ (379864,6) │
│ 364 │ (379864,7) │ (a)=(-249286) │ (379864,7) │
│ 365 │ (379864,8) │ (a)=(-249286) │ (379864,8) │
│ 366 │ (379861,6) │ (a)=(-249285) │ (379861,6) │
│ 367 │ (379861,7) │ (a)=(-249285) │ (379861,7) │
└────────────┴───────────────┴───────────────┴─────────────┘
(367 rows)

And here's what block 5555 from "idx" looks like (note that the fact that I'm
using the same index block number as before has no particular significance):

────────────┬──────────────┬─────────────┬────────────┐
│ itemoffset │ ctid │ data │ htid │
├────────────┼──────────────┼─────────────┼────────────┤
│ 1 │ (96327,4097) │ (a)=(63216) │ (96327,15) │
│ 2 │ (96310,7) │ (a)=(63204) │ (96310,7) │
│ 3 │ (96310,8) │ (a)=(63204) │ (96310,8) │
│ 4 │ (96310,9) │ (a)=(63204) │ (96310,9) │
│ 5 │ (96310,10) │ (a)=(63204) │ (96310,10) │
│ 6 │ (96310,11) │ (a)=(63204) │ (96310,11) │
│ 7 │ (96310,12) │ (a)=(63204) │ (96310,12) │
│ 8 │ (96310,13) │ (a)=(63204) │ (96310,13) │
│ 9 │ (96310,14) │ (a)=(63204) │ (96310,14) │
│ 10 │ (96310,15) │ (a)=(63204) │ (96310,15) │
│ 11 │ (96310,16) │ (a)=(63204) │ (96310,16) │
│ 12 │ (96310,17) │ (a)=(63204) │ (96310,17) │
│ 13 │ (96310,18) │ (a)=(63204) │ (96310,18) │
│ 14 │ (96310,19) │ (a)=(63205) │ (96310,19) │
│ 15 │ (96310,20) │ (a)=(63205) │ (96310,20) │
│ 16 │ (96310,21) │ (a)=(63205) │ (96310,21) │
│ 17 │ (96311,1) │ (a)=(63205) │ (96311,1) │
│ 18 │ (96311,2) │ (a)=(63205) │ (96311,2) │
│ 19 │ (96311,3) │ (a)=(63205) │ (96311,3) │
│ 20 │ (96311,4) │ (a)=(63205) │ (96311,4) │
│ 21 │ (96311,5) │ (a)=(63205) │ (96311,5) │
│ 22 │ (96311,6) │ (a)=(63205) │ (96311,6) │
│ 23 │ (96311,7) │ (a)=(63205) │ (96311,7) │
│ 24 │ (96311,8) │ (a)=(63205) │ (96311,8) │
│ 25 │ (96311,9) │ (a)=(63205) │ (96311,9) │
│ 26 │ (96311,10) │ (a)=(63205) │ (96311,10) │
│ 27 │ (96311,11) │ (a)=(63205) │ (96311,11) │
│ 28 │ (96311,12) │ (a)=(63205) │ (96311,12) │
│ 29 │ (96311,13) │ (a)=(63205) │ (96311,13) │
│ 30 │ (96311,14) │ (a)=(63205) │ (96311,14) │
│ 31 │ (96311,15) │ (a)=(63205) │ (96311,15) │
│ 32 │ (96311,16) │ (a)=(63205) │ (96311,16) │
│ 33 │ (96311,17) │ (a)=(63205) │ (96311,17) │
│ 34 │ (96311,18) │ (a)=(63205) │ (96311,18) │
│ 35 │ (96311,19) │ (a)=(63205) │ (96311,19) │
│ 36 │ (96311,20) │ (a)=(63205) │ (96311,20) │
│ 37 │ (96311,21) │ (a)=(63205) │ (96311,21) │
│ 38 │ (96312,1) │ (a)=(63205) │ (96312,1) │
│ 39 │ (96312,2) │ (a)=(63205) │ (96312,2) │
│ 40 │ (96312,3) │ (a)=(63205) │ (96312,3) │
│ 41 │ (96312,4) │ (a)=(63205) │ (96312,4) │
│ 42 │ (96312,5) │ (a)=(63205) │ (96312,5) │
│ 43 │ (96312,6) │ (a)=(63205) │ (96312,6) │
│ 44 │ (96312,7) │ (a)=(63205) │ (96312,7) │
│ 45 │ (96312,9) │ (a)=(63205) │ (96312,9) │
│ 46 │ (96312,8) │ (a)=(63206) │ (96312,8) │
│ 47 │ (96312,10) │ (a)=(63206) │ (96312,10) │
│ 48 │ (96312,11) │ (a)=(63206) │ (96312,11) │
│ 49 │ (96312,12) │ (a)=(63206) │ (96312,12) │
│ 50 │ (96312,13) │ (a)=(63206) │ (96312,13) │
│ 51 │ (96312,14) │ (a)=(63206) │ (96312,14) │
│ 52 │ (96312,15) │ (a)=(63206) │ (96312,15) │
│ 53 │ (96312,16) │ (a)=(63206) │ (96312,16) │
│ 54 │ (96312,17) │ (a)=(63206) │ (96312,17) │
│ 55 │ (96312,18) │ (a)=(63206) │ (96312,18) │
│ 56 │ (96312,19) │ (a)=(63206) │ (96312,19) │
│ 57 │ (96312,20) │ (a)=(63206) │ (96312,20) │
│ 58 │ (96312,21) │ (a)=(63206) │ (96312,21) │
│ 59 │ (96313,1) │ (a)=(63206) │ (96313,1) │
│ 60 │ (96313,2) │ (a)=(63206) │ (96313,2) │
│ 61 │ (96313,3) │ (a)=(63206) │ (96313,3) │
│ 62 │ (96313,4) │ (a)=(63206) │ (96313,4) │
│ 63 │ (96313,5) │ (a)=(63206) │ (96313,5) │
│ 64 │ (96313,6) │ (a)=(63206) │ (96313,6) │
│ 65 │ (96313,7) │ (a)=(63206) │ (96313,7) │
│ 66 │ (96313,8) │ (a)=(63206) │ (96313,8) │
│ 67 │ (96313,9) │ (a)=(63206) │ (96313,9) │
│ 68 │ (96313,10) │ (a)=(63206) │ (96313,10) │
│ 69 │ (96313,11) │ (a)=(63206) │ (96313,11) │
│ 70 │ (96313,12) │ (a)=(63206) │ (96313,12) │
│ 71 │ (96313,13) │ (a)=(63206) │ (96313,13) │
│ 72 │ (96313,14) │ (a)=(63206) │ (96313,14) │
│ 73 │ (96313,15) │ (a)=(63206) │ (96313,15) │
│ 74 │ (96313,16) │ (a)=(63206) │ (96313,16) │
│ 75 │ (96313,17) │ (a)=(63206) │ (96313,17) │
│ 76 │ (96313,18) │ (a)=(63206) │ (96313,18) │
│ 77 │ (96313,20) │ (a)=(63206) │ (96313,20) │
│ 78 │ (96313,19) │ (a)=(63207) │ (96313,19) │
│ 79 │ (96313,21) │ (a)=(63207) │ (96313,21) │
│ 80 │ (96314,1) │ (a)=(63207) │ (96314,1) │
│ 81 │ (96314,2) │ (a)=(63207) │ (96314,2) │
│ 82 │ (96314,3) │ (a)=(63207) │ (96314,3) │
│ 83 │ (96314,4) │ (a)=(63207) │ (96314,4) │
│ 84 │ (96314,5) │ (a)=(63207) │ (96314,5) │
│ 85 │ (96314,6) │ (a)=(63207) │ (96314,6) │
│ 86 │ (96314,7) │ (a)=(63207) │ (96314,7) │
│ 87 │ (96314,8) │ (a)=(63207) │ (96314,8) │
│ 88 │ (96314,9) │ (a)=(63207) │ (96314,9) │
│ 89 │ (96314,10) │ (a)=(63207) │ (96314,10) │
│ 90 │ (96314,11) │ (a)=(63207) │ (96314,11) │
│ 91 │ (96314,12) │ (a)=(63207) │ (96314,12) │
│ 92 │ (96314,13) │ (a)=(63207) │ (96314,13) │
│ 93 │ (96314,14) │ (a)=(63207) │ (96314,14) │
│ 94 │ (96314,15) │ (a)=(63207) │ (96314,15) │
│ 95 │ (96314,16) │ (a)=(63207) │ (96314,16) │
│ 96 │ (96314,17) │ (a)=(63207) │ (96314,17) │
│ 97 │ (96314,18) │ (a)=(63207) │ (96314,18) │
│ 98 │ (96314,19) │ (a)=(63207) │ (96314,19) │
│ 99 │ (96314,20) │ (a)=(63207) │ (96314,20) │
│ 100 │ (96314,21) │ (a)=(63207) │ (96314,21) │
│ 101 │ (96315,1) │ (a)=(63207) │ (96315,1) │
│ 102 │ (96315,2) │ (a)=(63207) │ (96315,2) │
│ 103 │ (96315,3) │ (a)=(63207) │ (96315,3) │
│ 104 │ (96315,4) │ (a)=(63207) │ (96315,4) │
│ 105 │ (96315,5) │ (a)=(63207) │ (96315,5) │
│ 106 │ (96315,6) │ (a)=(63207) │ (96315,6) │
│ 107 │ (96315,7) │ (a)=(63207) │ (96315,7) │
│ 108 │ (96315,8) │ (a)=(63207) │ (96315,8) │
│ 109 │ (96315,12) │ (a)=(63207) │ (96315,12) │
│ 110 │ (96315,9) │ (a)=(63208) │ (96315,9) │
│ 111 │ (96315,10) │ (a)=(63208) │ (96315,10) │
│ 112 │ (96315,11) │ (a)=(63208) │ (96315,11) │
│ 113 │ (96315,13) │ (a)=(63208) │ (96315,13) │
│ 114 │ (96315,14) │ (a)=(63208) │ (96315,14) │
│ 115 │ (96315,15) │ (a)=(63208) │ (96315,15) │
│ 116 │ (96315,16) │ (a)=(63208) │ (96315,16) │
│ 117 │ (96315,17) │ (a)=(63208) │ (96315,17) │
│ 118 │ (96315,18) │ (a)=(63208) │ (96315,18) │
│ 119 │ (96315,19) │ (a)=(63208) │ (96315,19) │
│ 120 │ (96315,20) │ (a)=(63208) │ (96315,20) │
│ 121 │ (96315,21) │ (a)=(63208) │ (96315,21) │
│ 122 │ (96316,1) │ (a)=(63208) │ (96316,1) │
│ 123 │ (96316,2) │ (a)=(63208) │ (96316,2) │
│ 124 │ (96316,3) │ (a)=(63208) │ (96316,3) │
│ 125 │ (96316,4) │ (a)=(63208) │ (96316,4) │
│ 126 │ (96316,5) │ (a)=(63208) │ (96316,5) │
│ 127 │ (96316,6) │ (a)=(63208) │ (96316,6) │
│ 128 │ (96316,7) │ (a)=(63208) │ (96316,7) │
│ 129 │ (96316,8) │ (a)=(63208) │ (96316,8) │
│ 130 │ (96316,9) │ (a)=(63208) │ (96316,9) │
│ 131 │ (96316,10) │ (a)=(63208) │ (96316,10) │
│ 132 │ (96316,11) │ (a)=(63208) │ (96316,11) │
│ 133 │ (96316,12) │ (a)=(63208) │ (96316,12) │
│ 134 │ (96316,13) │ (a)=(63208) │ (96316,13) │
│ 135 │ (96316,14) │ (a)=(63208) │ (96316,14) │
│ 136 │ (96316,15) │ (a)=(63208) │ (96316,15) │
│ 137 │ (96316,16) │ (a)=(63208) │ (96316,16) │
│ 138 │ (96316,17) │ (a)=(63208) │ (96316,17) │
│ 139 │ (96316,18) │ (a)=(63208) │ (96316,18) │
│ 140 │ (96316,19) │ (a)=(63208) │ (96316,19) │
│ 141 │ (96316,20) │ (a)=(63208) │ (96316,20) │
│ 142 │ (96316,21) │ (a)=(63209) │ (96316,21) │
│ 143 │ (96317,1) │ (a)=(63209) │ (96317,1) │
│ 144 │ (96317,2) │ (a)=(63209) │ (96317,2) │
│ 145 │ (96317,3) │ (a)=(63209) │ (96317,3) │
│ 146 │ (96317,4) │ (a)=(63209) │ (96317,4) │
│ 147 │ (96317,5) │ (a)=(63209) │ (96317,5) │
│ 148 │ (96317,6) │ (a)=(63209) │ (96317,6) │
│ 149 │ (96317,7) │ (a)=(63209) │ (96317,7) │
│ 150 │ (96317,8) │ (a)=(63209) │ (96317,8) │
│ 151 │ (96317,9) │ (a)=(63209) │ (96317,9) │
│ 152 │ (96317,10) │ (a)=(63209) │ (96317,10) │
│ 153 │ (96317,11) │ (a)=(63209) │ (96317,11) │
│ 154 │ (96317,12) │ (a)=(63209) │ (96317,12) │
│ 155 │ (96317,13) │ (a)=(63209) │ (96317,13) │
│ 156 │ (96317,14) │ (a)=(63209) │ (96317,14) │
│ 157 │ (96317,15) │ (a)=(63209) │ (96317,15) │
│ 158 │ (96317,16) │ (a)=(63209) │ (96317,16) │
│ 159 │ (96317,17) │ (a)=(63209) │ (96317,17) │
│ 160 │ (96317,18) │ (a)=(63209) │ (96317,18) │
│ 161 │ (96317,19) │ (a)=(63209) │ (96317,19) │
│ 162 │ (96317,20) │ (a)=(63209) │ (96317,20) │
│ 163 │ (96317,21) │ (a)=(63209) │ (96317,21) │
│ 164 │ (96318,1) │ (a)=(63209) │ (96318,1) │
│ 165 │ (96318,2) │ (a)=(63209) │ (96318,2) │
│ 166 │ (96318,3) │ (a)=(63209) │ (96318,3) │
│ 167 │ (96318,4) │ (a)=(63209) │ (96318,4) │
│ 168 │ (96318,5) │ (a)=(63209) │ (96318,5) │
│ 169 │ (96318,6) │ (a)=(63209) │ (96318,6) │
│ 170 │ (96318,7) │ (a)=(63209) │ (96318,7) │
│ 171 │ (96318,8) │ (a)=(63209) │ (96318,8) │
│ 172 │ (96318,9) │ (a)=(63209) │ (96318,9) │
│ 173 │ (96318,10) │ (a)=(63209) │ (96318,10) │
│ 174 │ (96318,11) │ (a)=(63210) │ (96318,11) │
│ 175 │ (96318,12) │ (a)=(63210) │ (96318,12) │
│ 176 │ (96318,13) │ (a)=(63210) │ (96318,13) │
│ 177 │ (96318,14) │ (a)=(63210) │ (96318,14) │
│ 178 │ (96318,15) │ (a)=(63210) │ (96318,15) │
│ 179 │ (96318,16) │ (a)=(63210) │ (96318,16) │
│ 180 │ (96318,17) │ (a)=(63210) │ (96318,17) │
│ 181 │ (96318,18) │ (a)=(63210) │ (96318,18) │
│ 182 │ (96318,19) │ (a)=(63210) │ (96318,19) │
│ 183 │ (96318,20) │ (a)=(63210) │ (96318,20) │
│ 184 │ (96318,21) │ (a)=(63210) │ (96318,21) │
│ 185 │ (96319,1) │ (a)=(63210) │ (96319,1) │
│ 186 │ (96319,2) │ (a)=(63210) │ (96319,2) │
│ 187 │ (96319,3) │ (a)=(63210) │ (96319,3) │
│ 188 │ (96319,4) │ (a)=(63210) │ (96319,4) │
│ 189 │ (96319,5) │ (a)=(63210) │ (96319,5) │
│ 190 │ (96319,6) │ (a)=(63210) │ (96319,6) │
│ 191 │ (96319,7) │ (a)=(63210) │ (96319,7) │
│ 192 │ (96319,8) │ (a)=(63210) │ (96319,8) │
│ 193 │ (96319,9) │ (a)=(63210) │ (96319,9) │
│ 194 │ (96319,10) │ (a)=(63210) │ (96319,10) │
│ 195 │ (96319,11) │ (a)=(63210) │ (96319,11) │
│ 196 │ (96319,12) │ (a)=(63210) │ (96319,12) │
│ 197 │ (96319,13) │ (a)=(63210) │ (96319,13) │
│ 198 │ (96319,14) │ (a)=(63210) │ (96319,14) │
│ 199 │ (96319,15) │ (a)=(63210) │ (96319,15) │
│ 200 │ (96319,16) │ (a)=(63210) │ (96319,16) │
│ 201 │ (96319,17) │ (a)=(63210) │ (96319,17) │
│ 202 │ (96319,18) │ (a)=(63210) │ (96319,18) │
│ 203 │ (96319,19) │ (a)=(63210) │ (96319,19) │
│ 204 │ (96319,20) │ (a)=(63210) │ (96319,20) │
│ 205 │ (96320,1) │ (a)=(63210) │ (96320,1) │
│ 206 │ (96319,21) │ (a)=(63211) │ (96319,21) │
│ 207 │ (96320,2) │ (a)=(63211) │ (96320,2) │
│ 208 │ (96320,3) │ (a)=(63211) │ (96320,3) │
│ 209 │ (96320,4) │ (a)=(63211) │ (96320,4) │
│ 210 │ (96320,5) │ (a)=(63211) │ (96320,5) │
│ 211 │ (96320,6) │ (a)=(63211) │ (96320,6) │
│ 212 │ (96320,7) │ (a)=(63211) │ (96320,7) │
│ 213 │ (96320,8) │ (a)=(63211) │ (96320,8) │
│ 214 │ (96320,9) │ (a)=(63211) │ (96320,9) │
│ 215 │ (96320,10) │ (a)=(63211) │ (96320,10) │
│ 216 │ (96320,11) │ (a)=(63211) │ (96320,11) │
│ 217 │ (96320,12) │ (a)=(63211) │ (96320,12) │
│ 218 │ (96320,13) │ (a)=(63211) │ (96320,13) │
│ 219 │ (96320,14) │ (a)=(63211) │ (96320,14) │
│ 220 │ (96320,15) │ (a)=(63211) │ (96320,15) │
│ 221 │ (96320,16) │ (a)=(63211) │ (96320,16) │
│ 222 │ (96320,17) │ (a)=(63211) │ (96320,17) │
│ 223 │ (96320,18) │ (a)=(63211) │ (96320,18) │
│ 224 │ (96320,19) │ (a)=(63211) │ (96320,19) │
│ 225 │ (96320,20) │ (a)=(63211) │ (96320,20) │
│ 226 │ (96320,21) │ (a)=(63211) │ (96320,21) │
│ 227 │ (96321,1) │ (a)=(63211) │ (96321,1) │
│ 228 │ (96321,2) │ (a)=(63211) │ (96321,2) │
│ 229 │ (96321,3) │ (a)=(63211) │ (96321,3) │
│ 230 │ (96321,4) │ (a)=(63211) │ (96321,4) │
│ 231 │ (96321,5) │ (a)=(63211) │ (96321,5) │
│ 232 │ (96321,6) │ (a)=(63211) │ (96321,6) │
│ 233 │ (96321,7) │ (a)=(63211) │ (96321,7) │
│ 234 │ (96321,8) │ (a)=(63211) │ (96321,8) │
│ 235 │ (96321,9) │ (a)=(63211) │ (96321,9) │
│ 236 │ (96321,10) │ (a)=(63211) │ (96321,10) │
│ 237 │ (96321,11) │ (a)=(63211) │ (96321,11) │
│ 238 │ (96321,12) │ (a)=(63212) │ (96321,12) │
│ 239 │ (96321,13) │ (a)=(63212) │ (96321,13) │
│ 240 │ (96321,14) │ (a)=(63212) │ (96321,14) │
│ 241 │ (96321,15) │ (a)=(63212) │ (96321,15) │
│ 242 │ (96321,16) │ (a)=(63212) │ (96321,16) │
│ 243 │ (96321,17) │ (a)=(63212) │ (96321,17) │
│ 244 │ (96321,18) │ (a)=(63212) │ (96321,18) │
│ 245 │ (96321,19) │ (a)=(63212) │ (96321,19) │
│ 246 │ (96321,20) │ (a)=(63212) │ (96321,20) │
│ 247 │ (96321,21) │ (a)=(63212) │ (96321,21) │
│ 248 │ (96322,1) │ (a)=(63212) │ (96322,1) │
│ 249 │ (96322,2) │ (a)=(63212) │ (96322,2) │
│ 250 │ (96322,3) │ (a)=(63212) │ (96322,3) │
│ 251 │ (96322,4) │ (a)=(63212) │ (96322,4) │
│ 252 │ (96322,5) │ (a)=(63212) │ (96322,5) │
│ 253 │ (96322,6) │ (a)=(63212) │ (96322,6) │
│ 254 │ (96322,7) │ (a)=(63212) │ (96322,7) │
│ 255 │ (96322,8) │ (a)=(63212) │ (96322,8) │
│ 256 │ (96322,9) │ (a)=(63212) │ (96322,9) │
│ 257 │ (96322,10) │ (a)=(63212) │ (96322,10) │
│ 258 │ (96322,11) │ (a)=(63212) │ (96322,11) │
│ 259 │ (96322,12) │ (a)=(63212) │ (96322,12) │
│ 260 │ (96322,13) │ (a)=(63212) │ (96322,13) │
│ 261 │ (96322,14) │ (a)=(63212) │ (96322,14) │
│ 262 │ (96322,15) │ (a)=(63212) │ (96322,15) │
│ 263 │ (96322,16) │ (a)=(63212) │ (96322,16) │
│ 264 │ (96322,17) │ (a)=(63212) │ (96322,17) │
│ 265 │ (96322,18) │ (a)=(63212) │ (96322,18) │
│ 266 │ (96322,19) │ (a)=(63212) │ (96322,19) │
│ 267 │ (96322,20) │ (a)=(63212) │ (96322,20) │
│ 268 │ (96322,21) │ (a)=(63212) │ (96322,21) │
│ 269 │ (96323,3) │ (a)=(63212) │ (96323,3) │
│ 270 │ (96323,1) │ (a)=(63213) │ (96323,1) │
│ 271 │ (96323,2) │ (a)=(63213) │ (96323,2) │
│ 272 │ (96323,4) │ (a)=(63213) │ (96323,4) │
│ 273 │ (96323,5) │ (a)=(63213) │ (96323,5) │
│ 274 │ (96323,6) │ (a)=(63213) │ (96323,6) │
│ 275 │ (96323,7) │ (a)=(63213) │ (96323,7) │
│ 276 │ (96323,8) │ (a)=(63213) │ (96323,8) │
│ 277 │ (96323,9) │ (a)=(63213) │ (96323,9) │
│ 278 │ (96323,10) │ (a)=(63213) │ (96323,10) │
│ 279 │ (96323,11) │ (a)=(63213) │ (96323,11) │
│ 280 │ (96323,12) │ (a)=(63213) │ (96323,12) │
│ 281 │ (96323,13) │ (a)=(63213) │ (96323,13) │
│ 282 │ (96323,14) │ (a)=(63213) │ (96323,14) │
│ 283 │ (96323,15) │ (a)=(63213) │ (96323,15) │
│ 284 │ (96323,16) │ (a)=(63213) │ (96323,16) │
│ 285 │ (96323,17) │ (a)=(63213) │ (96323,17) │
│ 286 │ (96323,18) │ (a)=(63213) │ (96323,18) │
│ 287 │ (96323,19) │ (a)=(63213) │ (96323,19) │
│ 288 │ (96323,20) │ (a)=(63213) │ (96323,20) │
│ 289 │ (96323,21) │ (a)=(63213) │ (96323,21) │
│ 290 │ (96324,1) │ (a)=(63213) │ (96324,1) │
│ 291 │ (96324,2) │ (a)=(63213) │ (96324,2) │
│ 292 │ (96324,3) │ (a)=(63213) │ (96324,3) │
│ 293 │ (96324,4) │ (a)=(63213) │ (96324,4) │
│ 294 │ (96324,5) │ (a)=(63213) │ (96324,5) │
│ 295 │ (96324,6) │ (a)=(63213) │ (96324,6) │
│ 296 │ (96324,7) │ (a)=(63213) │ (96324,7) │
│ 297 │ (96324,8) │ (a)=(63213) │ (96324,8) │
│ 298 │ (96324,9) │ (a)=(63213) │ (96324,9) │
│ 299 │ (96324,11) │ (a)=(63213) │ (96324,11) │
│ 300 │ (96324,12) │ (a)=(63213) │ (96324,12) │
│ 301 │ (96324,13) │ (a)=(63213) │ (96324,13) │
│ 302 │ (96324,10) │ (a)=(63214) │ (96324,10) │
│ 303 │ (96324,14) │ (a)=(63214) │ (96324,14) │
│ 304 │ (96324,15) │ (a)=(63214) │ (96324,15) │
│ 305 │ (96324,16) │ (a)=(63214) │ (96324,16) │
│ 306 │ (96324,17) │ (a)=(63214) │ (96324,17) │
│ 307 │ (96324,18) │ (a)=(63214) │ (96324,18) │
│ 308 │ (96324,19) │ (a)=(63214) │ (96324,19) │
│ 309 │ (96324,20) │ (a)=(63214) │ (96324,20) │
│ 310 │ (96324,21) │ (a)=(63214) │ (96324,21) │
│ 311 │ (96325,1) │ (a)=(63214) │ (96325,1) │
│ 312 │ (96325,2) │ (a)=(63214) │ (96325,2) │
│ 313 │ (96325,3) │ (a)=(63214) │ (96325,3) │
│ 314 │ (96325,4) │ (a)=(63214) │ (96325,4) │
│ 315 │ (96325,5) │ (a)=(63214) │ (96325,5) │
│ 316 │ (96325,6) │ (a)=(63214) │ (96325,6) │
│ 317 │ (96325,7) │ (a)=(63214) │ (96325,7) │
│ 318 │ (96325,8) │ (a)=(63214) │ (96325,8) │
│ 319 │ (96325,9) │ (a)=(63214) │ (96325,9) │
│ 320 │ (96325,10) │ (a)=(63214) │ (96325,10) │
│ 321 │ (96325,11) │ (a)=(63214) │ (96325,11) │
│ 322 │ (96325,12) │ (a)=(63214) │ (96325,12) │
│ 323 │ (96325,13) │ (a)=(63214) │ (96325,13) │
│ 324 │ (96325,14) │ (a)=(63214) │ (96325,14) │
│ 325 │ (96325,15) │ (a)=(63214) │ (96325,15) │
│ 326 │ (96325,16) │ (a)=(63214) │ (96325,16) │
│ 327 │ (96325,17) │ (a)=(63214) │ (96325,17) │
│ 328 │ (96325,18) │ (a)=(63214) │ (96325,18) │
│ 329 │ (96325,19) │ (a)=(63214) │ (96325,19) │
│ 330 │ (96325,20) │ (a)=(63214) │ (96325,20) │
│ 331 │ (96325,21) │ (a)=(63214) │ (96325,21) │
│ 332 │ (96326,1) │ (a)=(63214) │ (96326,1) │
│ 333 │ (96326,3) │ (a)=(63214) │ (96326,3) │
│ 334 │ (96326,2) │ (a)=(63215) │ (96326,2) │
│ 335 │ (96326,4) │ (a)=(63215) │ (96326,4) │
│ 336 │ (96326,5) │ (a)=(63215) │ (96326,5) │
│ 337 │ (96326,6) │ (a)=(63215) │ (96326,6) │
│ 338 │ (96326,7) │ (a)=(63215) │ (96326,7) │
│ 339 │ (96326,8) │ (a)=(63215) │ (96326,8) │
│ 340 │ (96326,9) │ (a)=(63215) │ (96326,9) │
│ 341 │ (96326,10) │ (a)=(63215) │ (96326,10) │
│ 342 │ (96326,11) │ (a)=(63215) │ (96326,11) │
│ 343 │ (96326,12) │ (a)=(63215) │ (96326,12) │
│ 344 │ (96326,13) │ (a)=(63215) │ (96326,13) │
│ 345 │ (96326,14) │ (a)=(63215) │ (96326,14) │
│ 346 │ (96326,15) │ (a)=(63215) │ (96326,15) │
│ 347 │ (96326,16) │ (a)=(63215) │ (96326,16) │
│ 348 │ (96326,17) │ (a)=(63215) │ (96326,17) │
│ 349 │ (96326,18) │ (a)=(63215) │ (96326,18) │
│ 350 │ (96326,19) │ (a)=(63215) │ (96326,19) │
│ 351 │ (96326,20) │ (a)=(63215) │ (96326,20) │
│ 352 │ (96326,21) │ (a)=(63215) │ (96326,21) │
│ 353 │ (96327,1) │ (a)=(63215) │ (96327,1) │
│ 354 │ (96327,2) │ (a)=(63215) │ (96327,2) │
│ 355 │ (96327,3) │ (a)=(63215) │ (96327,3) │
│ 356 │ (96327,4) │ (a)=(63215) │ (96327,4) │
│ 357 │ (96327,5) │ (a)=(63215) │ (96327,5) │
│ 358 │ (96327,6) │ (a)=(63215) │ (96327,6) │
│ 359 │ (96327,7) │ (a)=(63215) │ (96327,7) │
│ 360 │ (96327,8) │ (a)=(63215) │ (96327,8) │
│ 361 │ (96327,9) │ (a)=(63215) │ (96327,9) │
│ 362 │ (96327,10) │ (a)=(63215) │ (96327,10) │
│ 363 │ (96327,11) │ (a)=(63215) │ (96327,11) │
│ 364 │ (96327,12) │ (a)=(63215) │ (96327,12) │
│ 365 │ (96327,14) │ (a)=(63215) │ (96327,14) │
│ 366 │ (96327,13) │ (a)=(63216) │ (96327,13) │
│ 367 │ (96327,15) │ (a)=(63216) │ (96327,15) │
└────────────┴──────────────┴─────────────┴────────────┘
(367 rows)

I only notice one tiny discontinuity in this "unshuffled" "idx" page: the index
tuple at offset 205 uses the heap TID (96320,1), whereas the index tuple right
after that (at offset 206) uses the heap TID (96319,21) (before we get to a
large run of heap TIDs that use heap block number 96320 once more).

As I touched on already, this effect can be seen even with perfectly correlated
inserts. The effect is caused by the FSM having a tiny bit of space left on one
heap page -- not enough space to fit an incoming heap tuple, but still enough to
fit a slightly smaller heap tuple that is inserted shortly thereafter. You end
up with exactly one index tuple whose heap TID is slightly out-of-order, though
only every once in a long while.

This seems rather bizarre, considering the two tables are exactly the
same, except that in t2 the first column is negative, and the rows are
fixed-length. Even heap_page_items says the tables are exactly the same.

So why would the index get so different like this?

regards

--
Tomas Vondra

#237

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#233)

Re: index prefetching

On 8/13/25 16:44, Andres Freund wrote:

Hi,

On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

Hm, I don't immediately see an issue there. The only case we don't call
PinBufferForBlock() is if we already have pinned the relevant buffer in a
prior call to StartReadBuffersImpl().

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

/*
* Check if we can start IO on the first to-be-read buffer.
*
* If an I/O is already in progress in another backend, we want to wait
* for the outcome: either done, or something went wrong and we will
* retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
{
...
/*
* Report and track this as a 'hit' for this backend, even though it
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
...
if (persistence == RELPERSISTENCE_TEMP)
pgBufferUsage.local_blks_hit += 1;
else
pgBufferUsage.shared_blks_hit += 1;
...

I think it has to be this. It only happens with io_method != sync, and
only with effective_io_concurrency > 1. At first I was wondering why I
can't reproduce this for seqscan/bitmapscan, but then I realized those
plans never visit the same block repeatedly - indexscans do that. It's
also not surprising it's timing-sensitive, as it likely depends on how
fast the worker happens to start/complete requests.

What would be a good way to "prove" it really is this?

regards

--
Tomas Vondra

#238

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#234)

Re: index prefetching

On 8/13/25 18:01, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

AFAIK it is *not* reproducible on master.

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

This theory seems quite plausible to me. Though it is a bit surprising
that I see incorrect buffer hit counts on the "good" forwards scan
case, rather than on the "bad" backwards scan case.

Here's what I mean by things being broken on the read stream side (at
least with certain backwards scan cases):

When I add instrumentation to the read stream side, by adding elog
debug calls that show the blocknum seen by read_stream_get_block, I
see out-of-order and repeated blocknums with the "bad" backwards scan
case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a
desc"):

...
NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053
WARNING: blocknum: 25053, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19)
WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052
WARNING: blocknum: 25052, 0x55614810efb0
WARNING: blocknum: 25052, 0x55614810efb0
NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21)
WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051
WARNING: blocknum: 25051, 0x55614810efb0
...

Notice that we see the same blocknum twice in close succession. Also
notice that we're passed 25052 and then subsequently passed 25053,
only to be passed 25053 once more.

I did investigate this, and I don't think there's anything broken in
read_stream. It happens because ReadStream has a concept of "ungetting"
a block, which can happen after hitting some I/O limits.

In that case we "remember" the last block (in read_stream_look_ahead
calls read_stream_unget_block), and we return it again. It may seem as
if read_stream_get_block() produced the same block twice, but it's
really just the block from the last round.

All duplicates produced by read_stream_look_ahead were caused by this. I
suspected it's a bug in lastBlock optimization, but that's not the case,
it happens entirely within read_stream. And it's expected.

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).

regards

--
Tomas Vondra

#239

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#236)

Re: index prefetching

On Wed, Aug 13, 2025 at 1:01 PM Tomas Vondra <tomas@vondra.me> wrote:

This seems rather bizarre, considering the two tables are exactly the
same, except that in t2 the first column is negative, and the rows are
fixed-length. Even heap_page_items says the tables are exactly the same.

So why would the index get so different like this?

In the past, when I required *perfectly* deterministic results for
INSERT INTO test_table ... SELECT * FROM source_table bulk inserts
(which was important during the Postgres 12 and 13 nbtree work), I
found it necessary to "set synchronize_seqscans=off". If I was writing
a test such as this, I'd probably do that defensively, even if it
wasn't clear that it mattered. (I'm also in the habit of using
unlogged tables, because VACUUM tends to set their pages all-visible
more reliably than equivalent logged tables, which I notice that
you're also doing here.)

That said, I *think* that the "locally shuffled" heap TID pattern that
we see with "t2"/"idx2" is mostly (perhaps entirely) caused by the way
that you're inverting the indexed column's value when initially
generating "t2". A given range of values such as "1 through to 4"
becomes "-4 through to -1" as their tuples are inserted into t2.
You're effectively inverting the order of the bigint indexed column
"a" -- but you're *not* inverting the order of the imaginary
tie-breaker heap column (it *remains* in ASC heap TID order in "t2").

In general, when doing this sort of analysis, I find it useful to
manually verify that the data that I generated matches my
expectations. Usually a quick check with pageinspect is enough. I'll
just randomly select 2 - 3 leaf pages, and make sure that they all
more or less match my expectations.

--
Peter Geoghegan

#240

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#237)

Re: index prefetching

Hi,

On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote:

On 8/13/25 16:44, Andres Freund wrote:

On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

Hm, I don't immediately see an issue there. The only case we don't call
PinBufferForBlock() is if we already have pinned the relevant buffer in a
prior call to StartReadBuffersImpl().

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

/*
* Check if we can start IO on the first to-be-read buffer.
*
* If an I/O is already in progress in another backend, we want to wait
* for the outcome: either done, or something went wrong and we will
* retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
{
...
/*
* Report and track this as a 'hit' for this backend, even though it
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
...
if (persistence == RELPERSISTENCE_TEMP)
pgBufferUsage.local_blks_hit += 1;
else
pgBufferUsage.shared_blks_hit += 1;
...

I think it has to be this. It only happens with io_method != sync, and
only with effective_io_concurrency > 1. At first I was wondering why I
can't reproduce this for seqscan/bitmapscan, but then I realized those
plans never visit the same block repeatedly - indexscans do that. It's
also not surprising it's timing-sensitive, as it likely depends on how
fast the worker happens to start/complete requests.

What would be a good way to "prove" it really is this?

I'd just comment out those stats increments and then check if the stats are
stable afterwards.

Greetings,

Andres Freund

#241

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#238)

Re: index prefetching

On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).

But there's no OS prefetch with direct I/O. At most, there might be
some kind of readahead implemented in the SSD's firmware.

Even assuming that the SSD issue is relevant, I can't help but suspect
that something is off here. To recap from yesterday, the forwards scan
showed "I/O Timings: shared read=45.313" and "Execution Time: 330.379
ms" on my system, while the equivalent backwards scan showed "I/O
Timings: shared read=194.774" and "Execution Time: 1236.655 ms". Does
that kind of disparity *really* make sense with a modern NVME SSD such
as this (I use a Samsung 980 pro), in the context of a scan that can
use aggressive prefetching? Are we really, truly operating at the
limits of what is possible with this hardware, for this backwards
scan?

What if I use a ramdisk for this? That'll be much faster, no matter
the scan order. Should I expect this step to make the effect with
duplicates being produced by read_stream_look_ahead to just go away,
regardless of the scan direction in use?

--
Peter Geoghegan

#242

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#240)

Re: index prefetching

On 8/13/25 23:37, Andres Freund wrote:

Hi,

On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote:

On 8/13/25 16:44, Andres Freund wrote:

On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

Hm, I don't immediately see an issue there. The only case we don't call
PinBufferForBlock() is if we already have pinned the relevant buffer in a
prior call to StartReadBuffersImpl().

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

/*
* Check if we can start IO on the first to-be-read buffer.
*
* If an I/O is already in progress in another backend, we want to wait
* for the outcome: either done, or something went wrong and we will
* retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
{
...
/*
* Report and track this as a 'hit' for this backend, even though it
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
...
if (persistence == RELPERSISTENCE_TEMP)
pgBufferUsage.local_blks_hit += 1;
else
pgBufferUsage.shared_blks_hit += 1;
...

I think it has to be this. It only happens with io_method != sync, and
only with effective_io_concurrency > 1. At first I was wondering why I
can't reproduce this for seqscan/bitmapscan, but then I realized those
plans never visit the same block repeatedly - indexscans do that. It's
also not surprising it's timing-sensitive, as it likely depends on how
fast the worker happens to start/complete requests.

What would be a good way to "prove" it really is this?

I'd just comment out those stats increments and then check if the stats are
stable afterwards.

I tried that, but it's not enough - the buffer hits gets lower, but
remains variable. It stabilizes only if I comment out the increment in
PinBufferForBlock() too. At which point it gets to 0, of course ...

regards

--
Tomas Vondra

#243

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#242)

Re: index prefetching

Hi,

On 2025-08-14 00:23:49 +0200, Tomas Vondra wrote:

On 8/13/25 23:37, Andres Freund wrote:

On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote:

On 8/13/25 16:44, Andres Freund wrote:

On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote:

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

Buffers: shared hit=7435 read=52801

while with "worker" I get this:

Buffers: shared hit=4879 read=52801
Buffers: shared hit=5151 read=52801
Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

This is reproducible on master? If so, how?

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

Hm, I don't immediately see an issue there. The only case we don't call
PinBufferForBlock() is if we already have pinned the relevant buffer in a
prior call to StartReadBuffersImpl().

If this happens only with the prefetching patch applied, is is possible that
what happens here is that we occasionally re-request buffers that already in
the process of being read in? That would only happen with a read stream and
io_method != sync (since with sync we won't read ahead). If we have to start
reading in a buffer that's already undergoing IO we wait for the IO to
complete and count that access as a hit:

/*
* Check if we can start IO on the first to-be-read buffer.
*
* If an I/O is already in progress in another backend, we want to wait
* for the outcome: either done, or something went wrong and we will
* retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
{
...
/*
* Report and track this as a 'hit' for this backend, even though it
* must have started out as a miss in PinBufferForBlock(). The other
* backend will track this as a 'read'.
*/
...
if (persistence == RELPERSISTENCE_TEMP)
pgBufferUsage.local_blks_hit += 1;
else
pgBufferUsage.shared_blks_hit += 1;
...

I think it has to be this. It only happens with io_method != sync, and
only with effective_io_concurrency > 1. At first I was wondering why I
can't reproduce this for seqscan/bitmapscan, but then I realized those
plans never visit the same block repeatedly - indexscans do that. It's
also not surprising it's timing-sensitive, as it likely depends on how
fast the worker happens to start/complete requests.

What would be a good way to "prove" it really is this?

I'd just comment out those stats increments and then check if the stats are
stable afterwards.

I tried that, but it's not enough - the buffer hits gets lower, but
remains variable. It stabilizes only if I comment out the increment in
PinBufferForBlock() too. At which point it gets to 0, of course ...

Ah, right - that'll be the cases where IO completed before we access it a
second time. There's no good way that I can see that we can make that
deterministic - I mean, we could just search all in-progress IOs before
starting a new IO for a matching block number and wait for all IO to complete
if so. But that seems like an obviously bad idea.

I think there's just some fundamental indeterminisism here. I don't think we
gain anything by hiding it...

Greetings,

Andres Freund

#244

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#241)

Re: index prefetching

On 8/13/25 23:57, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).

But there's no OS prefetch with direct I/O. At most, there might be
some kind of readahead implemented in the SSD's firmware.

Good point, I keep forgetting direct I/O means no OS read-ahead. Not
sure if there's a good way to determine if the SSD can do something like
that (and how well). I wonder if there's a way to do backward sequential
scans in fio ..

Even assuming that the SSD issue is relevant, I can't help but suspect
that something is off here. To recap from yesterday, the forwards scan
showed "I/O Timings: shared read=45.313" and "Execution Time: 330.379
ms" on my system, while the equivalent backwards scan showed "I/O
Timings: shared read=194.774" and "Execution Time: 1236.655 ms". Does
that kind of disparity *really* make sense with a modern NVME SSD such
as this (I use a Samsung 980 pro), in the context of a scan that can
use aggressive prefetching? Are we really, truly operating at the
limits of what is possible with this hardware, for this backwards
scan?

Hard to say. Would be interesting to get some numbers using fio. I'll
try to do that for my devices.

The timings I see on my ryzen (which has a RAID0 with 4 samsung 990
pro), I see these stats:

1) Q1 ASC

Buffers: shared hit=4545 read=52801
I/O Timings: shared read=127.700
Execution Time: 432.266 ms

2) Q1 DESC

Buffers: shared hit=7406 read=52801
I/O Timings: shared read=306.676
Execution Time: 769.246 ms

3) Q2 ASC

Buffers: shared hit=32605 read=52801
I/O Timings: shared read=127.610
Execution Time: 1047.333 ms

4) Q2 DESC

Buffers: shared hit=36105 read=52801
I/O Timings: shared read=157.667
Execution Time: 1140.286 ms

Those timings are much better (more stable) that the numbers I shared
yesterday (that was from my laptop).

All of this is with direct I/O and 12 workers.

What if I use a ramdisk for this? That'll be much faster, no matter
the scan order. Should I expect this step to make the effect with
duplicates being produced by read_stream_look_ahead to just go away,
regardless of the scan direction in use?

How's that different from just running with buffered I/O and not
dropping the page cache?

regards

--
Tomas Vondra

#245

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#244)

Re: index prefetching

Hi,

On 2025-08-14 01:11:07 +0200, Tomas Vondra wrote:

On 8/13/25 23:57, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).

But there's no OS prefetch with direct I/O. At most, there might be
some kind of readahead implemented in the SSD's firmware.

Good point, I keep forgetting direct I/O means no OS read-ahead. Not
sure if there's a good way to determine if the SSD can do something like
that (and how well). I wonder if there's a way to do backward sequential
scans in fio ..

In theory, yes, in practice, not quite:
https://github.com/axboe/fio/issues/1963

So right now it only works if you skip over some blocks. For that there rather
significant performance differences on my SSDs. E.g.

andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:8k --buffered 0 2>&1|grep READ
READ: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=341MiB (358MB), run=1907-1907msec
andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:-8k --buffered 0 2>&1|grep READ
READ: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1024MiB (1074MB), run=14513-14513msec

So on this WD Red SN700 there's a rather substantial performance difference.

On a Samsung 970 PRO I don't see much of a difference. Nor on a ADATA
SX8200PNP.

Greetings,

Andres Freund

#246

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#239)

Re: index prefetching

On 8/13/25 23:36, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 1:01 PM Tomas Vondra <tomas@vondra.me> wrote:

This seems rather bizarre, considering the two tables are exactly the
same, except that in t2 the first column is negative, and the rows are
fixed-length. Even heap_page_items says the tables are exactly the same.

So why would the index get so different like this?

In the past, when I required *perfectly* deterministic results for
INSERT INTO test_table ... SELECT * FROM source_table bulk inserts
(which was important during the Postgres 12 and 13 nbtree work), I
found it necessary to "set synchronize_seqscans=off". If I was writing
a test such as this, I'd probably do that defensively, even if it
wasn't clear that it mattered. (I'm also in the habit of using
unlogged tables, because VACUUM tends to set their pages all-visible
more reliably than equivalent logged tables, which I notice that
you're also doing here.)

The tables are *exactly* the same, block by block. I double checked that
by looking at a couple pages, and the only difference is the inverted
value of the "a" column.

That said, I *think* that the "locally shuffled" heap TID pattern that
we see with "t2"/"idx2" is mostly (perhaps entirely) caused by the way
that you're inverting the indexed column's value when initially
generating "t2". A given range of values such as "1 through to 4"
becomes "-4 through to -1" as their tuples are inserted into t2.

Right.

You're effectively inverting the order of the bigint indexed column
"a" -- but you're *not* inverting the order of the imaginary
tie-breaker heap column (it *remains* in ASC heap TID order in "t2").

I have no idea what I'm supposed to do about that. As you say the
tie-breaker is imaginary, selected by the system on my behalf. If it
works like this, doesn't that mean it'll have this unfortunate effect on
all data sets with negative correlation?

In general, when doing this sort of analysis, I find it useful to
manually verify that the data that I generated matches my
expectations. Usually a quick check with pageinspect is enough. I'll
just randomly select 2 - 3 leaf pages, and make sure that they all
more or less match my expectations.

I did that for the heap, and that's just as I expected. But the effect
on the index surprised me.

regards

--
Tomas Vondra

#247

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#238)

Re: index prefetching

On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote:

I did investigate this, and I don't think there's anything broken in
read_stream. It happens because ReadStream has a concept of "ungetting"
a block, which can happen after hitting some I/O limits.

In that case we "remember" the last block (in read_stream_look_ahead
calls read_stream_unget_block), and we return it again. It may seem as
if read_stream_get_block() produced the same block twice, but it's
really just the block from the last round.

I instrumented this for myself, and I agree: backwards and forwards scan cases
are being fed the same block numbers, as expected (it's just that the order is
precisely backwards, as expected). The only real difference is that the forwards
scan case seems to be passed InvalidBlockNumber quite a bit more often. You were
right: I was confused about the read_stream_unget_block thing.

However, the magnitude of the difference that I see between the forwards and
backwards scan cases just doesn't pass the smell test -- I stand by that part.
I was able to confirm this intuition by performing a simple experiment.

I asked myself a fairly obvious question: if the backwards scan in question
takes about 2.5x as long, just because each group of TIDs for each index value
appears in descending order, then what happens if the order is made random?
Where does that leave the forwards scan case, and where does it leave the
backwards scan case?

I first made the order of the table random, except among groups of index tuples
that have exactly the same value. Those will still point to the same 1 or 2 heap
blocks in virtually all cases, so we have "heap clustering without any heap
correlation" in the newly rewritten table. To set things up this way, I first
made another index, and then clustered the table using that new index:

pg@regression:5432 [2476413]=# create index on t (hashint8(a));
CREATE INDEX
pg@regression:5432 [2476413]=# cluster t using t_hashint8_idx ;
CLUSTER

Next, I reran the queries in the obvious way (same procedure as yesterday,
though with a very different result):

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6082 read=77813 │
│ I/O Timings: shared read=153.672 │
│ Planning Time: 0.057 ms │
│ Execution Time: 402.735 ms │
└────────────────────────────────────────────────────────────┘
(7 rows)

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6082 read=77813 │
│ I/O Timings: shared read=324.305 │
│ Planning Time: 0.071 ms │
│ Execution Time: 616.268 ms │
└─────────────────────────────────────────────────────────────────────┘
(7 rows)

Apparently random I/O is twice as fast as sequential I/O in descending order! In
fact, this test case creates the appearance of random I/O being at least
slightly faster than sequential I/O for pages read in _ascending_ order!

Obviously something doesn't add up here. I'm no closer to explaining what the
underlying problem is than I was yesterday, but I find it _very_ hard to believe
that the inconsistency in performance has anything to do with SSD firmware/OS
implementation details. It just looks wonky to me.

Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong.
Though now it's wrong in the same way with both variants.

Just for context, I'll show what the TIDs for 3 randomly chosen
adjacent-in-index values look like after CLUSTER runs (in case it was unclear
what I meant about "heap clustering without any heap correlation" earlier):

pg@regression:5432 [2476413]=# SELECT ctid, a FROM t WHERE a BETWEEN 20_000 AND 20_002 ORDER BY a;
┌─────────────┬────────┐
│ ctid │ a │
├─────────────┼────────┤
│ (142534,3) │ 20,000 │
│ (142534,4) │ 20,000 │
│ (142534,5) │ 20,000 │
│ (142534,6) │ 20,000 │
│ (142534,7) │ 20,000 │
│ (142534,8) │ 20,000 │
│ (142534,9) │ 20,000 │
│ (142534,10) │ 20,000 │
│ (142534,11) │ 20,000 │
│ (142534,12) │ 20,000 │
│ (142534,13) │ 20,000 │
│ (142534,14) │ 20,000 │
│ (142534,15) │ 20,000 │
│ (142534,16) │ 20,000 │
│ (142534,17) │ 20,000 │
│ (142534,18) │ 20,000 │
│ (142534,19) │ 20,000 │
│ (142534,20) │ 20,000 │
│ (142534,21) │ 20,000 │
│ (142535,1) │ 20,000 │
│ (142535,2) │ 20,000 │
│ (142535,3) │ 20,000 │
│ (142535,4) │ 20,000 │
│ (142535,5) │ 20,000 │
│ (142535,6) │ 20,000 │
│ (142535,7) │ 20,000 │
│ (142535,8) │ 20,000 │
│ (142535,9) │ 20,000 │
│ (142535,10) │ 20,000 │
│ (142535,11) │ 20,000 │
│ (142535,12) │ 20,000 │
│ (142535,13) │ 20,000 │
│ (216406,19) │ 20,001 │
│ (216406,20) │ 20,001 │
│ (216406,21) │ 20,001 │
│ (216407,1) │ 20,001 │
│ (216407,2) │ 20,001 │
│ (216407,3) │ 20,001 │
│ (216407,4) │ 20,001 │
│ (216407,5) │ 20,001 │
│ (216407,6) │ 20,001 │
│ (216407,7) │ 20,001 │
│ (216407,8) │ 20,001 │
│ (216407,9) │ 20,001 │
│ (216407,10) │ 20,001 │
│ (216407,11) │ 20,001 │
│ (216407,12) │ 20,001 │
│ (216407,13) │ 20,001 │
│ (216407,14) │ 20,001 │
│ (216407,15) │ 20,001 │
│ (216407,16) │ 20,001 │
│ (216407,17) │ 20,001 │
│ (216407,18) │ 20,001 │
│ (216407,19) │ 20,001 │
│ (216407,20) │ 20,001 │
│ (216407,21) │ 20,001 │
│ (216408,1) │ 20,001 │
│ (216408,2) │ 20,001 │
│ (216408,3) │ 20,001 │
│ (216408,4) │ 20,001 │
│ (216408,5) │ 20,001 │
│ (216408,6) │ 20,001 │
│ (216408,7) │ 20,001 │
│ (216408,8) │ 20,001 │
│ (260993,12) │ 20,002 │
│ (260993,13) │ 20,002 │
│ (260993,14) │ 20,002 │
│ (260993,15) │ 20,002 │
│ (260993,16) │ 20,002 │
│ (260993,17) │ 20,002 │
│ (260993,18) │ 20,002 │
│ (260993,19) │ 20,002 │
│ (260993,20) │ 20,002 │
│ (260993,21) │ 20,002 │
│ (260994,1) │ 20,002 │
│ (260994,2) │ 20,002 │
│ (260994,3) │ 20,002 │
│ (260994,4) │ 20,002 │
│ (260994,5) │ 20,002 │
│ (260994,6) │ 20,002 │
│ (260994,7) │ 20,002 │
│ (260994,8) │ 20,002 │
│ (260994,9) │ 20,002 │
│ (260994,10) │ 20,002 │
│ (260994,11) │ 20,002 │
│ (260994,12) │ 20,002 │
│ (260994,13) │ 20,002 │
│ (260994,14) │ 20,002 │
│ (260994,15) │ 20,002 │
│ (260994,16) │ 20,002 │
│ (260994,17) │ 20,002 │
│ (260994,18) │ 20,002 │
│ (260994,19) │ 20,002 │
│ (260994,20) │ 20,002 │
│ (260994,21) │ 20,002 │
│ (260995,1) │ 20,002 │
└─────────────┴────────┘
(96 rows)

--
Peter Geoghegan

#248

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#247)

Re: index prefetching

On Wed, Aug 13, 2025 at 7:51 PM Peter Geoghegan <pg@bowt.ie> wrote:

Apparently random I/O is twice as fast as sequential I/O in descending order! In
fact, this test case creates the appearance of random I/O being at least
slightly faster than sequential I/O for pages read in _ascending_ order!

Obviously something doesn't add up here.

Minor clarification: If EXPLAIN ANALYZE is to be believed, "I/O
Timings" is in fact higher with the randomized "t" table variant of
the test case, compared to what I showed yesterday with the original
sequential "t" version of the table, exactly as expected. (When I said
"Apparently random I/O is twice as fast as sequential I/O in
descending order!", I was just joking, of course.)

It seems reasonable to suppose that the actual problem has something
to do with synchronization overhead of some kind or other. Or, perhaps
it's due to some kind of major inefficiency in the patch -- perhaps
the patch can sometimes waste many CPU cycles on who-knows-what, at
least in cases like the original/slow backwards scan case.

--
Peter Geoghegan

#249

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Tomas Vondra (#238)

Re: index prefetching

On Thu, Aug 14, 2025 at 9:19 AM Tomas Vondra <tomas@vondra.me> wrote:

I did investigate this, and I don't think there's anything broken in
read_stream. It happens because ReadStream has a concept of "ungetting"
a block, which can happen after hitting some I/O limits.

In that case we "remember" the last block (in read_stream_look_ahead
calls read_stream_unget_block), and we return it again. It may seem as
if read_stream_get_block() produced the same block twice, but it's
really just the block from the last round.

Yeah, it's a bit of a tight corner in the algorithm, and I haven't
found any better solution. It arises from this circularity:

* we need a block number from the callback before we can decide if it
can be combined with the pending read
* if we can't combine it, we need to start the pending read to get it
out of the way, so we can start a new one
* we entered this path knowing that we are allowed to start one more
IO, but if doing so reports a spit then we've only made the pending
read smaller, ie the tail portion remains, so we still can't combine
with it, so the only way to make progress is to loop and start another
IO, and so on
* while doing that we might hit the limits on pinned buffers (only for
tiny buffer pools) or (more likely) running IOs, and then what are you
going to do with that block number?

#250

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#247)

Re: index prefetching

On 8/14/25 01:50, Peter Geoghegan wrote:

On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote:

I did investigate this, and I don't think there's anything broken in
read_stream. It happens because ReadStream has a concept of "ungetting"
a block, which can happen after hitting some I/O limits.

In that case we "remember" the last block (in read_stream_look_ahead
calls read_stream_unget_block), and we return it again. It may seem as
if read_stream_get_block() produced the same block twice, but it's
really just the block from the last round.

I instrumented this for myself, and I agree: backwards and forwards scan cases
are being fed the same block numbers, as expected (it's just that the order is
precisely backwards, as expected). The only real difference is that the forwards
scan case seems to be passed InvalidBlockNumber quite a bit more often. You were
right: I was confused about the read_stream_unget_block thing.

However, the magnitude of the difference that I see between the forwards and
backwards scan cases just doesn't pass the smell test -- I stand by that part.
I was able to confirm this intuition by performing a simple experiment.

I asked myself a fairly obvious question: if the backwards scan in question
takes about 2.5x as long, just because each group of TIDs for each index value
appears in descending order, then what happens if the order is made random?
Where does that leave the forwards scan case, and where does it leave the
backwards scan case?

I first made the order of the table random, except among groups of index tuples
that have exactly the same value. Those will still point to the same 1 or 2 heap
blocks in virtually all cases, so we have "heap clustering without any heap
correlation" in the newly rewritten table. To set things up this way, I first
made another index, and then clustered the table using that new index:

pg@regression:5432 [2476413]=# create index on t (hashint8(a));
CREATE INDEX
pg@regression:5432 [2476413]=# cluster t using t_hashint8_idx ;
CLUSTER

Next, I reran the queries in the obvious way (same procedure as yesterday,
though with a very different result):

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6082 read=77813 │
│ I/O Timings: shared read=153.672 │
│ Planning Time: 0.057 ms │
│ Execution Time: 402.735 ms │
└────────────────────────────────────────────────────────────┘
(7 rows)

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6082 read=77813 │
│ I/O Timings: shared read=324.305 │
│ Planning Time: 0.071 ms │
│ Execution Time: 616.268 ms │
└─────────────────────────────────────────────────────────────────────┘
(7 rows)

Apparently random I/O is twice as fast as sequential I/O in descending order! In
fact, this test case creates the appearance of random I/O being at least
slightly faster than sequential I/O for pages read in _ascending_ order!

Obviously something doesn't add up here. I'm no closer to explaining what the
underlying problem is than I was yesterday, but I find it _very_ hard to believe
that the inconsistency in performance has anything to do with SSD firmware/OS
implementation details. It just looks wonky to me.

Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong.
Though now it's wrong in the same way with both variants.

Just for context, I'll show what the TIDs for 3 randomly chosen
adjacent-in-index values look like after CLUSTER runs (in case it was unclear
what I meant about "heap clustering without any heap correlation" earlier):

Interesting. It's really surprising random I/O beats the sequential.

I investigated this from a different angle, by tracing the I/O request
generated. using perf-trace. And the patterns are massively different.

What I did is roughly this:

1) restart the instance (with direct I/O)

2) perf trace record -m 128M -a -o $(date +%s).trace

3) run the query, pgrep 'io worker'

4) stop the trace

5) extract pread64 events for the I/O workers from the trace

I get these event counts:

Q1 ASC - 5395
Q1 DESC - 49969
Q2 ASC - 32804
Q2 DESC - 49958

It's interesting the DESC queries get to do almost exactly the same
number of pread calls.

Anyway, small samples of the trace look like this:

Q1 ASC

pread64(fd: 7, buf: 0x7f6011b7f000, count: 81920, pos: 475193344)
pread64(fd: 24, buf: 0x7f6011b95000, count: 131072, pos: 475275264)
pread64(fd: 7, buf: 0x7f6011bb7000, count: 131072, pos: 475406336)
pread64(fd: 24, buf: 0x7f6011bd9000, count: 131072, pos: 475537408)
pread64(fd: 7, buf: 0x7f6011bfb000, count: 81920, pos: 475668480)
pread64(fd: 24, buf: 0x7f6011c0f000, count: 24576, pos: 475750400)
pread64(fd: 24, buf: 0x7f6011c15000, count: 24576, pos: 475774976)
pread64(fd: 24, buf: 0x7f6011c1d000, count: 131072, pos: 475799552)
pread64(fd: 7, buf: 0x7f6011c3f000, count: 106496, pos: 475930624)
pread64(fd: 24, buf: 0x7f6011c59000, count: 24576, pos: 476037120)
pread64(fd: 24, buf: 0x7f6011c61000, count: 131072, pos: 476061696)
pread64(fd: 7, buf: 0x7f6011c83000, count: 131072, pos: 476192768)
pread64(fd: 24, buf: 0x7f6011ca3000, count: 24576, pos: 476323840)
pread64(fd: 24, buf: 0x7f6011ca9000, count: 24576, pos: 476348416)
pread64(fd: 24, buf: 0x7f6011cb1000, count: 131072, pos: 476372992)
pread64(fd: 7, buf: 0x7f6011cd1000, count: 57344, pos: 476504064)

Q1 DESC

pread64(fd: 24, buf: 0x7fa8c1735000, count: 8192, pos: 230883328)
pread64(fd: 7, buf: 0x7fa8c1737000, count: 8192, pos: 230875136)
pread64(fd: 6, buf: 0x7fa8c173b000, count: 8192, pos: 230866944)
pread64(fd: 24, buf: 0x7fa8c173d000, count: 8192, pos: 230858752)
pread64(fd: 7, buf: 0x7fa8c173f000, count: 8192, pos: 230850560)
pread64(fd: 6, buf: 0x7fa8c1741000, count: 8192, pos: 230842368)
pread64(fd: 24, buf: 0x7fa8c1743000, count: 8192, pos: 230834176)
pread64(fd: 7, buf: 0x7fa8c1745000, count: 8192, pos: 230825984)
pread64(fd: 24, buf: 0x7fa8c1747000, count: 8192, pos: 230817792)
pread64(fd: 6, buf: 0x7fa8c1749000, count: 8192, pos: 230809600)
pread64(fd: 7, buf: 0x7fa8c174b000, count: 8192, pos: 230801408)
pread64(fd: 24, buf: 0x7fa8c174d000, count: 8192, pos: 230793216)
pread64(fd: 6, buf: 0x7fa8c174f000, count: 8192, pos: 230785024)
pread64(fd: 7, buf: 0x7fa8c1751000, count: 8192, pos: 230776832)
pread64(fd: 24, buf: 0x7fa8c1753000, count: 8192, pos: 230768640)
pread64(fd: 7, buf: 0x7fa8c1755000, count: 8192, pos: 230760448)
pread64(fd: 6, buf: 0x7fa8c1757000, count: 8192, pos: 230752256)

Q2 ASC

pread64(fd: 7, buf: 0x7fb8bbf27000, count: 8192, pos: 258695168)
pread64(fd: 24, buf: 0x7fb8bbf29000, count: 16384, pos: 258678784)
pread64(fd: 7, buf: 0x7fb8bbf2d000, count: 8192, pos: 258670592)
pread64(fd: 24, buf: 0x7fb8bbf2f000, count: 16384, pos: 258654208)
pread64(fd: 7, buf: 0x7fb8bbf33000, count: 8192, pos: 258646016)
pread64(fd: 24, buf: 0x7fb8bbf35000, count: 16384, pos: 258629632)
pread64(fd: 7, buf: 0x7fb8bbf39000, count: 8192, pos: 258621440)
pread64(fd: 24, buf: 0x7fb8bbf3d000, count: 16384, pos: 258605056)
pread64(fd: 7, buf: 0x7fb8bbf41000, count: 8192, pos: 258596864)
pread64(fd: 24, buf: 0x7fb8bbf43000, count: 16384, pos: 258580480)
pread64(fd: 7, buf: 0x7fb8bbf47000, count: 8192, pos: 258572288)
pread64(fd: 24, buf: 0x7fb8bbf49000, count: 16384, pos: 258555904)
pread64(fd: 7, buf: 0x7fb8bbf4d000, count: 8192, pos: 258547712)
pread64(fd: 24, buf: 0x7fb8bbf4f000, count: 16384, pos: 258531328)
pread64(fd: 7, buf: 0x7fb8bbf53000, count: 16384, pos: 258514944)
pread64(fd: 24, buf: 0x7fb8bbf57000, count: 8192, pos: 258506752)
pread64(fd: 7, buf: 0x7fb8bbf59000, count: 8192, pos: 258498560)
pread64(fd: 24, buf: 0x7fb8bbf5b000, count: 16384, pos: 258482176)

Q2 DESC

pread64(fd: 24, buf: 0x7fdcf0451000, count: 8192, pos: 598974464)
pread64(fd: 7, buf: 0x7fdcf0453000, count: 8192, pos: 598999040)
pread64(fd: 6, buf: 0x7fdcf0455000, count: 8192, pos: 598990848)
pread64(fd: 24, buf: 0x7fdcf0459000, count: 8192, pos: 599007232)
pread64(fd: 7, buf: 0x7fdcf045b000, count: 8192, pos: 599023616)
pread64(fd: 6, buf: 0x7fdcf045d000, count: 8192, pos: 599015424)
pread64(fd: 24, buf: 0x7fdcf045f000, count: 8192, pos: 599031808)
pread64(fd: 7, buf: 0x7fdcf0461000, count: 8192, pos: 599048192)
pread64(fd: 6, buf: 0x7fdcf0463000, count: 8192, pos: 599040000)
pread64(fd: 24, buf: 0x7fdcf0465000, count: 8192, pos: 599056384)
pread64(fd: 7, buf: 0x7fdcf0467000, count: 8192, pos: 599072768)
pread64(fd: 6, buf: 0x7fdcf0469000, count: 8192, pos: 599064576)
pread64(fd: 24, buf: 0x7fdcf046b000, count: 8192, pos: 599080960)
pread64(fd: 7, buf: 0x7fdcf046d000, count: 8192, pos: 599097344)
pread64(fd: 6, buf: 0x7fdcf046f000, count: 8192, pos: 599089152)
pread64(fd: 24, buf: 0x7fdcf0471000, count: 8192, pos: 599105536)
pread64(fd: 7, buf: 0x7fdcf0473000, count: 8192, pos: 599121920)
pread64(fd: 6, buf: 0x7fdcf0475000, count: 8192, pos: 599113728)

So, Q1 ASC gets to combine the I/O into nice large chunks. But the DESC
queries end up doing a stream of 8K requests. The Q2 ASC gets to do 16KB
reads in about half the cases, but the rest is still 8KB.

FWIW I believe this is what Thomas Munro meant by [1]/messages/by-id/CA+hUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ@mail.gmail.com:

You'll probably see a flood of uncombined 8KB IOs in the pg_aios
view while travelling up the heap with cache misses today.

It wasn't quite this obvious in pg_aios, though. I've usually seen only
a single event there, so hard to make conclusion. The trace makes it
pretty obvious, though. We don't combine the I/O, and we also know Linux
in fact does not do any readahead for backwards scans.

regards

[1]: /messages/by-id/CA+hUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ@mail.gmail.com
/messages/by-id/CA+hUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ@mail.gmail.com

--
Tomas Vondra

#251

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#250)

Re: index prefetching

On Wed, Aug 13, 2025 at 8:59 PM Tomas Vondra <tomas@vondra.me> wrote:

I investigated this from a different angle, by tracing the I/O request
generated. using perf-trace. And the patterns are massively different.

I tried a similar approach myself, using a variety of tools. That
didn't get me very far.

So, Q1 ASC gets to combine the I/O into nice large chunks. But the DESC
queries end up doing a stream of 8K requests. The Q2 ASC gets to do 16KB
reads in about half the cases, but the rest is still 8KB.

My randomized version of the forwards scan is about as fast (maybe
even slightly faster) than your original version on my workstation, in
spite of the fact that EXPLAIN ANALYZE reports that the randomized
version does indeed have about a 3x higher "I/O Timings: shared read".
So I tend to doubt that low-level instrumentation will be all that
helpful with debugging the issue.

I suppose that it *might* be helpful if you can use it to spot some
kind of pattern -- a pattern that hints at the real underlying issue.
To me the issue feels like a priority inversion problem. Maybe
slow-ish I/O can lead to very very slow query execution time, due to
some kind of second order effect (possibly an issue on the read stream
side). If that's what this is then the problem still won't be that
there was slow-ish I/O, or that we couldn't successfully combine I/Os
in whatever way. After all, we surely won't be able to combine I/Os
with the randomized version of the queries that I described to the
list this evening -- and yet those are still very fast in terms of
overall execution time (somehow, they are about as fast as the
original variant, that will manage to combine I/Os, in spite of the
obvious disadvantage of requiring random I/O for the heap accesses).

--
Peter Geoghegan

#252

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#247)

Re: index prefetching

On Wed Aug 13, 2025 at 7:50 PM EDT, Peter Geoghegan wrote:

pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6082 read=77813 │
│ I/O Timings: shared read=324.305 │
│ Planning Time: 0.071 ms │
│ Execution Time: 616.268 ms │
└─────────────────────────────────────────────────────────────────────┘
(7 rows)

Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong.
Though now it's wrong in the same way with both variants.

Actually, "Buffers:" output _didn't_ have the same problem with the randomized
test case variants. With master + buffered I/O, with the FS cache dropped, and
with the index relation prewarmed, the same query shows the same "Buffers"
details that the patch showed earlier:

┌─────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6085 read=77813 │
│ I/O Timings: shared read=10572.441 │
│ Planning: │
│ Buffers: shared hit=90 read=23 │
│ I/O Timings: shared read=1.212 │
│ Planning Time: 1.505 ms │
│ Execution Time: 10711.853 ms │
└─────────────────────────────────────────────────────────────────────┘
(10 rows)

Though it's not particular relevant to the problem at hand, I'll also point out
that with a scan of an index such as this (an index that exhibits "heap
clustering without heap correlation"), prefetching is particularly important.
Here we see a ~17.3x speedup (relative to master + buffered I/O). Nice!

--
Peter Geoghegan

#253

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#250)

Re: index prefetching

On Wed Aug 13, 2025 at 8:59 PM EDT, Tomas Vondra wrote:

On 8/14/25 01:50, Peter Geoghegan wrote:

I first made the order of the table random, except among groups of index tuples
that have exactly the same value. Those will still point to the same 1 or 2 heap
blocks in virtually all cases, so we have "heap clustering without any heap
correlation" in the newly rewritten table. To set things up this way, I first
made another index, and then clustered the table using that new index:

Interesting. It's really surprising random I/O beats the sequential.

It should be noted that the effect seems to be limited to io_method=io_uring.
I find that with io_method=worker, the execution time of the original
"sequential heap access" backwards scan is very similar to the execution time
of the variant with the index that exhibits "heap clustering without any heap
correlation" (the variant where individual heap blocks appear in random order).

Benchmark that includes both io_uring and worker
================================================

I performed the usual procedure of prewarming the index and evicting the heap
relation, and then actually running the relevant query through EXPLAIN
ANALYZE. Direct I/O was used throughout.

io_method=worker
----------------

Original backwards scan: 1498.024 ms (shared read=48.080)
"No heap correlation" backwards scan: 1483.348 ms (shared read=22.036)

Original forwards scan: 656.884 ms (shared read=19.904)
"No heap correlation" forwards scan: 578.076 ms (shared read=10.159)

io_method=io_uring
------------------

Original backwards scan: 1052.807 ms (shared read=187.876)
"No heap correlation" backwards scan: 649.473 ms (shared read=365.802)

Original forwards scan: 593.126 ms (shared read=55.837)
"No heap correlation" forwards scan: 429.888 ms (shared read=188.619)

Summary
-------

As of this morning, io_method=io_uring also shows that the forwards scan is
faster with random heap accesses than without (not just the backwards scan).
I double-checked, to make sure that the effect was real; it seems to be.
I'm aware that some of these numbers (those for the original/sequential
forward scan case) don't match what I reported on Tuesday. I believe that
this is due to changes I made to my SSD's readahead using blockdev, though
it's possible that there's some other explanation. (In case it matters, I'm
running Debian unstable with liburing2 "2.9-1".)

The important point remains: at least with io_uring, the backwards scan query
is much faster with random I/O than it is with descending sequential I/O. It
might make sense if they were at least at parity, but clearly they're not.

--
Peter Geoghegan

#254

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#253)

1 attachment(s)

Re: index prefetching

On Thu, Aug 14, 2025 at 12:56 PM Peter Geoghegan <pg@bowt.ie> wrote:

I performed the usual procedure of prewarming the index and evicting the heap
relation, and then actually running the relevant query through EXPLAIN
ANALYZE. Direct I/O was used throughout.

io_method=io_uring
------------------

Original backwards scan: 1052.807 ms (shared read=187.876)
"No heap correlation" backwards scan: 649.473 ms (shared read=365.802)

Attached is a differential flame graph that compares the execution of
these 2 queries in terms of the default perf event (which is "cycles",
per the generic recipe for making one of these put out by Brendan
Gregg). The actual query runtime for each query was very similar to
what I report here -- the backwards scan is a little under twice as
fast.

The only interesting thing about the flame graph is just how little
difference there seems to be (at least for this particular perf event
type). The only thing that stands out even a little bit is the 8.33%
extra time spent in pg_checksum_page for the "No heap
correlation"/random query. But that's entirely to be expected: we're
reading 49933 pages with the sequential backwards scan query, whereas
the random one must read 77813 pages.

--
Peter Geoghegan

#255

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#254)

Re: index prefetching

On Thu Aug 14, 2025 at 1:57 PM EDT, Peter Geoghegan wrote:

The only interesting thing about the flame graph is just how little
difference there seems to be (at least for this particular perf event
type).

I captured method_io_uring.c DEBUG output from running each query in the
server log, in the hope that it would shed some light on what's really going
on here. I think that it just might.

I count a total of 12,401 distinct sleeps for the sequential/slow backwards
scan test case:

$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | head
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | awk '{ total += $11 } END { print total }'
12401

But there are only 3 such sleeps seen when the random backwards scan query is
run -- which might begin to explain the mystery of why it runs so much faster:

$ grep -E "wait_one with [1-9][0-9]* sleeps" random.txt | awk '{ total += $11 } END { print total }'
104

--
Peter Geoghegan

#256

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#255)

Re: index prefetching

Hi,

On 2025-08-14 14:44:44 -0400, Peter Geoghegan wrote:

On Thu Aug 14, 2025 at 1:57 PM EDT, Peter Geoghegan wrote:

The only interesting thing about the flame graph is just how little
difference there seems to be (at least for this particular perf event
type).

I captured method_io_uring.c DEBUG output from running each query in the
server log, in the hope that it would shed some light on what's really going
on here. I think that it just might.

I count a total of 12,401 distinct sleeps for the sequential/slow backwards
scan test case:

$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | head
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: wait_one with 1 sleeps
$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | awk '{ total += $11 } END { print total }'
12401

But there are only 3 such sleeps seen when the random backwards scan query is
run -- which might begin to explain the mystery of why it runs so much faster:

$ grep -E "wait_one with [1-9][0-9]* sleeps" random.txt | awk '{ total += $11 } END { print total }'
104

I think this is just an indicator of being IO bound. That message is output
whenever we have to wait for IO to finish. So if one workload prints that a
12k times and another 104 times, that's because the latter didn't have to wait
for IO to complete, because it already had completed by the time we needed the
IO to have finished to continue.

Factors potentially leading to slower IO:

- sometimes random IO *can* be faster for SSDs, because it allows different
flash chips to work concurrently, rather than being bound by the speed of
one one flash chip

- it's possible that with your SSD the sequential IO leads to more IO
combining. Larger IOs always have a higher latency than smaller IOs - but
obviously fewer IOs are needed. The increased latency may be bad enough for
your access pattern to trigger more waits.

It's *not* necessarily enough to just lower io_combine_limit, the OS also
can do combining.

I'd see what changes if you temporarily reduce
/sys/block/nvme6n1/queue/max_sectors_kb to a smaller size.

Could you show iostat for both cases?

Greetings,

Andres Freund

#257

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#256)

Re: index prefetching

On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

I think this is just an indicator of being IO bound.

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

Obviously it is hard to believe that the query with shared
read=194.629 is one that is naturally much more I/O bound than another
similar query that shows shared read=352.88. What "I/O Timings" shows
more or less makes sense to me already -- it just doesn't begin to
explain why *overall query execution* is much slower when scanning
backwards sequentially.

I'd see what changes if you temporarily reduce
/sys/block/nvme6n1/queue/max_sectors_kb to a smaller size.

I reduced max_sectors_kb from 128 to 8. That had no significant effect.

Could you show iostat for both cases?

iostat has lots of options. Can you be more specific?

--
Peter Geoghegan

#258

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#257)

Re: index prefetching

On Thu Aug 14, 2025 at 3:15 PM EDT, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

I think this is just an indicator of being IO bound.

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

Is there any particular significance to the invalid op reports I also see in
the same log files?

$ cat sequential.txt | grep invalid | head
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 2, ref_gen: 1, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 3, ref_gen: 2, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 4, ref_gen: 3, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 5, ref_gen: 4, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 6, ref_gen: 5, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 7, ref_gen: 6, cycle 1
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 8, ref_gen: 7, cycle 1
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 9, ref_gen: 8, cycle 1
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 10, ref_gen: 9, cycle 1
2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 11, ref_gen: 10, cycle 1
$ cat sequential.txt | grep invalid | wc -l
5733
$ cat random.txt | grep invalid | wc -l
2206

--
Peter Geoghegan

#259

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#258)

Re: index prefetching

Hi,

On 2025-08-14 15:30:16 -0400, Peter Geoghegan wrote:

On Thu Aug 14, 2025 at 3:15 PM EDT, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

I think this is just an indicator of being IO bound.

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

Is there any particular significance to the invalid op reports I also see in
the same log files?

$ cat sequential.txt | grep invalid | head
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 2, ref_gen: 1, cycle 1
2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG: 00000: io 0 |op invalid|target invalid|state IDLE : wait_one io_gen: 3, ref_gen: 2, cycle 1

No - that's likely just that the IO completed and thus the handle was made
reusable (i.e. state IDLE). Note that the generation of IO we're waiting for
(ref_gen) is lower than the IO handle's (io_gen).

Greetings,

Andres Freund

#260

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#257)

Re: index prefetching

Hi,

On 2025-08-14 15:15:02 -0400, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

I think this is just an indicator of being IO bound.

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

Obviously it is hard to believe that the query with shared
read=194.629 is one that is naturally much more I/O bound than another
similar query that shows shared read=352.88. What "I/O Timings" shows
more or less makes sense to me already -- it just doesn't begin to
explain why *overall query execution* is much slower when scanning
backwards sequentially.

Hm, that is somewhat curious.

I wonder if there's some wait time that's not being captured by "I/O
Timings". A first thing to do would be to just run strace --summary-only while
running the query, and see if there are syscall wait times that seem too long.

What effective_io_concurrency and io_max_concurrency setting are you using? If
there are no free IO handles that's currently not nicely reported (because
it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

Could you show iostat for both cases?

iostat has lots of options. Can you be more specific?

iostat -xmy /path/to/block/device

I'd like to see the difference in average IO size (rareq-sz), queue depth
(aqu-sz) and completion time (r_await) between the fast and slow cases.

Greetings,

Andres Freund

#261

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#257)

Re: index prefetching

On Thu, Aug 14, 2025 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

If you're interested in trying this out for yourself, I've pushed my
working branch here:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.2

Note that the test case you'll run is added by the most recent commit:

https://github.com/petergeoghegan/postgres/commit/c9ceb765f3b138f53b7f1fdf494ba7c816082aa1

Run microbenchmarks/random_backwards_weird.sql to do an initial load
of both of the tables. Then run
microbenchmarks/queries_random_backwards_weird.sql to actually run the
relevant queries. There are 4 such queries, but only the 2 backwards
scan queries really seem relevant.

--
Peter Geoghegan

#262

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#260)

Re: index prefetching

On Thu Aug 14, 2025 at 3:41 PM EDT, Andres Freund wrote:

Hm, that is somewhat curious.

I wonder if there's some wait time that's not being captured by "I/O
Timings". A first thing to do would be to just run strace --summary-only while
running the query, and see if there are syscall wait times that seem too long.

For the slow, sequential backwards scan query:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.271216 4 66808 io_uring_enter
0.00 0.000004 4 1 sendto
0.00 0.000001 0 2 1 recvfrom
0.00 0.000000 0 5 lseek
0.00 0.000000 0 1 epoll_wait
0.00 0.000000 0 4 openat
------ ----------- ----------- --------- --------- ----------------
100.00 0.271221 4 66821 1 total

For the fast, random backwards scan query:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 0.351518 4 77819 io_uring_enter
0.00 0.000007 2 3 1 epoll_wait
0.00 0.000006 6 1 sendto
0.00 0.000003 1 3 2 recvfrom
0.00 0.000002 2 1 read
0.00 0.000002 2 1 1 rt_sigreturn
0.00 0.000002 2 1 getpid
0.00 0.000002 1 2 kill
0.00 0.000000 0 3 lseek
------ ----------- ----------- --------- --------- ----------------
100.00 0.351542 4 77834 4 total

What effective_io_concurrency and io_max_concurrency setting are you using? If
there are no free IO handles that's currently not nicely reported (because
it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

effective_io_concurrency is 100. io_max_concurrency is 64. Nothing out of
the ordinary there.

iostat -xmy /path/to/block/device

I'd like to see the difference in average IO size (rareq-sz), queue depth
(aqu-sz) and completion time (r_await) between the fast and slow cases.

I'll show one second interval output.

Slow, sequential backwards scan query
-------------------------------------

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 24613.00 192.29 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.92 53.20

avg-cpu: %user %nice %system %iowait %steal %idle
0.22 0.00 0.44 0.85 0.00 98.50

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 25320.00 197.81 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.18 51.20

Fast, random backwards scan query
---------------------------------

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 27140.59 212.04 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.50 23.37

avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 0.84 0.00 0.00 98.66

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 50401.00 393.76 0.00 0.00 0.20 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.06 41.60

--
Peter Geoghegan

#263

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#261)

Re: index prefetching

Hi,

On 2025-08-14 15:45:26 -0400, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

If you're interested in trying this out for yourself, I've pushed my
working branch here:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.2

Note that the test case you'll run is added by the most recent commit:

https://github.com/petergeoghegan/postgres/commit/c9ceb765f3b138f53b7f1fdf494ba7c816082aa1

Run microbenchmarks/random_backwards_weird.sql to do an initial load
of both of the tables. Then run
microbenchmarks/queries_random_backwards_weird.sql to actually run the
relevant queries. There are 4 such queries, but only the 2 backwards
scan queries really seem relevant.

Interesting. In the sequential case I see some waits that are not attributed
in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
Which indicates that the read stream is trying to re-read a buffer that
previously started being read.

read_stream_start_pending_read()
-> StartReadBuffers()
-> AsyncReadBuffers()
-> ReadBuffersCanStartIO()
-> StartBufferIO()
-> WaitIO()

There are far fewer cases of this in the random case.

From what I can tell the sequential case so often will re-read a buffer that
it is already in the process of reading - and thus wait for that IO before
continuing - that we don't actually keep enough IO in flight.

In your email with iostat output you can see that the slow case has
aqu-sz=5.18, while the fast case has aqu-sz=10.06, i.e. the fast case has
twice as much IO in flight. While both have IOs take the same amount of time
(r_await=0.20). Which certainly explains the performance difference...

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

Greetings,

Andres Freund

#264

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#263)

Re: index prefetching

On Thu, Aug 14, 2025 at 4:44 PM Andres Freund <andres@anarazel.de> wrote:

Interesting. In the sequential case I see some waits that are not attributed
in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
Which indicates that the read stream is trying to re-read a buffer that
previously started being read.

I *knew* that something had to be up here. Thanks for your help with debugging!

read_stream_start_pending_read()
-> StartReadBuffers()
-> AsyncReadBuffers()
-> ReadBuffersCanStartIO()
-> StartBufferIO()
-> WaitIO()

There are far fewer cases of this in the random case.

Index tuples with TIDs that are slightly out of order are very normal.
Even for *perfectly* sequential inserts, the FSM tends to use the last
piece of free space on a heap page some time after the heap page
initially becomes "almost full". I recently described this to Tomas on
this thread [1]DC1G2PKUO9CI.3MK1L3YBZ2V3T@bowt.ie -- Peter Geoghegan.

From what I can tell the sequential case so often will re-read a buffer that
it is already in the process of reading - and thus wait for that IO before
continuing - that we don't actually keep enough IO in flight.

Oops.

There is an existing stop-gap mechanism in the patch that is supposed
to deal with this problem. index_scan_stream_read_next, which is the
read stream callback, has logic that is supposed to suppress duplicate
block requests. But that's obviously not totally effective, since it
only remembers the very last heap block request.

If this same mechanism remembered (say) the last 2 heap blocks it
requested, that might be enough to totally fix this particular
problem. This isn't a serious proposal, but it'll be simple enough to
implement. Hopefully when I do that (which I plan to soon) it'll fully
validate your theory.

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans. I'm far from sure how sophisticated that
actually needs to be. Obviously the design choices in this area are
far from settled right now.

[1]: DC1G2PKUO9CI.3MK1L3YBZ2V3T@bowt.ie -- Peter Geoghegan
--
Peter Geoghegan

#265

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#264)

Re: index prefetching

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

If this same mechanism remembered (say) the last 2 heap blocks it
requested, that might be enough to totally fix this particular
problem. This isn't a serious proposal, but it'll be simple enough to
implement. Hopefully when I do that (which I plan to soon) it'll fully
validate your theory.

I spoke too soon. It isn't going to be so easy, since
heapam_index_fetch_tuple wants to consume buffers as a simple stream.
There's no way that index_scan_stream_read_next can just suppress
duplicate block number requests (in a way that's more sophisticated
than the current trivial approach that stores the very last block
number in IndexScanBatchState.lastBlock) without it breaking the whole
concept of a stream of buffers.

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

FWIW it wouldn't be that hard to require the callback (in our case
index_scan_stream_read_next) to explicitly point out that it knows
that the block number it's requesting has to be a duplicate. It might
make sense to at least place that much of the burden on the
callback/client side.

--
Peter Geoghegan

#266

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#245)

3 attachment(s)

Re: index prefetching

On 8/14/25 01:19, Andres Freund wrote:

Hi,

On 2025-08-14 01:11:07 +0200, Tomas Vondra wrote:

On 8/13/25 23:57, Peter Geoghegan wrote:

On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:

It's also not very surprising this happens with backwards scans more.
The I/O is apparently much slower (due to missing OS prefetch), so we're
much more likely to hit the I/O limits (max_ios and various other limits
in read_stream_start_pending_read).

But there's no OS prefetch with direct I/O. At most, there might be
some kind of readahead implemented in the SSD's firmware.

Good point, I keep forgetting direct I/O means no OS read-ahead. Not
sure if there's a good way to determine if the SSD can do something like
that (and how well). I wonder if there's a way to do backward sequential
scans in fio ..

In theory, yes, in practice, not quite:
https://github.com/axboe/fio/issues/1963

So right now it only works if you skip over some blocks. For that there rather
significant performance differences on my SSDs. E.g.

andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:8k --buffered 0 2>&1|grep READ
READ: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=341MiB (358MB), run=1907-1907msec
andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:-8k --buffered 0 2>&1|grep READ
READ: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1024MiB (1074MB), run=14513-14513msec

So on this WD Red SN700 there's a rather substantial performance difference.

On a Samsung 970 PRO I don't see much of a difference. Nor on a ADATA
SX8200PNP.

I experimented with this a little bit today. Given the fio issues, I
ended up writing a simple tool in C, doing pread() forward/backward with
different block size and direct I/O. AFAICS this is roughly equivalent
to fio with iodepth=1 (based on a couple tests).

Too bad fio has issues with backward sequential tests ... I'll see if I
can get at least some results to validate my results.

On all my SSDs there's massive difference between forward and backward
sequential scans. It depends on the block size, but for the smaller
block sizes (1-16KB) it's roughly 4x slower. It gets better for larger
blocks, but while that's interesting, we're stuck with 8K blocks.

FWIW I'm not claiming this explains all odd things we're investigating
in this thread, it's more a confirmation that the scan direction may
matter if it translates to direction at the device level. I don't think
it can explain the strange stuff with the "random" data sets constructed
Peter.

regards

--
Tomas Vondra

#267

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#266)

Re: index prefetching

On Thu, Aug 14, 2025 at 6:24 PM Tomas Vondra <tomas@vondra.me> wrote:

FWIW I'm not claiming this explains all odd things we're investigating
in this thread, it's more a confirmation that the scan direction may
matter if it translates to direction at the device level. I don't think
it can explain the strange stuff with the "random" data sets constructed
Peter.

The weird performance characteristics of that one backwards scan are
now believed to be due to the WaitIO issue that Andres described about
an hour ago. That issue seems unlikely to only affect backwards
scans/reverse-sequential heap I/O.

I accept that backwards scans are likely to be significantly slower
than forwards scans on most/all SSDs. But that in itself doesn't
explain why the same issue didn't cause the equivalent sequential
forward scan to also be a lot slower. Actually, it probably *did*
cause that forwards scan to be *somewhat* slower -- just not by enough
to immediately jump out at me (not enough to make the forwards scan
much slower than a scan that does wholly random I/O, which is
obviously absurd).

My guess is that once we fix the underlying problem, we'll see
improved performance for many different types of queries. Not as big
of a benefit as the one that the broken query will get, but still
enough to matter.

--
Peter Geoghegan

#268

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#265)

Re: index prefetching

On 8/14/25 23:55, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

If this same mechanism remembered (say) the last 2 heap blocks it
requested, that might be enough to totally fix this particular
problem. This isn't a serious proposal, but it'll be simple enough to
implement. Hopefully when I do that (which I plan to soon) it'll fully
validate your theory.

I spoke too soon. It isn't going to be so easy, since
heapam_index_fetch_tuple wants to consume buffers as a simple stream.
There's no way that index_scan_stream_read_next can just suppress
duplicate block number requests (in a way that's more sophisticated
than the current trivial approach that stores the very last block
number in IndexScanBatchState.lastBlock) without it breaking the whole
concept of a stream of buffers.

I believe this idea (checking not just the very last block, but keeping
a bit longer history) was briefly discussed a couple months ago, after
you pointed out the need for the "last block" optimization (which the
patch didn't have). At that point we were focused on addressing a
regression with correlated indexes, so the single block was enough.

But as you point out, it's harder than it seems. If I recall correctly,
the challenge is that heapam_index_fetch_tuple() is expected to release
the block when it changes, but then how would it know there's no future
read of the same buffer in the stream?

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

FWIW it wouldn't be that hard to require the callback (in our case
index_scan_stream_read_next) to explicitly point out that it knows
that the block number it's requesting has to be a duplicate. It might
make sense to at least place that much of the burden on the
callback/client side.

I don't recall all the details, but IIRC my impression was it'd be best
to do this "caching" entirely in the read_stream.c (so the next_block
callbacks would probably not need to worry about lastBlock at all),
enabled when creating the stream. And then there would be something like
read_stream_release_buffer() that'd do the right to release the buffer
when it's not needed.

regards

--
Tomas Vondra

#269

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#267)

Re: index prefetching

On 8/15/25 01:05, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 6:24 PM Tomas Vondra <tomas@vondra.me> wrote:

FWIW I'm not claiming this explains all odd things we're investigating
in this thread, it's more a confirmation that the scan direction may
matter if it translates to direction at the device level. I don't think
it can explain the strange stuff with the "random" data sets constructed
Peter.

The weird performance characteristics of that one backwards scan are
now believed to be due to the WaitIO issue that Andres described about
an hour ago. That issue seems unlikely to only affect backwards
scans/reverse-sequential heap I/O.

Good. I admit I lost track of which the various regressions may affect
existing plans, and which are specific to the prefetch patch.

I accept that backwards scans are likely to be significantly slower
than forwards scans on most/all SSDs. But that in itself doesn't
explain why the same issue didn't cause the equivalent sequential
forward scan to also be a lot slower. Actually, it probably *did*
cause that forwards scan to be *somewhat* slower -- just not by enough
to immediately jump out at me (not enough to make the forwards scan
much slower than a scan that does wholly random I/O, which is
obviously absurd).

True. That's weird.

My guess is that once we fix the underlying problem, we'll see
improved performance for many different types of queries. Not as big
of a benefit as the one that the broken query will get, but still
enough to matter.

Hopefully. Let's see.

regards

--
Tomas Vondra

#270

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#265)

Re: index prefetching

Hi,

On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

I think I can see a way to fix the issue, below read stream. Basically,
whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
new flag indicating that we are waiting for an IO that was not started by the
wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
has to be somewhat different code from today, because we have to deal with the
fact of the "foreign" IO potentially having failed.

I'll try writing a prototype for that tomorrow. I think to actually get that
into a committable shape we need a test harness (probably a read stream
controlled by an SQL function that gets an array of buffers).

FWIW it wouldn't be that hard to require the callback (in our case
index_scan_stream_read_next) to explicitly point out that it knows
that the block number it's requesting has to be a duplicate. It might
make sense to at least place that much of the burden on the
callback/client side.

The problem actually exists outside of your case. E.g. if you have multiple
backends doing a synchronized seqscan on the same relation, performance
regresses, because we often end up synchronously waiting for IOs started by
another backend. I don't think it has quite as large an effect for that as it
has here, because the different scans basically desynchronize whenever it
happens due to the synchronous waits slowing down the waiting backend a lot),
limiting the impact somewhat.

Greetings,

Andres Freund

#271

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Tomas Vondra (#268)

Re: index prefetching

On Fri, Aug 15, 2025 at 11:21 AM Tomas Vondra <tomas@vondra.me> wrote:

I don't recall all the details, but IIRC my impression was it'd be best
to do this "caching" entirely in the read_stream.c (so the next_block
callbacks would probably not need to worry about lastBlock at all),
enabled when creating the stream. And then there would be something like
read_stream_release_buffer() that'd do the right to release the buffer
when it's not needed.

I've thought about this problem quite a bit. xlogprefetcher.c was
designed to use read_stream.c, as the comment above LsnReadQueue
vaguely promises, and I have mostly working patches to finish that job
(more soon). The WAL is naturally full of repetition with
interleaving patterns, so there are many opportunities to avoid buffer
mapping table traffic, pinning, content locking and more.

I'm not sure that read_stream.c is necessarily the right place,
though. I have experimented with that a bit, using a small window of
recently accessed blocks, with various designs.

One of my experiments did it further down. I shoved a cache line of
blocknum->buffernum mappings into SMgrRelation so you can skip the
buffer mapping table and find repeat accesses. I tried FIFO
replacement, vectorised CLOCK (!) and some hairbrained things for this
nano-buffer map. At various times I had goals including remembering
where to find the internal pages in a high frequency repeated btree
search (eg inserting with monotonically increasing keys or nested loop
with increasing or repeated keys), and, well, lots of other stuff.
That was somewhat promising (you can see a variant of that in one of
the patches in the ReadRecentBuffer() thread that I will shortly be
rehydrating), but I wasn't entirely satisfied because it still had to
look up the local pin count, if there is one, so I had plans to
investigate a tighter integration with that stuff too. Coming back to
the WAL, I want something that can cheaply find the buffer and bump
the local pin count (rather than introducing a secondary reference
counting scheme in the WAL that I think you might be describing?), and
I want it to work even if it's not in the read ahead window because
the distance is very low, ie fully cached replay.

Anway, that was all about microscopic stuff that I want to do to speed
up CPU bound replay with little or no I/O.

This stall on repeated access to a block with IO already in progress
is a different beast, and I look forward to checking out the patch
that Andres just described. By funny coincidence I was just studying
that phenomenon and code path last week in the context of my
io_method=posix_aio patch. There, completing other processes' IOs is
a bit more expensive and I was thinking about ways to give the
submitting backend more time to handle it if this backend is only
looking ahead and doesn't strictly need the IO to be completed right
now to make progress. I was studying competing synchronized_scans, ie
other backends' IOs, not repeat access in this backend, but the
solution he just described sounds like a way to hit both birds with
one stone, and makes a pretty good trade-off: the other guy's IO
almost certainly won't fail, and we almost certainly aren't
deadlocked, and if that bet is wrong we can deal with it later.

#272

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Thomas Munro (#271)

Re: index prefetching

On Fri, Aug 15, 2025 at 1:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

(rather than introducing a secondary reference
counting scheme in the WAL that I think you might be describing?), and

s/WAL/read stream/

#273

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#269)

Re: index prefetching

On Thu, Aug 14, 2025 at 7:26 PM Tomas Vondra <tomas@vondra.me> wrote:

Good. I admit I lost track of which the various regressions may affect
existing plans, and which are specific to the prefetch patch.

As far as I know, we only have the following unambiguous performance
regressions (that clearly need to be fixed):

1. This issue.

2. There's about a 3% loss of throughput on pgbench SELECT. This isn't
surprising at all; it would be a near-miracle if this kind of
prototype quality code didn't at least have a small regression here
(it's not like we've even started to worry about small fixed costs for
simple selective queries just yet). This will need to be fixed, but
it's fairly far down the priority list right now.

I feel that we're still very much at the stage where it makes sense to
just fix the most prominent performance issue, and then reevaluate.
Repeating that process iteratively. It's quite likely that there are
more performance issues/bugs that we don't yet know about. IMV it
doesn't make sense to closely track individual queries that have only
been moderately regressed.

--
Peter Geoghegan

#274

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Andres Freund (#270)

1 attachment(s)

Re: index prefetching

Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:

On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

I think I can see a way to fix the issue, below read stream. Basically,
whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
new flag indicating that we are waiting for an IO that was not started by the
wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
has to be somewhat different code from today, because we have to deal with the
fact of the "foreign" IO potentially having failed.

I'll try writing a prototype for that tomorrow. I think to actually get that
into a committable shape we need a test harness (probably a read stream
controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

New code disabled:

#### backwards sequential table ####
┌──────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10291 read=49933 │
│ I/O Timings: shared read=213.277 │
│ Planning: │
│ Buffers: shared hit=91 read=19 │
│ I/O Timings: shared read=2.124 │
│ Planning Time: 3.269 ms │
│ Execution Time: 1023.279 ms │
└──────────────────────────────────────────────────────────────────────┘
(10 rows)

New code enabled:

#### backwards sequential table ####
┌──────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10291 read=49933 │
│ I/O Timings: shared read=217.225 │
│ Planning: │
│ Buffers: shared hit=91 read=19 │
│ I/O Timings: shared read=2.009 │
│ Planning Time: 2.685 ms │
│ Execution Time: 602.987 ms │
└──────────────────────────────────────────────────────────────────────┘
(10 rows)

With the change enabled, the sequential query is faster than the random query:

#### backwards random table ####
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_randomized_pk on t_randomized (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6085 read=77813 │
│ I/O Timings: shared read=347.285 │
│ Planning: │
│ Buffers: shared hit=127 read=5 │
│ I/O Timings: shared read=1.001 │
│ Planning Time: 1.751 ms │
│ Execution Time: 820.544 ms │
└────────────────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

Greetings,

Andres Freund

Attachments:

v1-0001-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch.txttext/plain; charset=us-asciiDownload

From 433e82c94fd1c1b502a2b22e9c3874c1e766c05c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v1] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 133 ++++++++++++++++++++++++++--
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e7693..7ddb867bc99 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -137,6 +137,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fd7e21d96d3..de755fd53ad 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1557,6 +1557,41 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if (buf_state & BM_IO_IN_PROGRESS)
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+			iow = desc->io_wref;
+		UnlockBufHdr(desc, buf_state);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1689,7 +1724,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1708,11 +1743,38 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (operation->foreign_io)
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc, buf_state);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1798,6 +1860,56 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 *	  the readahead distance. This can happen e.g. for the table accesses
+	 *	  of an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 */
+	if (1 && ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		/* FIXME: probably need to wait if io_method == sync? */
+
+		*nblocks_progress = 1;
+		did_start_io = true;
+		operation->foreign_io = true;
+
+		if (0)
+			elog(LOG, "using foreign IO path");
+
+		/* FIXME: trace point */
+
+		/*
+		 * FIXME: how should this be accounted for in stats? Account as a hit
+		 * for now, quite likely *we* started this IO.
+		 */
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		return true;
+	}
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1855,9 +1967,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O in progress in another backend (it can't be this
+	 * backend), we want to wait for the outcome: either done, or something
+	 * went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
@@ -1970,6 +2082,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
+		operation->foreign_io = false;
 		*nblocks_progress = io_buffers_len;
 		did_start_io = true;
 	}
@@ -5986,6 +6099,8 @@ WaitIO(BufferDesc *buf)
 		 */
 		if (pgaio_wref_valid(&iow))
 		{
+			if (0)
+				elog(LOG, "foreign wait");
 			pgaio_wref_wait(&iow);
 
 			/*
-- 
2.48.1.76.g4e746b1a31.dirty

#275

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#269)

Re: index prefetching

On Thu Aug 14, 2025 at 7:26 PM EDT, Tomas Vondra wrote:

My guess is that once we fix the underlying problem, we'll see
improved performance for many different types of queries. Not as big
of a benefit as the one that the broken query will get, but still
enough to matter.

Hopefully. Let's see.

Good news here: with Andres' bufmgr patch applied, the similar forwards scan
query does indeed get more than 2x faster. And I don't mean that it gets
faster on the randomized table -- it actually gets 2x faster with your
original (almost but not quite entirely sequential) table, and your original
query. This is especially good news because that query seems particularly
likely to be representative of real world user queries.

And so the "backwards scan" aspect of this investigation was always a bit of a
red herring. The only reason why "backwards-ness" ever even seemed relevant
was that with the backwards scan variant, performance was made so much slower
by the issue that Andres' patch addresses than even my randomized version of
the same query ran quite a bit faster.

More concretely:

Without bufmgr patch
--------------------

┌─────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────┤
│ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6572 read=49933 │
│ I/O Timings: shared read=77.038 │
│ Planning: │
│ Buffers: shared hit=50 read=6 │
│ I/O Timings: shared read=0.570 │
│ Planning Time: 0.774 ms │
│ Execution Time: 618.585 ms │
└─────────────────────────────────────────────────────────────┘
(10 rows)

With bufmgr patch
-----------------

┌─────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────┤
│ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10257 read=49933 │
│ I/O Timings: shared read=135.825 │
│ Planning: │
│ Buffers: shared hit=50 read=6 │
│ I/O Timings: shared read=0.570 │
│ Planning Time: 0.767 ms │
│ Execution Time: 279.643 ms │
└─────────────────────────────────────────────────────────────┘
(10 rows)

I _think_ that Andres' patch also fixes the EXPLAIN ANALYZE accounting, so
that "I/O Timings" is actually correct. That's why EXPLAIN ANALYZE with the
bufmgr patch has much higher "shared read" time, despite overall execution
time being cut in half.

--
Peter Geoghegan

#276

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#275)

Re: index prefetching

On Fri, Aug 15, 2025 at 12:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

Good news here: with Andres' bufmgr patch applied, the similar forwards scan
query does indeed get more than 2x faster. And I don't mean that it gets
faster on the randomized table -- it actually gets 2x faster with your
original (almost but not quite entirely sequential) table, and your original
query. This is especially good news because that query seems particularly
likely to be representative of real world user queries.

BTW, I also think that Andres' patch makes performance a lot more
stable. I'm pretty sure that I've noticed that the exact query that I
just showed updated results for has at various times run faster
(without Andres' patch), due to who-knows-what.

FWIW, this development probably completely changes the results of many
(all?) of your benchmark queries. My guess is that with Andres' patch,
things will be better across the board. But in any case the numbers
that you posted before now must now be considered
obsolete/nonrepresentative. Since this is such a huge change.

--
Peter Geoghegan

#277

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#276)

Re: index prefetching

Hi,

Glad to see that the prototype does fix the issue for you.

On 2025-08-15 12:29:25 -0400, Peter Geoghegan wrote:

FWIW, this development probably completely changes the results of many
(all?) of your benchmark queries. My guess is that with Andres' patch,
things will be better across the board. But in any case the numbers
that you posted before now must now be considered
obsolete/nonrepresentative. Since this is such a huge change.

I'd hope it doesn't improve all benchmark queries - if so the set of
benchmarks would IMO be too skewed towards cases that access the same heap
blocks multiple times within the readahead distance. That's definitely an
important thing to measure, but it's surely not the only thing to care
about. For the index workloads the patch doesn't do anything about cases where
we don't up re-encountering a buffer that we already started IO for.

Greetings,

Andres Freund

#278

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#275)

Re: index prefetching

Hi,

On 2025-08-15 12:24:40 -0400, Peter Geoghegan wrote:

With bufmgr patch
-----------------

┌─────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────┤
│ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10257 read=49933 │
│ I/O Timings: shared read=135.825 │
│ Planning: │
│ Buffers: shared hit=50 read=6 │
│ I/O Timings: shared read=0.570 │
│ Planning Time: 0.767 ms │
│ Execution Time: 279.643 ms │
└─────────────────────────────────────────────────────────────┘
(10 rows)

I _think_ that Andres' patch also fixes the EXPLAIN ANALYZE accounting, so
that "I/O Timings" is actually correct. That's why EXPLAIN ANALYZE with the
bufmgr patch has much higher "shared read" time, despite overall execution
time being cut in half.

Somewhat random note about I/O waits:

Unfortunately the I/O wait time we measure often massively *over* estimate the
actual I/O time. If I execute the above query with the patch applied, we
actually barely ever wait for I/O to complete, it's all completed by the time
we have to wait for the I/O. What we are measuring is the CPU cost of
*initiating* the I/O.

That's why we are seeing "I/O Timings" > 0 even if we do perfect readahead.

Most of the cost is in the kernel, primarily looking up block locations and
setting up the actual I/O.

Greetings,

Andres Freund

#279

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#277)

Re: index prefetching

On Fri, Aug 15, 2025 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-15 12:29:25 -0400, Peter Geoghegan wrote:

FWIW, this development probably completely changes the results of many
(all?) of your benchmark queries. My guess is that with Andres' patch,
things will be better across the board. But in any case the numbers
that you posted before now must now be considered
obsolete/nonrepresentative. Since this is such a huge change.

I'd hope it doesn't improve all benchmark queries - if so the set of
benchmarks would IMO be too skewed towards cases that access the same heap
blocks multiple times within the readahead distance.

I don't think that that will be a problem. Up until recently, I had
exactly the opposite complaint about the benchmark queries.

That's definitely an
important thing to measure, but it's surely not the only thing to care
about. For the index workloads the patch doesn't do anything about cases where
we don't up re-encountering a buffer that we already started IO for.

IMV we need to make a conservative assumption that it might matter for
any query. There have already been numerous examples where we thought
we fully understood a test case, but didn't.

BTW, I just rebooted my workstation, losing various procfs changes
that I'd made when debugging this issue. It now looks like the forward
scan query is actually made about 3x faster by the addition of your
patch (not 2x faster, as reported earlier). It goes from 592.618 ms to
204.966 ms.

--
Peter Geoghegan

#280

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#278)

Re: index prefetching

On Fri, Aug 15, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:

Somewhat random note about I/O waits:

Unfortunately the I/O wait time we measure often massively *over* estimate the
actual I/O time. If I execute the above query with the patch applied, we
actually barely ever wait for I/O to complete, it's all completed by the time
we have to wait for the I/O. What we are measuring is the CPU cost of
*initiating* the I/O.

I do get that.

This was really obvious when I temporarily switched the prefetch patch
over from using READ_STREAM_DEFAULT to using READ_STREAM_USE_BATCHING
(this is probably buggy, but still seems likely to be representative
of what's possible with some care). I noticed that that change reduced
the reported "shared read" time by 10x -- which had exactly zero impact on
query execution time (at least for the queries I looked at). Since, as
you say, the backend didn't have to wait for I/O to complete either
way.

--
Peter Geoghegan

#281

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#273)

Re: index prefetching

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

As far as I know, we only have the following unambiguous performance
regressions (that clearly need to be fixed):

1. This issue.

2. There's about a 3% loss of throughput on pgbench SELECT.

I did a quick pgbench SELECT benchmark again with Andres' patch, just
to see if that has been impacted. Now the regression there is much
larger; it goes from a ~3% regression to a ~14% regression.

I'm not worried about it. Andres' "not waiting for already-in-progress
IO" patch was clearly just a prototype. Just thought it was worth
noting here.

--
Peter Geoghegan

#282

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#281)

Re: index prefetching

Hi,

On August 15, 2025 3:25:50 PM EDT, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

As far as I know, we only have the following unambiguous performance
regressions (that clearly need to be fixed):

1. This issue.

2. There's about a 3% loss of throughput on pgbench SELECT.

I did a quick pgbench SELECT benchmark again with Andres' patch, just
to see if that has been impacted. Now the regression there is much
larger; it goes from a ~3% regression to a ~14% regression.

I'm not worried about it. Andres' "not waiting for already-in-progress
IO" patch was clearly just a prototype. Just thought it was worth
noting here.

Are you confident in that? Because the patch should be extremely cheap in that case. What precisely were you testing?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#283

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#282)

Re: index prefetching

On Fri, Aug 15, 2025 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

I'm not worried about it. Andres' "not waiting for already-in-progress
IO" patch was clearly just a prototype. Just thought it was worth
noting here.

Are you confident in that? Because the patch should be extremely cheap in that case.

I'm pretty confident.

What precisely were you testing?

I'm just running my usual generic pgbench SELECT script, with my usual
settings (so no direct I/O, but with iouring).

--
Peter Geoghegan

#284

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#283)

Re: index prefetching

Hi,

On 2025-08-15 15:31:47 -0400, Peter Geoghegan wrote:

On Fri, Aug 15, 2025 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:

I'm not worried about it. Andres' "not waiting for already-in-progress
IO" patch was clearly just a prototype. Just thought it was worth
noting here.

Are you confident in that? Because the patch should be extremely cheap in that case.

I'm pretty confident.

What precisely were you testing?

I'm just running my usual generic pgbench SELECT script, with my usual
settings (so no direct I/O, but with iouring).

I see absolutely no effect of the patch with shared_buffers=1GB and a
read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
etc. were you testing?

Greetings,

Andres Freund

#285

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#284)

Re: index prefetching

On Fri, Aug 15, 2025 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:

I see absolutely no effect of the patch with shared_buffers=1GB and a
read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
etc. were you testing?

Just to be clear: you are testing with both the index prefetching
patch and your patch together, right? Not just your own patch?

My shared_buffers is 16GB, with pgbench scale 300.

--
Peter Geoghegan

#286

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#285)

Re: index prefetching

Hi,

On 2025-08-15 15:42:10 -0400, Peter Geoghegan wrote:

On Fri, Aug 15, 2025 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:

I see absolutely no effect of the patch with shared_buffers=1GB and a
read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
etc. were you testing?

Just to be clear: you are testing with both the index prefetching
patch and your patch together, right? Not just your own patch?

Correct.

My shared_buffers is 16GB, with pgbench scale 300.

So there's actually no IO, given that a scale 300 is something like 4.7GB? In
that case my patch could really not make a difference, neither of the changed
branches would ever be reached?

Or were you testing the warmup phase, rather than the steady state?

Greetings,

Andres Freund

#287

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#286)

Re: index prefetching

On Fri, Aug 15, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:

My shared_buffers is 16GB, with pgbench scale 300.

So there's actually no IO, given that a scale 300 is something like 4.7GB? In
that case my patch could really not make a difference, neither of the changed
branches would ever be reached?

This was an error on my part -- sorry.

I think that the problem was that I forgot that I temporarily
increased effective_io_concurrency from 100 to 1,000 while debugging
this issue. Apparently that disproportionately affected the patched
server. Could also have been an issue with a recent change of mine.

--
Peter Geoghegan

#288

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#273)

Re: index prefetching

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

As far as I know, we only have the following unambiguous performance
regressions (that clearly need to be fixed):

1. This issue.

2. There's about a 3% loss of throughput on pgbench SELECT.

Update: I managed to fix the performance regression with pgbench
SELECT (regression 2). Since Andres' patch fixes the other regression
(regression 1), we no longer have any known performance regression
(though I don't doubt that they still exist somewhere). I've also
added back the enable_indexscan_prefetch testing GUC (Andres asked me
to do that a few weeks back). If you set
enable_indexscan_prefetch=false, btgetbatch performance is virtually
identical to master/btgettuple.

A working copy of the patchset with these revisions is available from:
https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6

The solution to the pgbench issue was surprisingly straightforward.
Profiling showed that the regression was caused by the added overhead
of using the read stream, for queries where prefetching cannot
possibly help -- such small startup costs are relatively noticeable
with pgbench's highly selective scans. It turns out that it's possible
to initially avoid using a read stream, while still retaining the
option of switching over to using a read stream later on. The trick to
fixing the pgbench issue was delaying creating a read stream for long
enough for the pgbench queries to never need to create one, without
that impacting queries that at least have some chance of benefiting
from prefetching.

The actual heuristic I'm using to decide when to start the read stream
is simple: only start a read stream right after the scan's second
batch is returned by amgetbatch, but before we've fetched any heap
blocks related to that second batch (start using a read stream when
fetching new heap blocks from that second batch). It's possible that
that heuristic isn't sophisticated enough for other types of queries.
But either way the basic structure within indexam.c places no
restrictions on when we start a read stream. It doesn't have to be
aligned with amgetbatch-wise batch boundaries, for example (I just
found that structure convenient).

I haven't spent much time testing this change, but it appears to work
perfectly (no pgbench regressions, but also no regressions in queries
that were already seeing significant benefits from prefetching). I'd
feel better about all this if we had better testing of the read stream
invariants by (say) adding assertions to index_scan_stream_read_next,
the read stream callback. And just having comments that explain those
invariants.

--
Peter Geoghegan

#289

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#288)

Re: index prefetching

On 8/17/25 19:30, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

As far as I know, we only have the following unambiguous performance
regressions (that clearly need to be fixed):

1. This issue.

2. There's about a 3% loss of throughput on pgbench SELECT.

Update: I managed to fix the performance regression with pgbench
SELECT (regression 2). Since Andres' patch fixes the other regression
(regression 1), we no longer have any known performance regression
(though I don't doubt that they still exist somewhere). I've also
added back the enable_indexscan_prefetch testing GUC (Andres asked me
to do that a few weeks back). If you set
enable_indexscan_prefetch=false, btgetbatch performance is virtually
identical to master/btgettuple.

A working copy of the patchset with these revisions is available from:
https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6

The solution to the pgbench issue was surprisingly straightforward.
Profiling showed that the regression was caused by the added overhead
of using the read stream, for queries where prefetching cannot
possibly help -- such small startup costs are relatively noticeable
with pgbench's highly selective scans. It turns out that it's possible
to initially avoid using a read stream, while still retaining the
option of switching over to using a read stream later on. The trick to
fixing the pgbench issue was delaying creating a read stream for long
enough for the pgbench queries to never need to create one, without
that impacting queries that at least have some chance of benefiting
from prefetching.

The actual heuristic I'm using to decide when to start the read stream
is simple: only start a read stream right after the scan's second
batch is returned by amgetbatch, but before we've fetched any heap
blocks related to that second batch (start using a read stream when
fetching new heap blocks from that second batch). It's possible that
that heuristic isn't sophisticated enough for other types of queries.
But either way the basic structure within indexam.c places no
restrictions on when we start a read stream. It doesn't have to be
aligned with amgetbatch-wise batch boundaries, for example (I just
found that structure convenient).

I haven't spent much time testing this change, but it appears to work
perfectly (no pgbench regressions, but also no regressions in queries
that were already seeing significant benefits from prefetching). I'd
feel better about all this if we had better testing of the read stream
invariants by (say) adding assertions to index_scan_stream_read_next,
the read stream callback. And just having comments that explain those
invariants.

Thanks for investigating this. I think it's the right direction - simple
OLTP queries should not be paying for building read_stream when there's
little chance of benefit.

Unfortunately, this seems to be causing regressions, both compared to
master (or disabled prefetching), and to the earlier prefetch patches.

I kept running the query generator [1]https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test that builds data sets with
randomized parameters, and then runs index scan queries on that, looking
for differences between branches.

Consider this data set:

------------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t select 1 * a, b from (select r, a, b,
generate_series(0,1-1) AS p from (select row_number() over () AS r, a, b
from (select i AS a, md5(i::text) AS b from generate_series(1, 10000000)
s(i) ORDER BY (i + 256 * (random() - 0.5))) foo) bar) baz ORDER BY ((r *
1 + p) + 128 * (random() - 0.5));

create index idx on t(a ASC);

vacuum freeze t;

analyze t;
------------------------------------------------------------------------

Let's run this query (all runs are with cold caches):

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 5085 AND 3053660 ORDER BY a ASC;

1) current patch
================

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=0.517..6593.821 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Prefetch Distance: 2.066
Prefetch Count: 296179
Prefetch Stalls: 2553745
Prefetch Skips: 198613
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 74
Prefetch Histogram: [2,4) => 289560, [4,8) => 6604, [8,16) => 15
Buffers: shared hit=2704779 read=153516
Planning:
Buffers: shared hit=78 read=27
Planning Time: 5.525 ms
Execution Time: 6721.599 ms
(16 rows)

2) removed priorbatch (always uses read stream)
===============================================

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=1.008..1932.379 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Prefetch Distance: 87.970
Prefetch Count: 2877141
Prefetch Stalls: 1
Prefetch Skips: 198617
Prefetch Resets: 0
Stream Ungets: 27182
Stream Forwarded: 7640
Prefetch Histogram: [2,4) => 2, [4,8) => 6, [8,16) => 7, [16,32) =>
10, [32,64) => 8183, [64,128) => 2868933
Buffers: shared hit=2704571 read=153516
Planning:
Buffers: shared hit=78 read=27
Planning Time: 14.302 ms
Execution Time: 2036.654 ms
(16 rows)

3) no prefetch (same as master)
===============================

set enable_indexscan_prefetch = off;

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx on t (actual time=0.850..1336.723 rows=3048576.00
loops=1)
Index Cond: ((a >= 5085) AND (a <= 3053660))
Index Searches: 1
Buffers: shared hit=2704779 read=153516
Planning:
Buffers: shared hit=82 read=22
Planning Time: 10.696 ms
Execution Time: 1433.530 ms
(8 rows)

The main difference in the explains is this:

Prefetch Distance: 2.066 (new patch)

Prefetch Distance: 87.970 (old patch, without priorbatch)

The histogram just confirms this, with most prefetches either in [2,4)
or [64,128) bins. The new patch has much lower prefetch distance.

I believe this is the same issue with "collapsed" distance after
resetting the read_stream. In that case the trouble was the reset also
set distance to 1, and there were so many "hits" due to buffers read
earlier it never ramped up again (we doubled it every now and then, but
the decay was faster).

The same thing happens here, I think. We process the first batch without
using a read stream. Then after reading the second batch we create the
read_stream, but it starts with distance=1 - it's just like after reset.
And it never ramps up the distance, because of the hits from reading the
preceding batch.

For the resets, the solution (at least for now) was to remember the
distance and restore it after reset. But here we don't have any distance
to restore - there's no prefetch or read stream.

Maybe it'd be possible to track some stats, during the initial phase,
and then use that to initialize the distance for the first batch
processed by read stream? Seems rather inconvenient, though.

What exactly is the overhead of creating the read_stream? Is that about
allocating memory, or something else? Would it be possible to reduce the
overhead enough to not matter even for OLTP queries? Maybe it would be
possible to initialize the read_stream only "partially", enough to do do
sync I/O and track the distance, and delay only the expensive stuff?

I'm also not sure it's optimal to only initialize read_stream after
reading the next batch. For some indexes a batch can have hundreds of
items, and that certainly could benefit from prefetching. I suppose it
should be possible to initialize the read_stream half-way though a
batch, right? Or is there a reason why that can't work?

regards

[1]: https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test
https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test

--
Tomas Vondra

#290

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#289)

Re: index prefetching

On Tue, Aug 19, 2025 at 1:23 PM Tomas Vondra <tomas@vondra.me> wrote:

Thanks for investigating this. I think it's the right direction - simple
OLTP queries should not be paying for building read_stream when there's
little chance of benefit.

Unfortunately, this seems to be causing regressions, both compared to
master (or disabled prefetching), and to the earlier prefetch patches.

The main difference in the explains is this:

Prefetch Distance: 2.066 (new patch)

Prefetch Distance: 87.970 (old patch, without priorbatch)

The histogram just confirms this, with most prefetches either in [2,4)
or [64,128) bins. The new patch has much lower prefetch distance.

That definitely seems like a problem. I think that you're saying that
this problem happens because we have extra buffer hits earlier on,
which is enough to completely change the ramp-up behavior. This seems
to be all it takes to dramatically decrease the effectiveness of
prefetching. Does that summary sound correct?

I believe this is the same issue with "collapsed" distance after
resetting the read_stream. In that case the trouble was the reset also
set distance to 1, and there were so many "hits" due to buffers read
earlier it never ramped up again (we doubled it every now and then, but
the decay was faster).

If my summary of what you said is accurate, then to me the obvious
question is: isn't this also going to be a problem *without* the new
"delay starting read stream" behavior? Couldn't you break the "removed
priorbatch" case in about the same way using a slightly different test
case? Say a test case involving concurrent query execution?

More concretely: what about similar cases where some *other* much more
selective query runs around the same time as the nonselective
regressed query? What if this other selective query reads the same
group of heap pages into shared_buffers that our nonselective query
will also need to visit (before visiting all the other heap pages not
yet in shared_buffers, that we want to prefetch)? Won't this other
scenario also confuse the read stream ramp-up heuristics, in a similar
way?

It seems bad that the initial conditions that the read stream sees can
have such lasting consequences. It feels as if the read stream is
chasing its own tail. I wonder if this is related to the fact that
we're using the read stream in a way that it wasn't initially
optimized for. After all, we're the first caller that doesn't just do
sequential access all the time -- we're bound to have novel problems
with the read stream for that reason alone.

The same thing happens here, I think. We process the first batch without
using a read stream. Then after reading the second batch we create the
read_stream, but it starts with distance=1 - it's just like after reset.
And it never ramps up the distance, because of the hits from reading the
preceding batch.

Maybe it'd be possible to track some stats, during the initial phase,
and then use that to initialize the distance for the first batch
processed by read stream? Seems rather inconvenient, though.

But why should the stats from the first leaf page read be particularly
important? It's just one page out of the thousands that are ultimately
read. Unless I've misunderstood you, the real problem seems to be that
the read stream effectively gets fixated on a few early buffer hits.
It sounds like it is getting stuck in a local minima, or something
like that.

What exactly is the overhead of creating the read_stream? Is that about
allocating memory, or something else?

It's hard to be precise here, because we're only talking about a 3%
regression with pgbench. A lot of that regression probably related to
memory allocation overhead. I also remember get_tablespace() being
visible in profiles (it is called from
get_tablespace_maintenance_io_concurrency, which is itself called from
read_stream_begin_impl). It's probably a lot of tiny things, that all
add up to a small (though still unacceptable) regression.

Would it be possible to reduce the
overhead enough to not matter even for OLTP queries?

Maybe it would be
possible to initialize the read_stream only "partially", enough to do do
sync I/O and track the distance, and delay only the expensive stuff?

Maybe, but I think that this is something to consider only after other
approaches to fixing the problem fail.

I'm also not sure it's optimal to only initialize read_stream after
reading the next batch. For some indexes a batch can have hundreds of
items, and that certainly could benefit from prefetching.

That does seem quite possible, and should also be investigated. But it
doesn't sound like the issue you're seeing with your adversarial
random query.

I suppose it
should be possible to initialize the read_stream half-way though a
batch, right? Or is there a reason why that can't work?

Yes, that's right -- the general structure should be able to support
switching over to a read stream when we're only mid-way through
reading the TIDs associated with a given batch (likely the first
batch). The only downside is that that'd require adding logic/more
branches to heapam_index_fetch_tuple to detect when to do this. I
think that that approach is workable, if we really need it to work --
it's definitely an option.

For now I would like to focus on debugging your problematic query
(which doesn't sound like the kind of query that could benefit from
initializing the read_stream when we're still only half-way through a
batch). Does that make sense, do you think?

--
Peter Geoghegan

#291

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Peter Geoghegan (#290)

Re: index prefetching

On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

That definitely seems like a problem. I think that you're saying that
this problem happens because we have extra buffer hits earlier on,
which is enough to completely change the ramp-up behavior. This seems
to be all it takes to dramatically decrease the effectiveness of
prefetching. Does that summary sound correct?

Update: Tomas and I discussed this over IM.

We ultimately concluded that it made the most sense to treat this
issue as a regression against set enable_indexscan_prefetch =
off/master. It was probably made a bit worse by the recent addition of
delaying creating a read stream (to avoid regressing pgbench SELECT)
with io_method=worker, though for me (with io_method=io_uring) it
makes things faster instead.

None of this is business with io_method seems important, since either
way there's a clear regression against set enable_indexscan_prefetch =
off/master. And we don't want those. So ultimately we need to
understand why mo prefetching wins by a not-insignificant margin with
this query.

Also, I just noticed that with a DESC/backwards scan version of Tomas'
query, things are vastly slower. But even then, fully synchronous
buffered I/O is still slightly faster.

--
Peter Geoghegan

#292

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#274)

Re: index prefetching

On 8/15/25 17:09, Andres Freund wrote:

Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:

On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity. I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

I think I can see a way to fix the issue, below read stream. Basically,
whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
new flag indicating that we are waiting for an IO that was not started by the
wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
has to be somewhat different code from today, because we have to deal with the
fact of the "foreign" IO potentially having failed.

I'll try writing a prototype for that tomorrow. I think to actually get that
into a committable shape we need a test harness (probably a read stream
controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.

On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

Consider this synthetic dataset:

----------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t
select 1 * a, b from (
select r, a, b, generate_series(0,2-1) AS p
from (select row_number() over () AS r, a, b from (
select i AS a, md5(i::text) AS b
from generate_series(1, 5000000) s(i)
order by (i + 16 * (random() - 0.5))
) foo
) bar
) baz ORDER BY ((r * 2 + p) + 8 * (random() - 0.5));

create index idx on t(a ASC) with (deduplicate_items=false);

vacuum freeze t;
analyze t;

SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437 ORDER BY a ASC;
----------------------------------------------------------------------

On master (or with index prefetching disabled), this gets executed like
this (cold caches):

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=82 read=21
Planning Time: 5.982 ms
Execution Time: 1691.708 ms
(8 rows)

while with index prefetching (with the aio prototype patch), it looks
like this:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26 dirtied=1
Planning Time: 1.032 ms
Execution Time: 3150.578 ms
(16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 2.212 ms
Execution Time: 1837.615 ms
(16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?

regards

--
Tomas Vondra

#293

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#291)

2 attachment(s)

Re: index prefetching

On 8/20/25 00:27, Peter Geoghegan wrote:

On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

That definitely seems like a problem. I think that you're saying that
this problem happens because we have extra buffer hits earlier on,
which is enough to completely change the ramp-up behavior. This seems
to be all it takes to dramatically decrease the effectiveness of
prefetching. Does that summary sound correct?

That summary is correct, yes. I kept thinking about this, while looking
at more regressions found by my script (that generates data sets with
different data distributions, etc.).

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Let's start with distance=1, and let's assume 50% of buffers are hits,
in a regular pattern - hit-miss-hit-miss-hit-miss-...

In this case, the distance will never increase beyond 2, because we'll
double-decrement-double-decrement-... so it'll flip between 1 and 2, no
matter how you set effective_io_concurrency.

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

With fewer hits, it's fine - there's asymmetry, because the distance
grows by doubling and decreases by decrementing 1. So once we have a bit
more misses, it keeps growing.

But with more hits, the hit/miss ratio simply determines the "stable"
distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the
stable distance is ~4, because we get a miss, double to 8, and then 4
hits, so the distance drops back to 4. And again.

Similarly for other hit/miss ratios (it's easier to think about if you
keep the number of hits 2^n).

It's worth noticing the effective_io_concurrency has almost no impact on
what distance we end up with, it merely limits the maximum distance.

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

I understand the current heuristics is trying to not prefetch for cached
data sets, but does that actually make sense? With fadvise it made
sense, because the prefetched data could get evicted if we prefetched
too far ahead. But with worker/io_uring the buffers get pinned, so this
shouldn't happen. Of course, that doesn't mean we should prefetch too
far ahead - there's LIMIT queries and limit of buffer pins, etc.

What about if the distance heuristics asks this question:

How far do we need to look to generate effective_io_concurrency IOs?

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

So while the current heuristics looks at individual hits/misses, the
patch looks at groups of requests.

The other idea is that the patch maintains a "distance range", with
min/max of allowed distances. The min/max values gradually grow after a
miss, the "min" value "stops" at max_ios, while "max" grows further.

This ensures gradual ramp up, helping LIMIT queries etc.

And even if there are a lot of hits, the distance is not allowed to drop
below the current "min". Because what would be the benefit of that?

- If the read is a hit, we might read it later - but the cost is about
the same, we're not really saving much by delaying the read.

- If the read is a miss, it's clearly better to issue the I/O sooner.

This may not be true if it's a LIMIT query, and it terminates early. But
if the distance_min is not too high, this should be negligible.

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26 dirtied=1
Planning Time: 1.032 ms
Execution Time: 3150.578 ms
(16 rows)

and with the attached patch:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 36.321
Prefetch Count: 3730750
Prefetch Stalls: 3
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 722353
Stream Forwarded: 305265
Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6,
[16,32) => 316890, [32,64) => 3413833
Buffers: shared hit=2574776 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 2.249 ms
Execution Time: 1651.826 ms
(16 rows)

The example is not entirely perfect, because the index prefetching does
not actually beat master:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 3.688 ms
Execution Time: 1656.790 ms
(8 rows)

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)

regards

--
Tomas Vondra

Attachments:

readstream-adaptive-distance-v2.patchtext/x-patch; charset=UTF-8; name=readstream-adaptive-distance-v2.patchDownload

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index ed5feac2d39..0ea77dff3fc 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -116,6 +116,15 @@ struct ReadStream
 	int64		forwarded_count;
 	int64		distance_hist[16];
 
+	/* acceptable distance range */
+	int16		distance_min;
+	int16		distance_max;
+
+	/* number of hits / misses */
+	int16		num_reads;
+	int16		num_hits;
+	int16		threshold;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -235,6 +244,117 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 	return blocknum;
 }
 
+/*
+ * read_stream_maybe_adjust_distance
+ *		adjust distance based on the number of hits/misses
+ *
+ * This adjusts three parameters used to pick the distance, based on hits and
+ * misses when reading the buffers:
+ *
+ * - distance - the "actual distance"
+ *
+ * - distance_min and distance_max - this restricts the accetable range
+ *   for the "actual distance"
+ *
+ * The range is adjusted with every miss. It starts with [1,1], and every miss
+ * increases the limits up to [max_ios, PG_INT16_MAX]. The "actual distance" is
+ * kept within the range. This means the distance "ramps up" gradually, and does
+ * not drop all the way back to 1.
+ *
+ * The "actual distance" is adjusted less frequently, after seeing a chunk of
+ * requests. We calculate the "probability of a miss" and use it to estimate how
+ * many requests to look ahead to keep "max_ios" in the queue. The calculated
+ * distance is still kept in the min/max range.
+ */
+static inline void
+read_stream_adjust_distance(ReadStream *stream, bool miss)
+{
+	/*
+	 * Count hits/misses and (maybe) widen the distance range.
+	 *
+	 * XXX Maybe the range should be adjusted always, not just for a miss?
+	 *
+	 * XXX The min distance is capped to max_ios, because that's the maximum
+	 * number we know we can handle for 100% miss rate.
+	 */
+	if (miss)
+	{
+		stream->num_reads++;
+		stream->distance_min = Min(stream->distance_min * 2,
+								   stream->max_ios);
+		stream->distance_max = Min(stream->distance_max * 2,
+								   PG_INT16_MAX);
+	}
+	else
+	{
+		stream->num_hits++;
+	}
+
+	/*
+	 * Adjust the actual distance, based on miss ratio.
+	 *
+	 * We only do this once in a while, after seeing "threshold" requests, so
+	 * that we have somewhat accurate estimate of miss ratio. We can still see
+	 * miss_prob=0.0, so be careful about it.
+	 */
+	if ((stream->num_hits + stream->num_reads) >= stream->threshold)
+	{
+		/*
+		 * If we saw any misses, estimate how far to look ahead to see max_ios
+		 * I/Os (which considers effective_io_concurrency, our goal).
+		 */
+		if (stream->num_reads > 0)
+		{
+			/* probability of a miss */
+			double	miss_prob
+				= stream->num_reads * 1.0 / (stream->num_hits + stream->num_reads);
+
+			/* number of requests to get max_ios misses */
+			stream->distance = Min(stream->max_ios / miss_prob,
+								   PG_INT16_MAX);
+
+			stream->distance = Min(stream->distance, stream->max_pinned_buffers);
+		}
+		else
+		{
+			/*
+			 * With no misses, we simply use the current minimal distance.
+			 *
+			 * XXX Maybe we should use the maximum instead?
+			 */
+			stream->distance = stream->distance_min;
+		}
+
+		/* reset the counters, to start a new interval */
+		stream->num_hits = 0;
+		stream->num_reads = 0;
+
+		/*
+		 * When to re-calculate the distance? Not too often, to get a good
+		 * of miss probability (we need to see enough requests), but also
+		 * not too infrequently (we want this to be adaptive).
+		 *
+		 * XXX Seems reasonable to base this on the distance. It means we
+		 * expect to see max_ios misses, because that's how we calculated the
+		 * distance.
+		 *
+		 * XXX But don't do this too infrequently. The distance can get quite
+		 * high, so cap to 10x the I/Os. Arbitrary value, maybe needs more
+		 * thought.
+		 */
+		stream->threshold = Max(stream->distance, stream->max_ios * 10);
+	}
+
+	/*
+	 * in any case, respect distance_min, distance_max
+	 *
+	 * XXX This means we actually adjust the distance after every miss, not just
+	 * after every stream->threshold requests. Is this a good idea?
+	 */
+	stream->distance = Max(stream->distance, stream->distance_min);
+	stream->distance = Min(stream->distance, stream->distance_max);
+}
+
 /*
  * In order to deal with buffer shortages and I/O limits after short reads, we
  * sometimes need to defer handling of a block we've already consumed from the
@@ -398,8 +518,7 @@ read_stream_start_pending_read(ReadStream *stream)
 	if (!need_wait)
 	{
 		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		read_stream_adjust_distance(stream, false);
 	}
 	else
 	{
@@ -758,6 +877,13 @@ read_stream_begin_impl(int flags,
 	else
 		stream->distance = 1;
 
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+
+	stream->distance_min = 1;
+	stream->distance_max = 1;
+	stream->threshold = stream->max_ios;	/* XXX rethink? */
+
 	/*
 	 * Since we always access the same relation, we can initialize parts of
 	 * the ReadBuffersOperation objects and leave them that way, to avoid
@@ -971,7 +1097,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -985,9 +1110,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->oldest_io_index = 0;
 
 		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		read_stream_adjust_distance(stream, true);
 
 		/*
 		 * If we've reached the first block of a sequential region we're
@@ -1146,6 +1269,9 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = Max(1, stream->distance_old);
 	stream->distance_old = 0;
 
+	stream->distance_min = stream->distance;
+	stream->distance_max = stream->distance;
+
 	/* track the number of resets */
 	stream->reset_count += 1;
 }

test.sqlapplication/sql; name=test.sqlDownload

#294

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Tomas Vondra (#293)

Re: index prefetching

On Tue, Aug 26, 2025 at 2:18 AM Tomas Vondra <tomas@vondra.me> wrote:

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

Right, that's what this patch was attacking directly, basically only
giving up when misses are so sparse we can't do anything about it for
an ordered stream:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

aio: Improve read_stream.c look-ahead heuristics C

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

Instead, sustain the current distance until we've seen evidence that
there is no window big enough to span the gap between rare IOs. In
other words, we now use information from a much larger window to
estimate the utility of looking far ahead.

#295

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#293)

Re: index prefetching

On 8/25/25 16:18, Tomas Vondra wrote:

...

But with more hits, the hit/miss ratio simply determines the "stable"
distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the
stable distance is ~4, because we get a miss, double to 8, and then 4
hits, so the distance drops back to 4. And again.

I forgot to mention the distance is "stable" only if you already start
at it - then we keep it. But start at a higher value, and the distance
keeps growing. Or start at a lower value, and it collapses to 1. Plus
it's rather sensitive, a minor variation can easily push the distance in
either direction.

So it's more like an "unstable equilibrium" in physics.

regards

--
Tomas Vondra

#296

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Thomas Munro (#294)

Re: index prefetching

On 8/25/25 17:43, Thomas Munro wrote:

On Tue, Aug 26, 2025 at 2:18 AM Tomas Vondra <tomas@vondra.me> wrote:

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

Right, that's what this patch was attacking directly, basically only
giving up when misses are so sparse we can't do anything about it for
an ordered stream:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

aio: Improve read_stream.c look-ahead heuristics C

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

Instead, sustain the current distance until we've seen evidence that
there is no window big enough to span the gap between rare IOs. In
other words, we now use information from a much larger window to
estimate the utility of looking far ahead.

Ah, I forgot about this patch.

There's been too many PoC / experimental patches with read_stream
improvements, I'm loosing track of them. I'm ready to do some
evaluation, but it's not clear which ones to evaluate, etc. Could you
maybe consolidate them into a patch series that I could benchmark?

I did give this patch a try with the dataset/query shared in [1]/messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me, and
the explain looks like this:

QUERY PLAN
---------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 271.999
Prefetch Count: 4339129
Prefetch Stalls: 386
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 1331122
Stream Forwarded: 306719
Prefetch Histogram: [2,4) => 10, [4,8) => 2, [8,16) => 2,
[16,32) => 2, [32,64) => 2, [64,128) => 3,
[256,512) => 4339108
Buffers: shared hit=2573920 read=455610
Planning:
Buffers: shared hit=83 read=26
Planning Time: 4.142 ms
Execution Time: 1694.368 ms
(16 rows)

which is pretty good, and pretty much on-par with master (so no
regression, which is good).

It's a bit strange the distance ends up being that high, though. The
explain says:

Prefetch Distance: 271.999

There's ~70% misses on average, so isn't 217 a bit too high? Wouldn't
that cause too many concurrent IOs? Maybe I'm interpreting this wrong,
or maybe the explain stats are not quite right.

For comparison, the patch from [1]/messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me ends up with this:

Prefetch Distance: 36.321

In any case, the patch seems to help, and maybe it's a better approach,
I need to take a closer look.

regards

[1]: /messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me
/messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me

--
Tomas Vondra

#297

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#293)

Re: index prefetching

On Mon, Aug 25, 2025 at 10:18 AM Tomas Vondra <tomas@vondra.me> wrote:

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

Good to know.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

That shouldn't count against the "priorbatch" optimization. I still
think that this issue should be treated as 100% unrelated to the
"priorbatch" optimization.

You might very well be right that the "priorbatch" optimization is too
naive about index scans whose first/possibly only leaf page has TIDs
that point to many distinct heap blocks (hundreds, say). But there's
no reason to think that that's truly relevant to the problem at hand.
If there was such a problem, then it wouldn't look like a regression
against enable_indexscan_prefetch = off/master. We'd likely require a
targeted approach to even notice such a problem; so far, most/all of
our index scan test cases have read hundreds/thousands of index pages
-- so any problem that's limited to the first leaf page read is likely
to go unnoticed.

I think that the "priorbatch" optimization at least takes
*approximately* the right approach, which is good enough for now. It
at least shouldn't ever do completely the wrong thing. It even seems
possible that sufficiently testing will actually show that its naive
approach to be the best one, on balance, once the cost of adding
mitigations (costs for all queries, not just ones like the one you
looked at recently) is taken into account.

I suggest that we not even think about "priorbatch" until the problem
on the read stream side is fixed. IMV we should at least have a
prototype patch for the read stream that we're reasonably happy with
before looking at "priorbatch" in further detail. I don't think we
have that right now.

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Right. (I think that you'll probably agree with me about addressing
this problem before even thinking about limitations in the
"priorbatch" optimization, but I thought it best to be clear about
that.)

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

Maybe, but at what cost? It doesn't necessarily make sense to continue
to read additional leaf pages, regardless of the number of heap buffer
hits in the recent past. At some point it likely makes more sense to
just give up and do actual query processing/return rows to the scan.
Even without a LIMIT. I have low confidence here, though.

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

As I said, I might just be wrong about "just giving up at some point"
making sense. I just don't necessarily think it makes sense to go from
ignoring effective_io_concurrency to *only* caring about
effective_io_concurrency. It's likely true that keeping
effective_io_concurrency-many I/Os in flight is the single most
important thing -- but I doubt it's the only thing that ever matters
(again, even assuming that there's no LIMIT involved).

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)

I find that the regression happens with io_uring. I also find that
your patch doesn't fix it. I have no idea why.

--
Peter Geoghegan

#298

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#297)

Re: index prefetching

On 8/25/25 19:57, Peter Geoghegan wrote:

On Mon, Aug 25, 2025 at 10:18 AM Tomas Vondra <tomas@vondra.me> wrote:

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

Good to know.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

That shouldn't count against the "priorbatch" optimization. I still
think that this issue should be treated as 100% unrelated to the
"priorbatch" optimization.

You might very well be right that the "priorbatch" optimization is too
naive about index scans whose first/possibly only leaf page has TIDs
that point to many distinct heap blocks (hundreds, say). But there's
no reason to think that that's truly relevant to the problem at hand.
If there was such a problem, then it wouldn't look like a regression
against enable_indexscan_prefetch = off/master. We'd likely require a
targeted approach to even notice such a problem; so far, most/all of
our index scan test cases have read hundreds/thousands of index pages
-- so any problem that's limited to the first leaf page read is likely
to go unnoticed.

I think that the "priorbatch" optimization at least takes
*approximately* the right approach, which is good enough for now. It
at least shouldn't ever do completely the wrong thing. It even seems
possible that sufficiently testing will actually show that its naive
approach to be the best one, on balance, once the cost of adding
mitigations (costs for all queries, not just ones like the one you
looked at recently) is taken into account.

I suggest that we not even think about "priorbatch" until the problem
on the read stream side is fixed. IMV we should at least have a
prototype patch for the read stream that we're reasonably happy with
before looking at "priorbatch" in further detail. I don't think we
have that right now.

Right. I might have expressed it more clearly, but this is what I meant
when I said priorbatch is not causing this.

As for priorbatch, I'd still like to know where does the overhead come
from. I mean, what's the expensive part of creating a read stream? Maybe
that can be fixed, instead of delaying the creation, etc. Maybe the
delay could happen within read_stream?

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Right. (I think that you'll probably agree with me about addressing
this problem before even thinking about limitations in the
"priorbatch" optimization, but I thought it best to be clear about
that.)

Agreed.

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

Maybe, but at what cost? It doesn't necessarily make sense to continue
to read additional leaf pages, regardless of the number of heap buffer
hits in the recent past. At some point it likely makes more sense to
just give up and do actual query processing/return rows to the scan.
Even without a LIMIT. I have low confidence here, though.

Yes, it doesn't make sense to continue forever. That was the point of
distance_max in my patch - if we don't get enough I/Os by that distance,
we give up.

I'm not saying we should do whatever to meet effective_io_concurrency.
It just seems a bit strange to ignore it like this, because right now it
has absolutely no impact on the read stream. If the query gets into the
"collapsed distance", it'll happen with any effective_io_concurrency.

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

As I said, I might just be wrong about "just giving up at some point"
making sense. I just don't necessarily think it makes sense to go from
ignoring effective_io_concurrency to *only* caring about
effective_io_concurrency. It's likely true that keeping
effective_io_concurrency-many I/Os in flight is the single most
important thing -- but I doubt it's the only thing that ever matters
(again, even assuming that there's no LIMIT involved).

I'm not saying we should only care about effective_io_concurrency. But
it seems like a reasonable goal to issue the I/Os early, if we're going
to issue them at some point.

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)

I find that the regression happens with io_uring. I also find that
your patch doesn't fix it. I have no idea why.

That's weird. Did you see an increase of the prefetch distance? What
does the EXPLAIN ANALYZE say about that?

regard

--
Tomas Vondra

#299

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#298)

Re: index prefetching

On Mon, Aug 25, 2025 at 2:33 PM Tomas Vondra <tomas@vondra.me> wrote:

Right. I might have expressed it more clearly, but this is what I meant
when I said priorbatch is not causing this.

Cool.

As for priorbatch, I'd still like to know where does the overhead come
from. I mean, what's the expensive part of creating a read stream? Maybe
that can be fixed, instead of delaying the creation, etc. Maybe the
delay could happen within read_stream?

Creating a read stream is probably really cheap. It's nevertheless
expensive enough to make pgbench select about 3.5% slower. I don't
think that there's really an "expensive part" for us to directly
target here.

Separately, it's probably also true that using a read stream to
prefetch 2 or 3 pages ahead when on the first leaf page read isn't
going to pay for itself. There just isn't enough time to spend on
useful foreground work such that we can hide the latency of an I/O
wait, I imagine. But there'll still be added costs to pay from using a
read stream.

Anyway, whether or not this happens in the read stream itself (versus
keeping the current approach of simply deferring its creation) doesn't
seem all that important to me. If we do it that way then we still have
the problem of (eventually) figuring out when and how to tell the read
stream that it's time to really start up now. That'll be the hard
part, most likely -- and it doesn't have much to do with the general
design of the read stream (unlike the problem with your query).

I'm not saying we should do whatever to meet effective_io_concurrency.
It just seems a bit strange to ignore it like this, because right now it
has absolutely no impact on the read stream. If the query gets into the
"collapsed distance", it'll happen with any effective_io_concurrency.

That makes sense.

That's weird. Did you see an increase of the prefetch distance? What
does the EXPLAIN ANALYZE say about that?

Yes, I did. In general I find that your patch from today is very good
at keeping prefetch distance at approximately effective_io_concurrency
-- perhaps even a bit too good. Overall, the details that I now see
seem to match with my (possibly faulty) expectations about what'll
work best: the distance certainly doesn't get stuck at ~2 anymore (it
gets close to effective_io_concurrency for most possible
effective_io_concurrency settings, I find). The "only" problem is that
the new patch doesn't actually fix the regression itself. In fact, it
seems to make it worse.

With enable_indexscan_prefetch = off, the query takes 2794.551 ms on
my system. With enable_indexscan_prefetch = on, and with your patch
from today also applied, it takes 3488.997 ms. This is the case in
spite of the fact that your patch does successfully lower "shared
read=" time by a small amount (in addition to making the distance look
much more sane, at least to me).

For context, without your patch from today (but with the base index
prefetching patch still applied), the same query takes 3162.195 ms. In
spite of "shared read=" time being higher than any other case, and in
spite of the fact that distance gets stuck at ~2/just looks wrong.
(Like I said, the patch seems to actually make the problem worse on my
system.)

--
Peter Geoghegan

#300

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#292)

Re: index prefetching

Hi,

On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

Mainly some testing infrastructure that can trigger this kind of stream. The
logic is too finnicky for me to commit it without that.

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.

Yea. It does apply to 18 too, but I can't come up with realistic scenarios
where it's a real issue. I can repro a slowdown when using many parallel
seqscans with debug_io_direct=data - but that's even slower in 17...

On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

while with index prefetching (with the aio prototype patch), it looks
like this:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26 dirtied=1
Planning Time: 1.032 ms
Execution Time: 3150.578 ms
(16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 2.212 ms
Execution Time: 1837.615 ms
(16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?

I think what you might be observing might be the inherent IPC / latency
overhead of the worker based approach. This is particularly pronounced if the
workers are idle (and the CPU they get scheduled on is clocked down). The
latency impact of that is small, but if you never actually get to do much
readahead it can be visible.

Greetings,

Andres Freund

#301

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#293)

Re: index prefetching

On Mon Aug 25, 2025 at 10:18 AM EDT, Tomas Vondra wrote:

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

I noticed an assertion failure when the tests run. Looks like something about
the patch breaks the read stream from the point of view of VACUUM:

TRAP: failed Assert("stream->pinned_buffers + stream->pending_read_nblocks <= stream->max_pinned_buffers"), File: "../source/src/backend/storage/aio/read_stream.c", Line: 402, PID: 1238204
[0x55e71f653d29] read_stream_start_pending_read: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:401
[0x55e71f6533ad] read_stream_look_ahead: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:670
[0x55e71f652e9a] read_stream_next_buffer: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:1173
[0x55e71f34cd2b] lazy_scan_heap: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/heap/vacuumlazy.c:1310
[0x55e71f34cd2b] heap_vacuum_rel: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/access/heap/vacuumlazy.c:839
[0x55e71f49a3f4] table_relation_vacuum: ../source/src/include/access/tableam.h:1670
[0x55e71f49a3f4] vacuum_rel: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/commands/vacuum.c:2296
[0x55e71f499e8f] vacuum: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/commands/vacuum.c:636
[0x55e71f49931d] ExecVacuum: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/commands/vacuum.c:468
[0x55e71f6a69f7] standard_ProcessUtility: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/utility.c:862
[0x55e71f6a67d7] ProcessUtility: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/utility.c:523
[0x55e71f6a630b] PortalRunUtility: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/pquery.c:1153
[0x55e71f6a59b3] PortalRunMulti: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/pquery.c:0
[0x55e71f6a52c5] PortalRun: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/pquery.c:788
[0x55e71f6a4119] exec_simple_query: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/postgres.c:1274
[0x55e71f6a1b84] PostgresMain: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/postgres.c:0
[0x55e71f69c078] BackendMain: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/tcop/backend_startup.c:124
[0x55e71f5e5eda] postmaster_child_launch: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/postmaster/launch_backend.c:290
[0x55e71f5ea847] BackendStartup: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/postmaster/postmaster.c:3587
[0x55e71f5ea847] ServerLoop: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/postmaster/postmaster.c:1702
[0x55e71f5e86d9] PostmasterMain: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/postmaster/postmaster.c:1400
[0x55e71f51acd9] main: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/main/main.c:231
[0x7ff312633ca7] __libc_start_call_main: ../sysdeps/nptl/libc_start_call_main.h:58
[0x7ff312633d64] __libc_start_main_impl: ../csu/libc-start.c:360
[0x55e71f2e09a0] [unknown]: [unknown]:0

2025-08-25 21:05:28.915 EDT postmaster[1236725] LOG: client backend (PID 1238204) was terminated by signal 6: Aborted
2025-08-25 21:05:28.915 EDT postmaster[1236725] DETAIL: Failed process was running: VACUUM (PARALLEL 0, BUFFER_USAGE_LIMIT 128) test_io_vac_strategy;
2025-08-25 21:05:28.915 EDT postmaster[1236725] LOG: terminating any other active server processes
2025-08-25 21:05:28.915 EDT postmaster[1236725] LOG: all server processes terminated; reinitializing

__
Peter Geoghegan

#302

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#301)

1 attachment(s)

Re: index prefetching

On 8/26/25 03:08, Peter Geoghegan wrote:

On Mon Aug 25, 2025 at 10:18 AM EDT, Tomas Vondra wrote:

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

I noticed an assertion failure when the tests run. Looks like something about
the patch breaks the read stream from the point of view of VACUUM:

TRAP: failed Assert("stream->pinned_buffers + stream->pending_read_nblocks <= stream->max_pinned_buffers"), File: "../source/src/backend/storage/aio/read_stream.c", Line: 402, PID: 1238204
[0x55e71f653d29] read_stream_start_pending_read: /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:401

Seems the distance adjustment was not quite right, didn't enforce the
limit on pinned buffers, and the distance could get too high. The
attached version should fix that ...

But there's still something wrong. I tried running check-world, and I
see 027_stream_regress.pl is getting stuck in join.sql, for the query on
line 417.

I haven't figured this out yet, but there's a mergejoin. It does reset
the stream a lot, so maybe there's something wrong there ... It's
strange, though. Why would a different distance make the query stuck?

Anyway, Thomas' patch from [1]/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com doesn't seem to have this issue. And
maybe it's a better / more elegant approach in general?

[1]: /messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

--
Tomas Vondra

Attachments:

readstream-adaptive-distance-v2-fixed.patchtext/x-patch; charset=UTF-8; name=readstream-adaptive-distance-v2-fixed.patchDownload

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index ed5feac2d39..dc57ff0c640 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -116,6 +116,15 @@ struct ReadStream
 	int64		forwarded_count;
 	int64		distance_hist[16];
 
+	/* acceptable distance range */
+	int16		distance_min;
+	int16		distance_max;
+
+	/* number of hits / misses */
+	int16		num_reads;
+	int16		num_hits;
+	int16		threshold;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -235,6 +244,119 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 	return blocknum;
 }
 
+/*
+ * read_stream_maybe_adjust_distance
+ *		adjust distance based on the number of hits/misses
+ *
+ * This adjusts three parameters used to pick the distance, based on hits and
+ * misses when reading the buffers:
+ *
+ * - distance - the "actual distance"
+ *
+ * - distance_min and distance_max - this restricts the accetable range
+ *   for the "actual distance"
+ *
+ * The range is adjusted with every miss. It starts with [1,1], and every miss
+ * increases the limits up to [max_ios, PG_INT16_MAX]. The "actual distance" is
+ * kept within the range. This means the distance "ramps up" gradually, and does
+ * not drop all the way back to 1.
+ *
+ * The "actual distance" is adjusted less frequently, after seeing a chunk of
+ * requests. We calculate the "probability of a miss" and use it to estimate how
+ * many requests to look ahead to keep "max_ios" in the queue. The calculated
+ * distance is still kept in the min/max range.
+ */
+static inline void
+read_stream_adjust_distance(ReadStream *stream, bool miss)
+{
+	int16	max_distance = stream->max_pinned_buffers;
+
+	/*
+	 * Count hits/misses and (maybe) widen the distance range.
+	 *
+	 * XXX Maybe the range should be adjusted always, not just for a miss?
+	 *
+	 * XXX The min distance is capped to max_ios, because that's the maximum
+	 * number we know we can handle for 100% miss rate.
+	 */
+	if (miss)
+	{
+		stream->num_reads++;
+		stream->distance_min = Min(stream->distance_min * 2,
+								   max_distance);
+		stream->distance_max = Min(stream->distance_max * 2,
+								   max_distance);
+	}
+	else
+	{
+		stream->num_hits++;
+	}
+
+	/*
+	 * Adjust the actual distance, based on miss ratio.
+	 *
+	 * We only do this once in a while, after seeing "threshold" requests, so
+	 * that we have somewhat accurate estimate of miss ratio. We can still see
+	 * miss_prob=0.0, so be careful about it.
+	 */
+	if ((stream->num_hits + stream->num_reads) >= stream->threshold)
+	{
+		/*
+		 * If we saw any misses, estimate how far to look ahead to see max_ios
+		 * I/Os (which considers effective_io_concurrency, our goal).
+		 */
+		if (stream->num_reads > 0)
+		{
+			/* probability of a miss */
+			double	miss_prob
+				= stream->num_reads * 1.0 / (stream->num_hits + stream->num_reads);
+
+			/* number of requests to get max_ios misses */
+			stream->distance = Min(stream->max_ios / miss_prob,
+								   PG_INT16_MAX);
+
+			stream->distance = Min(stream->distance, max_distance);
+		}
+		else
+		{
+			/*
+			 * With no misses, we simply use the current minimal distance.
+			 *
+			 * XXX Maybe we should use the maximum instead?
+			 */
+			stream->distance = stream->distance_min;
+		}
+
+		/* reset the counters, to start a new interval */
+		stream->num_hits = 0;
+		stream->num_reads = 0;
+
+		/*
+		 * When to re-calculate the distance? Not too often, to get a good
+		 * of miss probability (we need to see enough requests), but also
+		 * not too infrequently (we want this to be adaptive).
+		 *
+		 * XXX Seems reasonable to base this on the distance. It means we
+		 * expect to see max_ios misses, because that's how we calculated the
+		 * distance.
+		 *
+		 * XXX But don't do this too infrequently. The distance can get quite
+		 * high, so cap to 10x the I/Os. Arbitrary value, maybe needs more
+		 * thought.
+		 */
+		stream->threshold = Max(stream->distance, stream->max_ios * 10);
+	}
+
+	/*
+	 * in any case, respect distance_min, distance_max
+	 *
+	 * XXX This means we actually adjust the distance after every miss, not just
+	 * after every stream->threshold requests. Is this a good idea?
+	 */
+	stream->distance = Max(stream->distance, stream->distance_min);
+	stream->distance = Min(stream->distance, stream->distance_max);
+}
+
 /*
  * In order to deal with buffer shortages and I/O limits after short reads, we
  * sometimes need to defer handling of a block we've already consumed from the
@@ -398,8 +520,7 @@ read_stream_start_pending_read(ReadStream *stream)
 	if (!need_wait)
 	{
 		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		read_stream_adjust_distance(stream, false);
 	}
 	else
 	{
@@ -758,6 +879,13 @@ read_stream_begin_impl(int flags,
 	else
 		stream->distance = 1;
 
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+
+	stream->distance_min = 1;
+	stream->distance_max = 1;
+	stream->threshold = stream->max_ios;	/* XXX rethink? */
+
 	/*
 	 * Since we always access the same relation, we can initialize parts of
 	 * the ReadBuffersOperation objects and leave them that way, to avoid
@@ -971,7 +1099,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -985,9 +1112,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->oldest_io_index = 0;
 
 		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		read_stream_adjust_distance(stream, true);
 
 		/*
 		 * If we've reached the first block of a sequential region we're
@@ -1146,6 +1271,13 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = Max(1, stream->distance_old);
 	stream->distance_old = 0;
 
+	stream->distance_min = stream->distance;
+	stream->distance_max = stream->distance;
+
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+	stream->threshold = stream->max_ios;
+
 	/* track the number of resets */
 	stream->reset_count += 1;
 }

#303

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#300)

Re: index prefetching

On 8/26/25 01:48, Andres Freund wrote:

Hi,

On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

Mainly some testing infrastructure that can trigger this kind of stream. The
logic is too finnicky for me to commit it without that.

So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that

- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)

(A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.

Yea. It does apply to 18 too, but I can't come up with realistic scenarios
where it's a real issue. I can repro a slowdown when using many parallel
seqscans with debug_io_direct=data - but that's even slower in 17...

Makes sense.

On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

while with index prefetching (with the aio prototype patch), it looks
like this:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26 dirtied=1
Planning Time: 1.032 ms
Execution Time: 3150.578 ms
(16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 2.212 ms
Execution Time: 1837.615 ms
(16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?

I think what you might be observing might be the inherent IPC / latency
overhead of the worker based approach. This is particularly pronounced if the
workers are idle (and the CPU they get scheduled on is clocked down). The
latency impact of that is small, but if you never actually get to do much
readahead it can be visible.

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.

regards

--
Tomas Vondra

#304

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#303)

3 attachment(s)

Re: index prefetching

On 8/26/25 17:06, Tomas Vondra wrote:

On 8/26/25 01:48, Andres Freund wrote:

Hi,

On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

Mainly some testing infrastructure that can trigger this kind of stream. The
logic is too finnicky for me to commit it without that.

So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that

- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)

(A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.

Yea. It does apply to 18 too, but I can't come up with realistic scenarios
where it's a real issue. I can repro a slowdown when using many parallel
seqscans with debug_io_direct=data - but that's even slower in 17...

Makes sense.

On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

while with index prefetching (with the aio prototype patch), it looks
like this:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26 dirtied=1
Planning Time: 1.032 ms
Execution Time: 3150.578 ms
(16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

QUERY PLAN
----------------------------------------------------------------------
Index Scan using idx on t (actual rows=9048576.00 loops=1)
Index Cond: ((a >= 16150) AND (a <= 4540437))
Index Searches: 1
Prefetch Distance: 2.032
Prefetch Count: 868165
Prefetch Stalls: 2140228
Prefetch Skips: 6039906
Prefetch Resets: 0
Stream Ungets: 0
Stream Forwarded: 4
Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
Buffers: shared hit=2577599 read=455610
Planning:
Buffers: shared hit=78 read=26
Planning Time: 2.212 ms
Execution Time: 1837.615 ms
(16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?

I think what you might be observing might be the inherent IPC / latency
overhead of the worker based approach. This is particularly pronounced if the
workers are idle (and the CPU they get scheduled on is clocked down). The
latency impact of that is small, but if you never actually get to do much
readahead it can be visible.

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.

After investigating this a bit more, I'm not sure it's due to workers
getting idle / CPU clocked down, etc. I did an experiment with booting
with idle=poll, which AFAICS should prevent cores from idling, etc.

And it made pretty much no difference - timings didn't change. It can
still be about IPC, but it does not seem to be about clocked-down cores,
or stuff like that. Maybe.

I ran a more extensive set of tests, varying additional parameters:

- iomethod: io_uring / worker (3 or 12 workers)
- shared buffers: 512MB / 16GB (table is ~3GB)
- checksums on / off
- eic: 16 / 100
- difference SSD devices

and comparing master vs. builds with different variants of the patches:

- master
- patched (index prefetching)
- no-explain (EXPLAIN ANALYZE reverted)
- munro / vondra (WIP patches preventing distance collapse)
- munro-no-explain / vondra-no-explain (should be obvious)

We've been speculating (me and Peter) maybe the extra read_stream stats
add a lot of overhead, hence the "no-explain" builds to test that. All
of this is with the recent "aio" patch eliminating I/O waits.

Attached are results from my "ryzen" machine (xeon is very similar),
sliced/colored to show patterns. It's for query:

SELECT * FROM (
SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437
ORDER BY a ASC
) OFFSET 1000000000;

Which is the same query as before, except that it's not EXPLAIN ANALYZE,
and it has OFFSET so that it does not send any data back. It's a bit of
an adversarial query, it doesn't seem to benefit from prefetching.

There are some very clear patterns in the results.

In the "cold" (uncached) runs:

* io_uring does much better, with limited regressions (not negligible,
but limited compared to io_method=worker). A hint this may really be
about IPC?

* With worker, there's a massive regression with the basic prefetching
patch (when the distance collapses to 2.0). But then it mostly recovers
with the increased distance, and even does a bit better than master (or
on part with io_uring)

In the "warm" runs (with everything cached in page cache, possibly even
in shared buffers):

* With 16GB shared buffers, the regressions are about the same as for
cold runs, both for io_uring and worker. Roughly ~5%, give or take. The
extra read_stream stats seem to add ~3%.

* With 512MB it's much more complicated. io_uring regresses much more
(relative to master), for some reason. For cold runs it was ~30%, now
it's ~50%. Seems weird, but I guess there's fixed overhead and it's more
visible with data in cache.

* For worker (with buffers=512MB), the basic patch clearly causes a
massive regression, it's about 2x slower. I don't really understand why
- the assumption was this is because of idling, but is it, if it happens
with idle=poll?

In top, I see the backend takes ~60%, and the io worker ~40% (so they
clearly ping-pong the work). 40% utilization does not seem particularly
low (and with idle=poll it should not idle anyway).

I realize there's IPC with worker, and it's going to be more visible for
cases that end up doing no prefetching. But isn't 2x regression a bit
too hign? I wouldn't have expected that. Any good way to measure how
expensive the IPC is?

* With the increased prefetch distance, the regression drops to ~25%
(for worker). And in top I see the backend takes ~100%, and the single
worker uses ~60%. But the 25% is without checksums. With checksums, the
regression is roughly the 5%.

I'm not sure what to think about this.

--
Tomas Vondra

Attachments:

ryzen-cold.pdfapplication/pdf; name=ryzen-cold.pdfDownload

ryzen-warm.pdfapplication/pdf; name=ryzen-warm.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��ryzen)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
792
612
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x������J���������������t��pd���E��j�� #YR��n��_�j \�A��?���?_�����������>�����g�����\���h����a������om�{����[���������}���?���o�?��R����Xi���?������w����_O��dyN�/���^\��.�����n5��)��^�1/��s;������]G�F*�(�vn;��RJm�A�9�aR���RJk�{����I-�Jg-�5���5��j�}���������r���g9�5��g�����9�5��g�����)r�k�7��������Z������)�<E��&��k�7�U[�Jo�]������]���)=�������r�S�q���W����Mz����_�ViN���^�
.������^�
.������^�
.<�9�O�z<\.>@�9
1���#�U��4���S�^�� 5l?~��,�����O��S�^�[��4�!�W�
.��9�O�z<\��9��z�*x�pHs:��z�*x����t?��U�p9�5iNC���H��>���9�O�z<\��9�O�z<\8�9�O�z<\x�����������Szun����^�*x��Js:n?��U�p9`��t�?��U�p9���t�?��U�p9�)��8�����rX�l������f�^E},s�9��z�*x�pHs:��z�*x����t?��U�p��g��4����W�����4�c��W�����4���S�^�iN��S�^������^�
.K��+Tc�n-_t�
�����������q���@��7���U��I�M%��O��J7��sWI��S���M�m��7��o��J7���t
���^^%��dp�������*y�$���N7�?�PW��&|j�t��cu�<n��|m����]��iwC"�j7�x������M2�;�����*y�$������*y�$��h7;c�7S\%��d���|��cw_%��dp�n���cw_%��d��n�q������M2��n��������M2b�b1�1�4K��vw�-\v.?v�U��I7���?v�U��If~��|���]�}����1�^~���������l����>���xkk�u�������D���d���������W?����5W�<���3��\A��ds����8���z����6h�� ��0;��O���c�;��&Mv����sp?�s��:�_����f��������Ss�)x���)l�#���y�����~�������,�"<��F*Qjg�k���3�����6�"�+^#��N���zH8s]�k��'s]����kYoc#B}�Q>���R������N�d�{;�Ss�)������<�����P?�0>�~��;�c������J��l�*����!��0H�KP�o/T�l9".�����S�Y��<�x�������3��9���� T�t��_�����$����2
���%�o;Op
5|;�7W�v��?X���v>��4�o�Y����_�6�O�xG�{[��z"�5��������P�%����p!fX�
m�����1E���1E�#�r>>��]}S�}��X�
?:�����;�"v�����X��O��o\x��
��m}7�7��}��6b�����l`	��	��t���k!��`�_���&���Tc�C����E�fb���Dl	Z�
�9�/��1%,��y�5�������Q��{��St=�.g�C�:��h���<��#r>>�(��)�x�S�u����gSL	_j"�9=�C��/?����e\G���ek�!=p��^��b����3���o'��4&�)�x3t=�.�����A��&q��<��#r>~�X�����<��C�rB~u9�|0��Z���>�t�.�����0f\���v)��Fq��on�n����KWg��(�t?�_t:�����st=�.��w�����K��)���3�K�bFE���b�!O9!�J]��''"���<�O�`C����!"�;�91���O}�og��C8fD��St=�.'�w���`\P&O1�����nu����pL�D0�%kO=�)g�WYLL�PP�����C=�D|���iJ]%��]�k�a}�����	aI��c��l�v�o>�cB8R���a
6�o#��r��U;�v�g���Jf3=�!��w�����:��z�S��o>�c>�R�����>�t�ru�w���.�rs
6��$��[��7f\�)�6
�~~�����2��`C=���k��A���������}}�Dl�_�
iO�6E�����������x�$��)�`C�1�������t��)�~F�
���w9�����8����rQ����p�D�1��r:�G������1>��V�ru��V;x���<��|vR:�4�n��	=�C���~����������������$�`��v�Jg������4�)&�/%����6!f�:�n�O1!����������]3����g\P&�1������\3��������`�O9/�������m�	�K��c�C����q������r�O`XOr���������5���i���Sv���z����1�l��9�W�R�N�)a���
���!w9!�+��<�_���1)�~H=�!gM���� �~Y�Ri������L=�)�����������!f8�x9�~�_�!Vy|Pk\���?j[�
i��L�>���J��|��<>�	�`C�/���u{�����tDi�W��3t=�.g�6a|P����>&�)�yE�0����L)��[�8Xr)f�-��EL1�����������2��:����}ILq�!rB~w_Z�<��=.\O��G����7#������"&X�
u��
�R����O{\����x���������"fz�CN��5WV>Z2*��3��E�1��������i���*���Sz�����]O�<��3�`O�������Gz_��Finr*f�5��EL��������?�q�	aMyr3t=�.��~	���/�~�|�c>8S�[�C�w��xaT|O
�&Q�t�������������Ag�l��KypSz������W�78O�[�3.`�]p�<�0���nB����'�bF�R]�]�U��y+����vI��*�)&�/��E��I���	�����
:������3=�!��~���7�&�b>�2�^O0�x�����;'��%�bF�R����;��3~t5m�<'�0m������no�w6��KU���&Fu�\�?�d��)oT��$1��&�E?������1v��3����������f���UyR��<q��tD�kU��&OFu�����<(iM��H�����c�o=��II��S���:�gm��$�[�[u�|7���k��I�<(i�_��@���e~�S����;^���^~TtDiR�K�����=�l|�^�w�������n�}J��I�<���v��u�c�N�6q�<E�C�rBn3~M�N1�s���i!�����"'�������C�5����a��N9�>���9P;�*����2�`�Fo_>���s�������m�O����GA=�(�;���=
�ayS=�!g�7	{���s��[��m>�G����qz�3�Dg?;\�`C�*�7����7&�Q��L������/�����6�3=7D��n�8���M���S��`��9�������_L�07`��vB�D�`3���\��6��u��A�]���=�'O��5������N�������3=7b��v��������{\M�������7�:�4����3=�!���-C�)m�_��s�x�6��n��</�-�'O���5���m�|��3��K�qI�m���t��
o���������|;c���m��	ay^�s_t����7����z��rr�Sx����������%=7E��v�ov��!�J��k���u
6������;yV�-�=���|�3�������������������%=7E��������tL�R��07E���7��CJ��E���!�����p�<*�G\��0���l����>h|XO�����/�sv���	�������'%#�`�;~3�7[�cFS���������s:����G\��a�O9��)��/�I��}���>�������Q���*G������lLc���Q����M��_y����<,U��8~4����1#�)OKF��#��;�����V�������~w��<�{\Q���q�#}+����1�AA]��lh�}��1#�GA^�.O+��7+���� O/b��G<�|��������g�z�SN�Go�A��o#�C�������]^�./�u�`�^>?>�N�\�ts�y}�o����>�<��	�^��m�/��*=���������?��x�����pa�6/��	��]^�>��[��|{N���Nyz1�<��of^�
���#����E��0���8����^I�<���f��?����)1b�������ON������//���������t�3����:r��|�o��.S�<3g\��0���l8��Mu��2����g&1�5��g�l�3����$��z�]N��)��cJ��q��<��#r>�����j������	���:�L|�MzT�T7f8�X9��'��<��[\�����k�!};��������33����v�ob�����337��h~x)�)aJyB3s�y����������Iy63s��x�oO����S���pf�n��?y��tVy63k\�����k�!}�����0�<��������~��0�<������@������cJ���xM�b�9�f�Q�o���g�7/��)��c��%�Gswv�lH_���}[��wN������^����7
��x��i������������o�<���{�@��+S/_;jZ�D��!��<L�3������*:�I�K����N����<M��S���w����}��]��	3w*�������)�D����;�����,I�=���4~e}a���/~��8��'Q���b}������9��8��D�S�x���f�7���i����$�=�������9���Y��6&�W�J&��~vJc�����pa�L�������s{�����>(f��&���A�|wNaN�j�/����a�'�����$|���Ki[-���p�k5���!��M��y^��"@�!���Jbz�H?�/Mz��;�?�z������
�I�w����B�_�(o���_y�|8�����p�������GL�3(�^�p��r6~��{�}�d_g��kw<�����}���x�$�:s?_��M��u���+&���
}���'��M<$�:�'C���7�h3~��Er��H.t�f����$���1#�6	�����_`�UX3����I:t�n���O�.�����I6���!'�w��s����%99s[O{����xW$7�	�����N��b��x�$>9s��x���x�ObJ�q����l�@���<�l��?[�������_���50!J��g��/�&���`?Yx�#������^ ������Zp�d�3.��=���?���)�:��0��������C�wQ����=��=��^qYy�GX��E����]���)����9����;�X_(j�s~�NbJ�u����b�aO9+�{/1%�<Y����!'��3�������[R�K��z{)��e��H��9-8����l�~�A3���Q���4��~������G���R������?���r�_���O2��O~�������b���d/\N��a�v��r��a\�Zr�n�������6�
�J���;��g�W��9k�Vb��e��O=9����9�������g��?\";����I�����~����������Y�K����u>82q����i�|������9k���N9���zr~��3g�^�(g�v�S����S�U��s��N��k�3���ae���]t�����v�����6�#{�r�w�i?{n���$p�i����������)�[N0�����_m�Y����,�����w����s[�N
������-D9m8G���[I��?]O��r��1���zv~�8���I�r�iG?���f�fN��0'9��=7��I�6����������/r�p�l��9������]X���c�p�D�9M�C�zj~���6\+�gN3�������#��aZ	?s�i?�����1i�V"����C?����&q��En1y[ 0��~�M�fbL��}r9	�;���#��aZ�|s�n������C=�
�J���;��g��&��aZIs�i?������4L+�oNr��zj~�gUd�b�0v�`�s� �y��ge]N���o'���S����t9K�C�zj~���9mXVB��f��=9��0f
��%��Y�����;c�����9�a�~����uxv���K���q������n7c����!�I0��&��f��{��BN�������#���9�
��C�������W������Y�����?���n��_���:������������z���O�Y����g�f�q������!�t;�����Os�0�>^�y��������9oW���<���������o�)KLr��zr~�:��_L6��x{������9m�6.A��R�K�i?�2��a\}��`�����fcB���GJ{���/���x��=���Hi��{�%����]�������7_�������>����4�!�����:�\c�0�>��Y����|w:���H5�;Sfv�C��g'x����j\w�d��O=5��0c���>���>���'o�����w}�s1]
}�s�����-�a�OsbB��OonL��:�ia�W�!�t;�����qSL����s�C�������<��_b�}�9����zz~ia���������9_��������9�����og���h���F�s����k����a�.��������9b��5�q�W��`��qH���#���{*v�a�s�y���r�8��������u�7��6�N>�l���-�K~?��M����J}�Ps��|��������Fi
1���������f�W���f�1)�`��N��7�N>6:����<�K�c����W����n_Lg�1�����~�W+9k�RM�kn��/�f��/&�a�
�M��%p�/����b:���������S�w��	��2&�^���m�>u&]�0+8wc���n��o���Fo+f�����w��6i�wq����{k�3���;q�4��Y��9^��������D������m��t�h�uq6N����%�w!���Lo�f\-w��l�~���I��j��r��x�o'���J�5}���pnek/y����=d1k�PM_Z�d�/�f�4
�y��j�r/�x���}�}%U4������ ����^��]�4��i�S����|qI��Q�P�M
!n����#�������D��B�������������)�����Oy����������������}�~}i��g�M�z�U�~��9�_����-�U��U�~�9�_�=�F]E����� ���z�(:�*:�*���B<G
-���W���Y����*(=+���}�r}U�5X`Y�&*�D����k�@�Y9T���+�������D�q����o��2gT�[�7����Y���5,S����Qj�f�����Y9�-
=��O��z��<�2
3
t���#^}U�����0�?�Q�~y;,����C�!?�����mjz��m�G�zVN��BO]?��W���l�0�C��7>��u
 �`�u����z�sPW����~������/<�.=#�=�/�\�Fz���G�����u<`����!�$M�^�U)P�C�o�<�����m������x��d����������Fa����Q������m0�ym/���:�v}&0�������x���d����o�����l���C���>�����O0�I�ag|����S�9Xc���-�����l����f���ef�MF�`�����Y���\�zzhNx_��_s
���y�C����f��S�a���c����4����aK
�W��W?\�Zz����#�o�,0�C��;'������#=tx_t�X}U
�����C����s�k���:<����?�����9�!��t�9�5X`��������k�u��ttGTJ��oAK����V*t[T��/������:
�'����D�,\����-�����Js��s������W�R9�^k��#_^��r��V���(�q��=s�@�������[GK��ZJ��������Wrc�*P�t��q�MZ%w�[v��$��[�{cJ>^�J�n��:lL(���J��G���K>aZJ�\��:<,��*P�r]-�:<�*���r�B��{�RJ��@i'����c��A�*Pj��:�u��K�x�@i����C�Z2�Z�5�z�����������r�B�k{�)Y���*P��u���=W��3�������g�^�Y��k��8p�JP�_2~�D���$����;��*�H��J+<�?=����a
l$���1\eJ�8`h'�
��+���R�;Xge���qW�R�;Yg%���^W�Rd{�������J�H*0�d��F�o'U*en��]a�*�U�c�����cU*E����a��&�����r0����5�jfdL�A�qU*e����g$�*;�hS?#a
��6��g�#L�3��F]����)tNv��T?#9a��;X)3\���mg?�oq_:J���
S����J��W0V����?�{U*
�3&��o�>+�2{�X�s�,�E;;������0�Nc��gd3L�3��FM����)tv��P?#�a���V�c���ke���lT���t�B������	S���J��z0��?s�)wJ��3N��8���5��J���0����pQem�g$CL����FU����)t&;�����1���6��gdHLqG����2M�X�s|���3dm�g�KL�3��FM����)t�u��T��	c�����O��e|V*e
��������vv�QQ?#�b
��6���H��Bg������S��`��~F�����=W�^0�,��\���u.���x|SZ�D��
9r2�kr�x�7�E�z�Y�����%1��?bUF��w;�c����j���b8wM~U��ZE�p�&V��UYz�_��j�j�Odo�)6e���m
�$N0�NYx:N[�"���+���U��.�K�_�-�by��:��`�u��c�@�'�b�X�qX ����|Uf���b��],�X���	�X�;��H�Sl���ur@�;����:��`�����(f>�5�)���k��$��)�����u�C�by�H��I�t���z���SO�+W�H���Q�@$�k�N+"�`�T%I�XIbE���"I\I���d�;>]Y���D���;�(E����uV>��
�T�$c'��|p��\�I"�����E���J�$b��:+��N�T�$F+b��:U��I"��}+����T�$��?=t3�i�r�&�F@���;��#1���,:�������@�x��R)�H���@���rm��#�B�sUm
��L�3�
��6�H��9�
��6�H���r(e�K'7E��)@.�]�q$S���+E��ra������+E����v��@^�JGb,���3���T���#�B����v�3�H���`��~F�:;�h��G2�'qa+e����Q�� ��.T���#�B������q$S�Lv�QW?#�d
�SnU�"��X�[9�� �J��#1��<7�3������G2�NgEI���Cm�4z�
rS��M��h��@��:'����������@��n�2��Xn�c��@���T�M��&S�n��M�L�)tn��M�L�	:H1�J��41���,�H1��R��@��:���h��@��:�����E�&S��`��~F��w�1���L3F��|)r7��Z��n��D
�X��x��4�^D��E�
{1�����D����pn&v�"�'"z\���D�p�z���ZE�p�&V��UY�"zD�a��+�H���fu���E)�b�,\�-d�D
��G��3�����&��b��u�H��by�H"y���6
b���Q������>��~��K�4C!��6���!�L�]���X/!�L�&���[/!�L�!������i
��!��8���i
�BL�7��i�pM���BL��%d��)��yw�%d��)6�y#���{�/���^����"�\�&��F%�i��;I�����S�@d�k`%�V^pT��Ls
l$����bV|N�bd�L!�w�Q�Lc���rQ\��&�N�YI��1�R)2M�=Xg%����+�"����uVR	��T���&�V�v�u�W)2M��V�/R�U�T���#�Bg�q���q$S���^��#1�����"�W�R���~F������6�3�H����`��~F�:�l����#�B�dM�3�H������2��U���v�3������G2����R�K��k�`?#�\�JC��8�	: �J��#1������Q*�g��L����F��q$S�v�QS?#�d
��l4���#��C����2��X��(�g��4*�g��L����FU��8�)t&;����G2��)wJS��8�	: �J��#1��<7�3������G2�NgU�3�D����`��~F��:';�h���$2���V�$c����~Fv�[C�M��$�)t;�����$2���6�g$�L�AvxU*e����g���vv�QQ?#Id
��6���H�Bg�����I"S��`��~F�����q/�Ib��[�R��!���F#IL�&����ib��X/"6�T�bbs�����I���L��E�d�Gt��
3I��������a$�1��X�#�CTV[4"IL�)+E�n+G$��)v�r�q��I�`�!:��S1��.�o��!f�Y IL�.�w����S��;���!��oUf����w��c�@�'�bM,��_��M:��
�u�?�;����:��`�!�C����_��H ���a�@�'�bM,��[ �L�!�w���=WH_:x�g�ZK���������mT"�����
�.;U	D��V�X��G�*�����F+���_��++6A��2{g���0v��J����J��a�d���-�+�"����uV���8�R)�?�=Yg�z��I�J���hElWX��q�"��X�oe�"�[�JU-���)t&������1����5�jf�L�A�wU*e����g�z�*;�hS?#�c
��6��g�L�3��F]����)tNv��T?#�c��;X)�?\���mg?#�[:J���S�� +5�$�&�3�U�4����������T��c��H���`���0���w�3�U������1��`5�3�?��9��FC����)>D;X)�?�e?��~F��Oc��~F��:�lT�����Bg������S��r��4�>���������T��c��3��O��`���0��<+����T��g�L�3��F]����)tNv��T?#�c�;���������������B���S�v�QS?#�c
��l4����������T��c��H���`��~F��:�l������~F��*����S��`��~F�����p/��_��[�R�����F#�L�&����ib��X/"6�T�bbs�����I���L��E�d�G���
3������������VM�6�����I�a�1\V�[�e#?Tv[9"�L�S��������C�������p��}�@�1��:��`�u��c�@�'�b�X�qX �1~��0��.P�t?����)�����u�?�by���b�t@��Pm8���	�?����pMP�t?����)�����u�?�by����3y����?��ZZ^K(�\���F%����;I�`���S�@�k`%��]pT���o
l$�b���u|��b�L!�w�Q��c��ToQ\���N�Y���1�R)�?�=Xg�z���+�"����uV���T����V�v�u�W)�?��V�/�U�T�����Bg�q����S���^��fF��$~W�R��~F������"��X�3R�E�lT�����Bg������S���`��~F����?v�R�������~F��t�v�3�?��rAVjvIrM�g$~�Ri����1A��U��������-���FE����)t;�(�?�e?��~F��*����S��`��~F����������Ga?#����PQ?#�c
��6��g�L�3��F]����)tN��P�v�q����~F�wU*e����������-�M����)t:;�(�?�e?��~F��*����S���`��~F��wt;X)�?�e?��n
�6�3�?���`��~F��:;�h����1A��U��������~F��t����S�4v���~F��:�l������~F��*����S��"��-�`�1Zn�K�{h~�2,v��O0���;;����"b���,Rm����M��&b�&��s3�s��}���6��/���#��*b�[5��D�6��bt���z��T_8�TF��e��MY;"�C����#�?�C����b�1\:`���A�p�@�'�b],�X���	��!�w���C���6��/�K�"��b��:��`�5���n��O0��X��Y �L�C,��_����'E����b�&(E:���u�?�kby��:��`�
��c��������|���}w�z��T����`����'y�����(Q�Y������_k����Z�Gb�_�JD{�B=52����Z*\���DV���!��Nx3��+�2��p�'	��� �H�.\�I�(3:I�K�k%I�2��]#����W< �[eF��v�(b�UfD6�d������m���2N{b��l��2'2[fiV2R�UfDf� ��NfC�����l��]8�l��V��-#�O2��UfDf��l]y�lH�p�������FfC�����l��]8��8�"h���!fk]��2#�0ff��^��H��2'rlf62"�UfDf����AfC����lq]x����2#2[�[�gG!�!	[1Fd�����
1>�����3��������2#2[�ZN2�UfDf�L���?�O��5�O�S�672�������i���.���V�]���:9��*3"�f�u�IfCR����l>]7�
1��1"�e���i�l��p�#�?C��������j���2s�� �!�ZeFd���x2�������i��;�
��*3"�e�ta#�!�ZeFd���.d6$R�����!��]G�2#�<�~}����(��:��_�K�	>Y��Rj�?\Ma���������+�M�����MJwV:��+%zE�
���-���j�R��JU� W>��RG^�d��x�JQJy1������#��P��+��}.�b��P.����,��0AB�RG^���;�����M�X������!�7dgf<�xgg"�A�#;3������ ����fv&R�:�33�Y|�3���������Wb��������:��kBY����0fqcg"�A�#;����<�(P�������<�/���^��|�%�\{�P�6*�=]k�N+��`�T%��XIb�>��"ZI����;W�R$D�I������R�D;Xg�?���J���0v��
����Ja��:+Zt\�i����b ���J��`�"�+�S��J��{��B��g��Z{��Bg�q���{��B�d�M53�t1A��U��!����.-��`�M��=]L����FU��=]L�3��F�Fa,�1���l4�������?v�RfR������~F��t�v�3�t1����R�K��k�`?#VZ�JC��=]L�A�tU*e>����^�����t���{��B����v�3�t1��`5�3�t1���6��c���`?#�Z�zB�ya,�y�3+|�3�t1�NgU�3�t1��du�3�t1��)wJ��3N��8����J���0��<7�3����������)t:;�������)t&;�������)tNv�Q���X��<����
�P�j�~>6�321�
m�g��b
��6j�g��b
��l4������cW�RFl�~>����Q*�g��b
��6������:�l������:;�h�����)�{���^03�-����=46qA���m4�t	�X�{g��4�^D��E�
{1�����D����pn&v�"v��#�[����p�zDt�ZE�p�&V��UY0"�C�a�E#�t	�����c��#�t	��)�E�9L��"�'"���S1��.�o��� f�Y`O�`�u��c���.�;����:{�C1��6�=]1\:��$}3,���%�bM,��[`O�`�
��c���.�;����:{�C���c�tbA�.�w,���%�bM,��[`O�`�
��c$�����K���Wkiy<��r��o�D���$����NU�����$V�w�Q�J ��5�����.@�w����� �c
����R�;Xg�z���J���0v��J����J��a��:+�[t\������R=���J���`�"�+�S��J��a�������J��F��:��k������B�d�M53�?&� ��*�2��X�3R�E�l�����1�NgU�3�?����`��~F��:';�(�?�e?#��������z�������-�]����)t�\���]��\�����T�g�L�A�wU*e�����^��H���RQ?#�c
��6������Bg������S��`��~F����������Ga?#����PQ?#�c
��6��g�L�3��F]����)tN��P�v�q����~F�wU*e����������-�M����)t:;������1��du�3�?��9��FS����)��6v�R��~>6�3?�
m�g�L�3��FM����)tv��P?#�c���R)�?�e?�����o�(�3�?��i�`�]����)t;������1���6�g�Lq���������h�tDz��-�r��.��e?E�����~�X/"����"������Dln"vnRm87;w;����j���b8{="=`�"f�U�M��,��������`�MY):v[9"�L�S��������C�������p��=��V��c�1\:`���b��X���	��!�w���C���6��/�K�"��b��:��`�5���n��O0��X��Y �L�C,�8���	�?����p�Dz�],�X���	�X�;F���Z�@�)�b�����K���Wkiy>��r��o�D���$����NU�����$V�w�Q�J ��5�����.@�w����� �c
����R�;Xg�z���J���0v��J����J��a��:+�[t\������R=���J���`�"�+�S��J��a�������J��F��:��k������B�d�M53�?&� ��*�2��X�3R�E�l�����1�NgU�3�?����`��~F��:';�h����1�������z�������-�]����)t�\���]��\�����T�g�L�A�wU*e�����^��H���RQ?#�c
��6������Bg������S��`��~F���������������l�>���������J��~F��:�l������B����i���h��g$~W�R��~�����Q������B�������S�Lv�QW?#�c
��l4�������nc+e�����#����!;�(�?�����~F��*����S��`��~F��$~W�R��~>����Q*�g�L����F���S�v�QS?#�c
��l4��������n�3���r�H��m�c��h��)�N�6���{�^Dl�6����&bs�s�j��������>��oUf�����k1���Xm"Ve���������Sl�J������`���\t��|D�'b���T��/�K��t?�n���S���#���!��~�6���C���6��/�K�"��b��:��`�5���n��O0��X��Y �L�C,�8���	�?����pMP�t?����)�����u�?�by������D�����\����?���|��L!�6.S�$cw�Y!���s�R��[Yg%}����J�	blc��-B�J
#������.Z����!b+��!����)b+��)�����C�V�<�6��0��"�B�0������)�z��XU�Fl���U�����L��P0��8����� 
��)vw�f���!��pUf����X�q�@�(�b],�X��*
���;v�D��)v���u�EA����
3c���t@����M,�Ac�hM��C/���/�C��C:#�
�u2G�C���
3w����H l��a�@�(�bM,��[ �L�!�wl��!S��;�����1^���H�p��qO �w,��%S���#�����K �D�a�@@)�b���N��9����@2��
3����s�@<	1��:y�`�u��c�@h)�bS,����\
��)�w���/�t�f����M:�e��*n�2Sl��#������@��j�a�HS0�c�j��5c�t�Y��eB��X �L�&�w���p
���;6����)v���u�N��uG�k��;�@���Zg:_ �T&��K���+���q�_<�*8G���*�M�M��������tF��Pp� 4+�;"��U������b��2�Pp��\F&�L�S��7��fF0�L����o<}��tT9����8�����7������y� 'U&��=r��=��T���G`��3��=��
���f��L-�3HJC��x� ;U&��=r��{�2)��7�����xh��xx� JUNE��g���6�D���\�g�*�b����=�dU�����3]����t/��%���SUI����H��L)��5vg����+�2]]c+� ]���\�����X��E�n�l��J��2���E�0�U"�tuaRm��*�OC��pL�6�t�C���8��0�U?E�*�yr�!��0eQ��ju����U���Tf�����HWNq�c7�G���b���q��#]e��S�jCJW1\: ���U,��YD���b],�X�"]eL�)�w����2��)�w����2��U,oH���K�]: ��%f�[D���bc�j�LW1\��C: ��Um8�"]e��S�jCJW1\:����S��a��t�1��X�q��t�1��X��YD���b�X�qXD����E,oH�*�K�"q*n
�u@���)�����: �U���Tf����c�����>���~N��9�"N��
)]�p���ID���7��HWS����u@���)6����: �U�;����: �U��������b�t��ID���Y��: �U�by�f�*c��Tf�����!q�UmH�*�K�E: ��%fX�"]eL�&�w��"]eL�!�wl���2��!�w���2��{��X�(S��(E���B���r �UaRl���~�b/���*��
���"�UaR<7Up�t5*vU<�3�HS��3��Q!��m[U�ZU�y��X�*V]*G�
����HW�Iq��������
�����O_8G�*����b�mL�jTh����L��Pt��g"]&��=r��=��0)�#7�3��
�b��K��������E{&�T(:��HW�I�i��x���tU�������L����xh��xx�D�*����"2�t5*,m*�3��B��x�D�*L�M{����L����8�Gn�����?���^b��?U���*Q�����2]]cw�A�z���R)��5��������J�����u��^q���v�m��*cj�]�3]��!bHW�!����b�1����j�LW1�1���C�
3]��S�����'WR�
S�h��V7�t���Uq��6�f�HWSl�Rm��*���#]]x����?�U��8��6�t��"]]X����u@���)�����: �U��by�n�*c��by�i�*#]Y������r*�v���S���n�*c�
��c��t�1��]�
3]�p��vHD�zUR�����HD����u@���)�����u@���)6����: �U�;����: �U��/byCJW1\:`���SqS�X�"]eL�.�w����2���;v��HWS����0�U���S: ������U���t@��K�p��t�1��X��ZD���bS,���"]eL�S,�8�"]e��M,oH�*�K�t@����U��"]eL�!�wl���2��!�w���2�X��W�!��.p���S��a��t�1��X�q��t�1��X��YD���b�X�qXD�������bM�L�j�R�]D�
A����HW�I���������b/�8�*8���sS���������\��p*F���)]�
��m��*����[u��T��R9�T(8W_.G�*L�S��7��f�tU�O]'�x��9�U�T�4+ocJW�B{f��g"M����=��0)v��W��HW�I�������tU8#M]
���F��L�����]�3]�
���k�D�
��{&�UaR�#7n�3��
���=r��=��p*F��8��������h�D�
E��=��0)6����3��
����q���y>�u�������R�zy�����M
�.�����;�6A�U�<t�x$����j$Q��o��J�������2�M2V�r����8X)B���vb%��+�o��J��.^+i�,u<X)���k����'+��}�����qE�kl����)K7V��sqgWGD�J��y.>��i�RG6fF����y�UzCvf���+;3��U����prqggF��J���L.����1W�#;3c��';32�U����LW3vf���)���$7vf�������i��!����XG��uf;����*u��n�������*�!{<C����!�*udgfh�x�3#a\����L���W�#;3���yU���-�O6Gvff���dgF��OL�����pqggF��J���.����W�#;3#�����b������;3�������W��F��J��������q�:��3\|�3#�[�������M�����o�.9�33��};3?��)6vfx�;3��U�����n�����o�:�33���,�������33�[��3#�[�����7vf$|�����������xo�:�33�[|��7��U��W�L�pW]��5�=���X�Z3���S�������\/&7�����7�Dnn&wn"wnZ~�]�Nu�
��Y~gv,i���\�Z~�&r�VZW��P���j+#4��)����P~�!rs��9D�Z~cYwex�����=i�wc�q&g�U�_����������KWD��U�!]����W�P��;��3��?�������b������8�0��Xl����G����3��?8�������b����?�2��"�7���������H�7g�����~���bcO�V��P���x�������y��L���^bcJ{^����B�m\������u�����s�Rn;\c+�`��EG�J��v��6�����"!�J
s�!Fw��m��.Z������a���6��0�b�1l;\8�T��C?D��T��C?E�a���
i�!LY��E��Z�p�v�m��)�����L�Sl���9?�2��)vw�f��v�b�]������b���*�w��b�!c�u��c��m��)6����: �2��)�w��������byC�v���t@��"�[b��u@l;dL��}�������!��6����1�"���
i�!�K���?qo"f��1\:������j��: �2���;6���v��b�X�qX��C��/byC�v����HD����b�S����u@l;dL�)�w�����1�N��1�~�sJ�S: ������b�t@l;\X����C��U: �UmX�b�!c�M��c��m��)v���u@l;d��M,oH�1\:���"g����f�Sl���u@l;dL�C,�8�b�!c�E�xU��C�8�t@d�K��X��C�kM�
s�!�K�M: ��Um��b�!c�by�a���=�G�I�i�a�R�]�6C:_��CaRl���~�b/���*��
�����Tqn�xn��<7W<wU<�32�\
���0*�;r�!�VUt��+b��0)��
���0*t��u]0G�	��k��v(L����o<}���S1�N���i�aTh����L��Pt��gb��0)v��W���v(L�������Ll;N�Q��3m;�
��Z�g"G��s���m�����Gn�{���CaRM�s�aTh���=�*���Ll;N��T�������h�D�
E��=��I�i��x���m���8�Gn������S/��%�����$�HW�B�m\������������J�����u��^tT�T�tu�m��t���S;>�E��R�����w�2�t���!]]��Tf���S���.S�
3]��C���.<�6�t�OC�
s�\mH�*LY��E��Z�0�UW�Wq|�������#]eL�)Nw���HWS��;N����!q�UmH�*�KD���v3�t��"]]�����: �U��by�n�*c��by�i�*#]Y������r*�v���S���n�*c�
��6�������SW����t�1�"N��
)]�p��^�"N]b��: �U�k]�
3]�p����"N]���: �U�;����: �U��/byCJW1\:`���SqS�X�"]eL�.�w����2���;v��HWS��{��w?������^����b�t@���X�q��t�1�z�j�LW1\:`v���SW�a��t�1�N�����t�1o@7��!��.pl���vVq��t�1��X��YD���b�X�qXD��b�^����b�t�Y�"N]b��: �U�kby��: �U�]�
3]�p��sHD���
�u@���y�i*�$���F�.J��@��t.��tU��.n�����*������s/�87U��*��*8���]O��LS��3��Q!���*�VUt��+�����bo��\�+"]&��U�9����E�6u�i*��/�#]N�HS��6�t5*�g�M{&�T(:o�3��
�b��q���tU�����HW�S1����L�jTh���=i*���L������Gn�{�D�*L�C{���{&�UaR<�*8g��3����X�dL�jTX�T�g"M��s���tU�����w��HW�Iqh��8���}���^��K���SUI����H��L)��5vg����+�2]]c+� ]���\�����X��E�|��nS�HWSk��e��*�C���
�6�t���!]]8�Tf�����!]]xRm��*��"�t�<����U���G��U��a��������SW�a5�G���bS�������2��)vw�f�HWC,������U��tua�;n���2�XRm��*�KD��p���u@���)v���u@���te�R����t@��"N]b��u@���)6��o���?��?�"N]���: �U��8��6�t��z��8u���HWS���w��HWSl�6�t�����8uU��HW�c���
)]�p��Q�"N�M�b��t�1��X��ZD���bS,���"]eL�S�}����z�sJD�zUR����s��8u�n���2�X�;V��HWSl�6�t�����8uUN��HW�t�R�����&q*ng7��HWSl���u@���)v���u@���!q�UmH�*�K�E: ��%fX�"]eL�&�w��"]eL�!�wl���2��1��0�U�8]D��5�2��Q�K�Rt-i*��/"]&��K���+������,����+�M�����
�ss�sW�S:#����L�jTHwd�
�U������b��r��Pp��\�tU�gW��]��0)�C�3]�
]9o��L��XyS��3��=i*�7��HW�I�k���z�D�*L�������L����i�Rp�t5*�gj���4���{&�UaRl�#7��g"]&��=r��=��0)�#7�3��
�b�����)]�
K���L��Pt.�3��
�b����=��0)��g�:�-���?x���|�*I9�U�i�)e������t���s�R��kle���+�2]]c� ]�(����v�m��*cj�]�3]��!bHW�!����b�1����j�LW1�1���C�
3]��S�����'WR�
S�h��V7�t���Uq��6�f�HWSl���9?�U�;������*c�E�zUR������.�by��: �U��by�j�*c��)����b�t@��O�����t����byCJWq9�h�t@��K�p��t�1��^�
�_��^��t@����pXD��b�^����b�t@/��.1�b�*c�5���n�*c�
��c��t�1��)����b�t@?�"N]�����b�t�(����@�XD���b],�X�"]eL�)�w����2����>���~N��9�"N��
)]�p���ID���7��HWS����u@���)6����: �U�;�Tf�������8u�)*R�����&q*ng7��HWSl���u@���)v���u@���!q�UmH�*�K�E: ��%fX�"]eL�&�w��"]eL�!�wl���2��!�w���2��{��X�(S��(E���B���r �UaRl���~�b/���*��
�����Tqn�xn��<7W<wU<�32M]
���F�tG���ZU�y��X�*V]*G�
����HW�Iq��������
��9T�yWD�*����b�mL�jTh����L��Pt��g"]&��=r��=��0)�#7�3��
�b��K��������E{&�T(:��HW�I�i��x���tU�������L����xh��xx�D�*����"2�t5*�gZ~�\�OUt�t5*�g��=i*�w��HW�Iqh��8�����)��������$�HW�B�m\������������J�����u��^tT�T�tu�m��t���S|��nS�HWSk��e��*�C���
�6�t���!]]8�Tf�����!]]xRm��*��"�t�<����U���G��U��a��������SW�a5�G���bS�������2��)vw�f�HWC,������U��tua�;n���2�X�;V��HWSl���u@���)vRm��*�KD��+�X���U\N��.q�3��"]eL��}�������!q��6���2�X��W�!��.��t@��K��XD���bM,��[D���bC,���"]eL�C,�8�"]e���rp�!��.0�t@���)P,���2�X�;V��HWSl���u@���)v�������SoN���S�jCJW1\:`n��.1��: �U��by�j�*c�M��c��t�1�N�����t�1o@���
)]�p��c��8����u@���)6����: �U�;����: �U��8��6�t���"q�3,���2�X�;����2���;6��HWS��;��HW��=�T�I�)]�i�LW�������F���~�b�%����{Q�^TqUp�����sS�sS�������)��i�Rp�t5*�;2]����[u��T��R9�T(8W_.G�*L�S��7��f�tU�O]'�x��9�U�T�4+ocJW�B{&�Up=T�y���tU�������L����xh��xx�D�*����.gJW�B{���HS��\�g"]&��=r��{&�UaR�#7n�3��
���=r��=��p*F��8�������iE{&�T(:��HW�I�����jTh���=ijl�����0��D@-��Q��k��:�G/;W*e<��V�A<z�Q�R)��5�����"=��,�M�"eL����a��>D���6��0�Q�"�xt��Rm��(�"�xt�qH�a��~�����rr�!��0eQ��E�3�p5|�G��
��>�Q���t�n��x�1�N���4�G<�b��^���b�t@���X�q��x�1��X��ZD<��bS,���"eL�S,�8�"e�+�X���Q\N�"]�����u@���)6��o���?��?�"]���: �Q��<��6�x��z��<t����GS���w���GSl���u@���)v���u@����1^����b�t����uO �w�x7��JD��
�u@���)6����: �Q�;���p�����?�t@��W�!��.07���C���f�(c�u��c��x�1��X��[D<��b�X�qZD<��7��X���Q��xw�by��Gq+p4���CW�a��x�1������x�1�"��
)�p���HD���u@���)�����u@���)6����: �Q�;����: �Q��u�8ke�G�@��*�*�\WD<*L��T��GQ��*��[������f�<�@sfF#���������f�l�ki/6���
Cr'@��t�.V���hI��`��������wH�����������8��8��8��P�c]��g�����G�'�w��(�m���]�%��8���r,����2]��Qfg����;�/s�XevF�y����2q��Qf'b,�R�]�Yy4~��L,���r��u�3�<������P��<J���4��X%Bi����(�1�C����h�����>��D,����XevF��������XevF�������XevF��������Xev"�r(��
;+��O����L,�����gby����>����gby��q�9��cy��.y�
�}G
)uS]��b�������?��X]��mrN��kh�Or+VW�g��j���������U��.���i���>Z�����QT0%���0cu5���`�`Jf7�Of���,L�F�����������)���>]���|f�P0����0Su5>�F+k��^����?�-{���J�.��x�������^�}��Su��	6���4�����*3#������������P����=��Y@�Uf&X���l��*3ld�|i��Pu��	6�G�4������LY�#_�����yL���#_����y@�q���fW���u`@�T�ta�Pu��F�����L�������y�S	V�U�T]ef�u��/�����23����fWxUW��`{�Ks(<����L_�{�3UW��3*�TN�A7��������#_����y��3�r*}�0�������y�S�c��<���P95|�0Su5�8���f@�T�f]xUW��`={�K�-<����L��=���@�Uf&������*3SZ�G�0Su5�8�*��h�=��Y@�Uf&������������y�S���9@�UfF�S��3UW��3�+�TN%XaV�Pu��	��G�4�������#_�]�T]ef�M��/�����23��TM�9	�Su5}��U�s��F`iWe:@�Ung���;��Kb_qb��`�'�v���O��8��8��P�c]��g���J��N���	��Z���������v���T����P�m�.Su��q�)�����������3��w�X&�T]�v"R55f�������g���US#����g������s��m�3T]�vF�����X]M��>�L�g��J��N���	�3m�}����X�U�3T]�vF���������r;#�GvvW�UW��'�#;{(}����ND���rPa��j�DQm���P55K�*}�����������>C�Ung�����N���E*?	��i�Z�~,�P�z���1�+,��W�+���I;��|C���s�*�R���6������N���B[fu9'5�nvM]P��G��sV��PTn������R�j8@]P��?K�,5��������91Xj^
�*7��Ksf��x�������&���N���uA�f�ti���|��g��y��d�����;ay��3yk�X<��������;�9@���hk�-��w&���y5h��<O������p�g��<}g2��W���y@�����< k^
f�-��w&���y���y@��q�f�< k^
:���w&���y5���<}gr�
��&&���w&S��yu;�STn�O�L�B�*���y�����M�}�<��y��d�5�������;�y@��L�z���L�Y�j�>��<G�����U
�<=/��y@�������L�Y�j802xR�3�d���������L����5��T��|zg2��W���}�<���; k^
z�#O�w&��y�w���{g������qf0��{g2��WC$Y3�x��3�d���2�x�]����y5�L<�������p`b0�{g2��W�s�<`����d�5��
���'�;�y@��t�f�S�L�Y�j800�y:�3�d������f�I�L����R�R�D�*����s�u�h�-���=�N����l��X��;���5�UI+N+N��5'�uI�kN�kN��
'��k�����N����#o]�#m��m�	;����L�������;�'�Y�*y�������y����������;���Y�j8��_�v���l�3Y�*iy������y�������L*�R����3R�j��:�cag����u5�c+��/�D���g��U:�p�I��@���g��U:�q�I��@���g��U:2p�I��@���g��U:2q�I��@���g���pd+��/������rSU��V~}_�ags��ZW�H�}&�cags��WC��}a�%����������|}�>��wS�|���O7�����������n���o^o5����l9�����!����n����f�����5��q3��_���Wp_|E��^<�/<}��������f�c��V��_�_�/������7���������o.#�����Z�\���_5]�������y���<�������/N���e������������8�V���_/�{{�j9�����34�B�|�o�+�l!qz}�������^����^\Q>�����X������/��;�����XF�s���2����a%��P�������2��.��2�*�x�d�����]u��9�5��<��������3������%pD���r���o��Uc��a����[������w���on��?�^�wnV���i�q�. ZK$��\���x���{|'hk������A\s������,o���5:�����J�Z���
T]c
��`�����g���u?N>6�p5�������=e#���~���ro������{�O��X��4�X���h[���CS��;�+y�(X��hl��������|j�W���|�������O��~��O_���\���j�9�T�-�������:��n,.����f
�U�n�`���:|�*�������L���U�$c`?.^�����_��O*�4�O����<O���|���j���X����p�Zn3���G��Wc����g������a��E�����=���d��'����uz�:�M��Q���|���|#���_��'�eQ<�~$������5��f�������y�����yx$���y���<l&�<�~�j���4�
h�<���j��Xo�����m��}���'YO�KI�D�����?	���F��W�
�;�}0Fx{�^m;�5j��5J$u��fH��������$�u�d���-��y�<�3�|�P��XD{��"x���F��[e2�D^�M4!V7�$���R"�������	��,�����H4!������5�h��s��n����'����u�@�9��qV�Cdm������;7h	���sI�d���a�9H�hu,G�w4)��$�4��?B>,��#���k"h��Y[��4����v�<���w�{��� ��#lL�e���A����� �
5��%@����C#p�QN���h�k����>	��~\�uQ]��"Y�p�*��B��2�feF����<�oRe�N���EU�hs���vU�h�*�HTewtP���?B�-�e�Du^�i�H�G_����2��4^�Q� ��}��0F������y"�'�h{$��Q���Q� ��y�c�i^Su!�����y2Y�$�C��yn_�r^!h��X��!�U�0W7�Q�d�K�!F�+�m��+iF�5����������L���e�GS&{�����A�4^�QC!W�I���{kl��#l���b�1��+c�9W���D
����m���#���Y\�F�������L.�<�k���V�\s�G�X��>���\�
W=��T�����J�����V��g9�oo�|i@�+��}����H����$�0'@��OB3ddk���K���i@on@�#�m� �������H�<w��vv3I���|�L�g�d���e�mJ\f:��"�������u}����@\f������!����t^W�=B��!QF������n���E7���z���e������	�2���|kn	.�9OvF���W��j��_m�<����	KWTRg�w]@>�*��vI������(���V�%I�>/���I$1��6��"y���$fkW:IkZ@��-�U�V���������
w��Z������hs���g���o���	���\�n�����v*�p|
���Nz�w��l�B��Z����7E3=<`���	A��.E��4N�<Q�-V�j�8B����o��mGt��o{�bo����N+���5�TN�G�C����)�p��SiIl�~���DQl!���������Ll![�=����v���?��Al�P=l�J&��hu���`��B���$q����3s�,/�aQNX�2S-b���>����s���}�^�2SG���3u��g��������5H�U������p��Q��:��6�Vt����q>~��)O����fu��:*����:D?3.z}P��:d?�*XRG����#dz:I�U����Q��Ce^��
���m�b���}���:�����\����4��������U����z=B�%e>E��m�������e;�~c������h�o�V��F����Qg��9��������4���?��������Jo8��Q��l��
oI�ny
��yDu�u�\��\TG���O�'u��G�TQRG�V�i������:
�k�.}�y<{���o��?�+�JB�2WH��:�d�N��UR<��Vx1�$@��.bJ*	���Tv�G��5�R*BRI�^'�*	�[����$d���J*)�]'�I%�>��J���<�t�WV�<�+�J������R����L������;k�����ml�hm�AR]�~P��e��/�-��mo�:b[��%�cE4e�p�����=�-D�+�m���hc�>�����'Y�Q�PO��g�+�I!��vY�������f����W����[7&u<`�3�)���
�"���~�rHQ��w��/a�����;G�[��5su�.A# ��������%�(���������,��n{��������N8c�(��03�o�^��l�!�B��gi=�w����7�~u���Vw�qowo��]�wv����|gIH��D��P�uK'������J�`��Nl]bm[��a�������L�Wxlm�������9~!�9"���e�M�6�G���F��I���I�$��
�l�,^�m-@�U��E*'c�a��U.xUN��,�������9�"�CnD���79���\���TN:���e��r�cO$�6�g*��X[�
*'c�{���M/R94���NE���!�=�s���r�lW9D6����s���VF[f�E�O,�=`�B]>L�	��^�-4�X[���C�Q4�_�E�����$��dM�� �t'mk��'��I�30��I`�������G�CX�Z�p�
��!������Q9D6'���U�����hD��&����'��I<`��r2X��]R9����T
�m�{8��*�C�5���>*���X������FT9���*�M�&��a�&�����X����t��%���=�L�%�~�6Sal/C:_qTa���!V��&�0�7^����l�5��9��;9�0"�U���m?BF-��%����+S9l�S^��e��r�]�)���5,�9`
+����>*���X�����hD�s��r��M��j�~F-��Q�ER9�Ny*��*����n����uR9���rpt��'���ui�d��q�l���G�T�4H� �>�<6b�mJY��$�"�Q��N&�X_��DS��o�'���ju#�DaK����0]c�!�sh�����=��l�l�d�hz����B��w,�dU9�nI�$�G��2�b��2���pH�d�94DXc�O�����lW9�k�*�����*�M*7�9��K'qQ9	���g*'�`��H��@����0(��rkM�%���!�=Fds��4��!����N�^g�U�c�t�����V�H�d]B[�M+u���e����d��"q�p���#��C��_�^��XW[�[u�e���Z����M4����k�D�G���D$�'���#�y'tw�&uS-t�{���IV�SN������e)��A�v��(z"y=�2�h��/A��V�D|��1�7�����xD6,qF���Z�H���$z�l=wr����T^<�K}�N=�{�,T�|�/�"k�4��D��� Z��z=}�_�-�Fx�M���/E��E�Cu� �"�c�(z�����79MmO=�C�Lm�P�lYD;d�"wU:�h$
E�;s�(�����i(B�[�i]K�V�l���h�F�IC��h�i��������0�A�y����Qwr��Q� w�������q�����v���	�v�8��D�D��-c�Y�Z�q&zm�����8"�:T���F�.z�l��d���;9��|�~�S�,����]�(z�|��Rv1F���w��)��D��I��E���A�Z���!��Y��M����w
{���E?wV�Cd��m��s-���_����j���X�m�E���=X�o@~�o�F����8D^������}6�*��6���&��FRe@6d���`4^��ds�
�������=�M�e�1������j�|�tWw���G��Q����o;��B�_F�X��	���z�M���_�����z�h8�dB�9�ds�N�7�_����x�-��)HR=�|���j����Zv�q����T����D�Ch��^�z���z�M4��Q���z�hx�"����d{��MN�����-��"�'a�/�V�$����;!Y���DT��z��LD1Z�ZOQ�6��[=�����M�Qwr��e7]}��t�)��x�(�H���)��7U\��p���Z>��\RF�/�>�L��v�D�DrV�\&��lm+ARI�����X=�6����!&$�DhC�N�&K��������.7�}f�C�+���M�����v�'�BF�+�I%�K���wJt&�����v3�$�����O������Q���#FzgR�=�Q�b������vQI�V=Mm���)�Ow{1��p�\�������$�Oo�����B&�7�I�� �}M%|���F����0	eh�\��o�����a�8�6��_�������y�+H�Ah��=u5����Z/��Q=M�k�D�XF��]����k�-����72�#�u{nwn�<	lr��y����_uVn���Rn@��y�24	�t�n�O��|��ge���dmb� �feZ��P������V~$��.��a�C���A�;:iB�����:M��u������T��Yd�$�"���5(�P@vH��5k��AC!W��R�P�����G����A��3D�5�����B�=q�h����5)]s�1�S�,�e�<s�,�=v�@����)�<�l��X��p$��5q��L2�'q�h�n���y��4���Bw4i^w��S�,�e�<�B��Id��^���d�����{�I�d�:~��'s��N�6n�#��4��<��N����\�@��y���u=
m��7�e�q�Zdze�{Ue���,�}�o�������Tqm�D"�u;\��N��p���A���:��;:����$U�F�����ye��Y�ik���k�D�G_���dm��k�f��g���y2W�@q�<�U^o�<	��"���!�������!�}��4���r����xlk+�]�o@�zz�y��n�!��ca�����M���km��>qD�5�����A�����Y���f�����f�E�S�,�=����.�>�l�P��L��H�F���2W=�49������Jb����H�P�I��&����q����<���5O"�����^�U���+;�D�<��v��`'�Ch�������!�=W���y�h������\Y:�O�,�}�<������'��K�$�C\cW7���<���E���!�C���qB;����4�m[���,t�{�!.�]^���Yp����K���E2��!"x����a�����}g}�OZ5��WgG�uyvukf�4��f'����_����4o�qt�@h���� |m}�c�S�/�e�'M�%P[�3K���|��h�L%	<`���!	�U���u�O���8I��4lj%���D�i�@���:!�<�6	l��q���������$Po����D�uG6��q"��[&���Le�A�2	Dl�x�����&t���%�H�f�W@'	Dh�wB�@ot6e����4e.R�k��&��-�d�U�$�T��N�d�*�@��(1[=;�����g����j�$������p$IEhu�$
���
���*D�g���$|��N@^��t	����s��ve(������Jh�F��"����$��mX��z����N$����"@�{D�h@$PF��7fO����|~�d�$��_�qJ�����E�2iQE��tkw�I��`�@�I�uH�v�-B��~����}��c$	w����.p���Et�@������&x�=�o��\���t�l
]���"��)Z����_�TZ�Fl������R7�lk�
��!L*��N*-�G�'D�;?!�����I�����|�SOr{�����h�?�b��^	���b����_�(i b;�+��I�;ihNI��N(��4.p.����9����;�q8|��g?5�$����]B���R�XFD}��
d�(���`D�6,�[=u#��u&����B�N��2��C�q	
�!(Dh����_/�*�@q
endstream
endobj
8
0
obj
33847
endobj
9
0
obj
[
]
endobj
14
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
792
612
]
/Contents
15
0
R
/Resources
16
0
R
/Annots
18
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
15
0
obj
<<
/Filter
/FlateDecode
/Length
17
0
R
>>
stream
x������D�}�|����xr�I��KP (��@��\ �d�@������*�'O@������������m�w��~_xm���
��V���������s����w(�o���Wwagt��.���l�C��]P���|�����5���=�������`��}��~�����u;������v��#;kg�/a��}%5u�iG���TT9�c /�!�K��$��'�HJ��'B/��9'8^r?�P��t�*N�H�&tn~ZT����Z��,��iW�G� y�d��&���~z�U���^M�1�s8�u�ca��,,0�������i3����Q=57F
:����KQ��`\�f�9�S�@�
�Q���i�{��Yz��W���N�e�<^2�?�<O�2�Sa3c�<^�[R\�u7��:���1L���M�H������
D�>���4�8W���������K�z%��OF����4�(^�v}uj�B%?�b���bX�J���a>��T�c�����L�%�Da>��LQ�����<�����m� �%�<�nI�R�5���x��
���<��C/��+�V�v�="|��!Z����AT�H�Z`1��
��b�@L ��^j�a��u
�&�L�l$�/�]%��*F�������i��6��	��m�p����g_���<'�2�P1�s@k��@�����d�#&@[C���5�w�|u�u�sr(�
�;���	D�N9j�a����M��X�����q����P�*wh
�6sE��+J}���a��RZhQv=O���eU�5w��\yn^XZ~Q�=O��J�mP�>{��k�������d��j�%[8�P1��5LL &,�Hm2�`�!������;��"wyNe��bp���01�h3)���a��c
�&�L,�yF�����sr(�
�;���	��E}r��=E�2�d(j�sQ$�Tc��P�1g�Xv�����|[��/��U��k�BsEp�]j���t W���������k��.�s_8!�*�x����:3�4"�g�@�0L][�%�k�Xs�����`��+�=��:��/X39�����y%�f��
�[�����Z^zK�-��%Mw���8����ryb:��s��":�
���i���L������Zt��qi��K��A.��f�;��	�5S�L\;��x�i[Qt�$������Z�U��f=�a�s\vU\��k�S�=6��������g��q_k��*���9,{�����1q�Zz����}!�{�\\C�\3�k��P����\\/�J���5�(�#��D�Orq-N1��D���\\�C��s���5&����1�L"I�?�9uv���Te�_a;����$�&������(��5}��c�Z�8*#sm��6��Q�����e��6��Q����e��6��Q�x�����t>>��U����M;�*�k��P���d�������]������0��K�����w��}����m��~����m����z?*F�3���~����l?o.����~��~��0e��l�T7��g��M�>��N����l�4����$�^�����M�1�������j�t
����,~{��V!�O�ns�M[���{��7>{�D�>����m�v����grzL>�&��SyJ����U��3�.=}�7K�1�A�X�~>��o��E>���h�O><N�����9�1��H:�A7~\��4��o�b�i����������X
2�[��3bUz*,w6~DKA\��j��YWr�y� �����R
����������}k��S
��LN�}nA\���^S�Z��X��XZ�
n�D<�J�L����G��*i��*��G�6V�LU�vs��*Y�>V�m�T�5���;RR%�	��v�W�5�;[���J�i_o�oR%���=$�J�gr�9�J��/��<�J.�?j�T�K�R�|�=U�:�@����/�o$�
endstream
endobj
17
0
obj
1938
endobj
18
0
obj
[
]
endobj
10
0
obj
<<
/CA
0.14901961
/ca
0.14901961
>>
endobj
12
0
obj
<<
/CA
1.0
/ca
1.0
>>
endobj
7
0
obj
<<
/Font
<<
/Font1
11
0
R
/Font3
13
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
/Alpha2
12
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
16
0
obj
<<
/Font
<<
/Font3
13
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
/Alpha0
10
0
R
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-BoldMT
/Encoding
/Identity-H
/DescendantFonts
[
19
0
R
]
/ToUnicode
20
0
R
>>
endobj
13
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
23
0
R
]
/ToUnicode
24
0
R
>>
endobj
20
0
obj
<<
/Filter
/FlateDecode
/Length
27
0
R
>>
stream
x�]Q�n� |�+���a�R����4�4��������A|�������(���	���wF�/����(��]�Dp����������#4��m	8wf��m������;e� ��+��L�{?���s�8�	P�A���{3��}��\�m=?���!���ri.NH��LH�*.�C\��Q��Uq
c�Qp�d���>��)����	5,����)���iI�dYs����x�B�B�K�e!�L�,�k��0�M*�\�\��-�����b����9��+?_g��
endstream
endobj
22
0
obj
<<
/Filter
/FlateDecode
/Length
28
0
R
>>
stream
x���	xTE�7~���&���Y��u�&&�@D������7@�mF`paDDAfTtl�p����8�3������+�������vq�y�������$�����r���Su/@�����dr��"�w�~�R6�Sg�����{��H��_5{����D�D���5c��s�z)QQx�AB�w6����s��+�w<���F��3/�w�v���D�O >��W�O�\Nt�L����l���������^Q�R�����	����5�<N���XF�O+=Hc�Gm�[J��Y
�O�I��0����n�D�DX<���&�h��L�ndw���tm3e������H_��PU�X��L��#�=j26��V�������W�~�>�J����/���j"-G}UTC5��	*��u�k�Gh=�c�q�1�2)�VK������&�5=�>X�2���"��6����N��0sH���q�4����t���<�g
�k�Q���Q�Os�#6����S�0�
�F�,��;�i���p�q��$%������'�2���_��#���v��k�	z��I�KK��4�&��?��c~H��R��DZ"�J�a����b�B!��^�G�A6�C�Kdi�Lv[�>��L�%�.y���)���s(2ZD����H/1���v!���`w�CRH�T�J�(�*�)]�?|(��1���<�Jg�����5��n�3���:�\l0���e!v�}*Y�li�4_� �'=$����O(���E����
�}�>q���C���G���;1��Ou��������*j���w�����l*;�,d+�m�!��2��$��-
�j��<�2�i�t�tZ	��7���O�/eU����{���.��W\�_9M���S�*f�L=C��nWw�O�G�*m�6_�P_�_g�SWQ�?������h����=�
z�s�<$�g���,��,��~W�:V�����9l[�V�[�Fv��~�`����i�4C�%]'���J���WzN����t=O�s���_-O����b��%�u��zy���������|���d*����;������Y�%���>�v�/�'����j�Z�v��]{G��Az��J����e>KgE���z�H)X���)QY�� !�)��0�*��j9�y����[���$pN-����������T�dXE���7�C�S��t�����R�y)�v����K��p�-UIS��2���vz�~%��.bi';���kX[J���DvU�$�Y�hv����2�����a��&}�Gq*��}j�
���m�[������n2��X�����z�XgK�S`A.�^��L����)W�Q��>R�B����~�����kT�XaXe��n�����1�y��tlIVuM��t
��z#dl6�5�2������c���X����g�{3���`��������L�������2��#���:u��[����������F�m�a����1}�,���G���`���.����hK��X������,D-�!��X��am��8�~O�1�%cD��}����������kYRf�j�'w,-B{A��V�}z����
��~��l
������ha5�]��=T	�Z+�	��e.���o���CT���$�k����a�1���W����G%�q40<}x��J��"zq�4�X!_��^��bN���z-Q�fR�z��UC�T�X>��i�i��E�������,�73#=-5���NJL��s��8v���k�"K�����k���-!��3jT1���@��	-!��N-���b�SKQrv���H�`wI��UQUq?��_���_;�:����9M��#��D��pV|#=sj}!����|���-��n��6"g�,[q?�e�#hG(��3K�D@J9d�D':J��J���=�y#g�5�oY����T�/�F��s^�r��b�����!]4���GCk|��u����E��3sf�8�1$�h�m��nm(�����QT?�qE��4y�H�\��^�����37�?��Px�����uh�F�~��I�75���h��G�G����<��B_��3<g��[05��C4������`�q�RG�VOj��
U��4��M��H�'\������S�o�+."�]1�f������'B�8�O��,�=�
�����'�9�`��5�V�?����������hY���9H�s��VI���#���2�L��\_r=�V5�G��@ TT�UD�9E�����~��K�r��| 5@�3���@�YY|����<DB��7F�>:/���%�����s:�9I�y��hN7{K4y7q�=)d�w��u�F�b��������S?~j�o��S���N�E�w���P��F9M2CR�,r���t��FGH��M(����	�Wr���<�lYY?���[z0�G9� '��^��N�=%~J��e�W�K����^m;%�h���_����3��e���\9�;��V����vc���P��M�6�*��]9l��]A�r���N2+'5�JL�2�iW.�;��E����c>�zEo�,"+�#H�L�*"A��og$�,�4F��K�4�H�O1�>�*~��4|�����.UHU�2�t%�(���aI���d�3�!O�u���j��X���*�F�u���YqYqyx08'|r���J��O��kB5l�4W�����Y�,�
�-�d�\*+r��"�";E��bO`��p����Tr��ij��
�j��?@m���E�5F�����T���F��T2�H�Ry`VR��.|��=-�6>�g<'�2� 8GO�����3�F�������8��������S.����%����S;��I}6��i�$�����
��R��n������9/�%e�������
N+�
f���Q>/�D��[��ntKcb�O�`���e|��dd�c(��X��k����q�Y�4����vi�#��p����
y�"[P����`�=���Rh-p6y[���sc����q�������16�0kz2{;��K��</YNN0��K�X����9��H�XW��@$v�O��@ P]U�r�9p8���yA`�&��"�d�MG"��5:M�(��;3Wj45�#.�R�qUUA-��fj^��
P�v'����,�?_�r���
������1M���D��2$
�f�W^��^/���?��ty�o�����n��Y
��'�s}�[�X{��.���i�nkZ�h�����J�H��������>=��>~���+�e����S���� ������������7���WV���&�O������x�������P����i#�zg}�����;�6G��R1	�����"�n�%[r�%u~&�tJ�?��B;��2���Q����1G����Zp|�,��#���hA3k����f�f�g{����M�����W�����	
��+by���pW��]����W5_{��nP�v�-�A�������6KE����e��{��kq2�^���B������;7~���U)wHw8�����o��z>�>�|��Q��Z����Ig�����49�:�!��
�|�zE�
���U)��pw��q[c����s�H|by�'OI�,46����-dg� �R�h�:��^��,_��x*��'8������4=+1%�1"�1G���7�98v$@�]��Cc���(�2U�\NB�U�\���**�����?n�5K/j����^�(�	sy�=�������xl��y%��#��,�nE&Av3L�Y,�o��lM�m����j���,S"�;�$���)�:�L�M��jM�b�Z���1��
[ra�������R��\w�,)�U�#\p<�1\W����ad��\m�mn|D[�fx���(K�%��*���w5��>�.�d�r��_R{����]0s��iM8�[p4I�Mr�����K�����n�xk0�|�J"��_w����^y�u�s�k���m�u��=�bId��3�:�����=���gl�:�j{�����������H
����&=��R��$�!�Z��dPim���#�x�2��������d���.��@q�z�
���n����1�)�xw�]�,V�Q������2�d*��Y��3�7�a�K��+�1,�#p8���`Ab�'��L���j��@Su����lC�x��7M5����0��!#��w�5��P��6LDk������Y4��bx�1��� �E���*���*7v�y��*����:Nr��+@_��,}�<�>z8���sY��GX�����>5_�r�9UU�M(�t�#���.����f�(v��KG�X����}�UrS{�l�������,���q�������q1	���I`����X;�n7���O�Mcq�nf���G3����ZKH�YT[�Y,���U7=N�kgJ������Vw�[rs��:��)�WvHs)2g0��9�\u�9�0y�L�Tu�xT��������CV!i@R�k�gs����\�1������`��o�����O�*���u�Qy4�-DJ���Z��X�e�l�4�5�
����Yu���J�qs�T������9Qb�|��n���ln"��o�}��[�ae�T�_wvn�:�V�4�s�#}�����w'	�F��I�hKJI������mE��6��3QY�,�`��1r�7�L)	�-&Q�A�s�c36���z���`�xo���+o�Z/�`���U7t}�AcQ�
!<� ��� +zVg�Z��2k��i=`���z�-����HPeM�3)���g3�MM�t�&���|t���r%�b���8������(�H\�� n�� KQ�0%|��3�wo`�Va����v���d��6g\����&��\�����[gk��=k}����6Qn�%����ig[.��=���#�	�KM������QnT�R6���M�&����k%�iEz���Y���6�-V��bSmVYS���q��n��6�f�+��%�T��R���>�)��l1/:��������q���/����0b��$~��+WX�q=m��z\��l�5��M|r���/���=���G�[�R�h65|;�>�r��k��'������:���*� �>9�;H��
�\�
��L
����3U��-�Ru+T	��5�����{�f�����{��C[K����E���*w3���r$inG�\n)����J#-#=�9�\R8��R��pK�o����h�8B�
�PaIa2/|�P+���W#�Ld�z���fp��j����QtW\\~Zz�?���u����S���yP�v�.����HG��t������;�����r+Q���������(������������K�_�;_����/��)��_�o�+�)�VE�@��j������8�����9�.]�X��G6 ���t�@BV����'��K9�{)�\�K���s����m�,�V����?~����dV��S�@�������'M?�vcW�4����F������������������a���%����Z�X�v�a�\�����\�&L[���v;����$LNi���L���w�p�:��n�`Gqz9��	I}��EF��:��	!��)M�r�W�^��P����i�W�|p����Z�^��y��9�����	�q6a�Nx�wG}�>�|��U�����*���)V��5%a���s��Q�h���n=(�]}�z������������O�S�?:���U�u9Nh�=��(Q�+�����iRZL���F������u��_7��0~`�	��%%����K<��MX�����<������Z�|.����/�t��}#�V��������3���}��������	_�������n
�68aT�_.W:+��j����	�i��Y�9(����I�`��<���vWlL��W�w��3k�}�s�
�:��s����/��==|Y���I>f�������^������I��7���o�}����`%��af�������u��>�;�v�����-P�Dr���`�*�����lu�*��7���������C��a8�j�8��h���:�[�4[Y]�R�|�bM]�%^��4����J���L���el+�)q�,��G�nu>��2&��@�����(P}(yb}h�����le�!�,���N���r�qQmK��g�>tB�������_�V�#�O�������'��Z��c�O�K���1������zb]�����L�{9_;�gk����Y���w�?����kr�����g�W��H�!��\k��Vg?�9.�.kx������p��=0g`����8k���������s���:�L�*����E+��+��t{����9�e���=w��(�OK�r�r���to�����[8����ws���d�����`2�nC?V����c�2�J]�5GC�2�"�}�������������0-H`����V�1�17�g�������g������%KJjV�T��tH�����RYj]����1G�� �����
����#4��8����������xJ������e�eot���t�_���l�SQR�8����3P[rq53	"��W�i0{1~������e21b-8���	n�d,8�6]9�H|� �vH��� *M�(O��W���B<Pol�W\)��S������!�H���/�WB��p�G�����[�0"�M�{~����'�x.h��W��9|��Y�HtT�`���x��Ro�a���Bh ��|(�w����#MN�
��/�JYj���_R���4:���%o���_
�_�Mo�W�K��'��}�z+	L�\�^�KJ��6����oZ��p�;'3)}���7��J��k|(�W����b��G>�c+�sfLS���D��D��	�,9^Jd���t���;���&�����d9���$�Mf%�;V��vk�����M������O���T��%��D�%qY����GUJt%�K����+�F���P��P��J4:7ENA���\��)����E��p#����f�#O��i2�"��8``^�tu�=?=�L�y�<��J��W�b���Px��@z�E���;{����	��|���LT��6���� n�*[��J������>�tq��S�n�%%���S�?)����q/!r�	/�j�v,���Y~��bzy��+?���A<(�����v��R��G]V�R�L>����a��YC�->�:q��8����b�v�LRRK�u�����8`��*3��!����v������&����T[6���;�����r��DJ��Q�����W�5b�v���!��[��FV������0��V���dS�6�U��������=��v��Y�:grr��Vm�CH�T�+R�]�V�)��W*���,�H�"������\��#����S���U�����;���]K;��o�R�bY	��*7�G��X��C����,+�u���Hwq��������������H�l�jM#������z�5��p����c��+�Q�(m��Q7��+���EU�v��H�T��&ZSlIG(�j������O�0k�!����G[��+�+�+�W��p����
u�u�m��uz]9���n;���>V����;��o�����q�7���bs�,./7�8��f� ��5Xa*`����zn�AS��I�����1m��
zV�L&�\��$�8d�f�������(|�6�4�Jb�c������b��]Bv H2���c)��;Xj��HM�����JM��Df�"�Wm2\U�C�'�'z)N����n{�Y�a~����(����<�(��!�j��������{3\+��%��)bq�<�bw����;��?��1Jj���
ig�d����E����5����Q�`o���Q�j�S�V�U�`�(��~��6?�������o��}�[��^d+W��F(cmO)��l��&�L�����m�}��������)+:��n���(��:���������.�=�<�<g;�Xu��-��W�kmI��
&9���b�>s �Zd.�=�����x0��-���5Q���f���G�������IM$R5U��i�Z���K��j����n�5���y�);e�,
�������N��L�Y'����L�j>.BT��x�POV\����<f�q%��h�#6�0r��u��k���1����2>��4���������F�*��koH9�~����]/���GI�gV1������U_ZR,���m�V�/1_h���oOt���+��OK#\����^�"3������IR���.�@�������>CS�t�.F���<H�Q~1�Ao�*�.��lc?0
8��D�<�;P�t^�������t:�"`0�M�B�#������h�F�����H�[�A����&^VP�?��D~?�oU����t��]Hw��[x�A�h���8�p�����A'�������G���qC>K�����A>��|^��"lG��� F!�F�*�t
�����I����/���������G����}����`!�����1;����4r	�q��_q@�^����2��[���~V��i#�e@��Bb�f�'�����
t/�I��<�SJ����kD�g�P�SBf�>��z���3-��B��\TN\6����6��	�" ������F�2�?�_�e�y��M	�e��z����#�J���u1�Nd"@��������'x�zfB���di��h�
�����������M�+t��&���3BW'��G� ��s�\�i@
P�=H��(@Y.�����������(:}���g�8�N�����4��A���R��P�U��I�O���b���E)��5���0iC���3�H?�g
]�.FiT��%��:�h�����X�
+����W����u 
�Q��h���
1���{g/��C?�.D[7+;!��t���A)[9�Tu���J�9u��D��G{�uF�8�������O �U�I��X=h�-|M���R��Ho�E��h��������)A��B��j%���S�'kg���?�>�����$��?�mD[�_�?�~�1=�����KQ��������*h
��^`������>���{��b���F��8�����f�����KO�z���[/{S��p�.��S�cmt��>r�m$�s|����M{����7a�_����.J�����#��"��P{���W����}�����E���nc3�]���vFl_O��������}T��l��me���G�;@�UX�y��O|_��P��uy����v�Xy��������te"��6Q�a����|��N�OP����B����Eq�N�kOr�E��n$���S��]���~���F>W|�@�y>�X���et�N��wh��-���1>#�]B�8o��-�>��U��e���M>S�,:��n:Y�:�����P~]c��&K>���������v������"��b}|�L+�D�����-��'X_yI��~Bwa-����5|���P��o��'>���G�i;�F�zw{�A��'�E4���N�;��:������?��T�8����I=��Q��A�hW~���V���X>�_k�[&���L�"�X��HsE(<�[��.��g���`|-(�����j�p"�
��cZ.��
y,������q��G�^�Z�?+TU�rn��R���N����L�@��}��NS���g��zLH��c�m
�%����h�k���\Ee��D_��}�����cT��>��"�������4��C�����O��;�@Noy/����V�
�����m0�B�u�0�g\Y�V
��,�{@�A?����+�z��h?pH���E�}�c�?�y=������P�7����eT�!��M�wj�(�+W��{9������P�n�B��O_��Z@w(�Pv�?���~J{�1�s���u�x��q��U������`~��o���}�\7�b��\v���\���)U������W��^�]�e�;�*�h����\�����D��Q�C��������~��w��c�D��S�2	}y��q�E%�|�7"�}����O�B^�������rH�P�<yD�a=����*wr^�/�'������<
{t>�$J�M{�����wZ���P�^k�����	X;�������tp|�W�o������h-Q�J��� :1v{����6a?�?���b7�������E`��FW�~e
�##�]����E�9���v���np��e�=z��(_(|��A�o�'^F|�}���#�<7�D	��.����C�w���?�U��f�s�����2������g������%�GM9��{�����'�8Q�����1�3e���R~4|Y�ss�����>(�If�)
�}g��r�������9�%�g!�����_�}��m���,�mv�F��+��Q��X��_��\�a^t�����X����~�K���^f��iQ��=��=��j���#�{�8s{!�>�D������?�w������{���i<��GaFez�~�/���\����7��������"��/��{�{�����?���E�u���t�2�x=�^�}��������R-02J�v*�)n4�]9c4������Y�2��X��)��7�������z�u�EQ��D���so�����!3��u��cT��]�%�s]���
?+c���\���%����|��8�]��{��"[����L~�!���8�����=y�gtim��9�n�
;?g�y������G����Q�������4~7������sSt��/h�VC��=D�.~&�����������C�����O��J+{LL�{�!@�2���J�=�4�[��M���d����M���l;�6�3t���N���t�[��n�
��~��N��~�W!��������w�b��}��A�������F�,# �	�wC�=�O�6��(����}���y�����/���{���C�)�����,�l��z��B���mr��1_(���wT�=�r�\9�h�M~A
��������QV�i��tw�L����BgW�~*��vi�E�0���g��S�����x�w��%����;�O�]�m'��
�Y�X[�0�Wv�-���x��]�x��)�n�)_S��]�L��N0VYp����2���{����r_M�����N�-�#�#��E�Z�i��
��0tg9��#��8Fm ����H�.���?����d�#)�#2Q������_�N�|���
�������D���\�"i�2�F�	��H{��&���3��N:�oQ��~�w�O����9��d�w�����������7��i^o�����tN�����@?~����������t��B?~����@z�O���7�^�o������H���~��
���������qF}��������n������!�s�1��?k��s������Y���)��-#���W�v�����H[�7�/����f�~������G��m����'�F?�{#~��	e:j��@���d�Ef{��{�$,?v���m�I���q�7���L�vS��"�v��h��f���v���9��C��-���
�;�{�UW������NCo�*�������
�I�}q���x�>S ��_�/��z�y�6�{���{c�9���x�kt��������I�>�T�����n������=���A����N�[�����/S������V��0�~d�R�qj�_@�u���RoOH�N�"�v����9W�;����g�>���q�i���L���\�~���;�g�����NP�n���VZ%�S�.��C�-T��{�b��I{����RR�����^�94)����w�&�Y�u7��{-�>��F����>������������w�m������=�/h#r��������{<�w���>��x�]�9W�_��\�j��Q��4�PG�O?�FXR���G)�?�gC7����p|T��EX����s�W!s}O1����M�1!�.�&��6����F�y��4��,R���e�|�U����hLtE�m�����{�y����O���o3������_����x������C�q��W"��������F��������o���k���M{��c����k��(=�����������J��d~���ww�}=�T�~���N���~y��pN�[�����cc"���V|�������s���6�����L��$�g|���$�����w3�������R"��.c�4��S�
����������+�{R������������v�F;������+ �i�	�g'���<1�/�^���3���`����o����~��{T�Q�`l����9��Y�:�����������������3���[<����y.y3���8��7�~��1���oiv���r}�;�Q&�4�Y�����@;&�No=����|cti���go�u��������G�e�|K��g�\(�H�������
��v�i���/b�Z+�d||��������>��3&�6���)���h���q^^�8`��j�����3�����������$O�mO��Q&�C�Z1�6�m��^���x��~u#	Og]��f`�c��r'�v'���n����#bi4�.	>e����<�����F��LRn��N���w5��&'���x-��+���5�o~���o���S���O~[���� ���w!G�y�����_�N�@�A�3�e�Q��p�Dz6����A������2�]��"�����'Be�\-��/E �����Q�:*�>G�(��6���������re�Gh�,��ifx��,����W�����~��z��l�S%���j���}D��$����0�a~z����Z�q�=���{�D���T}�*�[�Ls�r5��q��b�7R���zr�L�g�9�{b�-�|0r�m����v�N��?G>��*��5��,�N�-=�����G&���Z���z����/y/M3m���+���W�;�B���
?���D����WN�7m\�L_��Pv�o��
��B>��%xG��%��D���x�����7�_����~������?�������;���n���?��;��y_��k<�}hnG�6���.�{���r�JG�f����y�;�4�'�~%*_�[��y��s�w��{�=�5�]gs�T�k<���������w�5����h������]�Z~N6�&�q;�m;J6���������%�.�j�	�����)n�������o#6K�Ln��:�������l6H:
����C����	��s�[��$l��W�C"l�}�����g�\���K��1���9�e�������p�����_x�����b���o����"��2���{�y�\.�k����}�\ �<>��3=��G���(�����?��Q�h6��;M�5�|��;"�"��_�,�6u�c��}p%��������-&~�!�!y�6�o�����~�����:t��NE��s����P�W<�wiiH�4�4���4��4���s��W�����l����
���!��?���@:����A9<�\���#�	�+�1���b��B�b��:A�-	��aD�~"��D������>��}�C���>��}�C���>��}�C���>��}�C���>��}�C���>��}�C���>���g`������n&�$rQ	���z��$�$�����)?H���ld
�����`;h|����@Y�����"����e���4� yg�d���-X[&���Z�_�VK$[O,������(��n��:���@����Z����>T[�(����|	0��c��>3S���mVo���+M�5�b�t���������0�m��F��M����u���{h) ��(�����S������-6�,X��o�@��<�:	���z�P�����a}�-����k��5��4�O&�A��_�����_�'�~�ZZ	��<e
�����Y���C^y	h&������3�)�l�u�-C{�(^-'Q!�kd7������&�-n��������#!{D�X�I��Yo-����A!��mV;���VWR�c���N�(�������6��M�dR��Y���!O�0'A,^��A����.mEE5q�H9����H��$�:9S��{���6��s�|����W���ETkX�3����*CnH�	p�h|]�p���*$�x)BK���Fh5fm5fj5fj5:��G�*��B��j�/_A��-s�Jj�@;D ���CN�=�kD����f��=���'�b�6GLY�c�B��B���%{������P��y�8��V��crrdj���S���Ap�d���I�P�q��^b���.$�U� �n�%�9}��/���jtJ"�Bz��C5��{�l��mAH��IOQ)���y/����}
����@��f=�m���@���Z�n>X���@������43�.�������tT�7�\�'�N�}��)-�gA��P��&}Z��U\zT�C�A�ZcxB�:'�j����"���~�w�NJE��Z��H�������C}L�OZ������I��Fv���k�R�����W��u���!���=��`q�~�4����~���+�U�������`@�HX��<+�'A{� �NZ��T�j�0&>.����U�Z��/B���;��UK��8@BK���2�W��y5����5"e���d>8��c>8���������/8�����-�hG��hG8Z��"8x[��"8���p4�p4��
��
�hAp�GPp�GA�GA�Q
�Rp���Tp�����(��(G�����8|��8|��	8|��	8\�p��%8\�p����p��Yp�C�8�C�8$8��8����8�C�8$]�K>P����,�r,�r@���0��HC��,����	�N�v��S�v
�Zp�8B��#$8B��#����#����
����
���c+8��c+8�
��Bq���W�{j�_�F�Zi+t)}*�zM�kh������-�j��
���]D^k�V���a���y��a�q@����C�Vb�q��a�q]}X?�K��8m������>��$_M��v��n��x~`��Z���r�[;;��Ry0����"�R{��=\�n.b5V��K��
	g�A���5���?���=�&{[����l��������@Py�W��|c0��r?�d>���Dg	vHNv��d����o_k~)H{k�8�G[����X���^{3���V�ad?!�z��lo���4���2�5�Eo��M&��Y'�t"����V���-	���y�"4���B�H�A�L��HK9��� ���J^�B�|��F��{*����}���{�xo�~
�O X����v���v6%h��/��k��56^��.��8}�{�*�]�������=�{Sq��k��U��V�r_��3��]�-�.*>�]�=�;�;�����V�9��������{�
�p4F���=#�]t��{�7���V��s���H�����,�z?��(������v,�����i�p}���g��z��h���,1��f�X4�b�,dI��}p��������q5�?vI�)�ElU�"��J������Y}��|�?�:>1����O
�9�Y(���'

�����PE�>�7Lk���MMH
I+�MjlgO�>-?�t���b,���MM�q_^���WYW���8����m������*�#��>����s;�X�9��C�����C�/��������M(vX�6���s�b�����`O��b��H9?�Q.���9�/��mNQNa����|#kw�|�L�k��ky��4����~Q*��y)���+y�(R�E�:Q����B%'���Ev(����2�H���h���	��?��X[��K�9+gdK��Y@Kh��s<�e��|��,����o9��9���Z�3�6�$�����S?�����S���9�q�S�Y�����G���mj��j�9��U�m5V�@eU��F�Vu�d���j�V
o���U�m������q���7�8'B�$�
:����4���?�+t��,����
��d4�9�CN�g���,�3���X3��dhV�^���r!9.gx��N�P}h���P����\UB�?<g�����������"��,I�g��,^�x!,,$�M�
���:�j�mB�i�4Yi�����F'2�[����@�AN]��U��K����-5�l�c���8�IW�����tE[v?�,j+�8�r���U��*��i^���X���x]�����[+4�������V�Zr�L���@pQ��n���mM�
o��@�)��	y}_�,*�n�.4k](�_��H�B��D���Q��&��\,�"�Db���?�����
endstream
endobj
19
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-BoldMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
21
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
15
0
16
[
333
]
17
65
0
66
[
556
0
556
610
556
610
556
333
0
610
277
0
556
277
889
]
81
83
610
84
[
0
389
556
333
610
556
777
556
]
]
>>
endobj
21
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-BoldMT
/Flags
4
/FontBBox
[
-627
-376
2000
1017
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
715
/StemV
80
/FontFile2
22
0
R
>>
endobj
24
0
obj
<<
/Filter
/FlateDecode
/Length
29
0
R
>>
stream
x�}R�j�0��+tl��b;-AI)�����[��Z�s��W�i�$��<�-#f�m�X[3���������j���
X;c�L0m�������%i7�4�P�~L�����8�~a7z��6I_�o���|n����s�0��O�d�`����v��lU�P7��
�?����	�5�F
�k��� �x8�UO�����sRu=�@8��CE}�}�����e���$BVv�ik�e
?k.O#����i���� ZS`�JDEO(GTf��	NhA�e.��7���{�B�i��QP��WR�]����8��3��s��{f���C��5�+�FU��A'��
endstream
endobj
26
0
obj
<<
/Filter
/FlateDecode
/Length
30
0
R
>>
stream
x���xUU���>��{n�mi7�i���P��\���"A"A:��TTDT����Xu  303�cA����q��<2
$�_k�sB�:�������?�uv9����>;,	��W��o���W���:r����^��f��!\1��k�||�^w���i�'L:|����:��a��<����0]0m������i��u��WN� m{F�u$�����`����`�,�=w����������*�!�S���\��g�S�<=�9�����M�����,����cX�9�
�GC��B�V�
c1��?��a�tt������n��b���n���Z7���|
W�m���U0>��Bw����lqzL���]���q�%�1�N��D�H��5�>��������R��b,�0��uR���S�'qyp
�A�Ap���Il}2|�"l��[�U�>�KeA-L�u��ue�y�2.=(}B��l��
;����w�K9�~,}�P�|�u�OjnZ�\�+��*��J|s%�^�C,�~��T\J%�\�~��F�h��������HzQ���\�;i�����XG6������|�4l�cg�M����k��Y���.~P����|J�n>�������0���q�q6����f���x� ��t����gm��R�����O�g=�0v	����N�;�����|$�����Is�����7B�'/U�+���7�i����������0��%8�{a=�l�w����)��<���<6�]���m�Q��=���C�o��-�����?�g�<�����������A����?Ha)_JJ]�*�F�G�BZ�����rL>(�q��(�)�(�����+�T�v�
l���USi�����������
��!�0���U8�	����}b�s�s���X);�]�+3��`s�\�el{\���l��_�78f7�c����^|�.������������IN�+eH�R�V�,������������I'���K�9W������_/_%��?�?S�)�*��u��\mT�K����
��i����M[b�`;��V���D�+m��y��������$iGL���J~k���~��"\��#�?G��0�w6ZS��f��?�Qy��uly��b7�oTle�+���NrRz��>b����,���'������S�@���Z��n���/���m5��`���H��}/�A����K��0���"����$y*��l!|O U�S�PK��2�.���\~
gW�
��a������w�*8(;�C��A�ki�|L��!��aNz	\��������FC�|��B�����"�*����@���|�|i�Ds.B��b��"����#�_�\�uhPG�F��xr����06�<��
W�����V�b����6����������!�H��*����*���;{q�Y����1q��<���#�:�:�bw	r��2���,��H���y0���'���~��O�s���/�!����%q����q���d><=_��<��\���U�nI�5��T�y�V����G���]:w���}Y��]IqQaA"?/������F���`���������4U�%���o�_]����^.J����	�1�UF]}���]�>^'���.���S��L%S-%�����e���x��>�x#;l�o������A"�F�����
���i}���.������V�����mq:z'zOv�/�-'F��'foa�����p��[8��8��X�O��h���^*�;aR��ac������i_V�zOL\V�^���(�E7�j�zMt�N��[�[���Z���euI����	���Kj�_��S��H�L������LiU���8%W�Z��0lL��y����6�./�W��v�q��8��o�S�n�.�4��1�����S7#^oO�JL[5��&���_��5K�J�X����cy�����	}��a��k�ES���o��m�}��n�x����:2������8�ee�(q"D}|bG2&�s�A��=`��X�j����;2����n�����~�R�'���������3��Q�����'-����x}2Y_ZJ(���=�1�'�]��]����z\>�k;��gG\��<��[Sp&�c��pY�VHuL���:z��z�1��,���T�K &7������^=�;�g=������G$;&�wU���G��2��hyg����H����LI�E��R�c\�r!�SROj�l��"�����u�g�#/�����>F�Dp��9���������>kx�UE���cW�r��Q���3@���c����aRf!�kL��AP�Y��%�M��,3yV�L3^������!�[��_"�oU��	����%�zb�.�{��U���Y����}kf}��5�V�XO$
��$��a[Rl���cv�_9r�V�x��^5[
���]q�����K���S2��Vn�3w��������D���c0��y���1O6�R"�����9�5���i����P�@�]����
��P���K�N�8qy�&�����Z
a��b���cMtM�O��23���7
F"�Hf�7���������?�rJ�hTb��Ha	��b~���Bg���8��|���g���������q�WU\<6��O�=~T?��>�tT����t�3����`E��
����"��M
g��r��P����a�c����a9#���0V�]��������C.tJ���\�,��Cw�s!�����"KXm��[y��j"�����Py�n]+�$V�����f�}���\~�V��fl�>R�]�?f���_��=)�o�Xy�����w��/��?x����n�wr�j���������������O/p��g�|��;Z�����%>�
��%����X *��-����?����v�sp]�2���X����0*���S�����r{].W�Q^���k����<�������3�r�3Ja�x*�v����������s'�L9b9��q��v�����QIE�n���E9r����<��~�q��[��t[R��O�n�:u:UjR��].zR^K�g�lP�Q=����(RC~o�pj����+�+�/{�������2.����1.:<s�6�91py��h]����j�u��Z�>�����m�m�{�X�pi�������
#����H�y�T^�����]�s11*!�(�J�������<��g�L�<@���c�8O%�F%@��59/�J����M�'Ek�����v�=��!��X��D
�~B��?C����E����%�5u��76\�u~�ol|��;w=�p�SO����Z���������~��������n���ch!��z�rT���~
�����8����C���a�B<���W1S^�����gdfU��]a.�^q��p�*#F��>����H��/S>�^Y�<�p�SQBC��\
@�Bmy�-�������s7�b7�ALs�u���T���J�����Z�g�%|��u���{9?�p�#��[�����-��u�W��(����prxdhH�6D���
ED� *D�m��j��tw$PG�	g|>�:e�u�����0����9Y|�����4��S.�������x.r��o���H(�����.D~��C���E�)����~�����<j{+6mL���#%V��z#����7�Y��9�9���^�y�%��Y���H�L�$�u������?�}�y,������z������o��p�
C;�p,���T���X0++f��I��bY�;Go��m�c�F�N3�6�]�P�:�P7s����K����K :��r��W���J��n^���-��jI�%����U5U�4�=����c��C��B�d�f�{@-��kRbaF^Qw��n$K��V���3���d�tw.���o6=p�M�]������O���q9�>{~��}7��d���Z8����������:�N*����a�L���@�lk����H�6/��h#��k�p����8�2r���vYJ;w���D��:Q[\+mb����|t��+�����Q���/�/�+���.��%�;���^����.�]�)
]��N
]��6���*xK��n���9]n��1�������<#?��u��d����1��i);�R�a�����-���b����+�<!�/����UTk)��+l��A�^�q��wR��5�#�����l7���l��k��]&v$��c�������t����.����,��\��b/��!S�@V��%j�!�FO@udtC��3=��
���y��gL_l����^�9e�Mk�?�wi�4�����n��n�x��e��/M�:i�Cr~s����>%m!�KWv#�v�����>�z����
�,\O����S�\O����S�	�'��\6��OlOE<U����!�c\�\�\O�^v)I���%?�7�TISNICy�v�"�AI�%7p�[�����`CmxC���E������TG*;��a	��U���B�p4��)���OTh���jk�����V��q.q�Lu0rd���=�l����&�q���J�TbC?^u��WY�L�VF~��zq��(����[���m�L9�+�������]EM� 2`�T��rV��t��*]�Y�����Ewm�I����r_yF�'���i��_lh���?.�8}�������4����y���N���~�t
�!�'��0�T�b���S�X
� "��BY���3"��
��>�������eYGNFF��D��+�9Yn-��^(�""8	�8DUHRM����i�B�+�c�f���/�d���]�e�����4&�;)���Q�H�i��#����=����Fv�
�@R�
��xS��N��� V��%������W���;tY��L%� �Gx�b*�5q����^��[=������\�b/�U ��X�Zr�#+l�
�$��S���mZ�d2���<	
d5�xF�&���ojxv���K������s��;�1�������-�W����u[�T��=�|���z������B��@9�
�\����1�6�F3�V�fE�-4J��Q�a���G �(-8i*�$q��i�^���3�SKrRn�v��������:�����	��B���y6m�o�_�p������p���(���2�D����G�gJ��I��I�����n�-�z��f���ic�
���dP�Q,O���Eq<��>�P7�qf�7Hwl$�f��������Z�U��Y��t�Y:N��N�}�5e��z�9��sc�,6��U������W����0��b�!��H���`[���0�Xk�;j5!�p�L<�5���\
m��B�H��I�I��O�Nr�[��tj[������u?����k-������|��M���>x�c�^�fy�gF�A������u������7��l!{�w�~����5���>����m!�Ip���g������d;
b4�8��.�<I���C��'���6��Ad��j�d���zLb���S5�����	�t\�+}����G6�
��%����'H�W7���K���o�O>���f�����e_��"���(RM�I���,S=&��1i���������������3��e�/()�%�v���R��xQ$�FuB�|���Z�1�ngqvQ]h�����6�����$X�n��:t����JI���C�p�!�I/��M�ehPD��F$zI�\�JI-4-d��x�h1C�t������P�VV�� v����@Tc G�/�k	~�����T>eR�T3CH�1�3��:��XGs[��5���.]P����N���n`���9�������o�Ul���������}[�5��>���Y��xG
�����D��]�cC�j�-6�c��b#�����kg��|*$�BI�P���$�ySq�5^^������T�KD0��`0�`"��%&5������@�,/�eX�x<#��3�t~��mB�	������=�+k�"�Fe�!�j���95?���������R�s�V<�5b3f_���}_���������|������v�-S���y��K6={��'��vkglx��
S�oW���40��������-?q����Ak�<�����,Qv��
���������O!�3aP�X��R�B�t6�2m�g��B%�k!��n9�����B�)@?o�t���E������	��C����T	!x�'0�'�+���,�,��zr=C<�����	C����Q���|��i��'k�)�E�"?Q�N���Z�qF�K�
��'��gz=��6]ul�Q��g��P*qI����������u�rV�n�Yz*�'�e����x������!�g�I*/&m1��������!���fQ������n�A�Z4�k�O������*�i�v�2�k���I��w�����4�TU����
m������/���5���%�IC&�(�����9���G*a�Vx�ZQLRC@�������	"1<���
-�0����X��g�8���8z�u���3��W��`�So��F��]q����vEF�d��~��u�o�����m��+U<�c�C�QT���e�h��Hod7�y{;���r������<�2�:=v�2�>��TY�T�Cv)R\���[��"fL�0;�Nii�v���������
��Y��X�E��r�.)R]$h�P>M6�~�DU%DTm4*U0~5Hd��,�Zk���H���"W��rP.�Y���+V��i��9�tN\�;����� 1��'��������!����s�EZ<��UMd��A�C��;��B@��B���2��sf�&��Y��k�x����b���hsv�NxB��x��W�M�z�/�������Kz\8��M���c�.-�=���{W7?����5��'���,�����4��2��+�����z��7�Z�����L����lp*����u��R[�������y!�>	d�
0��s<�:;cE���^X{Q@{!����$�y��N��G��r*f������G��6����g�]���-v)�(�`E�':�E/�!��X�Az��<��(v�t��W��/��t���������XV4KR]EzaFQn��P.JF��y���p0�0����,g8�}�����A��q�����Z��t�p4u-��r"�w�_P�E��	Z����2uw�t�uG��
m~�a��#��U�\�e;�������X���7;�W?������]���������t��x��eCV>������3��Z��T�K���V�����i�l�3NZs���+�uEr{�P������%��q�����;�*Q�S��CNzr�r�@^�v��h�t�YGP�V���:��q�$�;��u$������$��x*��������K�P�x�
)��C<5���
��r�'Qa���_��%�8�fc\8��5�<�G�t>:���&��)����j���2&Sk�����/3�����6�j�r�q)]�.�J��Bw�g�t�|�{�g���������3�����
r��8������l��'5���O'���P)���0js
�g)���fw8Q��xt��:�b?������:oU��F�9r��4t�b<�Z�d��8asb)�����%*)b(C������G��+u�bER��m�sj"�(���k�"M��;��G1k�<R��������]�tH��a��
:w�����s���p�O!��
<�6����.|W��������rM���;�*=ey����{��Kw��sM^�f��Z�SK�y,��ugy���%�o-+`�t
E����<�<���1��S��9`�������WOu���#u�?n�.�j�
�g,�b�tJQ����T�:!�����j�~�M��^��\������7���n{i�+m�.-����������Zwm�����N�����={��Qb�V����8�}"��dG������%�����'��� U���S�����!�
�F�e��re���������������]?�H��Wx��w,�w�F����o���S�����'�Q�<��U@${C����f��D�B��y���QI���6�:6jpR��oN�#eB������e�����%Kr���Oh����B,�N�0�
�R��"��o����1�XR�2M�S`�"���4�1�bP�iA#�$�)N��C��S�sEJ%J����B��m+BF*���+����Zc��J���U��tj&5.�pa$���;
-��a��U��V�\
S!h�����
�0���Px�i�
�z��#d���G��)	�9�����OqD}�R9�BI~�		�#ITI�G�����syh���A��\�0��tQ2��u?�N
��y��F���I��"I�<�PO��T5i������7���w����+]�]���-+7����=Y���_9�����-�^:���7^28�q�

W�?�fNd��S.���������>(��KuPw��s�!j���\���Z���B����
w��5�5�b>�1<8"k*��L�O�e��}Sy+�A���'�o�_E?�>���
��&cU�������5�(��B=yW�@���/xA�������O��B'�q��2$�S�Bf�S��#15bY�)�rb����[��Z����%AV,�}��!�����-���)��\qF��Z�if�TBC_D����>a��^>KHQ$U'�����1�q� 0�_�	���[�W;�}��59W���h��C��j9��&pPD������T5�.�Le�)�3Q�����:b����K���#�u%$C,W$��rk�&�1y����������:nk�?s���o�~����W���#LZ5�|�9���_{�w/���~����ks��f v�O�s!+���j�Z�(�di�r�}���A�\�#�����d�;��������3�9�|����Y������&�g�&d-Pd��'":���
��f��P�w��A��.gf94��7�I��NBk[���|o Kv�-��YY�	�Px�/�'�uh�&&K+�������n�����o[aQ�;�%�����q��P����]��Z�����k���n�x�}�����%v\��i����w:�D{~�N�IZf�0���aW5����e������S�]��<�by�u�J��.�z������o1;��c��W7����z��e�Slt�W
,�I��J�?l�A�?�{�wy�iO�@tX�6M��K�����MEgGt=�~�m��K���}Q9J����Vd������`<��
�G�,��5������b�]��A�w1��}�:���h�Y�k�ESD�����BE.�q>�3������:A�qUD>�!�;�)�*��vC�`�$�'Z�\�Q�Tw4y��<�U��D���v��O�k�
e�n�g�O�f2�\J�,aI�����D������ �	�Cqn}��@l������ex���u�����w��aG���V������y��%R^��RuN�,s/r�
���hv��(X��tv^����qNs�t|����(+>/q^�E�k�6�i�����.������������k�&��+[\�n��y_'�)��CjF#��P���z:A����j����dey}��\�PFyay��(�Z�S�O����H�P���T�.�8,����Qe��7�p����tl#r�4�1��c��
HDNZ�~25M`�|/+��\�L��r2����~�M{�\o�w�W2o
^�4���x���>�z6�_��7{����y��������A���5�,Z��	:�<b���TW�����(��\/6�R���]�}�4���Gw�s�.����2�aW��w��?����'&����_>��
7={��Mcb�
�L���VV��Z�V�]|z��<-��i��������D��B�.���LE��Y$�T�SOF�k*����p#"��7���SOM<�r��T6Y�u�Vn2\��V��9�T�SOfll/��5{������u��b_c�`����d?f���k�m_l��:lO��vL��dW%��E�72PUv�Z��#��^�'��}�1���aJ�I�r��[aBKYh�����C�-�"���PE��A�"��z�E81�J\�@��l�&L1wN���]�3$��+��<�!�z�x0���=�����C��>�o��x�Z��m���S?s���#�:Zk����VN�hG�~���q��"v+"�'c��Q�5�W���
���6q��D/�~Ed����2�QeY������R��w�q\#]�xW���=���Z��*��j�w�\���j�7��*�_T�,��Q�����`��;�$�\U5���	��V��AMS%Y.Th�;�����V*�F�u�!72o�������Q�o%5��������w�Et
��4n��\dr��C�S
C��p�S����g� 0�B������3�r��JkB'��)V�$�O���x�(��U�	��+����7F��<M�U��$�4�<��v�k_&q{���l�����r���+����*���fWb�������wq�?Mxa���}[�*�l��(�p�^��H�D��i9�GC]�?��-����*��Z'�F��?�d�Ym�aa�D}�:@�������/�g��6o\��>���7_�4��^�|	��
��a��mLd�[��3">+�m��,+�iEbV$��[�2��"1+��|�n+��"^+�$�nE�V�gE���[��Y�ule�"HNM
r�+
�#����O��[��8��	{$3n��DN��A�Kcj"��
���
��0�y
���OPD?��PPJ�V0���f���}����1�jS9[�SS��#�k2Y�� ���L�A&�A}�A�Ld:��L"#q�$�E]eZ.�L��xyB4�d�d�(d�	q�\ b�1f����!S�;m��SA�
^
�>
��my�����s���*���������9s�TC�����/��%�qEA�/����������N�M8qg2,�~}	R��(b���������W��{�+�7oK�;o�=
c&]���\t����������b����{��X��|��C�����AI��HI!�%P$5�7��������U&��(w�����"�#���=��?KA�n��c!���8���RI�E��,~� m����'<�B�:�E	�)�>g��
�?��t����cJpUg��[E����sp�$|YE��������������g������

o��g�=�>vS������&��$c\`�,0�:��(��'uqT���������8�B���H�g�zB"d����>��d&��T��ashI��|�'�y~a��c�w����V�x���6�
�3�{R.�����u��i_~����z�ii�}��/��wC����!.�0��������~�u�W�F�5���t��B����������S����������,e�}�>�?+4)r
�����i�2�q��ri�2�q����52���1A��$�!@;,�VB�@��S	Mxk�k�{L�!E���f\l�W���Ky

+:i4]���~��]���[Q�Yd�c�#���2<����!�R���8��,���7���� n�A
�#��A8��_���#c����5&��_�
@���a}	C^�����e�ev�����30���6�<v�����_��Q��][W,����[y��~u��M����0�k����^}��fD�+|�ajP��e��+���y�<_V�>��fw|v7H6��"A{����,������,H������T,%�")_+�S�����C���V��`��gk���k��=��6�T�uy��Wx�yH�\VkqU���nn~�����\z^�^�\���6�������us���1W�?����t��S�
C�E�����O#R����q���$�H���"t�4�H�K������}
F�O�_h��������e����p,�4������Gq�wa��8�8�8�8�8�8������wCQC�������]�������I%����x�uW��e�vz����W���l+z�(Tbi�V$aE
��9_���jMJ����J�a��S9�V\�r��xQ����#�B~�L8e���!�������7��2�QT������ d >
_*H�u:���!��������`�B�u������/���Y�l�P�S�J�4 4��:8sc,VM"]�zW�O�O"����/���8��
k!*<v��~7����;��UP�
m��<T�J�O�_j}PXj��P)��X�R:��VJcby��u]�u��]w�]��YC��7�w�i\[rilq������^1vo\x2Ht�k��P��i�"��Y�L������3����������!H29��������P���s�jKH�M"h�6�Z��}NB	��t��%5��{�������	b2�������ek��d%�v���3!W�&	h\�2�&�,Y��5l����dK�Y�w�]+�u7>BA�7n�7&��
ZT��{��t-���������7�f���5o���P��e{�=�������9wr�s��.,���������F���=��p���p��G.~�,�����Ty�,��Li��O5[����[�FD�"B�D���R#0�8����&AH�'�5�%9�z>�3�y.a�H�i���5[_{�:m��X[���Bp�V���i�F�5b����*7n�;6O���pa��!^SN!hUS���������eJk��;s��~�P��#����ES��c_y��2yB���a���\����Kq������.��l��m���%9����(���i�7�����Ae1\�����������9�3�<Ux����`E2�
l���BN`>\?(Yzl��JB-zl�0&�3&����c�~��l9#
�n�`�A�q������y6nf����l_���hc�Iy�����
��X:&�\b+]b+]B��
�-��������e�%H�-��.�B���x���v�����'G��P���~��B����������u�����fSm�
UU��	n�/HQ--]�D}�IT1�g�I�H�����o]��!�����b����ix�a��!]����������w����������9���w�[6���w�*H�K�f(�n��O3��)pq@g~cVl���Lu�b�)�+��J��� V�A����Ua���t�Hs�*�h�T�����m23t��n����|8H���V@�z7ucI�
����j%�"G%tu����l4����OaS�t�t���]���-�_�X�V���-�J�*����~��x����mq�/8������;N�qGN������C ��+)�B����>�������|)����d���<�p>���\�(.'�q���A8�<�����2S���Vhw�vH���d�@��f�����.S:��+��J����cr��=�,V����=�S,����	-���M�M����#��We�W�Wy�-�T����3����c���(g������Han$��]�W�EM��^9�j������N�4�2~��i� n;f�_ �o��C1�\��[���$�	I���}������q��~�������.�o�o�d�U���M���3�v����o}�okya\�vX_��������[.c��	_�xmJ�-'�����}u���5���j~�����V6�}��z�e����1���R)�K���'1p�"�T/���%.��|����0�S�,��'���o��o����������.�J�v~����>�7��4�g��}�����U�Y��;�3��BV]���i��d�iwzl~RP��e:��&,�JmIOt�z�������)m�m������B�%�����$�T���u�|�.u�w��;�����+�w���[����������J���d��&����C���������s�s�g���I{�S���SW��@�6'�t��3���M�j�B����In�*t7��Nu'��F-�s�`@u8}E��o�<�1�w�o�o���s�H�����Y�Z�F�q�Gi��m�e�����`���aCtv�>��1p�~TZ/HMqx=�?�4[\���IE*���}.t{�n����~�abuPZ�	p��e��������(Gm6M#���z=pO�nV���^�������#>���t,rpG#����+}�|�G)���:e�q��=�����"��������c�q����-���ng|)/X�O<Wj���+Wx���G�"�8����c�qW��IF[�0x����7�t|�������#��C[4���y#����{���-Z������vQC;�
�����u��B�������zaQ��>���� ��
�c_O���J(C���y�kL�an���)n��}���d��ab�	�Xb����T�\���G���������j�d�����hZ��>���|�����b�D���n�}���;[�M��y������W�����sMSu������D
�����<z�p�����`����kT3���C��_�Ryd \�e�5�h�V�X�OH������Xv3���p7����>B�B�3�!L@Ai,���b������\�R�n���S^�)�1���w��V�,L?��������O�k1�!|?��c8�1>�u2�v�6�R��b~;l�Vs������</�1����a9�1�~�L�^+�K����~�cK������`;7��j�W�����8T�y%���A��aG�����^�i4��9���1��1l
��o�2�	��VckK��R9,�p&B&�0~f���z@�$�<Z���'�`L3���Qa��y�&�!� ���:�>��$\��'�#��Wa�Wl	�zl�s��`$���r��C�Vc_�X�Dk��%������DXB�����x�������lts%�=�e�`~X��p��P}l����G���(��
��0�2B���g&����(�������Gf"�D�P�}�+	|E�!�������k�c8k�a��O�f6�mQ?y�30��<j���p���j�h�p�
~�x�5��p�%D�����4A��[VHt�c&z����C<^J8K��BZ�5�&HfX�j���`($L\_j��Z����1l�N�y� ����p�|�H����	�p>X���m���r�h�%��b3�}8��q=���qM��o�|�-�(O��P���<�o��m��3�QH�����?���4����_*o��8���&��X'��b�V���������Q:���S�SIAwy�O�y���|{������;l1,�o�r-&����a_�mXJ@�c8���smq�
-|m�7q*C��u��p�;�����Q�
���|@��������|��V?����6��j��mC![��[t�����?�G�q�#������
[�_�7#>cM��7�B��L�G>��}q:��K?�6�7I��&��������y/h��c���<mg�R#��U�a�����[�G���b|v�9X���}G(����A\O�L��|��yD�H���0��D�@���D�^\g�E��R�=��n9������q�/�<��R�r1<�~]�Q�k��$�+�����v�m�'����SX&Xn�X�<)����D�
�B���`,C�muR�7��1��>�"��������>�<������6j�a�:
i.6a�c�Q4���^��k%����s@����)�i�����b\��!�,�5�)��G6x�
�i3���"&}�^X%'��:n��������������-X?����}���T��t��^���B1�S���(]+������p3�G�&�1��"}�	�
y��<I�(�����'@�d�.y	L�GC�3����BZ���0^~�a5���H�8��-)� �|�gL���r�_	W��a��q�Mp�Sp���r;�I���5���J����c���3TN�����@�E�V �jA�1��8�qOq�?k�8��qZc����yR�X�����K�~�������i�
�]�-
�k���n\�~m`@����-D� w��K0^��o�3���u��n����6U8��^��B�[���U�]k�~~*�5(���g����A`���	���u���u��M�&@\��@]A�jJ��������L���P A���4�����:�Z���C���*�Sh�����������V����a�x����1��]�
q��1����'��[���?��5o��6�v_)�����`�A>����X�m��2�G����^�qZ~�`,�J�hL���?N�C����XcTi�%}y������h��7������+�%h���h]�u�{k�}i�?8���:\�a�����B+lM�m��m��K~�L���sm��	�v^Ax	�����b����#���R�z�[��\K������������Pz7�Cpc��yS1|��w���o��r&l0��(��0����F�O���q�����6#���!�<?����p-���-���������a�����c0���~��#?�C�����?��u��8�\�ya���
�o��~�B��������V>�6��h3�
��������q����5���M�S�H�&]��g�?���������tg�_Iw&����g����";_����y+;�t�L3��e~������x��C��1L{F�>�����n/���0<��l��d��[�cA��_��S�?��]L�~.��&\@�V���K��,�FF������%�-��]�Tz7A[��Gz�/�I��O�m���8�F/��m�G������� �m��?�-��gtkm�����4�Q���|�����"�@�?���2*}��h;
]l�BLoG@��|�I��G�m��N7a�&L��Qv�	�~	���-��B?�5|p
�:"���G��0��k�!��w8J]�s����������v�9�b��i/����C���'���C��ag||�&�:Q�B�[���_!�E���~��k���(KQ��Z~:Lg�oH���$�h����oQ^���N��-��f����������!-_2��H^�@~��~��Cgy�A���s�Q��>g5+��.
�=��V�c3����m���m�8o�Oz�b�C������+�,�J2�'|�����4�9��	����E��i��U��e���e�1A�A�
a�q^�>�����k��s�)��n��m���`�t#�}�O�	��K�������K���B�n������8�!T��9\?��_�����Lq#
{i���d�|����s��� �Gq>G�a���;X~=��H+�����o�	X6���w�qn��@��qM�z����������`��WK?���]����3F�y�W�H��<s&�K���D����p-����P�U
�y���9�o���I��>R����i)�W'�S�A�t%�/���e��]������� G�%L"`�����0DM�����`x'����/p�u�f�����WPW@0�r	&��,�<'�1��F�*a�����D+�r��!���`��`o�>6�X�IG�kX�2J�~��#��
�����c[�|
���k�Oa�����~b?W����s�Em�����s�&��'������������-`~��������������/�m��5��^�!b_�~��}��b��j�����h+�{��</M6�
���vuz�h~�,���O�n�R��F_T��y�of�����M�b��6���F�o���1L �3������{��g�7gs���@ZB��s1q������F ��K��(�c��y'�u�/�Iy��:��Am����E��<KV�����I��4��*���Qyz��@<\�,���LB�����y3����/@T���W@i�����b�\�&�M:�tBg��L��N�
G��_t,�?��>{�f[����U����\����:�,�����SP^����R�	X��:���3����Z�o���)d�W�^w���u��n9�� ��O�����1�xQ���Z��Z\�Ib<����S ���7(�/����WGXi����CE=5)����k�������YYE����u	+��8����Fy����\����Vt��V�]��z��G�������`�C��-�7�V���S���i$�O�D�����9�>�%S��������0E]#�A�.���Z��~�iB��E2Z�u�P�{����k����4>�����A�D_�/h�����3"L7����"#�����x��(��t��38��W������S��T��9�'�������C#�;U?q��6��iV����F���q������q?�F#�!������H�k����s�Y��k���}��
/5���{9���'s&L��������>������'w&Td?����d����_(���n����;\7!�
�	�>�O����@��l0������`�m��&�1/1 ��	_��(����F��l��������c���h/ ����!��"TI�K@-�@���[-H�	�u���Z���8�i-c��7������v_��������;zVHw���7����6@���T\���Fx���	�VbtWI���4Y�Wl��#<�
mS3m��QU����At�������&���$����'8�y�v���
�Ca�yO6�x�]��N��`��:_z�aO�E9�`y�2��W��R�C�p,�Ge���2^6a�����3�A��>�fx�5�m�C@e��y���6��s
h���?3.��J��<NAT�/M	�z�<m�������/�y�4�'�!uC����,0�����C
p���6��o�_C�j����QP�E}��/�������\q���;=��u"�HB�B�x1�K�IZ���G��;��e�^����!�w20���`�E�Ax�j�,�O!��������0T��W0�V@����&�N<	�3��%���$�{��$����8ZJjR�W���Xo�a��_���wV��2�Q������-�����������������u�~ =� �Vz=�_�m�X��f�<��s����������J9*�P�E9�.��@���O,���~��}[������������������H��g��=mhd���AkB����R��~�{�S��1�A��d��j��8dz�P��Zi���Z|���6�-S���g����Mw��
>��k���vi���� ����V�O��}����C��y�/�����?J��g*m�n��]�_L�9s���2�U����\��nN����w"}�@�tZ�G
}����=m�������4�W��Z����AyS/�7�>m~� ����#�T��� b�w
�����V|7�����Q�k��
�Aw��NC~3�x��i����"����c�%B��&O�v^�s����~Y�$���$������5�U�3���A�C,c�q�/����iaS�}���l:i�I���q�=�a?y��;�_��L���6��V�Kz�Y�i�����g7(KB&���no��P.�F*����t�3z��o��{d����v����Z6��7k~�U8�!�i?C���r�"��8q�3/}�'�'Q��[[l?���l
�s����4u�Nt'I��=��������qq�C�;��rC��!�	��i��b�8�8��Y�[��;�`x}���G�&����w��\*p��u��c��H���Ra���(
���T��*�?b}W�g�Tni������$�C��8���������	
�/}������~3�������'��\s������o���,�}hD����/����D���<������
���@��@`�vj�W��������~�=����������n�CT���F�%��T��sD���;+��E�����x����]j��|75��n�5���vX�h�A������wc�M��'�"�������A"�H$�D"�H$�D"�H$�D"�H$�D"�H$�D"�H$�D"������65��E�<T/������~N���S��P:���\PV�,���T�RWV*��&=�Q���e!wt�b��z���y0��_���{��<�W@>����"�T*Z��=���r��ct+�4�@!�t�~0F@��5�10>�3�<��z��<������	��]N���v1�����������&G�p�S����+W;��6d
_X�-S�p�e�AX�In�H�3�2�W�s5�4]�L)y��0�Kz���R�%�h!��9*%��?v2������H�	~���)����>��1>+����)p��|>��l�������z�
���s������b��V������.�]X7���:����>���`}.�ksAyE.(-e�;�;���8�XQ�J5��z�:U���Q������K��g���U��H����d�.0�|D�]#��`��z��g�[���T~%�n2�r*��G�����T��������l�[�+��	�����R>��E����{���_���)���>���a�A+��`��)^������I�Q	�}d���Y�"��H`3�!L`���`F���N}Ea'_B$L���"&�����	<s�0���	��G$L��L���g5+�p��������������G�Fw���~�������W���3/0s3�2s��G�y������� 35f��a�$k�T�,��{�#^f�0�
f&�`f-3k�i�p$��R���]���Q����W7���Va�W��0{d�R"��/�	_��ku�k7�D��i�8��0M�<��i,�i42�����\s ������m�����s ���t 7�����s��%>��[��Tz4O��E����:}YSY���%V<�Y�?>+��h?���'���N���3��T`R�.c�#_V�HV�HI���4U���k������S���["���h�?�2���I��F&���?���1��vB�>���B ��&[:�5�o����H�N�G����u�Okvb�I�L�q������v��$��[��z��� ���aA'��`Wiv�~���W��/��u����t=�
�V��\���U�Z���u��X-TU5_�S�J��Lv6Ox\�o?�Q���Q�{���y$$g*�'������m,n]�C�������+��k-��1�4N��6�1�����p0n����3���j-���������*���=��X�s/T��s/$�-;��m-m)��X�>f g?��L�=q�u*���z�2a�D��L����}=��6��=6�n	��WZ���m�^i�%��a��`�����e�T�������N;�Z�]�p�P���-(�uyL�F�5����[SnP��$���kfj����5e&����2Sh�[�i��4[�V�fK4�����+��IN,HN�=)��Fs4�����Yh�D�{_�m� K7%��������`�z��>�e�6��=	�0,%0�{�>�w
Z	�`�����M}�I��t�?6J}��=�}��X�)����K�;����ub�����4�%k}u����t��+,�
��:"v_d����Q������E�X�U��2��{�6Uy�VL���9*
&���6�������H�=%RKP����G��*&��\����%������'����	wl0��^��[�]��Q����Z�����\��dm��+*j�d/:�kQ�IT*��P�5���������P����5�d�E|l��	����9.��8����	|���d�dA��o#7l�����c�g�P.���P�;{b����,��dfl
���6�
endstream
endobj
23
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
25
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
15
0
16
[
333
0
0
]
19
28
556
29
36
0
37
[
666
]
38
41
0
42
[
777
]
43
47
0
48
[
833
]
49
65
0
66
[
556
0
556
0
0
556
556
277
556
0
222
0
500
0
833
556
556
0
0
333
500
277
556
500
722
]
]
>>
endobj
25
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
26
0
R
>>
endobj
27
0
obj
278
endobj
28
0
obj
16608
endobj
29
0
obj
330
endobj
30
0
obj
22096
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
14
0
R
]
/Count
2
>>
endobj
xref
0 31
0000000002 65535 f 
0000078190 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000225 00000 n 
0000000390 00000 n 
0000036665 00000 n 
0000034311 00000 n 
0000034332 00000 n 
0000036575 00000 n 
0000037009 00000 n 
0000036627 00000 n 
0000037158 00000 n 
0000034351 00000 n 
0000034520 00000 n 
0000036851 00000 n 
0000036534 00000 n 
0000036555 00000 n 
0000054340 00000 n 
0000037302 00000 n 
0000054702 00000 n 
0000037656 00000 n 
0000077482 00000 n 
0000054904 00000 n 
0000077909 00000 n 
0000055310 00000 n 
0000078106 00000 n 
0000078126 00000 n 
0000078148 00000 n 
0000078168 00000 n 
trailer
<<
/Size
31
/Root
3
0
R
/Info
4
0
R
>>
startxref
78256
%%EOF

distance-test.shapplication/x-shellscript; name=distance-test.shDownload

#305

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tomas Vondra (#303)

1 attachment(s)

Re: index prefetching

On 8/26/25 17:06, Tomas Vondra wrote:

On 8/26/25 01:48, Andres Freund wrote:

Hi,

On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:

...

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?

I think what you might be observing might be the inherent IPC / latency
overhead of the worker based approach. This is particularly pronounced if the
workers are idle (and the CPU they get scheduled on is clocked down). The
latency impact of that is small, but if you never actually get to do much
readahead it can be visible.

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.

I kept thinking about this, and in the end I decided to try to measure
this IPC overhead. The backend/ioworker communicate by sending signals,
so I wrote a simple C program that does "signal echo" with two processes
(one fork). It works like this:

1) fork a child process
2) send a signal to the child
3) child notices the signal, sends a response signal back
4) after receiving response, go back to (2)

This happens until the requested number of signals is sent, and then it
prints stats like signals/second etc. The C file is attached, I'm sure
it's imperfect but it does the trick.

And the results mostly agree with the benchmark results from yesterday.
Which makes sense, because if the distance collapses to ~1, the AIO with
io_method=worker starts doing about the same thing for every block.

If I run the signal test on the ryzen machine, I get this:

-----------------------------------------------------------------------
root@ryzen:~# ./signal-echo 1000000
nmm_signals = 1000000
parent: sent 100000 signals in 196909 us (1.97)
...
parent: sent 1000000 signals in 1924263 us (1.92 us)
signals / sec = 519679.48
-----------------------------------------------------------------------

So it can do about 500k signals / second. This means that requesting
blocks one by one (with distance=1), a single worker can do about 4GB/s,
assuming there's no other work (no actual I/O, no checksum checks, ...).

Consider the warm runs with 512MB shared buffers, which means there's no
I/O but the data needs to be copied from page cache (by the worker). An
explain analyze for the query says this:

Buffers: shared hit=2573018 read=455610

That's 455610 blocks to read, mostly one by one. So a bit less than 1
second just for the IPC, but there's also the memcpy etc. An example
result from the benchmark looks like this:

master: 967ms
patched: 2353ms

So that's ~1400ms difference. So a bit more, but in the right ballpark,
and the extra overhead could be the due to AIO being more complex than
sync I/O, etc. Not sure.

The xeon can do ~190k signals/second, i.e. about 1/3 of ryzen, so the
index scan would spend ~3 seconds on the IPC. Timings for the same test
look like this:

master: 3049ms
patched: 9636ms

So, that's about 2x the expected difference. Not sure where the extra
overhead comes from, might be due to NUMA (which the ryzen does not have).

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

This is for the "warm" runs with 512MB, with the basic prefetch patch.
I'm not sure it explains the overhead with the patches that increase the
prefetch distance (be it mine or Thomas' patch), or cold runs. The
regresions seem to be smaller in those cases, though.

regards

--
Tomas Vondra

#306

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#303)

Re: index prefetching

Hi,

On 2025-08-26 17:06:11 +0200, Tomas Vondra wrote:

On 8/26/25 01:48, Andres Freund wrote:

Hi,

On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

Mainly some testing infrastructure that can trigger this kind of stream. The
logic is too finnicky for me to commit it without that.

So, what would that look like?

I'm thinking of something like an SQL function that accepts a relation and a
series of block numbers, which creates a read stream reading the passed in
block numbers. Combined with the injection points that are already used in
test_aio, that should allow to test things that I don't know how to test
without that. E.g. encountering an already-in-progress multi-block IO that
only completes partially.

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

Yea, the performance effects would be harder to test, what I care more about
is the error paths. Those are really hard to test interactively.

Greetings,

Andres Freund

#307

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#305)

Re: index prefetching

Hi,

On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:

On 8/26/25 17:06, Tomas Vondra wrote:
I kept thinking about this, and in the end I decided to try to measure
this IPC overhead. The backend/ioworker communicate by sending signals,
so I wrote a simple C program that does "signal echo" with two processes
(one fork). It works like this:

1) fork a child process
2) send a signal to the child
3) child notices the signal, sends a response signal back
4) after receiving response, go back to (2)

Nice!

I think this might under-estimate the IPC cost a bit, because typically the
parent and child process do not want to run at the same time, probably leading
to them often being scheduled on the same core. Whereas a shollow IO queue
will lead to some concurrent activity, just not enough to hide the IPC
latency... But I don't think this matters in the grand scheme of things.

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

I couldn't keep up with all the discussion, but is there actually valid I/O
bound cases (i.e. not ones were we erroneously keep the distance short) where
index scans end can't have a higher distance?

Obviously you can construct cases with a low distance by having indexes point
to a lot of tiny tuples pointing to perfectly correlated pages, but in that
case IO can't be a significant factor.

Greetings,

Andres Freund

#308

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#307)

Re: index prefetching

On 8/28/25 18:16, Andres Freund wrote:

Hi,

On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:

On 8/26/25 17:06, Tomas Vondra wrote:
I kept thinking about this, and in the end I decided to try to measure
this IPC overhead. The backend/ioworker communicate by sending signals,
so I wrote a simple C program that does "signal echo" with two processes
(one fork). It works like this:

1) fork a child process
2) send a signal to the child
3) child notices the signal, sends a response signal back
4) after receiving response, go back to (2)

Nice!

I think this might under-estimate the IPC cost a bit, because typically the
parent and child process do not want to run at the same time, probably leading
to them often being scheduled on the same core. Whereas a shollow IO queue
will lead to some concurrent activity, just not enough to hide the IPC
latency... But I don't think this matters in the grand scheme of things.

Right. I thought about measuring this stuff (different cores, different
NUMA nodes, maybe adding some sleeps to simulate "idle"), but I chose to
keep it simple for now.

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

I couldn't keep up with all the discussion, but is there actually valid I/O
bound cases (i.e. not ones were we erroneously keep the distance short) where
index scans end can't have a higher distance?

I don't know, really.

Is the presented example really a case of an "erroneously short
distance"? From the 2x regression (compared to master) it might seem
like that, but even with the increased distance it's still slower than
master (by 25%). So maybe the "error" is to use AIO in these cases,
instead of just switching to I/O done by the backend.

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

Obviously you can construct cases with a low distance by having indexes point
to a lot of tiny tuples pointing to perfectly correlated pages, but in that
case IO can't be a significant factor.

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic. I suppose there will be such cases for
any heuristics we come up with.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.

regards

--
Tomas Vondra

#309

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#308)

Re: index prefetching

Hi,

On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:

On 8/28/25 18:16, Andres Freund wrote:

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

I couldn't keep up with all the discussion, but is there actually valid I/O
bound cases (i.e. not ones were we erroneously keep the distance short) where
index scans end can't have a higher distance?

I don't know, really.

Is the presented exaple really a case of an "erroneously short
distance"?

I think the query isn't actually measuring something particularly useful in
the general case. You're benchmarking something were the results are never
looked at - which means the time between two index fetches is unrealistically
short. That means any tiny latency increase matters a lot more than with
realistic queries.

And this is, IIUC, on a local SSD. I'd bet that on cloud latencies AIO would
still be a huge win.

From the 2x regression (compared to master) it might seem like that, but
even with the increased distance it's still slower than master (by 25%). So
maybe the "error" is to use AIO in these cases, instead of just switching to
I/O done by the backend.

If it's slower at a higher distance, we're missing something.

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

I am not worried at all about those cases. If you have to read a lot of index
leaf pages to get a heap fetch, a distance of even just 2 will be fine,
because the IPC overhead is a neglegible cost compared to the index
processing. Similarly, if you have to do very deep index traversals due to
wide index tuples, there's going to be more time between two table fetches.

Obviously you can construct cases with a low distance by having indexes point
to a lot of tiny tuples pointing to perfectly correlated pages, but in that
case IO can't be a significant factor.

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic.

I think doing index scans where the results are just thrown out are entirely
unrealistic...

I suppose there will be such cases for any heuristics we come up with.

Agreed.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.

Yea, I'd not expect a meaningful impact of any of this in a workload like
that.

Greetings,

Andres Freund

#310

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Andres Freund (#309)

Re: index prefetching

On Fri, Aug 29, 2025 at 7:52 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:

From the 2x regression (compared to master) it might seem like that, but
even with the increased distance it's still slower than master (by 25%). So
maybe the "error" is to use AIO in these cases, instead of just switching to
I/O done by the backend.

If it's slower at a higher distance, we're missing something.

Enough io_workers? What kind of I/O concurrency does it want? Does
wait_event show any backends doing synchronous IO? How many does [1]/messages/by-id/CA+hUKG+m4xV0LMoH2c=oRAdEXuCnh+tGBTWa7uFeFMGgTLAw+Q@mail.gmail.com
want to run for that test workload and does it help?

FWIW there's a very simple canned latency test in a SQL function in
the first message in that thread (0005-XXX-read_buffer_loop.patch),
just on the off-chance that it's useful as a starting point for other
ideas. There I was interested in IPC overheads, latch collapsing and
other effects, so I was deliberately stalling on/evicting a single
block repeatedly without any readahead distance, so I wasn't letting
the stream "hide" IPC overheads.

[1]: /messages/by-id/CA+hUKG+m4xV0LMoH2c=oRAdEXuCnh+tGBTWa7uFeFMGgTLAw+Q@mail.gmail.com

#311

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Thomas Munro (#310)

Re: index prefetching

On 8/28/25 23:50, Thomas Munro wrote:

On Fri, Aug 29, 2025 at 7:52 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:

From the 2x regression (compared to master) it might seem like that, but
even with the increased distance it's still slower than master (by 25%). So
maybe the "error" is to use AIO in these cases, instead of just switching to
I/O done by the backend.

If it's slower at a higher distance, we're missing something.

Enough io_workers? What kind of I/O concurrency does it want? Does
wait_event show any backends doing synchronous IO? How many does [1]
want to run for that test workload and does it help?

I'm not sure how to determine what concurrency it "wants". All I know is
that for "warm" runs [1]/messages/by-id/attachment/180630/ryzen-warm.pdf, the basic index prefetch patch uses distance
~2.0 on average, and is ~2x slower than master. And with the patches the
distance is ~270, and it's 30% slower than master. (IIRC there's about
30% misses, so 270 is fairly high. Can't check now, the machine is
running other tests.)

Not sure about wait events, but I don't think any backends are doing
sychnronous I/O. There's only that one query running, and it's using AIO
(except for the index, which is still read synchronously).

Likewise, I don't think there's insufficient number of workers. I've
tried with 3 and 12 workers, and there's virtually no difference between
those. IIRC when watching "top", I've never seen more than 1 or maybe 2
workers active (using CPU).

[1]: /messages/by-id/attachment/180630/ryzen-warm.pdf

[2]: /messages/by-id/293a4735-79a4-499c-9a36-870ee9286281@vondra.me
/messages/by-id/293a4735-79a4-499c-9a36-870ee9286281@vondra.me

FWIW there's a very simple canned latency test in a SQL function in
the first message in that thread (0005-XXX-read_buffer_loop.patch),
just on the off-chance that it's useful as a starting point for other
ideas. There I was interested in IPC overheads, latch collapsing and
other effects, so I was deliberately stalling on/evicting a single
block repeatedly without any readahead distance, so I wasn't letting
the stream "hide" IPC overheads.

[1] /messages/by-id/CA+hUKG+m4xV0LMoH2c=oRAdEXuCnh+tGBTWa7uFeFMGgTLAw+Q@mail.gmail.com

Interesting, I'll give it a try tomorrow. Do you recall if the results
were roughly in line with results of my signal IPC test?

regards

--
Tomas Vondra

#312

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#311)

Re: index prefetching

On Thu, Aug 28, 2025 at 7:01 PM Tomas Vondra <tomas@vondra.me> wrote:

I'm not sure how to determine what concurrency it "wants". All I know is
that for "warm" runs [1], the basic index prefetch patch uses distance
~2.0 on average, and is ~2x slower than master. And with the patches the
distance is ~270, and it's 30% slower than master. (IIRC there's about
30% misses, so 270 is fairly high. Can't check now, the machine is
running other tests.)

Is it possible that the increased distance only accidentally
ameliorates the IPC issues that you're seeing with method=worker? I
mentioned already that it makes things a bit slower with io_uring, for
the same test case. I mean, if you use io_uring then things work out
strictly worse with that extra patch...so something doesn't seem
right.

I notice that the test case in question manages to merge plenty of
reads together with other pending reads, within read_stream_look_ahead
(I added something to our working branch that'll show that information
in EXPLAIN ANALYZE). My wild guess is that an increased distance could
interact with that, somewhat masking the IPC problems with
method=worker.

Could that explain it? It seems possible that the distance is already
roughly optimal, without your patch (or Thomas' similar read stream
patch). It may be that we just aren't converging on "no prefetch"
behavior when we ought to, given such a low distance. If this theory
of mine was correct, it would reconcile the big differences we see
between "worker vs io_uring" with your patch + test case.

--
Peter Geoghegan

#313

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Tomas Vondra (#311)

Re: index prefetching

Hi,

On 2025-08-29 01:00:58 +0200, Tomas Vondra wrote:

I'm not sure how to determine what concurrency it "wants". All I know is
that for "warm" runs [1], the basic index prefetch patch uses distance
~2.0 on average, and is ~2x slower than master. And with the patches the
distance is ~270, and it's 30% slower than master. (IIRC there's about
30% misses, so 270 is fairly high. Can't check now, the machine is
running other tests.)

There got to be something wrong here, I don't see a reason why at any
meaningful distance it'd be slower.

What set of patches do I need to repro the issue?

And what are the complete set of pieces to load the data?
/messages/by-id/293a4735-79a4-499c-9a36-870ee9286281@vondra.me
has the query, but afaict not enough information to infer init.sql

Not sure about wait events, but I don't think any backends are doing
sychnronous I/O. There's only that one query running, and it's using AIO
(except for the index, which is still read synchronously).

Likewise, I don't think there's insufficient number of workers. I've
tried with 3 and 12 workers, and there's virtually no difference between
those. IIRC when watching "top", I've never seen more than 1 or maybe 2
workers active (using CPU).

That doesn't say much - if the they are doing IO, they're not on CPU...

Greetings,

Andres Freund

#314

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#309)

Re: index prefetching

On 8/28/25 21:52, Andres Freund wrote:

Hi,

On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:

On 8/28/25 18:16, Andres Freund wrote:

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

I couldn't keep up with all the discussion, but is there actually valid I/O
bound cases (i.e. not ones were we erroneously keep the distance short) where
index scans end can't have a higher distance?

I don't know, really.

Is the presented exaple really a case of an "erroneously short
distance"?

I think the query isn't actually measuring something particularly useful in
the general case. You're benchmarking something were the results are never
looked at - which means the time between two index fetches is unrealistically
short. That means any tiny latency increase matters a lot more than with
realistic queries.

Sure, is a "microbenchmark" focusing on index scans.

The point of not looking at the result is to isolate the index scan, and
it's definitely true that if the query did some processing (e.g. feeding
it into an aggregate or something), the relative difference would be
smaller. But the absolute difference would likely remain about the same.

I don't think the submitting the I/O and then not waiting long enough
before actually reading the block is a significant factor here. It does
affect even the "warm" runs (that do no actual I/O), and most of the
difference seems to match the IPC cost. And AFAICS that cost dost not
change if the delay increases, we still need to send two signals.

And this is, IIUC, on a local SSD. I'd bet that on cloud latencies AIO would
still be a huge win.

True, but only for cold runs that actually do I/O. The results for the
warm runs show regressions too, although smaller ones. And that would
affect any kind of storage (with buffered I/O).

Also, I'm not sure "On slow storage it does not regress," is a very
strong argument ;-)

From the 2x regression (compared to master) it might seem like that, but
even with the increased distance it's still slower than master (by 25%). So
maybe the "error" is to use AIO in these cases, instead of just switching to
I/O done by the backend.

If it's slower at a higher distance, we're missing something.

There's one weird thing I just realized - I don't think I ever saw more
than a single I/O worker consuming CPU (in top), even with the higher
distance. I'm not 100% sure about it, need to check tomorrow.

IIRC the CPU utilization with "collapsed " distance ~2.0 was about

backend: 60%
ioworker: 40%

and with the patches increasing the distance it was more like

backend: 100%
ioworker: 50%

But I think it was still just one ioworker. I wonder if that's OK,
intentional, or if it might be an issue ...

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

I am not worried at all about those cases. If you have to read a lot of index
leaf pages to get a heap fetch, a distance of even just 2 will be fine,
because the IPC overhead is a neglegible cost compared to the index
processing. Similarly, if you have to do very deep index traversals due to
wide index tuples, there's going to be more time between two table fetches.

Most likely, yes.

Obviously you can construct cases with a low distance by having indexes point
to a lot of tiny tuples pointing to perfectly correlated pages, but in that
case IO can't be a significant factor.

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic.

I think doing index scans where the results are just thrown out are entirely
unrealistic...

True, it's a microbenchmark focused on a specific operation. But I don't
think it makes it unrealistic, even though the impact on real-world
queries will be smaller. But I know what you mean.

I suppose there will be such cases for any heuristics we come up with.

Agreed.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.

Yea, I'd not expect a meaningful impact of any of this in a workload like
that.

regards

--
Tomas Vondra

#315

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Andres Freund (#313)

1 attachment(s)

Re: index prefetching

On 8/29/25 01:27, Andres Freund wrote:

Hi,

On 2025-08-29 01:00:58 +0200, Tomas Vondra wrote:

I'm not sure how to determine what concurrency it "wants". All I know is
that for "warm" runs [1], the basic index prefetch patch uses distance
~2.0 on average, and is ~2x slower than master. And with the patches the
distance is ~270, and it's 30% slower than master. (IIRC there's about
30% misses, so 270 is fairly high. Can't check now, the machine is
running other tests.)

There got to be something wrong here, I don't see a reason why at any
meaningful distance it'd be slower.

What set of patches do I need to repro the issue?

Use this branch:

https://github.com/tvondra/postgres/commits/index-prefetch-master/

and then Thomas' patch that increases the prefetch distance:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

(IIRC there's a trivial conflict in read_stream_reset.).

And what are the complete set of pieces to load the data?
/messages/by-id/293a4735-79a4-499c-9a36-870ee9286281@vondra.me
has the query, but afaict not enough information to infer init.sql

Yeah, I forgot to include that piece, sorry. Here's an init.sql, that
loads the table, it also has the query.

Not sure about wait events, but I don't think any backends are doing
sychnronous I/O. There's only that one query running, and it's using AIO
(except for the index, which is still read synchronously).

Likewise, I don't think there's insufficient number of workers. I've
tried with 3 and 12 workers, and there's virtually no difference between
those. IIRC when watching "top", I've never seen more than 1 or maybe 2
workers active (using CPU).

That doesn't say much - if the they are doing IO, they're not on CPU...

True. But one worker did show up in top, using a fair amount of CPU, so
why wouldn't the others (if they process the same stream)?

regards

--
Tomas Vondra

#316

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Tomas Vondra (#315)

Re: index prefetching

On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Use this branch:

https://github.com/tvondra/postgres/commits/index-prefetch-master/

and then Thomas' patch that increases the prefetch distance:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

(IIRC there's a trivial conflict in read_stream_reset.).

I found it quite hard to apply Thomas' patch. There's actually 3
patches, with 2 earlier patches needed for earlier in the thread. And,
there were significant merge conflicts to work around.

I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

--
Peter Geoghegan

#317

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Peter Geoghegan (#316)

1 attachment(s)

Re: index prefetching

On 8/29/25 01:57, Peter Geoghegan wrote:

On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Use this branch:

https://github.com/tvondra/postgres/commits/index-prefetch-master/

and then Thomas' patch that increases the prefetch distance:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

(IIRC there's a trivial conflict in read_stream_reset.).

I found it quite hard to apply Thomas' patch. There's actually 3
patches, with 2 earlier patches needed for earlier in the thread. And,
there were significant merge conflicts to work around.

I don't think the 2 earlier patches are needed, I only ever applied the
one in the linked message. But you're right there were more merge
conflicts, I forgot about that. Here's a patch that should apply on top
of the prefetch branch.

I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

AFAICS Andres was interested in reproducing the regression with an
increased distance. Or maybe I got it wrong.

regards

--
Tomas Vondra

Attachments:

vmunro-0001-aio-Improve-read_stream.c-look-ahead-heuristi.patchtext/x-patch; charset=UTF-8; name=vmunro-0001-aio-Improve-read_stream.c-look-ahead-heuristi.patchDownload

From 04d2cb5149c2e7e211b8efb0cdd1b3d2a67e97b9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Fri, 29 Aug 2025 02:32:54 +0200
Subject: [PATCH vmunro] aio: Improve read_stream.c look-ahead heuristics C

Previously we would reduce the look-ahead distance by one every time we
got a cache hit, which sometimes performed poorly with mixed hit/miss
patterns, especially if it was trapped at one.

Instead, sustain the current distance until we've seen evidence that
there is no window big enough to span the gap between rare IOs.  In
other words, we now use information from a much larger window to
estimate the utility of looking far ahead.

XXX Highly experimental!
---
 src/backend/storage/aio/read_stream.c | 35 ++++++++++++++++++---------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 8051745c232..7b009d65f8a 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_sustain;
 	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
@@ -398,22 +399,36 @@ read_stream_start_pending_read(ReadStream *stream)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
+		/*
+		 * Look-ahead distance decays if we haven't had any cache misses in a
+		 * hypothetical window of recent accesses.
+		 */
+		if (stream->distance_sustain > 0)
+			stream->distance_sustain--;
+		else if (stream->distance > 1)
 			stream->distance--;
 	}
 	else
 	{
-		/*
-		 * Remember to call WaitReadBuffers() before returning head buffer.
-		 * Look-ahead distance will be adjusted after waiting.
-		 */
+		/* Remember to call WaitReadBuffers() before returning head buffer. */
 		stream->ios[io_index].buffer_index = buffer_index;
 		if (++stream->next_io_index == stream->max_ios)
 			stream->next_io_index = 0;
 		Assert(stream->ios_in_progress < stream->max_ios);
 		stream->ios_in_progress++;
 		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+		/* Look-ahead distance doubles. */
+		if (stream->distance > stream->max_pinned_buffers - stream->distance)
+			stream->distance = stream->max_pinned_buffers;
+		else
+			stream->distance += stream->distance;
+
+		/*
+		 * Don't let the distance begin to decay until we've seen no IOs over
+		 * a hypothetical window of the maximum possible size.
+		 */
+		stream->distance_sustain = stream->max_pinned_buffers;
 	}
 
 	/*
@@ -963,7 +978,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -976,11 +990,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
-
 		/*
 		 * If we've reached the first block of a sequential region we're
 		 * issuing advice for, cancel that until the next jump.  The kernel
@@ -1138,6 +1147,8 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = Max(1, stream->distance_old);
 	stream->distance_old = 0;
 
+	stream->distance_sustain = 0;
+
 	/* track the number of resets */
 	stream->reset_count += 1;
 }
-- 
2.51.0

#318

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Peter Geoghegan (#316)

Re: index prefetching

Hi,

On 2025-08-28 19:57:17 -0400, Peter Geoghegan wrote:

On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:

Use this branch:

https://github.com/tvondra/postgres/commits/index-prefetch-master/

and then Thomas' patch that increases the prefetch distance:

/messages/by-id/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com

(IIRC there's a trivial conflict in read_stream_reset.).

I found it quite hard to apply Thomas' patch. There's actually 3
patches, with 2 earlier patches needed for earlier in the thread. And,
there were significant merge conflicts to work around.

Same. Tomas, could you share what you applied?

I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

It seems caused to a significant degree by waiting at low queue depths. If I
comment out the stream->distance-- in read_stream_start_pending_read() the
regression is reduced greatly.

As far as I can tell, after that the process is CPU bound, i.e. IO waits don't
play a role.

I see a variety for increased CPU usage:

1) The private ref count infrastructure in bufmgr.c gets a bit slower once
more buffers are pinned

2) signalling overhead to the worker - I think we are resetting the latch too
eagerly, leading to unnecessarily many signals being sent to the IO worker.

3) same issue with the resowner tracking

But there's some additional difference in performance I don't yet
understand...

Greetings,

Andres Freund

#319

Peter Geoghegan

pg@bowt.ie

5 months ago

In reply to: Andres Freund (#318)

Re: index prefetching

On Thu, Aug 28, 2025 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:

Same. Tomas, could you share what you applied?

Tomas posted a self-contained patch to the list about an hour ago?

I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

It seems caused to a significant degree by waiting at low queue depths. If I
comment out the stream->distance-- in read_stream_start_pending_read() the
regression is reduced greatly.

IIUC, that is very roughly equivalent to what the patch actually does.

The fastest configuration of all, independent of io_method, is
"enable_indexscan_prefetch=off". So it's hard to believe that the true
underlying problem is low queue depth. Though I certainly don't doubt
that higher queue depths will help *when io_method=worker*.

--
Peter Geoghegan

#320

Thomas Munro

thomas.munro@gmail.com

5 months ago

In reply to: Tomas Vondra (#315)

Re: index prefetching

On Fri, Aug 29, 2025 at 11:52 AM Tomas Vondra <tomas@vondra.me> wrote:

True. But one worker did show up in top, using a fair amount of CPU, so
why wouldn't the others (if they process the same stream)?

It deliberately concentrates wakeups into the lowest numbered workers
that are marked idle in a bitmap.

* higher numbered workers snooze and eventually time out (with the
patches for 19 that make the pool size dynamic)
* busy workers have a better chance of staying on CPU between one job
and the next
* minimised duplication of various caches and descriptors

Every other wakeup routing strategy I've tried so far performed worse
in both avg(latency) and stddev(latency).

I have wondered if we might want to consider per-NUMA-node IO worker
pools with their own submission queues. Not investigated, but I
suppose it might possibly help with the submission queue lock, cache
line ping pong for buffer headers that the worker touches on
completion, and inter-process interrupts. I don't know where to draw
the line with a potential optimisations to IO worker mode that would
realistically only help on Linux today, when the main performance plan
for Linux is io_uring.

#321

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Andres Freund (#318)

Re: index prefetching

Hi,

I spent a fair bit more time analyzing this issue.

On 2025-08-28 21:10:48 -0400, Andres Freund wrote:

On 2025-08-28 19:57:17 -0400, Peter Geoghegan wrote:

On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

It seems caused to a significant degree by waiting at low queue depths. If I
comment out the stream->distance-- in read_stream_start_pending_read() the
regression is reduced greatly.

As far as I can tell, after that the process is CPU bound, i.e. IO waits don't
play a role.

Indeed the actual AIO subsystem is unrelated, from what I can tell:

I hacked up read_stream.c/bufmgr.c to do readahead even if the buffer is in
shared_buffers. With that, the negative performance impact of doing
enable_indexscan_prefetch=1 is of a similar magnitude even if the table is
already entirely in shared buffers. I.e. actual IO is unrelated.

I compared perf stat -ddd output for enable_indexscan_prefetch=0 with
enable_indexscan_prefetch=1. The only real difference is a substantial (~3x)
increase in branch misses.

I then took a perf profile to see where all those misses are from.

The first souce is:

I see a variety for increased CPU usage:

1) The private ref count infrastructure in bufmgr.c gets a bit slower once
more buffers are pinned

The problem mainly seems to be that the branches in the loop at the start of
GetPrivateRefCountEntry() are entirely unpredictable in this workload. I had
an old patch that tried to make it possible to use SIMD for the search, by
using a separate array for the Buffer ids - with that gcc generates fairly
crappy code, but does make the code branchless.

Here that substantially reduces the overhead of doing prefetching. Afterwards
it's not a meaningful source of misses anymore.

3) same issue with the resowner tracking

This one is much harder to address:

a) The "key" we are searching for is much wider (16 bytes), making
vectorization of the search less helpful

b) because we search up to owner->narr instead of a fixed-length, the compiler
wouldn't be able to auto-vectorize anyway

c) the branch-misses are partially caused by ResourcOwnerForget() "scrambling"
the order in the array when forgetting an element

I don't know how to fix this right now. I nevertheless wanted to see how big
the impact of this is, so I just neutered
ResourceOwner{Remember,Forget}{Buffer,BufferIO} - that's obviously not
correct, but suffices to see that the performance difference reduces
substantially.

But not completely, unfortunately.

But there's some additional difference in performance I don't yet
understand...

I still don't think I fully understand why the impact of this is so large. The
branch misses appear to be the only thing differentiating the two cases, but
with resowners neutralized, the remaining difference in branch misses seems
too large - it's not like the sequence of block numbers is more predictable
without prefetching...

The main increase in branch misses is in index_scan_stream_read_next...

Greetings,

Andres Freund

#322

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Andres Freund (#321)

Re: index prefetching

On Wed, Sep 3, 2025 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:

I still don't think I fully understand why the impact of this is so large. The
branch misses appear to be the only thing differentiating the two cases, but
with resowners neutralized, the remaining difference in branch misses seems
too large - it's not like the sequence of block numbers is more predictable
without prefetching...

The main increase in branch misses is in index_scan_stream_read_next...

I've been working on fixing the same regressed query, but using a
completely different (though likely complementary) approach: by adding
a test to index_scan_stream_read_next that detects when prefetching
isn't favorable. If it isn't favorable, then we stop prefetching
entirely (we fall back on regular sync I/O).

Although this experimental approach is still very rough, it seems
promising. It ~100% fixes the problem at hand, without really creating
any new problems (at least as far as our testing has been able to
determine, so far).

The key idea is to wait until a few batches have already been read,
and then test whether the index-tuple-wise "distance" between readPos
(the read position) and streamPos (the stream position used by
index_scan_stream_read_next) remained excessively low within
index_scan_stream_read_next. If, after processing 20 batches/leaf
pages, readPos and streamPos still read from the same batch *and* have
a low index-tuple-wise position within that batch (they're within 10
or 20 items of each other), we expect "thrashing", which makes
prefetching unfavorable -- and so we just stop using our read stream.

It's worth noting that (given the current structure of the patch) it
is inherently impossible to do something like this from within the
read stream. We're suppressing duplicate heap block requests iff the
blocks are contiguous within the index. So read stream just doesn't
see anything like what I'm calling the "index-tuple-wise distance"
between readPos and streamPos.

Note that the baseline behavior for the test case (the behavior with
master, or with prefetching disabled) appears to be very I/O bound,
due to readahead. I've confirmed this using iostat. So "synchronous"
I/O isn't very synchronous here. (Prefetching actually does make sense
when this query is run with direct I/O, but that's far slower with or
without the use of explicit prefetching, so that likely doesn't tell
us much.)

--
Peter Geoghegan

#323

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Peter Geoghegan (#322)

Re: index prefetching

Hi,

On 2025-09-03 15:33:30 -0400, Peter Geoghegan wrote:

On Wed, Sep 3, 2025 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:

I still don't think I fully understand why the impact of this is so large. The
branch misses appear to be the only thing differentiating the two cases, but
with resowners neutralized, the remaining difference in branch misses seems
too large - it's not like the sequence of block numbers is more predictable
without prefetching...

The main increase in branch misses is in index_scan_stream_read_next...

I've been working on fixing the same regressed query, but using a
completely different (though likely complementary) approach: by adding
a test to index_scan_stream_read_next that detects when prefetching
isn't favorable. If it isn't favorable, then we stop prefetching
entirely (we fall back on regular sync I/O).

The issue to me is that this kind of query actually *can* substantially
benefit from prefetching, no? Afaict the performance without prefetching is
rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
used.

Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.

I continue to be worried that we're optimizing for queries that have no
real-world relevance. The regression afaict is contingent on

1) An access pattern that is unpredictable to the CPU (due to the use of
random() as part of ORDER BY during the data generation)

2) Index and heap are somewhat correlated, but fuzzily, i.e. there are
backward jumps in the heap block numbers being fetched

3) There are 1 - small_number tuples on one heap tables

4) The query scans a huge number of tuples, without actually doing any
meaningful analysis on the tuples. As soon as one does meaningful work for
returned tuples, the small difference in per-tuple CPU costs vanishes

5) The query visits all heap pages within a range, just not quite in
order. Without that the kernel readahead would not work and the query's
performance without readahead would be terrible even on low-latency storage

This just doesn't strike me as a particularly realistic combination of
factors?

I suspect we could more than eat back the loss in performance by doing batched
heap_hot_search_buffer()...

Greetings,

Andres Freund

#324

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Andres Freund (#323)

Re: index prefetching

On Wed, Sep 3, 2025 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:

The issue to me is that this kind of query actually *can* substantially
benefit from prefetching, no?

As far as I can tell, not really, no.

Afaict the performance without prefetching is
rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
used.

I don't know that storage latency matters, when (without DIO) we're
doing so well from readahead.

Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.

I don't see that level of improvement with DIO. For me it's 6054.921
ms with prefetching, 8766.287 ms without it.

I can kind of accept the idea that in some sense readahead shouldn't
count too much, since the future is DIO. But it's not like aggressive
prefetching matches the performance of buffered I/O + readahead. Not
for me, at any rate. I don't know why.

I continue to be worried that we're optimizing for queries that have no
real-world relevance.

I'm not at all surprised that we're spending so much time on weird
queries. For one thing, the real world queries are already much
improved. For another, in order to accept a trade-off like this, we
have to actually know what it is we're accepting. And how easy/hard it
is to do better (we may very well be able to fix this problem at no
great cost in complexity).

This just doesn't strike me as a particularly realistic combination of
factors?

I agree. I just don't think that we've done enough work on this to
justify accepting it as a cost of doing business. We might well do
that at some point in the near future.

I suspect we could more than eat back the loss in performance by doing batched
heap_hot_search_buffer()...

Maybe, but I don't think that we're all that likely to get that done for 19.

--
Peter Geoghegan

#325

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Peter Geoghegan (#324)

Re: index prefetching

Hi,

On 2025-09-03 16:25:56 -0400, Peter Geoghegan wrote:

On Wed, Sep 3, 2025 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:

The issue to me is that this kind of query actually *can* substantially
benefit from prefetching, no?

As far as I can tell, not really, no.

It seems to here - I see small wins even with kernel readahead, fwiw.

Afaict the performance without prefetching is
rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
used.

I don't know that storage latency matters, when (without DIO) we're
doing so well from readahead.

The readahead linux does actually is not aggressive enough once you have
higher IO latency - you can tune it up, but then it often does too much IO.

Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.

I don't see that level of improvement with DIO. For me it's 6054.921
ms with prefetching, 8766.287 ms without it.

I guess your SSD has lower latency than mine...

I can kind of accept the idea that in some sense readahead shouldn't
count too much, since the future is DIO. But it's not like aggressive
prefetching matches the performance of buffered I/O + readahead. Not
for me, at any rate. I don't know why.

It does here, just about. The reason for not matching is fairly simple: The
kernel readahead issues large reads, but with DIO we don't for this query. The
adversarial pattern here rarely has two consecutive neighboring blocks, so
nearly all reads are 8kB reads.

This actually might be the thing to tackle to avoid this and other similar
regressions: If we were able to isssue combined IOs for interspersed patterns
like we have in this query, we'd easily win back the overhead. And it'd make
DIO much much better.

We don't want to do try to find more complicated merges for things like
seqscans and bitmap heap scans, there never can be anything other than merges
of consecutive blocks, and the CPU overhead of the more complicated search
would likely be noticeable. But for something like index scans that's
different.

I don't quite know if this is best done as an optional feature for read
streams, a layer atop read stream or something dedicated.

For now I'll go back to working on read stream test infrastructure. That's the
prerequisite for testing the "don't synchronously wait for in-progress IO"
improvement. And if we want to have more complicated merging, that also seems
like something much easier to develop with some testing infra.

Greetings,

Andres Freund

#326

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Andres Freund (#325)

Re: index prefetching

On Wed, Sep 3, 2025 at 8:16 PM Andres Freund <andres@anarazel.de> wrote:

I don't see that level of improvement with DIO. For me it's 6054.921
ms with prefetching, 8766.287 ms without it.

I guess your SSD has lower latency than mine...

It's nothing special: a 4 year old Samsung 980 pro.

This actually might be the thing to tackle to avoid this and other similar
regressions: If we were able to isssue combined IOs for interspersed patterns
like we have in this query, we'd easily win back the overhead. And it'd make
DIO much much better.

That sounds very plausible to me. I don't think it's at all unusual
for index scans to do this (that particular aspect of the test case
query wasn't unrealistic). In general this seems important to me.

I don't quite know if this is best done as an optional feature for read
streams, a layer atop read stream or something dedicated.

My guess is that it would work best as an optional feature for read
streams. A flag like READ_STREAM_REPEAT_READS that's passed to
read_stream_begin_relation might work best.

For now I'll go back to working on read stream test infrastructure. That's the
prerequisite for testing the "don't synchronously wait for in-progress IO"
improvement.

"don't synchronously wait for in-progress IO" is also very important
to this project. Thanks for your help with that.

And if we want to have more complicated merging, that also seems
like something much easier to develop with some testing infra.

Great.

--
Peter Geoghegan

#327

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Andres Freund (#323)

Re: index prefetching

On 9/3/25 22:06, Andres Freund wrote:

...

I continue to be worried that we're optimizing for queries that have no
real-world relevance. The regression afaict is contingent on

1) An access pattern that is unpredictable to the CPU (due to the use of
random() as part of ORDER BY during the data generation)

2) Index and heap are somewhat correlated, but fuzzily, i.e. there are
backward jumps in the heap block numbers being fetched

Aren't those two points rather contradictory? Why would it matter that
the data generator uses random() in the ORDER BY? Seems entirely
irrelevant, if the generated table is "somewhat correlated".

Which seems pretty normal in real-world data sets ...

3) There are 1 - small_number tuples on one heap tables

What would you consider a reasonable number of tuples on one heap page?

The current tests generate data with 20-100 tuples per page, which seems
pretty reasonable to me. I mean, that's 80-400B per tuple. Sure, I could
generate data with narrower tuples, but would that be more realistic? I
doubt that.

FWIW it's not like the regressions only happen on fillfactor=20, with 20
tuples/page. It happens on fillfactor=100 (sure, the impact is smaller).

4) The query scans a huge number of tuples, without actually doing any
meaningful analysis on the tuples. As soon as one does meaningful work for
returned tuples, the small difference in per-tuple CPU costs vanishes

I believe I already responded to this before. Sure, the relative
regression will get smaller. But I don't see why would the absolute
difference get smaller.

5) The query visits all heap pages within a range, just not quite in
order. Without that the kernel readahead would not work and the query's
performance without readahead would be terrible even on low-latency storage

I'm sorry, I don't quite understand what this says :-( Or why would that
mean the issues triggered by the generated data sets are not valid even
for real-world queries.

This just doesn't strike me as a particularly realistic combination of
factors?

Aren't plenty of real-world data sets correlated, but not perfectly?

In any case, I'm the first one to admit these data sets are synthetic.
It's meant to generate data sets that gradually shift from perfectly
ordered to random, increasing number of duplicates, etc. The point was
to cover a wider range of data sets, not just a couple "usual" ones.

It's possible some of these data sets are not realistic, in which case
we can choose to ignore them and the regressions. The approach tends to
find "adversary" cases, hit corner cases (not necessarily as rare as
assumed), etc. But the issues we ran into so far seem perfectly valid
(or at least useful to think about).

regards

--
Tomas Vondra

#328

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Tomas Vondra (#327)

3 attachment(s)

Re: index prefetching

On Thu, Sep 4, 2025 at 2:55 PM Tomas Vondra <tomas@vondra.me> wrote:

Aren't plenty of real-world data sets correlated, but not perfectly?

Attached is the latest revision of the prefetching patch, taken from
the shared branch that Tomas and I have been working on for some
weeks.

This revision is the first "official revision" that uses the complex
approach, which we agreed was the best approach right before we
started collaborating through this shared branch. While Tomas and I
have posted versions of this "complex" approach at various times,
those were "unofficial" previews of different approaches. Whereas this
is the latest official patch revision of record, that should be tested
by CFTester for the prefetch patch's CF entry, etc.

We haven't done a good job of maintaining an unambiguous, easy to test
"official" CF entry patch before now. That's why I'm being explicit
about what this patch revision represents. It's the shared work of
Tomas and I; it isn't some short-term experimental fork. Future
revisions will be incremental improvements on what I'm posting now.

Our focus has been on fixing a variety of regressions that came to
light following testing by Tomas. There are a few bigger changes that
are intended to fix these regressions, plus lots of small changes.

There's too many small changes to list. But the bigger changes are:

* We're now carrying Andres' patch [1]/messages/by-id/6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev that deals with inefficiencies
on the read stream side [2]/messages/by-id/kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7. We need this to get decent performance
with certain kinds of index scans where the same heap page buffer
needs to be read multiple times in close succession.

* We now delay prefetching/creating a new read stream until after
we've already read one index batch, with the goal of avoiding
regressions on cheap, selective queries (e.g., pgbench SELECT). This
optimization has been referred to as the "priorbatch" optimization
earlier in this thread.

* The third patch is a new one, authored by Tomas. It aims to
ameliorate nestloop join regressions by caching memory used to store
batches across rescans.

This is still experimental.

* The regression that we were concerned about most recently [3]/messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me[4]https://github.com/tvondra/postgres/blob/index-prefetch-master/microbenchmarks/tomas-weird-issue-readstream.sql is
fixed by a new mechanism that sometimes disables prefetching/the read
stream some time prefetching begins, having already read a small
number of batches with prefetching -- the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

This is also experimental. But it does fully fix the problem at hand,
without any read stream changes. (This is part of the main prefetching
patch.)

This works like the "priorbatch" optimization, but in reverse. We
*unset* the scan's read stream when our INDEX_SCAN_MIN_TUPLE_DISTANCE
test shows that prefetching hasn't worked out (as opposed to delaying
starting it up until it starts to look like prefetching might help).
Like the "priorbatch" optimization, this optimization is concerned
with fixed prefetching costs that cannot possibly pay for themselves.

Note that we originally believed that the regression in question
[3]: /messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me
account for the way that we saw prefetch distance collapse to 2.0 for
the entire scan. But our current thinking is that the regression in
question occurs with scans where wholly avoiding prefetching is the
right goal. Which is why, tentatively, we're addressing the problem
within indexam.c itself (not in the read stream), by adding this new
INDEX_SCAN_MIN_TUPLE_DISTANCE test to the read stream callback. This
means that various experimental read stream distance patches [3]/messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me[5]/messages/by-id/CA+hUKG+9Qp=E5XWE+_1UPCxULLXz6JrAY=83pmnJ5ifupH-NSA@mail.gmail.com
that initially seemed relevant no longer appear necessary (and so
aren't included in this new revision at all).

Much cleanup work remains to get the changes I just described in
proper shape (to say nothing about open items that we haven't made a
start on yet, like moving the read stream out of indexam.c and into
heapam). But it has been too long since the last revision. I'd like to
establish a regular cadence for posting new revisions of the patch
set.

[1]: /messages/by-id/6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev
[2]: /messages/by-id/kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7
[3]: /messages/by-id/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me
[4]: https://github.com/tvondra/postgres/blob/index-prefetch-master/microbenchmarks/tomas-weird-issue-readstream.sql
[5]: /messages/by-id/CA+hUKG+9Qp=E5XWE+_1UPCxULLXz6JrAY=83pmnJ5ifupH-NSA@mail.gmail.com

--
Peter Geoghegan

Attachments:

v20250910-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchapplication/x-patch; name=v20250910-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchDownload

From d9f7d6b826f451ef9e5711f7496a097b6cba6b28 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v20250910 1/3] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 133 ++++++++++++++++++++++++++--
 2 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41fdc1e76..7ddb867bc 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -137,6 +137,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63..1860c304a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1557,6 +1557,41 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if (buf_state & BM_IO_IN_PROGRESS)
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if (buf_state & BM_IO_IN_PROGRESS)
+			iow = desc->io_wref;
+		UnlockBufHdr(desc, buf_state);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1689,7 +1724,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1708,11 +1743,38 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (operation->foreign_io)
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc, buf_state);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1798,6 +1860,56 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 */
+	if (1 && ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		/* FIXME: probably need to wait if io_method == sync? */
+
+		*nblocks_progress = 1;
+		did_start_io = true;
+		operation->foreign_io = true;
+
+		if (0)
+			elog(LOG, "using foreign IO path");
+
+		/* FIXME: trace point */
+
+		/*
+		 * FIXME: how should this be accounted for in stats? Account as a hit
+		 * for now, quite likely *we* started this IO.
+		 */
+		if (persistence == RELPERSISTENCE_TEMP)
+			pgBufferUsage.local_blks_hit += 1;
+		else
+			pgBufferUsage.shared_blks_hit += 1;
+
+		if (operation->rel)
+			pgstat_count_buffer_hit(operation->rel);
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		return true;
+	}
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1855,9 +1967,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O in progress in another backend (it can't be this
+	 * backend), we want to wait for the outcome: either done, or something
+	 * went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
@@ -1970,6 +2082,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
 
+		operation->foreign_io = false;
 		*nblocks_progress = io_buffers_len;
 		did_start_io = true;
 	}
@@ -5963,6 +6076,8 @@ WaitIO(BufferDesc *buf)
 		 */
 		if (pgaio_wref_valid(&iow))
 		{
+			if (0)
+				elog(LOG, "foreign wait");
 			pgaio_wref_wait(&iow);
 
 			/*
-- 
2.51.0

v20250910-0003-Reduce-malloc-free-traffic-by-caching-batc.patchapplication/x-patch; name=v20250910-0003-Reduce-malloc-free-traffic-by-caching-batc.patchDownload

From cd55bfa444ceabb764c68856fe3e821c1e2dc43f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 10 Sep 2025 16:54:50 -0400
Subject: [PATCH v20250910 3/3] Reduce malloc/free traffic by caching batches

Instead of immediately freeing a batch, stash it in a cache (a small
fixed-size array), for reuse by the same scan.

There's room for improvement:

- Keeping some of the batch pieces (killItems, itemsvisibility, ...)
  instead of freeing them in index_batch_release.

- Allocating only space we need (both index_batch_alloc calls use
  MaxTIDsPerBTreePage, and thus malloc - because of ALLOC_CHUNK_LIMIT).
---
 src/include/access/genam.h            |   3 +-
 src/include/access/relscan.h          |   5 +
 src/backend/access/index/indexam.c    | 173 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  15 +--
 src/backend/access/nbtree/nbtsearch.c |   4 +-
 5 files changed, 176 insertions(+), 24 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 39382d8e0..1bb4d29c6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -233,7 +233,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern IndexScanBatch index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup);
+extern void index_batch_release(IndexScanDesc scan, IndexScanBatch batch);
 extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 55495b0e3..6e511f1c1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -187,6 +187,7 @@ typedef struct IndexScanBatchData
 	 */
 	char	   *itemsvisibility;	/* Index-only scan visibility cache */
 
+	int			maxitems;
 	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
 } IndexScanBatchData;
 
@@ -266,6 +267,10 @@ typedef struct IndexScanBatchState
 	int			firstBatch;		/* first used batch slot */
 	int			nextBatch;		/* next empty batch slot */
 
+	/* small cache of unused batches, to reduce malloc/free traffic */
+	int			batchesCacheSize;
+	IndexScanBatchData **batchesCache;
+
 	IndexScanBatchData **batches;
 
 	/* callback to skip prefetching in IOS etc. */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 274687cc1..47f9827c0 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2197,6 +2197,10 @@ index_batch_init(IndexScanDesc scan)
 	scan->batchState->firstBatch = 0;	/* first batch */
 	scan->batchState->nextBatch = 0;	/* first batch is empty */
 
+	/* XXX init the cache of batches, capacity 16 is arbitrary */
+	scan->batchState->batchesCacheSize = 16;
+	scan->batchState->batchesCache = NULL;
+
 	scan->batchState->batches =
 		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
 
@@ -2329,34 +2333,102 @@ static void
 index_batch_end(IndexScanDesc scan)
 {
 	index_batch_reset(scan, true);
+
+	if (scan->batchState)
+	{
+		if (scan->batchState->batches)
+			pfree(scan->batchState->batches);
+
+		if (scan->batchState->batchesCache)
+		{
+			for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+			{
+				if (scan->batchState->batchesCache[i] == NULL)
+					continue;
+
+				pfree(scan->batchState->batchesCache[i]);
+			}
+
+			pfree(scan->batchState->batchesCache);
+		}
+		pfree(scan->batchState);
+	}
 }
 
 /*
  * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
  * which seems unfortunate - it increases the allocation sizes, even if
- * the index would be fine with smaller arrays. This means all batches exceed
- * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ * the index would be fine with smaller arrays. This means all batches
+ * exceed ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive). The
+ * cache helps for longer queries, not for queries that only create a
+ * single batch, etc.
  */
 IndexScanBatch
-index_batch_alloc(int maxitems, bool want_itup)
+index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
 {
-	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
-								  sizeof(IndexScanBatchPosItem) * maxitems);
+	IndexScanBatch batch = NULL;
 
+	/*
+	 * try to find a batch in the cache
+	 *
+	 * XXX We can get here with batchState==NULL for bitmapscans. Could that
+	 * mean bitmapscans have issues with malloc/free on batches too? But the
+	 * cache can't help with that, when it's in batchState.
+	 */
+	if ((scan->batchState != NULL) &&
+		(scan->batchState->batchesCache != NULL))
+	{
+		/*
+		 * try to find a batch in the cache, with maxitems high enough
+		 *
+		 * XXX Maybe should look for a batch with lowest maxitems? That should
+		 * increase probability of cache hits in the future?
+		 */
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			if ((scan->batchState->batchesCache[i] != NULL) &&
+				(scan->batchState->batchesCache[i]->maxitems >= maxitems))
+			{
+				batch = scan->batchState->batchesCache[i];
+				scan->batchState->batchesCache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	/* found a batch in the cache? */
+	if (batch)
+	{
+		/* for IOS, we expect to already have the currTuples */
+		Assert(!(want_itup && (batch->currTuples == NULL)));
+
+		/* XXX maybe we could keep these allocations too */
+		Assert(batch->pos == NULL);
+		Assert(batch->itemsvisibility == NULL);
+	}
+	else
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(IndexScanBatchPosItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+	}
+
+	/* shared initialization */
 	batch->firstItem = -1;
 	batch->lastItem = -1;
 	batch->killedItems = NULL;
 	batch->numKilled = 0;
 
-	/*
-	 * If we are doing an index-only scan, we need a tuple storage workspace.
-	 * We allocate BLCKSZ for this, which should always give the index AM
-	 * enough space to fit a full page's worth of tuples.
-	 */
-	batch->currTuples = NULL;
-	if (want_itup)
-		batch->currTuples = palloc(BLCKSZ);
-
 	batch->buf = InvalidBuffer;
 	batch->pos = NULL;
 	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
@@ -2397,3 +2469,76 @@ index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
 	ReleaseBuffer(batch->buf);
 	batch->buf = InvalidBuffer; /* defensive */
 }
+
+/* add the buffer to the cache, or free it */
+void
+index_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * first free some allocations
+	 *
+	 * XXX We could keep/reuse some of those.
+	 */
+
+	if (batch->killedItems != NULL)
+	{
+		pfree(batch->killedItems);
+		batch->killedItems = NULL;
+	}
+
+	if (batch->itemsvisibility != NULL)
+	{
+		pfree(batch->itemsvisibility);
+		batch->itemsvisibility = NULL;
+	}
+
+	/* XXX a bit unclear what's release by AM vs. indexam */
+	Assert(batch->pos == NULL);
+
+	/*
+	 * try adding it to the cache - finds a slot that's either empty or has a
+	 * lower maxitems value (and replace that batch)
+	 *
+	 * XXX maybe we should track the number of empty slots, and minimum value
+	 * of maxitems, so that we can skip pointless searches?
+	 *
+	 * XXX ignores cases with batchState=NULL (can we get here with bitmap
+	 * scans?)
+	 */
+	if (scan->batchState != NULL)
+	{
+		/* lowest maxitems we found in the cache (to replace with batch) */
+		int			maxitems = batch->maxitems;
+		int			slot = scan->batchState->batchesCacheSize;
+
+		/* first time through, initialize the cache */
+		if (scan->batchState->batchesCache == NULL)
+			scan->batchState->batchesCache
+				= palloc0_array(IndexScanBatch,
+								scan->batchState->batchesCacheSize);
+
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			/* found empty slot, we're done */
+			if (scan->batchState->batchesCache[i] == NULL)
+			{
+				scan->batchState->batchesCache[i] = batch;
+				return;
+			}
+
+			/* update lowest maxitems? */
+			if (scan->batchState->batchesCache[i]->maxitems < maxitems)
+			{
+				maxitems = scan->batchState->batchesCache[i]->maxitems;
+				slot = i;
+			}
+		}
+
+		/* found a batch to replace? */
+		if (maxitems < batch->maxitems)
+		{
+			pfree(scan->batchState->batchesCache[slot]);
+			scan->batchState->batchesCache[slot] = batch;
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fba562df8..18c734c4a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -382,21 +382,22 @@ btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
 	if (batch->numKilled > 0)
 		_bt_killitems(scan, batch);
 
-	if (batch->itemsvisibility)
-		pfree(batch->itemsvisibility);
-
-	if (batch->currTuples)
-		pfree(batch->currTuples);
-
 	if (batch->pos)
 	{
 		if (!scan->batchState || !scan->batchState->dropPin)
 			ReleaseBuffer(batch->buf);
 
 		pfree(batch->pos);
+
+		/* XXX maybe should be done in index_batch_free? */
+		batch->buf = InvalidBuffer;
+		batch->pos = NULL;
 	}
 
-	pfree(batch);
+	/* XXX keep itemsvisibility, killItems and currTuples */
+
+	/* free the batch (or cache it for reuse) */
+	index_batch_release(scan, batch);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 6211e9548..0e9f8bda2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1200,7 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Allocate space for first batch */
-	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	firstbatch->pos = palloc(sizeof(BTScanPosData));
 
 	/*
@@ -2237,7 +2237,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	BTScanPos	newpos;
 
 	/* Allocate space for next batch */
-	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	newbatch->pos = palloc(sizeof(BTScanPosData));
 	newpos = newbatch->pos;
 
-- 
2.51.0

v20250910-0002-Add-amgetbatch-interface-for-index-scan-pr.patchapplication/x-patch; name=v20250910-0002-Add-amgetbatch-interface-for-index-scan-pr.patchDownload

From 155054e4d4103223e72a808ca450b84c412bb1ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v20250910 2/3] Add amgetbatch interface for index scan
 prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   22 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  150 ++
 src/include/nodes/pathnodes.h                 |    2 +-
 src/include/optimizer/cost.h                  |    1 +
 src/backend/access/brin/brin.c                |    3 +-
 src/backend/access/gin/ginutil.c              |    3 +-
 src/backend/access/gist/gist.c                |    3 +-
 src/backend/access/hash/hash.c                |    3 +-
 src/backend/access/heap/heapam_handler.c      |   43 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1328 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  321 ++--
 src/backend/access/nbtree/nbtsearch.c         |  717 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   73 +-
 src/backend/access/spgist/spgutils.c          |    3 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/execAmi.c                |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 src/backend/utils/misc/guc_parameters.dat     |    7 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 contrib/bloom/blutils.c                       |    3 +-
 doc/src/sgml/indexam.sgml                     |   40 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    3 +-
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    8 +-
 33 files changed, 2110 insertions(+), 866 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 70949de56..21c6c6b60 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -196,6 +196,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  IndexScanBatch batch,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -203,11 +212,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -307,10 +314,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b..39382d8e0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -111,6 +112,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -231,6 +233,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897..b00f1f759 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -116,6 +116,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8..559cf49d8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -938,10 +938,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -960,74 +960,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1065,32 +1016,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1190,14 +1116,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1305,8 +1232,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1326,7 +1254,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..55495b0e3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,156 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan,
+									   void *arg,
+									   IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The firstBatch is an index of the first batch,
+	 * but needs to be translated by (modulo maxBatches) into index in the
+	 * batches array.
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			firstBatch;		/* first used batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+}			IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +286,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4a903d1ec..a2b5de5bb 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1265,7 +1265,7 @@ struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e4..3ca46b91d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -296,8 +296,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..c9de3d120 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -84,8 +84,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a96796d5c..910d736a3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -105,8 +105,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b3d7f825c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -104,8 +104,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..6c41b3119 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -84,7 +84,9 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -124,23 +133,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 86d11f4ec..274687cc1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -51,9 +52,11 @@
 #include "catalog/index.h"
 #include "catalog/pg_type.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memdebug.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -107,8 +110,146 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan, ScanDirection direction);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static pg_attribute_always_inline bool index_batch_pos_advance(IndexScanDesc scan,
+															   IndexScanBatchPos *pos,
+															   ScanDirection direction);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages
+ * (512kB for pages, and then a bit of overhead). We should not really need
+ * this many batches in most cases, though. The read stream looks ahead just
+ * enough to queue enough IOs, adjusting the distance (TIDs, but ultimately
+ * the number of future batches) to meet that.
+ *
+ * In most cases an index leaf page has many (hundreds) index tuples, and
+ * it's enough to read one or maybe two leaf pages ahead to satisfy the
+ * distance.
+ *
+ * But there are cases where this may not quite work, for example:
+ *
+ * a) bloated index - many pages only have a single index item, so that
+ *    achieving the distance requires too many leaf pages
+ *
+ * b) correlated index - duplicate blocks are skipped (the callback does not
+ *    even return those, thanks to currentPrefetchBlock optimization), and are
+ *    mostly ignored in the distance heuristics (read stream does not even see
+ *    those TIDs, and there's no I/O either)
+ *
+ * c) index-only scan - the callback skips TIDs from all-visible blocks (not
+ *    reading those is the whole point of index-only scans), and so it's
+ *    invisible to the distance / IO heuristics (similarly to duplicates)
+ *
+ * In these cases we might need to read a significant number of batches to
+ * find the first block to return to the read stream. It's not clear if
+ * looking this far ahead is worth it - it's a lot of work / synchronous
+ * I/O, and the query may terminate before reaching those TIDs (e.g. due to
+ * a LIMIT clause).
+ *
+ * Currently, there's no way to "pause" a read stream - stop looking ahead
+ * for a while, but then resume the work when a batch gets freed. To simulate
+ * this, the read stream is terminated (as if there were no more data), and
+ * then reset after draining all the queued blocks in order to resume work.
+ * This works, but it "stalls" the I/O queue. If it happens very often, it
+ * can be a serious performance bottleneck.
+
+ * XXX Maybe 64 is too high? It also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS, if most pages are all-visible). Which might
+ * be an issue with LIMIT queries, when we actually won't get that far.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+/*
+ * Thresholds controlling when we cancel use of a read stream to do
+ * prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->firstBatch)
+
+/* Did we already load batch with the requested index? */
+/* XXX shouldn't this also compare firstBatch? maybe the batch was freed? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/* Return batch for the provided index. */
+/* XXX Should this have an assert to enforce the batch is loaded? Maybe the
+ * index is too far back, but there happens to be a batch in the right slot?
+ * Could easily happen if we have to keep many batches around.
+ */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->index == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches firstBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->firstBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("    batch %d currPage %u %p first %d last %d killed %d",
+				  i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -283,6 +424,9 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
@@ -380,6 +524,12 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled within the read stream, etc.
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +544,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -421,10 +574,43 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *markPos = &batchState->markPos;
+	IndexScanBatchData *markBatch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current first/next range). This means that if
+	 * we're marking the same batch (different item), we don't really do
+	 * anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchState->firstBatch ||
+							  markPos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, markBatch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -445,19 +631,60 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchState->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->firstBatch = markPos->batch;
+	batchState->nextBatch = (batchState->firstBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -579,6 +806,17 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -614,6 +852,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
@@ -630,14 +871,286 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - ambatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * If the scan direction changes, we release all batches except the current
+ * one (per readPos), to make it look it's the only batch we loaded.
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *readPos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* Initialize direction on first call */
+	if (batchState->direction == NoMovementScanDirection)
+		batchState->direction = direction;
+
+	/*
+	 * Handle cancelling the use of the read stream for prefetching
+	 */
+	else if (unlikely(batchState->disabled && scan->xs_heapfetch->rs))
+	{
+		index_batch_pos_reset(scan, &batchState->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	else if (unlikely(batchState->direction != direction))
+	{
+		/* release "future" batches in the wrong direction */
+		while (batchState->nextBatch > batchState->firstBatch + 1)
+		{
+			IndexScanBatch fbatch;
+
+			batchState->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch);
+			index_batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		batchState->direction = direction;
+		batchState->finished = false;
+		batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &batchState->streamPos);
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchState->readPos;
+
+	DEBUG_LOG("index_batch_getnext_tid readPos %d %d direction %d",
+			  readPos->batch, readPos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? To detect cases when the
+	 * advance/getnext functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, readPos, direction))
+		{
+			IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->index].tupleOffset);
+
+			DEBUG_LOG("readBatch %p first %d last %d readPos %d/%d TID (%u,%u)",
+					  readBatch, readBatch->firstItem, readBatch->lastItem,
+					  readPos->batch, readPos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to firstBatch.
+			 */
+			if (unlikely(readPos->batch != batchState->firstBatch))
+			{
+				IndexScanBatchData *firstBatch = INDEX_SCAN_BATCH(scan,
+																  batchState->firstBatch);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so mucu behind the position
+				 * gets invalid, as we already removed the batch. But that
+				 * means we don't need any heap blocks until the current read
+				 * position - if we did, we would not be in this situation (or
+				 * it's a sign of a bug, as those two places are expected to
+				 * be in sync). So if the streamPos still points at the batch
+				 * we're about to free, just reset the position - we'll set it
+				 * to readPos in the read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchState->streamPos.batch == batchState->firstBatch))
+				{
+					DEBUG_LOG("index_batch_pos_reset called early (streamPos.batch == firstBatch)");
+					index_batch_pos_reset(scan, &batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free firstBatch %p firstBatch %d nextBatch %d",
+						  firstBatch, batchState->firstBatch, batchState->nextBatch);
+
+				/* Free the first batch (except when it's markBatch) */
+				index_batch_free(scan, firstBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mar/restore.
+				 */
+				batchState->firstBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed firstBatch %d nextBatch %d",
+						  batchState->firstBatch, batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(batchState->firstBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchState->reset))
+		{
+			DEBUG_LOG("resetting read stream readPos %d,%d",
+					  readPos->batch, readPos->index);
+
+			batchState->reset = false;
+			batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &batchState->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the first batch from here.
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 *
+	 * XXX This is a bit strange. Do we really need to reset the position
+	 * after returning the last item? I wonder if it means the API is not
+	 * quite right.
+	 */
+	index_batch_pos_reset(scan, readPos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -704,9 +1217,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1089,3 +1611,789 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->firstBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The first/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->firstBatch >= 0 &&
+		   batchState->firstBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->firstBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->firstBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance, right after loading the first batch, the
+ * position is still undefined. Otherwise we expect the position to be
+ * valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The position is guaranteed to be valid only after a successful advance.
+ * If an advance fails (false returned), the position can be invalid.
+ *
+ * XXX This seems like a good place to enforce some "invariants", e.g.
+ * that the positions are always valid. We should never get here with
+ * invalid position (so probably should be initialized as part of loading
+ * the first batch), and then invalidated if advance fails. Could be tricky
+ * for the stream position, though, because it can get "lag" for IOS etc.
+ */
+static pg_attribute_always_inline bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos,
+						ScanDirection direction)
+{
+	IndexScanBatchData *batch;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchState->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the first batch. In that case just initialize it to the first
+	 * item in the batch (or last item, if it's a backwards scan).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the first batch, without having to go through the advance.
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the very first batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 *
+		 * XXX Actually, could there be more batches? Maybe we prefetched more
+		 * batches right away? It doesn't seem to be a substantial invariant.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the first batch, and initialize the position to the first/last
+		 * item in the batch, depending on the scan direction.
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
+
+		pos->batch = scan->batchState->firstBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->index <= batch->lastItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->index >= batch->firstItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere. Here
+ * we rely on having the correct value in batchState->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The first batch is always loaded from index_batch_getnext_tid(). We don't get
+ * here until the first read_stream_next_buffer() call, when pulling the first
+ * heap tuple from the stream. After that, most batches should be loaded by this
+ * callback, driven by the read_stream look-ahead distance. However, with
+ * disabled prefetching (that is, with effective_io_concurrency=0), all batches
+ * will be loaded in index_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ *
+ * XXX It seems the readPos/streamPos comments should be placed elsewhere. The
+ * read_stream callback does not seem like the right place.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *streamPos = &batchState->streamPos;
+	ScanDirection direction = batchState->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	AssertCheckBatchPosValid(scan, &batchState->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike index_batch_getnext_tid, this can loop more than twice. If
+	 * many blocks get skipped due to currentPrefetchBlock or all-visibility
+	 * (per the "prefetch" callback), we get to load additional batches. In
+	 * the worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to
+	 * "pause" the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is tring to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			IndexScanBatch streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  streamPos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * If there's a prefetch callback, use it to decide if we need to
+			 * read the next block.
+			 *
+			 * We need to do this before checking currentPrefetchBlock; it's
+			 * essential that the VM cache used by index-only scans is
+			 * intialized here.
+			 */
+			if (batchState->prefetch &&
+				!batchState->prefetch(scan, batchState->prefetchArg, streamPos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (batchState->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchState->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchState->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the very first batch,
+		 * when readPos and streamPos share the same batch.
+		 */
+		if (!batchState->finished && !batchState->prefetchingLockedIn)
+		{
+			int			indexdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchState->readPos.batch == streamPos->batch)
+			{
+				IndexScanBatchPos *readPos = &batchState->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					indexdiff = streamPos->index - readPos->index;
+				else
+				{
+					IndexScanBatch readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					indexdiff = (readPos->index - readBatch->firstItem) -
+						(streamPos->index - readBatch->firstItem);
+				}
+
+				if (indexdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchState->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchState->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchState->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are no
+ * more TIDs in the scan. The load may also return false if we used the maximum
+ * number of batches (INDEX_SCAN_MAX_BATCHES), in which case we'll reset the
+ * stream and continue the scan later.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * This only loads the TIDs and resets the various batch fields to fresh
+ * state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatch priorbatch = NULL,
+				batch = NULL;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchState->finished)
+		return NULL;
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+		return NULL;
+	}
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (batchState->firstBatch < batchState->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext firstBatch %d nextBatch %d batch %p",
+				  batchState->firstBatch, batchState->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchState->disabled &&
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   index_scan_stream_read_next, scan, 0);
+	}
+	else
+		batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc(sizeof(IndexScanBatchState));
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->finished = false;
+	scan->batchState->reset = false;
+	scan->batchState->prefetchingLockedIn = false;
+	scan->batchState->disabled = false;
+	scan->batchState->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchState->direction = NoMovementScanDirection;
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->markBatch = NULL;
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->firstBatch = 0;	/* first batch */
+	scan->batchState->nextBatch = 0;	/* first batch is empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	scan->batchState->prefetch = NULL;
+	scan->batchState->prefetchArg = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && unlikely(batchState->markBatch != NULL))
+	{
+		IndexScanBatchPos *markPos = &batchState->markPos;
+		IndexScanBatch markBatch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchState->firstBatch ||
+			markPos->batch >= batchState->nextBatch)
+			index_batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->firstBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->firstBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->firstBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->firstBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchState->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchState->firstBatch = 0; /* first batch */
+	batchState->nextBatch = 0;	/* first batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *readPos = &scan->batchState->readPos;
+	IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	AssertCheckBatchPosValid(scan, readPos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (readBatch->numKilled < MaxTIDsPerBTreePage)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+/*
+ * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
+ * which seems unfortunate - it increases the allocation sizes, even if
+ * the index would be fine with smaller arrays. This means all batches exceed
+ * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ */
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
+								  sizeof(IndexScanBatchPosItem) * maxitems);
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, we need a tuple storage workspace.
+	 * We allocate BLCKSZ for this, which should always give the index AM
+	 * enough space to fit a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->buf = InvalidBuffer;
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..fba562df8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,33 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +325,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +338,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +354,51 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +407,50 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +697,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +846,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795..6211e9548 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, ItemPointer heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimscan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1513,12 +1477,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1534,11 +1498,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1546,11 +1510,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1568,69 +1532,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1642,8 +1616,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1654,37 +1628,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1722,11 +1694,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1790,28 +1761,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1851,12 +1822,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1873,11 +1843,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1978,11 +1947,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1995,17 +1964,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2021,12 +1990,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2043,202 +2011,95 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  ItemPointer heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2253,73 +2114,96 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * the scan on this page by calling _bt_checkkeys against the high key.  See
  * _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	/* firstbatch will never be returned to scan, so free it outselves */
+	pfree(firstbatch);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		pfree(firstpos);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	pfree(firstpos);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2329,102 +2213,70 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2432,17 +2284,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2457,19 +2309,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2692,25 +2561,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2721,7 +2588,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2747,9 +2614,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index edfea2aca..a92bd787a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1021,14 +1021,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1395,7 +1387,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2034,13 +2026,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pos->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2153,7 +2145,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3160,7 +3152,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3234,70 +3226,68 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(!XLogRecPtrIsInvalid(so->currPos.lsn));
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(!XLogRecPtrIsInvalid(batch->lsn));
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3314,12 +3304,12 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3344,7 +3334,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3360,7 +3350,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3377,7 +3368,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3427,7 +3418,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9b86c016a..c6aa99ca9 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -90,8 +90,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca2bde62e..7141243a3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a0..ab2756d47 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 4536bdd6c..dd975245e 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -312,11 +312,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index f59046ad6..9fea68a9b 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -876,7 +876,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0da01627c..5218d158d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -54,6 +54,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_bitmapscan', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of bitmap-scan plans.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 26c086935..30516781c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..e36519ab7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -163,8 +163,7 @@ typedef struct IndexAmRoutine
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -789,32 +788,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function need only be provided if the access
+   method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..be1b0f55c 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,8 +319,7 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 83228cfca..3a7603b24 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -158,6 +158,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -172,7 +173,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(24 rows)
+(25 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e81628..dc57403c3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -192,6 +192,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -1265,6 +1267,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3415,10 +3421,10 @@ amestimateparallelscan_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
-- 
2.51.0

#329

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Peter Geoghegan (#328)

3 attachment(s)

Re: index prefetching

On 9/11/25 00:24, Peter Geoghegan wrote:

On Thu, Sep 4, 2025 at 2:55 PM Tomas Vondra <tomas@vondra.me> wrote:

Aren't plenty of real-world data sets correlated, but not perfectly?

Attached is the latest revision of the prefetching patch, taken from
the shared branch that Tomas and I have been working on for some
weeks.

This revision is the first "official revision" that uses the complex
approach, which we agreed was the best approach right before we
started collaborating through this shared branch. While Tomas and I
have posted versions of this "complex" approach at various times,
those were "unofficial" previews of different approaches. Whereas this
is the latest official patch revision of record, that should be tested
by CFTester for the prefetch patch's CF entry, etc.

We haven't done a good job of maintaining an unambiguous, easy to test
"official" CF entry patch before now. That's why I'm being explicit
about what this patch revision represents. It's the shared work of
Tomas and I; it isn't some short-term experimental fork. Future
revisions will be incremental improvements on what I'm posting now.

Indeed, the thread is very confusing as it mixes up different
approaches, various experimental patches etc. Thank you for cleaning
this up, and doing various other fixes.

Our focus has been on fixing a variety of regressions that came to
light following testing by Tomas. There are a few bigger changes that
are intended to fix these regressions, plus lots of small changes.

There's too many small changes to list. But the bigger changes are:

* We're now carrying Andres' patch [1] that deals with inefficiencies
on the read stream side [2]. We need this to get decent performance
with certain kinds of index scans where the same heap page buffer
needs to be read multiple times in close succession.

* We now delay prefetching/creating a new read stream until after
we've already read one index batch, with the goal of avoiding
regressions on cheap, selective queries (e.g., pgbench SELECT). This
optimization has been referred to as the "priorbatch" optimization
earlier in this thread.

* The third patch is a new one, authored by Tomas. It aims to
ameliorate nestloop join regressions by caching memory used to store
batches across rescans.

This is still experimental.

Yeah. I realize the commit message does not explain the motivation, so
let me fix that - the batches are pretty much the same thing as
~BTScanPosData, which means it's ~30KB struct. That means it's not
cached in memory contexts, but each palloc/pfree is malloc/free.

That's already a known problem (e.g. for scans on partitioned tables),
but batches make it worse - we now need more instances of the struct. So
it's even more important to not do far more malloc/free calls.

It's not perfect, but it was good enough to eliminate the overhead.

* The regression that we were concerned about most recently [3][4] is
fixed by a new mechanism that sometimes disables prefetching/the read
stream some time prefetching begins, having already read a small
number of batches with prefetching -- the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

This is also experimental. But it does fully fix the problem at hand,
without any read stream changes. (This is part of the main prefetching
patch.)

This works like the "priorbatch" optimization, but in reverse. We
*unset* the scan's read stream when our INDEX_SCAN_MIN_TUPLE_DISTANCE
test shows that prefetching hasn't worked out (as opposed to delaying
starting it up until it starts to look like prefetching might help).
Like the "priorbatch" optimization, this optimization is concerned
with fixed prefetching costs that cannot possibly pay for themselves.

Note that we originally believed that the regression in question
[3][4] necessitated more work on the read stream side, to directly
account for the way that we saw prefetch distance collapse to 2.0 for
the entire scan. But our current thinking is that the regression in
question occurs with scans where wholly avoiding prefetching is the
right goal. Which is why, tentatively, we're addressing the problem
within indexam.c itself (not in the read stream), by adding this new
INDEX_SCAN_MIN_TUPLE_DISTANCE test to the read stream callback. This
means that various experimental read stream distance patches [3][5]
that initially seemed relevant no longer appear necessary (and so
aren't included in this new revision at all).

Yeah, this heuristics seems very effective in eliminating the regression
(at least judging by the test results I've seen so far). Two or three
question bother me about it, though:

1) I'm not sure I fully understand how the heuristics works, i.e. how
tracking "tuple distance" in index AM identifies queries where
prefetching can't pay for itself. It's hard to say if the tuple distance
is a good predictor of that. It seems to be in case of the regressed
query, I don't dispute that. AFAICS the reasoning is:

We're prefetching too close ahead, so close the I/O can't possibly
complete, and the overhead of submitting the I/O using AIO is higher
than what what async "saves".

That's great, but is the distance a good measure of that? It has no
concept of what happens prefetching and reading a block, during the
"distance". In the test queries it's virtually nothing, because the
query doesn't do anything with the rows. For more complex queries there
could be plenty of time for the I/O to complete.

Of course, if the query is complex, and the I/O complete n time even for
short distances, it's likely not a huge relative difference ...

2) It's a one-time decision, not adaptive. We start prefetching, and
then at some point (not too long after the scan starts) we make a
decision whether to continue with prefetching or not. And if we disable
it, it's disabled forever. That's fine for the synthetic data sets we
use for testing, because those are synthetic. I'm not sure it'll work
this well for real-world data sets where different parts of the file may
be very different.

This is perfectly fine for a WIP patch, but I believe we should try to
make this adaptive. Which probably means we need to invent a "light"
version of read_stream that initially does sync I/O, and only switches
to async (with all the expensive initialization) later. And then can
switch back to sync, but is ready to maybe start prefetching again if
the data pattern changes.

3) Now that I look at the code in index_scan_stream_read_next, it feels
a bit weird we do the decision based on the "immediate" distance only. I
suspect this may make it quite fragile, in the sense that even a small
local irregularity in the data may result in different "step" changes.
Wouldn't it be better to base this on some "average" distance?

In other words, I'm afraid (2) and (3) are pretty much a "performance
cliff", where a tiny difference in the input can result in wildly
different behavior.

Much cleanup work remains to get the changes I just described in
proper shape (to say nothing about open items that we haven't made a
start on yet, like moving the read stream out of indexam.c and into
heapam). But it has been too long since the last revision. I'd like to
establish a regular cadence for posting new revisions of the patch
set.

Thank you! I appreciate the collaboration, it's a huge help.

I kept running the stress test, trying to find cases that regress, and
also to better understand the behavior. The script/charts are available
here: https://github.com/tvondra/prefetch-tests

So far I haven't found any massive regressions (relative to master).
There are data sets where we regress by a couple percent (and it's not
noise). I haven't looked into the details, but I believe most of this
can be attributed to the "AIO costs" we discussed recently (with
signals), and similar things.

I'm attaching three charts, comparing master to "patched" build with the
20250910 patch applied. I don't think I posted these charts before, so
let me explain a bit. Each chart is a simple XY chart, comparing timings
from master (x-axis) to patched build (y-axis).

Data points on the diagonal mean "same performance", below diagonal is
"patched is faster", above diagonal is "master is faster". So the close
the data point is to x-axis the better, and we want few points above the
diagonal, because those are regressions.

The colors identify different data sets. The script (available in git
repo) generates data sets with different parameters (number of distinct
values, randomness, ...), and the prefetch behavior depends on that.

The charts are from three different setups, with different types of SSD
storage (SATA RAID, NVMe RAID, single NVMe drive). There are some
differences, but the overall behavior is quite similar.

Note: The charts show different number of data sets, the data sets are
not comparable. Each run generates new random parameters, so the same
color does not mean the same parameters.

The git has charts with the patch adjusting the prefetch distance [1]/messages/by-id/9b2106a4-4901-4b03-a0b2-db2dbaee4c1f@vondra.me.
It does improve behavior with some data sets, but it does not change the
overall behavior (and it does not eliminate the small regressions).

regards

[1]: /messages/by-id/9b2106a4-4901-4b03-a0b2-db2dbaee4c1f@vondra.me
/messages/by-id/9b2106a4-4901-4b03-a0b2-db2dbaee4c1f@vondra.me

--
Tomas Vondra

Attachments:

ryzen-nvme.pngimage/png; name=ryzen-nvme.pngDownload

�PNG


IHDR����C�bKGD������� IDATx���g\Tg����3� ��`A%5{l1,1�����)��)��5e7��f7�T���@�1�ILLQ���&*��� V�A`�i���4*���93��#�P��^����\cRE�7���c�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;`�;�O�<yr���������gggk:��p���+W>������7�xc����Nz"� �L�0A�*�@���w����'���E��c���;E�S�Nm��1//���E�����q��3g�/n��i�����"����.^�����_�vm�����n���f�]���~�����zNN���[�z<�~������'������(�w���w������W����6l8x����d2���?��i���������/Ju����[��������Y�r���s�f��l��wo���G�����7n�x��M}��]w��v�;u�T�f�F��[�n��Y���[�p���>��K��v���/��255��������Z�����x<�&MZ�lY��7o����o>���3~��7f�9333))i��Y�$�����[N�8%"�F�:v��%O�o�����+~!�g��X,f���_�]����~s���;w�|�G}����j������SSSE��t8�d2�9s�����}��m+����1c233��o���
t����?L�������L�m��)�r����5k>��v+��������W}���<���O�����(���KMMU/:�NEQ�-[��Q������{��9EQV�^]�F
����?�b������s�
7������|��'�x��OT�""[�nU��r%&&>��
��s��i�����g/�l�����#F<��3W��f����eK���(��'����L�R��#���_h��U���E�~�����w�y��l�o����c����W�����������O��{��w����������-�+�O�o���_�~��}{��uM&��
�7��[��g���C&N�������n_�h���+�D�bbb�t�""!!!&LX�n]E�_b��e�G��S���ccc{�����s�����������5J��5JHH�����2�"##��m6�k�����d2y<Y�~���?��Eu�YD�x�������?��������o\�i��;�o�����abb���"��I�+Nr�}�M�:����?��c\\\��]��D� ���#""��=[���(,,�Y��%��U�-�V�E\�����W��pcHNN=z��/�("�-�2e�z��n�����/t��a��qf�Y
}U|||��M��J�_���#,X���?^|�%O�.�_���#�C��u��������d�6m�����2V�E�n�z�������+��pc������9y�dnn�[o��^�6m���_��i���w���z��EEE�9sf���[������)Sz�����2p�����-[����U7�\����}���G�~���=Qjj�����c������?VQ��'N�4)77���_�>}zE����]<��O>��wouO���G��o_�x}Q�?�x����M���g���W����pHw�_�F��w�^��G�
6T�]�n�^�z������������{��������oMMM-,,l��ezzz�-Dd����}���b������m�i��3f���+�j���������g�Dbbb\\��w�]����'����n������K>�}��w�u��o��v��}��{�����[�n}���:u�������/##�q����w�d���X�O����o�q��5�f�Z�f���Cg��}��)�V��7P
}���o��}�v����2���\}�Z�U0V� � � � � � � � � � � � � � ��p��L�8l�|2�&���L�M�b
�!=fH�	���L��h4q������h	����Z�Z>%+�R�f���R��@`R����Rx@�v���C�Kii)��!`�L�O���O3`��$�4	?M�r�}�|�c�w}%"�qHd�5~JnN4��5&��&E�}2=g�5�?'i��f@��H�i~����f ���=�{t������{�Y_)����k��l�4S^���r��lM���2�f��5����C�P��F���N�(���o��j'�h^~�e�G�f�i~����f ���!��=�����2�-���n�]���Vs~�{CL�V�����;PM�n��������E��Q�v���0*���7{��	w�:�����[��{q_��w�j�j�e	�Y��@����#V����}�����p�����}UZ��pW;�T�Z���"|_��;P�M
2�eS�]��}��."�?%��r�dO��C����3����}UZ����~2�L�L�������=W�d4�]}d7E�p���V�jT;��
x�=[����T���
kQ���Z��p~/<N<N��G����� ��X}��p.c���������ZZ���i�F�=7���7z`����F�y7�5[��=�����.��U���
�����rl�.�����J��U����p.P����U;��W�,�r^\d=?�v�����|�m�s����j'�)v9F�J��j0��(��c��v�.�m+S�j����w��#����}N?�!S�p@�2P��R�.�������	w#���k��m�j��>U^�s���2�c�A���4q�����U��;��9g���������.cU;��`q�Y��|^�:Mf���p�N� (����n��}F��	w�����p@�+p�Y1���M
]��;Yy���s����p@�*p�&���N� 0����nL`T;��ds�$����A���v����(IZ1�w����t{�T��X���z7j��-��y���h�w�Gi���}�^������
c��y���x��S��p��0<��$q��~
[p��06��on����#����w�Z��~��0��e��������#���p�!�)�Z>'�i����{!�`06G���6�����v���(�|�������6�g�)����(I�j'�`�����s4	�j'�`j�'4�{�{0V;���	w�;�]E�������?C�]��N���.-�>{H��_�:D�Y�G���.-�>gH�����Q��p��Q�}�~�����_��/B���P�!��/�����= "r��hP����_����,��w��Ti����T�U����,)�>gL���y����/V���S�E��S��p�n�2�c��?F�@�J���C�W{��u_��Z��7��rXq��Q����;|J��{Zw~��Az�b$���wN�&������p����~o�.T{5�U�p�~.!}���z����g1$�^��Ia�D��t�^>��6���j#���dS��BOx,���k��B��z�d`�q�8dS���'������k_h�C����p�f��zB"\c�3�y��K
�=���UZs��Yp�mIp�}�y������)�qz�el�;4u0M6%�:����������K�b��{2c3)���:3��&h�Y ��u�p��`���u�$,�:]$4Z����aj���f����M���3�%��A[��"������	e��0o{��ED��`k�V8U""��IA��l��9��+?����gB��5Z,��s&[����n�������d1���!�gH���~��b��e�i�����`*�d4LM��pA�U M�c?��.���v�?AN�-!}�S��Ni���4&
S��2A��-f���f����bI�Z;��m��
�������PqHxe�(U�}r��'����� ��;D,Q�~����-�pi7S,����W������w�T��,��K��(����@��nN��PUj�O�psr;��j�9�9\t��e�y�C���wT������s�����v}��%�p�*K�!C���wT�z���T�NXq�;R���>��v��j�?����T�~�*�����+���Z���p����{�`��;��HQ���YT��`�Wp��L��Y/u�H�Mz�a�Wp�����s��)�j����w�MH����p�����'�������/����w�w�\^B���;%P�~��2��j���I���{\+���
�pvT�!��U���#��k��U����/n�����X�wHH��7�\�����C����6l����S��(�V�K7z�-���L����Y���7~���G��j�j��&Lx��w�75i��Q�F5j�����5k����y��W^y�C�����2e��/�X���3���&��e��,��Kt����g���������X�I����Y��9���1j�(9x�`||��m�:u���xBCCsrr�4ir��<x��n��}{��m�?��}�+V�������8��Hn��Is�xY�o
�����C����XK���g�{�z��j��-[����l69}���(6�m���g��)�e������m������������j\j���,��lY��A�����[�v�������7i��W�^"��x���y����z*66��Gq��"��/��j���C��i���]��1]&%%�k_1�N��b��H����������`�p�g
)))�W���_�s�w��1y��/��"44T]��w������=z����'Nt8V����BCC�N��T�zE�*�Bx�x���=���Ix��3�����5lw=W�333���{���s������ILL��m���&??��Mg��U���u��f�����Ayc^���`�_��L�D�=*K�p_�f��w���'�$%%U�>999��7��={n���|]|�����w��u�`������y��4��[�O�J��'B�s���e�F���/t��Y���q��������w������(�~���+�n���Q#����{�^�z�{����/�7o^vvvdddU�_qN��`��A�g��������o��d���������W^���_f������v�;w�<e��F���p�������w�n���_�������z�/G����V���
��M7�g	.�w�B�����vr��9�w�m\��z�t�;|c���s_�>�j7:������+85x��W�
��2���;@`�������C��= �� ��+85(}���5���8����;@���<i��i=���F�g��XqT{�#�o��������c���*`l;�KZ1�����m�E�Y�E�;�����N�����[e��j*���Z��n���V����@�������|wV��w��A�Y�#l�0������T{Pa��H������Y}F���j.�;�al=�;l�������Vc�~��p�=���`���c�Lp#�����;dF������@7l��kT;T�;����Q�p�S�2s�v��~J=�}N�Q�S���T�y���W���w��������w�B���w]�mR�)n��h���;V�7��(��`��~��$'YL��8���-���V���;zD��zO�cRE�tf2�Mzp����������������C�
S��2:)��U����F�6�{W�!�t'��|��r����hu��3��:	�����������s[������L�_��
�s�fId�z��;������m��)!�z��F����4��,�����wzF�0>)���3��*�C���,.�zG��eI�'5l��L0���
��l]�4��}gj��$�	)<��L0���Pd�O�zw����>�|H<5%2N��`����O��wxF��1<��X���L�r�*�
��S���t���k>���������H('�8
S�f%��/���Q����^��,�
S��2^�#��kF�x������q�8U�����e��>�u��&T;�	+��B�CC�;�W��j��w����/#�vh�=�S���[�Kh���B3��hi�� �4���J���*�
��?���T;��w
P��6V������~����o�Z�Y�w�k�V�	������ ��U����84���O���w�jZ{��������m����p��>F�T��cY#3>���K��
P5+�e��a���j���PT;�B�T��\��a�@�������_��a��� ����������_
|�j�^Xq���&�Y�����}�x���
[�=��p%�i�'Y��+��(�s���{5���'��\�U {��c_^�@~����z����Lv�;�e��b�./�}�6tq���j���gB�#�~���#S<6	�[no|����EqHx���!���@DD��$?YL��8��^}�����}vS���K�b��{>;��(z��3��oA�S �ME��H������z�S7�k��*#�h���Qi��l�qe��*"K�Z?�g�7�W�T�.�h��I��O�U@�'�sI^��{����U��baS;�� b�ZV���{�����g�sRg�������n��Yvt���.�f���u��/f��@�&+� ������{6h��,@��95�F��;^K�������4�G��z��V�@���a,�;FKr�v[e@�Y����Z�t��]���=PY�;._����EK?��j��� �,�����E�
�@��p�����GvM����M����=6qd�������/��z��EKO�j/J��9�$�1R���4�2�^���4|Z������/���=��T����p�9.��:O4LMV�@�S�}i���[�"�����C�U\��UG��@V^���5�{]Y�Dq^x�8���<�:��/�L��xe�c�^�"b��:��.�h1�K��b��{&T
����@`����)�N�H��M���oxl��K����%Xh��4+�@q��8K"�?�9�t����X�j��4LM^�	��4��,��/��?]4l��)T;	{�@@p��d��?/j8��wy����z�h�p��-f����OX^���a�b��C�����yq�����[���'JM	��CV�@ �,���E���_���K���n�X8��U8U�����d~�<��"k�������8RK�;�v���4�{�R	 "�����l�ne�c���=�]�;0���u��UI�����9��_��e�w����$w`<��l�7+����v����D�#���w`�R�b�*���_�?��
'?"h���jXq�����/n]�f��m��=�V��_� {��[��J�D�#�����mR�)n�����^��ty��T;�V�7��$'YL��8�����;���Lx$ IDATu���IQ�g����7?�.��M�cW�g��R����&���ad�&[e�(��U�gZ�
/�i�-7S�@9����8Q�j���D�Um�n����3~�p�!$J�OO�u��}����u���(�g�������H�������o�W���D�=�
S�Se�_������+�}�Md=�g���7w���\�f��T;P�������_;W�2�j��pzz7k�+;VQ��b�����������a���j��w�������j*�w�kT;P
���:�#�T�|g��������>[����C�Q�=c��-k��{�x��|a��
o��au�$��V���������k�Lj���Xq�E�� ���<6qd��F�b��TQ��'�)T�������2���IQ�g����7�x
$��(v��k����m��[E��{,@7�&[e�v\�b��V�]3z,iU���3��2@;�8Q��~�6�h���>m�K��3�p�1GM;���r�3z,m��:3���L@�`{7{���;?�����Icck�K���������I��h�|��[�3d��4LM��
������_7��f����p�J�������S���p$�&o����}��69�)��s �������?���a��IT�y�d����3���B���LnN����K9dQSq��?�����b���T���
t��]���!�{��b�^xh�J�=�A`�'�������V�Z��Y�}���-*��u��v���S���Wk~\�7w����f��M�YN�Hd�x�z��B3����C�
W�\YTT���r�=�������f1b�s�=w����_������:�vo��~���C���X���t���5Z,��s�X����_l�������o����1c������n��_�~S�N�����=�T���6)���x	��cX{��l�b2��t�""{�����C��:u��s�N
���6)�w��)|c�^	��R�'����=���i������}���(kDDDQQ���+b�LJJ��\���&;c�@������j|�7v}�z�j*���ryUj����������Q�F=������������w(..V�^��Q.C���@r��c�M<v�I��*}����������R���[�����r�-c��y����/v�����.�k����x
�������c
��pL!��}������������|����������R���;�9�p�BEQV�Z�f�����O����8Q.:�PqHXe�)|}W������}�|oL�;��C}��
�u���M�/����C��~9��	���w�����hek�r:�j���W�6*�|EQ��}e�/_?n?��9���ajr"�A���&�YR#^B*>��(M���*���S/�Q2�<���v��4LM��p�7��m*�]D�}����:��?Mj��=``u�;�������H���uX�}i��z��<���a��y���~�$�#����
��w�sT���/<v`M��MB�3S�Qz��<�w����R��X�����q���b�s����nN���Z����f���5#���&[e���m���wR��?#�V.�gID���6R���#�J�i�'Y���u���g������{&W�9�W��I�����)i�:rv}�[������c�����}�)m������n��{&��2�����w�����z��7��"�-��w���v��V����]��01�K��b�������./mM_��ge�c�-N�T��BK���LZ"��C��UC&��{ (h��l� XP���qs*A�/��[}���!��R��1��@�{q����Y+����"�p/nY���^v�F�V�[���������;K���I�Q�@`���t~��}�@� �@�oY��j[e4�oY��j�@@Q�}%��� Q��������v  ��@ ����G�`xT;�*��)�<���O����F�``�}�cuBk�=/b�FE�A�wIe���:E���p�x�j�|&g��I��5��/�U�)����G�v x���&���)�<���-g�R�@�a�~/7M�$�9T�8&�_�^�,�a�6&EQ��Ag&��sHFS����	[�M���N��
�@�h��l����mr�k	�("��:�7\^����c�[e�Wyi��,&�u.���m����E�B$<N����w���@r��cW���nw6Ho�22�"�f�%J����a�3����=�_Xq�/�e����<�;pGI�eq+#��-uoK���y��4��,�PQ�c�����@����LnN�%w����S9�w�4X���HK�t:.!����,�EM�e?��.#�����S���S���&�����.�M���)!�;������q�"-i>3��]D
��l���d��z����Pi��
�!�gH}�wq�I~��B=��CS�;�.�}Z-�a�/!��CF'����Dr'.p�D�*�w��l*��vq����q	�n��@r��bWD��?hWQ����_dX}�>�QL�M������aj��@��e����8�Hx7�>�+[LVEQ����.+"MGD�/��L��#�0K"�%48�?P9�;�'�E�8��i����(N������V�E,��K$4Z���{��ps*�%���9\B����������������.�jY�Rg����nT���n����m��,��UDyb��?��X:pt-�a����]"@ �05iV����>��Z�0���%nN @(�<���{lT;�?@����L���#E�����[�����p@E����9���M���!����K��%T;���8H|��x���}�*�w|��x&����Eg�~8���8�p�w�j?n?���H�v���q+����()�z�x�@U��_P��d)���w������T[e�.����'g��_�P#����&V��"����JO���;��V<�[x�������v��p�+�j?]Z�Z;M�h�j�9~��1��yp��y��T;
�����v^B��������v�#��Fy�s�o��
py<��.(r:|(�j��f�Z�<��?,��_&<H����pM����u�:��@��"~�P}n�s��Ol����6V��M
2�e��G�W��	T;�#��M����$1��V�����N��	��(z��3��o7W�d4���Cs�8.��W���������_%p����05Yq={�����b?p�wwz��3><�,����z�q�q^x�8$<���uz��~~���'?:lr&SU���#�A�%���9\,�b�v3�u�wtx�c2�+�|���t0M���UI�e����4��n��D�Se��$"^,�W|����S�|Y8��Ps�����Y�T\��`	����z��� �w�f����W�v���������E������Rx��h(X�7��V��d�j�C�`�v�����DVx?m����8<�1�M�Oo�/��]D�Q�c�X��-�p�9S�W������w@��j��C>�e�~��&�Y/�W�Y����I���+sx��3>�R��7��]j��R��������c��� ���2�������|rw��#�;��]w�� ����8���J�#�J�<u�L�%���c�v���J�Z���|P#���-���G�b@�H���w ���n���5����2�kT���+w��s��/����K
��v���)@0*q;o_�V���vF�w@�)q;�\�^��>������m.en���������j`$������U��,a�6*��w X�����J�A�0"~m�By����7�����cuZ����n��v�//@���~�!C�02V��L���5k��/����Xv���U��kE���v��o1@`R�=�V���c�&�����"���r�X�[���T;�@A��o�^�jHw@@)��y}GS�	���k�� � �joE�P�; P��0�b���U��#�v��p[��l��yq���v����T_��l��w�E7��g�I�v��w�Q�#V���j8�7� ��C*v9n[�W���>��! ��p��8S�6�@b�;� ��$'YL��nex���z7�����(z��3��o�=w��l*{��:����k��=��%J���h��l�AY�����=���?��z�>E�� ,��-��vvN��&�!aqz�>E���'l����5lsZn2����L	a�����n����;�,�b^�z�����Tz@j�KH��C@�h��4+�~��Q:d��Nu�������SAA���6���p���j���n�w�?*p�&��K�@9��w�j�R/�j�r�;������a��������pAy�����g���;�_�%�����0�j���]�lR�).��Q��|n�F���y��3�?b�@?�i�'Y��6�$?��Y�*��;@'���,��Q��7��{��]o�a�M�d����S@U�����l�y�����g�}��&������Y�$_���4_?;T��Ix��eJ<3��5����EqHx�Op��dq��i�]6%���O�� ���yB���9��[���9\��K�O'(����C�U
}�������g��oq��]�,�Y/�h_'����D�v����}����g�����&M�4i���������=F��D��gk���.�&�)T���!V�.�@U�E���w���s�N1�������u���o�����7n����o�m�����
���	
�-�Z>gp��w��,"��2�Y/�>_�4L��~�e��-X����>*�2~���c�<X�9tD����)�^>'�i�k~R��e�feoN��k���_��6m�l���Z����c���?����{�o�z�;?�����
�<yrnnn�����(�b��&�Q�P
�
�����?�<??_}XPP�h��������b����{o�����#_���oz���S�N�:����S/�\�������?���������+..��u�^�%�������=n�{0��.�{<�#F|���]�v
		��m[����,YR�'?~��)S����x��/���z�n�GDD��U��w���o�L�r��!��$"�z��0a���>Z��W��U�,�Q28}NB����S���[e�f��%K����;��s����yzzz��]D�{����;_�e?~�l6?�����~��O?��~���-[n��&��E�O�>6l��u��Q�������A��{����{.11���h����?�0a�SO=�(J�n����'"g���S�N����W///��+b�LJJ�7�@06�M�3�m��G��>��uT;�����ryUj����L���,��s�w�����a���]����j8��j������;11q���|�����VRRR�n�C�KS��a����4�I>�y�RBE�h��,�����R�>gH�����G������5l���������{���kk��QOv��wV�Z��WSZZ*"�[�>|�p��C�5o�����.��d���m�]r��}����&Kb��$y�i����oR�������E�Z�h�w���'�W�����{�j;Mnnn�J��}��,Y��?|�����8p@DN�<�d����GW�:�����d���d����W�'�4���JKK���ly�����tN\��*�����O�����\.�w����O�&Mj����(���{��g�
&"-[�|��������c��;w>�����w��u@5������PqH�N���V�J�[��I���]
��n����@����4����?�����?��Cy��z��������Z�����U�V��-�����:t��v�k�.22��w>s����5k��Y�k�~	���J�K���W��^ �MJG����f���hM	;.��$B�q�/h����D��<���/�_�~HHH~~��w�������"����6)���}���.-�����f�"��9�9C�� ���cW�X��|�o\L�pW�]�v��
����n0`�&��pM�*-�>gdl�)�����D��N�2��4���h�:3�V�G�h�a��,���P�NI�pH�;\�S�E�g����rg����_�HnSQ����%���k{������N�1C�x��Z���SE���{��m�4i��7o���&C-(�]D\������:�q��dq��i�]6%���Gp�=Uf��%������zLLL����X�
0��G����������,��c���:�)����N(< u�y��������;{��q��yy��Q���9J�L���������t�q�����C"��w�W��2��5S_�����[���S��AQ��Z%&W.��\���j����.�h��K��b���	��l�/--U{���t���O=�T�n�������Q�F�5|5��ps*T��������Zwy��Az��8U08��*����?��?����_NII�d�PU'K
���S��<k�P-��s���;W�=��k��]���`��UB�@���8���k7k��Y�f���[�re��,\�0//O�!F�V�=�;S��c��9u���[�l��Jnn��%K�3��,)LH�}O��/��|���^RRr�����H��^�G�Z����B��.*�]�v]�h���g�����.����7g����R�������A��w���{����]�v
4h��!^���j��G�Q��r�������x<]�vMLL4�L���G8U������e�&���K�z����� �=z4,,�A��W���k����aCM���9QR80}��6]_�4�����P�;�����{�2�/�?�&C����F�A�L�/cdU�|#���\`X����o��������}���WJKK�.]z���zy<�>�j����,�M�����ED6%K�Qb5�����*�t?t��U�V�\�f2��������S���'�:���%�j��M��;%x���n�U�Qp��5J��������:�q�;w��d��������b���4~�p��-.�>k�u=}Q����������ABj���b���S������x<���������u���[��&s��pq��8K"����|Z�����i��BEqH��f����L�p/**0`������DFFN�6m���z%���4��,�����'�4�����TJ����v�/��"22���cK�,�x<�2e���#5�W��I�=��pf����S�O�a��h��S�����q�'O�1bD�&M����N�u�]7z��E�yy<�7ylR��X,9��yc��>5z���=�
*���7W��l����?���z?s����xMQ���H����JB���Z����E����=�
*�Uf��!"��_�q���[����>��x��H~�(������$��6�yP\��n�X8��Qu6�o��9--�g��=��w��)nN�[����g��1��o��2[�gH�������%N�Q���>|X�����#�#O���&l�mr�m��oSM�9.f��@c�fe�������������w���A���M�d���Z�$l?����-�)\���j����)��s�WT�/���������&&&:��U� IDAT�������%K������Cz+�������e��t�9��
��K����C�(M�����S�R��/3 0��U����>~��o�Q~���^�]��_��M��� ��Zl�l��nNn��j���o�{
$��(��M�s\��u�
t�*S�n]��wG����F�i2�7�j�C�*W�zv��$����4o�Wm�l1Y/<4Y��i�Me�(..6l�O<��_����
6�����-RoQ�Q�F�5�?�W�� H�Rx6!}�_:����U�������N�������Y�s�����a���k�^<D��������)))���#E������
�T����������g�v�z�k`��:�%��=�T;��S����;w��g����*���������q/���5((RF�Z���U�����Z�������_�~�%��zk�����
�U+�E 	��V����������2n�77���p��~.}���������nqqqqq.�^vwy���������Fy�A����w�~v.O������=/����M�Z������5�L"��c������j&d�A?v�=����W���~�������#�yr*�!�@�!�@���4[� �����������w�<�����Y���y��nf�t}���U�DXR]�{l��[�}��M�vYf�[j�����T�'$���9���E��	�a��@�W�S*^��ql��7Y>��R�"_�kdw��LO��w�<�&s���Z�W������w���j�I}Qi�:�mK����������**��$@"�@��4�.{:s�����'��<z�O�n��t"��U@R[�&���f����{z��<�
Y��]`[l����N���U���z�@���G\�hm��t.�;t1>
�����1�u�Z_�U��@�#�@�Q��B?6^;�=z�{�����<%��52���9���ZkT_���������1��j/��{K��52����4����2���k��3���}�$�����
U�* ���V����*�����1+���\��l��fl��n�N�M��]c��|��Z�����2��|
Y5��1��[����.
��� ���]�G�����EbrTi��9�~�����7��?�=��a�;��xU���Y�}���
 w-�>~�C�{����f����`���9j~p�	��k���
�V�kd[e -����#�7�dR;��r2��� �,�>�5R;`[��@�H��7�����v-��Bp��*\��
gN���!�;dY@����5hk�n�|�xR;`�Xa�w�R{�n������{�uW���������
�T���u�^sk|TC���Z��vQ�\��dSt]��
����	I1%|�������;d���V���_�?������-�Z�z��v]�\�x73���dP�9�>
�+^���������2e���;��2)�Q��Jp��
��/��i��b���}��s���vM�N����%C��+���~��'?P2��AC�]�� �����a~�s��3�/�q��^��l��2���������"r�>	{6�vO��19|���e�:Q@s�z�.kL�CG<��y����5J�5�JVO�t�����:KH��>��$�K���8���K�tl������.�Q��;t�:5��Jq5��F�u�{��>�W��.
��0��#�5/�+������0[z�|����]�:AC@�J=sN�a������
���:�#g�+tU��@�%B���T��h����?��M_�3e�%���2-\'��q��IMg>�}s�s��ZULj��;d���q�sR�Iz����S�%7��{�w���C�'�{��;�
����A������Xa�w��QS��o�����>��Tm5*(��}��}e0j�Y	�2f���S���{�?e��l��	����@f������QS�io>�lR;���`o�mZ�a�}��l��[����2�i�)o<����*�v@g��{���g����	��F�e��)x��>Bj�w�
M��>���j�9c�S�_����}���q��B�%����ku�f�9�_=��bd�.@�b�;�s���5����Vj�:r�s`{��g�,@w����TH��>#]S����/U�����lW�.��u6���h����^�g��G����brVi��9��Bp���t+j��idH`�w�5��q���5����z-%�.��p_&7���eZ3\E�.�����y��#����J�c�Q���l�25���4�j=�[ri�4s�������DJ�2Y��.���l��`[!�KtYXQIZ����:��Ui���v� ���j��_�rjF_�����X5����Q�Zk��.^2A�4��&���O,���;���5:���TH���d���
����G�p�O�Q?"����;l���E���
��g�/�z��-Q{�k�^�+����/���~E�|v���L;��LQ�lg�2DpI
&V�4������u���O}��i�����Q%�'�� �A�u�z��~��G�8�o��y^<�(��U����v�^b$dFH�~=6�&w���cb�
V*W�$R; ��tkuZm��U6�tjB��o�������	@na�@��S���TGjW�d,����#[�l������OW6����Z\��1%���<�v@N"���f4����'�s�!}��hu��x��vQ�U�*tS����\|D�!�����*{���_���'�v�E�v@WAp����r��g=���/��oX�k`W1��[*R��e�5g����z�G����Z��\P�����z�!k����V���+�"���;�|��k�RD������q�h���
�v]�6f����"~���xZRCZo�u����^���.
��Fp����d��-R��YbJ��3�������|�WR
)�������p����9�]���;�����G
c�����]}�7���e��f�.�w]P2�H��e�z����7,z���^?�G��]e������������e���i`�zU����7,���g�?������K ��t:����0�C��dH�J�
o8��5�Q��m��������������e�H6�`�d�@���an>4LEk7�:T�l%��v@^bT@���)�|������3�5��d��/��g���R�
�*�[V�,n
����dH�~�����|u���������U�"���q��m�������W��]Z?�����vQXe3�f�# �25��Y�*(�|���*��~�KC��Z��k��y�Q]�K+.o8�����/�l�� �0%���C�j�Q}�zU��/yu��K�����^Z�_��<�	�})�Q��JprE@����.3�x�&W��o|����@N"�g��!�������n9�|�
��%�,_P���_:�r|����Lv`�o�����'+�2k��k����n�r�O���d���>
���j�����H����u�r�GS5��-f�=[��4��?�e�)qE��g�[Djt7�w3���n�	)\��6��������<��	��f��� ?yUX���9Y�FSVI��@��*8~}QO����@���\������=�S�l��F�
@��t7�G�K�:��O������F�����������q���/


����k�~�������>}:�<�������,=�����8�'��2�x�v�;�*k�d2��_���C��_����{��w���7�p��|���g�<��R��G���^j{����8���{���R;����B�����={����;��[n�8�x���>��O>>|xcc��#^�����g��+a��k�.����3C���z�W�	���y��e�.v[><9����{����V��'_}���;l���������'?���<���|L�--��T����nT��:h��o��,Y2d��M���
�������1�3e��L|J�����4.��d|��hB���^�vw�r��)S�O��~n����b�in:����x<��w�Q ������C�u>c�(ZNxO{<�5���L��}�7��=��A���s������[��&S�����3�`�>	�U���/��8i�����Hy(�}�LO�k��XPM�f�@�[������>���~��������rM@���
����u_��'K�|��_?q�!g}�����
��v��8�gK�f��-��@��t%Y^�r��g�=z�V�T*5n���|�;��w����y��������L��a
l��%�����]'��Z���N��w{�re�$��J�o8��uV���,W�L��U����#�|���b�>}�������W�\����v��L��S������Km���:a���������������Zj�Y�K��L����5�=�
�����KV�YO�>ei�������m���������������$-1��C'��{��_��mj�dz4~�ln�^��*�&�v�f:����\�^��hU,�$���Nh/�a{YWM����obA����Lvo�Kt�FM2+���B������R��4;����t���;<re�@2)�Q3�� ��������R�#�'�{���U�\R;_��`_����U2��'�aBv��|^���������Fp��B���RX��	����a���0�U�N=Fe�.�.������i�x��%����@��#�}���+g���k ��t>��+�
w�&�
���";��=�:�G��:�����N8+���Nj`w�q�/,���m�WqN������@�����g�^?��?��������:�����e��N�A�8�:R;{�w� ���?G\���#
x�{������b�9u�-���i��k�Zj�]]�2:A����a����i��A�:�\*�e�&�6:�2$��j��JK�b������)��*����v}tmF:��v
Yf�!{-�X��/3��-p��[{
ys�u�[ed�f�>�#�Q��;�������.5��Z~���g*v��U�rR;Ap���Z��UR������r��|+���nT2��S��PJw�=b��;m��;}�fC�d�nT2��;��S���|���;m��;������p�,��
@&�q�'�j���2�x}�?���+����a�M9]����d�@�
U�*���f\�Gl�&��������0s��C��Mk�K���v���FM2+��=!���������������7�J��&�2����Y�&�2v�'Z�H�5������`QH�r{�K ���Gj���WU��'V�w�g��Xk�����+T7e�!�.
����������<���M��Vd&e����]���`7<R�����9{������A� �E�-�P��2�7*\��e���E�~7o���?Z�3�E�5�U�>�{~=m�-�(]<�;{�O��������v*�v��+^`h��v�������l��SuZm�����5��^h���l@7Ep�S>�i_���U�U��)�S�l@7����$�j�QA��u�?��M\�(r�������]������/����A�M�}�
=��[e��k {���:H`�`bE�G��Z�?<�mcf�yq��jd+�v]tU���aAE��w[�1����5��S�����*�������k�RD�icL���ZX�v���}���]n_��b�����v���i~Z_��k����:�Z6O���@uj2e���<�EI�dU�%�v��~*[I��������q%7��M��v*Um��Cj �0*tWM���l5���y�|��(Y|b��m�Y�[�V��l��B����>�_���P|�=�~�q���'�������W�lv��t?���}q(���b�C�5�&]V�������aT�S��f�)���o-�Rf\��:�n���-�ru*������Z�|�����c�n�;�`xa{a:n�]�����f�yL��"44{��]��G�v]|-�BI�yK������([�j��7��/fO=t������?~�tM��o��==�5��25��y��6�WJ�
����'mM;0tQ���+l����?zx��S���p��Hj�K ��'&[������ucf��N���Q�����^a�W���~����;�wF{dnqhj��"T�KUO����������`����H�tw �xL��\.�
M���|�6���,�z����N��������������v�`Wq_&7�"Oc�iQY�\o���7�YR5�Nl��__s�5e'{��v:]�&����<R{��n��Z4�]��Kf���������*�k%�j��dP���}�������n��:��@Wd�v2,��P�������D����+�s'�����hu��x��]S"�� �R/���=������|�0p�)��_T������q�$�;�FH�]V\�>�Zj~����������q�uj4;��>�������&��e�.�w��	�4�I��v�S�g�v����/U�w�}�������	2��t)��*}Vo��Yo:4�J�*B�_q���>�����j]����]�r*��l���U9�;�w3���#W��Nl8t��xV�8~�����r��S&}�����Ug�XdV\�U��^�5��~v�c�Q�����t+���R]��-������������}5SS&}����Yw����=�v�n��TH��J��
*�z�R���	�#�]��H����t,�c�������q�G/��y���B�t,-��<�!Q'��|h�J�f�p���@��15u��6yM�m_Uk�HR���'��Vu��0���hu�<��W��S:��0��E4����qGW���EeE!o�D7�?��+�&���8�q��l����w����j�����q<�)����N��O�,[PT��9����u�n��o#T�F�2Y���n,�Qswa��_�}�~ysM�MM��2����W��xe/�Xq �h6�qG��*�<3R|��9�z�G��T���?����;�|t��ngi���?��5��OY�����v��;�D����7�2����y�h�+���v��A�$���6���n~+�4I��9o���W�>���u�����X��y��U�<�n�;�G9��tGj��c{���G9�W(�����T��PR%���`T�E��������e5����?����r���[��K������.��T ��Ee�M���>�l��k��s�n�X�nE�+G\kdI���5���l��L��a�����:�K�t��3�^�������;A�|�
@W�����;�<VM(�E^�\��
����6n<���C���sl����p��YC���;rI0�������k�P�WgM{���o�5�g��vY������@WB��OAo���e�o%��~u�_;$��]��ks$\2�G����h6�qG�h�_��e�?+p��
�^�������m����Kj���`�$���Bj+��aE%i�@K�w�w�Z�Q�M^~2@W�:H ��i�)S�j�@5x
+��q���w����N�m��f7�W\s��X���O�]�!t���
,���Fq,}F�IO��-���;�U��QA��9��dMdY�q�#�H���;w���m�O�u�#X��^f�:�[���������E-i#}��+���g<0|���:���G������#��T�n$&����/���C�)�|E�.
�(:�@�*2[���HJ�<3e�~�������Z���_���I����1�],�Qt��f� IDAT����\�^��S���%=��Yc��V���|����B/�}��q����9��],�]w�������a�BN��4V��2�?=���������6���"�v���:Q��LY;~]R�3n&%�
���>x��8�����t�m��=�])�ut��N�S�����AKI��K������P3d��V�h����9�s�oe�d�R���t:����0�C@'zT~x���=�l�\����CZ�T]���/>q�����x�]#�,��dV�;:��������t�H_{~�������1���N��N�vq�se0j2�t����tE:K��/l�hp�?d�,�{W�L'd�8��0�t�GjSW�I&��+�,�t`���5�_����#��_9���z�
��g��]����,��/�&�#��`������������!<g{U��WL��>�M�������3Y6�U�w3���k���}�r#�,��,�xP�����OjH��>�������6���pt������U������;�������?_`��H�k.X����������[��������47��j���5@W���Ik�[��D��N�k.X����W������p�����k�
P<��0�o�^_tt��L����I��ejw&��~~Ah���+�xO��N��!o��UO��9����qG&=V���_w����r��������w�/Gx�f����*@��2���Z��������7��n���~�R��VW��������;��bA����,ez��_^:n��W_���_pb�W������]C���;������e����U=~�y�����7Y�3#M�8{�~��L`���K��vU��Fv��2�Dp��[T��R[z���l�����|j���.�6��>��-�,RJ��f6��3��9 ��+W*~gZ������|�J��F��.�O(�HJI)I�/�'u����tQ���D��`�����SJWV~0�_�+���0b�������-
��;�V���n�{�;�d���)#]Y����-��{Ta���W2w�K��i�|hZ5��N*��;�{V5�y����biY*+���w�+�U5����Kv�B��x�\.����g��
��ers*vU"�gj������8�lv�_�d���W�=�0j����F]{�n_4TM�����vJ� ��*�Iw�X0��6��k���~��#���6��\��P�u?��rX:z��������e�b�s�*t�@��_�nQ,����Jk��y#�a�z�m��o�6������>����?8"����G���;�J�d���
�n����|u�EJZ��V���a4�h����vz�^}�\-�a����
l�.,s��V���s-J&-����+=���CE���[�=���Fe������6Fi��4������U>�����q���l	�:��v<6M!�E^�����#RF���e���UC\1�jq����t/t����Tg���������-Mk�/V�\1K��3��%���}�����&5����UsR.]z�����gp}jH89�e=�-o�S{�U5�*(/�7`�����],>��26'e�_z��U��6���]Wt�������bqUMV����.����������}`���w�%n1~x������u���m��w����Q[�����:��@Gp6
4���:$�z�f
�4_v������UC�qC��|E���u�e��MS��;��@�� I
%�|n���T������%��7��qk��6���c��Uv�o�������|:��@G������vk\R������k�b����gbg�����g�rXg���@S'����Pn��'��9������#X��>�]U0Y���A'/3c��/���0����/��YcZ~�������*@~�`����n/�V���I���+��)H��j����&����{���A���`p7^�-T�A�v�5h6�q��BI��S8���x��k�bF�y!������\����`���gr8���J���Q5�*25����$�TMDeNy���
7�]1���TK������ ��_���������	�!�}����k����n56��8��t���/�V��TQ��y
�
���X�{�-�����/�=W}Dq��nDC���o9����Cm��n��|�f��6�t[wt����
�L*�J�����v%�ZF�a����3=<m�~^=+�Q������oQ�\v���|&������-���{�����O����/�uJ]����A-��C�p��������&�W{���7@�����$�������V��n��;��C���e�fc������[EK{����~��r��[T��0�o��>H�c����t���t{�����y+���(��+Izt���)��lM?o-��'�%����y����.N��]�2��ld[�`�dT�i~@��������*��T��k�<�T[����?y*aK�]}q� �����~/3�}�Fv�L��L�=E�%^��6R�M��������JG��&{�_�����+a���W7F�z2�����L"���@C��/bQ�5����U���~I��JX�����	[���/��L%���*c,�H����I��J�d��v2&dyBG�I��E����~k��i3�����5��>^g�+��w`�Z��m��e��qQ����LZR'_�F�f�r������W�?��>"��.��������uIm�X��S���&����������-6[��������@��w��PT�YZi��g���v�&i�_���_��5e<��I�fj���A�v���-�=O���T�M��#O��dZ���vis���/.��q����[�x��������=����g��Mv�b1U���3�}�_��;����xR�J�����;f�9W�bKY�~���-=��M=�w��P���)Q�Y������u������#Ox�:A�.����:��	[��+��R�����X�>]y���;~e�2���ph���JcG����T�O�\�}����F�����;��3f�����Q�F��1c��#�]0}�=��8��=���]��5���'I���M/����e�A��7����p�fG�Tz�*�W�
L��O�S9�W�Z�h��������������`��SO�����������3�<���~�#��^2o�mU*fK��������������:�������S�������Hn�����h��2�&�}
GlU8*�}jn��'�Rn�����b������t:�N���q��^��|\p��j�8q�G���O��y���?��k�	cSj���3y�8�����������R���x��,�	?Vk`���n�Ls��i�v��}�����-��!C<�����x����,9r����Q��������Z���r���bKdMlL�}��t)nF�v�~�TH�thP���k�_��o~c���fld��@n�!C����/]�t�����������7�xCRkkkAA��/+((hmm���;clg��)��)�Q����ZS)�?��e�������dP�n����M�F�N��s�T�����S����v�[(�C���)��W�<5e���Se��[�]R���M�t8g�q��I�^~�eIEEEmm��������o�=8�3�����`Bs>	����-u�5�E��>T���<���rm�u��B���B6��[���1����
~�^��NV�s�~��S��{��@�5e���Se��s�}K�T�c����#������g�}VVV���O
*��3BE����W�+�Z�xx�3�����{�����M�eO*�Vg]��Q
��V�G=��p�������j�:�r�����b��*?�]�����>`/��3'��;6�J������k����r�!---C����>��sf��1i��4hw���M
#���+B	��R8%)9h�C�j��=�����v��4��w��@�}��N����7���V��*����TP��
P��[(Y���j�l3�D�fnu�c����^��O��}�>��C�=��!��1����/�y��EEE?��O�������7��2-��Z��e}��v�=�W)�j��v]s�f�+��q�����j����`�f�S�e��Z���k�
�I))!�_�O����-��t�sZ,�����q�c$�us��ddD��2��S�[��[7�amL����%S��V��6����\j�\�b���6����Tu�*����o����V:��R6���������FM2+�=w-h�_�]���WiX��yX��Y{R���L]p���/jYK[�|��C6�M�n�����m����U�[U^�P�J�T8��k�j|G���C!���Xt�����������v.�Q����i�_����9~�����%����W�W�.�y����a���k����fY-jm����M���y�Y��a�z$�n�LS��;�E����������/�M����I� ���;�IK�,[L��L?�XjT"�����]w���C�Y
��UQ�M
���Q��?z�q9��_��7D^���lH��|��{p���@\v�V���7�b�*+;������!�N(��B������|J�����MJFb��"Mv�?��7l���l)����s(��J�T�q���Q�o�v�����;�G������)����"�NU�*�v{f�^������<�����,�~��Y*mP ��@�`������Q}z��1�
%���?�7x�: d����,)����Z/t�``=�T�b������aW<�a�}K��,Q�yw�@�Ph�JR�p��E���hSq��q0�����M>����� �-�X+WL�dI+������������a:�~���Q��{mX��c`������m�DjV8,��������E*Ej�@Xum���EfX������JKUQ��R���h6�q�Qk����UK)��n�����4�����&��������6+b\N5�-I����i����E������� �"
��\��Wc�2���[���F�vt[ykP�5X������B*)�����Vc��;���������M[{����/����~-��=�/r��'8|cj���z�2�][].�oH6>�������7����S�w��5BS-r����}��w���tF�������w=����~���=~��f:��h���{�s��b��i�������G����G�9�j����osfC~��I�����,��	@fD������qhH}���VZ�jSM��l�Q7�-��K��#w��|�
��G�M���Rc#�c�w��'������<�p�=u��o�iq<��a	���S��n��n�[%���
����^�����6|y��KUw������:���
))kP�6y=*�qj��|������{<�:Un��^��<���A���{��{i��_�l�����O�o*����-i#��X��\7��HT��T 9�}��6���m:�����c�t�p<f�m;�.=����K�l����#�$�j��dp~@��������n5�:��s7����i[�j��������v�SRZj��v��D��h��Q�"R;�6=_�)�)���n�0dxe?V�Cj�T����r�U������{��y��}�f3��ZP�_�=��|��Eb��Z���������q����k&�Z��}�����u��
5o|�!y�*T��;x|�X+���biU
Te�}��dR�_�o�=��U��r��2����l/S[e`�e0j�Y	���i^�RaI�?y���O2�ls,8y�G����u�������G{�C�h�N�@�8nX���.������m�q6&TTT2O����n�G����2����W�@yw2r��pP�k��L�.�\��|%�Fe��u2LIw�����I)%����?�b�j������HW��l���'U:Jg]�dBv���\:�x!U?��Q
<)I�gUzd�M��y�46��i�6���Tz�*�W�
L�G����y�����K�^�yfPv���5Z��7[-�l��[��L�={�!�+����G�fB�f1��S&7Y�����N[c��C4��e+4���lz��K�����u���z����j�I���v;�Og�kC�eQ�(y�����Tr��C�����U\�>:��!�X����8�[w5���}����U�Ce��|�}���n��nW,�[�tfev��u�qG^�z	O;r�w4i�c�L��&u�����m�zk�]�g5�xEc�_b7U\�5M2m�O�6���[.z3L���Wn��V�-����.�n�v�PY�I���"A-���]��u�m����}f�|Q�s�xD���|�}�%������T$���jm���@f�v�k�N��q�PW�1lO�2���k���>���\�K�9P��oS$)m�Mj,�����V'����[��4nPC�j"*s����v��m^%���N`���K~Y�J�tj��VJR�[�������S�E5E*���
��)��Z��R�s�[D�%B����f����G���}��Y���+��Z\�ae�t����:��6�P���j5rl��o.H���Q&����:����"~����,hs$:R���
r�S#/q<<r�$�-�iJ���9����<���KJ&e2�MU�NK4������T���y�*/�yj��)���r;�-������0'��/�K~����%���������2�rye�un�\���^���*}S��t�:e&~��W��U�{����!�Y�jo�o�d+�	|�mj2�Q�:�B�J�d�O��m��~<�A���O>H)U�B�R%���@�a���}-�S|��������0���t��5��dI[��t����7|]�E%Gn[O�aS��T�P�J��I�+���H[>q)�������ip�7�V����
�Hh�����*�����2q�������r��g����.��B�WI|�[�r��F���
=��[����r�$�[��*�XCsH��lu����*����=����{���C��(����p��(#w��r
3���Z�g����B[�X{�O�[S��c�-��=�7�m�RO�Mt��x��JR������j���ko��G�g�j�����?X����&o��"�#��)��HL�-��n��/����P�Q/X��4d�dv�Kk�i�������\�ZIy&h�MnG���n}�����j��a��S�6S_v���Y���=2]o7t��.)]�-~"�J���@~��L�}�I��N������*��H���/G�p=���F�"Y�]P>���~p���������;�������S��A����[�|�J���������n;���:�vd�\���9Z�:[�f��
��}�������c�9�\7jaR+;��
�o��G�)���pp���(������\��Q�Y�o5�m�qw��(�?g`W�qG�������,-���?�rd�����]��������MMt��FM��J�\�z�vo�z����el���04�?����4�2��A�N�/v/���g~�kR
'*����&w��3���B���h��*t�}2�r�=��G���u��������+<�q[��ny�r�U]�9�K*����r������������
�ivRy������*�-�+�-���d�f:������q�S����I?���v<���1�u��C'���w����6���Ft��zlj�m;�-[���3�y��3#�?�����ek�����WU?����2����c��a
?��x�]U����A;~<j0���������V��[�@����i�����n���2�����.��Z��$�r�������Y���q����;��oKm�����o�J���pD�f��t��l�2f`]|�d�a����OjvI8�?�
��lIY7��P�D�2�r�V�Y����^�RZ���}���[=��Uy��S�$�Wc�����
���R8�`J�z�������W�rR;�]wt�PT��I�:b�;����O��u�^x����<��2�T�(U���Q
<�u�� r#�T�W;������_��'��+�����NL��u/��l�la���,1��*���x4y�nz�pkr�
<_���\/�J!)��[]f�bTu��o2����$Fe��L��iSYA����F=�w9o���{Q��� IDAT���W������i���ogT�[�i�v���
�7\m��S�Z)��j|[��`W-V`��F�L�C�������l�Z�s��{�3�5��
,S�w����4I����o�BKPKj4�LEYn'���-�|\��h���]�Fe��
*���U:K����Y5����6��~�u������������5����nG�u�n�����#Uq�J�T���Dnx�RV>%���A-��p���Wh��	��F(a��+����m;�~���]O��kvd���,���rz����b���
/6�#KuE��?{��T}���9������&�VA����LV]7o��6o����S	SAdN�9�2���������u�:��*����J/:�@.'M{��\���$M�-�<�A��sIO��y������a��<��k�2���:�z����@�E�}��)mDO�%��c���������<wX`�y�{i��t{�d��5������C���l���n��S)��$���U��(.�# R 8�y��r/Vq���8��h[��u�b�7�m�t��b���n���(B
��*�l	�s��J�PF{N\�9��}�
2��H�W�C�����H�Z����w�B �������MG��Q�)�����~c�������h'I�n~��D��]�R���9�<�A	l,�������oG{��H��@&d���NX#���K��o[��k��U�������R�h�����#W���G�b���F��)�[��������:���T;������>}O�wB����(6�����_q�*r��u���M�oI��MM��gm������|�����JG��2?�J����.88����W��&�.�^��~�{'��c�6�ub���'Q7q�9�~�}TF25{�Y\}��y��^���E���G���RB�}���g������;E �������e�v����V�.{���Y��;����7U2%	��;�����a��$6F�9�����H8�=���g������Rt��C����<u
3�U�#)K�=�����}���^^-<��)����{4{W�@��kq��q�,�?�U���l�<�����(6���p�����"^����K���Q;F4�|RB�L3��q�9���c5C	Llnn�%�n"�Lf��tX�)	G����_�xh���q�Q��Qq�����|�:��~����,{��i�{���G-�6�8��t4����W��4���;�S2o�7���*��>b^���R���hr��(5�f�}�x���;W�n9���o/�,�p�LwW�>%r�t����`�&ad<06�k)?�p�h+I8��J�%���������Q���������%Nt:K�c�g�F����q}��C*�A]���)#h;i9�rn�����E�3���K���q� ���[��y�f�H)f�JFRq�>�>Z�@�I"��p"���p�[�g������?w���l�x��	�t�1���h���I0���*�~	sf�),`���~�'}��KN&z������A��\�N��|���k s�������*E���������� 'UR���m��$�L�3�� ���A�4��'���,��"�>v��-]��y���J�!��yR��d���T;P6�X<�������=�H&Oo��R&�@���hlm"�N�.��&��i��}��a^�{X����#X����k��������a�fq�9��[�o�w�����iCm�c�	�.�����TFNH�DX�@p�!��`x!����{���y�_n?���O4M��I�����vS{3�����R+�<��D������A�O'��\� M��%�b����>-�����������!��+��^�:�Q�;�te�&]����.n�"#�����hJ��l��r�������M���a�b�����o�V��������K.�Z$�����#�}���F�D�U���%i�o�n���W��������D a�-;������ -��O���}&Y~��}�����z���F�:i�B�8������� KJ�el����������)%�Y�]eYu�+���5�tf
��.~��#�N�7Y�$HB!��������ls�b�
��������A�
*���H['����k�vHl�s�]�M\�k�G�B��g�Ecj�-H�Hv�@ �������iz��iW�O;n��gNki\������,��H�n'fU{���{��sg_B�������T��s>������^�j��c8�(��q?�6lV��
����PV>6�j3�-`����Wp�a8Q+�Wrg��������?H����%)����LZ������7�n4J[���O��� ��+�j!����H�������,?s��S[���P�5�<��aG�!��V+ef��x�J���`�;��h��Q�w��5�"��2�Y�#������up,|	FC�`�PV^���Z�*�n�*W�S���q��&���7�y��GR��MEE:�1�����@�]��y���u�T��m��]r�P������d���a�,�O{%����m�Y�4�9o���6���$iT�F�G���Z��tvW	�{�+a31J��s��m[Q�=��B�V�9��'�����Z����F���:�dS�yyr��R�.E��O���plD'��)�^���kO
ZJ
G���=���0�a�����mfQ3�.�����1�yG���(f��=��X���`f���\������������Aa�����i����������-$45��y-���z�����d6o�H��_�������v�3s_4�LZLK��J��c�b���o�g���N�q\���=�������!�8�������D3u[1<���d>��/�m��z��X�jcuGS�s���	X���� �K�d������/������=;n	c2u���eM�EhqPn��O�S����>i|~<�T��9�$�]_l!�����k?��u���`1�����<��)T}�|�\Q0�w1z�=�w1��/��E�D�2�,���j\�^�j�]�j��j��������b����#���fl�p�v��o�I�p
���/�5��_q�E��72����k�k���X;$9���.�W���Pr|����`�Q��e4,���j�6���������wc�]@����>"�UD<�?q7�� D�E�}�l�9�"`������s!u������Rx���	���j�\���5�&3���i��J`=�9~��
L�=�~�t��_�z4`�jc�f��iN���9H}J�T�^	�.?�wd>��e<�1��������O33+��8�+�f�OB:�E'v\���qE��HA;����E��4����3��w'/(�w�J�k��g�d��NbM1��!1�T |D�]���k��'�le�K�����Q�_�����������z��#�~S�.��L�95+����n���j������"�3%������-�Qfo�B*&	�.�mX�����|�cj����f�{��N�w����UtB��z��-��]��6��b
�zk�
�e>����*<|>�"7��P)v�PYT����h�h�e(�JY2��T=x;M�6��X)���$>��b���D0F���I�I?�r	45���w������� �@y{�������_hvg���I`=����7o��K	�����C83zg��m�T<��8n��O�����m�U��jwzQl�����~*�]�����^�����Xw^{�3��1���0����]�����P\L*U��t�hm'-�P����Lj{�=?�#��l#s��q��O�\�����s�� w�'D[���:;��8:�������M�o}�������
G�fb��;$�BIF�I8���<�{�x��m��hk���q$cIJ��r�
}�tqa-�����ra�PU����&c(5�>�4���������qi
��<�p���F�,����$FQ���\���r���ZO��������5S?.r3��"w3�%x�j	�OqE=������jf��F�����s��~*D�YT��'��������~uG�P��Pp�7�M9����}���qF����~���)��7�q)���$a���&h�|����������2�+���l{R��
>��*��o-�T(�I�����+��A�M�|x�CJ�fo�P�I�-Pn������,��n�4�n&	�N����w��@WV�F.�4�`,��L]q�p]��Q��U��2����@pp2�RShV!��%����~��?u��H���3��~c�o}�%�DS�]E*�q���!	2��$�t�Z�M�CZ�$S��;�}�2���XL�#��6���4�[p@�?��4>l�����u���4���V�Y�����]d��,�Q�oA�]J4�v����������h��h)���{�\�e��;���
?�>*#��R�SI����X�z��h%f�Fra_�|pv�
��+�*#���5�M�o���S�#f�
?8c��|���jl�ON��U���������B,f�`D�e������R�I4��$55l��3���+l]�vA�fK<<P���d�X��r���rO����h���M��D�Pv��Uxc�����U��-{Y�v����<�n��[��vS�%����^":��^:s,.55��44�����c���b�i�mE�A�'�����=���p�����^�}4z�������xbOA���7���e����$�e�|;x�!Hb& ����v>R��]�t�����m��������*?�N|��T~��
�#�	��KL'��Y�%���Z��o� �z���Du�4Jt�{y����a�(��CIb`����pt�u������X��4��W�z&��]�c~no�1��T�C���~-������O+U��#w���x����vSQ�������jTn�[��?|x>
���3ux�r�@ ��:�e�'�A6zOL?���L^��������p�#xZ�$��YGl��(9'�b��������tz�i#�iwb�R���&	��k�����*�}�>��j��d��Eag+�^f��},�b�7���S���6,
�&�Lzvi��#.Pl����5��������~�;��~���&�]��;���v�@�s�>�[
�L���p�����(Gz{��Q�x���#�e>%O`I�8��B�z5�0+��?�>����Z�e�0����z��)�����D� r����w�7������=�o��5��hi���^6�X<���Q6��������n�����hb|����8�:e�6)���/�]�yfd�l����^4d���NX#���Kd���1�u��~.YAk&R3�4��?��t�t8�$%�yw��Eak?g��>�x��]��ys����%��w��-1�k���cM�.u�	v�}�,���>s���X*�;q�hA��3fg���&���Hn�
���@w�����x�C�l{��
_B�T?}Ak�;�7�m�z�c�����7��7����?��:�fW1�7���]�*���*���:�����y�:6��&z�I`�����b=x�������8�(*����%@�:nIuQS�]Er��L��'�
�H*l�:����M�-�/<����?f0��3�#�
^t��F8�}w`��o� m�VE��"���Jq9���E{�����������y��C����T5����2��G�@ 8��*"U�v��A�29�?��Qg<�������v�o�V�����&��������{T;��)��j���)K�$i�Rif*$\v��L:Y�'cH(�����
"N���qK�T���yu!	�u=g��`�2����t�o�T��r
�w�?�����y���k8.����p���(���rlJn�zc�y%��N�m�|��Qw�p|��)Mf�(�� `vW���(��0JM�q�����0.�_f��d����?k]���(1�2���8G��=2�]Qh���#�+r��8��MH����2��BjTf�i�R.2[pP�����p3��uj��Cs���5n���P�1�gO%���{�k?����bK���z�
F��,�;s���}�P��%f0!S��'����tj�gn��}����|%g�C�)aV����}��h{h)�����:m2J2�0�[<�Q�f9w@�lE���@���^l��V���X���i�$d
����GET^���u�~���]?������CH���V��v�f:��f����8��B�UT�?��Ot��B^�y>������#d=	�����H�d���o���zj��^J$��C������^� a0D�v�\~�y��{6�����n�y:���B���;qD��@ � r���>�yh�b�����20L��������k�>����<��bXG\���j�8�������N���Q�q|<Q��,��[;
�B��j��0��9�b����J<�R�O{�];B0/�o��&����z�O{c�����i����P�h��-�#���{4�Q��-�C������;9��h����L��j����~�.��`����D�-^�:���oWQW�J��?����W^7s��~��+��RS>eb����A�7`S�`�\��d09�&�^����\o% ����<�>�noT�O�����/y71}�Z�Uq)�:-���J���E�%�<���VOoO��2�h��I��L���2��8��Of|�������H(�X>z|Y�������U,Bps���0R3RM�����x�9�Y�}WE��@ ������!e������<u�
�..)�g��u�-�z��)����xo@������Q����������'���I
CM���US�=~�C��������M��E���t����ae�X�5��o���Y�-��rN�C�#�L:*p���$!	h�p�����	���ZT�U��z��e�h���;�w�����u�F�����6K�'@w�k��������j7��4��#R$?
A���9�^���%K�*o��#I��]��~����[~������JiZ��c�R�{��F|�H�gZ��I*=3��&�d�0�gE�l;��*�,��UT.���
�k3M�$���*�xx��
��KD�C#���Kg��Cr��hs$i����(�9-{L=�zG������Z�lrR��9gR��A�X������$k�f�8�����O���������n���VIY�D��
�qYQgP?@Slv�\O|�EC
������^|��S��z��{��^����R��G�-���s�oV��6����V6��/n�����vv;-�1~l2��F�!���c�0?z�b�O��ib����"cC�n��|��q���\4����qX}��e���c���=6<�o�jT�Mt�������E}y�64sY%�t�(E.X�����0������=�^��������$�5�W��@�2}~��9w�]�����h/�nf���I������`�*��0�H'�|��e*�k��}�U���	����T��r�����m�����������o�+?a����m�Yt�Y���M���~�o���:A�"1�|&�j�rR��[����������OV�
���9?ox���������
�_���� w]O0B(HL������w�����DDc���<<V�}6�����V����\�/#�sR�^{-[�������k�y9��a�������W�cb�@��TT�%��}�a*���j�mE=��$0�3MuoI����(�`9����Ur{)s�k�s��'���bs_|�*����y��R����Q����Yk�p��K��������������*/#J��U��8���_&I?�����#�3/;.^"�x7_=1��������o:�������������I0Hi~�;���R��f���3��/H9}X9LBQIF)I�������8�-g��z����:)�]����n�xOV,�7s�7����
EM��[L������Xm����.�bQ�|����{�<:y1@Q�{��|�hz��������c�R��w IDAT���I��_R�C@t��W���  ����@ �� *�������':�VH)������j�W>/���/�:��I�0FN�2#��-�l�'�Y�*J�Y�����d\%�J�2n	��zm[{�#[��Y\4�����-����>��s�
����4����oik��s���:6��m���1#a�0��xYA�Yawu��I|��}�{vj��Dg��6�������~�9��"s��<�+���"�q_�~�W+V��%f1�|��o�V����9�������8T��8T����S��#9�-�����M�<�w�S���M�]3>����K�VB�pK�
�B��'�����?�������^���7���(��k���^��E��I ?����Ou�V�����A'�=�:�wc�
���v���l��L�)%$�~�T}I��$Z�0�}��)�{�rlV�8uQ}J��������P�~L���%��e|��Z��1.���O1�>����C�q�U`�Qqw:	p�#�V/���(6b3�����E�����CM��$B��dv���m�*��eG������}�5��`���9��-���:>�i����������2B�
��U����	��rF��SX.K��P<Ji7��Ridv@�u&kF� �T�L�h����/����1��������E�\��`o���`o$4��H�����3/����o��l��z�������/����r�����=���N�(Z'��(���#��S�H����.9^��?l����p�9�CS�t����f�/�����Q�C��)3�AtMG��]NG�����(�d=���U��������b���F�7Zd�4�X���*�xh����rQ[���v���������kx���
<�����&t;��3���	��$��xn%�?������X�V�]AV\Ae�CT�Yq]}|h��rN��du���2�1��[C���,]�fJE�
����U����S�O�u��@+�,��=�m����esG���@6���p���!��L�P��@�	#0}��c���i0�.	=��d�#�`ns�g,h���#Nx���)������,�;���T]<�%����)�#���Ql��\�2����Y��������XI�_����%�60�N�:u����������I���I"�j�E�u;S���z�}Z�Xvxa,��1��
�6��6k��e;�_b�U#�B��{�/�#AVy���t�U^��{�sM
�f
*U(t3�"�����Q=�{�-|d0,v"�l0��`���"L	��+�������e9����8:�%S�?:��b��tz�+�cE�%mYWHR!��S��O2��-�q���h�����&�}p�:J�o�4�=�l�������G�I��en��I�)���D*+LHj����D��*�����{6jB#����w������ ����3�o�����<���_.�dfN��llF*i�F�&���p�~l��D,�,�3xI
:lW���t���t�����D7����.����t@�v-B�V�A��+0MH���0(���J|�����4��7k��V�vT+�3p
�?ue������	D-I����G�;<�}ti�Yv�a��p[6�����;Ge���7�%�`��G�������m
,�E�W`��U�� ���$����J��W����:�Z+����ZA}�>�F�B�������\���Q-�e�q�[����YrJ��Y�	�[�&�hI�-x�;3��A�L���(�cHc��]Xr����6�-�g�����=0�5���LK%��}V�������d[��S/z���Uk����}�s�����l5���l���"�
�!�������HF<V���V��n�g\���y������O9pI��3���<�5�]����`�4����D�Es��TzLg��4��������/�����>����e1wHw_'�=�$�&���FaZ�������#P�(�#A������lc��N����>�C��,������u�X6y&	�6n���g�]���wf&�&�"g:
b6���;��@�Is�0uP	�96U&w�z��;gH��Be"cj��*g�:��*q|1�`C��:�y�������`0�R���Y����pOYWJR�,����To?���F<��UU�^���7�z
HH�������:���s/�3S�u��������x���-����w����5�����?���i��)Yf���N�8%\R�9�F(]���|;,H-�3��i#�c��x�Y�F�����~>�b�Xs���,��v�2�ru1�����9g%�4=~�$�
������������>'�;ZU����e<9��D��[3�oWy<@������ffV���T�b���J���\{L�K5	M��?h	��R�I���� 4�������������������/�qY���3>��Mu��:�$3��e�S����;�1:�,`[c���$�@ 8�������W�;��t��O�����~~��'}A���mJ�v@.����O��	G��5mKN��?�Nwg�m�^��v��D2�3w�4�r��
rR9r�����D�r0�v�mV	�iF������q����
���q���'�{�*�p�����;U��S�R�A�7i
�o�h���������*�E)��61��^"9��}Nz��wV����>�����e
�z����L����cO��PF<4��2�}����?��$$��
8h)�n�������y�~V�H�L0�.�G��C8�qx9������a6�����($!!IH2��o�I�=���zx��Vnx��'��$�@ �� ��g��%/�~��������a��_�x�k�����H�x^�_��Gw�j�J�
�m[��+���:��e!��G�48�[��U��KqN���t�=��3�������L�?$5�&��}�Z+�n�`&y6��w����S]�#=��(G��+��@]D}��u>�o
��R5���d��Y�������p'���h�+"���j� ^/�����x�tt���h��7c+m����/�h�������.n�Mg�;��d=j�������ZH��w���ln�f�����]D�A��������(L:>���|��e��J�8Z������v���H����c��ta��2��{cmM�^[�(K`�C2}}i���x?!e�4N�!����@ |���Y���+��M����s&��/��������	��tQ\�q����^}�ysP�g��R�5!������
�#l����6��V�(yap5g��*���WQ}�[�B�=�Z�F���F|�������aG~����k��c��a<
I0���b��#F���\���(���e�������z��I�&�i�z����7����h:�
xC})3�Lm	��[F�����1
�b���W��Hz�����V��V�B8�����
�B���vkZ�G�<� s�&�w)�qeri
��T���3���E�f�1��n��-�f|�~���Q���ya����$?���H6a��4����AU#?�vo��dI	���xA�J�[���#�{c���!c���3�����g���Ow-NK��*b�����$[�=����=����de�����?Z�=�������#dw�b�k��r�5�T{@�Z��Q�O��=������gx�w���#����aul)���e�
� Y��j<{��d�AR����&&H*�� ������L"I�ET�B��������P��������e���|�{�n�G��`|���^��u,�f�����gR��ia,���5��������)�������2���F�������LL�H����v.�>���e�>��������H��n/bz=S��hp-�Q�=-���G8#����p��>������\%�DW���[��|���+�������x��>b��N�:��@�`c�0����r]�y+���L��9��r�f\�J�`!�<XV�����s�������53���A4�'�p���n�&c$��L�����H�r,���Z���n��c��`q���~��/x��Y5�&3���/H�/M@[w��nW�����?z��������S���ZK���U/<�d'i[d�����$y05��/�Q����`F��|q��X2��eEMXM����`��o��[�`����y0���������K����;^
�W��F�4�Ya��=��1a��Q�h2�wAB�L��d���bJ�� x,�zfmN��	�)��������r��W����`��x��d�^8��U���&K�K�C�r11���r���l}��F/Q���E%fpK�dtm-��FfRi{���L��t�T,����P��T3H��;MH�@*��y-����D�&�D�'�Sa�%�;Y�'��%���a����S�{��
�����g�F�������|�Cv�@0hDs���`�F�$Z=�w#qv�X����uv�Hh��R�����o��x������V�2��'+�v�#6������53�2�3e���:�������#h!I�J��,��`�I����: ���H��������%K��s�)������`2'&��4e�<$�0S�PX{�|�v=g��3�a�������yI������s8������Yz�g��J��-�}�m�btn�cj|R�Q(�3��66+9/Y�#7vh��M�d���FN1���UQ]�U�[��jT7�W�U��{B7~w���
YBy�m#a�L8������� !aH$�Fv0:��^��k�MG�4z��D50�a��un��~&�%T��D����>2�J'R��D?��v��d�+;����JU���{�����P��D�d��TR{f
Ab^LS���y1;L4�cK�8���|�����j�N�Su�r�x���
3��.�C�p?�i�Qr~L���=F,�J���T��Bb[Ix���r�+'|���6��>~U�5�F��l,��J���.�0��pX�e�����jK�LL�VR-���2eb6L���+
�2T#G��e����3���$��!����s���c&11��G��s�,�y��%��9W2��v�K+}����lV�Hz������`|R�c�/u������c����s��I�R��#4��I���S�{.R� �{W�\8}����?1���Wn�(���G����(n����_ML$ ��y>z������G8%`���x��'Z�&�[/���m ����/��`�T-��������^�d�lO�)��#�t(`�5�Z��[F=��s}�y
����s{O�*��2��^=\�&����3���u�f����*�/�@��D��2�XN]�0�!RVh26�(%d`kIt��oT���h���j�~����^5z*��
$v u iXK��%lw���H[�#�]�+��Vf�j�pP���cI�V��k
���p_���$�IR=�\X�D]�����JRn"tKm�A������K,�������e����ock��G:v���do��+X�]�96]qjY�st����u!�
���z!����rQT��*W��=�0��������dP��B���9�`G�LDck�+��T��Tf�}��a^�{X=`On�-���?u���sG��mu��a8�9�@!X	1��������#���m@�%(6��r�=q�������r��i%m5�f�an������p�>1�v�d���;�1���J��k���B�ER�V5�d�]
IG���(u�U�SkE�������8S����&��f�����.���VAu�VP��nOw�H
�@���<.+��Qe�VT��IzfZ�F~a���V��}`�������y��g�O�����%��k~�?)�&��#���4�
��]�`:#K$,�B`FT3�&���r%M���	�S�Z��5d=G�P�S��0�������%H�K��.�6���/�P��N��h��������>���=�DJ %%�#�b�$��������$��g	�����2���� ����	���_����tt�����r��R���V�O��6�W�p�c���^�,?/���-���
��%�����f��L/�}e���T���G���bW�;�Du:5�:wz����<�l�6^��?%��2��MPy1�����*#���4��]���nB���"cLY�{bwqz-V�l<g�������$���t��V�y���x�{e��}a�9���aL$���t��j������&��)��[�����9���Gr#���l9k�;O}5�_F�B�_bNw��@ =��`���
���������m����[6��d��c����X�F������D��b�����m�i�n���(��O�fv��I������������i�(�z���]\�NXOW`%p�7_�G���U<�a��������FPI�V�a��Ix����H�$�}jJ&6������[�8v���'�v���1V���z�z�rPw�R$��b����w9F�� l�=�]�lq�-E��8���.y�k���m�X�T����\�_�N�Oz���#��}�zD��^�$L���K�Ht�fDZ������U^b:�X�7n��4;))��L�c����y���w(���i|��c��C,i?����y��}���~j�0����F�&n\�N[WU~u5�(Tv������{�;���f�9��+��|;�4�����j�����NI�v��Z�q�Z}-�X~��w����}�ydH��E<1H�[����XnR{�~�
8����A��;p��b�.|�@ |��� �m�����<��k��/�f����$�-#B3����s�S]q������|���9.�|�xb���d�5=��#��|�c�v~�zl�	�,�������~	��j	}����$�e�N\#�����X�R����X�}��p�r�0�2\mD$����Z4z������e�����`C�9����<�pF5���������i+Zj���6�]���h�{�[��U�N0�����3|�%�/K�SL����lw���k��j�4����z�9c{���c��"�{����p�V��d��l
�o������a�p�6��e�b��K��.V�Es����>�{�]U������a	[5��/*�Sd	�#=�;����Q�����;_����z����������Os�FJ�!��=I�;���N_���X��h�5�Z��[F���mr��%�_��e8Y��`�3�z���kQ�N�h���Hn��.����OB��q��xni��p�	�V�[F_[���5���� �����m}`���������e�{�$qi�m��]G���^H$�������>&�t�e�
_Lv��>�{S�M�L�u��:M��8�r{3�+y6�q&�P~���cs*��6�\�������<o�b��cF5GI���B_Q��1��,���	�4�#S\�2?.U%����k]����-u����RPw.�e��]�$����-H^�{�_�"���j��Z��v��o:���:=A�{�.8�]�h��{n�`�1�	��q:<���k�L��	�!&�nx�;�P��YGI��^��'�HKE�g����P���D2�q:C.����	��"�F<!p:	p�����Fs��l���6T�`Q3�y��
^[6�q�2n�2-�#
?�����d���[�R]�\Ai<�;�;�=Em'-�P.����d��!��[?�����(�t��.`G�m
��.�sD��pr�	�������vn������������fO:��m���qyo��4�����$#:�M2��o/o��zGq��X1����}�}�irnc�����[%#�����e��\o���J��b�������B�@u`��������y��u
'M�X#���rJ��p��w^0��w�
��8��PM�[�r�Aw�����c�q�k6��_?H1�|�7~����U|���]GFq��-�
������XS���.�.~���S	*
���>����N'�����Z��7������������c���`�o��0�Q=�X���0�H�v���^>rd���/�rZ��8oZq�#�%{9c/��OF�nh��JB���"�����G��A�X���{��� IDAT�����HD��*�#r�B*Om��#{NM���c����	��r���������$Z�+������<�����I�O}��@ ��>���=wj�G���e�2I�o����?}�9��U�f&��Pz�y�f�����9�'3�������J�~�������
��m�M�����FK�GH[�.��
����2�Wr�2���6U���?f��k�d�d�f�bK��<�����8vj��{Q���$'�����QK��a"E�'3�e�����_�}���A���f�?rT�����B�j+����yxT���?��s�,YfB0dd	����Uqi�����������.��(�P���j���X����J@�Kf��r����d�I��������+�s�<s2x}��|��=���q������L���F���W���]�e�'����b���7��U7R�g���XxR���n	C;j��&��5��`����|4%`b��2��2��������/ �J�z����KodF��O/���J]1���O�������#�M:w�g�v��;p\'_��c�������Q����{�\����6�T��*3�{N����R�sj�J�����_���ai�6�2�{����f�=�e�<J��&�k�U�{O��������mq0�Du%�9�}J�s�r�OoU���|s�����=ii1�dd�K���Fm"�o�d���Il��c-M��M�
'�x&�Je��(:��^Z��/^��5_=���u/4�x�nT;0��
�Y�	
���23c=	l��s��W����}����#�To�����UOMr��l�-D�[���@���f��P��w���[|�G	�[��-<���X�����d���t*������yy�����Y���"�������6R����70����}H���|��Q���u&�GZY�1�m-<����{��x��Icqf{S}G�b
Mck���a��r a�J�B�[O[x�[U	+�b����SZ:���(����6H�
�_�����������QENkZW�*�H���L����t���ip��]Q3����w��������Y��8���*p�6��"�X3a� ��%4�S�*U	J�]�h��P>#\����������?b�H��&�Z�-���{��?���U`c~b������oxi�q����*(���}���g��B{���Jk)���_�6�N,��y1V:����v4�+W���I_{��J��b�W3����:�G�c}� ����PB�~�����.��?������rSEIUE�?&�E�!�-2Y��hF�� u��/���^��@����L���
fNU����V��t�C��gz����t���T19��dk�P�w��g|i��43F2�9��Gt$����8�dKg�u�.�}Ho��9"L����d$'_B����['t^!h��4�fF�����S$
z�������gN��>����B-���K�����!���^�K�G�	ja�A���'c'�F�-����y��u� X�-�1��kg��ai�Q�.�_����wE���w�C���&���&QB������f��dHp�G�$���[��_�����������%��#��l�fS��1���������0��������t���~x��L���`�wMYo2~P���q���:1xt0M���e��IKK�f�{*���/���yEH�N��:<���k���UA3=�3����>4�QK�J�O��t��a���,�r
=�r�K���x�M���cs�u_+���DN[�.��%�[�W	%x=�H�D_Y?����i�ve�h�����V�N48���{��T���1���_'6����|�%�se�2^���~���������{]��i!:�8}[��IKR���)7�#�i�I�o<�0Wm!U�����K�F_yx�G�s�����1y�V������S2�}�Xi��h=w��9]V%���~u���F�������\�s���2����c�����B^��G�6U�~�/��I�ZN#jk)+��kXFN����X�e��a}-���Q��k�����u�|���\�2J�K~������_�6/�{����fj���,�nQ�,S��U����Qj���1.?8v�Ma���[~ZV���O6�E�����~��W/�����6�'��/z�:,g�G���dG�Qjx������]��I�����5�����l*�����b�[����6�S� �������o�r�+��0)�[��0�i�(&t��
j�/�����46W�;�B��������XB]SZD�j9)�
�_���}r�0B>�P6#��H���s0���ZC�7o}������}u]���6�%��3�Q�N���0��]����*By�3��I���"\���������<�s�z��eF-_
T+Y��4:���{���������L�N{�~�=�����zh�A�����;/���u	RE���7�~[X�F�!�}�}���^=���s����P��.���o��Od�X�k�+�����&r��nr<b����0��R��=�	�����H�`l����y"���Q�������6	[h����N�G�?gk�4����As+�gd*1&��i;��G��,ZN��n�<{h�Y{�Qwqqqqq���GYl�a�1V�c5�9��g/\pGU���4��	��;~����a}���Q�������Du�9��s
[��.�O&Sp�����&��G�&����1������Np;C���c\�����6oE���iU��z��r�����`�KL���a�TN�������'��F5���8�:
9>�ZF���;y���C�O:F�B����^iV�m��jSHDf���f?�O�Ow�l��q�������5\�fS�����
�^;/����P"�]@(��n�L(+h���������S	���\ZX���E"!C�����
9t�������e�h�!�!�mVk�i#�b�a�2�x�_��DAYY�����!E���WE���������-����W	������F���|�~�5�r<vm�\�����<���n��������������#H
��u����rf$���X��V�<�e�di���*}]�{1��{��W�L�1�eG�������Q���0.....?8ns�9����t��f���?����R���-����W,�����ys��������R?����U�br��u��p6W}%���&�!��KW�W,)>��|�<e�r�~����e���&�\��~;�m����=����-[|i�L<��gX2��g(?��V��wF@�D�b'p��g(f�2V/����P��5�i��A�������h����G��c�������N0!&�_�x��$DT<���t��f���Z��j�v���>)d�4��L�&�C��D�&��#�)��rDS$�aE����IzE5ue�1��_&������q��s�������H��Um�^{��Uo�G��tr�^�6U��J�Ob��`02�/��X�=H��>y=�y��A����B
0�
o��o��?:T����'�M��C*�q��|�x��{���l�6�����6�?[���	�$u��N�g�����gw�:u0r(E��#y�z8�-��2�����R����+���e��W����������+6����*������v�4�A3s��7��Z6����-���a���(#^�����h���>����j(�-kX��~*w���H�|��<�JC������7�����?23���C�
���o-s����D���<����iu=���l|�d�n�k�Jq��h�j���%ZBvyu�*\���0��=V��'F���������sq�8d2}a/�����h?zK�U�V��(�����G����q���K��3wj���G97O��1.����$<�������Q��gX=�����7�H�"�{&9��9���XF�A7~|�?L�� �&dG�K^(>� ��f��o�����6x���-*+[Y������t:���@b[��XRe��&u$��� u������O
v?�����'���2�$�|�����iS�fv�i=R{^ ]q��U������eY� ��XLO�����d*<I�{��	�^w^����������_J�v�4���3�wx|�i�6���O������eC�����G��&�k������G/���������):����}��z����?2����;�x�;
�X��rf�\f���.-��ZF�|��k�-?��s
�]�R��]Pp~u�k�CS�;fm�����I�4J=B�A���PU����u��d2c16f�x0���}����6�����qH�����x$w���E���Qdn�w&oz������|q�����i����o�m�=�e-��"��7��d����}�5����z�d�6�� <�j�>O�,�3�F�����7|��������BV��������Y�j-�� ;���v���-c`)��V�C�a�`�Q��`��"hiS��M
) �
����(S%��(o&�le�� �1�d�1��6�Q ��D1��}����,��=;���c��L�����tU������
?�b���K�=jS�1�l�J���eoc�������5/�5��+�����G�6K���|����^�Ov��t|������-��eG;
��`������A�J	�	�A�hoz�0�Y
8R&�V����*�L�C/�P�ac�WK�J������,��@���Ab�G��-����O}V���C������*EM�
�X�0� 7�z�]��}�����_81������^6h��x���{{��'1�B
��:�h�G�@���2�C�9~�����iZ��\:j������z�3�b�D4:�hm0>x;�h�k����5�e�}���o��|�����Y�������U3���P�+�Yq����;
&Zq�m�%Y[@�R�>��
yb��<.mD�s�|q5���
�1�
��jvN��wG���(����JU�hE�����\�Uj�6��*X�3Hi�6�2oZ�Ree_�x5a�����=����]�����������q�/��;������)���S�^�D^��}0�<@���~��Fk@�D=�������M��ugYV�H-#�[OZ�E�Snr��������8hv�3�MayG�{f���P��V+�UA������1d�$
��	�!b����.�r\����h-�W�RK��P�m��y��������I<��q���8�_?��Sx�ZQ\�����w�Q��3}/�|�^�X�k�c"Q#H����/�Mm��c�z�Z�x!��]���?����!�	�d�Gc{�#O���XK�������ll)���`v-K��r<p������P{K��W=?#h�hN�d���S�>��~�Y�8uh�QGRa�{�r�F�5L"FK�D�[��u7�������Tx)�Q�C��t�K� �z1��A\�5�aN���	"�Q}%�p�}�|�v�B
#l�j�1��Cl5a�_�]�*d���]\\\\���q����*��Q.-L>�X�����;X_��9qZ��K��|�9x���mz����[.�Ce�HMz��!*�����|zAR	�	�#gT�IG�=%�%^x����*�r�G(������<����]2Q��T�]g'�~L�d%��/2.���+J�f���+�D�K22dLTm�MpAA2:�������K��-��`45�� ���G���9��hX��t�<�������o8�����Zqm��`+�a	���t&��G�v��������&��D��O*kZJ�(z���������6����)1}��_$������kZr�)d@�!D�sk��}�.oa�����r�I1�=��o�����y�>����|?>�a�hZ;��{�������=��k��NE�'��_�\6�,l-=�K%p)�z�?�B�{�����*��X������������z����X���?�T��R"�/$7�'O�V���+_�`��,�34���M��@���� &��o��)�)�$c�<��oh��a�&��1����iQy>??:�����i��I��r�G�����D��8�T�m��U���]�
���T6$��Tg0�Tt��i���,����^Y������
	p$�*^���� �$�;��������#������_��8�H��5������_LI�q���K�J��H���x�uX�_[���_XmhZ��7�����W/ztV�P`-lG���j���i���y��.�JNS~^���[F�}D�?������Z���k%�
?':�
�
�H0;�Op��=�_����G�x�(�'��$bSc�vMf)b8�{���P@R���^�GVp�I�����T�2(h�
��<@�������......����Ju��x��V�3@rC~��i5�,p�k}_Q�(�vFV�)��{�{,)�Z'�������MC���������iSy�Q�#�����IY���-�������6��R���d�h���x^��n�	���4�L����������;�+u A %B�+��%r��)s2����<`d.�D-�Y���5A
ax�+n�e�j��O�cn*����=��5
��������E+��C��a�c���o9����j��l%*9������0b���t~�kT��y�"���	����M�����U0m����#}��p����:�	r��n!39��0���D���|dQ���m���<C�V�A�:+�A��������j6m#�5+��f�=�f�/�J?ip0��\%�/�W3)�q��J��+uqqqqq������������Z�L��!?1j���,��������;���=�����.��
��YJ�TTa��:� �P@y��1������t��9�I���\F��j�`�dG����j���/L���R��Ou����1h����V3 ePJ����L��o�Q�w����K�����G/���uO���|U��g��a��d��k7tk����/{T1�-j���asvL�H����g�����7�9����j��h��:��������W�m}`�j][!��qSU7��������%�<�C�*�EG��f���p�4�qdz�8:�?@ ���8��4�$3>����n7��y�����Fc�3��m��;�l"�M��=X�%Q��6��e��cV �h%_K�bJ�r�bJ��o��J�{L�*�<x9VA>�vF]�y�{	�FQ�#�����h8�%�h��;S]\\\\\�[�������{<����r�L4����/{���V3���x�/`��l�j
w��n�);���Zb9xk�C��1����:��Y��b1p@��Eg��b��x[�Nc5�=W��Y[�J�FV1���))_��v�����w�V���N��^����8$���j��\o]����QM�7V?L	
�-����_�ZN{�s
�e��?�3{��?��X"���l�9��2�f���A4�����B^_��O�?;}J��y�o��d@����Z�~��k\v8��x#�+C���������O{7�tlH�Vb���o���4;����������r����UL2m^��L}���F
J9�X��e�hZ����l�}��
��}+�����bn���>?o4�(������t}=�J�km����e,�m"�CM;�Eh�H[� ��k����x���n����4jiK�Z��K/�3v=A����)��h���^L�3�N���q#_\\\\\��������	o"!�V�q��O����2>4L���V.�u��{�}Z"H?8M
��PROQL�:�ycM�~�
^U�|�j�b^��F�s9��M�\����Y�A11 z(bR+��vf����|����l�(���Y��-b�s8��lD�����T�M{�:d'��w��p�tQ���//y��P���������,��i���A�������/������,�j5���M$<I�<��0$f�����������^+S�����;c��uL��U.����p�Y�z��aI��w�W�;l�G�������b�f����X�t���c�
������+�J�L���4w����!�LX�� IDAT��G/d��n
���.(����*-���o�^��B*�Z��������Eb�{��s�F��=ty��Sw(�g�vj{S�#2b����������5����#B[-���z�E("bN[�3R�'6zj�u/
�ly���������x��������a�M�������Z?�������(���L��'��|/�07^Z�c�������I���[�G4g8�1q>�yYW�NITcm_o-_�^���1^������������JT�������Aaz��>)fG=�f"�,����Zq�����������?��������2qV�����b{d���"���o~s���_^4�`C}��9�6���R
������$���N3I2>������|����z�%ZQ=?=|�!�.n�9�`�a�+�v����0�����
r�|n���c���P����Q��a������=cQ/�r�4��~���$�&I�c��!A��l�hK�
���.@�������FNzK�'M��ca�S
��u��������w����@T�R	�8�kJ8������>�w=��p����r����d����_���mE_���k}������=?��<?�C!��R�������P�9/�y|=7��m��Q{v��+��,!����j�-��>���'L���1�sjp�q��+��J����U}c-s�I�vA2��)0�K5U���T�TA]
����%�k�NCK���>
������#���{��J�-l��r��k
�����z���N�O!H�2�q�NG�������K;zp����Q������!��&�m-#��(���AsQu<B���fW��������L��7@vo��Ud�"�M���\��
���c��&QB�d%X�i��7[GTJ{��UJ{3q_���d�,�[���3N�bkK�s%��lm��t�jc�J@#�A����I����|����F��|n�����������q�!<�Q��9+��U����'4�LX�����?���'������<��U�>�����<����_����������Lp��#���~��@�����5��sFfDb�6}szLz�fOi7�J�"����	�M��<�;�.��K��B���8(�o8O�tlH�����y�D�2�8���&�`
������7�wg�Y�dq���|��K�����z=r���>�	����h�����g���#�bx;~w�M�O��0KzT�����J���D��!E9�(DS�8�K��<�
��I2
l��M���dKa�1��-��Z����m��2J���C��FG8�(t����m,lfR��=�rI4A�L;���=�a�6�)Uf-�K/����"A$���������3���?�z6kh������=~�A|��O�	�����O��q��c�����T��m7���k�����g��Y��:��^55����"$"��V�����f�����!�@(����"#$ki�����H�����	�wf��2����o�Oz�B��x��x��6�0�������
h���+g�$T��B�}����4z�R<��s�����]������^{��/H�9���Y�b����&�bx$���%rF�p��D$������ll���m��M�pT	��&�y���U
g����;@d}��%@kcn3s2�������c3U���%u"�?5I�_��	����2����[M�c\\\\\\�/p��A)}ML@�d��u�s��4vH���z��e7�������V�yq{��_���/��#�8��b���8H����UsSg�Y"���9��rqD�^�\5�P���M�&S~���*�QKk9!�j>�\�������e�`��{W�M8	���93�������OO�R�l�|�X��?��UY�n��-���U@���b^ g#F)����&2$���"v�Al��Bo�[V���f:����������=�����Qsl�>=83~���o,P���8�������m
���9�������~���;�*����I��<�n���!L���v�j�������VP&2]S�G$��lI���4������;�oT)���D���e������S�.������Y���wqqqqqq������@q�yiv��aS.�l:����>��O�����/{�ExL"�C�������N��Ja
�Ws��q����t��k��9w��s]�:�f�V�W/���A'/������b"s�|.��(rMciE(����!���0��
&@6�@���������7rxJ����a����"��#����7f�`�m'�*����������tC��[<v4���"����Q�������}�!���)O��Y���m�-������_�3bXmY�H��W���s�>d���^r���z��"�]8�ig��������uz������|7��6�����������g8�����<Y}3o�7�=��6	��F��y��sR�W�>��s���sT�L�]�����	k���p�C��?msO��{���4������������q�2�i8�^���3��\��Y�����0������\�;U����U�U�]�RMh w����'�u���O��Y�ymm������_RB6���
��%�5����c-�������86(W��(�V+�	!���G����3E��G�s���VQ3T;`K�����������?�;k
u���	m��'�z0~�M�����--�|�y���:���~�A��,�����xP����-6��B1Q���%�_/�[��p���K�7U�?��x�U�����!|�a:�r���o����
�}x����!������Z�������4%.L�F��>qT=���{�����]�4��IX�0	{8��w�N�36�*�
!��q���!�qOXD���N�k�'����D���Y�#�u�rW��������c�����U�o��O�����3�l�u�m'L������������y��Sg����n�������}��[��]�d��7xD��}^�
Y��?���S��u-Go
�l��f���B���o��D1��O���:k{3X�A�h+
�����{ �[���*��e�O�����IvZ��E@W�M��V'k�S���c�r��0�Uk�����������}h��i
�SL�g��{�%l����B����q�!4$��������r@������.�b�{��J�5���9ZL?@e������+9����'	$�|����i0��V�_�{��{��	�&`��l�>%��N�y�z�-QI����W��w�!F�������o�|�$��
Z�fOA�Q��p �=���~�]���������W���Dio^��$k���uP���T��V[�i��,z��<����}M���*/B����G><���W�;�6k`����]|�vt�U�b�����P�<O=b2���k&��	�������2@��r�!�K����V����I8i���^n��gp��G���Bf������x�=�7����`�Y��������=�(�U}h�{��g�S�=�5�oi���9��bV�gG;�}�]sm���a��r���[�������|��vM�shj��_l�5K���v�h6�&�ya�;���e���d���g�l���J���j�t��C�s��8_��'$�I�'!��>���el�&OA��LWL��3���w��]�M�����#�~//����zzp�(*�n>�������&������n������cW�W��-��T�y+�+8��z���fU������{K�rGw�(Q���X�gT�=&7�[�n��1��
x��y3%�D���DY?B��.'�Y���S����X�\Q�����/S��G�b��2����<�A��F%�~���?�����w���EN��rD@^����Gl��h������D�����h�b*�F����8�?.6����YS�+0��z���e��}�����7����P�An{�_�B,T_��|N��I����{�w���������ew����{��Q:A
��xV#�j6���M��LH!���:�~��&�v�q��i`�j&)���T�t���*��������]���+��g�������e�B/�yQ�����
��(v�����J�����������X���i>#^h��wAPc_�)o&��;O^��\d��"Tz�qc;B�_�������u����[������U'�Z%���2������.���(��$��������x
������d���5Qn�6�w���&�����{�#�m�5��~�cz[cFp{������������i��������-�L�������\���/������~n{���%�E���=;~�"s��G���f*$��[!�����B�v^O����6O�������T����)��"�Nme���]\\\\\~4���?Rj�y!�G�6�I37|�7~���a�
`Ma��)�3
�v��s���[E6���*&\3���m��k����{S���o�p�~�������t8�H`{5_UH�*�(����	u��{`��Iw��+{���
���\�x��f��$�V���}���?��J�\��v$�	�-[,�S���M�'��RCl��k9;��:�K���
���T�w������hD���T~|co���|�|�������m#+��ql&W��A�����-3&-I������M�h3a��}+�ll��^L5�z��Y����2��0���%01s�Q
�/��O�r�w�����n8�������/n��{�E���;O�iM�6��z/���0y������������8}���������j_�p�?��x�%���m��w�o���\�\�H��
���Q
��FG�W��g�]rJ}�(�z�S�$����=< [9%��>E�������[sB��L��l�������Jj_�jK,5�ym�������6|���~�)1��H<�p�IW�1���5��P20(����}�	q�E5|a4��5��O����Ok���$�����^b�!�L^�
�����w������-4	�������S�41�3�Q����0Mv���Laxf�#Ze��}O����t����z�if�o���F����������_��XqW��c\�����,V�ZzMb�p��^��[xdZ��;����&��9�N�&�$*j����2}����I�<S�5����;��I���<���W}�����(l�����}Yu���������p��c������� d�i�-.���B�5D&$��X`��!����O,�Y��;
�h�i�Ct�E&����?����y�!��#�������:�57��>�YX>V]�P��g��PJ�u��bF��H�er-=����oL�
x@�"���!�lj*g����
�����z�����h��E�^Y�\�'|rC������w�v`�D>n.\R)o���I.�*����))���))�������{O������M����.������$�a��I�"��x2�g<h���6�<���p+��y5)^D�C����g�l;�O�YS��	S�f.*���<�4��������-��VL|����TN��P4o��8>��P�1�o��W���B���?m�g�O)Ru���E�4��P!�����'V��KKy����L�aPY$����,�2�1�7Ai�X��T�wU�������0���VS-*2r�Q����i�+S4�|r�"uD��^�s�x�a����y~�\��������;���55����hbl�85���2��]�O_U6]�[����[�7�4>X������S����U���\G���L����|^/����*B�����!!B@4Jqq�y����Fr���6fF4fS���MwR���F����5Vz����:}n��������ns���w��5;��h��;��M���
����	S�������{�xH��k�o����'����D�����.�%;��D���Q-��O]���6��
�u>$�6������#����#�%�����p�]�b������%C7�k�m����p�N2H��fg4H$`���~�vy�v�"T����{�r�~C�?�MV���rU����?Mk�;�|����z��1�n{�������<3����[:�F��Y�xw^~�)���?_�������{�����{�Ag������D;T4� ��1<�������4�~
&�EsL=�1��/w�!ns������P/f_���(q��u3�]�v���)��o�V�m3L,A��T�p�b;>��T��RGw�/P���������(3�.=�.H�S�?� �KY��2#J2�����@5�<���K�0%�u��n���]7z���%�����C������=�S5	l��`D���o�Jk��Lm�3����<|����j�j.���(�?��N3���C�z���f��%@5��|Rs����^]������#�%�>q����^M8����^��uK
J�'�+������m6�����������L����^Re\\\\\\\��q��w&��g��U����mY)]�EQ��)k�t��wV�9�ng���[g��IV�f_li��q}��������$�����n)��A�D��S������A������ED%&pv��@`��xF��
~�)�%�#�t��(��~��:L>Bb���1v���v�'��<��aG��l�I�	|X�1�Ot�Y����L�������F������MZ]��������W�c�x �������k�""��c���i1��)x���c'O��:�Ae%��{��`&�clg��#��M�qqqqqqq�&�������{>Y�{t��z!C����T����j��n�u�����rN����j���"Qf:��W��|����U�I�w����3n.X�g�	��f���f���$�������hI�dRM�9	��*����d��N���&9��I�E@*�:��i����1��_(��bcgT���1�3��T���^��6pd\�l��3�{�w44=r�/y�/%���������H��[UK�`�������	�F�����hk��u��k�|��^��ILB��I�u�dnV�s'�nh`�(��\*:�|F�n}����������V��f��g�(�o���R�g%���Y;�����
�z����U >g����ur�'�~�@��,��I�Y�������@2����=��z��}g4���g��D����;��W>����eyZ%wM9� h	����<�Nn���,;����8C�D�����7�V�����Q�������Oyj��{�V�t�C�q����<k9�$��������"OM�46��*6�?{P�2�^���_mM�J�<�5e~�����]� ����\Z&���W�M^>:��q9}k�\K0rL��az;W�4d����7�/z�s!1��!F��n�=ns�w����eF�k��t��0e�O�{���4��Ot��8�E
���j��"��|x��7`�1O/��7S��f��T�����S{Q����w��Q���?���3�.17!�"�@��(�SDH�
�X5P.�X�ET�U
JP�A,���R�@#�@D��$�k7�f������.����������g'���u�g���3��~.G��t�N�+f{Bd������O������������#��&�!�?���*�.&�0�	�<|��&�#��j�-{��]G��J.���=���q�H�,!�i�X0l�����Ij�_�v��'3���5���N5�����g���Xw������N�4��E����N�0�>^�~��^�C@t���8��������eAzmQj��E:�@@
}���t���WHoK�B!�F��2M����w/ar&P����c}�#��(F��W����[1I���BjmX�M���6�.��*�-e���P��( �u�`�(�A�P��eYl����*��l>';�� [�P��1��T
;?��/���#�++S��^|���A�[�g���h����}L�sYb��CR��f����u�3p����f������ON��d���Y�.���8� IDAT��\��_����������!�e����O����_C����`����6��l����-�9�]�a��J��0����Q��)��-��]�>��F����H�*�B��(qoJ��n�f���03�.+��a�{���62��C��$83L��/�t���vA��]Sb�6�qwl����(�����\��]����w<�`6���uE^����!��U��a�~���^o��������Oe@����x���<�VSwcT���?2���'|����UD�K8*����g0��YV���f�����U�]���1uE������%8�������D_nc����s_��=[�v�k���k/������k�]/�j��\9L����y����<3�z=O/
�/)�*���H}���������&�B���v���zE���'7<9L
Q^Y{��eh��I>��.�Q��1)��$��Qs��`o����]S�f�
{ 8,���\0��$��q�S�G�x�V���P���9O;��y�
ktD��:x�����������)�AA�t����=�� k�b J�����U��dG��K�r��2Fdo�9c�M���Z����~�8�����&qK��)s`����B�j�?��8�Kt����C�;}�����h����t���w��������q�3B!�����B�%5K}hQ��/N��d�s��?�M ��>b�d���U��U5��{�
���m���Y|����U����r�`�V	@��M�<Vy��s�����$��l��+md8*��@1`�H�|��T�N���W�@0��@`����*K���'�+> ���D;M��������
w�]��������P1,~���=���n.�U;�b.�S��TC�o������su�M�F_��A����B!������/�vd�����b����B�"lC_�_�����i��vbR����!2d���R�8�^����Y:��5��hj�}��K�YGC�J������\w��4-��89�;�r����9!J=md�
����B�X�F���j�_h; ��X���y��sh8%�Q!�5���1��=�bs�~��!*�uk<�Y��������,Y��_N��� ��Y
@" �Z�p����s��n/Z}�����_�Y6c��v���������j[��ic0��eJ!���P�{S����c�n��][gV$�+����w��?�`�WH����vWJ��H{����t��^�6�9l�k0�z)����&bS�0��6u��P=�^�N�9�>�����	0@��w~��Bou�����mV��Z�L���E���IS]�� p��T5*���2�8{������Z}������t�b�����X�v2A��Y����#A�;C��k��1H^�S�B�����I�80��Z�e1��
���7����	E�kW�B!���������U��������<�N\�����;���r��X��r���L������ZY�Z�J����2-����;��
�
d{���E�����>��Y���b���FR���+_��[P~_��{wQ@%����]���I��OX;����W�����,�h�I�/����o���T��Ef'�-��x�O��/���������}z�!8]E19H?�4����e�l{�����P����I�c �)k'�BQ����}
�^��y��<�.9���=��xM��l�l����Z�V�]����4�
��9��r�lg���j��)n�
:��y*I�L�+m~��r��|�j@<��Pvr������q�����E�����k/�?�Z ��DLk=_
4�X�a�f���(D\��V&X�?��I��S�uW��0��S����>x^QX��Rc6�������@��TN�'}���,@=K~�T6X� ��p]
��������B!����E������
@����v��?��q]�n�����$;���)����k-������-�;�r�]�����Rq�*Xl���*�w]z�d8Y�:�!����H�_*���~Dg ���N�=u����F���]S����>`j��r`�A�"����&���b���I��%X�����%ft����2 h��P����F�������Y'1�O��������M��nL��1����N�^TPW�r[��L	!�B����'�~@V��=�>���|q�#�� ��[��ap6zc%;�����sR���2G�e�6oK�Q9K��\tswo�e�O�W���T�h���{�F`6�����qn�o�*��
v ��{t��s��|����F��������>V^�dmx������/*L����yl�C���E<6�m�pp��Y��QqI��!��V������r\�kU�L����s���Tf�F����v����4��B!��&���$�7�'<��W��wW+Y;dD�@aw����W����L�q�T��������C�C�W
t��a\���`k��s0;42T�|����C����a�N�3HFaO]�\�9Y��;R-`7��/�8"�`��6�<�p��/��X�Z����M���j��%���Wo���o�W01�W^#�E�8-i%�]�3;�p����f�����1Bm�wb�Y,�Q+7�8W2"��$,���*h$8�����vB!��F(q�&�tyn���Ee��#l
��@-�
�q�H�/����,����8����Q0@��{��@[	*��v&�����Tx]
������ip�5��Hv�vJ� C��p9����8D	���U���}�- ������w^�[������-.�B�k��
�Me�t����O�U�V�V[L�`��yE��b�J���uS��Z`�z {.<���]0�y���Q];!�B�����W�b`�gT|�&��A�'*s"��������?b�e���ae0��8^��!����/�v��7+M]D ����T������{	8$"	y��rv00;�.��~��87;�c~�������uP�T�sD1�w`�Qj����la����~���X@���"���I+T�V�F�.������W�u[i�}�S��`U^�^�B"�A�����*��R�@ ��SB!�����I6[\���s���X�����Y��������j_hf�qC: Gf����2�G8J�{�{��H�D	�0���l���C%�\����
�n���]�$�w�(sxGGF	>S�����QE������K��2T6�<=bd���ZY
�--2��.��/�� ���3�t�����>�����Q�Cq�T&m�|����p������Z�.��@"���Q#�Bi�(q��zEu���S����A��W�n�28����m@�r3S
7�K�/S���/��p!@Y�b��]�j�
������1h�#�����d�����r�^�:@h�]%�!��k��<�,�Q��1��
�A�bE��Q�EI���P��dM��=����`3�������_�%�.��c^/L��T�B!�r�P��E�l]zh���y������������?���3����2�ma���0+T\��j�s�E�p03XP�x?���
�>T�`S����Y���	�z}1PRv�U
���i��/��c�1}94".�>!�8XyP�X�������{|O����R�^6�����o��� ,���_`�V��vL��u0EPGB!���j�����|V�(�Q��wZG�������9F�}��A��o���Q��?���Y�����k�q��+�_.c*j�B)��A���	QY�u�%��Z�&��P����e�P�/���k���N����0�	��	����mK��5�����b���G�:+aJpV$#=����l@��^��}�\��6X���T�l��&B!���j�o����7�I�����Mz�����)Y{*��������\��QU�2�-A@) ��^�^�U��Gg���\�@�4S���8�w�v�P�B8�gVoy���XvV�.�a�~����(�U ��KV������.<8e��}��-R�����a���V_`J���@~��
LB�����B!7��?lsX�R
���m?eM��y��v���Ef�p���H@E�Z)�����8����o=����C�pv�l�Z�u�N/n\��������~Q�s=�`�\��8�Emd$LZ$���?C�&�nL����9�E8�c����\��_[�����e?����F-����Ejt0�a���H���^{���4�V���5��lMh4[*�c��Z-��d	G���H���;����F�XjP��p�_���`�M�zw���P
�@�{�����8���Pv�'
��}?!#�Ak��U2�U;�@B������z�������',�"��O4s���,L���\4����H`��V��	�"��F����lMh4[���Je��(�U�(Y��B�,�'|w������_@xs���]��D�����GW���k�J�+N���~�D���93#���W��as��j������~�p@
�o������q��>����0��l�9|�[�����B!7M�M�eY��eKnnnLL��Q�T*�U7+_��2�����H�g���#sD�P����TGY_�&���F��J��jX�(��������A�T�4>�.�`t��C�����]��E�&M��4��!|�~����-B<��q�wc"���0���]���+yyU��2�i`
����� �B!M���D2f��3g��1b��=>>>��mc�a��5�����������!#��_�j������7k4�����\��9}UI���W�n7~@���T�x�!��
��p�@��/��q��w
�8D�,W�pT�pI�M���K��o�y~�%vS�RL�5��|Kc�]&��]Kv�K��� A�e*[������h�4��	�fkB�ys4���:,33s���������s8�:uZ�p��Q����8�����sd����A��� ��VN_�]���>t_�;��m���(�9�P9�M�V�g�;, ��ZZ�>$;�X��){1C��c�%���& �A��
b
����f��!�7�5��l5h([���F�����g��3'33s�����3�<�R�/^���M���U+�x�����jG��g���uO�v |C-V�9����y�E�T���������@|����>��;:���:S����c���%u}0�"��!�S���������u����
jMh4[
���F�5���9(q�iii���K�.UVg��}����>��{Y�d���	!�BiZ�U�������1(3�?�F�����
$2c~(U�Yg'�P���x�L�0ZL�Q��aW������5��!�>2,�ED�6����5��B!�\K�L����8�Y-**
		i�����k��h�p�������w�x�`K�����	�L������Gx�n�/�����o��������	0!01%�.l��$�s��!/�B!��^��T����C�-((���s:��;w^�parrr�-m:���@�#�
�U:�d�������DE4�B!�V�:w��sss���w��e������V��B!���Zm��9��u����#""RRR�Z���G�B!���Zm�N!�BHk"4�B!�B~%��B!����B!��P�N!�BH@�;!�B!-%��B!����B!����}����s���$I�������}8��������l���O>�,gee����M�6)))����m~i��h�,/[��S�Nw�}�wp��-���111�F��������1��K�,y��G"##�HUU����=���>����ruu��
JJJ~���������/��!##���C� 6�S�N���N�M'M��pl�����S��``��hkr���gee����<y��� 11q����}D��V�\y��A����a�/5jTyy�����u�VTTt}qr�]�xq��!��O��o�w|����f����~������<#�Tqr#dff������_?w��'x������{>�N�S�WTT���c���������{��7�/Nn���GO�:�������={���i���������UVVv��y���UUU_}�U�����=�<D'�V��������+�|�A|||s���}�~���
����F�1##CY}��G����#Nn�>}�,]�t����O�������d�����������M�65a��EEE111�|�8�_�u��]o��o<XY�����t�.]��8�


dYV�_~��A�)��:96U��g���,���o����2�@[��z�]������1BYMNN���)))i��"WQ\\���5��?����T	������������c������`���O>��$I
�w�}����Fs���*#�Tqr#����8q�o��
F���X�����w�y'##���w�����;�y����.�4Nn���H����p8|}}���T��x������{�=�I'�V��N�+++�Vkhh��"�������U�7��rUVV�_��c��YYY�?F!!!�m���.\�pqr�������b���H5I�� W�n��������������<�4�-������Y����'���7�K������������SV������
�"�'��Y�fy�'N�����n����f��_'���F�F����������i��u�������z�j�%������"))i�����6��l)�=�����^�:((H��	�5��g��F���Oqq��z��%Y���P�[S�>}
���_�t�/**R����Is	�|�T��Is�������|Ni4oe�.]0`@RR��3���N�Mo�Wy����LJJZ�l�������@[��:qg�
0`�������[�w��i�Dn�,{����3..@||�J������u����{q�\
���QUU��t���C�������s���WXX��[7et�m��������=;`�����������&Lx��W<�k��*~�_��b��c��]�n��?�tmen��9����2���D��.[�l����}D�*���RRR�b48����g��2�y��=������������/:tq�\z�����4x�����]�v
>�	��f�0a��j���sYY�����/v���SO=�r�����.]��^�z���J��K���UUU����h4���J������������<x0))i��A�6mR�z����_�N���n�����6m��l��
���Ks�
�yfffvvvuuuTTTRR��by����w�6�G��������M��'����z���s�u�����GDD���h�����g���>���Leee��o/,,���Pf�V�u���EEE=z�:t�u�I����\�`A���O?�\Nz��cS�I�:v�����#:����^�h+C�;!�B!-�m]�N!�BHKA�;!�B!-%��B!����B!��P�N!�BH@�;!�B!-%��B!����B!��P�N!�BH@�;!�B!-%��B!����B!��P�N!�BH@�;!�B!-%��B!����B!��P�N!�BH@�;!�����g�}����}�r[P���k�}�Bn��WTT;vL��Y����������@�����l����+**���O�>}�4�����S�:�N�V+��h�t:srr�fsPPP�g�Z�999F�Q��6�K'��Ll� �r�=��#��
;p��N�����?����}}}

������'���~��K/����9s���o�����������:44�����M�X,�N��$���<%%%222++k��1���f����g���&�Iy������5+00p��Uqqq��BH��8��}�Bn����n��m����/�8o�������{[��IDAT��v���[�n��!���z�^�V;����/]������'�8w�����%�b�������~�,�:tx����L�"�����{��9o�<�u:��������~���RQ�;!������~e�g��m�����7��s��E�F�������GDD���t�����w��_}��\�f{����g�v��=''���c���{���<:i�$��	!���Je!�v����,������3�dY�$)--m���]�v���+((��9rd���3g�|��g��[�������1��[oy"�z��,�����WF!�%��B��}���k��9����k��J\��������N�6��W^��y� �,+����0�,X���O!�%��B���q:����QQQ�V����9r�������rss#""t��q��
������&L3f���3���

�
����BZ	J�	!����W���@e9(((11��P�=BCC��{������������z+,,,((������i���O�~��}�Ys���5k��3�z����������k��������|���?#!���C]e!�Bi��!�B!-%��B!����B!��P�N!�BH@�;!�B!-%��B!����B!����jq��O�IEND�B`�

ryzen-sata.pngimage/png; name=ryzen-sata.pngDownload

�PNG


IHDR����C�bKGD������� IDATx���w`�����oH "��qVAQ6��l���u��������I������z�p����*K*"�-�(2e$B���qh�2p�������+�|�������NV,@rk�����;��)@�@
���;��)@�@
���;��)@�@
���;��)@�@
���;��)@�@
�$��k����[��?���e��%t"�(	w�TQQ����z�������>����$�����C���RhT��	w���]�v��%�X��w�]�pa,!�[���7�X�~}���b��-{��76l�P{q���o������[^^Bx���'N�8k��W^y����\���7�(..��g���������W�X1����555/������w�D{�B>���X,�h��>����z��q��k��������?����B�����o�YQQ��������������_QQ��i#'�EEEc��i��Q�F�>���O<q���>���G���o>���W_}uuuu�N�9����[��={���7�p��	�����u�Y����V���_�:n��m��=����7���[MM�w�1y��3�8�������{n����g|���5j4o���}��92>����/���5k��l�2�0s��A��Z��s�h���M�4��2j�����F�}��'-Z�x����k����;w�}��n��o����������7.�PUU��g����
6�Y�f��Y��r����������n��y��z��u��<��3�8����&@��x�����w�y'�}��g�r���_^]]��~��_w��5����_����?���b��E]4n�������X,6y����[�~�����o����4�M�>�i�����O����{���s����;6���!C����=~��}!!�����b����������o����u���'�x���?�����8`���~�����|����?���8��Y�&??����>�@2�*�N8����;��8�����_u�U�5
!�y���V�����?|�����/�������B_���~����7�Y�zuN��#�/�p�E���X�`A�V����^����Z�j���~w��2l���<�PVV��s�
6�.��V�v��:��Bvv���Cg�����?g����>���v���}���;/�v���W�^��/j��Y�
������[������~�`U )4o����F�z�������jjjB�����w����w�q���Cw�uW�n�z���O?��[n�����[ZZ�x��#F����'��!�6m��q�o��?��������[PPp��g���m�|��n������6��l����C����i����_��URR��g������)J������<�G?�Q����������O?���z��?��i��v�
74j�(�q:th��m����8��<��Ss����n���D�c��}�������
�����m���z�I'���{u�po_��'��h������-����H���f���X�b���+W�����x�}�u���m���-���>���[�l�a����z��O�������.�`��={���m��o����?�����
����������O��'7n����w�}�Y�f���X,6l��;��c������������
_PP������7/����N���~z����m��}Qw�y��^x�}������O����;�N�;@�Z�n}����><����:�����Z����B&L�����!C�<���?��O��7n�����?�������;.���O<��3999�<��)�����o>�����/=��s�=7������9}��)((���kj�Mv�D'�|��_������{���^}����{ouu���c�����m�O<q�Q;u�4w�������9s��G}���~n�������co_��g�9c���#G��1����F�Z�n]�V�����!�'�|��{�Y�`A���!w�� �H�}o�p0��@
p�)@�@
���;��)@�@
���;��)@�@
���;��)@�@
���;��) '��w�!����G=�)�%��d%����,����'�Q��3��xF���(������>h��/������@�U�T_7�����\o��H�5��g>��Q��.�!���;$Lm����s%����;$FeM����G�wH�x�7�n\�.� j��O��P�.��D?���������i��3�wFI�weM���\���R���$����WO���C�������W{S��;����WMdo��X�DE���3o�����{�w8�����X^N�����^���;��x����`�.�`�l��0m��9�Ov����]��~�V]u����r����;�U����m���.��N�����'���BN�JH-�j?�Y��}�QVV$38q�})�^5`��h���;�K����C������;�U����C�#�v�{���=llT�p�=�W{�C[%I�w���j���$�v���l{��W��v�W�	�W��v�RU�������'a�w�j��h�)-��uPV�p��u{��i�;�<rT�AY!�]����n���h\A�#�tI�
�Z���U[���$��L-�@�J�j�d���+//wr�T�p ���C�#F�H��r�Tm���������w2HiUy�)cN�o�Z�.�� �U���>|��������v�@�(�*�;��3k��W�\�w2B��;�j3����X����WR���.�Hs%����>|f��)]���tV[�#�\���.�H[%��}��I�j���x�wn�.=�]����+���:��#�O�j�����m}�>|������W�M��r�&�l�.G���F=K�9q M�T���:��Q�����;i"��������e����@�����q�wR^q��>S�\t�	i\�������[����D=K����x�w;:��]���6V��(��M�=���wR����^SF�i����^�,
D��b�+����p�6'�&c�]��b6Un�=e��G������z���'~��6l��I�&?���o��������/.((8p`vvvb���6Un����B���b�|��?|��A�vX!77��?�y���A�>����/�|��Y��5{������x}wYY�}��M��z�����w��L�'05�9q�����i����V�Z�z���^�>}�'�|��y�����;>���L��H�X^*V{bE���f�����g�}��{��8qbEEE���/���[������g���[TT����"�Y�7o����n[�b������_�s�97n���G]����i�r��^�����1��n�[�T��#v��~�hVe���G�����Yg���������O���y�D]�������o�Y8j�q����~Q��F����o�=��Afgg�w�y�~�i�m��k���}����[�n���������,�����_�
 �p����}������_?��3C=z��={�����O`-**���o��*>+���pt�c:f�k��[4wB3f���>{�y��b�^x�M�6/��bnnn���o^�x��f��Q^^>g���-�u}wn	�l��~�1S��Q���4kEE���S�,YR]]��s�^�z5j���?�M�4i��E�s�����5�����I%m�=�=�w���N���������f��X��Q�5�[���������,}m�3T�	w�����nH����nH����_W��g��k�����}��/X���P�J���CBY���y�a���E^���n���p�u'�����#�~�q u,[�rc��9a���G���������X��zV�&TU�|X�=������^�1�@=k�,�[��
��B^nx����Y��{Lu�j���v�D���de��.�7��k��	g�$���0%�ph`R��M���|��x����?>�W���;�4���6�(���p�����k'����U�����#o9�����=�YR�w���m�{O}�I��������f�����n8��v��,)��;�h���=T{"w�K��oT��`U�z���N:�?;��z�t�����tk��/��jO �@�}���{���'��������a'����P���Y���G���:v��i���%�X� 1Vl-�>y����W����;	��o�r��F�'�����~�Ew�zq���-�2��Y����U��+'��x�����}�EQ�����p�>Q�
H�p >�Z�C�7 �2��x��}�������w���-/���wN����p`?|���g������[_��,���u��Q��p�@���!��'�|���l�Q8�Y{��;_ ^��;��7U{t���/�=I8q`����N����]��%�9q`�>���{�H��$����������?�����,'����-{O��N=T{�p���Y�ec��Q���K��!�L��������^zH�'!��KK��(���z��$dU�P[�?�����F={�����;@�S�)��@F�W����u�)�=�	w���������~qV���|^����;@�ZR�Y����<����s���/f� ���g�S��B�d����;@fQ�)��;@�����SF���~7��T�� S|X�Y�)�T{��!^��}v�������x�����n:���g�	w�4���k{O��s��������gU �}X�Y��c~s���=�9qH[�K���2���\v�
���� =��4#���jO?� ����d� �,.Y��p�o=5�8qH�=�9qH�mZ�w����?����z��;@:P�iO����W��2�����4fU ���qU���{����xV��P��;@
�W����k�=�Y�HU�=�8qHI�j����zB��g�!w���`�����������?=�Yh VeRL�����kT{Fq��J�oXyy���]���8��Y�;@���a�eEcG���dU 5,���r����;@
xg���S����Zt������'��e�b��g�XV�o�����!3h���0|nh�*��]��S��/������Hj;��������l��w����7�M"��}J`jZ�H^;��������x�xk�(,-�r8�pHR�{���W{��E�������P�"��h`� ���>���+��Bh�$��5��������Z��� ����Ig�g�\9��q]����/}�}��aIq���U{
���D�@R�W��j<9 =���r��G�t������`��7|zE��1]8���g!�8qH
��}s����P�$��Q�^��jg_��Df����]x��O��}�=�T;���*�*D��$,l�?����=�G�
�t���V�l/��^�U �U,������;����������bi�3����D���>���G����]���B�2�D=IL�D��u_��#��Z�?O����p�!�e�c���w�q�k�>�r��G.������L�!;?���	LM�*������WM������Q3�'���W�-8������@^]�������z���,������d��{�=9�YHIVe��\��A��k�����v�U�z4{�G�f<��Ko��F�sP����N	w�z1G��P� �����Z�NB�qH�x����{�)�z��w�D����� a��FU���2����/���j�8qH�N}s�p���^z�����~s��O�z��p8(�j�K�!�>!�YHgVe�+k�y���&t�I�S��;�ze�?�||B��.=���g!�	w���i`�`��Zr���T;
��S�O��%����_��|�j�9q����w����R��2u2u���^���{�����Q�B&r��������W����_W�D��;���r���OO�y��G�,d.��/SW.�y��'���G�v�dU`����P��$�;��MY����O�v��p��N����y�+}�������G�,��w���INN�v����as�y��-��v��w�T;�L����O�v��p�?]|��	��d&��L��I	��h��T��2@�z����y��^C�;���g�/���P���"��L4i�j'�X�2��O���s/������E=��p2��>^t�k������T;)E�d���|��{
U��;�@�����
�3��^C���N
�@F�������s�zS��(�2@�{n���^��j'�	w �����>��s�1Q�����T;i��;����|����X���N��D=,'�@zzv���~cba�o�v��w 
=�������)}�q�a��4��H7Y�P��~�;�V�2���#������w����v��p�D��U;i��S�t��G�~w����������z�N���������?���7�xc�aMM��/��x������fgg'�:�fT;"+�E��������.]���4���>����g����Y��^z)+++��w���78`O�c���bQ��O�o�,�	L�(O����o���a��}���+�������?������WVVv������8p`��G��	��?���I����%�N�r���[o���;O9���+/��r�n��7oBh��I��}���xH>Z�o�8��7T;"�p5jTyy�w���]/�Y����wn��i�f���	��7Y�1bD��V �&|��{�����������,F��{U&��G�*�h��?������~199{�0Q����@
Q�$�#F�~���v�����n��������5j���+V�5j���[��m�f���?�z����[�uHu���M�w���Y�f,X�`��+6o��`�����=z��={���!��������}��uHiO�v2X�wB5j���koy��7/^�x��3f�(//�3gN�����;������G�������������Y�z������S,�4i��E��9�������&����;$?�N�J�p��p�$����~4���w�����g�������>�_�1�!D�������74rQ��U;8q��#��v{���HR�,}���'O�{��T;���HF���S��9�H.������B��cUH"�����;S_�WA�#����p���%��k�����P��;�$������E3��yR������p�7����p�j�}��T bc���������C��>w Jc����w��v�BVe��<����V�P7N��h�v�/N����}F�;NT�P7N������8q��;?�v�O�h8~0���_��������zH1�h �j�����=,�Y ��q��_�������C���q��������3g�����6d�9q��j���@=R��(Ve��2j�k�}O�Cbw�^������g�vH�$^��g�����!Q�;�`��?����r�]�6��zH�H���v��%����������sl�@}p;H 1T;�+'�@������6��o��e��@z����W���w��S�P_�����_��u����;p�����1�>�����w��vhHN�����f>�t���w��k�,������E��/�7��j��#���s��Y���vh`����b�����7k���p������5n�j��w������Q9p�[��%�xk�s�0�|���_�
	}��#� IDAT�
	����&Ue�b��g�XV�o�����9~�;?rJ���CBY���y�a���E^�S5���w`�~���G��5��]���luh���a���tU���,��������?������i�,���6��z������M���,���o�����7g�����vR�f���B^n�o�r��������)%Y�������f>�����P�$L���de��.�gV�'055�p���!���^9�i�;S'|�@�C���j���N�F�$���@�W�3��U������g�Lyv����P�����@�S����@F��;S����W��uD���I�'���~����?yz�;T;$?�*^������!%X��L����/��_�)��;d����;d��/���!	w� ;������E=���@�P�����@F���^��j��~w�R�����@������V.���v��K�@�����_��2����@:���/��P�Cp�i+^�E}oW����@z�q�n���p�4��o�4E�Cz�*��?�~i�j��#� ����H�C��*i"b�}���k���!-9q�t��!�	wHy���+���]�
��J�@j���wyn���E���V'l	�����Y�X,�"����@����wf�}���E�|���B��
�[�YQ�����wHU�������������B��n�x2�wHI������]���������|Ge�iH�@��W�[V���aG���[���������-BK{2���w�q ��V{a�o�h�t������:t��N� �$055�p ��B�[��m��U�������d�����b!���'�v�L�RC���oX��!3y�9��X�
}�;V��F���u�@�q��N���]���v�xVe y��!���U�U;d<'���T;�+'���b!v���,.YW����v��;$��X��������z����8��%b�z�o��{��0����yU��&�9hh��H,�������������kg���v�B�
���0�,�ihPY�X,�"����@R����6d���Xh�6���[������EV�S�O	LMON����^���MkT;{P\�T����x���CX�=��x�t"� z�j_��G������$���w�^B��� ~���jw��$����X�/�7�
�d��*Cx�EhiO&�w�R,��|���[����e��m=��&���!�]F2�p����b��>���M�}n���������]V�pV�����@4jb��s&|�e�����v��eV��E��
�Y!/+<h=&s���A�x��e�$��D]��*c=&%055�p��U�j��yF�C&pwHU����s�Y�e�K�om��$�q��!�����}uYia�o�m���Hu�����l���^��v`y~4�x��-W���P�j�}bO� �2P��c57����e�9�i����;��x�o��6��-�8��Ku����'l�(�[���v� 	w��j��|���LW�5�[��F=)O�@��vv_�
	}�
	���������`M]	|Z�W���O�/���3]���vH(���0/7�~<���x*VS��;$�jg�e�C����������'� a�����L�BmBU�����CA�(�!�	wH��jwvh�,�[��
��B^nx����Y�3���w�q ������S[�*��cH�jgW�[����C����3QSS�
w�����_~��������7��
��:V3d��7WU�E����8p����^�sq�6{�@}s������C���o8;��{|�������V����?�T��v���]�����+�����V�@C��������h`~�@]U�Te�!����z�F�Q�d'�P'��>��M�hxN������*V�\�!���p�/��FYY�������U������f>�(+��KoT�@����^��=�Q��/�A��r�{VYS=X�IC������j��p�����?�v iw�7�����h���?_���@��SE��kf<��q�S�n�i���@�+	v�o�4�i�d��U;�l�V���A3k���O�8k��_L*k��U�@r��	�L����$5��
@F���>h��y9�=Hr~C��T;�B�� C�v ��=@&�V]u���-��kOa��a��P\��@���q�UW]5��#����_��R��i��a���$;TV�����D=���X,����M� �������9�^e)��T��O���;����7�M"�
�$���d[u��������44�����QXZ�<� �� Sl���r��#r����W�"T��|XY
ZD94�������jO-����������r��]CK{2�?��v��_����������K��l��Q\���!��\��$055�pHse����6����=���G�'�pHc�j?��C���5����j����2����W]1m����?��Y{x�84r����b��x��?��q]�k���8���e�����z���!�w�P������?����=������-'��!;d5���i('�O���s��)%%%'�p��W_�����k��������/.((8p`vvvb����U���:X�g�e�C��[���&4�^8.\�,�� ��y^�/���o}k���Gu���SKJJ^��C9$�0h���>����/�5kV�f�^z�����^��'�����Jb���P����V�-��I
�L�X����m�����e����M��b���j~~~iii,���8��&N����{�7���ZU�}����'T��D=
k��X��X��X�����QO;%05��q���i��i��u��UTT�o�>����/w���y��!�&M�������(��Hc�g����=3
�+�
S+�
C����EdON]�z�/~���n���_����cO:����5k�>���?��M��+W&���d�f����EP_T;!�Q8�q�w�
"3b����2�?��q7m��Q�FO>���
��rr����D]����Dw��u{�����\�Q1b�We%�po���O~��������_�������Bm��]�fM��Y�zu���x���������������N��beee��{��1{����7���������������-U���R~��]e���h��8|�����:�b�3f,_�|��9���!�|���/0`��3�������{����� R����W�+hq��.����)��M�n��i��)�|�I,+((�����7q��O�4i��E�s����sss{}w� �v %�|�'��rT;�*������~������q'�v �wRI��;�8b�j2���nI����C�@N�H
[�W^V4V�K��J��{O}�amT;���;���������9��.�J������u^�.�zH)���VRY�w��g��~��.�����}��va������������J*��M}��VmGv�F�������m�)���Q^��:d��x*�?��@�S�i�bY�j��aV�P�4�y �w�QIey��cT{��-���c�!� �y �w�N���:��jO?�-����Fy!;?4��>�[F=���v��K��/<��{��R�����P�$4�����YSS�
w�$R[�<`��$�'����+���2��Q�U;���;I��r[�)c.j}���_�,�H����m}�>���A��x�w=��j��@�v�����w"�c����p���gHv��h�����'�v���D ^���V�u���w�5j��Q{{�w�q�w��T���j��<�PW_�]�t9��CC�5���~u�9�t���I�&o�������������4���������v����G=@*��K�N�<����z��'k��r�-�_}����s������`�6T���2�O��Q�@fH`j�u����������WN:��$d2���m��>�������5�;t�����n��)��������+((���H�*���2���O��yWD=@J���}MM��f��u��gggg���;^x��I�������eU��m����pt���;W��%���h��i���n��3����OB&��p����|Ih�!���v sE�iI�$���a����$��l�����v:�=d����SC����������g��Bx���+++2���$�j����-��P��'���j8Xu
����/��7����3.\B���8m��z��S�,d5��o�����pQ��L��bi�3�����s�=w�q�}����
�_����|P����r���-w�����s����<Vr�����5�W�\��K�������}�v���nc��o�ft��N�����������Q������;w~���7n��q����~��.���H=��o�Y8��3:>4���SB�����Q����,�X,v��7O�8��#������i�UW]5n��������@�����c:zmT���n��+�����555_����w���	"'�b]��^���m��:��}<�qO$�p���o�Y8jP�N�`W�����?��)����M�����{�I��4��r��gBa��I������~��]���������H
�j����v��,�����p���o���n��yH%����uU��c�)//��aH%k�m�t�����Y�4�/X�///��zyy���_��o}��s�i�r�i4m��i��
5j}��T��f����n8����+�Y�W��Uf��?���������g#F�H���k���=G��3��`�.�KKKKKK���-Z�h�"!sDH���Pw
w;�-Zs�1�sLiiiQQ�1��0a����2�"^�_=��jh`u}r�����~��]��\�r��I�3�h���=
G}���?��������m��VL�������g*�N���v�Y� u
���>������qc���M�&L���S���
�d���U;@t��L�z�:��s��o���������{�z����_=�@�j���"��r�9s��y�jjj�>��>}�dee��l
�]e�aUYi��#�u8�?���,�]qeXR:���&Q�����d�O?�477��#����l��-Zu�Q	�#B�`o�l���p�
'���N=�����a>74������a�)Q��p���5v����Cf���O<�DB� 	�xmT���J*����l{(�e������2���zQ�����_�>77w����+���/�������z�h���F���������5n���s��r$�~��������M��}�����u�������vX=OX���|N��o:�����#�Y�"%������_��>$'��1����"��1c�dee�t�M��99999u�#M����tkI���CO>_���q�o�q�$A��b�������o����b�V�Z�j��9"$�j��T��2��"�-[�t����������y����o���	�#B� .^��N>�� A��u]w��_���y�U�VM�4����{��w�}���^��!�����=
G�v��U��A�]�v��m��������:�����s���x4�[��O���]U;@��k�{���{A}�����~|�}��
�<�.~�>����s��Q��^�uU�_�~3g�!\t�E7�pCII�����|��z��?k~�Ew������,����[���?���������T
��S����o�z��O�(�Y�Sw�����X�|y|�=777!DN������=T;@=K`j�u�=�p���q�;v<��3�<�����/!����~�i� e�5����������G)))����z���������j=�@��V��/�v��Q�'���3���n4hP���7��x��9s�t���>� �>�Z�}����v�TS��V�Z���[�WWW�n��~��^�����uS�)���>x��W_}u��	+W�\�f���g����_����������z������������=��o}�k������,��������~ee�����3d����1����_�*d���l�Q8�{�u��jh@�������to�m��E�-2P��@�S�Q��>�iI��M�D(��Y��������N����n�����H-o���p�jO6��a��P]�@���)^����KU{RY?>,l��
������H)����iH�'�����m�)���Q^��:d����:���^-���G�(��T��C��+Y�w^�j*���s�H!� �,+]��p�O:����������+���&!Vb�;��*CnA���E����[6��:��z���Q]V5e!��BV������b���Bv���R�pHKK��,���z�iC&B�[����C���,~�b����4:$LYy�i�����@*��`i���#���j����0|th�*��nC{�rB�j��U�C�r�7TqW ����G�z�����g�`%[C�!��b��������"/��~\X����d�z�C#�hH�*��=Y,[g�|�8',]�)!1,vm(_b=8(� ���$R�&T�r������M�������h���WNHU�g����~�=)�l��-����f!/7<x{h�,����b���;����~��p�/��7����ga��U�p	LM�*����a�g�������U;@�K`jZ�H1KJ?�=e�j�4� ����X�*�2>,�����Y�d'���_{�� C	w����>��j�XVe�]��u�e7�tN��'�IM��� y-.Y�{��_�s��� �	w�$�~��>S��s����xV��=�2�hq��~S�����v��;@��!���;@rY\��W�j����$�x�����o8���g �w�d�A���S��v���@RP���w�����w�]q�{��N�"�p����T;��� J7��W��\=�}��g �9q���W��:��
P�|!��w7��;��?���F�Qqe��.WF=	`U �j����zB��g�������Iv��t
CO�z JV,�z��ee�&
j��U�����������H_%�����l���y9a���E����<	LM�2
J�
dYih�K�5n��F9M�4�4����f����P�"�y8h�������U;� Z6	�w
y9!?7�������dR��n;�@CX�qU����r�U���zV\���!_���GD/���Y�;P������E�Fw4�����%����d�{"	w�^��a�eEc��7��������x���2H�7����p�/��nC�������;@�P�p
 �Y���$�j��{h���SI��������Ij�����u�G.'�����W�6~l��W{j���n��$�j��xk��$o�zR�pH��V^9m��.�T;�o��vCB���vC�����!�y^�'����K�����s���K�D=�LJ���CBY���y�a���E^�S��<9 Y���sW���1�?���_;N�����q����s��UQ�C������^Q:g��k�]�*�����,����Q�&TU�|X�=��rR�p8@oo���W��������8
�U�f���B^n�o�r���������f���;p ^[������?��+�����Cr������7�[����C����C&����~�uWt<�q�����;�v��f'���yu����?�D���������4�"���Y�;���[~����.�������	^]����=y�����v�p��W�-�f�cOv��W�����L���_l�j j����o��v"%��E��$������];��?]zC�6���9q�3�@Rq��/���u3�x����)�zN��@����;�������f>�g�@��;�^���3���7�P�$�����
R�$+�j��i�~p����y��8�9��p�Y������aa���	�� IDAT�oX�.��4i'+�E=C���| �����.	������N�Cv����\S��;��^Y�O���BV������Q��~�����^z�����c���O�z���[bU;�*C�H('�@�R��xk��$o����2{h���C��p��!�e����v�!�T�����i�����g^Y���3����K�>�a�K������IN���=�]����P�$4�������������z���Y�L2i3C�&
-0�����Z�z[��� ����"������ x�x�u��x�	*HA� .@����@)KX
%@�D)
i���3ig���~L��]&9�����=g�|���;����FMeVw����p!�4Vuc���>Om���V�^�� E�T, ��{���q�>���W��*"�I"m�R���)bm��t�v�����k��P�v`�
7-��GD�h(��H������dF�L6i��������S/Vj?x�����V��D;�)
�"R8��L6Icy�l�x{oj���~(�e,���@>/�W�/��&)2f��8f��:D��m����m{���6n�x�QG}�3��x<���T���Z�vm0<��s].Wn��I=�"�����,�1�z�{�oo���J��cFin����R��p�P'����Z���|7�7�>���g�}vMM��G����:��U�Ve��������o�u�YO>�dYY���KG��I�]�`��7���z�������r��Y���F9�4������9�O�������\F���D[[[��;w��z�O?�t:�~��g�@ggg:����9��#x����+��"�[f:�KIzb������5���w���v�z��]$�O�|!�x�@����w������#��5x��D��==�.�k��I�����#���zj��Y#G�<OMMMCCC��Hke���3��������s?��o<��O��lku"���a�T�R��J����aAw�6� �k�K�s����Y�fM�6
��i��	�o�8qb{{{���c7�p� |�"2��L���'�X�Dk��=�?~�Xc�������"#i���"����}�;XT��A���A}9~��H����L�9|~����w�y��w���������������=��Ppv��/��>JGP��'����y�3v�����:���#�����������K}���������T�77�p�^[er�����_�������/_^YY�9SQQ�i���;ttt�?>��E��]������_��'dR;c�&�Xf�]���Ig�=���?���8�a��k@d���M�����[o]�r�������v���+w���������������<�>�d���9O�����O��QL��.��S��W���z\~�k�����_��������K������S�~��_.����k��}����/���~��g2c"su~O)RHm_���|p�%���&�����T�]���E�����j\�KL�~���%K^y���3�s�\�i7������&M����~���f����{Rp)����c��
!��w�����k/yF�]DD�K��6v�H!x���K����N�Uj�f��f���u�w�����e�������vv��n�r5��ra]Q$7�%�V��-�!�d�%3��.�'"Q*��=�y���r�\�"�V��E�TkF�$7�M�F;�)t=Q�|����j���o���7�����V��D;�)Zq�aI�]$e6��a������y��������������f��$�.]7^r&�@�V<n�uW�~'��TWPj�
����T�a����.]��~h�1B�x���rj�kP�6����6��H�Rj��H�P'�4f�X�P'�Z��C�
�"�O�]D����i�l�}��q�R�!iM`d����)��*2��Edx�����=���g\6mt���sA7��C���$��_�
�"�M���q����_}�/�q�?k�='�X�wfTREv�������.����~�����R{�i�����Pp�%w�|�I�K�^6}�$�k9`�*#"EA�]DD��6$�H����5_x����Fi6����V�E$/eR��g|E�]DD$C+�"�w2����_9�0�v�^Zq�����_Uj��V�E$����W�������O=l����������H������n\����(�����I�]D�������|T2"""���."C!��1���[��EDD>��������P�B�*������~�/<������EDD>@�.�:|��:�"��$�5T��e��N��Ey����_�f���i7�HN�)��T�haN$_�0j�[DW���CF-�+���R8F�j�Q��p��jD$����w�������\S0f���_�]$�"i*6��s�s�1�r��U�HN�������r��bQ=��)%$����������PC����]�HAhMd���$�n|�#Dd��b�V�E]����E<s���Qc^:���+�Uj���V�Kt���w���]D��f�\X�o����|\�]$����sp�sP_��.Rx�*#"��Ogol{2����Y6;p��E��K}�W��2"L?�"2���;��9q��dG�4o��l��v�%RXL��-�`���.R����w�\�������I ��^��]��v�f������h�vW*R@������JR7����.HD��]D���������T.!������vj���1<��.�K��*b	�b	B��{�D��(��Hn�$r#�nb;0���H(Jg�)���I�"�j���R���������p��ig="2h�9UDr������
�<��*y�dG�#%,J����r�)@�r���C+EP?e"�I�]Drc�z���E�-��������FE�?��{5G��Er��a�Lf���=v�$"�B�2�9U$7:xy!'��?����w_�����4�]d(��&�J�"�%�QS+�"��m�K�[[�v����5eJ�"C$�a�8�������"�[���4��7�2���?��A�$'""�
�"�U����L�X�������^�g;-v�%""R �*#"�1��-�i�?���'O�y<���"hw]"""B�]Drcqs���&��w����~�N�������.���*�*#�����������Q#Gwcn�y4�%��KDD�f9����
�"����~����]y���v�"""�_4RD�����������]y�R����`�T9xJ�"""CF�]D�o�_�������EDD�������6��#�����!w9`���\�]$�������)p
�"r`�S{�v�(��n�
T.��{T. �`w5"R�4	Q� E@���p�[����]A�v$v�D�X@�������.�}6W%"�$�QS+�"��n[���_[V��=%t+��(�B����&�[k�k��p����zD��)���~�m���>�QM��H<9ph%N��)h
�"����Z�?�/r�W�9��Z���&{����r|^e���_�����D�`��[=�"���g~�������<�0�k����Yx�z�ewf��v�+	(����r5�Y�E>������5+��������)�����Pp�%w�}����~��R������a�t��YD�������|s�R���H���T�"e���j�����I�+�+�����w�b��J����J���w�[�S��J�"""�B���q�����i1�2�>J;x������}��Eo>��.""��r���V���Fog���.""���9U���'�h�=
����7m�k���g~}R���EDDd/�E����������E�\���4�N�Y7���'O�)�����#�w��]���7O'�#�S�v�q,k?��'|>���EDD���]D�N"7�&���!~?���dR;���~w�+
�"����rs�G��4��������XXA��]������z�E����c����q�K?S��7�������s�`�Q�����]���2S4'�v������AD$�)���[���V���������-�O<������4��������i�lR ���4��N<`A]9�����H~���H��X|�YHB������~��_�|��q��TW�Mv�88������G�����h�c���Ir-�&�I,��&�&�Ig��� "���E�Fkg��n�y�����7M����\��1�UE��A�"��X��X�����D��1�+��1-���8c@Kb���&�.;��;���=����� ����3w��)s��U������?1�@�:�FPX�3�1�$F�
��H��:9��r��Jn��Z���3�C��]���Y���DD��~m����w��{n�{U��2n\7
uN��
�������I��s*�<.z�$�c����u��������2�����H����������0�����m"2�(���;�������3?qj���-�@zw{��v��k�a.���(u�q���h��`�v���J��E������K}�W��2"2|���HQ���'�h~��?����m�	8�zo�����-���x�����{�\J��x�;*�4�����Pp2C�]D���]�u���uhE��M�?�������Z�+���F�h�K
�7&[������v��M�E�=h�wAZ������������h��E��2������nw-CHa��hN���1�\Rp�v��+�-�E��3�DD$(������Lj_1���0������FMmN)L?}}�J�"""D�]����
|����m|�H�k���\�B���W����O�\��.""RH��.RPnX���~u��+�!#""R`�E
G&�����	Zk)8j�)7�~����)����(���Q��)L
�"�����o�k��]��.�	7P����Q��p����� �0�w�����?�_��.v�D�X@�������.�}6W%"�r5��.2�}o�c������)���Z;0\�����v�#"R��E����~l��7���rLI���HqN$�8�'�Y��H!Rp�����%�x��
�v����E���(����
����1�w��]���|�������.�����Nu%}[����a�TfUp��G�]DDd��a���D�������J�"""�G�]d8�M������=�����"��R���H1Sp�/���4vc�v^�]DD���UF$��&���O�L��Q�9���,}�-�v�b��*�*#��������+s���-:��+����_�*#R����tq��a8�yz��OwD����(�v����.��DvrS��zxR8��������#��-�E_�����'w���C��'B���bx�ed������j(�no#t��(��L������/������
��^>)�^I��'���p80zh��F����H^���)O�H����h��f�7l������F����H^he��R��O�}"5��Z@
��������EDD�N�q�A��I��I�O���Rp��WN�k.v("""6Sp�~�g�t�c���y��:~�w��OVsT@RED��$M��)���5�=(���/M��/,����Z������ED�8m
���!mQU�MB��.�+���2�����yl�W�R���G2��
R��C��:p
���FMmN�S��7^x���o?r�eJ�""R�zZq��MB�����m2�}���
�����]��K�DI�v�!�1-�`Zv�!"�_�A������W��w��.b���~�p^k�fM%-5��dk��j�_x�wS��w^gw5""$�D��E8}�8}T��:�I�&4B�#��[=�b�4������n���P��������J���t\D����DD�f�
����AN�	C<`A���V���c������2�1?�H�c��������DD�!�-D*F�$�-N�fpj�@b`BBP���4RdHeR�K�������(���C�~��^Gh��HR��$A�""�h�@���<���4��8�B�~�V�E�N����xi[��s�R�p��Qr&bZE,A���'8\\�I����|��� d=
��*���^��[=�2D2���m�Lj��������2�!"�3d��qpPR��J��~���]�.�ma.������>�����0j*�*��PH���s���}�#s/+��.�"bQq7��
����%nPWN�06*"� W@&,�������[orI�=��e��I��`��v�v\���pV��~�A�]�)"E$�H[����!���]�
����.��?�?:�+#
���H�3-�M���."�(Bbo%8R�t��6�-�uK�����"�vj3��]���m^�3�;�� �$�*��ew]���PE��7h��a�fK�����_Ujjq03�f
��B�]��,!<�����O��}�������o|D�]DDd��D���o
'�<T&����w9;�����X���ct���A�]DD�.�i����$]��Z���G=�"�h�����;t�l��9%�>������v��w)U�1��h5e�].�:��UF�T�g]�������:�a&3�}��K��L�]DD��
,�+�����e��R��'��k)�:�T:��Uz�����"""�s
�"�\.����(;���9�vZ�����_�E�{���2�Fh���H���]��U1��T���SX�q������oz���Fh�]DDD�V�E���,r�+!��7��o>����7-Uj�A�I�),���So�R�G�L��������U�]�&���T�
�]����
����V��p�RNH��g���q�R�l
���!mQU��Z�����f����Hbv�\B��@��*&\K���]w����cV�����8��2�N�k*H�^���:p���DD���0j��]��l%���j�P��p�m8g�k2���;�6q�:����.Sl���#kX������zDD� i�Y+��OI"k�H������Q���g������C�n���^�r%���-���X���+������osU"���%��h�]d(�������#��)���������H���R��q����b��S��W���z�v���a�T�R��J����"}���w��=V�K+�k/Y����]?��%^����dd����7k	MD>���!����*9��8�=�����o=�x�������U��_<{�8����E���|K�I��$~��I��b�V��Cd����~��,'V��O�J�3����jt�O)��i�LDHn���i���i��cj-U9,UrE+�"C�E�']}����S;����hUj�����-1!�b$M�1�!�:I��\�3����H�S_Z���RN_�I�@=A��]���$��u�����c.���r0Uf%�i��������t��Y���.�A���%+��������S;p3���G�;L�hd�iw�G�dD� ���8��^����w�18��T��\V)�G���q�}�����;�s�������=���������$��!q���8������F2��Pz��T�Y���R��s�z��S��2����]2����{����sI��
�t6��j�
��Z���*����T������@��0|��3�!z��d��6���d:��]S;�|��*�m�Y�#��9RD�@E �kj�8��I�8(���.�N}y���D������������6G�-��9RD�@��uA�^�K�Ti�HqPp�E&������9���=~=��~�]D���J||��z�TD�@��JKx�d�Q6�$CI���q��tj��{�vG�;d��&�9����)"R��B��u�y�O�SsI�]22�}[wlXt�33F�����]���Lh�j�Z�����K
�E�L������s���l��Rj�s�U�����JPw!�3�.HDD�)����{�
�u�!�H]������J�y.�E����<}?��d���:��������8H�CI�$�Nv&/������y��(����-Y��������p��-�l�r��~<@DD$�(�K�jM`w%/�������w|��v�$&8�xb��J���{���&�&�I��g��ekd}�P�]��i�����x*y��K{����r�7�"���|�/e����!I6 IDAT��/�_��lM�2������]DD��YU�u5�"���<�����T���o?��cf���}���Z�vm0<��s].Wn��I=��%���*<�d"q�w�"[���3����rj5�dx8��2�4������:���1h5��HQ��U��1��}x~Qj}k����������-�n�����.Z�z�5�\���~���������o�u�YO>�dYY���KG��I���D,*�&����\��]��k��f�v+��1v��>��4�%3Ns��2���3�����Mp��������?�}�������?��_����c�O>��s��-��a���#-�:��c�,Yr������]�����Ng����W;z����n���nf����n�m����D�x���>�+�T��p�&<N�uS��"�8+a�tI�N75��f"���m����SO]~���dr���f�9r$��xjjjrx^�]�<�J����
o,M���ZV�����k����a��(2'3�v��H�P�$f�X�Pm��ul�*�9�c �z�D�c���0�=On��i��	��'Nloo���}q���n��g)�%>���_vY^�z�^�A�L�[*���4D"F�I"��=��""2���1M��q_�A���j��u�8|�M7o��yg���J,;����c4���W����q/�T�s+~�*��9�y?���{R;���A�X���&N���a#L[�&N�TSk���")v�2��A�[������L�����]�sQ7�!���z����b��M�������y)Z��=������}�gcjFId-�,Fm,GDdx���#i�����F����nM��"`P�J��g�]��J�^Zzz�~:��x��_����N[�r��;�x<���PSS���RT�1�Y��e�R�>��.���O��7?����s�"�>J�}������&���J�`������!w
��K�h?��>���{U#�-��1�;j���<	����?�����A^|��k��=����/_�����3�dF�����4� ����\�B��r5�8��	O��]���q���T�eo&ZMY#B���N(�h)������W��b�Y����:.�ef���������0aB������?���4i���~��^�7�����^x��-7rU�)�?V|G��S�587�a�9{v�_����w�N���H�Qv�(�;�X3[���������^hv�aB����p�������7<y,��)"2l���'���R��9�h���;��-�H��a���KRh�h5�t������L%HT1��?�N��N�Q�WED���u�W�THu��q#+����d�����T�CWE�J%��������+�nC����wL<��R9���C��ED
D�2Li\�������/���J�]��$f��$f�IOjD|��Ng�z�Cng�#p&%����,��;}��"���]O��������������:D�P�q��x�[������]�������>\m%�F��'�UE�j+�����<N�U���9����$~&u����"�L8����3�#Y�'��0d%�`�w�r)�$��e���.HD
B&BW�a)t@��4�K�mN�%��(Id
)zw(9��@G2U�I��7����������T������0��D����6��_����m���0���w��v����*)�a!x��:�pF$Gr5�*#�R�.f���L���e�+q���}Qf^{	�Jfqj�S��z!���_��x�ERY#�S.�4�Y���K��v�v)z
�2,u1�uR;���{��/._=���a��N�����,�!�R���x��_�������8���$���zD��`�cD�*�V��(��!�7Hw��%������u����?��+�Y��U����7��R=�""�b�k�rS��(�e�}<����=�����L��mUr��������>7���'��H���XL�I�M5���DD$G2)||Y�fcsL�*�=��:.�ef���m���8�R�M�������z��x��~��)gz��$]�7g/�9�6WV�Z�`������}����DD$W�mT>��/$�<�:���J��*BW�{b�M�b�C��Iw����&bI�8�$�&:�1�@(��0d���r��_p'����t��������]v�Z	�&#"R�Sx,�&��5�K�B����z�Z�����F1��o-ZU�)wf�����z����>�9�)�d����0�46c���)�.�g(�gP�R�k�\�-��Ho`�I�X��H�]�	��j	�R����"��.����3W\z����zv�3Y�Q����&�p����7��.�(���N����c���Fjg���DDd8�-��rQ��"qU�������Q�~z�l�nM��"`�sQ<~���^��ejs�0��L����2���Ygms���1����R��X��P}^:��|�}��%�@�V<n�uWP{����H�-nca8�T�7�/T������2�OSerI�}�J�?����%#���\�!|���V�\O��B���X�C����|�W2""�a�Se��h����d��'�W2r�S{�I{#�f���D�YW�'^��a�F#;���+����u�T�M��A+MDDl0�`�Q:�)J��Sp���+?���(���O}~PS��07Wrw
7W�:�w�_����y	���R���n�$��J��a~%K�����D�%#""�o�%�V�a ���x�~?��z���?� ����������f��v�+	Tj�a^�}�u���:(��][�������mi�N#���z�EDD�.�QS{}%���s�
Ej���6f\���0qz�q����Z�-�Z1���g��`C�N��NY/Z��B�����YB������e��<���x3�j�h�����.�.����DxlI����0�QGI�S�
}��m��2==g�U�}����9����,J��CO���?��������-6�5��:������Ze�*��2�}\�����p�f��^����=$,��cZ��|X{\{%�n'�n��-������z�2�uw�9���k�q�rB.�
���0m_��s��8�	���(�Z����V)|�D��'Oq��3?�t8���1�R�=�m����$0dv�1��P=����6����p�!	O��?0�dF��Y�1@��yc��`�B�UuJ�"9!qk-N������	G���'M��)�K>�+�g���1�s��W������]�l��3���%[;�H���9f:��Gw3%��
�����Lh�1e��
�)(��E*HA�����q��w�	������eKj/$�.B��cv������[?����������,���
P6C�]�P��j�L�-��]v��|��9���6�����h�:�aq�|�m'V����D�{����`�lf���[+&q�WX�[pB�9�S1iOt�\��Y��0E��� �g�����s�
\7��k���Lb�l��B%h_�6���X"��'1b�og*��E�a�M$��R,����)��q#��]��n�>���0I#+D���0"}�~x�D�(�M��7��fU�{<�-u\X���x��^�0j*�*���s���.!`Oj7S4'�v(����6��I}��nn������."��@E�r;P
h���8�Q��p]\�wB�����Sbw�2,i�������.��$ju�?�������S[8����")���J���<>��.�g����{H��V�6BM�tO��S#F����n`��y:�?�9D����n�"!�����D����ysG~z��_��IS��X��z�����H�n+�	7�e�x�����L�����(����X�-��9��-w4sT�����~�3��K�fA�H1S,��l����W�������0�=������s��!�koM��������}x;�q�w��%�f���f��8f���Dd���n3N���5����V�d::�#@�����%�UFA8�W���N�4��I����������J�
��-7�:����;CZd���(��a������Z#'�`��B�98���+������$�r��E$����2�?�%�V�\����.]��^�(u��"�=�wi��/rj<�����'7w������1G�-�� ���[o������}�	Rn��[Z���c$'�9U�XkW��������c���t�hi������'7���N���>�+)��2��k_�tM������Z�0��,:����+�e����:.Tj�|��.�t��AV#u2E�<�.����m8�4����2��0�}2�Nftd�(�� �,#��j��1���/N�E�$oAd�!�wP7��8\8<n�?��O&��.|������O�*c�����76,�C;L�hd�y���;XT��A���A}9~���HQ��Y4�����E����|3���.�I���qf���`�G�%������|�oN>?��l��p�Gj��~X������uu�S{�OQ�W�9h�W\
�yP����K
����h6��+�����og�yt�]_���c����DDD�v��.�-�����y���	���Sj?m�Y�An�
-v�#"""CO�]r-bZE,��C,Ah�V4a��p{u��[g*��� ����q����������Sp�\k�����2��u��l�������<��R�����E����������R(L����.��DDd/����\�XT�M,�9���	w�0��7'���~�v�lh��jFj7��0m!�<8-RuL�����DDrL�SsI�=�.Y��^���,������x��O|F�]Dv!QAC'�A�G��3�u�),��*v3-�`Z����Q������H����w|���E�������������.R@L��)M����T���.����.noCc�^Ce5��5i���&�]����q�\r��%iw�ydu��+��\n�`�o��F�&a�*y"��.�����R��.�������x'�ew!"r��%�V��k;>7%��VQ�k���������6�����8J�t���u:�N��N�Q�� ��?� �7�=�f��\6���D�X_�L��	��u�/�qy1�P�m-w4Mv8R�����m�{�QvW$R,�*#��ch-����5����g3����c~��7����Nw�:�/�mj�K�4j�@�4��������V\Y���N~���n;Kz�D����h���4tC:m,�@�%��h�L��N��2��t���k������.h�zR�Gi+y{��&t������~��-my���|��Q����t��Xf�X�������t������R�M�f;KzA��d����_������I���&����;�����-wgb������."�X�Tp��X>�e�'_���*�O,������x%��f��r�
��~��-o��#���7���	��g������w�t&;\��Y�PH��I��&�����[���1��+����F6{�n�@���0�*���_Tr������.����:-�	,2���[=��-��b3��u�t����������(��q��
PB��G1)�%tR1�X����:^�����K��������s��v�(?]��1�pQw!�3��3g�wh]N�4�w9�����E]�jc.v&�f���0�B��u�q��]\������w�����;�:���_�7E��������z9��mD��F�����/M/��x�6����������-����b�=,�.�E����`0>����f"=}/0J|<�AY�3F����e'�M�&���r�L3��-�{S:��
�}�q��$�>���K.��	�P
��XN��;s����������e�'��[F��D�Q7�Z]�JdH(���������"����T�s�a[�)�wcu��m�?b�������S�=z0/��Sen�>��o�������8v��U
��0��S�q��I���������9�#�7u��`
��,[���������V�K�tp��R3Ns��2��=�������-��I�C�A���mX;y�7L8���:�����\4���3cZ��������^��EUA6�F#��b��cqR������ dM���z,�`A~ED����E\��P����?�@�`F@�]d�Rw����NJ��E)q�dm����O�p���?������3�'�pp�����\��x����&��r;�Hs�����?���3�'�w ��uu��+"�G.�S��O�l�f\�h�w��r����D|����9�#7�8����NU�x�*s	_���z���&����������{�%f���&���/�JR47S��Z���7�)vW!"9�V�o~?=9sqr���k����������e��~�]D���J||��@��;[qz�;�3�0�������!�]�L����H�z�����EDD�/������Q��0[�^�0����=z��o&��nh�n�����cC3�W3���4Q��f�M ����F{X;�c>���;��K'��@�#�%����&��]MI�}�EDD���*�*s "i��
g|�O��M��g'J�:�G�$q�k�r�x��?^K�KW�/~����p8&F78�z��|\G���Y��>Y���s5�}u��!�g�1M�""��42����iu�����^>�g����������
O�����A/�������[x�*R���a���/�'���C������c_�3s]��]C�5r�6��g$�^�}|�o�|���H�a�T��|��f���}������������v������gJ>[�o��{�I��c�.�E���*B��qc%
���������p�m�9�q���������Y�7��7���H6��fB.�Y�i�����2���D������.�tW�W���}���X�[�N�}�c��c~2�L���P�uwG�(�5�H�{����B�p�t�K�.+t��������o�g�#10���8(-���w��3�sF�J�4�?�>����u
t��,Fi.���:M��>f�Mny�wv�D��Z��Y3��L���~����vN�g�f����<� ��naD��-�X�A�f�k��e��2:�IY'����q4@�7����!t=7���=��07W�����4S�yd&q7=^�����[��{��qM��~�UDd����q�S$���UX����b�x��Us��SZz:�|};����<���WB�p3"��z����S��|�v�N\)��d������4�n��WQ�0Yf�W��8��E�w)�>q�_��
���)��_�q���KC$b$��3W���x�&w���0�����2���5�qb���J�� "G�2�9"i&���!�T[w������s�q�����~����:������E�5�7��,PWN�m�c�����l.��0m!�Uu�)�&V��'���lV�����0
O���l�C:�{����e��i��2w����Cu{���lC����������I�C�
����>gP^�	"2��0j�UF�5�s{�q^/�NfR���b?��������q������{1;i\����V�Y�JYg{�46'7D�B�b$MR1�B$;�,x�DM�${�q
J��n#�������`u��
���P��q��/�=L[%O��B%O��~�e$L��*0�Qp$���DK���DD��{�����I��:������S����V�?1k������\4�|���a��1��TNc�k������������^0:=��������j�NC���|!�N2]u����r�B*+	�q�)_���#��Gy=���4��9sn%�>�������8f�DH�h��4��H�����3`<����
k����+���S��v�R�����`�����D��U&|/���x�,�~D���\#�����F.�sm����?��{���
+���p
��U����n���M�����WO�����;�0�w
w%��RN��U��!-��W��O���p8N�L(�)�
z/��Ay9)�w��s�{����RMX��/t�i7r��ob�����d�=��������M��� �vL����O��#vW*"�0�T1�H'���R�������W)�����\��-������?�{���g�2 IDATGvx��7���V��+jbL�;���>���s�����s'`��7t�>H�����lkfl�������6��K���?�*	x���0�B�����4m���}?#�]bi�f��SB`@�$4A������ �$}�^��W��O��t�Gte[F�
��M������Lr�Z�3:��-�K�[���6�O��N�2�T����9�9�?v�E�>���\]i�|p����W-K��� ���K�^�������4%8��r*g@d"�d��$��&vo*��2V�U����N�'����]�]bI�!!���������I4�A1�C_���������('��\�#�0p�T��}"R���^&�b�f�i\i�i�l�b�^���$v����z������H��qF�(����{sK����\r����S���]�;:���������E���(����
��ee3
4�[&[i_9�;�
��B��]�����;>�*m���N����R	AQz��(�RD�U��]	��l��m�+��KPA����������EBI��I�32�L���c�	=�����5s���s�����3�y�	Ea�V.�$j�[�k�B��s8��/
�	m.�7x���U���$�c#�%��1s������~��B�"qo����QZ>���o��k<��$a�y��l���3{����rD�-�)���[Q6��(����N��ac����������[#f�PF�'{)	����9�^0=F�t`����Fp���`�'�x2�Q��7�;�:S����&�F�M��q����7xzl�tcj�w�`F�'�})D���zpBQC�O�>����~���8J��q��H�?��-JA�P��|�c������w�[���:��WBgPA���H�{*���y\��Z��]��U?\;�d;H�a��M�"�g�u�V����`S@e^{��-�
|a���kotsF��
&�B4�i����]LJ8��]�V��I��85,eG1�h`6���������T^�������F�@2�
y�����wJ{�Id��������+j��F�JLz�����]��
u�B����J���� ��e~(��C�������{��Q#j_���t&�^Fd�[��r`4dC
�M��I��U��4���j�#���2��*V3��uOmk���m�#|����;�����Z@�T6��yQ3��vZ�����2N/�o�._��j Q��*Tj��`��xu����Ge����lRRjd��Xk�`eZ>f)p(�B��wq
rw�Z���9z�QE&M��N�V/���Q�<���G��5�5������"�qI�_�P�r]�L�6��q��u(�*9�H���/�U��������HCWY��
��G@p����*�F�'�B�z �{����m����7j�g��H;����t��Hg`�+��1���%���`���$�O��W��"^���1D�u������d�|-?��L^d�8���+$!�p;�����`����+��^ZJ�C!��~4�,����Se�^���[6$��p9��1�����r�����kX���7b�1�,����4��`��	��M�oAu�X�7P��)@9�{��X2Ur��B��I��8�"�j�d9�Uv4��8�F�m�4��sz=�kH�J�e�9��2��zv�����=�<��[��l���y��KB��T
�SNs�=������BQ�$poRe2+�	���J�5�*i!h����/���}o���6�zv����D���������������o16�wj&p���(�1�������!��L5�,���dRe�8���������o.���/�#��W��Z��c���K��&u�ttf_c�0�p���*�7�|%�f\v��x��_Zd�=c��[������L��;�0L8���w���������qB!�q�a��;�mD���K�?
U����l�K��#t�@a�z�]����E���'�!\!���d�$aS'�����k�1�R�H`Qf�{v�s��3��l�{+��z��:��P�\��0l��)�'�=�?]B�]!�B��M�-������^��y�J��0�;h���k"��~��;U�g��.�����w1f�
�~L~������t������u�o-^GeKT2��et�?P��zM-=��1���|�M��n��Qdw{��
s����X������8-B!�		����C��FU(�
{"��5�t����0J4c�
k�������������������^�H}��Q�0�u��p4�m�����a����q8L6��1G�F�x��������nS�f������;~3��*>�h$g{(B����f����=����U�$0#�����B��D���D��qw���I�!���y��>�����GL��O�#�_+@������+����������_�?�b�Wk�X���|"l!��Y�b�SD�a��-��KZ���=�e|��V-vZ��L�`HZ�Uu��,�gakE��~�|�s�3<�{)�����.`Q-�no{�n`��������B!�Fuj6���L4���f�G��"o��+?Uac���O������P�i�R��^Kk=��FL��=G�����O6��!�	���
D=R�T�Z�����;���ui��;h��3����|���m�'�r}%����0��z�x����9QCf�7k�:����t�e��W2��a�������
!�8	��Rc��.��u�/�_����-������[��V������w�;��=
p���8�p��W_:�I^:�	�Kb�C%�:�_yD:����JU�r�2���Ne��+�$@���(~M�W�q�������e������w���C�v��va���e#����B�5$p�K�.p?�dX���
+�;DmkG�����|n��������KM�[�!�C��O�h{���x����J�|���da�c*v��xT7��n���
fb;[r�N�^Q�s2xN-/�t�~�}f���?�����	!�8�rj���K���
�F�v�yn��:��}xg�2����sSo��������������d.�'o��^JsQ��W�������-s�P9@�/�D���My�L�w���}'�����T�g���f:/j��HI���p|U��M�S,:�*�t��f_
�v����B�����C�qw����y{�i��=�O�h)�5��Ei�Y�~N~Fa�C_�kL>������cO'�,Qs�=��������k����(f3>J��*��3��������G	�	K��k�4
e<�LZ:����MY��}>C�d����1u3��`�kc�.&%��\!����������X�e�Ir�T�}F�
�P�	3+8�(f/�����������yG�{�?X�������&,�lJ4��6}�5�<X��jQ����2�=�W���3c�5kC�0|G`PY�(������rR��a���E�O���`]!�	��R��/����t�|�r��.�:P����93��6��NW.�1����8~��t������������#�YDE����2MU�J���d��'������:d�Oq����SZ�JT������./��33�'Z����k��	���2_�xFb�:��.��6qB�<H�^�W��t�t.�����
�(��~�/�����s������B>gN���$<�Z��@��o�R:k���a���r�?�����H�l�j-������fD+]�����\������D�Go�6���D�K��
��"N����<��1, o���!�h�drj�e���'x�/�$���b!�����w��D�o�q���r��sJ_����@+��7�s���dI�z����w���B�]�[s�JGs($�������5��������#qdcO���Er�����`<�DZHM�T�B!���g���	L�C�E�*�`��������V�h�}����i-EK>�9�L[�ow�O��z��q�X���n?�K��W����Ortm�^�����-�9'n��������x���0��N=q����voD5���!�K��LZV4�1�b��]!�8m�7J7L����e��A������;�+�����K�-�G=��\���4+����������a"�7����7b�0���j�������j,��:�x:�G�redB��{^Y����o`��S�e�6�b�R�"s[]H�+��
/�D���h!�B�����]O���"��+W	-���l��k0
������.�aS������Ei�wV��~�K��1��ISH8��;������*/����:��p�����|�(kUg�6s��*���1	���xPA�m���u��������G���:u��Nz�/!����faQ&��1��y��i��H��g
���j��7����������-�oy�/�m�-��>�%*m&��g�J�:<����H��?���P���lGf�����?�Us����-;u��5�A��G�m�S�S�c��V�:r\�m}f���R��C�BQ��ln0g���'������0[5%?���D^���#��l��]����+{E��K[�O=��>g;H�c��!�#�a����������@8R�V�!6��0����S\����{���q���t!��hujJ��Y*/c��r�A���w�@(v��<�K������w���v�Q;`7�]Q{���'��a�a
c�Su��}��_�ps�N~z
�N�d����[�
�(�R�A��e�@�1'S��������%jB!��q�����#��D����.�E����%����<��X�2]/����"�������lZ�`9nR��E�o�t�^�Q�;�C�(�a��F.}���*�^GY�������Pt[����ebC��e`�p����68|}��t�=���V�Bq,�S]:;������F�E�"�JX�V�����K�t�O����������cN|�;�3Y�^O���wG��]�OA+/�R��S@T�2Y�������S��R�?������,N�Qx�Q-�b�S�`O6	)�7�i���=���klm���X�B���T��������4-��6W8�N����Zz����|~�Q3W�0�a� a�+����N���w�q�w�$��+��:�?��d�*�}��''�#��~2~7��j�*D`0�7���[�����xk�3�Gg�:���e������6������m�|�Y�
�26:���,u��B����g�m�{
�O���'�8wW�@�+�|	@�m����^g	�����(��#�]\�����\UR��F5L$��$?�lJ�v�����~'U����z~�~��d���m����C�4�����OffC��,��G��<��7��|aZ���|}����&�]B!�^��uJq~�<��5�������z�����t��bKm95���}�$����}'[&���T���eD���dtG����������V����[��0[�8�����r�Y���������;��f����2�C�RBQ%%L���y-*������@�5{��1���u���B!�$p?�"��QQw����k���g�=~���;?Z��i(�b����h|UF��~�����f#�`�b�c�r�\��\���<q����[!�@�e5�5�R�s����=���,r0s.	73��zH1Z=�A�t=A�B|>���tgg��b��Q����}&�9����B�$�L��;������MhIr���:�/���;Z���|f��;:������)h�x�����i�0��e�{�x�d��V/��=���O����R�����u�~�n��@�?g��UZ4���T���L��D��s5��8���.c�u��N��:d��W�vg�!���nnt�J�W���^������)�z��B!�	�Cl��|�
(�1�6���F;���`����o����~���ne�c�M�G�}�g������	�����;	��T�;�:t��Q�
�@�X�_c��W��[JL�
A�
������F�G�<;��,�Ntpx���7�.a�GDQ3p7��Lf���.v�h�������t��\�a�7�Z�>�����M!���;+!6��������P�w�)d�^��w]������;[���}��-&m���\�N���dO4�����a:�h��+�����R��/����l�����n��w����x�������)|��~�p�����w<o �B ����������bR��q��f_
�RR!D�S�������g�*�v@�`aJ����{yrc�vwy�n���U�w/����%}��!t8��&n?%>"�z�#Ub��Q�����>�I�i`�}���	T��@���\w)F�.l���?��vt��2�qc6))��T���g7�Z~�v���q��B����g�H��n/���;�>O�{����4��������?zq��������$f��!�33IO�d��a�Qw�*l�v�}�
h���SZ�(������R?�FA+��4����{��)N!���x�
X��M�}H�u�^��Z�iFB!�hR$U�lI�����D
�H}�S��W��M�}��)����W�����l��Yw���q�_��#���IK���x����:+=��7�J�N'�������Q}�5��z�
]��G�d�E�����W�����p8��xv��.�(E~�tK#�L: �B�&IVNmv��_���7�W�����P�0��������{���L��F��2�"�)��"�Eu�n����S�����t���SqQ��ReY1R�`+���b0��lSP�������*`��>���n#�������D��ud~�A������^%��en����:j1,!�B�:$�2!��O^�EyP��^��n0���0�h7��C7�����wT_�#|V��#Nf��q�q�Jg���x|�fb�+�c�i3���UI ���`}%���ul^��.?X`��&�V^If���
���W#7�t�����L�������W\;:��g�}i�n!�����=��������`E�G
9���)I]|p�rGI����6�����\���U%
Fv��������/+FYF��Co#qyS+s����ff�b3}��^��;,��IQ�}W^J�����q�_�_G1�?���oKq�T~���:�nRn����R��	��!�������e�����-�����G������������3&:m�P�&&��2
���X���{b2�/��h4��'@�	��PUev����Y��B�%�T���E�K��^��/`�����a$
�|��1��z�x�������,|�d{����c�<I;�\MB!��O29�a'�:5�
))��*�rbL}^h3����b;����N�u�#M����5b��xn*F>/32UG�^.`V::��������X�����<6������R�%�8��
����-+�m��y>~�����lCk�YD];������1\[1�v�<��QoTs���[�;B!Dc%u���yl�����.�M�v��c��H���W�t��m�IO�f�����{���4��������@f&L!EO�#��Q�%`��
%��?�h�{��	���P���l�o���sb/
t�F{>��tQ���U�o��7��^�B!����L���^!FQ��@�������@'������xx4���i��I���88�M�,'������������j6W����k���h_HW���^0Wn�%1��R-w��%����	���oh�v�1���'PdN�B!����l{w��+���K�uPl����?�i�i��ovE
`4`���������L����!<��:�Y7T���~�b��������,�0�%���e0I�#�����b%����}s��a�b�X�����1�)l
0?�������|��������W�W�]���L!���$��!r�3_b��e�1_Co��e��^>m���m=l��\[��_��p����J�<�V^}F�p_!��Y;������Q
��#�S�Gm�/`��i���]�k]|��5_��Ml������X�8p?������O��)�n��xN���B%a� IDAT!��:5%p���=��Y��������g��������u|���o^i���j��g�.Xo������xm��]Qj"��/���:#���*�*h������L��0�a�C������)�O<_<����h:���OD���#�B��LVN={�dc9�����L��r~�Cy���X����o�M+[������vs����==���*��U�k��cs�X����7WF����� >���F���&���g4���II:���b+�v@5�\�s(e��B������p����&���kyS��:O���=������n���L�����;8�^��y������?���]3�;:�*_^���u�n����A��&�V��>�P��E������7��������{
��kS_r�g�����&�	!��)���.���T�V���y�#J��(U1e�q����>�?����FC�>�&�p�d�n�b��D1R����F��q�cL��"
#�}��::��d���G��[T=��/�;
�g�o��q{�`����E�9�ewV�3�|��C�������A���F*�h~�����tW�J�Qo'C!�M�����&{)���O�4'�U)P��e����Wd��L�}���7��R1?}k���?d�Iwa"�������65/��]u�F����u�Y�����[�~����t��|��51wwo#K��H������������*���&����{	o���U������������fE����-0��/�]B!D#&��)�\I�����ynS.�|Io#qySQL�<�I�����X���q3�����h�����=�hH�������cM.GX�jV4/����8��Jj=5����I@W�t�^;�����%��[|�W6{K����#�UA���B��9'���M 1�V0�>���bb���a)����PwJ!������YB���}e����(LX���������b�9�G������}��������Q���,��@��j�v%�n��K��3�*�`�����W�4y��=�q6�s���Sg�1�^�)��=�������k�qER��`�XSi�����T
!��Q������C��XC	`�����(@V��n��t~P�(Q�S����w/7�����=���QX!�P�����mo$�p�s�D��?W�,����?�/r�<N�z���e�HII���e�_H��e|���� ������fq{ �B��C�S��w�<�ZLFrr8�5���k���hU�v�_oX����-����|o�J��qi��D1��	���M	�����h�����sh���
;�w����n�p�v[KvZ�;�����u����^�[���1L%%��x!��)�:���Y����xI���[�8t����>{�(��k�5�4xz�c����1�&���x���[J|8Jp�H�&X�����MZ��$��0���j�b�p��[�0T�T�sz�z�a[&��P8�n�`$���CHH �tK�!�B�Md��SX�j�N���e��e�����]��Ct���e&���}���������gvY{N���~���p��}���Ww���H�(r�+�v)D���t�;�8���>��x�D�K��h���7�t��U�����Z��'��K�!�B��:\9UReN�]��o0�O(��g���=�Vv$����_�C��}�?����kc�t�	sK|��O��I�m5��O6
9?uQ&��1��y�����Sxo��5�	XF�F���A��EoE~N%=�
��HN}O�$�B!��DReNV�J��G�tW)-��z��df*,�$<��s����������VeW������;�����-����}h��4H��'����)r�q3+�bW�
��l�9��6��tv��
�VP��?��Y8��Kr=g�!�B�?q?)�*�&z�v=�I��+����q������������KZ�O$�R����FR_\�"}�����bb^��c��s�G^.F#n
Fv���|x:&��1�
T[�cO#�~�tJ�za4B��?<n�S8v+^/�I�!�B�c��������;�K�Z%j7]>-��-�V�@���s���9�g��
��T��7�x�V����Q��XGi;�����������x'�,1_�at�����i'L��:?]���-*�
�K(ahhf���;�RX����gi���#�Bq����LO?T}���������q�!����@�u��.��G���[����������A����+'.hL�{�����>�%�L�������c������������\P����)��P�@��=�$�������XW����^	!�B�)	�O��q������2x96�8�=��h��q�!���p��V�2]0�����u��$�3~)����{�p��W������V�G��%�����K��2��Sn������T�I�fV������+c��?~]�5�tKRv7�eHH�uvt�!��)����2W��.�'��k��k_^�|�)���q���~����������Mw�����������Vyb�n�j���s[M?���`�Lg�9�C�q��D��zT�k���<[F5�_������;>�y�G������y4s�$���*%�B!���q?AqMg	q���i����.��W���G����������i�IC���v�������N������VkY}5�����G�����3Z'2�"����K���/n�a����t�g:�[�+{�_;96^�w�����=�!�/�*��{wwm���/9�}!�B4QR����}�7s�J,�����9+��
��a�����s���>�[�/��?��/������"�7R:3��K$u���.�:��@��:3�v�zMA�����^$ml-��m�����e9�Wd�~��a�3�Mo8�4��\p�o��n�(���������B!D�����9Kh'�_#���
�i�����G��/�[��5�t����t���7��f���1����d�v�M.>��um�~�T���������
9���������88�M�,g>q���%)8J��[�d���f��R�^���3)1~+�T]B!��FF����S��-
L��d�oE���v�+�Nx�������(Ufp
��uvJ�t��2���c������E�.'��������4���l�o�����I8����r1�1xQ?C}��5t$ov"��U[���s�Il��D�B!��J����d��N3�Si��[5P��6���g?p�\_Zb���UF���Or\�����}dm��~g��~��G����g���[���t�\7t����������]>��r��������k�ur=�C!�BT����N�L�(��<
#��X����o~���%>2����~��QtX�N��F�Q8�_H��2�/|�je���g��y�b'��Q�����XkI��4:��u�Y�Zv�-�iw!�B��A�Z��FIYLy�5��e��	/�y�����I){#������+M�N��X�P���a��m�^]1*Se�zu-{�k?}��#iwn�{w�RJ�������U}i�|v?t��2g$?��0��
�����`��������h	���!�B�Z���Zf�+��D����������[��<j�w�������zy����0��a���~[S���/?12s��]�Do �g�Se�Sg_��%ec�#���fg2�/hjy�����$+K�9���u�-f�X�K�������8#_!�B���T	�k~��+y�����Jm�M��<C&��7����?�}>"U#���������:J�oW1k���'X�.�'�WS*��jp[.�+{��X�T<w�A��Vo����o!3�U�x��,:���U�9�$��G=l�
���z��B!DSQ����$�iF�%���o��^2RZ�_Q�{���i���qk�5�rPZS��{�"9?�B����nq��3?}Z�i.T��o��a��O_�I��S�����I���"��2���R�
�����c�T�l��]!�"T$����e�7tQ������_~��?���A�vb������)Z��������-���Ie�e�F��2���{%?�]�+���j{O�����*�s�:�T|��1=H��f�n��_��z����������~�B!��T��{��������k�2�����"j@�U�T]���b���M�P�U���'�D�?;�4��y�5�b	����lD]��$t���ktM[����F�n����?�hpi��-��������B!�I��2�N��;p��s0������������QY��M1�������s�Q�+���-���f���O{_d�6�U�u���P��gW��2m�	��L��(��Pud]�%���[����0���CR_��b>�z��t]�����O%.�v!�B!��LN-�1�g�������������]������8p�zg��*������o]��G�����6��:���
L���y��8�v������rf�q*K�v�iKnAW�@��4;~��Q�*N�`b��I��oa�����N��y���=�n>_!��Y���u���yP_T�C`};�������]���X���L����g�3m�����{�~����Qjq�3���Z�e5�����MU����^<�����v��^:<��g���=8��G?��0�yyenW���s!�I�O(z���8�	��7��}������h8r��9�����fEN����
Z������c����s������Qx��#0�h����c�	���l��Vc���lQUg�i�W�K.[�H��~���z����E���l�������>��l���Kr:�*-���U��p�7t��n����o|��ONws#g�Y����4��]�=�w�>^���"j��za0�]���<�l�����*����^YS�`���q�k�����.�+.�����\F<$�:�QLp�����a����~��������
����X!�B��&�*����E�f������cm6�J��5�v@��A6��*�5�Nz�'�$��M)��L����4�R�F���*g|�`���KN�0Z�[@�
b����5��+��D�^���!�B������_���>z�����W]uU��P����%���C;?�1��_�R:T���^������Y?�wf���:�:�`<�����h�n?�O������O�Htv���������P��h
�;��L�)��!�B�F��f��U�F��k����H��{�y����W_}u��4]/E��X�w�]���~�<j����zc�JL%&C������rh���>�6~g������w����es�����o��}(�A�#�6Ra�Bz�:�c���_�D�r��uS��i�Ir�4��nn��7+r����{������+8�g�yf��U.>�4i�^��;wn�m<aWX<� �@�����W#�:�����nz.T,��mW�2�}`����C�U�����l�I�&b����|�����0�H��8�Z����uP@��N:�]���{��� ����3����nV�����'PPP[�4..n���5�1y�%Z
��7
�b���AyFJP��4�m��[{3�kktq�R[�h8�"�������oV�t7+r��f�~4����zMV�G���a�F�mx�_7�c��4;n
����}
S�_��=C�=DX5��:]���)
DB!��Yi��{||���++����������R�T��T�-�j��MS�;.��s�V�K^��E������
;RH��>hb*�:���e�?g4�l�)�$�\!�B����-t|?�����Cw�����:w���/�9��-=��&�~h3�����s�E�4���.!�Bq<M3pn���_�u���_���������z�I�O!�B��Q�
�5M[�x������m{�u����P�H!�B���dw!�B!��&�r�B!�M��B!�B4�!�B�H�.�B!D# ��B!����B!�B4�!�B��g���>���]��|��U�VEGG�h�"�����p���III6������#�����e�EILL��v^���O?]�p���[����Vk�Kk��}��w�n����\u���j!�z��>���~())���CE�\�M��?���k�]r�%-r�7=EEEO?��7������{��Kr�Gsq_�vm��=�n��s����������G���Z��_�~�?���]�*>��O���\�q��=���u�.B��ru��y���EEE��y���w��|i���W_}��C��/_��G�����m!1c���o����.��/��m��l��i;x���7���OT���$�����������|�v��O@k�����������z�
u��)���ONN^�rett�w�}W���SO
<8�8;;�b��a��;vT<0`�c�=�iZII��f���������:l��r�*�Y�F���������5j��f�9�T.����/�������r�_�qWU�������+�OG�������P�K�����_~����.
U�W�XQqZ;u���}�o����E�$%%U<�z�@VV0`��`���#�-[V��"T"##+���3>>�d2���edd(�2~���������UU322���~��w�r��!�%���v�ccc�Ocbbt:���{��i����e4�n,((�8�@\\��={��]��7�|���7n���i����x�����vB�W�^�tiNN����?���`�\�M��
����f��}��U4��T�����/<x����|��g��;��~B�:p��� �It��ZW��!�_����nz��7�u�-��)��t���[�ly��7����(����o���o�������-[V
��&��i�������||���v��q��5\p��[���)�i��l���qqq@aa���U��E#_PPP�4???xZ��]���U������9x��`K|||aaa�UO_������_�~��Q]BB��Q�,x������o��e��-����@`��y��-x���m��m������.����u���(\�xq�����{����u�P�K��A�-Y�$����~��c�����]���e���������a��z������}��������U�8x�^��|,x������w��������[�j����u��a���*UU+ggg����G�r��X�g��XVVV����N�:}��-Z,\�0�=��U�VU���\��]��3f���;v�6mZ����X�z��d>|���z���K,HHHx���o���6m�T���v�<y�3�<��t��i����@@.��`���Ue�o�����?�q�������[���.��)�����!�v����'�x<����w��-���c��Yc��m��]E������������3t��:o
o�����,����|�n�����[����GW�=���E���}�W_}�o�>����O���er�7m�7o�C=T�"x�s����K����;,,l���={��xI.����]!�B�F�Y��!�B�XH�.�B!D# ��B!����B!�B4�!�B�H�.�B!D# ��B!����B!�B4�!�B�H�.�B!D# ��B!����B!�B4�!�B�H�.�B!D# ��B!����B!�B4�!�B�H�.�������]�hQ�{!���~������B��f�������o���b�X�VUU7l�p�����[+����pl������Usrrrrr�����N�:�����f�Ng��������p8��iSc�n�;++�f�����B4b�Pw@!DC�����r�J���������/X� ""b�������~���`x��7����^�zm��=22r����]q�G�������_���^���m��@ p�������]�vk��3fL�Gxx�����v{p���w���>��u��_�g�������Q4Mu�B4����=z|�����3�{��U�V�����v'&&�����_~���
3�^�7))i�������IDAT��������kW�C�~��`����-��r�����v����{�IOOWUu������s�=���b���/cccC}�B�XI��B4G�\sM����������_?�j�&''���7�	���


�����m�����{GDD\y������iC��l7o��c����{geem���W�^�|�M���'O��]!����!Ds|���"""*�EQU5�u�]�/���{dd���;5Mk����?���;�<���w�y����_#���r)�2{����.���q\\\��B4Y�!��i������������������-KZZZZZ������#�,\�P����|599YQ�����:ti���i��]!DM���>���_=��s^�����������^�z���k��m�N�:��?�IMMm��e�n���1c�<������;w�l������C}4B�DH�.���\��u���6m����V���O�����.����^z���5M����f����M�M�6}��'������;�Y�f=���>����;t�0o�����o��Vqqq�v�&N�x��B��*#�B!D# Ue�B!�h$pB!����]!�B�F@w!�B!	��B!�h$pB!����]!�B�F����[.��IEND�B`�

xeon.pngimage/png; name=xeon.pngDownload

�PNG


IHDR����C�bKGD������� IDATx���y\Uu����.�e����.�k��&����h��M����L5�<1�,9m�����d�ZY�b
�kj*��V*������^.������QQ���{?���z^����^/����=��(��nz��w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w�w��:q��w�}w�_�o�����k:��px�����������_�9s����w�5����{��2���;��w�����lEQv���k�.EQD����[�l9}���2EQ����e����������[������t��G}��_�]���o��\����e���v�OT��������G���}�����b��uEEE����~�}��)���?�������*�_q�'Nl�����VM&������[�n---��g]�K�:�}����R�k�g�=@233����g��-Z:t��
4��u��~x������w��!,,�~������5k���c?���g�}�s��6����c������_RR�`�����o�����'����o�����{��7y��'~��WYYYIII�f�R'��������?n�ZEd��5#F�8v��%?h������W�Ef��p�������K����^>|�N�.��'�������i���;�|q�����3�L��_�vm�V�*��\.��Q������m{���[n��}��5�����|`2�v���(��S����\^^�(������wo����O�L�>�s�������g����I���(JZZZ���=�955�Y�f������Z�*44T�&|�APP�[o�u�0����7o����q�&M�t�T�/""��oW���,11��G�l�KF�3gN�-��9s�w���R�:u����R3g�l����fS�������S�L�� ���*^�y���:u��u�6k�������;;vL��N�:EEE�������N��_��W/����i��������Q�.]��O��������]���d��y�����kO�:��/�0a����/"����>�0aBU~�Glll���E$00p�������l�K����9�V�Z�l�������~��S������Rk��1b���
������P����2�"##=DDD��L���
��i�SO=e�X�4i�.0���I��������]�v�=�����~��-,,��woJJ�z����v��4l����<���/��bnn�������t�r�T����;<<���3�
�������KNz�U_�+���v��?=::���A�`���#G�|���E���>�2e�z�]�v�������?�v�mc��
PC_���WQ��u��p���7^|7�K~���~�C��\.u�������*��Q[�l�{���LX�/��E�~��s�����#"������9r������7�|S=���o�n�:&&��~(//�S���j-((X�pa�-z��1e��=z���������d��m/����	�*&L�����=zt��E����������;w���kE�0a������^{�����W6|\\���>���={�T��=z�m����W�/��SO������������U�v��M��%�;��~����u��~�����S?�]�v�=D���?�����q�����-X��O�������5m�4==�I�&"��|��'AAA�|�I�V��n�:c��������uS��\�/�����xv�\��n���;���b�\��m�������z�����y���=���[�hq��:t��q�����5k�4h��[�n���Y3�rT�Ku��q����f�Z�z���g��}��I��Y�nLW�h�
����x�����^�A�qW�w�f��p3�*+�����������������������������������=������N��S�7)����1i����d�E��\������r]^����u]x��/W���kUZ^6b��_����o�V@c���Qk?
�r{�h�UQ>r���������e�;�O�/����-W�Yq���(�����v���Z������v������5�:��p�u{��W��Hx��/�u���.�\������rU��V�����������-E�!gy����
	����M�V�����;p���e��z��j��������k����g�W;��������g	2/�[���/F��G������v��>%���+S#�B>�;����p�CI�{����!��v��*������|��@�8��W�zT������E��F���>�L����;p5�2������vV���Q���D���{�����;P)u��ID���N�W�V{��Z����	w�
�joQ�K��p.���wz���j'����(s��j'���jo�}�N���Kf�Y���p����3K�����Ke�ku���#���y�Q�lK������2����CW��[o��k�I���E��(��3��d�E�#vq��4���CK��K�-�����5"�d��=[�K�h�����05�*��_
�x��L\���Q�j����X��$�b�c�,�y���b3+�����{�H�;(*�d�Wj���R#R�^f��|�%�7��Xqn�UB��K�9(�<(*`v��La"�_�%�s�Y�9r@�I���3��C��[�D���5��)�&v���nqy�q�+N��:�!��_���g�oUoH��v�;<;��5]�n��h��H�L�f�(�������q�#�n���:���[�~������U�Y��9`�;�����y���k�`��[DB<;����]:kR���L��ngR�;j7���>����y�4�&R.X"[�jG��w�>O�����/�F
�%R�K�6I��S^[e������]�KA�E7��9����F����]�+��X;��j�8���r��K����N3V	�����Y����/<�I��j�Lf�P�&�6���� |���LZ1�S������&�l9/u����5LM��p�AU���}��J�\%�+��������-�>��*IZ�N�[����a>S���>E}7j�zM��}���h�w�����bn�z�|��	w�u_{�z���>L�Y���S���p��0<��$1cn�z�}��	w�Z�w�o�f��z�R�w�Z�}�~��0�3������4�{�v��v��TP��7cNbL�?�
�{�B��`l��+�����i~S��;��������|��!z�R�w�YWI�_V���=p
6)��#�\��H_��Z7��v��.U�&��fWhQz���V�Y�"bRE�tf2�"x)�����L��I����x�,��p���*
S�=��^��h�+T�v��k��9��P� ��'�8��g�8<\��Z����[H#�t������~���=��M��M�����X$�%e3���r���r���Ab�V�t�<X"\R6C��J��S�����q��]12IM��Y$8_f�:�������?�
V�*/��%R�ao�����;t`���$+H/��Y����MO����l�h��.m�������;@M;�G1��������:CZ�����Z���aj���;@�:�Gq��H�H]��������Y^'���F]}���*�/G�*�(JL���a�SD�8<���k���v��Qq��-e�C���;���rV�.=cT��T���QV��.�Z$4Z",:S��J�Igq���6k�J�D��Rl�f�;�.������~�sF5��?���=��xs��w�k�#��`�k��l��n|���E�@'��	i�GQ�U�� �u_�C�;��j�V�P�����O�2�����������>���A�&O������y�K���-:Q��E�p�;w��%Kv������%$$�;w�����X,}��9w������%E	��G�������=����?����i����=g�/_^RR2{�����i����_�������j�U��T�
�-��N��~��9s���U�sr��mw�q��dR{���y�f
�@/��w�����nT���-��N�:x�����kc��3g.��:u��>}Z���1]&%%E�����()��1��-�����g�.)))�W���_��A~����v�Z�n�%�CBBJJJ<�.�+""B��������xIQ���c[v�C�~z�R�RRR._�����Yq�8qb``��?<z��������s���g��i��Enn���4n�XD�:�v��(����tYl^�dIyy�����+����_|q���'O�l���������8q�M�6+V������C�49�aL&}^��V��~\���>[eF����f�=zt���"��i�7�|�w�������k��q�����<j�Z��n���	z���_l>v�����;t��9SPP�����Q�F�]|�V�/��;�������>k������5LM��p���a��2�Uj�O����~_��"���#�l	��'��5���z��kt{|��s�{�fM��;�^Xq��j�M�>�lT}��Y����}���S�Y|[epS���g�����+��qj�On{���}���������K�^�{N��k��*HM��Q�������4�����CCyr*����I�Hq��?��>0?a��)�����j�
S�w\��<1�=G��C�^9������k��k���[��p��~O5����g���{,�B��Z��2=Y,!���J���S��K����L~����q����y���?�M�g��*�05��;��P�����O���2z!�p
���&��~���OS��!�P9[���&�e<��n�]_�9�HM;�~l���{w���
������L��
p%�s����gr���ONZ}J,!���DY��`x���?��;��K�NLZ}JD�$9G�����Tn���G2^�,�����O��$.V�������[|&!}��:�3i�X��Ht�XBdf�X�������=�����>��	O��%"b+����H�#���4LM��p8/��tB��?t�w��q�xr*4�V����������
����wG�[e��Z��x����v���;�O��3K���Y���O%���K�T��c��w���dY,�.)�!C�K��\�]x���9�2p\\��7�;!r;H�#����iq��1��KQr���j�K�T{��v���&�l9��Y��p7K`���*��_O��`l��8Y^
s�H����wIY����z���xs*����0Y^rH�M
K�P��01GKh����N��^r�>����j7�����A�{���r����,�)���������}�S�f��k��P���2ys*00��H����a�O�m-�=g,�W����.d���.�����S "b��������`1��=C^=)��,���"��2=�X�Q<�k�GZR�F�b3+�������K�h���|��X�H�K�YZ�^����&{�/Z���s�X=k�	���P�%�n�.�f����
"�7��fV��R����C�e�*]�e���IP��d�P���#���j��e�}$��	]�����P��g�0���Xlf�������3���u��-����-���n4�^�4LM��p���j�{�� pe{�'�M���n������,�{�|���Xq����'�x����G5���,�+���j�y�;���:���1����R�>��2���������}�������p00����}���v_�V����
+���V��w������&�����c�2������M��=j[eF���x�j�+���������f�z��&T�!�c{A���y��v��V�c�IV��l������T�#��Lj���JR���Jj�znGA����C���E�{��L��k��#��C�E��w��e���k��&��<�����wo���������H���2�$.N�n���zQ��l��A�x�U�O�E����j:��1sz����wa�;�F6��]Y��n��g�0������������m�������=�[e�K����V�;��(�#�j�M���'�]������zwn�T;.A���TY+C�dr�M���|v[���������v\��e��TPC�R#C�T-�/��$\=�VptH����Gm�V�1�%
S�w��_�I�H�H���%(G�����qM�U��l�C�RGD]-p�+NQ��"��&���{Y,b1��H��e�U"��9���������3���U�&�����L���C:o:y����������X��*ku�^�p�	qR�-e�C���9��*���Ti�b�8�J��R����^�;@M��e���Hp�X,<����W-���	����kQr���c�{q'Dn	j�M��o?Y6f��w�<���-Md�/��ED9"3I]}����$�!E���d��U��=�q��R&E"���I�	9����R�U��|{2w��w�=�q��F&)9+$R&�� �+�5�����~o�]c�j�DL�g-b��E�3�y�D�=&�����j�Z��so���|�&E�r$^GK�N���h��4+������VR��m�9�0�vh�7�T��'�X������kH�����P]�vh�p��vh�p������)��hL���w?��0N�Y�;Xq��z���p���nT����2�P������jG5`�@T;��>�~���	&��_?22�M�6�-�|�i���_DDDx��_��K�.���:tX�j�
��������[r�#T;��>�~����;������������q�����O�8qb��}%%%%%%�I��6t����{������^{���9r�4�*?g����&�����������]�v�<yr���Ed��V����c"r��Y����Y�����������/��2..n���������;.\x�����c�.���_�Y������������c���J�F����u�<  �y��V��S�N~��z��={n��6�Wu��a��]7p�2�����T�o|�7��\�������-���KII��*5����Uf���={�t�\�[��5kVXX��4o����#u�����X�|��1c�����������p������������(����D��)))�/�j�����w������������~������+�|��
�fsHH����?|�����ZD"##��;���s����l��<�&2�e?��=�5I��2�k�NJJ���{�x����
��2��v��?��9�{�����8p�2�e�Y���{��Q��	�u��8p@Q�����k�fdd���[D�n��q�F���p8�{�������G�����:���+��r����W?���7p�&y��.�5K�=����O>�d~~��(M�6}��WG�%".�k���999���o����?o���������~���'L���A����I�&7p�f����
�}�x_�����mE^���Y�y�G{�k��,0
SS�=������j��t�$�Q������/��T�q IDAT=��^S�g��"��fE^�#�����P��[e*����j�� ��,#o���Q���;�P��6�q�Tz��G��e�����G��[���_�vx'V�.Xvd����|����T;�+��Q��f�;��H���T;�� iG�>��c���p��j�!���Q�0
�*����{�o�d���o��X�Y�k`��)��B�����a��;_���o?[���.ub���*�����0i�g��?��j�����,;���M�}}�x���w�#l����6qVv���������}�0 V��/H�m��4X]R>C�����\��C�����e�'P�0(��(z��3��c��3F�9��Z��//EI����rwM���5O��d�0��R`�@��Ys���P�����S�04�^��qK���%eqRG��S�]�4�o@@�;0<��N�!1GK�E�3e�UBE���]�[��Lz�j�`{7{��6qf��x�-�"�$w��-_���M��
�
�K���Y	w|�'w�6�������E�CO�9�R������{��;�6�g��xi���;e�D��K����R��Kr��m�����T;|��������0F:8�Dr���n��O��%�z�{�.�/�,���n��*��#G�=�=��c������%��$��Z6���Is���G�c�8x0,��)qyh�
����W���P�����wl�������IO�D��w����3�b�{@�;�����������IO���-���9Z���8V:&��X��*��������`H�����V$=q[t��w�����n�0�������f~a3w��� ���TV��f������f�3w`$W�v�J�t��E��%�"a3ew���`{7{�0��~�\����]e�=4LM��p��j�Lz�mt}�g��7��B�<9x���������je�E�Y�����{��{y{����T;�+��{-��������'�v�w�������������9[��=me��m�v@�w��Rs��v��;�.�9Y��N��`��"�9Y��b��IqQu���.�;��������L������+������g��T��:z�x#��o������j��7�������]�V
�H�WA�=������+�v���*t�N���Q�@�����M���'�]��T;���P�v��P�9tC��j�*Kceh�L�������O���f/�����T;�(u�����q�SR��=�>L���=��L&^�d�v90D&��C=c��|Y%�"2^������3R�z,��T�����z�"��!�����0��"Q:OU5�&+�@K��6V&�'o9��H�z�,A9rDD���w���~Z"+���yR����z����wf��$'W�ytB�����,�*�"&�:jl��'�D��{���xZ"�E�$������zqM�]v������z���hf��0_t�:�H�X-:S��J��=��xD���r�ELwI;���AX#e�+b	��(����?�5R��t��n����8bd�C\�a�-�����9w��Ww�^=`��#�O��D���{�QU�B�>(��%:J�Q����I��hi���,�K�K�f����n��o��]kV��<��zW���+A:4�zz�T/�]K�;���#[���a�X.�v���Z"��>T;����T�kn�������j�A��j��������T;��hO��5'6���m�@co��~��
���8<Z�Y�A�-y���v@[�;��?�f�O�!T��>�j�+�@o��n��M�=�(���,�o"���R�}���b-T;P]�*n�[?���w3�T7V���{��us�m^=p"�T7V��
������;��������5���D�=�Xq�����fg�@�5�p������ok�@
#��u�����T��������}������k��P��nwpmj��nT@G�;����qW���5�9����*���+��v��a�z��;�����v��y����pp�z���������wo\\�����=����k>�y��OQ��70)�R�?u����<���������b�
���y����01b����v���������L&
�_�d��E���c��9[W���2�M�25=������c��e�ZW�\�(���~]XX�(Jiii������
�_�^/^k����?���sv�O�����2AAA�����'O�,--m�����[��o�����"��������ypM���I��A�����/y��'F�=o���-[�����4h���a��yyy����2)))��K��Rv�������ryUj��u��{hhh@@��~XPPp������Y���.�/�;���X�����|�jn@JJ���hE�po�����������[���v����"s��q�5��������<�L�����\3�{�^J�'�*��p8����__TT$"n�;333))I����^���I���&R�����N����v�=>>^Q���W���n��!66VDy���{�:t���N�s��
�-��:9n	�s���X���j�����O��={6##���������
4H�����l��~��Q�F#G�		����w�?��_�a���uC����A�w�B����������+�����j�aj�����j��j7I��O�����j�w��+���?nO_F�D��G�W���uB,z����UA��F���l���Wl������:���UT;`Xl����I�	6��-3�_�(�y�+L�M�����x����I�Hq��G/?�x��v+Ob��y<�	Tn�����x�aZ������v��w|N\���j����f���:�Z�=��E��s��2=���w��r��:�xJ���0<��
�z���Q�����~��DG�=
�����-_��o������K�*�OQ�=3�I��1�;cg�������(���%��*��`$��-Y�K�K�g����U=��2u�W�O��_���x�0����iq��1��KQB�^�0��lb��6�{����z��#��2N���F���Req�tL�Q��1U�'���[�=���������/7���@�>�]"l�x��H����	���Q)"��������}�����Z��������U_�_�%�shs�P?� ����y����s3���Yqx�KV��$4_v[%�s�"���_~Wp$=�7Q�P�&p
�����5]�n��h��H�L�F�`��w���3[N�K�h	��=[�K�h�z.PDyf����9���8�x?
S�f%�����+y(�L�����<����
%���7���*\
��V|�]����!n�8�N���Rz�T��J]"�=%i�����%zO=����;�+l��~�j�zh��U2����Ob�'?����S�����D�C�5�^$1=�q�m�b	��-��T����k.�R���<����K-H�X�j�/��9���9Hrr���"��
V	�.C,b��P��g�0����0Y^rln_Z^���;���R����AqM�]v������z�]U�P&H�����2"�#�6u�3���Z�n�������Q�IQS��2���g	6��-3R���L�
�����R�(Ol�8�LzE�J1���$&��*cd�q��)�<���}gf$������$&���(���j�����N��L���}����;!�r�'10��%��UQ���������?�=d�c���jP�o���4�Wa+���b+�{�4����2i��j���Q��%�S��IlOI]��4�Ql�f�;@j��x��r�Wa/���(9h	��-��T�*���;�vT��Cb����� ���s�,��
��Q���??��vTA\Sq�]8t�%�����f����l����m���e�2~�'�������qm�H���X�$:J,a2��b��{&��w����TY�,�%�%e3���r�/�P��>>r����	� s��	��J�A�o��T
�0i�ph�.����zh��|�%a�\�V���3��Q��@M�05������	��/�n���r�"�xi-�w5�+6|B��.,6����%+��(GB$�%e3���2�\������3��?�������KY�2]�Y$8Z,a,r�DJlR�g��yV)���|Gaz�o�vM���%Gm���<^���fV����#[�R|�<g�b��U���~��$��{���H�m��4X]R>C�����|b�$�H�Y\n��"�G�<(�oN���&v)��!u1X���m�rv����T�&����iq��1��KQ��E�U�*�����V��.�Z$4Z�Z���3�j��~)0K���,�9rZx�(|w�@3���}��O�k�7�KM_��Mh ���L��qK���%eqRGx�(|+�h�.�dY��Mq%��B))W*���:�4�j��UB�����%�"��2�*���E�����w�6���~�7�8�C�X2�����:�,���cT{5��3[N�K�h��/|����+��T-�M\r�0%x��~6��j�oN��\|w��i}'���Xlf��%�8�*yo��T�,�������TBg��J����f���q�?*p:�v�#�����?��(�@s�Y@ec�YX�v}����@5�/7���b����r��Q��	[e�)�J���������C��F�}���+��f��U��v��wn�����@��v���F\XkO���jh�n�J�����N�@��o
��]Q���J���v5�=�\O�s�G_`+���b+�{�Jw��UQ>j��"���%��&��$��%�izO\�IQ�g������6WE�Ck>����up@�������I�Hq��?��H��e�y*�"
S�w�M�v�R���S��`����A�sT�y�*�?�p
j��L�����j�
q��v_8t�I\���U��;W��(���������}�5\�'�%D�#�"3����L�5���=��J������T��K�Q�o$�z���aj���;��\�#��o�����
@�R�=�j�5w.���ET;�A��_J���_� 40h�]��!���������E�C���%x�$p��C&<(���c�v���J�Z�#V�h^tk���WE��T;���&��S��D��jAz��J��F�y�d�������j` ���)����w
��J���V�F�C���(���;%���V�{Kh�{w�4Q����V�RZ^6r����-�P���?X?RR��r~DP�F`8�m���������;��e�g�}6..N=?f��c����������?~���^��wo\\�/�{c�~���=lej����v�����7>����{�~��L&S�^��v����K�>���/����/���s��'�����;����[,�>}��;w������L�����E1�L�_}�U�~�GxxxQQQDD����WS�L9p���%=z�?~�O<q���8�������2��U�uC�?�k���$
SS�?^jR�{ZJJJZ�h!"���O<���a�~������O�f��mw�q��Kz���y��8�C���k�T;C�����(�<��3�<��ysi����E�����3�(���k��{����3gj�����:u��>}��W�t����j��5G��zaT;�����ryUj�����{EE�O<q����_]=c6�z�!�����={����{���jHHHII��].����z�W��2��e�!+�7��5���M��	�KII�|�W�v�m���v�;���_~�eHH�����u:�"��E���\��4n�����Z���vof+���b+�{��	�������/$$���?������<+�{��]�lY���Ed����6m����'N,[�l���7p�'e��+�7��5�j�Z�K$��$������D�i�g���/������s�=z�g����7�x��o��8qb�V�E��w���S
$"M�6}��7{����}�]�v�7�[�n7p�~�����I�{){�$���D%""�)2b�D]m_+}���o���������[�]�v"r�������i�&22��k


rrr5j��Q��9	n	��j7�m?H��b/:h��U�k;������I���;�}���v�g/�����E$,T���5�_�/��mQ�Fb�����%L���&3�L���b3+���vC�J�A�o.�Qz�T��h�p�;W��2�Ed����v���2�W�.�����:���mz>9��T�.�9�Mt���4	�����0*u�L+k�Y=�v��p��2����qQu�����p����9W��9�V���!�F�V{|T��T;?�]e����9�G�nT��w�1�+s
��G��[�;�K���%Gm��B������R��[e�(U�%��`	tI�k�Wl�Z����Q���IQ�g���������l%3�R."���e�����k��p4LM���K�,n%J�)"�Y���t��T;��_��7�px���d�a9�,/9��"&����)�W��RW��:��I�(��Q��v���p�/U�J�$�Jz)"".qm��R[	�2S�Y%T��z��I�q���8J%y�:n�[��T���R�,/9��!%�O��eE3��sH���$������p�����,Rz��$9G�k��|KV�:�/��9uE��7����s���y��!����}��U&q�7�-	w�����[\�\��}�w�7��'Ov��_�<�k�LOK���%Df&�5�&�%wB�v�����h�����e���~�����{"����T+R9[$�7��hV���!��A�w���W��z���05iV��gs�$�x�w�fou��,�%��T;T���Z�}�5���[�d���i�#����J3�������C����>-5����z1�w����V���[P�|�����<��T�.�������q�Z�}P�|�����B}Z�v������g��=���N�8q���N��j�v������^����W���	������k���{o��e���<�������R��s������`�gM�OK�<C���r�
��z��n�NKK[�p��~�9��c��3���N���@�)(u����?�j��l��}T��OK��>��w�n����gZ�l����k2��u�X�N���R�|Yt�to�V�������K�,9{��zh��?�����8���������w7h���C����K�C�]�vm�.]w����g�e��V�����2P���J���I��{����h�����V�\�s�Ni��}bb�&��p���([��K�h����c�+�����1�T;?�[��$�4�*K���`	rI�yn�M�����w�[b\���/w�=#�4}���t.\�p��]��OHH��ys�.]���5�CG�;�<����8�C����O���+I��F����"��2+J���j�w�q8={��6m����w��%"���?W�\����/G�=�,AK����G����������V�p�����4i�g��	&�gZ�n�g����
`�Ht�\x^`�S�K����%j���K����~��U5����z��ta5����&uw{���"!a����}N7=%�=��.c�b����� IDATqH�e+���b+�{7�����S��>����3���3g>���=zT�l�Kq����CJE���4��F�,�]�k�$����=��I�(IzAbGIj����)U�,�(�#�<��_��[700������w������j��S�fl�������8��;=����u#�r���$8_���{M�����<��:�3I&M�m�(m85D��TA

�]�����5�z�<V�������v��]��E9�P<�k9���Ch�Q`�#Is�����R�N�>���O'�wP���y���SSw(%��]��_��ME0j�ZpN�[��������6o������3&"Q\��{A��Xu�����
��)�-���!g3������wQ��H��\�7���NDt��C�/��������Cfdar���ev�Z��Uq���1eKaNT�*�F�q�{��~�����v���s�E�""�k�}��^u]V��������X e�b�v������K��9����ZK�����������Z������{���;������������N�=��!��]��\������v)D�
ON}�����c�=�w�)�DD���~*�.�����j�BD�S�U�{��>�/"oIDD��Ao�5+_��>C��8�����X`�d'��]�+g�	�����u��w�-��u�]���f������`0�U���w"�V8���v���]���qj�+��J�$���$���	l+'���n���g����S�N�RSS7l�0u����{w���3�D�""�/����V�2������q@l2gY���U�������PQQQQQ��WM&��d�Nam�;�DD����r��W������C�D'���P����-�DZ����L����w�����b�����X�dIyyyD� "�xQ��q���S����{$,H�`��}L�DA-�9u��uMWJJJ>����TEDD������Uq��G�!��9���W#�����'":��^�������d������(����]��~f$g`��UQ;����C����{���=<~���%K
����(V���[��������q��
6�w����
��|�~���q�&L�����H}
��3d��Ttnw�n��!???:t���� D��6��2DDgP�����9��.�v-DD�'�Q��:p�@BBB��]V���M&�\�:T��ND���������x�����]Q\j�q�
^���f�,[�����gD� "�T�lT�v"�Yr�w@n��C�L�z����
kW|>��+f������H��}S;�,�R8��� 7Y��.(�Z�u?q��O>�$6�LF���Gu��)�F[e��NR��o�8��A��]���+�6o��dD��b��g��-�q_�r����*����Z�2�N�k�D""�#��cV���7����bW�>�M���C�^�T���k�O���������������;w��9�����5`j'��p������-0'�\���psjuu��#�u������C�-[�"��(���9L�D����y�$#�����Y-mwy��w���KKK?���p8<f������S�N�ryDD�F�W���Z8�o��cj'��0�fL�����bR������C�M�<9555!!!���w������^��#"����Z����������������$��Cp���G�,�n��}��w�]�G��ryDDu�{��W������������V�	&l����W^9s�L���i��7�|3��Qt���;\��L�DD1�5w�~���N�3##c������Mq�uX��}����R�Z����F�s�PMM���{k{�"R�����c��Z�����(�T	 '''%%����]vY��]_|���T@DDm�6���R�v"������������~���n����#/���O<���_F�<""����������(n�����?�|�����M�=�5kVaa���>j��h�GDD�S�<f���cj'"�7-�q����NwB��B^xat�""���M�w_:����(��4�O�>��/�\�dIII����-[�i��	&��,�����r�DDt��T�z�K�
���K�P����2��!�j�AsZz���/��������W6�C�������Z����2D���:~����\:�7L���N8����#7YYjDt�TYQQQQQ��WM&����ep'�v��������4x<u����2�m� ������S�:�udL�7��!������"���YQ,iip'"�xT���x���s�<�
�@����M�z�b�9<���������cV���Nq�lFN$	$	yy0����(����=�D�>����^��NqF��r�n���v)D������;�?L�DD1"�Q��2DD����cL�DD�oN%"jW�+���Z��!��n����Q$q������[ul��W���k�����w"�v���|����x����!"j���������kW����k�����+���������q���T�""�5w������W_<x�����������Gi���=aj'"� �	����#G����+|��~���}��
_z�����w�}�
8������:Q{��ND�q�����G�n��a��i~��G���m��A�������}{�~�����f�����Gj����LD��*���Z���3g�=��7""R]�?��K�.��@�^��z�,�V�\9j��~��HMM�������Dp����pU�j��N`j'"� ��*�l�����#F��gO�>}�t��Gp�9�)�����Y��"c��.��f�v-DDT';;��T���<Uf��,x��w�z=��/�b�W�z} ��zs�*CDq�Uq����<9t"S;QL���>u�7��]������'.^�x�����]�t9~�x�	�����"�NDt�|2J����zW��q�^�^;Q�Zp_�~�M7����;33�a1##c��-
��_}���a�"�NDt^v8�k�����b��������U�<��ND�!�3Pe���S�N}����R���[�a�����a���1b���W�~���������#�~�b8U��Z����4<u���eH0����r�z�SC'�����(~��T��;wfdd|��'��[�l�F�z�jI��z���~����i;R�DD�w����g�q�(�o'W!�rU�S;q��;�D�2'����	����s%�������5u�����~���G����(j"5�Y���d� �N����
`|.gE����H�OM�����qO_���70���FM��A���*�~�L����8����O�������(L5\���O���bj'"���QLs+pT���-�gHxa�����/�JE���fE �;�a���g�sx��k��^DD?�����8�&�����oz>x|w�����Ga��������3f���h������M�&O^�h��z���9Z�Y8-������|���c`I���Q\��;���$�(!a'���D�o��t����U�<=��Y�'3�Qw"���cDi���-��]�,z�������)���2|��jmj���x7�����: NB�8H"j���j�-���,�C�G(��pB:�A>t��W�<����L�q���QDDq�s�#�������V�N�^kW$�ex����C��n���������~����K�&[e�����+���R;�"����6�?;|S;�b�m8l�ED��q�Qt��s`���l�A���XY��W��q�-]�R����.�~���:8��E�I����������7��&�}-w�����a��$	�	bn4��vQDD�c{7{��(�N��B�j��`����5���q���S���PnG���]�����97��boN�$w"��E�z>��%l\�c����g�g/fLajo%�J8r�� ����jDDt����NDm�v��B��$[�?�S�f��g�_���,�P��8��F�txj������rUDD�`p�$w"jKnTt;6���
��G��$���	�j�o
\{?�Uu��D�{�v��"":�A��������6�!a��E�!��&�z�A��j�CD}�DDmg���{�~���6���oC���C�D�8 %��)y�����(��%�V"j#���^����F�t��7����\<����T"�y�q�$w�8��q���v,j�r���N\������O��d�]�m�E��.��c��e����N�v8���!?��bp��5�����.ze��)=/U�"j�CE)��#���(.��������P���	����"#����6��-�~��
{�D+8T����Q�w���P#�xQt���D��L�[�u��E��}��iL�D1��b�o#�E����l��(>u�!�d`��N�(���F��a�U(g}QCjg�Q,�PQ�7�D���Q��Q��<$D�.�P1�dg"Bg��gj'�u*J�����q'�gm6UFq�<
J}K�`DJ�~T�Q��~;zX��mo����\8��)�dD�0":O�*CQ��S#����Z��UJ
����_tb��C�G0�9��J��w���������q7DDm����ND-�m���	���t�*
�<�H�'%.���?����G�$����q7DD�8U��������A��P�F��5����Z�:�#�Sj>��w���v���(��!"��������d���Pn��
���'>�����}w�i~?l�wCDD�;Q��^X������9�J7<��_8��'�1J���'g�"a�@���s4��Q���n��Q3�
������ 	X�_�Z=B����}�i��D�r��@���q��n�%��n��(�E0j�"r"�v�8t��v$��c�I(����k�p������},}���Q$�U���6-�����[�Y�������������g�A=ty�mS;E�D�*CD�[���*����DR�d�qY.zd�6���v�kWL��c�O�HGS;���Hbp'�3������I:�W�]�J_����MY��y���zP�D""�]��ND�V,B�Y���A���L��B�@��j�AD�����B��0���{gX���[��[������
��p.U�"�8�.������;@���{�����x�]�����i#���JF�m�)I�����[e���T�9�����>����djo��}��u(��f=DD������FH�7!Y�w���/�����\����b�^��z�YQ<cp'��p��B+�d������~�/��i�o��e\Z��~��fNF������ ����j�DD����w�8r�0
���P�pI4��������X��_�;���|�p���,Q�oFD���ND��Z(F�P4:�/C �=3�X����H}y_�-2�G�����L�DD�����q'�!7^�M�W)(�H��lbz��>�=O��G-�2#�{���_�����NDD��'�F�;Qp+������j�j�!�>�T�����u��k���'����cj'"��c�u0�!��x�}�iuQd���5�&��q�5%��wfj'"����ND���E�g�j��~H��������h������88���G��DDD�������Y@N$����'�"h%��h����5�$��M���A��D��DDD�C��s��b�#��

v]=\H�C�D�MF��$w��t�Q�D�N�����E�����DTg��=�����kf
�d;u�;Q+p�L$1��+EF��B6�kS�������y"""N�!"���r+��(���<��1�Q��f3w����"C�U%t����%ABJ�V�sgj'"����;uP�W=���7���c��
Mj�8���fj'"����f���
e��������QL��c�6UH��`DJ�s������o�{�mW]�'
�q���:��[Y�P��4,(b@�;���<�v""j�8����De�����7�Z�da�s�rq+��|vp�-�Z2�V�v""��q'�8�l�G��z$Y�{��R�[��?=���
��d���t�8:�E����N?v/B���U)�%L�C2��X[��������%S;�>�)�����_F��v�[��������_^��NDDm��=���:�5%���1�Q��T"�s��NDD��Se����M����mt�������������[URx�go86k������������=�M������L�DD���NDq"$�*!���X]�k��%L�DD�>0�Q<8��V+veb�G���Q�_�9~u���6��l�l�v""j8	�� �b^���i{B��~���~m�G��h��1�������#TW���������Y����NDD]�&{��(���!���gy.O��}��ww����v ����cS;�W�l��;Q��C����+��*(G��N�X5������ !�lUi)S;�>���b�"#��� m�9p�]B^@�.!�������������NDD��;EL�	��Lx�:#ye�jl��
(]���K�/�jX\y���/������b�[e�"Cq���Sw(H0�A0E����.4�Q�"�Ak���]g���`��������O���$""���Q�Q���ME(E�t�-��nx
��H�A �u����W-�}��1�Q���f��E��;�F� �#q���0!�R�3�e����U�k��1"����mK>��v�Q���;��}	�@����Pj`�A	R��f���M��&�V��<�v""�(���w�HRd(.'<
)�,�^�>
�8d�><��_��������!CDD1+�Q�����(BB2�.��F6��+��/'�W����
MY>~S;�2��Q�9��V+veb�G":	�����>}z�g��X�������L�DD�qp��;�D��/��{'!��[�HZmd&A:�pW!�����	"������_>`���D�-����'�QS��s������Z���O�+Bb��_~���F�p�wOH�:�w�����������;�V���x������v�����������~'t�%jnO6Z���DDD��=�D�Z��DGE�6	���=�I���X��P	�T���*o)V[��JDDG��ND��lC�������n��Fd����("���"�=��A����0�������w�����6��0�G��H���:�h�$ �t}�sP�#Y���6IDD����������n�=f�x�8^������^�!���e����`�.�z""�x�V"j��N�w��_�_�r?F,C�����aD����g����q�{��r'�l����
�0�w�spcC�2�[�E�����1��|�w�~�`��U��1�s�����X����(����D�&O1���p�%���[Z�bYF~>d��g5M�,dtgj'"�M�V�p8����������
����G����^����N����
m6��)S�Zm�����I6T��M�Bl-�S�t����E��A<������U��W����N������(���%RZZ:k��m���{���>�h�z�.]�M���S'			O<�D���i����s�
7l��111q��� �b�Tl�!j����k��@U�NT����e<�k�n\u�����z|~��G�/����������}�`�Tm�}��i�g����k��@ p�����~�s��M�����u�����O���~����~���)S�u��?%Q��sS�U�D�t,�����o���@�����������R�����;�S���NDD�T�q���O���
��.<xP��K�>�����-���i8y��������3���k��m�:E�h�y��+�(�0M��������{w������w0��J��.�������s����yy�������c����n��F�u IDAT�"OMM-))i�zs�SdggG����!04��� ��!
GK�����V�C�:��; W�]ERvv���2�����2����_~��'���?���k4��^{��g�t�o�9��Z�)��"�h����7"!	z#n}Fs�W�4������2*�.���s)�#�y�#�\�v5D1������^?��{SZ�v������v����/���]x���X'"�]5�m���������\��=;~�������Wjwb��3����N,W���J8���B���G6*������Cl�p�����l��y�������M�6UVV����v�����V��������E��v�z�;�7���`u����F���'����@��E���M�����Al����E{�������s�O����/]�t�����,_�<--��[o0l��I�&�;v������������_��u"R���p-FH�-}�"~g�tw�����_�@1�M���a:��ZT�p:�p@���������w\�^��z�Y��G����;��ukxSMM��5k\.W(2d��q�4��P���?��sg����O�������Sq�;QY�&��%�IXq9��w�w�{�u�+��z
�v��;7nT�a���C#���Ij��n�����;�$���:���[�<������vADE�����Dm��~<�!�u�[�R��������-�9Q����"|�����#��{�0Y��bCA����]wh6c�:���\U�+��{X�S
Q;��I�Dm���(z�����[���%���u k��e���J�����d�k�'���(+���rUDD�ap�$w����wAo��r�3�l���o>Y3��K��a��K��E��`{�����D	�{$1����N�: ������NY ��B�)�|{��{�?\�9�RK���J�F��r�n��l?�u@��pd�����`p�$w��r�0
J}g� ��2��T��r=�5����{�*3�l����H�r��",�52����i#���JF�m����D��E0j�/%":/�bh��C����z���z�O-5���~��_����wv����Y	�O�<�Op���Y�ND�����q':/�;��"��b���O�����U)K'  ��v�?D�<)pa��0�.�y�0�~���[�$��'L���H9i��h@��ys3Q{�w"�Z��}ZyO����@2 �5�������|�n��n�jS;�/u(:�v�gT�(4��j�C��99�C2�b�dD�L�D�r���T"�3^�R���t�
t�*��;�����M��A�.T��3��4��+��g:��83��`���:���Q+p���ZKq�����*bM���@���
�|�f}]j7��7fI�����ANH,H���|��&�Ka���`	��(�I��bB�`�v":Wl�f�;Qkp|���wP�����
�������7�Lx~^����'��\u��k��j�"��*��!DDt�FM��Qkim@ ����_��|�5���A��o,�*������
���2g9Rwk���1K��[��AFB�^y��!����
�8w"jY�K*�b��G.��5��)y��_[|��W������y	r����D���@.�1�4�z!l<�`��f=DD���K��2D�����
�r+;����z������7k��[���C�)���Ynv�q�f��n+����'�\L""��'�F�;Q��<��� ��a�y��!����_����������s�}�=��Q=�q'�(p:a�"3V+��fO+����0����C_}�@��'Vv���K;nj��""�n6s���v#-
O��$����E�Sv��S6�����K6����W��T����q����[��������!6�HE5=]�!�dd�wO�7lTdI���[�~�x�k�X�A�����0M����"�Se�����z;�M{W�:����G���\L����@���6[������������Hs6��}�,���7�I������~�<l��T���"�;�D��_>�������=�8�W�^�F�^*dx=�v��f3rr I�X I����n���rxp�h��g�G�<�mO�0��c3���_��s���=z�[2)�yO����G'""j���wj�v;C�)��;���PH���`gf�E����d3�Z���hf�LA	�:������CH����cW
��Q��#�Y�����N�i*DDDu��T":���%Z��� ��Aw+�^6�t���U����.f�_��q0h�C��br54�@*M�q�p���E?�-> Q��V��������XL8�g|=�" ���s`�`��(��<$��s'����L�o���� �,�����[�M��#�T�T�OA��)]S,�S*!""�a�[e�=
��,
A��5w�����#�ej{�+d�q��&x��/
��q�z	���X��������-��	e����P# h�.�ea����(�q$��hFztR�Io�u/B0C����O�L��K�C�d�V���q�f
L���O�%���\n3����������(����wj���pl=:_���R\���4��w��x�����]�6L�Q:�.F>z�p��j��5�;�Dt6G��9���B���8cj`4�9�K�,�K���4�X�o���z���mID������������p��;�������p���F��2h[0��#���!���~����>�0�N�����XQ*����]�8H":#_1�&
��_���B��>'�������m�0��^I�dt�x�DDDtvl�!j�6(M&�+~l'�S%�0U���{���[e�W������y��������w��HkF��]�a?z�B{B�:V:����G�G.&f��������cF���K�������I���� ""�l�f�;��.$�!6t��dx]0��=���j7�����>A��e�L'
�y����]r����fj'""j5N�!��wb��3����
�<�Z��qrjPR�I�N��"��r���+�����6h�����)��U�/�\�vDDm���(������!�u Xq�Xm4���a�������K�iC7�z�w�!Q�����+a���z3�+����(�������G�T��|$�1�f(�"@F��f�����/tK�[��
�,�}�c'�r'�!�j8r���\O
�����eDDq���(�I6��l����N!s�J7�~?A.7�Wd���DU��!�/�N�1�����K�N#�_;Qk����%�Pt@�z������(��f��VRDKH+)�� �y�|_1t����vi������v��6�_��������voW���9�G��C��>u��Xe������fU�"���@N��v���<Z�{@�E���'a�:���n���.��_��.��cwo|���}��	��Gn.��x�(s��z-�!�NFVE���X���dMT� "���`�dfep����Q����W��R����nK^~��Ia�P��^�>�<��V��v;,���iCn����~;UQ�L	*VD�J�����aIR�"���`�����_�h���"�N�G�N���F3���C��>U�o~�~�U2C22����������EQ9��A��$!���E�����J����������+�����UB�@��)���ms�]���%�����C�.j�CDD�������~��R�o�� !)�Br�����2$�����O^
d~�\���u��Y���J������I�DX�D���A��������n��S<s+H+���`IAi�8�<������t~%(��-�i�T�F�n���6���������m}������r�S`aj'"�����w�xV:��]��$�C;)}�����*��'\�	�O�O�
������Q�����f2�3�����D���E�I����P����?�O��K��
 �$��x@�W���L��>]&o����P���y0���DDD��]"l��8���U��+x`;���V�&4(�5i_�Qh8S��\���8��(��QwV�d�k�k�!""�(��Hbp��'+p����KC�~�{O��1[���iP�%w�����m_?��4��)�DDDA�q'��<2���#��h�!B)��I��!���������9�,�fj'""�W�q'�U2�P���<A%�y�2��s:��6�s�2�z�s~��M�����n������(^�K��2�V:�{t\P����/��>��
v�X�IG`*�Y=��O�$_��t��'wz�dJT��u<l�!�0��x�N�x�F�?���l����k9n�+Om
���Dd]�y������z��[����9j*���� ""���V��V�� �������
j�����Gz<������S�>�\�������0����]�BDDD��;�D�-X��0pP��+b�()M~�����,�-0,���
����?s�:�����w��b[�!����T?I8��y0�n��*�$k�c��!����
���w�Z�~�nC�N�~""":/�D�-���_�0����gqI�vX,j�X��P��kP����Ia{{�����d���W0��=T�T�2�����	O:=\�����]��B�� �'�����M����0�
x�?�bk;�J���Hbp����P\�,-8�y5n��
!j�d�����<c{�g���t��,������v�X`P�"""�p$Q:����	�A�y]*����R��0�l����/~��i7���A�I���D��db��8Q�f)DDD�����q�6���M���;$� ���Le\���3ia�����V���`��G�������^������K�3j%�ex����"""jS���9��M)�JE�m��J��s��������+��q�,x�W��~�e���&G��sT��"���"�E(O�U����(n�2\��	
[e����Pej�
J����n��	�S��^?�������y]�v6t	 �p�G��.�VDDD�������LX�p�_�i�bp'j+!7�:���7(@P��@�W���2v��B.>
1�����V��Ne���N�G�#�39�$A�� A���f��JDD��v����Y���g���b�Q[�C�.��$���6�0�QS��e���s�CH@�o�������?�����C�CL�K=�s0l*�f��\�����G�QCEEEH?�N�v�;�Dm�`���^�5@�����������;�t���������rH!����
t�C�AL�W`���J,0d��	������� �7��R����(^�l�~���wg���N�V�f���F�Z�����A����<h������W�������b�QI��g_�y����(Y4��u�9��@������r<5p������Q�f6#'�����<�����=�$D������u�hUZ�.���6�3^���f��*�}cQ�F�������W,7����S���<;��]�%�S;���wU��9��C�]����(.��T>95��)F8�r�=�~�rqE��vczj<���y���/^;�)�G��w.����M���������0'�\Q����D��~�� $��A��/*@��$H��3�=J�^����>�|�'�H�zB����-��X� % ���NDD�B�l��;�����
7��X��������x������S{o�:��.�T��=?�*CDD[e"���b����� h��U����E���kx��mR�7~;�`�d2�].�[e�bZ@����|N/�W:@�s�t����mV��A���H����Tt0juQ�f3w�)��;��=�~\��Y-zU0p��4�3���_H����z����O@DDD�w�p�[Bd�<���Oz$�\���O~P�
�}��P�|�:�?JL)4~3�"J�y������:(]�!���C#"T(�*�w~juc��s)���������
/49�@oC�_����"4�H��y��Ri���0�uT�a�ER����a�7T
]�����E�E���~�Is����*Q����|Y�
h~��_F�M+B��2<7���>Z�����DDDDt>"5��NI�{�q��� h��E���{D��^�o��>_�Su(��x`��SGN��� (��
�exn�P%������g"""����N9n�*�
��4nLZ��~���'���^%�!����uZ]j�L\�����<�N����\��m�a���(�0�ENqM�+�>�= �q	��@��	e��IV�P<p��H��*���]�k��|@��1�k�9�JDDD��n��S�����S���!��~:!m<���|p�q��"`�����t�c���V��.�>^��M`W��Qdp$QL2�I�$ 9��)���.����y���B��@Q���F��?nH�#G2��	���w��Cn<5��ap�(aA��k������R�p��@�Z���>�X�[Z5��\zt>���g ""���T�V�
=�E���P#�x��#c�h~����� ��~�~��3?,�`��)�]o���%Y�O@DDD1���(r����?��axMcZ��N��C����O����F����3��=�u�?7u�����'""���w�y&��F�]}�� ������i�����w^V6�O����`�m���'�Hr������b�;Q$��bh
 
�P���W`pV�)� �����dx�.!�CM��^mDf>e�����_���Q��V�H��)�����(��C�sy�)���/,��/����%a����=FA�k����J8r�� ����jDDD��@N��H8�����P���cb�v{���}��@��^�z��_���t$�)K��]������J	({&I�����U8��(�\�}oC�W��������A��4��/�L�.���e��qD�_�B���b�g��:P�""�
�D��?��^���k���b�%�����k�h3"��,�f?Bw����/�����2�����p�Er%���\y�W��e�"h<�a��Y�w��I����Q�C73��|
S��	�����*:�2���wH:��E�(�������/<��{r��
��vg�\�dd.�u2��#���������K��9`NT�&"�s�@#_����k D~M�N�����wj�
{\�c��".���v�/�A��������G����|�~���gF�~��p�v3,����|,V���d������Hi]i�*�M��Ww(P�1L�~�E�����aIR�"��A����]� �~8�����!�����ph��
r/=��������[�.55��'�l��i���C�&&&4h��u_'���N����L��:��d2k���g ��6���Qm���r|�c���J�f�>�������'�v6�5���t��m�����u(�����J,I����ND��8����Af!���T������
<~8T4���7���r��p��?�T��>������N�\-������������g�}���n����"���l�T�n��8���G�����<~�������;��p��v�d��;L���K������%h�0�}s'���;��C����uD!�^��������0d�R8B����{��f1�F�����r�'�j����/��ys�n��.~��6�m���Z�v��	W]u�[o��u���W]�w��^61�8f������{|!�M�������Xp��;���3_u�%��K03��)����{ `I�d@�}0s����Z��M�3��G������"4�MK�_@��4�jrr~s�'�j`���~����~���K/m84h�7�|�u�����t IDAT�eC����m_q@����x���C�����"��kfpG��U�����kY��@���8�FL�~�{���KR��p���,�ov""j��p%�.����4l�+�(���i�f8����$o�Q�����w�>b0�k��z��?�bk�LUUUbb�
t���UUU\o�p����(|>j/��x<F	&�9��@���^qy W��Z�w�\�������d��=:G��[�-�������Ka���`	�R��!"j�~8����ka�������5���haNC�I�)yZ�u�cD���(9���.���Q�?X};J�������SSe�[�=99���������s��\o�r
�K���r��|�|U�7V������O�z6[��+"E�]3���a����{|u�������f�Iv��K��q�m�r�x�HA<B
G��P��x,^� ����A���~�_A�%Z(�vzO�+��e/�;3�?r��I���6m���?������l�}��>�w�X��1��K�����{oRCV�E$S��$S���_�@!��$���Z�d�!�.�\���]��z��+��{
���FwF�
a�4<@x"��y8[�h��Qe�?����>�����z����F��n��8����"�v5��9~��;�ol��9_:����u>d�����l<���<��28�� K]!FK��4�1�}�!���u�t�Y�?�G!z���2��d��(hmy9�^���S%d�z�G����
�/���u��=��#����Oz��W.���n����K��������!�I�����{������MJ�Q�+��]>m���'�^�NY��G
����7	}���/'r�C���u��TL������y=B��,������`���S�5�>4}
W��&�nx�%%%�=�����ZRR��o|���2eJ���//j�4���f���w0;��8�#�r����E=|5x���+Z_(��������^�����n�M=����~o�J{�d3M2K�Y�3��'��P	U?�.�Rs3!Ya �^�=��AQ):�m��zA����5����k"��U��������"�������29Ut�,����������},��E�v��EV�����$����������%���>�#���_�eO�v��S�%4���^�bX:�P
�`&I4-#���uyg��:�dBE�\��H!�b�IW!�0��z�r�d�C�3I���f����:�����F�R+E���]�������a� ��*�~�j������t.�\�`;x]kJwYt��y��_�������z��\���o�����L`�����<:��~N����K�*!���5��9�k�/���b,��)��9*���7CL��Y�K�.��4B�P�bh�U���{/E��W����&�o��qb?{!NuM'k��s��-mEF��j�%H���L�m��3W�sO-�v������6�{���"���7��1e�"�C�D�YXIb���I�����)J�����<�CW���'��HV!���)�-��^;�V_��*�])��'4�|�����:F&I&	PcZJ_�7��#�^.�����Fz.:��v�Sma���TQ<�<Jp;�Wm�v
z����WV�������-h����O�ff
0��\t��D!,Q�B����;rD{Q_��B��'4�|R�-5�#Rb97�E���0\�%6��M����*�o{Nqy�t�2�����ft�d:���-��^�,�u�����=p��H�4S���F
P�g%)��#�u�&T�OB!�>��l"1��Nm�=L��Bt���\5%f��}Dj����#���b��f����!�OY���@�{����>��%�\E�Z��G������S�V�����w^���������t?v��+�<�B!��"�S��3�B���	|�6T����Zb��UC)��eLZj��V|���+j����z�1��S�j���REe�+}�q��g�p7K����%jB!F4I6K�}k5��`R��0��-FW_���! ����cV���������~R�O�{��5�F=y��c�o\sW��o��$s;%��6�B!�5eq������4�����X�m|�,����v��y����]����>p�+>�������n�.����.�������
�|N��)y��B!�	���41����J��v��+���x�x5�ZL��TV��*�C�_�����������I�����\N��tB!� 5�B�Q������Xv���k��o����o
�V�������Q���z�5����:�^�-�B�'$�.�/��N�G{�����%�����q���}������k�o{��L{��PAi��)�6n��;�Y!���d��0�ge�5�Y�)��������T��q��`��k����+�|E���=�1�1W�,���8�@
��5�B��+��������a��z�=7������+D>i�"]e�#�"Q�?E\P�l��7����r�����]����Z7Y����7��lR�������>5��&v+J(<��v�$� %,+r��&���)�v9���,G�&�5������/#�5:�h�:6���B��}J���x
��d�i/,�@�>����L{R�����clU�������2j���?v����1&�/ 'C��m�����FB�����r���|
��4�V��a�z>7H�B)���5���/�1��Q���F�A���g����u4UI�P�z�����tRfR{������@��y�$���sW���M(s���cL�"#�$����3�R�k$�8 �����V,��7�t	���v�G�����C��S�	X��h���j&u�3Y[�DR�.����zO������C�6�TK����JVMC�bt����J�v�U�����t�3gUN,��;��(c�l6GX_�UK�I��c���hy�a4�����B�aL��%���,�
,XI�����R�P��=�����!*��/����M�#�T�H����Y���:�+�	6�-p'��1T���j����c�)�r\���-?�=�y���MM�L��W�~���I�~N�����D+)���z �,��;��@0Hc#�!�B�������fY�����f&�b�}��/^��q�����$p�r����(6z5�!��e�4����QEk�����b�S����HoaB#�.U�o�IA6��q�}�W2���{��?�����,'s�-��c������)�#��l�b,�w!��!��e���>��i�@q����P�*�T�d0�H�&@��8�H�^H�O���@���]nC�w��S�H�Q���s�G��LK0����4��u�q��|��t%�
�>����4�Z��^�����;�uv���zpC�_�3))��>�HW!��"�W�c:o����N�Qe��w��%��$p/$	�>Y����%����v��h).=�����>��_�YR�Dq�Isx>���N�}���+��|����}-�����Ke:� j#��Fl�]!Y�)s;����2��I��/d��zD��920����� ����a���[�Y&���W�h-��y����(�Q�����W���-�G]���
���Q����ZCT�v!�8$�W{�P��(6
9j��lbI0=���_���f��2b6���x1rH�.L���Z��{��Y��y?�-d>d�Z�:7���u[�~?��[��Oq�^�$5*y~�j��i�	���Y���N�S��i�j����L5mj�a��U!�>dtv�i�#�������N="-T5����t��&��8���L�e�y&��^ui!���5��A����||���h���7�
.�-��]������������$=*u~����U�������W�X%��@�a�g��k�'�<���<D<Q���B��C���V1��.m�u��v�"���&^�B�!��n�q?�d-�.'����o��F�^ueM�V����������_���~<F�z���O�&�����}%c&L[v���r�gN��z@7A6��'^���!��o��J����('!�N��m$F�
�xt�����/��$�S��R����4Z��K�.-L�S���o~E�3��u���� ��>Ss�`�$����S��N��f�u�0�(j��(�`����h*]�M��b�!�:�B��K#fV���g��E]]e��
.��2bOH��8��T�u�|��)��n�����,S��9�-k��i8>>��sp?A��#�i���lVDx=N������l�=Rl���L�Bq�S�����=@x"�$jC&��8�h!fV�7���
>Y��ob���EAB%������t<7����;�8sM�g��0Dq_����H$qLrI��fr���N�)*aC��dB��&�B�I������d��$(��,oQ���������i�Hf��/��0=y����ZJM%�er��9J�i9����`}h��L��E4DX�v!�bd�L�$��A�3��"7�Z�c�F����Ge����70z�c�+��dg���^B!�@�&!&g{����[�vVU.�wb�����_�^3�#��W�15����K�Km���B!�{@�����Y��>�;�����|�����K�f�T�~�dST���\�/b|���]e�-�v�+��c�\�B!��I��������@��������+�_�f����(*�d�;.�7��{8o���dN�����({���r)_O����Ph�B���B!��q��8�b(:�My5�����������LGT</�������A;�������������`C5{4�>��@����P�ef�@�D!�$��d��1�be9^��S���F|�h��x	�(Jor�d��(e��?�����%s����~��yO�)���xB?��?�|p�	_��
h��\��i����yi��2�A��)�$�.�8����'�Y�*T��}�?v>�L�����$@��K���"^.��q�E�f6?�����z�r)m�Q}`����yZu}�s����j���y�	Gtg���\E��6��D|%�5�^I���������a(����]q0��MoW���f0w{L�����B�]I�Y2��N.Nf>n�c��/��F�~(5��<��i���KJ0��W:3��r�Ib���^^�[��hp����y�����{��uB.��<�������^K/kU7���%3���0������7]Y�a���Cc�$D���z��ZI�G���]����O|1��>�����=���y�
�N����8`H�]x<;)|
�U�)��M��Xl��O����m������l��Wk�L�u?92��������_���qg��ZQ��I(��0Na8`C��6i&����X�sHf1R;�v5{��ez���>f$jBU<N$���D"��Ko�mf��4o���6����.���X�d3Gr1��^�a��<�`z$mb^{
!�"Q��G�z��?(���q�+S�z|��%o����I�?*��mh������Vsn���W���������y����c�>�V�u�"�j�NgOR��w:��������B���,b1�IL�d�X����6�'�+��j>��������E��3�7��=��������������]�#JJ^F[qP?��g��������#��o�Nh�>!�[�R�qIu�[^�����*��l���������r����.��
����6�p{��5�����QF�J�7�b�������F�~j�U�[Oj-�I��*�{�Q���O���>�
J��
�"�!Iw��(!�*�(z
��^5D��EO��27R�?5�-���������w������z��>�,g��S�<����~Y�%j�z~REI���P����b��
�N��tv��1T�a�8�9�M���FKi���**���	l{��
����c������<���^��T�1��O�=o���U��N��aC�F����$�2eq�>�sKG��%�1g�>29oV��<��PC&��[o6��E	�����_����gJ �w^��A7�n�_�.VK�x��������~I���Pl�j�~�@+J(�K��%��FbQM��	!����Y�]�����r_g6�k2][|('��hN��s3���m�i����h���H(DyG#FCM�Y%p�V�Y��IT`Z5�v�[�-��kIe�{�^�,�r�S��6�>��&���I�}��\�-�������z��\��Yj�'	�i�BEqc$t��
vea��$�V�_W�,�K�;$�7���a�6������������L�un�c\M�>��Bl���8�e-��p���(�&q�����\��X��A��������O+:�)�d2-��bFPQs��s��W��s������?����#j���H$qLr)��S������b���5k����e-��8�{�����d��
!���/@�T-�(xy-c�t�fv���G�~��|��M5�@�s���&��p�[��y@Of����f���?��pQ�~������@�����I��y�B1�����~	�)|z��S�����.ZB�d��~RR��e�F��MQ�w��
���������_��B�wx����f�����W.���`W�T��u6q��L��$�^(f3��`J?f!`����{�Q{�O����h�8���,Eb�Iy����?��|����}1k��qm�����_&�����S�.���t�Z����7�'�2c�g��N����C���M1R]V
�F���g�=��"t
;K�"*/��$��=\�����f��D�U�`�8��$p�ZM6&�Eg ]epLR	�y���G��������_�6���gK���O���s�]�o����J�����|���������8��n/V����j��D��#Z� �oRZ��g	!�#���$����B���N�&V���R�I�-	&D1v�������6����G��us���� vn�<��O_������L�V�n���>��`8����P��,�jn���|\�c��,���-��.�j�x*����y�~�*!�b8����$p��6����d��7b��gI�N���jN�����>����/��E�\l�s��~3�\�����=j���,`������������]�t��*�94�)u0m:Qu���w���h|��,�B!�I;H�7��Y���a��8�N�����_cc@�bI;I��N�$F�c����F}�H)^�������_�`0�-3���KIS�2��)�F��M�6L�����sI�7PAk����J"k���H�=���B%T�#H�#H���`L��VLg_�B���]�O��l��v�������jq������Zl��][,�hM����l"[����V5��8������v}%���������/6k����E`��K��������'vmiL{;��7�H���M�\Fv7�����W�>@�_eeja�6�;:>A���'�]��$���[�^E�-������B��"��k%�a�$=b�4��wo��(g�gI����q���l6�DX�K*z]=~
v���VD��Q!����"�f|@�>ql	^GF0��~��Qv�������Q�A�>3E����+,���$L��C�ur����4E%x5[}=G�dd�H��Y�����_M��|��Q�$��{������wR~#���t1s$]bki���Bw��<�N�,�u}�-w,��19��&�h����}�\�$�dy�l�T��
rY�X;t�_��2u����Q�*�:��x��[�����?���;����0v�!>���k������wn���~J\����s�h*k������(�?�M���{�(��u)"
��2�URkh�@j
�^�9;��
��)bKH��IR6�������P�����B�o29U��P{d�m�P���^��,J���Y%�v���?v|���R����%!~X��:���T�����E,F�f��Vs�^j����]�����r�O���8������'o��U:J��GQ������|��F�������#I\QJ6���RQJ��e\�S#H�P9U�X��{�eT�2RD��d�����ne�>=�?_� IDAT`��������A3`{T��d!���d�E����%�JN�9(}������Z�Zu<��\�c��d�T�%s�k,��
\�r���i�T���b/xI|A���uI]�`�I��d~w��Q�u�A�u���r�z��a��R���:C%�c���LHo��0�vj�>��G�p(K��a�t����9IX!�wj�U�'�w@�#&���Q3�P����B�	Q�Av����_�:��d�6���������e�"9>/y�c��VI|�}�)3�9l�A�eQ^N��n`����';"�6����:�%��_���z�`QY��VG��^�_�(����,�K�9t����T�G��S���IXDC���^��de'v�����o����AX�<����Q})_<�D�h��D�B��H;HQh�8�#�w���Y-��h����R	��PA���U.l���\�0������p*�'�����vRQ{R�^���`K��_���%�^4�f���sPY�?#�����'����+!n���%�3k�D�b�	��\�~���3��Q;0�4ne�u4�J���UfK����%����Lmf�OB��I�Y2��Yl/�Kf�C/��I��}A�iD����3�(Q�=�Dd�Q~1����N 0_{��m��g��������t��M7n�n�����N8<P`;�����)�}���X/Q���x]��Q��t1<�iCX����&�����`)�z��p�_�|��9�����N�q����h��1?[�SZ�Q
���M�(a|�:�v��5�����7t�����w��e��/�[2������<������]������X�`-��P�����F(�E�q5���+\�����wus���S�jJCk�r�2��H-c�=��<!��*#@�������,.���m;�FM/��D�gm���}Z5��j��xv�kQ�J����'���#k4O\����e�Ecc�I�+6��#Y�0�k}��WA��3L��E����f���y�^)���FlIw�{���?�������Y�Ho��	�<����������mb�>���W1I�]���
��F1��O��d��X���&���I��pvN5e�y-���X�\��&V�MZk�%����r2����5��-��:f����)�����%k�w�GCq��fL���[���hhw�B{m#���	���K8�&���c����'m�/�y�3���l��� ��S�b�����p��������:�P��.�B���s9���Ri 8���t=J��&E#�cY�G�)���Es�������E8��xm���w"V�y�y`�����G��w��u&l+�=p�O/h[�>s������������,�m7Za���U���l�����������b;x.�����/���n���p��B?_�1"'1�kDN"���:��b��f1��4�&����v�mc���Y�U�NF+��{���#��_�)�S��x���^�dF#jGyv����]�L>9(�����r���g�1o���_>��R(����n�������'���`6]��C*�Jvv�4���16,�u�����J���=\|&���|3g����6�
�SYv.3��V�����N�'x9�O����e��O$���7�4�Iiq���6����>���T6��78��|�������9}����yo`���O��{���"�Y�*�5��*|�0>�i5]Q;�����M�q��WO,UI��?�}�����#={�u��m+}��?�^&@���b��i��'4�ul�K������.T��������	Q�Cm+���W+F��\(�|T�������,�"��35|z��t!O��~]�����nm��2TY�w�9$���w��h.��
y7���5��Z��vlp��e�^�~!��$���E�dS-��T�4�|8
����83vZ�8]U�>g�<�o�����c��#V_�GG�>s��Q��|�������S���u��6�
&a�Y�3�u{��������T!���6�������E|[/;�T�FWQ|(
�8��EH�e�=q�]���0��b/4����cP���bj��2T���
Zu�c�[%��X%x���G�������j�{�b/��2#�����Vz��;q^���86�Ts�n�>��(����q�=|������G_��x(
�F���^��S��i5���)�\��>���q��>s�����O���t)v����6��mr�\��K��p��>x�:C>��c9�[����=o
s��R�2�x.��w���T���`-���z��0�)TB�Yp3����z���:�3F�w��bh�*B=o��8!yoB�R�=�j��)b��
�Gu1�=[(f,���v��k	��w�������S@ ���1J�V�Nr����Y-���AI�s~�{��;����*�����v�_���k���O��
�1*?g�L��,�Q{�&ob�����0m-DK
7�I^�5;�:��*��(>��bY�!�"���S�fk�J��w��a=��cA~����c��\�Y�3�G�/	!�0�I1kFP�ny�o'��b
��C(���o^�Cg����q��L���1�[��������*X�����$����d��0���LZ;n���pm�]e���o.V������.y����1iI�a�9&�0�N"\��l�MB'�JkH�������|���l�L*�����M���f�$�N{���Q�>���#P�P�<�4�M�����!�H�^H#(p_���&��Rxy43���v��Y�H D^�|�Hz�����Fr~�/&�����,��P\Hk����uD���/������<S���!�"�����P`�i��{Pen�(_H2��_���{�FE!�p"]e��T�������g�qN�AQ������=��K��"���SJB���O�U��,�aT����X���)��r�S�����bZ�L��_)[���O��m����&}���o�,��E�Yb/��b����t4u-{t����+>�V�>��;��B!^#&����q���L��o�E�<���1�`/;�Mg�����uUu�$�]��;74>Ni�Z�f�.G��3)�������:?�p��(��_\~��-�k��������Q.�u�����}���d����v���%�G�H]�����V��9
��NP��_����N���s73c���,B1lH�L!���=k�R9��7�\~�P�i������8�����v��x����<�Og��v�F��/��o��{����A&Gp�����B���/�r�~��e���������9"|K��^������:X�%�:
x|0�����q$�>���7q��2�W���'�#t8Y
S���/,x�����R*O���Bq��R1�z|Y-��h���h��h�������vQ�}���92y,v��H�S<���n��{�-"m���%���/pf����_�xqz�o*�gl����?bI�c{����n�������iR[�)�:�P��s04�E�g��gh��Q;0�4ne�u4�*Q�B�!��}1*p�j�]c�F��
�fQ�
��i�����:>��jb�Fu���/�*wS�x��8�Y��]��-�]��z��G��]���6���b��	;��FQ��{��M���=.{���������{<N$���D"�����b���H�p�_DC��3��"��@#E��%���L�B�d2�i�BS�{���6�V����E��b<�=��h���2����U�{�wv�"����;���5J���c�S��7(;"�~�uOD7��z�E�_�0y#����rU|~�u���(�yI�c��j�|�8<�^�^F�:����dY�b$�$��s�P*�:��jb��[@�B�Et���*�/�Y�$�-D'H/�b$pY����"������hX�8>
:�}��:io��2�� ����������X��i��<Qo3�#>���k��3�������m�{�����%uA���3j2���cl�rH�^F��4���������>�[�F]3�mu/�?+Kl9�\�J���u�r%���s���ew���K��B���R���.��Z�����v@	��P���3�)OY$ g�Y��R���Wr�s%�L�y���_�X7:�cs�n)%I%��1?|���W5���x���}/U�B�+�'�a�f1������	���>$E%d���TQA6�������������D��������k��SX)bKH��I�6�%4����B!F	�jC������D�e�x�q�����b5��Y�RR�w�/�.Z�R���o)��%����>�$om�eju����=�������$�(�q�_8�	�er�I{���=�UU�0�AM
!�u��(!��0#��q��.���;us�T����1�B�0#�b?�v��EyyG)6`46���dc�I����=5��/�����?����N������eS�r#��l1��o��l���_@Q��:F�!��8�|O�&��(��}%�P�fA^���������������z���9���M!F������}�r�:��x
���������
����?u���E�Z�t�a+XA����6}r�����p5�?���g0�/�3KQ`�M��h	a�L�����}Vp
�������&b�8�0=
Q�H�&��D!�����t��;e��A]������=�0����E��w�T�J�/��@C�����T�����7�����;S���j��&��(Fx�k�w��4��BV&�����*c�����{p�#n��$���>����7Q!� ���y5��8oEX5��"l�.��Z��F������#��q�&kT���C7��>�������C��������=Nc��g�"��q�7�Ij?�L�~��`���Hn"��<�%F���m������4���7�c�G�k����Lj�t�n��lbI�
�v3Gr1[~�b�;H���q�f��
���x��3���hD-�����X������p��?��]�U_�m��.����������W<�����"W����qKj#�<�����|�
*���Zv��+0|����A��A��Y������5*�m�[J�uP�<���)�����r�$�B4H��@�p���z2�.�S��v�cw�XtB��x�MB��0��>��#�#�'�����5����Q����F��Q31[�o�h�'yN�e;P@���>Q�<T�+��������_�����������=�	
��d:vq��X���9��t�{]3�4kl����39U�=���{��VP��9w��B-��	��!9l���93{��`S�Cs�v@��5��vH�+c�J�B��R��==G��q.�
QE���b��R��K<E�v���(�N|��DU��5,�l�*��@���/|g�?.����J��GN�_������nOy�%�
/z�'������
�Y��b��%��E�����_���M�I�8��a���}=9uP�VjWb�����,�Pku�a`e��w��x���0)����/�i���������O:.Yv�.x�v��6�B:U
;����H1;E$�@;�
!�,	�"���k&�)�
)���L���QCL��g��3�VC��X5�f�?�����QN�g��`����}���w,���;N�;a:3����d3P��
����f�wv��	}^Q��q����P�A��^���%���K�����f�y�x�9�p�SC�$5����`�QEN)y�'�C��=����J��*����<.�A�gn��UZ��U?�I
L��M��zw!��;$p?x��nv��t������GG7�
L_���UR���]<�u����=3g��7�n���\�����W�s��4��7�NY��8����A��0t��N���loNN ���C��j������w�m����Z3�
�R$-�4����]T�����'��
}�QH�:�~2�%��Ss����>L�f���RtmQ�T�a�8j��,�Bx4:y�2.�Q���Q��/^!F	�I�Q���)���l����a�gu��WL��n�O��D\�F��_x���C�|����7}n�B�d��/��_"���3q��`s���n��
w��2����C�������O��Ws�g(j��.|H���&��0JQ}�x���Rs�@�G���ql�
/������[ee�n1�jT��`�x�9�M'��l&��� tU~�a?�����/)���<B����n�20MW����rG�����8rZx��R��s�|�2�OI���$��;eT�����U�����]��z���s�
�2�#:.�gO}2�J��	/<����Iv~�	h|���}s�V��ow�������b
��P�X�11��/\��(��W&3�`]e�;�O8�-
� au�jDq�}��6<t[��P�R!�PIW�-\������O�����#���>��K���@�y���Yw���6��fm��=q��N����F����;u��?{���.�-E�O��2CML�zv��aE;�l[!l"b�o8���EU���v���>���+rt/,��M�b��g�[��n����Tp�$�����-	=�c|���_#�T�Y�S!J�4*���n��I�g����=��\�>NU��!��3������mmm��VU���N������W�^=m����{�0�����o�V�%���M��������5��C��l��=�\�?W��,*��=�/�����O��Zr�}3��������yw|����h
T���]�6�p��H/��������l�>�s\t��D'.�
f��j?����U�?�A�=����ET"���r4�L�[O`�C9��"����(�|
�[��1�J�}oQ+��B���)++�0aBQQ`�+������;���O~r��G�\�����������m��AS*��������^J��� �7�L����k��>s5�5Soj:����W�4=d|x����2L�x���-	6D��d���g�{,���d��wVq�N�qX�t<%�.d���o�x�g�j��Mn/,�N��C=S��H�#$����i�M������3}���8�v����:��	�|$3��}!��]C���������
���������s��+�O����x�QG����'�p�`��z��6p-��X��Yk�z�����OTQ�����=�l&���15���"��p�wt�����9Z6�W��=)w
���R�)w�B���9d�b9����|WA?�k����X|/�~?�oPsv����k�_���8�����f�#�O(���4��X����oR:���V����w���E4� $w!��ZC���Uf��m������[o����k�/�p��'O�>�=%�y�=���C�>��,������6�>����������G);�3/�~��K��PAl��1�J'�{v����):��O�����R���������\�����������D.�&_������D�V�yKH���H��[Bs����$�k0���A��_�Y�W�p�T�>��:�>�� U_��1���8���i����4��X�L�L��=�B��dx���F�����z�����M�������Ys���v�v������a{_�],Z�h���}bk=~
Pu���_z�%�:^M��&�oo#��t�d��u�c?.mne��q c��Zx_d�a'=~�M��W�&��Fxu.Y�w�9$������������)�g0a_���W���;���k�OV�1�6���}'���V�Km���R��i�����>���{�b����p+�~������0c7+S������3�.�|��3�W��B!F�E��U���+p/++���^{��7�xc����?���������u��u=��a{_�]����
r/9t�m>��+���[��
q^-��\���1��O�m��s�}��RO-�/����9������G�f���\N{c��k@k���LQ��	A�V��1��v�\����0?��$���Eq�z[����L�������XA���_�_���p�����q�:��%�*�M/H+C�E�Y�6x$����-���7
!�bW�-�5�,���W��/��s�9o��60v�����;�}�Q{���nA��NW IDAT!�\�n`���l����a����Vk����������~��k?(NY�7��������M-zeY�w��'���f=�����OyfE0E�<�?{gE}������9v!�@��ET4ED9�x��P�Q��VEX���V�����&A�-��Vm���x�FcQQ`IH���cvwvg���&aC�<���~�3��]��y���z�%���}�ebho$�;�6�����t�RN��Vqjse'�,G���X���45����S����j��jd3�5�O��)n����nk��4��dP5��WS�a�)]��P)������Qd���xX ����
��m�����������OZ�,k��5j�A?�����v	�VX:��=`�S���,����/�����/�����v��h��F�o?�k�|
���`��?��b$�����??��������_��`��c�W<�8GP���q��0�@��7P+��:K	������C~�0��`&��I'K��'.�~-�\��R�	O�������yw����s[�$��4w{�ao-�k9�2�����j1r,r�����1����C��t@�{�%��������I ��#�P���__�~����n��/��t���?�<###��5��3������,Y�������,%%�@��{���U�oDK��X9��E�x�K�4b��Rfv/�W\9��?Wf.t�M��V�'cL|�&��b�������y�[y��������b��6�=%�U�9�.y�td���&����8������5^��1�(X#D����C��}�E�s�N��t��,�O}*P�2��Psh���_��u��M��H&l�/�����Q,hQ
�3�	������AGC����N��E/�	`q���3~�z�E_0s)H��BaS�+�����w�}WTT���t]9r�o~������[���?��������7o���C����e��7hLk�F$�7r�k���~�#���M���e��G�)p<��W�&��
4��#RR�q�@�� ���+��y�v�ot��s�!Z��9��?
���tC��s����F2��T�1���&�bFK���cU���M��Jd�&'�+E�%;8�9��o��\���r6=� ������9��vh���.����|�da�}������rs������&#���)���U �~��������$��Zi�������/Z��q��<�����jj?�A��M"�IR��5�����_�|����~\���2a�����H����RW���Jf
7w����SN�%pO&u+����50�0�������o�U�-�4'��Xqj'{�;���7��V���A����G��.�e�a���� ��7v\v�����W���(����
�ns`\���*�0�����d�HW�]ejS�qs�N���������Lt(gH(��T2`}+ppQ;�����n���9h����D�~8����c�h��������{a�%��khC:_��>�	�0�#���V�u=
��AqXX!	���@�G�^��{�MP5;:��1���
'���vJ�"�]�$��
�})r����dU<:���.���:�����(��s����_��A9�u5��!;R|������uh�3.;����{&���`j�o.��;Q
��@:�������93�&�� �$�/����cw����OQj�r&��'����l������C]���?+�����������x�N�k�������c"�����Q�6���`g&�����H4���s����.	��0���~���}�K�K79�$K������oU������%��������zyC%���l|
N;TA2~
�B�1sw�X�s��M<v)3:���(�(��qZ�����.�	���Q�:�Z�GC��Ax��K���?��qL��q]�s�.Vnd���u��_�WO�c����4SU���SK44�����"'3�\�a��k�L���1P$4�B�1��H �����cGz�\��t�����4�<��?��F�;6���@5<����J�N����Z���"���=�������[-'=��~_�eJ����w��C6[5*�%�{tgh�g�:,g��1�M�V$�`�d���������v(x�0��~�[��,����b��w�g��?����>�
��)��~����E�y�����d���i�%;!BN�a7��n�S��2	��Z�����S>����9Li=��1M�|���@�
��)��U�@�	GI��pr�e��f�t�N�*���%a&� �\��4����F��S�I�lH�l�E��{����S(��j�hNV���u�'{/;�����|��?�r�A�4Hn�D�D�r��'3�������N�v��le��Z\��z�4��r���{3�a�;A{z��7�l��Uf3U)�|�� �����(���j;����L[E�$d���6L����FA���o"����@NE\�v���Bb�]3������A�ig��v5A����c�Ir���ArQ�Q{����n-�����D3�x	_�7�{�q�����NxZ�B�?U�xN�)��y����xW����bv��T[�Y'����x�6����J���
:��22P�`;���~��V���V-����5�����f
�0�*�N0�WBk3�Z�� �u���mw�����1<[���?a��h5FMg����?��(�o~�`Ip���YT���W��9f�V��(0���7 Jx-������=W��C�G�|zU�\B�
�����ru&<o��y�?c��H�����������M�E,m��j;o��)����z��e�K�n2������,��z�G/]7,�#�U��n�	2qG��`7�@�zOC�V�N;W�Pp9P�1�8L
�yP:�8���8���	1���T-�����K�2m-Y�4��i	��c��`e��D��.���N!��M���3�;"����8[Vk��m�?�H�e�����uX�%�P(���^�@ �~�����B�Ej�*�1���3y�=��&C��b^��=VoD�K_��'FE �L_���hN���=�����h����bk�-�E��G|E!���J���G����7Q a��ry�����T<���R�(�W�d�������e��Y��h���8��a�d]u�i�Z��1�8]v���;�U����� �k�8e
rpXp)8��~�v�wY�C�SKN3��)��:�d#���!�b��%v*De�@ 7�8�G��u�`��,i������oU�pG�a��6���cH&������S?�`��w��v�d3�5gO�:|�>s�8GP���&�a[��dk�I��Ar���h�����@N�!��O�)��t�>�(Z�V�.���KY?47{#�X���C�L�;:�Lb�R3�N������`J->+��J��+��?��9����`q#3����4�]y,��)<�[�j��&���Mu���L������dW�V���1
�a�Id��������������l�l�h���QR�y89j�S;@���P�1�y���c1�3S�T[R��������F��;Qd'`Uj��9m<��O���:p���_�������*�l���4&���~���v3UcpK��exaZ
	��w���CK�����)�D��������;�%L�L�!��M��v(�����d��6/�~����&��_�x=�(������h2E"�.����P���Y���7p7����r>�C#�O��]�4������dSM�����C9�T��c�8O��!�T��x���|vj����kJ�s:��`jYm/�
�WM�]�',��������9b(��L������J�:��l�����bG�������o �����!S�G������FW�Z������VV��q��t��[1��$�%��@ �C���
|Q�-��v��z	������ ����v��G�O�N�h��#R����=�;����,������B�p�4�1v�6��y��O����������]#����9_��QuT`\�Y81������o��[*�J�tZ��M��)~hAs
����|����Tt���91l���i^���S7����������������hy�H���}S�@ t;�8���$k'y�v�
�z�����(hb2� 
$P
�{����l}z���}w�;N{g�%��e������=���Nz���&U�0��+�����1�>��
�aqas q��vr���2���*N5������/�����1d�6���GG�P���)n��3D`kOo`kK�fNX�z��Tk������a�Y��Rz�{_�����6����Gy=���F��P#x�F;H�@ �*�V�H7r�Ie�;�Z���^�J^/��4�Dm����X����%Yo=��8H�J�%��6�28vc��[�����k�����R��=�n���*���r���3����Qe
�3T��������F���+�[��G�u�wo-�-��{x�$f�l���y�S$F�O�{��:}�z��oVM�|���:��AG�����::�nf���1Z��YYq-9}~��	�@p��{wrd������	}M��1cv��Ar,df����3H����'s�$^=�>���m��6��#��< )�c1�w8�'�}���'�{�,	����`a��v����r�/����Z������\�7�O�H�>{	���_���Jo�K���t��a���'�����'D���Zg�K��u���3�XcH�j��X���g�z����������jP��s��T�<��2�@ ND��)A/;�"�ku����0��!7�U$�}�d���a>6&2����W��D3��l�%�Y5��?�����s�����q^u�s��xd/m����:qvv��%�U:��n��a������8��������������(�!����?��J��zIbU��c�6�2�X���7��
U$L������x� ��I/�����n��]���6�BB2c�v;H�@ �D�~�`h�`�j�}���#E0�H����%����X��"����{We�:���dP��>���;��$�:���D_��h����\��yvH�z��WW��vI�_aYV���pw;�3?bTO�	�V�� �Bm��i<��$��*fJk����L����������8N0���m�gV>tg���x>J��|�����^a<KQ#���x��`S���.{��&���l�vV�
�@�C!�#�d�1���2z}�s���;�E�N'8�\8IRMdH'�t�cm���s�Q���6��������������������]�k'���!=�
0~AR�����u�������=��g"���Y����a�S��GG�X��q��?u�d�����,R�v�����4�R��U��:��:�6��#Mi���Qn�8�@��An�O��O��t8�i��T���pl�	�@�9"p?b0�!������C�.K�!�\��">�����%K��`js)�����u$�uMn�n�&0����s�&�}���g_�-?�����8��2�
��������M��d����$
<8�8��)z�~���L�<2qG����������\�M����Pt5N{�#a�� Y�r;������H��o���������~���V��X�����-��������n��#�S�S�����L|/$�Q��T��q��0I��"�_�������1�g��o��������'�[��������]�`�^j`:D�S�&�v��u��������QH�TR�D}��8\���d�����u�����d�
�������<xP,c�S���*T+:ot
�W�U��������L��# 	2q8���T����h63�(a�������`4?	�,���s���?`�>��$0 
����q���@ �"\e��#(p���D}X��:������������(�Y��
���z������eA��{���i�a��pZ'�{��{�������jg$9��F|62����~,Y��k	4�W�)��9#��F��(��	�t����&���$������o�5��L�vw��=!Lfj���q�X1��������P�����G[��]����j��d,s��<�-��a��Ax���i�@ 86�{wrd��P���+U�)tX��_�}��O�����u���3Z"��i�0w�/������%E�NO���
�1�^8�!���f�����mT}����/����o���<������<�4%���g��Y�p�/�^��%�b�A
vsW3��j�l�RO4��a�f����12��'�@ �E�A���|W�v���Se��rP6E�7�����2����,�������&�.]x��x�����Q��D�Q+���uDq�Pp;�Tv��=�v�����A�a���h����#po���<�������A��_&�c���93�*gr�D��p���P4������T�f�X*GJ���R��oz]V�Q���1���] A�t�oyx���g����##���L)���BXV���s�o6)F�8n�S�8��������& l/��������
� #�)����%��d���#{����^��J�tZ��M��)��������d��� ��o)�.��������������Ok$���Ng�)���N����L����=���=x+�H(���Jc�9]!XFSG�`u,���~�{�m���@ �."���@�ikm�S
1���y�5��������� ��a[�X����}����}@��L_p�?������^�@,`�)9��Tr�8#� ����a9�Nc�L�fZ��G���������*8�����D��"K�������K!�;JK�@%�@vp�vT�H��`����U�����f:������;�@GHe���Y#��F��d��$k�-���sr�xm�������[���	�o?���9������;�g"��nUM�7__��2���/Ef��^����W�1��/���d<������P��bB2!fL?��W�=+�3m	�Q�6�1m	uz;����1�9vn�GA�nH
X`4(hn�AL�
���JVl��}�Q�(3!�7{���p�{�!���7�wE��Q��x���;�@���S�y�8*�S���x/����'^"E3���IZc����2�������w�����hxp�	���"�.�I&��I�Mzz���f��������v ��N�g���v����BgJ�F���)l��m8VQ���Y��b�]�LW��9��~�,f�E1F���<v�y�O��.4����Z3����UDj@B�Xx��6����@�MP�/����G:��d���l��41p���f^�"���
���������Y����H �cQ��#�oP�����W)��~P�:9p���2��0v$0��}c�:���_<w����������b7��V��x����~��RQM"���7�
f�Uq��N3�G���aN��Z0����4	��&�:��^��:Z	�M]H'�u������Yx��:|������*o�^�j��0�S�#��3c	u�.����x'�u3y�����{����e�?3�t��:j��]WqNl�y����%D�@����J A�"����� �;���oB�z�OC1��[dM���`�P8��/��3����odC&�&31F&
�=�,��p��g�zg�s�{�g7<I����������/%|���X�c|/�`��2G$�q����D���/v���A�*O�p;�NG�AN�Z-#��gAE���/��$�s���R����hp�v�=�R����c�B��l�tO�n ��j=~
���)<j�@a�������rc���=�0�zT���@ 8����6��*S���@��C5�4Pg�~�K��������u����V��7R�-��X�x�d���+O�v�`u|�-=�^����g�zg|\�����n2����g�9�������H�v��4)�����	9�I����yF!* IDAT�����/6�WQ8��dE��<NQ��b�L�1�=����\�'�=�i�����4���b_g�.�����%��<[�u�5Z:(C��}
E�n�n�)��V	q�����g�S&��Y2.	� W6��cD�Hf�'S$��p;1%n�,�N�����@ ��"���j�d�DQ�L���w}�����xqY�������RIG���X��5�������in�Yw���s���10�t����9q.�\z�P�;��|�/��<��L���e;s��z��"/��,�G���L��C�G,{�e���e\M8o���a8�c�����F�UV|�U��A���&u�E	��"D!QX�%�R�<F�������T��]�+N��OE��$!�P�<2�M��X�:w$aSa4[JE�2�2��q��������T�g* �L����-�eD�] �����.�p�}������N~�:���4�$Pe�1�+�$�.�WB��ZUc�nAf����*���w���i�&����ec�X>�;�����6��;����+oUY��	}Hu,�d��b�u�95@f&��|5UU�~n��������	��k��&����L{����#
��Asf9Ya�/�iO0S��	��k8)X/�zNs)��dwnUy�x7�)A�E��8�V���t
'u�J�=�L%������ r����D�7�K"[n�~�?�$G���@��1�����GK"[/_v�F=�@p
1	�F8����C����UT�BkrL57n�L���6L���4~����s������_�R��?u�>�lsG�WI	'4�t:Y���n�i@CK�[�+O2#��B����,�gY��F�A��+(z�9iX-D��yT�u[>X3���#��T��img����B�p�B��L���A�D�|
�Y�b7����:�������������`��Nd6�]����f�@ �U�����Sw�x`���V9�+m��g�b���:�-C�����s���L[���G����v�*OS&���������G�������\.;�F���Md��M����!�����'����O�>��M�e������e�P_F���	����f0~/a�������a&�y����W�k������ �'�$3E����A����K(gp]��~R��+����*cr�����Z���*U	�	Ti~�Y
GOi�@ �8"p����$����K�g>�7�gC2� }�/)!H���	������z����������F����)���]~^h��'����=@V�����������"��)hX�GC
��P_`�,fQ���7�!�Y�!R����P���G7��A�Yx��}����Y:����'�c6k�ee�t�?)�����"�dP�����v��F��QS+s-��*U��	J�(_�g+��G�?�|�
r��/����PUi����Al�:+a�&��i�'X(�bC|>k����_�OkYZ��s�^�|D��	�L�*WV�a�aI����|**X���
�vsg���>�q�1[<�%�G�s����(�����7&=V���/�?Q��k��[���������-�E� ��/��BM���z���f��N4���&��{�wN����Z�K�afA.�y8d\6r�O�`i���.���h�����e�|c%����e��o�z����b�/#9`����C��z_���u|������,:w����'����BfbR��;h\.rs�0�=�z����O"[:�����c	��d���r��A�9��-g��`�e�LXe
�0z�g�����%�����%���IL>��?�h�^u{����%<�,A�C_���[x.����q����;���N�d�o�����L�o7�=quu��fd:rM��(����
�@ 8�������'+���=���>������]_��������Z>1KG���@D��J�8=|�|���d�F���XGF�� Y�Ck�j�y
.�5,x�����e03�q�27�P.���h���
K�k�'�\�2�,I�\
�Sp6Nkg9�@=�?�QU�a#�����*>?��$�v�]��\_����.��]�����sy��} )n�sY����@ %W���U��E��9EA�(,|���<`2�aI��1��_�0�0�"yyZ���S���Z;��/�u���4�S�����1����$L��dFi����� ���N�������U�+>��%��	��6��J&���L|�@�I%�a�D����K}I��o�y�c��?�[�]�C��a��g����^J6�!�y���w���
���V�H�'���uo<��jS AGW���v�����++�?��_YY7y�,A�`�D��K������oe�������,����1�
??q�dN��g%�D��p!�S�(3'��Mw�4��6�� �r������Fq���q�&�_5[?������D���<~-w:��Z��4D����=QIo�:���w�����V0t3�����.]�����{`��l<��H��M����@ 8���������Yo��,":��ju���%(rS�
wG��P�n�u���[~����V�YI=�[���~2���h>��_*.�-|(kl�������g���3�v����xvZB�Z�\��b<&��d
7�,<V�0�|
�mq?e#��j����<%���2�6 ~-��������D$�}?? ��!���-/uH&�������L��f��"�E�������@$��Gw���XWK�h���h%����Y��@ 8���L ��������x<��(�@)*<�F�@2�Gc.���4i��Oo[r��+���o�~�}BTnRb$����$��f� a��z���%kV��y���:�o��,��C�HJH6wj���cc�����Qh@
�����x���/�q�p�IQr��$	cD������w���1
�(V$�9(���v�"��+�X��P��	����$��BEg/r��K2�!���Z@��#���M �c�q?���!��� ���rN��70�����������n�TK�h`���L�;��%�y+3�CZ#s�_���/m�~�9	����$L`5�(j\���Cd�T	WK!�Y8��c��%5�]W����wl<�������>*�����d�P=����k��x-B�`S���������e�'�2&������\R��J�)"xJ�tCB3
��?�FO�)��1Lql@6#-�Z�L���@ �c�q?�D�����3��4��2�?|�n����B���9} s`���H<j��r�:��?��5\>�9~�7��'^e���`�F����M�1e[������XK�hP��d�����pK�8�h�(EC�F�w�L��NBN���Y�V���?�����$rSc����V��0�h�#'�1L<����PV���]`@����h�Sb�b`A����|�$��=O��'�l�~7���@ �0"p?\��4
�i���B��<������hd��(hS�oo������YP
�[�vO�;|���W��1g��~�
R\
�3\MH��t(�t��[Y�������U!���^����E�.d�:���>��v�)j��w�L�q��;����/>�:����(�����d<��6�~H^��]C�hj�v>:�)hZ�FBS�J{���(Z�	��p�+��+VRyyw��
�@ 8zv���2��0
�inV$�H����	'�rP��hx��IW����:��K�b���`lMK?����-��{�nI����N�ys�����>�c�3KH��*�se�_�h53��uFc4��;��0-�����h����g������J!�M��������:���U���?���<gR�P\�v�w��z����)�2����cz4���H!�@����v	�(cX�����r����5|
�@�&0�id��n�x�h�WZ�SC�����!	�Lga����@ 8z��PSh���t���k��-�{r�����k�2�����W?7Xy�>�����
A���J�0��������_FM�����U���3;�yw�V����g'�NO$����|n��fo�f��������3��+]����x~����4�S�N�-F�FN����N������f��m�n�5�\�v`�T3�d��K����6���j0%> ��\����Ki5�~��;f����b1�1�^������4��c��0&1wVs$|���FgH�1L9�5�@ t���s�R��JB��Z���}��VK�+r�#w=����wZ�&Y��!�[z�L����������@
���k���_wNK0�<�'���b��G�y��F�j���j%�Phi/l���	Q
5f(�B�����]�M���k�Q�	�@��{e��*�F1��@��3~�h��5#0����jo����Z���BjM��z�8���4��7�z�n�����9+����(���S:�XZ�6�61mZ�~�H�m�	� ��P�(�����;1�+�(;w�+�q�
&�JJ���q�@ �q?,������:RIG�3`uCjt�U����7�B�_��?5�q#��f��s��s�����_�i��Q��g��$�A�p*�g�p*{�O������/��	l�+���2�	���I&����N������:�!A���H:�qYqf���64
fR�/j�CB�#�fJk:\h��5-�@�M�N������@��/'�R��t�6����Y�	�RBQa�������7(�9}�bR�i���]Db���Aj����������e�.�>��aF��!
��@ �����3��&����7*���y�����g����|!tUiiUXS�a�8�}���7���������E����#,_Z0���k��)J��{��B��4,m��&Z�/b���5-��86�H�;-��F���t������ �X]����Z�)������dls�B@YM��
������"��������Q���n<�m"��)<*umM����}��������pGj|����&0�@ ��F<��v���4�N5Glwx����I������s������vE�{�����3�r�S��PB�ef��ILPGXjw�	�fF5�?���F]�]#���D�O����+)K���E�1(a�vW�"M�dY���8;��N���
��!I�d����7���i��[wz+)��E�@�����z���Daw���i �(�c�9�� a��4��R�=��};�e���p<��p���b����0����@ 8p���p�m�??P���A A��� �B����]z��Q�a�0�����q����$E�Vo{w�O}��!����z$��R��s7���.���tr�v��0��2u
�q���S�����;F�'�n>�f��������`:�3-&jb�bJ���M��&�{�d����UN�nW����N�7W1�E�(��`���t8�B��� ����i����2k��'H(��sO��c`����q"W��5h&�����)��W�>���k�1J ���v����p��+n�w��X[���~}�]f���q<;���=a���3���o����{b|��;�q!�V��N�\��}����G���j�+�ak)+3�5���8[G��X�3%d��M<zi�%:}?&����[�+�x����f���)+OgB��z����&;�g�@��W�t�A8�D��%�
�!=��BjH�oc#>���$�e\�%~�I��^��@ ��CM���}G��w���)�%c��&ET�FZ�)�!c�s�����W���8���uo��=<q�0�=�\d(T��|����9�o���x��[f�������	��x6���6:����m*b�qhi���q���64�&�Y=����vJv3q%���|���!S5���z�}w	��?�r���@ 8D�1�����=���j��0�j���C|����9����E#V�����rC����MQ����/��������������A����F�����J��5�����E�����'SW�@\��Dn��w�i@}5

�A�Yxj�VNq�%�pe������mCO=��������	�kw
��_�h�Uy��r�l	�|	�`A�l�fW ���A���� OI��u[n���[�
����Y�����O�%��z��\����%��;�W��R���D�h��'k4���{�0��x��a�(�E���|����������w��g�	��	�N�pf�HfA�>�*����	�W����{����a_�f
�Q�B� YQ����>�=^�;����4��/R���d
rpXp:Z�j�}JT��$��
�@MX7
�@ 8�R��{~����6���T�)��0���s������]w�/��1�5G�w�z���������('� L��������OHM���5����0M}�z�������'��E���
X��6$x�8I}���SJr$2�����?2P���R���q4����T�:W��uJvC9}���wJ��d�P�/MpE����5|�|���W���������AfJF��#n����9��@ �(��2�����S��+�����ur@������8�i�G�����������;%��S�|���e�[���e�����������=S�h�P:����x�����AG��<��nq�9�I�no H�b}-�RKi�������������r�,�-�b�D��P��R�M#7��N������lmG������z_i�@ �����6�o���G��LT5�tfVV�0�\0������$C��r��;g����^�fT<��A��K�3��������2�vU��L�|	��1���q
����qb�=%��P�������[�-�w�_��LR�@�_������>LA�oui����>�����T�:j3�oa^g��Nh 0�>������ ���o�@ �Dq���7H�N&��a��p3g*��+X8�M�6M��f��o���5��W,��gkF� la�_�D�?8���Q����
��a����N�������8�����h�H�����7~��'���
Kk�����p��H��C�a7�����f��pF������������,=�/f���WP���/�?t���:�����z�I+�,E���F�,�.����H�9�GM���&�Q_��]�A ��G����6(`����Di����S��cH�'�W&5{n��2q�����by��M���������&���H��E�������p�Q������C�q��:4_O�{�1c���}7N���YYM����$He�I[�^O���\��l�-)��Q[�[�P�������O�>�G��d2_C�Bv������k�����Y%p�[�ZVk������Q ����������l��h�~�����/�j�R?�����9�����k��*3�����/����x�����]�����
�0�Z�(�Q����h	�1�xl)�P�@�sL�e[���8.�����"��8��!Y��$H�wSV��?�:C�.�2R���X��1�o�����_*)3�$����\X7
�@ 8�����6	�4:,��|k�_���m�P���!,��j��-��D^3�W��r��))n����AlPf�c�(��#��Y�s��6��E�O ����329�uej��$��m��PA���S	F�`�A����kX��N��������:�-M�im{J'd�6':���>@X7����=.�z��gf�����7.rQ@�R/�h�nf^���b�V��S=���Xz25K��&Zy:��J��h@�X�������.;��1�.�%��>�W�~wvw��D����o�����a�^]zAz�����,\
M�����g�i�����u6]{f�����y�)�B��2I����R�^��o|���Hq��n6�6�`�#P��qf+7-�f�r)Jga�
��N:�'bU�B�\����G�A�p$���y\�G�I]T@)�k���|�9_a�E� IDATcO�/�:��-)wZ�Q�oFW��T��S��	4"
4ob�����]���T/9f���W�nj��%����dj�nC{�2���)<6;`����N
��vG�7�+�P�A���;�--����
�k�!�r�[c�� w�b-��o��U��B@e3��p>�n��`R0�� Y�td�@��oj��mV��,��
�7"Dm`���\���x����,IT�;�/u�j���0)���`Z��+�n$""�z�g���4�������L������_~�Y�X)<��f,�~����U([��i�7��]FIx�����N�i��7`D�]g���$G\��p�/�E
M^�4F�t��W7�Pb�)���Ia�E�;�:"L|\0f  ���!�B�����p4�408A���Np�����o�e���E��E+6�5U5��r�!a�`�NDDD�g�y�K}Js�����m>����_2 ����{v�w�����0�<e����z�
��pP�`��U�	eo�z;�F���G0K�����G������N�!�4�o��2��<��I���f9V����@���}�8�`u/D�@~1���h�\� G�����<KQb���;�L�{
[�:<r�!�y�sj�*�Gv2��p~��_�-�>�]��];~n��wV��/R�q�D��.#���$@���/�@��P9���0��.73��>�G���:c��\�EH;��Y��L�.B���_bZL��Lc`���L>6
��z��ersKB�wK���f>i��:����Z'��^�N�hlw�P��rp�v"""�����b6b��|g�tu�Oh�c��ts�7�}�t�I���fvw��Qq������#g�F�t^��Ly����w��"k;\P�E��(���ga�����;�����U���t��k<�U0;wx�d(�=G�`;�|3fw���
�����p
0��{�S�y��ax�!I0�@}e$�=0��=�����@3�LA#���^������&p�LU~�()�l�#�s���]����lHz��\dd�14�������n;;66����
�Jtv�!3n��i�
����,o�BG�5J]
\V*�/a�Y��7��=�1���=Tf
 Y:�{�����7����"\_��rwG�����* �A������|��� �HB�k0�@`��j�Ydh�����
���Q�	'W`2V<V�;DDDDuH
�a�^�4S
>.���1����L�r����fK�n�ill�n3.�0p��D�Vwy�|	�f9����0^�`�R`���0j�#�����]
4G<��J�C�Uf�og=s�-w'��o��#��<�fx�@�ow��;��d5X���U�@�z(��C������k �f����n ?�M�^���r���DDDT������A�]g��V�,��v�������n�;�c|�*����0��}�73��� "#�
2�	�^�)B(���}/������Z�0��WC�n�7��rwG�p��r
�J�{~�U�$��f���\
T��*d�v�	���a��B�v"""����*�W���I-'	(���y�>6���������@��b��o;�t��N:��
7�������� ��������J� 7���I_ob���^� �g<���'��ZOcA��pQ�g[������o�8�?� O��s_��
/N�_�0em93x�!
�a1�e���}���/���pL�L#�����F_�5h�Sz� ��nL���0�zb�#��|W�1j��j���&W�%�#���;��N���&�~3���!H��cb�
������Lcy���M��8B#����	����WM�GDDDD�c��x��9;�����k4fk����Y�����o�������[dX~h��n2

��Jl��;>�,M��^d$��I����\���~y�|M�d}���:{upP����f���P�s(���0xG
2�@� ���p'�v�_�}H/��
�����R�Q��Z �k�@�&7�C��U~+""""�o�8�&�~���?��as��_���Y�g|��
}k�1q��<k]p|����	@BB:z~�"��g5p��Ml'{�a�yT�IkeR��.����,jK���� ���f�lr��`>	�� c������s�AD}A�,ab'l����>��~��Hk�Q1DDDD5��{M��4�8�)E��� �o+'�}�������	]�,�	����b����JW����;����`�X������4�7Tc��!�}�[�8�)�W�g�n��Rq�2z7���S�oc�6D	d��0��������������')�y���C�5�K3@���q�?3`����"����1P�7���4�\]���;c�9L�P�/��?Z��
�e��vFS����\�=7%��~����<*�ZM��&�������0��x�6DDDD���Ye�M.pr��� 8<���6`f���Y��&����������/,����y:A�1Yh:*m��x�����W��2``����?����gW^>&�	�|�Orp�F��~�0�];�_��������R��V|~�+vH�5���k�������\,>���`�t�Lh����Y(U��@�i�F�x]#i�&�qo�0��|Pz����J_D ���p
0��;�5�J�E��5�u�^!
����$[7MN���x�V�!"""�?�{��xy�O�=���i:��D_��]Wo��~�ip)�B����V%�	j�s��!����A�4h��0H��JG��Bd���K7G���T���s1��� �V�Z{5|�P��� �*A��(��]7Y��Z��H_d�����V�!"""�Z��S�v�@I����{}QK���O������3�WmI|���}x�����)�=TjrpY��g��	���^����nkjt��h�+q�~�����?s�
��[
Vk��)������6�A�$xW;���I�d7�7B�Pdghz%6���b��b{oT""""�����S9��9��_3_}�����@�D�����!Q��g��4`�/�2�_4��e�s�X4(v������I-��)_�*
<�@�(�t��3E��IbS��4=�Bz�9��8��
a����0G�CS���8�a.��k'"""�W8T����������E�/`��+��u����8�
����������N��M���V}��k�����LY�v�g"d��QE���b`���+�����e����}�&�cx2�@ktp�������q�!�&�j�=J�T����u8���-��8u��FF�� \����)�s��,!K{���9�
��2d ���Tz���������P�#����Q��KC�q�AP�����9���?1""""���;�6ZR��:���<��M��n���/X����LZ_�����%1����x�������e�w RVZ�� �\:6���7:Y�K�p/�&��i���#��`�B�X�Fd-fBDDDD5�wN�IJ�����<��r������3��l�a���mF��@{ �L�9�����F�����:j`,��{T������wb�A~/U���R$d��
�2�		CDDDT'����w�c��'G&@:��4pF����CO_���`d��.�k�-����z��^h(Z�6�8T�'Uz��C�"��������x�>o����{����qU	���U%��J��cV�g�!����^� 7N��Mb��CCO�z	�"N�|���n�Q��\����~Z(��:v��b���;#wBZc���N����X0�*a\U����qU	���u��gV*S�A����t��8h�k�<�x�6@	`���Bq���
�D �
�$�������������1�A7��0�Ml��}�A��?�jc\U����qU	���U%������jp����2�$���')))00p���*������#��N�Mp�k��ia@��+��QY�]J�9_-:���I���M��.<
�G������W�=�:��g���.]4h���~�����7��Pa���?���,�7e�������l�v�M�M��PN�w�@�+��!M�p��i����{���[���S��OW?�+W���U%��JW�0�*a\��~f��2����8p`ZZ�����dj������?t��
���(��$T��E@��@��1�����w����D��Z�����U���U%��JW�0�*a\��~f�Ye����G{�����
@���������)u�n�8h��u����/�8^@w�+�������*��T�����������1�YYY������O������{g�k���\�.�{�u��5�������`\U����qU	���U%���1�Q7�����{j�4��f(������t�������=����n+Q%u�q���:q��h���Lww�
��J��1�65gYim��5T���p��h������|us�{����?~��
f���������� I������DI\������.a���];����k{�����4d�����������k*w""""���:����}��Yoo��#Gj����"""""�����;Q]R7���1l�������w""""";���������q'""""�l�������w""""";���
�}iii�v��X,C��������5�w�NHH�>�<y���������zh���nnn�u�Z�������3~�x�bMET�����������g��+W�x�	�F��%I��gORRR``���C�w������W\+V����U�5�o��,���~���}��4o�|��a<�~_ll���'eY		���P�7n�X�t�u����1c�(�EEE��o������[xx�u�����,���-���s����JM�R'�z���{|||pp��s�RSS����9R�[Tk6m�Wz���\+W�:t�����9��]������������/�3f���������Ew��x�U��U����,Y��G��Eyj��Q���/**Z�|�����q�T��N\������T-��h}I}�k����z�:w�\YY�G}��S����)]�����S�L�v�Zaa�s�=7y�d�����t�R�.���������o���_�7n��E��������3'&&FyXS�����\����w�������CBBj{�jMxx��_~Y�XRR����?�<;v�����Q��O�>y�����GDDX�5Q����q�H�������te���H��;vL���'O���BY��Fc���w��Y�u;u��L&� ������q�����e�d2�������eGqqq*��h4������m�V^����?q���:�.;;�u���e����/Z�H��T,u2�T���K�3h� ���!Csrrj{�jGVVV||���K7o�|��5���������!C8P�z��x���k�V���"�{��5.iw�R�����e�1uqqp���=z���*?�����fM��������a��m��-��s���{=�K�V�t:e����F��Y�f<��E�Ejj����2�(++K��+V|������s��Qk��E�f��;v�u�UXX��3�DEE��wF_�T,u/�W�������b�����(�W�\�������++++,,��m[�-����?U����u<==/_�\�z����P�XS������?�l��=z���*�i�Ua�k�^�����:y�����k��-X��C���_g\233,X0y����G���<�~Gll���s���n��;v(�v��u��5777%%�����pg\�?���I�:w�l[������S+���X���?����/��;w����W�W>U��a5Q]��G����������SJ<�lU��`0�^�ZY^�`Ahh����_{���/��q�t:Q?������7l���
<��DQtuu�����l���CA�����_�<;k��������
���^��a�������g���<�jJ�>����������������$������v�������+;;�Z���T��j�����at<�*X�j��>��iS����e�Ua�k�n�*�eK�Ru���zt���<==������=z�l6/_��G��������s��]{��������WX�������GWFF�+���v��5k�\�x1..n���<����u�.B��=������������5�����$�><t�Ppp0����Je�N<::�_�~���a5Q=��G��,\�p���������{��}���7n0��Pv������q�]�����������dY...V�yt�!��d2��e��+999##�]�v����7J����)))={��F����q��]�r%111111///33���s5K]���v=m��;s�L�>}����j?���M�6=�����Q� ))i���}�����'N�HNN���o�7o`���o��Vdddzzztt���������:c���;w�<x���RS���*��#���/_>s��I�&��z���M�	&?~|RR��!C>\ZZ�\�[Su{t����[�m��N�:���{�nOO�={�h��z��i�


���dY>|�pJJJLL��3���}����`�:������hQ###���[�n����u��)S�,X�@���k���[�n����-[�
��{�U�^�=:$$���_��X�p\�V�w����;w�������M�6��9�C�����.!!����y���VfiP$$$9r���i�����RU�^7���'%%U�SMET��������k��S�l+�Z�=z��Xtt���g���G����5X�G���h4���?99�b�������Go��\�������o_ZZ�,����tttT���U��K�:t��U�N��}{�������{�fdd8::���S�1GQ\\�m�������������u{�}�v�d_5K]�������z=�������^�q'""""�l�������w""""";���������q'""""�l�������w""""";���������q'""""�l�������w""""";���������q'""""�l�������w""""";���������q'"������v��U�[ADT/����W��@DD����&M��������:����I��3g�\�v�Q�F� (��������yyy������?���[nn��/�l6��Z�(�z���lNLL���o��q�O,..NLL���Z���v�����k{����6v������8qB��%&&.]�t��
...���^^^��S���7o~���CBB.]����z��QY�PTT�����/���5�����_�X,��_9r���o||��#������������
����������5j\��A����""�S�k�n��^}����{�������Kqq����_|��O���BGGG������v�Zgg��'���)oRVV�V��t�2a��)S��$)  `�����M�$)""�c��������:��������DD��c�����a��);vl��i�.]899^�rE	#�bVVVzz���wNNNhh�����A����'��Z]�7��g�������&&&����!!!�~����_|�];���P"�����UYE����ZA�$��2}������m�����������A�S�N}���s��y��W����
#^
AX�d����sg�������gDDuw""�h�������K�����m�V��t�������Y�f���[_��(��$)�
��l�2�Z�|"����;U���l6�����7o��o�8u��?� ))���@�-�o���A�g�}v��s��i��Ijjj�
���_�{CDTG�q'"�w:w���Q#e�q��aaa����o�����W�U�V-Z�H���#G.Y�����q��?�����;����w���+�X�x�����|���������5�6m���Oo���������W�D""��*CDDDDd8��`�NDDDDd������)IDAT6�DDDDDv��;�`�NDDDDd������7����IEND�B`�

#330

Peter Geoghegan

pg@bowt.ie

4 months ago

In reply to: Tomas Vondra (#329)

Re: index prefetching

On Mon, Sep 15, 2025 at 9:00 AM Tomas Vondra <tomas@vondra.me> wrote:

Yeah, this heuristics seems very effective in eliminating the regression
(at least judging by the test results I've seen so far). Two or three
question bother me about it, though:

I more or less agree with all of your concerns about the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

We're prefetching too close ahead, so close the I/O can't possibly
complete, and the overhead of submitting the I/O using AIO is higher
than what what async "saves".

That's great, but is the distance a good measure of that?

The big underlying problem is that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic was developed in a way that overrelied on brute-force
testing. It probably is still flawed in some specific way that some
query will actually run into, that cannot be justified.

So at this point INDEX_SCAN_MIN_TUPLE_DISTANCE should still be
considered nothing more than a promising experiment. It's promising
because it does appear to work with a large variety of queries (we
know of no query where it doesn't basically work right now). My hope
is that we'll be able to come up with a simpler and more robust
approach. One where we fully understand the downsides.

2) It's a one-time decision, not adaptive. We start prefetching, and
then at some point (not too long after the scan starts) we make a
decision whether to continue with prefetching or not. And if we disable
it, it's disabled forever. That's fine for the synthetic data sets we
use for testing, because those are synthetic. I'm not sure it'll work
this well for real-world data sets where different parts of the file may
be very different.

We'll probably need to accept some hard trade-off in this area.

In general (not just with this patch), prefetching works through trial
and error -- the errors are useful information, and useful information
isn't free. The regressions that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic addresses are cases where the errors seem unlikely to pay
for themselves. Let's not forget that these are not huge regressions
-- it's not as if the patch ever does completely the wrong thing
without INDEX_SCAN_MIN_TUPLE_DISTANCE. It's more like it hurts us to
be constantly on the verge of doing the right thing, but never quite
doing the right thing.

Fundamentally, we need to be willing to pay for the cost of
information through which we might be able to do better. We might be
able to get the cost down, through some kind of targeted optimization,
but it's unlikely to ever be close to free.

This is perfectly fine for a WIP patch, but I believe we should try to
make this adaptive. Which probably means we need to invent a "light"
version of read_stream that initially does sync I/O, and only switches
to async (with all the expensive initialization) later. And then can
switch back to sync, but is ready to maybe start prefetching again if
the data pattern changes.

That does seem like it'd be ideal. But how are we supposed to decide
to switch back?

Right now, disabling prefetching disables the only way that we have to
notice that prefetching might be useful (which is to notice that we're
failing to keep up with our prefetch distance). Without
INDEX_SCAN_MIN_TUPLE_DISTANCE, for those queries where prefetch
distance collapses to ~2.0, we really can "decide to switch back to
prefetching". But maintaining the option of switching back costs us
too much (that's what we need INDEX_SCAN_MIN_TUPLE_DISTANCE to
manage).

3) Now that I look at the code in index_scan_stream_read_next, it feels
a bit weird we do the decision based on the "immediate" distance only. I
suspect this may make it quite fragile, in the sense that even a small
local irregularity in the data may result in different "step" changes.
Wouldn't it be better to base this on some "average" distance?

In other words, I'm afraid (2) and (3) are pretty much a "performance
cliff", where a tiny difference in the input can result in wildly
different behavior.

You can say the same thing about hash join spilling. It might not be
practical to make a strong guarantee that this will never ever happen.
It might be more useful to focus on finding a way that makes it as
rare as possible.

If problems like this are possible, but require a "perfect storm" of
buffer hits and misses that occur in precisely the same order, then
maybe it can't be too much of a problem in practice. Since it
shouldn't won't occur again and again.

Much cleanup work remains to get the changes I just described in
proper shape (to say nothing about open items that we haven't made a
start on yet, like moving the read stream out of indexam.c and into
heapam). But it has been too long since the last revision. I'd like to
establish a regular cadence for posting new revisions of the patch
set.

Thank you! I appreciate the collaboration, it's a huge help.

I've enjoyed our collaboration. Feels like things are definitely
moving in the right direction. This is definitely a challenging
project.

I kept running the stress test, trying to find cases that regress, and
also to better understand the behavior. The script/charts are available
here: https://github.com/tvondra/prefetch-tests

So far I haven't found any massive regressions (relative to master).
There are data sets where we regress by a couple percent (and it's not
noise). I haven't looked into the details, but I believe most of this
can be attributed to the "AIO costs" we discussed recently (with
signals), and similar things.

The overall picture that these tests show is a positive one. I think
that this might actually be an acceptable performance profile, across
the board.

What's not acceptable is the code itself, and the current uncertainty
about how fragile our current approach is. I hope that we can make it
less fragile in the coming weeks.

--
Peter Geoghegan

#331

Tomas Vondra

tomas@vondra.me

4 months ago

In reply to: Peter Geoghegan (#330)

Re: index prefetching

On 9/15/25 17:12, Peter Geoghegan wrote:

On Mon, Sep 15, 2025 at 9:00 AM Tomas Vondra <tomas@vondra.me> wrote:

Yeah, this heuristics seems very effective in eliminating the regression
(at least judging by the test results I've seen so far). Two or three
question bother me about it, though:

I more or less agree with all of your concerns about the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

We're prefetching too close ahead, so close the I/O can't possibly
complete, and the overhead of submitting the I/O using AIO is higher
than what what async "saves".

That's great, but is the distance a good measure of that?

The big underlying problem is that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic was developed in a way that overrelied on brute-force
testing. It probably is still flawed in some specific way that some
query will actually run into, that cannot be justified.

So at this point INDEX_SCAN_MIN_TUPLE_DISTANCE should still be
considered nothing more than a promising experiment. It's promising
because it does appear to work with a large variety of queries (we
know of no query where it doesn't basically work right now). My hope
is that we'll be able to come up with a simpler and more robust
approach. One where we fully understand the downsides.

Agreed.

2) It's a one-time decision, not adaptive. We start prefetching, and
then at some point (not too long after the scan starts) we make a
decision whether to continue with prefetching or not. And if we disable
it, it's disabled forever. That's fine for the synthetic data sets we
use for testing, because those are synthetic. I'm not sure it'll work
this well for real-world data sets where different parts of the file may
be very different.

We'll probably need to accept some hard trade-off in this area.

In general (not just with this patch), prefetching works through trial
and error -- the errors are useful information, and useful information
isn't free. The regressions that the INDEX_SCAN_MIN_TUPLE_DISTANCE
heuristic addresses are cases where the errors seem unlikely to pay
for themselves. Let's not forget that these are not huge regressions
-- it's not as if the patch ever does completely the wrong thing
without INDEX_SCAN_MIN_TUPLE_DISTANCE. It's more like it hurts us to
be constantly on the verge of doing the right thing, but never quite
doing the right thing.

Fundamentally, we need to be willing to pay for the cost of
information through which we might be able to do better. We might be
able to get the cost down, through some kind of targeted optimization,
but it's unlikely to ever be close to free.

True. Useful information is not free, and we can construct "adversary"
cases for any heuristics. But I'd like to be sure the hard trade off
really is inevitable.

This is perfectly fine for a WIP patch, but I believe we should try to
make this adaptive. Which probably means we need to invent a "light"
version of read_stream that initially does sync I/O, and only switches
to async (with all the expensive initialization) later. And then can
switch back to sync, but is ready to maybe start prefetching again if
the data pattern changes.

That does seem like it'd be ideal. But how are we supposed to decide
to switch back?

Right now, disabling prefetching disables the only way that we have to
notice that prefetching might be useful (which is to notice that we're
failing to keep up with our prefetch distance). Without
INDEX_SCAN_MIN_TUPLE_DISTANCE, for those queries where prefetch
distance collapses to ~2.0, we really can "decide to switch back to
prefetching". But maintaining the option of switching back costs us
too much (that's what we need INDEX_SCAN_MIN_TUPLE_DISTANCE to
manage).

I imagined (with no code to support it) we'd do the sync I/O through the
read_stream. That way it'd know about the buffer hits and misses, and
could calculate the "distance" (even if it's not used by the sync I/O).
Sure, it's not perfect, because "stream distance" is not the same as
"tuple distance". But we could calculate the "tuple distance", no?

In the "sync" mode the stream could also switch to non-AIO reads,
eliminating the signal bottleneck.

3) Now that I look at the code in index_scan_stream_read_next, it feels
a bit weird we do the decision based on the "immediate" distance only. I
suspect this may make it quite fragile, in the sense that even a small
local irregularity in the data may result in different "step" changes.
Wouldn't it be better to base this on some "average" distance?

In other words, I'm afraid (2) and (3) are pretty much a "performance
cliff", where a tiny difference in the input can result in wildly
different behavior.

You can say the same thing about hash join spilling. It might not be
practical to make a strong guarantee that this will never ever happen.
It might be more useful to focus on finding a way that makes it as
rare as possible.

Sure, it applies to various places where we "flip" to a different
execution mode. All I'm saying is maybe we should try not to add more
such cases.

If problems like this are possible, but require a "perfect storm" of
buffer hits and misses that occur in precisely the same order, then
maybe it can't be too much of a problem in practice. Since it
shouldn't won't occur again and again.

I'm not sure it's such a "perfect storm", really. Imagine an index where
half the leafs are "nice" end get very high indexdiff values, while the
other half are "not nice" and get very low indexdiff. It's a matter of
random chance which leaf you get at INDEX_SCAN_MIN_DISTANCE_NBATCHES.

Much cleanup work remains to get the changes I just described in
proper shape (to say nothing about open items that we haven't made a
start on yet, like moving the read stream out of indexam.c and into
heapam). But it has been too long since the last revision. I'd like to
establish a regular cadence for posting new revisions of the patch
set.

Thank you! I appreciate the collaboration, it's a huge help.

I've enjoyed our collaboration. Feels like things are definitely
moving in the right direction. This is definitely a challenging
project.

I kept running the stress test, trying to find cases that regress, and
also to better understand the behavior. The script/charts are available
here: https://github.com/tvondra/prefetch-tests

So far I haven't found any massive regressions (relative to master).
There are data sets where we regress by a couple percent (and it's not
noise). I haven't looked into the details, but I believe most of this
can be attributed to the "AIO costs" we discussed recently (with
signals), and similar things.

The overall picture that these tests show is a positive one. I think
that this might actually be an acceptable performance profile, across
the board.

Perhaps.

What's not acceptable is the code itself, and the current uncertainty
about how fragile our current approach is. I hope that we can make it
less fragile in the coming weeks.

Agreed.

regards

--
Tomas Vondra

#332

Peter Geoghegan

pg@bowt.ie

3 months ago

In reply to: Peter Geoghegan (#328)

3 attachment(s)

Re: index prefetching

On Wed, Sep 10, 2025 at 6:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is the latest revision of the prefetching patch, taken from
the shared branch that Tomas and I have been working on for some
weeks.

Attached in a new revision, mostly just to keep CFBot happy following
a recent trivial conflict introduced on master. The only other change
in this revision is that it now carries the latest version of Andres'
patch to prevent the scan from waiting for already-in-progress IO [1]/messages/by-id/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv -- Peter Geoghegan.

[1]: /messages/by-id/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v20251012-0002-Add-amgetbatch-interface-for-index-scan-pr.patchapplication/octet-stream; name=v20251012-0002-Add-amgetbatch-interface-for-index-scan-pr.patchDownload

From d21b736caef05eb1166e20feab809b579184af26 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v20251012 2/3] Add amgetbatch interface for index scan
 prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   22 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  150 ++
 src/include/nodes/pathnodes.h                 |    2 +-
 src/include/optimizer/cost.h                  |    1 +
 src/backend/access/brin/brin.c                |    3 +-
 src/backend/access/gin/ginutil.c              |    3 +-
 src/backend/access/gist/gist.c                |    3 +-
 src/backend/access/hash/hash.c                |    3 +-
 src/backend/access/heap/heapam_handler.c      |   43 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1328 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  321 ++--
 src/backend/access/nbtree/nbtsearch.c         |  717 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   71 +-
 src/backend/access/spgist/spgutils.c          |    3 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/execAmi.c                |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 src/backend/utils/misc/guc_parameters.dat     |    7 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 contrib/bloom/blutils.c                       |    3 +-
 doc/src/sgml/indexam.sgml                     |   40 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    3 +-
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    8 +-
 33 files changed, 2109 insertions(+), 865 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..3a651744e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  IndexScanBatch batch,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..767503bb6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -115,6 +116,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -228,6 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad..bae86c8e7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index db1345f54..f410e3367 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -939,10 +939,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -961,74 +961,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1066,32 +1017,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1191,14 +1117,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1306,8 +1233,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1327,7 +1255,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..55495b0e3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,156 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan,
+									   void *arg,
+									   IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The firstBatch is an index of the first batch,
+	 * but needs to be translated by (modulo maxBatches) into index in the
+	 * batches array.
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			firstBatch;		/* first used batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+}			IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +286,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 794087431..3fe4656c7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1346,7 +1346,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e4..3ca46b91d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -296,8 +296,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..c9de3d120 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -84,8 +84,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a96796d5c..910d736a3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -105,8 +105,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b3d7f825c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -104,8 +104,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..6c41b3119 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -84,7 +84,9 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -124,23 +133,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..cba512869 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -51,9 +52,11 @@
 #include "catalog/index.h"
 #include "catalog/pg_type.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memdebug.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -107,8 +110,146 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan, ScanDirection direction);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static pg_attribute_always_inline bool index_batch_pos_advance(IndexScanDesc scan,
+															   IndexScanBatchPos *pos,
+															   ScanDirection direction);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages
+ * (512kB for pages, and then a bit of overhead). We should not really need
+ * this many batches in most cases, though. The read stream looks ahead just
+ * enough to queue enough IOs, adjusting the distance (TIDs, but ultimately
+ * the number of future batches) to meet that.
+ *
+ * In most cases an index leaf page has many (hundreds) index tuples, and
+ * it's enough to read one or maybe two leaf pages ahead to satisfy the
+ * distance.
+ *
+ * But there are cases where this may not quite work, for example:
+ *
+ * a) bloated index - many pages only have a single index item, so that
+ *    achieving the distance requires too many leaf pages
+ *
+ * b) correlated index - duplicate blocks are skipped (the callback does not
+ *    even return those, thanks to currentPrefetchBlock optimization), and are
+ *    mostly ignored in the distance heuristics (read stream does not even see
+ *    those TIDs, and there's no I/O either)
+ *
+ * c) index-only scan - the callback skips TIDs from all-visible blocks (not
+ *    reading those is the whole point of index-only scans), and so it's
+ *    invisible to the distance / IO heuristics (similarly to duplicates)
+ *
+ * In these cases we might need to read a significant number of batches to
+ * find the first block to return to the read stream. It's not clear if
+ * looking this far ahead is worth it - it's a lot of work / synchronous
+ * I/O, and the query may terminate before reaching those TIDs (e.g. due to
+ * a LIMIT clause).
+ *
+ * Currently, there's no way to "pause" a read stream - stop looking ahead
+ * for a while, but then resume the work when a batch gets freed. To simulate
+ * this, the read stream is terminated (as if there were no more data), and
+ * then reset after draining all the queued blocks in order to resume work.
+ * This works, but it "stalls" the I/O queue. If it happens very often, it
+ * can be a serious performance bottleneck.
+
+ * XXX Maybe 64 is too high? It also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS, if most pages are all-visible). Which might
+ * be an issue with LIMIT queries, when we actually won't get that far.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+/*
+ * Thresholds controlling when we cancel use of a read stream to do
+ * prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->firstBatch)
+
+/* Did we already load batch with the requested index? */
+/* XXX shouldn't this also compare firstBatch? maybe the batch was freed? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/* Return batch for the provided index. */
+/* XXX Should this have an assert to enforce the batch is loaded? Maybe the
+ * index is too far back, but there happens to be a batch in the right slot?
+ * Could easily happen if we have to keep many batches around.
+ */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->index == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches firstBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->firstBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->firstBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("    batch %d currPage %u %p first %d last %d killed %d",
+				  i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -283,6 +424,9 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
@@ -380,6 +524,12 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled within the read stream, etc.
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +544,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -421,10 +574,43 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *markPos = &batchState->markPos;
+	IndexScanBatchData *markBatch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current first/next range). This means that if
+	 * we're marking the same batch (different item), we don't really do
+	 * anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchState->firstBatch ||
+							  markPos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, markBatch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -445,19 +631,60 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchState->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->firstBatch = markPos->batch;
+	batchState->nextBatch = (batchState->firstBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -579,6 +806,17 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -614,6 +852,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
@@ -630,14 +871,286 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - ambatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * If the scan direction changes, we release all batches except the current
+ * one (per readPos), to make it look it's the only batch we loaded.
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *readPos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* Initialize direction on first call */
+	if (batchState->direction == NoMovementScanDirection)
+		batchState->direction = direction;
+
+	/*
+	 * Handle cancelling the use of the read stream for prefetching
+	 */
+	else if (unlikely(batchState->disabled && scan->xs_heapfetch->rs))
+	{
+		index_batch_pos_reset(scan, &batchState->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	else if (unlikely(batchState->direction != direction))
+	{
+		/* release "future" batches in the wrong direction */
+		while (batchState->nextBatch > batchState->firstBatch + 1)
+		{
+			IndexScanBatch fbatch;
+
+			batchState->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch);
+			index_batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		batchState->direction = direction;
+		batchState->finished = false;
+		batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &batchState->streamPos);
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchState->readPos;
+
+	DEBUG_LOG("index_batch_getnext_tid readPos %d %d direction %d",
+			  readPos->batch, readPos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? To detect cases when the
+	 * advance/getnext functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, readPos, direction))
+		{
+			IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->index].tupleOffset);
+
+			DEBUG_LOG("readBatch %p first %d last %d readPos %d/%d TID (%u,%u)",
+					  readBatch, readBatch->firstItem, readBatch->lastItem,
+					  readPos->batch, readPos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to firstBatch.
+			 */
+			if (unlikely(readPos->batch != batchState->firstBatch))
+			{
+				IndexScanBatchData *firstBatch = INDEX_SCAN_BATCH(scan,
+																  batchState->firstBatch);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so mucu behind the position
+				 * gets invalid, as we already removed the batch. But that
+				 * means we don't need any heap blocks until the current read
+				 * position - if we did, we would not be in this situation (or
+				 * it's a sign of a bug, as those two places are expected to
+				 * be in sync). So if the streamPos still points at the batch
+				 * we're about to free, just reset the position - we'll set it
+				 * to readPos in the read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchState->streamPos.batch == batchState->firstBatch))
+				{
+					DEBUG_LOG("index_batch_pos_reset called early (streamPos.batch == firstBatch)");
+					index_batch_pos_reset(scan, &batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free firstBatch %p firstBatch %d nextBatch %d",
+						  firstBatch, batchState->firstBatch, batchState->nextBatch);
+
+				/* Free the first batch (except when it's markBatch) */
+				index_batch_free(scan, firstBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mar/restore.
+				 */
+				batchState->firstBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed firstBatch %d nextBatch %d",
+						  batchState->firstBatch, batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(batchState->firstBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchState->reset))
+		{
+			DEBUG_LOG("resetting read stream readPos %d,%d",
+					  readPos->batch, readPos->index);
+
+			batchState->reset = false;
+			batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &batchState->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the first batch from here.
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 *
+	 * XXX This is a bit strange. Do we really need to reset the position
+	 * after returning the last item? I wonder if it means the API is not
+	 * quite right.
+	 */
+	index_batch_pos_reset(scan, readPos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -704,9 +1217,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1089,3 +1611,789 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->firstBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The first/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->firstBatch >= 0 &&
+		   batchState->firstBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->firstBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->firstBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance, right after loading the first batch, the
+ * position is still undefined. Otherwise we expect the position to be
+ * valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The position is guaranteed to be valid only after a successful advance.
+ * If an advance fails (false returned), the position can be invalid.
+ *
+ * XXX This seems like a good place to enforce some "invariants", e.g.
+ * that the positions are always valid. We should never get here with
+ * invalid position (so probably should be initialized as part of loading
+ * the first batch), and then invalidated if advance fails. Could be tricky
+ * for the stream position, though, because it can get "lag" for IOS etc.
+ */
+static pg_attribute_always_inline bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos,
+						ScanDirection direction)
+{
+	IndexScanBatchData *batch;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchState->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the first batch. In that case just initialize it to the first
+	 * item in the batch (or last item, if it's a backwards scan).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the first batch, without having to go through the advance.
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the very first batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 *
+		 * XXX Actually, could there be more batches? Maybe we prefetched more
+		 * batches right away? It doesn't seem to be a substantial invariant.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the first batch, and initialize the position to the first/last
+		 * item in the batch, depending on the scan direction.
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
+
+		pos->batch = scan->batchState->firstBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->index <= batch->lastItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->index >= batch->firstItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere. Here
+ * we rely on having the correct value in batchState->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The first batch is always loaded from index_batch_getnext_tid(). We don't get
+ * here until the first read_stream_next_buffer() call, when pulling the first
+ * heap tuple from the stream. After that, most batches should be loaded by this
+ * callback, driven by the read_stream look-ahead distance. However, with
+ * disabled prefetching (that is, with effective_io_concurrency=0), all batches
+ * will be loaded in index_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ *
+ * XXX It seems the readPos/streamPos comments should be placed elsewhere. The
+ * read_stream callback does not seem like the right place.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *streamPos = &batchState->streamPos;
+	ScanDirection direction = batchState->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	AssertCheckBatchPosValid(scan, &batchState->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike index_batch_getnext_tid, this can loop more than twice. If
+	 * many blocks get skipped due to currentPrefetchBlock or all-visibility
+	 * (per the "prefetch" callback), we get to load additional batches. In
+	 * the worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to
+	 * "pause" the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is tring to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			IndexScanBatch streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  streamPos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * If there's a prefetch callback, use it to decide if we need to
+			 * read the next block.
+			 *
+			 * We need to do this before checking currentPrefetchBlock; it's
+			 * essential that the VM cache used by index-only scans is
+			 * intialized here.
+			 */
+			if (batchState->prefetch &&
+				!batchState->prefetch(scan, batchState->prefetchArg, streamPos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (batchState->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchState->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchState->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the very first batch,
+		 * when readPos and streamPos share the same batch.
+		 */
+		if (!batchState->finished && !batchState->prefetchingLockedIn)
+		{
+			int			indexdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchState->readPos.batch == streamPos->batch)
+			{
+				IndexScanBatchPos *readPos = &batchState->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					indexdiff = streamPos->index - readPos->index;
+				else
+				{
+					IndexScanBatch readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					indexdiff = (readPos->index - readBatch->firstItem) -
+						(streamPos->index - readBatch->firstItem);
+				}
+
+				if (indexdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchState->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchState->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchState->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are no
+ * more TIDs in the scan. The load may also return false if we used the maximum
+ * number of batches (INDEX_SCAN_MAX_BATCHES), in which case we'll reset the
+ * stream and continue the scan later.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * This only loads the TIDs and resets the various batch fields to fresh
+ * state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatch priorbatch = NULL,
+				batch = NULL;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchState->finished)
+		return NULL;
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+		return NULL;
+	}
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (batchState->firstBatch < batchState->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext firstBatch %d nextBatch %d batch %p",
+				  batchState->firstBatch, batchState->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchState->disabled &&
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   index_scan_stream_read_next, scan, 0);
+	}
+	else
+		batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc(sizeof(IndexScanBatchState));
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->finished = false;
+	scan->batchState->reset = false;
+	scan->batchState->prefetchingLockedIn = false;
+	scan->batchState->disabled = false;
+	scan->batchState->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchState->direction = NoMovementScanDirection;
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->markBatch = NULL;
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->firstBatch = 0;	/* first batch */
+	scan->batchState->nextBatch = 0;	/* first batch is empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	scan->batchState->prefetch = NULL;
+	scan->batchState->prefetchArg = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && unlikely(batchState->markBatch != NULL))
+	{
+		IndexScanBatchPos *markPos = &batchState->markPos;
+		IndexScanBatch markBatch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchState->firstBatch ||
+			markPos->batch >= batchState->nextBatch)
+			index_batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->firstBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->firstBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->firstBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->firstBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchState->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchState->firstBatch = 0; /* first batch */
+	batchState->nextBatch = 0;	/* first batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *readPos = &scan->batchState->readPos;
+	IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	AssertCheckBatchPosValid(scan, readPos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (readBatch->numKilled < MaxTIDsPerBTreePage)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+/*
+ * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
+ * which seems unfortunate - it increases the allocation sizes, even if
+ * the index would be fine with smaller arrays. This means all batches exceed
+ * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ */
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
+								  sizeof(IndexScanBatchPosItem) * maxitems);
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, we need a tuple storage workspace.
+	 * We allocate BLCKSZ for this, which should always give the index AM
+	 * enough space to fit a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->buf = InvalidBuffer;
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..fba562df8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,33 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +325,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +338,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +354,51 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +407,50 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +697,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +846,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795..6211e9548 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, ItemPointer heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimscan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1513,12 +1477,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1534,11 +1498,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1546,11 +1510,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1568,69 +1532,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1642,8 +1616,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1654,37 +1628,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1722,11 +1694,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1790,28 +1761,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1851,12 +1822,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1873,11 +1843,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1978,11 +1947,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1995,17 +1964,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2021,12 +1990,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2043,202 +2011,95 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  ItemPointer heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2253,73 +2114,96 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * the scan on this page by calling _bt_checkkeys against the high key.  See
  * _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	/* firstbatch will never be returned to scan, so free it outselves */
+	pfree(firstbatch);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		pfree(firstpos);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	pfree(firstpos);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2329,102 +2213,70 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2432,17 +2284,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2457,19 +2309,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2692,25 +2561,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2721,7 +2588,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2747,9 +2614,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 288da8b68..e22bf81db 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1022,14 +1022,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1396,7 +1388,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2033,13 +2025,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pos->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2152,7 +2144,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3260,7 +3252,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3334,69 +3326,67 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3413,12 +3403,12 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3443,7 +3433,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3459,7 +3449,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3476,7 +3467,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3526,7 +3517,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9b86c016a..c6aa99ca9 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -90,8 +90,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca2bde62e..7141243a3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a0..ab2756d47 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index da5d901ec..f40eeea82 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index f59046ad6..9fea68a9b 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -876,7 +876,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index b176d5130..14b38d93b 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -54,6 +54,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_bitmapscan', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of bitmap-scan plans.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c5d612ab5..4bf0032d5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..e36519ab7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -163,8 +163,7 @@ typedef struct IndexAmRoutine
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -789,32 +788,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function need only be provided if the access
+   method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..be1b0f55c 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,8 +319,7 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3b37fafa6..9702e3103 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e8..b8e6f25a3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -193,6 +193,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -1267,6 +1269,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3423,10 +3429,10 @@ amestimateparallelscan_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
-- 
2.51.0

v20251012-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchapplication/octet-stream; name=v20251012-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchDownload

From 4ac3707c04b6bb666ef8e8f6831f938460fc564a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v20251012 1/3] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3f37b294a..06a2208a5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -140,6 +140,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index edf17ce3e..e8d5f75e4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1544,6 +1544,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc, buf_state);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1676,7 +1716,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1695,11 +1735,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc, buf_state);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1785,6 +1880,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1842,9 +1974,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v20251012-0003-Reduce-malloc-free-traffic-by-caching-batc.patchapplication/octet-stream; name=v20251012-0003-Reduce-malloc-free-traffic-by-caching-batc.patchDownload

From f6feccd38472107f1b538a8785bba252fb7a6d27 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 10 Sep 2025 16:54:50 -0400
Subject: [PATCH v20251012 3/3] Reduce malloc/free traffic by caching batches

Instead of immediately freeing a batch, stash it in a cache (a small
fixed-size array), for reuse by the same scan.

There's room for improvement:

- Keeping some of the batch pieces (killItems, itemsvisibility, ...)
  instead of freeing them in index_batch_release.

- Allocating only space we need (both index_batch_alloc calls use
  MaxTIDsPerBTreePage, and thus malloc - because of ALLOC_CHUNK_LIMIT).
---
 src/include/access/genam.h            |   3 +-
 src/include/access/relscan.h          |   5 +
 src/backend/access/index/indexam.c    | 173 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  15 +--
 src/backend/access/nbtree/nbtsearch.c |   4 +-
 5 files changed, 176 insertions(+), 24 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 767503bb6..913945c4b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -230,7 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern IndexScanBatch index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup);
+extern void index_batch_release(IndexScanDesc scan, IndexScanBatch batch);
 extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 55495b0e3..6e511f1c1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -187,6 +187,7 @@ typedef struct IndexScanBatchData
 	 */
 	char	   *itemsvisibility;	/* Index-only scan visibility cache */
 
+	int			maxitems;
 	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
 } IndexScanBatchData;
 
@@ -266,6 +267,10 @@ typedef struct IndexScanBatchState
 	int			firstBatch;		/* first used batch slot */
 	int			nextBatch;		/* next empty batch slot */
 
+	/* small cache of unused batches, to reduce malloc/free traffic */
+	int			batchesCacheSize;
+	IndexScanBatchData **batchesCache;
+
 	IndexScanBatchData **batches;
 
 	/* callback to skip prefetching in IOS etc. */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index cba512869..1892eedd5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2197,6 +2197,10 @@ index_batch_init(IndexScanDesc scan)
 	scan->batchState->firstBatch = 0;	/* first batch */
 	scan->batchState->nextBatch = 0;	/* first batch is empty */
 
+	/* XXX init the cache of batches, capacity 16 is arbitrary */
+	scan->batchState->batchesCacheSize = 16;
+	scan->batchState->batchesCache = NULL;
+
 	scan->batchState->batches =
 		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
 
@@ -2329,34 +2333,102 @@ static void
 index_batch_end(IndexScanDesc scan)
 {
 	index_batch_reset(scan, true);
+
+	if (scan->batchState)
+	{
+		if (scan->batchState->batches)
+			pfree(scan->batchState->batches);
+
+		if (scan->batchState->batchesCache)
+		{
+			for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+			{
+				if (scan->batchState->batchesCache[i] == NULL)
+					continue;
+
+				pfree(scan->batchState->batchesCache[i]);
+			}
+
+			pfree(scan->batchState->batchesCache);
+		}
+		pfree(scan->batchState);
+	}
 }
 
 /*
  * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
  * which seems unfortunate - it increases the allocation sizes, even if
- * the index would be fine with smaller arrays. This means all batches exceed
- * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ * the index would be fine with smaller arrays. This means all batches
+ * exceed ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive). The
+ * cache helps for longer queries, not for queries that only create a
+ * single batch, etc.
  */
 IndexScanBatch
-index_batch_alloc(int maxitems, bool want_itup)
+index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
 {
-	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
-								  sizeof(IndexScanBatchPosItem) * maxitems);
+	IndexScanBatch batch = NULL;
 
+	/*
+	 * try to find a batch in the cache
+	 *
+	 * XXX We can get here with batchState==NULL for bitmapscans. Could that
+	 * mean bitmapscans have issues with malloc/free on batches too? But the
+	 * cache can't help with that, when it's in batchState.
+	 */
+	if ((scan->batchState != NULL) &&
+		(scan->batchState->batchesCache != NULL))
+	{
+		/*
+		 * try to find a batch in the cache, with maxitems high enough
+		 *
+		 * XXX Maybe should look for a batch with lowest maxitems? That should
+		 * increase probability of cache hits in the future?
+		 */
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			if ((scan->batchState->batchesCache[i] != NULL) &&
+				(scan->batchState->batchesCache[i]->maxitems >= maxitems))
+			{
+				batch = scan->batchState->batchesCache[i];
+				scan->batchState->batchesCache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	/* found a batch in the cache? */
+	if (batch)
+	{
+		/* for IOS, we expect to already have the currTuples */
+		Assert(!(want_itup && (batch->currTuples == NULL)));
+
+		/* XXX maybe we could keep these allocations too */
+		Assert(batch->pos == NULL);
+		Assert(batch->itemsvisibility == NULL);
+	}
+	else
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(IndexScanBatchPosItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+	}
+
+	/* shared initialization */
 	batch->firstItem = -1;
 	batch->lastItem = -1;
 	batch->killedItems = NULL;
 	batch->numKilled = 0;
 
-	/*
-	 * If we are doing an index-only scan, we need a tuple storage workspace.
-	 * We allocate BLCKSZ for this, which should always give the index AM
-	 * enough space to fit a full page's worth of tuples.
-	 */
-	batch->currTuples = NULL;
-	if (want_itup)
-		batch->currTuples = palloc(BLCKSZ);
-
 	batch->buf = InvalidBuffer;
 	batch->pos = NULL;
 	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
@@ -2397,3 +2469,76 @@ index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
 	ReleaseBuffer(batch->buf);
 	batch->buf = InvalidBuffer; /* defensive */
 }
+
+/* add the buffer to the cache, or free it */
+void
+index_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * first free some allocations
+	 *
+	 * XXX We could keep/reuse some of those.
+	 */
+
+	if (batch->killedItems != NULL)
+	{
+		pfree(batch->killedItems);
+		batch->killedItems = NULL;
+	}
+
+	if (batch->itemsvisibility != NULL)
+	{
+		pfree(batch->itemsvisibility);
+		batch->itemsvisibility = NULL;
+	}
+
+	/* XXX a bit unclear what's release by AM vs. indexam */
+	Assert(batch->pos == NULL);
+
+	/*
+	 * try adding it to the cache - finds a slot that's either empty or has a
+	 * lower maxitems value (and replace that batch)
+	 *
+	 * XXX maybe we should track the number of empty slots, and minimum value
+	 * of maxitems, so that we can skip pointless searches?
+	 *
+	 * XXX ignores cases with batchState=NULL (can we get here with bitmap
+	 * scans?)
+	 */
+	if (scan->batchState != NULL)
+	{
+		/* lowest maxitems we found in the cache (to replace with batch) */
+		int			maxitems = batch->maxitems;
+		int			slot = scan->batchState->batchesCacheSize;
+
+		/* first time through, initialize the cache */
+		if (scan->batchState->batchesCache == NULL)
+			scan->batchState->batchesCache
+				= palloc0_array(IndexScanBatch,
+								scan->batchState->batchesCacheSize);
+
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			/* found empty slot, we're done */
+			if (scan->batchState->batchesCache[i] == NULL)
+			{
+				scan->batchState->batchesCache[i] = batch;
+				return;
+			}
+
+			/* update lowest maxitems? */
+			if (scan->batchState->batchesCache[i]->maxitems < maxitems)
+			{
+				maxitems = scan->batchState->batchesCache[i]->maxitems;
+				slot = i;
+			}
+		}
+
+		/* found a batch to replace? */
+		if (maxitems < batch->maxitems)
+		{
+			pfree(scan->batchState->batchesCache[slot]);
+			scan->batchState->batchesCache[slot] = batch;
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fba562df8..18c734c4a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -382,21 +382,22 @@ btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
 	if (batch->numKilled > 0)
 		_bt_killitems(scan, batch);
 
-	if (batch->itemsvisibility)
-		pfree(batch->itemsvisibility);
-
-	if (batch->currTuples)
-		pfree(batch->currTuples);
-
 	if (batch->pos)
 	{
 		if (!scan->batchState || !scan->batchState->dropPin)
 			ReleaseBuffer(batch->buf);
 
 		pfree(batch->pos);
+
+		/* XXX maybe should be done in index_batch_free? */
+		batch->buf = InvalidBuffer;
+		batch->pos = NULL;
 	}
 
-	pfree(batch);
+	/* XXX keep itemsvisibility, killItems and currTuples */
+
+	/* free the batch (or cache it for reuse) */
+	index_batch_release(scan, batch);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 6211e9548..0e9f8bda2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1200,7 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Allocate space for first batch */
-	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	firstbatch->pos = palloc(sizeof(BTScanPosData));
 
 	/*
@@ -2237,7 +2237,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	BTScanPos	newpos;
 
 	/* Allocate space for next batch */
-	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	newbatch->pos = palloc(sizeof(BTScanPosData));
 	newpos = newbatch->pos;
 
-- 
2.51.0

#333

Peter Geoghegan

pg@bowt.ie

2 months ago

In reply to: Peter Geoghegan (#332)

3 attachment(s)

Re: index prefetching

On Sun, Oct 12, 2025 at 2:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached in a new revision, mostly just to keep CFBot happy following
a recent trivial conflict introduced on master.

Attached is another revision, also just to keep CFBot happy following
a conflict introduced on master. Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

One minor thing to note about this revision: I added a comment to
selfuncs.c's that notes that there's an unfixed bug there. That code
more or less copies the approach used by nodeIndexonlyscan.c, but
neglects to take the same precautions around the read
stream/prefetching see different pages as all-visible that the view
seen on the consumer side.

ISTM that the right fix there is to totally rethink the interface such
that the read stream is directly owned by the table AM. That way we
won't have to work around inconsistent ideas around which heap pages
are all-visible because there'll only be one view of that, in a single
place. We won't have to do anything special in either selfuncs.c or in
nodeIndexonlyscan.c.

--
Peter Geoghegan

Attachments:

v20251102-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchapplication/octet-stream; name=v20251102-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchDownload

From c3d82fd11ab6c87ed2d74f18d7cb446efbc7d669 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v20251102 1/3] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d..139499111 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e8544acb7..ab044b05e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc, buf_state);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc, buf_state);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v20251102-0003-Reduce-malloc-free-traffic-by-caching-batc.patchapplication/octet-stream; name=v20251102-0003-Reduce-malloc-free-traffic-by-caching-batc.patchDownload

From d5f0e9b0ed6f2a2c7284cb72ebadb3373687b5ab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 10 Sep 2025 16:54:50 -0400
Subject: [PATCH v20251102 3/3] Reduce malloc/free traffic by caching batches

Instead of immediately freeing a batch, stash it in a cache (a small
fixed-size array), for reuse by the same scan.

There's room for improvement:

- Keeping some of the batch pieces (killItems, itemsvisibility, ...)
  instead of freeing them in index_batch_release.

- Allocating only space we need (both index_batch_alloc calls use
  MaxTIDsPerBTreePage, and thus malloc - because of ALLOC_CHUNK_LIMIT).
---
 src/include/access/genam.h            |   3 +-
 src/include/access/relscan.h          |   5 +
 src/backend/access/index/indexam.c    | 173 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c    |  15 +--
 src/backend/access/nbtree/nbtsearch.c |   4 +-
 5 files changed, 176 insertions(+), 24 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 767503bb6..913945c4b 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -230,7 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern IndexScanBatch index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup);
+extern void index_batch_release(IndexScanDesc scan, IndexScanBatch batch);
 extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6f87c6b31..eb306a9df 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -187,6 +187,7 @@ typedef struct IndexScanBatchData
 	 */
 	char	   *itemsvisibility;	/* Index-only scan visibility cache */
 
+	int			maxitems;
 	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
 } IndexScanBatchData;
 
@@ -266,6 +267,10 @@ typedef struct IndexScanBatchState
 	int			headBatch;		/* head batch slot */
 	int			nextBatch;		/* next empty batch slot */
 
+	/* small cache of unused batches, to reduce malloc/free traffic */
+	int			batchesCacheSize;
+	IndexScanBatchData **batchesCache;
+
 	IndexScanBatchData **batches;
 
 	/* callback to skip prefetching in IOS etc. */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fd3ffa222..0e089001c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2196,6 +2196,10 @@ index_batch_init(IndexScanDesc scan)
 	scan->batchState->headBatch = 0;	/* initial head batch */
 	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
 
+	/* XXX init the cache of batches, capacity 16 is arbitrary */
+	scan->batchState->batchesCacheSize = 16;
+	scan->batchState->batchesCache = NULL;
+
 	scan->batchState->batches =
 		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
 
@@ -2328,34 +2332,102 @@ static void
 index_batch_end(IndexScanDesc scan)
 {
 	index_batch_reset(scan, true);
+
+	if (scan->batchState)
+	{
+		if (scan->batchState->batches)
+			pfree(scan->batchState->batches);
+
+		if (scan->batchState->batchesCache)
+		{
+			for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+			{
+				if (scan->batchState->batchesCache[i] == NULL)
+					continue;
+
+				pfree(scan->batchState->batchesCache[i]);
+			}
+
+			pfree(scan->batchState->batchesCache);
+		}
+		pfree(scan->batchState);
+	}
 }
 
 /*
  * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
  * which seems unfortunate - it increases the allocation sizes, even if
- * the index would be fine with smaller arrays. This means all batches exceed
- * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ * the index would be fine with smaller arrays. This means all batches
+ * exceed ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive). The
+ * cache helps for longer queries, not for queries that only create a
+ * single batch, etc.
  */
 IndexScanBatch
-index_batch_alloc(int maxitems, bool want_itup)
+index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
 {
-	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
-								  sizeof(IndexScanBatchPosItem) * maxitems);
+	IndexScanBatch batch = NULL;
 
+	/*
+	 * try to find a batch in the cache
+	 *
+	 * XXX We can get here with batchState==NULL for bitmapscans. Could that
+	 * mean bitmapscans have issues with malloc/free on batches too? But the
+	 * cache can't help with that, when it's in batchState.
+	 */
+	if ((scan->batchState != NULL) &&
+		(scan->batchState->batchesCache != NULL))
+	{
+		/*
+		 * try to find a batch in the cache, with maxitems high enough
+		 *
+		 * XXX Maybe should look for a batch with lowest maxitems? That should
+		 * increase probability of cache hits in the future?
+		 */
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			if ((scan->batchState->batchesCache[i] != NULL) &&
+				(scan->batchState->batchesCache[i]->maxitems >= maxitems))
+			{
+				batch = scan->batchState->batchesCache[i];
+				scan->batchState->batchesCache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	/* found a batch in the cache? */
+	if (batch)
+	{
+		/* for IOS, we expect to already have the currTuples */
+		Assert(!(want_itup && (batch->currTuples == NULL)));
+
+		/* XXX maybe we could keep these allocations too */
+		Assert(batch->pos == NULL);
+		Assert(batch->itemsvisibility == NULL);
+	}
+	else
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(IndexScanBatchPosItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+	}
+
+	/* shared initialization */
 	batch->firstItem = -1;
 	batch->lastItem = -1;
 	batch->killedItems = NULL;
 	batch->numKilled = 0;
 
-	/*
-	 * If we are doing an index-only scan, we need a tuple storage workspace.
-	 * We allocate BLCKSZ for this, which should always give the index AM
-	 * enough space to fit a full page's worth of tuples.
-	 */
-	batch->currTuples = NULL;
-	if (want_itup)
-		batch->currTuples = palloc(BLCKSZ);
-
 	batch->buf = InvalidBuffer;
 	batch->pos = NULL;
 	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
@@ -2396,3 +2468,76 @@ index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
 	ReleaseBuffer(batch->buf);
 	batch->buf = InvalidBuffer; /* defensive */
 }
+
+/* add the buffer to the cache, or free it */
+void
+index_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * first free some allocations
+	 *
+	 * XXX We could keep/reuse some of those.
+	 */
+
+	if (batch->killedItems != NULL)
+	{
+		pfree(batch->killedItems);
+		batch->killedItems = NULL;
+	}
+
+	if (batch->itemsvisibility != NULL)
+	{
+		pfree(batch->itemsvisibility);
+		batch->itemsvisibility = NULL;
+	}
+
+	/* XXX a bit unclear what's release by AM vs. indexam */
+	Assert(batch->pos == NULL);
+
+	/*
+	 * try adding it to the cache - finds a slot that's either empty or has a
+	 * lower maxitems value (and replace that batch)
+	 *
+	 * XXX maybe we should track the number of empty slots, and minimum value
+	 * of maxitems, so that we can skip pointless searches?
+	 *
+	 * XXX ignores cases with batchState=NULL (can we get here with bitmap
+	 * scans?)
+	 */
+	if (scan->batchState != NULL)
+	{
+		/* lowest maxitems we found in the cache (to replace with batch) */
+		int			maxitems = batch->maxitems;
+		int			slot = scan->batchState->batchesCacheSize;
+
+		/* first time through, initialize the cache */
+		if (scan->batchState->batchesCache == NULL)
+			scan->batchState->batchesCache
+				= palloc0_array(IndexScanBatch,
+								scan->batchState->batchesCacheSize);
+
+		for (int i = 0; i < scan->batchState->batchesCacheSize; i++)
+		{
+			/* found empty slot, we're done */
+			if (scan->batchState->batchesCache[i] == NULL)
+			{
+				scan->batchState->batchesCache[i] = batch;
+				return;
+			}
+
+			/* update lowest maxitems? */
+			if (scan->batchState->batchesCache[i]->maxitems < maxitems)
+			{
+				maxitems = scan->batchState->batchesCache[i]->maxitems;
+				slot = i;
+			}
+		}
+
+		/* found a batch to replace? */
+		if (maxitems < batch->maxitems)
+		{
+			pfree(scan->batchState->batchesCache[slot]);
+			scan->batchState->batchesCache[slot] = batch;
+		}
+	}
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fba562df8..18c734c4a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -382,21 +382,22 @@ btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
 	if (batch->numKilled > 0)
 		_bt_killitems(scan, batch);
 
-	if (batch->itemsvisibility)
-		pfree(batch->itemsvisibility);
-
-	if (batch->currTuples)
-		pfree(batch->currTuples);
-
 	if (batch->pos)
 	{
 		if (!scan->batchState || !scan->batchState->dropPin)
 			ReleaseBuffer(batch->buf);
 
 		pfree(batch->pos);
+
+		/* XXX maybe should be done in index_batch_free? */
+		batch->buf = InvalidBuffer;
+		batch->pos = NULL;
 	}
 
-	pfree(batch);
+	/* XXX keep itemsvisibility, killItems and currTuples */
+
+	/* free the batch (or cache it for reuse) */
+	index_batch_release(scan, batch);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 54a767a4f..b019c19f8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1200,7 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Allocate space for first batch */
-	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	firstbatch->pos = palloc(sizeof(BTScanPosData));
 
 	/*
@@ -2237,7 +2237,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	BTScanPos	newpos;
 
 	/* Allocate space for next batch */
-	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	newbatch->pos = palloc(sizeof(BTScanPosData));
 	newpos = newbatch->pos;
 
-- 
2.51.0

v20251102-0002-Add-amgetbatch-interface-for-index-scan-pr.patchapplication/octet-stream; name=v20251102-0002-Add-amgetbatch-interface-for-index-scan-pr.patchDownload

From 5ed191a7ee23ea50ef511c20501a70eccdbf0d35 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v20251102 2/3] Add amgetbatch interface for index scan
 prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   22 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  150 ++
 src/include/nodes/pathnodes.h                 |    2 +-
 src/include/optimizer/cost.h                  |    1 +
 src/backend/access/brin/brin.c                |    3 +-
 src/backend/access/gin/ginutil.c              |    3 +-
 src/backend/access/gist/gist.c                |    3 +-
 src/backend/access/hash/hash.c                |    3 +-
 src/backend/access/heap/heapam_handler.c      |   43 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1327 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  321 ++--
 src/backend/access/nbtree/nbtsearch.c         |  718 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   70 +-
 src/backend/access/spgist/spgutils.c          |    3 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/execAmi.c                |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 src/backend/utils/adt/selfuncs.c              |    4 +
 src/backend/utils/misc/guc_parameters.dat     |    7 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 contrib/bloom/blutils.c                       |    3 +-
 doc/src/sgml/indexam.sgml                     |   40 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    3 +-
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    8 +-
 34 files changed, 2112 insertions(+), 865 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..3a651744e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  IndexScanBatch batch,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..767503bb6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -115,6 +116,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -228,6 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 909db73b7..744ad2fac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 16be5c7a9..119705f64 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -939,10 +939,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -961,74 +961,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1066,32 +1017,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1191,14 +1117,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1306,8 +1233,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1327,7 +1255,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..6f87c6b31 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,156 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan,
+									   void *arg,
+									   IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * maxBatches into index in the batches array).
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+}			IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +286,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 30d889b54..6879fe99b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1343,7 +1343,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d14379..6472013ae 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -296,8 +296,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..c9de3d120 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -84,8 +84,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5213cd71e..d80654422 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -105,8 +105,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b3d7f825c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -104,8 +104,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..6c41b3119 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -84,7 +84,9 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -124,23 +133,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..fd3ffa222 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -51,9 +52,11 @@
 #include "catalog/index.h"
 #include "catalog/pg_type.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memdebug.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -107,8 +110,146 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan, ScanDirection direction);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static pg_attribute_always_inline bool index_batch_pos_advance(IndexScanDesc scan,
+															   IndexScanBatchPos *pos,
+															   ScanDirection direction);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages
+ * (512kB for pages, and then a bit of overhead). We should not really need
+ * this many batches in most cases, though. The read stream looks ahead just
+ * enough to queue enough IOs, adjusting the distance (TIDs, but ultimately
+ * the number of future batches) to meet that.
+ *
+ * In most cases an index leaf page has many (hundreds) index tuples, and
+ * it's enough to read one or maybe two leaf pages ahead to satisfy the
+ * distance.
+ *
+ * But there are cases where this may not quite work, for example:
+ *
+ * a) bloated index - many pages only have a single index item, so that
+ *    achieving the distance requires too many leaf pages
+ *
+ * b) correlated index - duplicate blocks are skipped (the callback does not
+ *    even return those, thanks to currentPrefetchBlock optimization), and are
+ *    mostly ignored in the distance heuristics (read stream does not even see
+ *    those TIDs, and there's no I/O either)
+ *
+ * c) index-only scan - the callback skips TIDs from all-visible blocks (not
+ *    reading those is the whole point of index-only scans), and so it's
+ *    invisible to the distance / IO heuristics (similarly to duplicates)
+ *
+ * In these cases we might need to read a significant number of batches to
+ * find the first block to return to the read stream. It's not clear if
+ * looking this far ahead is worth it - it's a lot of work / synchronous
+ * I/O, and the query may terminate before reaching those TIDs (e.g. due to
+ * a LIMIT clause).
+ *
+ * Currently, there's no way to "pause" a read stream - stop looking ahead
+ * for a while, but then resume the work when a batch gets freed. To simulate
+ * this, the read stream is terminated (as if there were no more data), and
+ * then reset after draining all the queued blocks in order to resume work.
+ * This works, but it "stalls" the I/O queue. If it happens very often, it
+ * can be a serious performance bottleneck.
+
+ * XXX Maybe 64 is too high? It also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS, if most pages are all-visible). Which might
+ * be an issue with LIMIT queries, when we actually won't get that far.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+/*
+ * Thresholds controlling when we cancel use of a read stream to do
+ * prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->headBatch)
+
+/* Did we already load batch with the requested index? */
+/* XXX shouldn't this also compare headBatch? maybe the batch was freed? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/* Return batch for the provided index. */
+/* XXX Should this have an assert to enforce the batch is loaded? Maybe the
+ * index is too far back, but there happens to be a batch in the right slot?
+ * Could easily happen if we have to keep many batches around.
+ */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->index == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->headBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->headBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -283,6 +424,9 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
@@ -380,6 +524,12 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled within the read stream, etc.
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +544,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -421,10 +574,42 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *markPos = &batchState->markPos;
+	IndexScanBatchData *markBatch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range). This means that if we're
+	 * marking the same batch (different item), we don't really do anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchState->headBatch ||
+							  markPos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, markBatch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -445,19 +630,60 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchState->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->headBatch = markPos->batch;
+	batchState->nextBatch = (batchState->headBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -579,6 +805,17 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -614,6 +851,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
@@ -630,14 +870,286 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - ambatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * If the scan direction changes, we release all batches except the current
+ * one (per readPos), to make it look it's the only batch we loaded.
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *readPos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* Initialize direction on first call */
+	if (batchState->direction == NoMovementScanDirection)
+		batchState->direction = direction;
+
+	/*
+	 * Handle cancelling the use of the read stream for prefetching
+	 */
+	else if (unlikely(batchState->disabled && scan->xs_heapfetch->rs))
+	{
+		index_batch_pos_reset(scan, &batchState->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	else if (unlikely(batchState->direction != direction))
+	{
+		/* release "future" batches in the wrong direction */
+		while (batchState->nextBatch > batchState->headBatch + 1)
+		{
+			IndexScanBatch fbatch;
+
+			batchState->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch);
+			index_batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		batchState->direction = direction;
+		batchState->finished = false;
+		batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &batchState->streamPos);
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchState->readPos;
+
+	DEBUG_LOG("index_batch_getnext_tid readPos %d %d direction %d",
+			  readPos->batch, readPos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? To detect cases when the
+	 * advance/getnext functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, readPos, direction))
+		{
+			IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->index].tupleOffset);
+
+			DEBUG_LOG("readBatch %p firstItem %d lastItem %d readPos %d/%d TID (%u,%u)",
+					  readBatch, readBatch->firstItem, readBatch->lastItem,
+					  readPos->batch, readPos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchState->headBatch))
+			{
+				IndexScanBatchData *headBatch = INDEX_SCAN_BATCH(scan,
+																 batchState->headBatch);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so mucu behind the position
+				 * gets invalid, as we already removed the batch. But that
+				 * means we don't need any heap blocks until the current read
+				 * position - if we did, we would not be in this situation (or
+				 * it's a sign of a bug, as those two places are expected to
+				 * be in sync). So if the streamPos still points at the batch
+				 * we're about to free, just reset the position - we'll set it
+				 * to readPos in the read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchState->streamPos.batch == batchState->headBatch))
+				{
+					DEBUG_LOG("index_batch_pos_reset called early (streamPos.batch == headBatch)");
+					index_batch_pos_reset(scan, &batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free headBatch %p headBatch %d nextBatch %d",
+						  headBatch, batchState->headBatch, batchState->nextBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				index_batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchState->headBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed headBatch %d nextBatch %d",
+						  batchState->headBatch, batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(batchState->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchState->reset))
+		{
+			DEBUG_LOG("resetting read stream readPos %d,%d",
+					  readPos->batch, readPos->index);
+
+			batchState->reset = false;
+			batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &batchState->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 *
+	 * XXX This is a bit strange. Do we really need to reset the position
+	 * after returning the last item? I wonder if it means the API is not
+	 * quite right.
+	 */
+	index_batch_pos_reset(scan, readPos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -704,9 +1216,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1089,3 +1610,789 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->headBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The head/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->headBatch >= 0 &&
+		   batchState->headBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->headBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->headBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined. Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The position is guaranteed to be valid only after a successful advance.
+ * If an advance fails (false returned), the position can be invalid.
+ *
+ * XXX This seems like a good place to enforce some "invariants", e.g.
+ * that the positions are always valid. We should never get here with
+ * invalid position (so probably should be initialized as part of loading the
+ * initial/head batch), and then invalidated if advance fails. Could be tricky
+ * for the stream position, though, because it can get "lag" for IOS etc.
+ */
+static pg_attribute_always_inline bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos,
+						ScanDirection direction)
+{
+	IndexScanBatchData *batch;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchState->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the initial batch, without having to go through the advance.
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 *
+		 * XXX Actually, could there be more batches? Maybe we prefetched more
+		 * batches right away? It doesn't seem to be a substantial invariant.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->headBatch);
+
+		pos->batch = scan->batchState->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->index <= batch->lastItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->index >= batch->firstItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere. Here
+ * we rely on having the correct value in batchState->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from index_batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in index_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ *
+ * XXX It seems the readPos/streamPos comments should be placed elsewhere. The
+ * read_stream callback does not seem like the right place.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *streamPos = &batchState->streamPos;
+	ScanDirection direction = batchState->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	AssertCheckBatchPosValid(scan, &batchState->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike index_batch_getnext_tid, this can loop more than twice. If
+	 * many blocks get skipped due to currentPrefetchBlock or all-visibility
+	 * (per the "prefetch" callback), we get to load additional batches. In
+	 * the worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to
+	 * "pause" the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is tring to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			IndexScanBatch streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  streamPos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * If there's a prefetch callback, use it to decide if we need to
+			 * read the next block.
+			 *
+			 * We need to do this before checking currentPrefetchBlock; it's
+			 * essential that the VM cache used by index-only scans is
+			 * intialized here.
+			 */
+			if (batchState->prefetch &&
+				!batchState->prefetch(scan, batchState->prefetchArg, streamPos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (batchState->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchState->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchState->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchState->finished && !batchState->prefetchingLockedIn)
+		{
+			int			indexdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchState->readPos.batch == streamPos->batch)
+			{
+				IndexScanBatchPos *readPos = &batchState->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					indexdiff = streamPos->index - readPos->index;
+				else
+				{
+					IndexScanBatch readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					indexdiff = (readPos->index - readBatch->firstItem) -
+						(streamPos->index - readBatch->firstItem);
+				}
+
+				if (indexdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchState->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchState->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchState->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are no
+ * more TIDs in the scan. The load may also return false if we used the maximum
+ * number of batches (INDEX_SCAN_MAX_BATCHES), in which case we'll reset the
+ * stream and continue the scan later.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * This only loads the TIDs and resets the various batch fields to fresh
+ * state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatch priorbatch = NULL,
+				batch = NULL;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchState->finished)
+		return false;
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+		return false;
+	}
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (batchState->headBatch < batchState->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchState->headBatch, batchState->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchState->disabled &&
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   index_scan_stream_read_next, scan, 0);
+	}
+	else
+		batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc(sizeof(IndexScanBatchState));
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->finished = false;
+	scan->batchState->reset = false;
+	scan->batchState->prefetchingLockedIn = false;
+	scan->batchState->disabled = false;
+	scan->batchState->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchState->direction = NoMovementScanDirection;
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->markBatch = NULL;
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->headBatch = 0;	/* initial head batch */
+	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	scan->batchState->prefetch = NULL;
+	scan->batchState->prefetchArg = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && unlikely(batchState->markBatch != NULL))
+	{
+		IndexScanBatchPos *markPos = &batchState->markPos;
+		IndexScanBatch markBatch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchState->headBatch ||
+			markPos->batch >= batchState->nextBatch)
+			index_batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->headBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->headBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchState->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchState->headBatch = 0;	/* initial batch */
+	batchState->nextBatch = 0;	/* initial batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *readPos = &scan->batchState->readPos;
+	IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	AssertCheckBatchPosValid(scan, readPos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (readBatch->numKilled < MaxTIDsPerBTreePage)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+/*
+ * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
+ * which seems unfortunate - it increases the allocation sizes, even if
+ * the index would be fine with smaller arrays. This means all batches exceed
+ * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ */
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
+								  sizeof(IndexScanBatchPosItem) * maxitems);
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, we need a tuple storage workspace.
+	 * We allocate BLCKSZ for this, which should always give the index AM
+	 * enough space to fit a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->buf = InvalidBuffer;
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..fba562df8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,33 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +325,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +338,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +354,51 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +407,50 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +697,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +846,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0605356ec..54a767a4f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimscan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1515,12 +1479,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1536,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1548,11 +1512,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1570,69 +1534,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1644,8 +1618,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1656,37 +1630,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1724,11 +1696,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1792,28 +1763,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1853,12 +1824,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1875,11 +1845,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1980,11 +1949,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1997,17 +1966,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2023,12 +1992,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2045,202 +2013,96 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2252,73 +2114,96 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	/* firstbatch will never be returned to scan, so free it outselves */
+	pfree(firstbatch);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		pfree(firstpos);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	pfree(firstpos);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2328,102 +2213,70 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2431,17 +2284,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2456,19 +2309,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2691,25 +2561,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2720,7 +2588,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2746,9 +2614,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ab0f98b02..9872de87a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1022,14 +1022,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1396,7 +1388,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2033,13 +2025,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pos->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2152,7 +2144,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3302,7 +3294,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3376,69 +3368,67 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3455,12 +3445,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3485,7 +3474,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3501,7 +3490,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3518,7 +3508,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3568,7 +3558,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 87c31da71..9cbb77438 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -90,8 +90,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac36..6d895e4ff 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 94077e6a0..ab2756d47 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index f4b7343da..a315439dc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 745fd3bab..de6c66a15 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -888,7 +888,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cb23ad527..652bb4c53 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6782,6 +6782,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * XXX We're not using ios_prefetch_block here.  That creates a window
+	 * where the scan's read stream can get out of sync.  At a minimum we'll
+	 * need to close this window by explicitly disabling heap I/O prefetching.
 	 */
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index d6fc83338..fc0caf840 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -54,6 +54,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_bitmapscan', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of bitmap-scan plans.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f62b61967..20a0ffaa5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..e36519ab7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -163,8 +163,7 @@ typedef struct IndexAmRoutine
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -789,32 +788,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function need only be provided if the access
+   method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..be1b0f55c 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,8 +319,7 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3b37fafa6..9702e3103 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919c..d018bb067 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -193,6 +193,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -1268,6 +1270,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3425,10 +3431,10 @@ amestimateparallelscan_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
-- 
2.51.0

#334

Tomas Vondra

tomas@vondra.me

2 months ago

In reply to: Peter Geoghegan (#333)

Re: index prefetching

On 11/3/25 00:49, Peter Geoghegan wrote:

On Sun, Oct 12, 2025 at 2:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached in a new revision, mostly just to keep CFBot happy following
a recent trivial conflict introduced on master.

Attached is another revision, also just to keep CFBot happy following
a conflict introduced on master. Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

Thanks.

One minor thing to note about this revision: I added a comment to
selfuncs.c's that notes that there's an unfixed bug there. That code
more or less copies the approach used by nodeIndexonlyscan.c, but
neglects to take the same precautions around the read
stream/prefetching see different pages as all-visible that the view
seen on the consumer side.

ISTM that the right fix there is to totally rethink the interface such
that the read stream is directly owned by the table AM. That way we
won't have to work around inconsistent ideas around which heap pages
are all-visible because there'll only be one view of that, in a single
place. We won't have to do anything special in either selfuncs.c or in
nodeIndexonlyscan.c.

I think we've already more or less agreed that the read_stream should be
managed by the table AM (rather than by indexam.c), because it's up to
the table AM to interpret the TID.

If that also clarifies the IOS handling, that'd be a bonus. I've not
been very happy with having to check visibility in the stream callback
and passing it to the executor. If this gets "nicer", great.

regards

--
Tomas Vondra

#335

Peter Geoghegan

pg@bowt.ie

2 months ago

In reply to: Peter Geoghegan (#333)

3 attachment(s)

Re: index prefetching

On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is another revision, also just to keep CFBot happy following
a conflict introduced on master. Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

Same story again today. The recent "Sort guc_parameters.dat
alphabetically by name" commit made the patch no longer compile.

Attached is a trivial rebased version of Sunday's revision, to keep
CFBot green. Nothing new here, really.

--
Peter Geoghegan

Attachments:

v20251106-0002-Add-amgetbatch-interface-for-index-scan-pr.patchapplication/octet-stream; name=v20251106-0002-Add-amgetbatch-interface-for-index-scan-pr.patchDownload

From a41d3c77769f5e6f18aacbbb2f7c7116484876bc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v20251106 2/3] Add amgetbatch interface for index scan
 prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   22 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  150 ++
 src/include/nodes/pathnodes.h                 |    2 +-
 src/include/optimizer/cost.h                  |    1 +
 src/backend/access/brin/brin.c                |    5 +-
 src/backend/access/gin/ginutil.c              |    5 +-
 src/backend/access/gist/gist.c                |    5 +-
 src/backend/access/hash/hash.c                |    5 +-
 src/backend/access/heap/heapam_handler.c      |   43 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1327 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  321 ++--
 src/backend/access/nbtree/nbtsearch.c         |  718 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   70 +-
 src/backend/access/spgist/spgutils.c          |    5 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/execAmi.c                |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 src/backend/utils/adt/selfuncs.c              |    4 +
 src/backend/utils/misc/guc_parameters.dat     |    7 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 contrib/bloom/blutils.c                       |    3 +-
 doc/src/sgml/indexam.sgml                     |   99 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    3 +-
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |    8 +-
 34 files changed, 2181 insertions(+), 865 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..3a651744e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch(*amgetbatch_function) (IndexScanDesc scan,
+											  IndexScanBatch batch,
+											  ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..767503bb6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -115,6 +116,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -228,6 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 909db73b7..744ad2fac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 16be5c7a9..119705f64 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -939,10 +939,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -961,74 +961,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1066,32 +1017,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1191,14 +1117,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1306,8 +1233,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1327,7 +1255,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..6f87c6b31 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,156 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan,
+									   void *arg,
+									   IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * maxBatches into index in the batches array).
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+}			IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +286,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 30d889b54..6879fe99b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1343,7 +1343,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d14379..7a9c8711e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..9b55ab230 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5213cd71e..06600f47a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b4b460724 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..6c41b3119 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -84,7 +84,9 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -124,23 +133,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..fd3ffa222 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -51,9 +52,11 @@
 #include "catalog/index.h"
 #include "catalog/pg_type.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memdebug.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -107,8 +110,146 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan, ScanDirection direction);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static pg_attribute_always_inline bool index_batch_pos_advance(IndexScanDesc scan,
+															   IndexScanBatchPos *pos,
+															   ScanDirection direction);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages
+ * (512kB for pages, and then a bit of overhead). We should not really need
+ * this many batches in most cases, though. The read stream looks ahead just
+ * enough to queue enough IOs, adjusting the distance (TIDs, but ultimately
+ * the number of future batches) to meet that.
+ *
+ * In most cases an index leaf page has many (hundreds) index tuples, and
+ * it's enough to read one or maybe two leaf pages ahead to satisfy the
+ * distance.
+ *
+ * But there are cases where this may not quite work, for example:
+ *
+ * a) bloated index - many pages only have a single index item, so that
+ *    achieving the distance requires too many leaf pages
+ *
+ * b) correlated index - duplicate blocks are skipped (the callback does not
+ *    even return those, thanks to currentPrefetchBlock optimization), and are
+ *    mostly ignored in the distance heuristics (read stream does not even see
+ *    those TIDs, and there's no I/O either)
+ *
+ * c) index-only scan - the callback skips TIDs from all-visible blocks (not
+ *    reading those is the whole point of index-only scans), and so it's
+ *    invisible to the distance / IO heuristics (similarly to duplicates)
+ *
+ * In these cases we might need to read a significant number of batches to
+ * find the first block to return to the read stream. It's not clear if
+ * looking this far ahead is worth it - it's a lot of work / synchronous
+ * I/O, and the query may terminate before reaching those TIDs (e.g. due to
+ * a LIMIT clause).
+ *
+ * Currently, there's no way to "pause" a read stream - stop looking ahead
+ * for a while, but then resume the work when a batch gets freed. To simulate
+ * this, the read stream is terminated (as if there were no more data), and
+ * then reset after draining all the queued blocks in order to resume work.
+ * This works, but it "stalls" the I/O queue. If it happens very often, it
+ * can be a serious performance bottleneck.
+
+ * XXX Maybe 64 is too high? It also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS, if most pages are all-visible). Which might
+ * be an issue with LIMIT queries, when we actually won't get that far.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+/*
+ * Thresholds controlling when we cancel use of a read stream to do
+ * prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->headBatch)
+
+/* Did we already load batch with the requested index? */
+/* XXX shouldn't this also compare headBatch? maybe the batch was freed? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/* Return batch for the provided index. */
+/* XXX Should this have an assert to enforce the batch is loaded? Maybe the
+ * index is too far back, but there happens to be a batch in the right slot?
+ * Could easily happen if we have to keep many batches around.
+ */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->index == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->headBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->headBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -283,6 +424,9 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
@@ -380,6 +524,12 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled within the read stream, etc.
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +544,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -421,10 +574,42 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *markPos = &batchState->markPos;
+	IndexScanBatchData *markBatch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range). This means that if we're
+	 * marking the same batch (different item), we don't really do anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchState->headBatch ||
+							  markPos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, markBatch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -445,19 +630,60 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchState->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->headBatch = markPos->batch;
+	batchState->nextBatch = (batchState->headBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -579,6 +805,17 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -614,6 +851,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
@@ -630,14 +870,286 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - ambatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * If the scan direction changes, we release all batches except the current
+ * one (per readPos), to make it look it's the only batch we loaded.
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *readPos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* Initialize direction on first call */
+	if (batchState->direction == NoMovementScanDirection)
+		batchState->direction = direction;
+
+	/*
+	 * Handle cancelling the use of the read stream for prefetching
+	 */
+	else if (unlikely(batchState->disabled && scan->xs_heapfetch->rs))
+	{
+		index_batch_pos_reset(scan, &batchState->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	else if (unlikely(batchState->direction != direction))
+	{
+		/* release "future" batches in the wrong direction */
+		while (batchState->nextBatch > batchState->headBatch + 1)
+		{
+			IndexScanBatch fbatch;
+
+			batchState->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch);
+			index_batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		batchState->direction = direction;
+		batchState->finished = false;
+		batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &batchState->streamPos);
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchState->readPos;
+
+	DEBUG_LOG("index_batch_getnext_tid readPos %d %d direction %d",
+			  readPos->batch, readPos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? To detect cases when the
+	 * advance/getnext functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, readPos, direction))
+		{
+			IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->index].tupleOffset);
+
+			DEBUG_LOG("readBatch %p firstItem %d lastItem %d readPos %d/%d TID (%u,%u)",
+					  readBatch, readBatch->firstItem, readBatch->lastItem,
+					  readPos->batch, readPos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchState->headBatch))
+			{
+				IndexScanBatchData *headBatch = INDEX_SCAN_BATCH(scan,
+																 batchState->headBatch);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so mucu behind the position
+				 * gets invalid, as we already removed the batch. But that
+				 * means we don't need any heap blocks until the current read
+				 * position - if we did, we would not be in this situation (or
+				 * it's a sign of a bug, as those two places are expected to
+				 * be in sync). So if the streamPos still points at the batch
+				 * we're about to free, just reset the position - we'll set it
+				 * to readPos in the read_next callback later.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchState->streamPos.batch == batchState->headBatch))
+				{
+					DEBUG_LOG("index_batch_pos_reset called early (streamPos.batch == headBatch)");
+					index_batch_pos_reset(scan, &batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free headBatch %p headBatch %d nextBatch %d",
+						  headBatch, batchState->headBatch, batchState->nextBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				index_batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchState->headBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed headBatch %d nextBatch %d",
+						  batchState->headBatch, batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(batchState->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchState->reset))
+		{
+			DEBUG_LOG("resetting read stream readPos %d,%d",
+					  readPos->batch, readPos->index);
+
+			batchState->reset = false;
+			batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &batchState->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 *
+	 * XXX This is a bit strange. Do we really need to reset the position
+	 * after returning the last item? I wonder if it means the API is not
+	 * quite right.
+	 */
+	index_batch_pos_reset(scan, readPos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -704,9 +1216,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1089,3 +1610,789 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->headBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The head/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->headBatch >= 0 &&
+		   batchState->headBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->headBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->headBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined. Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The position is guaranteed to be valid only after a successful advance.
+ * If an advance fails (false returned), the position can be invalid.
+ *
+ * XXX This seems like a good place to enforce some "invariants", e.g.
+ * that the positions are always valid. We should never get here with
+ * invalid position (so probably should be initialized as part of loading the
+ * initial/head batch), and then invalidated if advance fails. Could be tricky
+ * for the stream position, though, because it can get "lag" for IOS etc.
+ */
+static pg_attribute_always_inline bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos,
+						ScanDirection direction)
+{
+	IndexScanBatchData *batch;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchState->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 *
+	 * XXX Maybe we should just explicitly initialize the postition after
+	 * loading the initial batch, without having to go through the advance.
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 *
+		 * XXX Actually, could there be more batches? Maybe we prefetched more
+		 * batches right away? It doesn't seem to be a substantial invariant.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->headBatch);
+
+		pos->batch = scan->batchState->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->index <= batch->lastItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->index >= batch->firstItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere. Here
+ * we rely on having the correct value in batchState->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from index_batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in index_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ *
+ * XXX It seems the readPos/streamPos comments should be placed elsewhere. The
+ * read_stream callback does not seem like the right place.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *streamPos = &batchState->streamPos;
+	ScanDirection direction = batchState->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	AssertCheckBatchPosValid(scan, &batchState->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike index_batch_getnext_tid, this can loop more than twice. If
+	 * many blocks get skipped due to currentPrefetchBlock or all-visibility
+	 * (per the "prefetch" callback), we get to load additional batches. In
+	 * the worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to
+	 * "pause" the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is tring to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			IndexScanBatch streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  streamPos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * If there's a prefetch callback, use it to decide if we need to
+			 * read the next block.
+			 *
+			 * We need to do this before checking currentPrefetchBlock; it's
+			 * essential that the VM cache used by index-only scans is
+			 * intialized here.
+			 */
+			if (batchState->prefetch &&
+				!batchState->prefetch(scan, batchState->prefetchArg, streamPos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (batchState->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchState->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchState->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchState->finished && !batchState->prefetchingLockedIn)
+		{
+			int			indexdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchState->readPos.batch == streamPos->batch)
+			{
+				IndexScanBatchPos *readPos = &batchState->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					indexdiff = streamPos->index - readPos->index;
+				else
+				{
+					IndexScanBatch readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					indexdiff = (readPos->index - readBatch->firstItem) -
+						(streamPos->index - readBatch->firstItem);
+				}
+
+				if (indexdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchState->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchState->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchState->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are no
+ * more TIDs in the scan. The load may also return false if we used the maximum
+ * number of batches (INDEX_SCAN_MAX_BATCHES), in which case we'll reset the
+ * stream and continue the scan later.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * This only loads the TIDs and resets the various batch fields to fresh
+ * state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatch priorbatch = NULL,
+				batch = NULL;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchState->finished)
+		return false;
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+		return false;
+	}
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (batchState->headBatch < batchState->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchState->headBatch, batchState->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchState->disabled &&
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   index_scan_stream_read_next, scan, 0);
+	}
+	else
+		batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc(sizeof(IndexScanBatchState));
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->finished = false;
+	scan->batchState->reset = false;
+	scan->batchState->prefetchingLockedIn = false;
+	scan->batchState->disabled = false;
+	scan->batchState->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchState->direction = NoMovementScanDirection;
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->markBatch = NULL;
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->headBatch = 0;	/* initial head batch */
+	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	scan->batchState->prefetch = NULL;
+	scan->batchState->prefetchArg = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && unlikely(batchState->markBatch != NULL))
+	{
+		IndexScanBatchPos *markPos = &batchState->markPos;
+		IndexScanBatch markBatch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchState->headBatch ||
+			markPos->batch >= batchState->nextBatch)
+			index_batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->headBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->headBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchState->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchState->headBatch = 0;	/* initial batch */
+	batchState->nextBatch = 0;	/* initial batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *readPos = &scan->batchState->readPos;
+	IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	AssertCheckBatchPosValid(scan, readPos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (readBatch->numKilled < MaxTIDsPerBTreePage)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+/*
+ * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
+ * which seems unfortunate - it increases the allocation sizes, even if
+ * the index would be fine with smaller arrays. This means all batches exceed
+ * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ */
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
+								  sizeof(IndexScanBatchPosItem) * maxitems);
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, we need a tuple storage workspace.
+	 * We allocate BLCKSZ for this, which should always give the index AM
+	 * enough space to fit a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->buf = InvalidBuffer;
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..fba562df8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,33 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +325,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +338,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +354,51 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +407,50 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +697,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +846,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0605356ec..54a767a4f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimscan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1515,12 +1479,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1536,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1548,11 +1512,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1570,69 +1534,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1644,8 +1618,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1656,37 +1630,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1724,11 +1696,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1792,28 +1763,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1853,12 +1824,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1875,11 +1845,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1980,11 +1949,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1997,17 +1966,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2023,12 +1992,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2045,202 +2013,96 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2252,73 +2114,96 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	/* firstbatch will never be returned to scan, so free it outselves */
+	pfree(firstbatch);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		pfree(firstpos);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	pfree(firstpos);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2328,102 +2213,70 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2431,17 +2284,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2456,19 +2309,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2691,25 +2561,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2720,7 +2588,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2746,9 +2614,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ab0f98b02..9872de87a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1022,14 +1022,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1396,7 +1388,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2033,13 +2025,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pos->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2152,7 +2144,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3302,7 +3294,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3376,69 +3368,67 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3455,12 +3445,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3485,7 +3474,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3501,7 +3490,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3518,7 +3508,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3568,7 +3558,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 87c31da71..9d66d26dd 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac36..6d895e4ff 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8335cf5b5..5dd89ed09 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd930..3e1ee438d 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 745fd3bab..de6c66a15 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -888,7 +888,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cb23ad527..652bb4c53 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6782,6 +6782,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * XXX We're not using ios_prefetch_block here.  That creates a window
+	 * where the scan's read stream can get out of sync.  At a minimum we'll
+	 * need to close this window by explicitly disabling heap I/O prefetching.
 	 */
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c0..79487545e 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -875,6 +875,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f62b61967..20a0ffaa5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..4e8225332 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -746,6 +747,63 @@ amgettuple (IndexScanDesc scan,
 
   <para>
 <programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch batch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the given
+   direction (forward or backward in the index).  Returns an instance of
+   <structname>IndexScanBatch</structname> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples. The caller may
+   provide a pre-allocated <structname>IndexScanBatch</structname> instance,
+   in which case the index tuples are loaded into it instead of allocating
+   a new one. The same caveats described for <function>amgettuple</function>
+   apply here too. When an entry in the returned batch means only that the
+   index contains an entry that matches the scan keys, not that the tuple
+   necessarily still exists in the heap or will pass the caller's snapshot
+   test.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the access
+   method supports <quote>plain</quote> index scans.  If it doesn't, the
+   <structfield>amgetbatch</structfield> field in its <structname>IndexAmRoutine</structname>
+   struct must be set to NULL.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function> and
+   <function>amgetbatch</function> callbacks, not both. When the access method
+   provides <function>amgetbatch</function>, it has to also povide
+   <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The <structname>IndexScanBatch</structname> returned by <function>amgetbatch</function>
+   is no longer managed by the access method. It is up to the caller to decide
+   when it should be reused or freed by passing it to <function>amfreebatch</function>.
+  </para>
+
+  <para>
+<programlisting>
+bool
+amfreebatch (IndexScanDesc scan,
+             IndexScanBatch batch);
+</programlisting>
+   Releases the batch returned by the <function>amgetbatch</function> earlier.
+   This frees all AM-specific resources associated with the batch, like buffer
+   pins, allocated memory, etc.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the access
+   method provides <function>amgetbatch</function>. Otherwise it has to remain
+   set to <literal>NULL</literal>.
+  </para>
+
+  <para>
+<programlisting>
 int64
 amgetbitmap (IndexScanDesc scan,
              TIDBitmap *tbm);
@@ -789,32 +847,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function need only be provided if the access
+   method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..be1b0f55c 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,8 +319,7 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3b37fafa6..9702e3103 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277..d535df692 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -193,6 +193,8 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTBatchInfo
+BTBatchScanPosData
 BTBuildState
 BTCallbackState
 BTCycleId
@@ -1269,6 +1271,10 @@ IndexOrderByDistance
 IndexPath
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatches
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3432,10 +3438,10 @@ amestimateparallelscan_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
+amgetbatch_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
-- 
2.51.0

v20251106-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchapplication/octet-stream; name=v20251106-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchDownload

From 69ea348aa7aed13bdd90aede84718762a6ac721a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v20251106 1/3] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d..139499111 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00719c7ae..f602767a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc, buf_state);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc, buf_state);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v20251106-0003-Reduce-malloc-free-traffic-by-caching-batc.patchapplication/octet-stream; name=v20251106-0003-Reduce-malloc-free-traffic-by-caching-batc.patchDownload

From 9959ab12969897b837a872f2cdc45fe82f144fcf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 10 Sep 2025 16:54:50 -0400
Subject: [PATCH v20251106 3/3] Reduce malloc/free traffic by caching batches

Instead of immediately freeing a batch, stash it in a cache (a small
fixed-size array), for reuse by the same scan.

There's room for improvement:

- Keeping some of the batch pieces (killItems, itemsvisibility, ...)
  instead of freeing them in index_batch_release.

- Allocating only space we need (both index_batch_alloc calls use
  MaxTIDsPerBTreePage, and thus malloc - because of ALLOC_CHUNK_LIMIT).
---
 src/include/access/genam.h            |   4 +-
 src/include/access/relscan.h          |   8 +
 src/backend/access/index/indexam.c    | 202 ++++++++++++++++++++++++--
 src/backend/access/nbtree/nbtree.c    |  16 +-
 src/backend/access/nbtree/nbtsearch.c |   4 +-
 5 files changed, 210 insertions(+), 24 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 767503bb6..1a92a8195 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -230,7 +230,9 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern IndexScanBatch index_batch_alloc(IndexScanDesc scan,
+										int maxitems, bool want_itup);
+extern void index_batch_release(IndexScanDesc scan, IndexScanBatch batch);
 extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6f87c6b31..3a0f4dce6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -187,6 +187,8 @@ typedef struct IndexScanBatchData
 	 */
 	char	   *itemsvisibility;	/* Index-only scan visibility cache */
 
+	/* capacity of the batch (size of the items array) */
+	int			maxitems;
 	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
 } IndexScanBatchData;
 
@@ -266,6 +268,12 @@ typedef struct IndexScanBatchState
 	int			headBatch;		/* head batch slot */
 	int			nextBatch;		/* next empty batch slot */
 
+	/* small cache of unused batches, to reduce malloc/free traffic */
+	struct {
+		int			maxbatches;
+		IndexScanBatchData **batches;
+	} cache;
+
 	IndexScanBatchData **batches;
 
 	/* callback to skip prefetching in IOS etc. */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fd3ffa222..c32f82a3b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2196,6 +2196,10 @@ index_batch_init(IndexScanDesc scan)
 	scan->batchState->headBatch = 0;	/* initial head batch */
 	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
 
+	/* XXX init the cache of batches, capacity 16 is arbitrary */
+	scan->batchState->cache.maxbatches = 16;
+	scan->batchState->cache.batches = NULL;
+
 	scan->batchState->batches =
 		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
 
@@ -2328,34 +2332,123 @@ static void
 index_batch_end(IndexScanDesc scan)
 {
 	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchState)
+		return;
+
+	/* we can simply free batches thanks to the earlier reset */
+	if (scan->batchState->batches)
+		pfree(scan->batchState->batches);
+
+	/* also walk the cache of batches, if any */
+	if (scan->batchState->cache.batches)
+	{
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			if (scan->batchState->cache.batches[i] == NULL)
+				continue;
+
+			pfree(scan->batchState->cache.batches[i]);
+		}
+
+		pfree(scan->batchState->cache.batches);
+	}
+
+	pfree(scan->batchState);
 }
 
 /*
+ * index_batch_alloc
+ *		Allocate a batch that can fit maxitems index tuples.
+ *
+ * Returns a IndexScanBatch struct with capacity sufficient for maxitems
+ * index tuples. It's either newly allocated or loaded from a small cache
+ * maintained for individual scans.
+ *
+ * maxitems determines the minimum size of the batch (it may be larger)
+ * want_itup determines whether the bach allocates space for currTuples
+ *
  * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
  * which seems unfortunate - it increases the allocation sizes, even if
- * the index would be fine with smaller arrays. This means all batches exceed
- * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ * the index would be fine with smaller arrays. This means all batches
+ * exceed ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive). The
+ * cache helps for longer queries, not for queries that only create a
+ * single batch, etc.
  */
 IndexScanBatch
-index_batch_alloc(int maxitems, bool want_itup)
+index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
 {
-	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
-								  sizeof(IndexScanBatchPosItem) * maxitems);
+	IndexScanBatch batch = NULL;
 
+	/*
+	 * Try to find a sufficiently large batch in the cache.
+	 *
+	 * Use the first batch that can fit the requested number of items. We
+	 * could be smarter and look for the smallest of such batches. But that
+	 * probably won't help very much. We expect batches to be mostly uniform,
+	 * with about the same size. And index_batch_release() prefers larger
+	 * batches, so we should end up with mostly larger batches in the cache.
+	 *
+	 * XXX We can get here with batchState==NULL for bitmapscans. Could that
+	 * mean bitmapscans have issues with malloc/free on batches too? But the
+	 * cache can't help with that, when it's in batchState (because bitmap
+	 * scans don't have that).
+	 */
+	if ((scan->batchState != NULL) &&
+		(scan->batchState->cache.batches != NULL))
+	{
+		/*
+		 * try to find a batch in the cache, with maxitems high enough
+		 *
+		 * XXX Maybe should look for a batch with lowest maxitems? That should
+		 * increase probability of cache hits in the future?
+		 */
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			if ((scan->batchState->cache.batches[i] != NULL) &&
+				(scan->batchState->cache.batches[i]->maxitems >= maxitems))
+			{
+				batch = scan->batchState->cache.batches[i];
+				scan->batchState->cache.batches[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	/* found a batch in the cache? */
+	if (batch)
+	{
+		/* for IOS, we expect to already have the currTuples */
+		Assert(!(want_itup && (batch->currTuples == NULL)));
+
+		/* XXX maybe we could keep these allocations too */
+		Assert(batch->pos == NULL);
+		Assert(batch->itemsvisibility == NULL);
+	}
+	else
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(IndexScanBatchPosItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+	}
+
+	/* shared initialization */
 	batch->firstItem = -1;
 	batch->lastItem = -1;
 	batch->killedItems = NULL;
 	batch->numKilled = 0;
 
-	/*
-	 * If we are doing an index-only scan, we need a tuple storage workspace.
-	 * We allocate BLCKSZ for this, which should always give the index AM
-	 * enough space to fit a full page's worth of tuples.
-	 */
-	batch->currTuples = NULL;
-	if (want_itup)
-		batch->currTuples = palloc(BLCKSZ);
-
 	batch->buf = InvalidBuffer;
 	batch->pos = NULL;
 	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
@@ -2396,3 +2489,84 @@ index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
 	ReleaseBuffer(batch->buf);
 	batch->buf = InvalidBuffer; /* defensive */
 }
+
+/*
+ * index_batch_release
+ *		Either stash the batch info a small cache for reuse, or free it.
+ */
+void
+index_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* custom fields should have been cleaned by amfreebatch */
+	Assert(batch->pos == NULL);
+	Assert(batch->buf == InvalidBuffer);
+
+	/*
+	 * free killedItems / itemvisibility
+	 *
+	 * XXX We could keep/reuse those too, I guess.
+	 */
+
+	if (batch->killedItems != NULL)
+	{
+		pfree(batch->killedItems);
+		batch->killedItems = NULL;
+	}
+
+	if (batch->itemsvisibility != NULL)
+	{
+		pfree(batch->itemsvisibility);
+		batch->itemsvisibility = NULL;
+	}
+
+	/*
+	 * Try adding the batch to the small cache - find a slot that's either empty
+	 * or used by a smaller batch (with smallest maxitems value), and replace
+	 * that batch).
+	 *
+	 * XXX There may be ways to improve this. We could track the number of empty
+	 * slots, and minimum maxitems value, which would allow skipping pointless
+	 * searches (in cases when should just discard the batch).
+	 */
+	if (scan->batchState != NULL)
+	{
+		/* lowest maxitems we found in the cache (to replace with this batch) */
+		int			maxitems = batch->maxitems;
+		int			slot = scan->batchState->cache.maxbatches;
+
+		/* first time through, initialize the cache */
+		if (scan->batchState->cache.batches == NULL)
+			scan->batchState->cache.batches
+				= palloc0_array(IndexScanBatch,
+								scan->batchState->cache.maxbatches);
+
+		/* find am empty or sufficiently large batch */
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			/* found empty slot, we're done */
+			if (scan->batchState->cache.batches[i] == NULL)
+			{
+				scan->batchState->cache.batches[i] = batch;
+				return;
+			}
+
+			/* found a smaller slot, remember it */
+			if (scan->batchState->cache.batches[i]->maxitems < maxitems)
+			{
+				maxitems = scan->batchState->cache.batches[i]->maxitems;
+				slot = i;
+			}
+		}
+
+		/* found a slot for this batch? */
+		if (maxitems < batch->maxitems)
+		{
+			pfree(scan->batchState->cache.batches[slot]);
+			scan->batchState->cache.batches[slot] = batch;
+			return;
+		}
+	}
+
+	/* either no cache or no slot for this batch */
+	pfree(batch);
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fba562df8..af947b6dc 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -382,21 +382,23 @@ btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
 	if (batch->numKilled > 0)
 		_bt_killitems(scan, batch);
 
-	if (batch->itemsvisibility)
-		pfree(batch->itemsvisibility);
-
-	if (batch->currTuples)
-		pfree(batch->currTuples);
-
+	/* free AM-specific fields of the batch */
 	if (batch->pos)
 	{
 		if (!scan->batchState || !scan->batchState->dropPin)
+		{
 			ReleaseBuffer(batch->buf);
+			batch->buf = InvalidBuffer;
+		}
 
 		pfree(batch->pos);
+		batch->pos = NULL;
 	}
 
-	pfree(batch);
+	/* other fields (itemsvisibility, killItems, currTuples) freed elsewhere */
+
+	/* free the batch (or cache it for reuse) */
+	index_batch_release(scan, batch);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 54a767a4f..b019c19f8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1200,7 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Allocate space for first batch */
-	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	firstbatch->pos = palloc(sizeof(BTScanPosData));
 
 	/*
@@ -2237,7 +2237,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	BTScanPos	newpos;
 
 	/* Allocate space for next batch */
-	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	newbatch->pos = palloc(sizeof(BTScanPosData));
 	newpos = newbatch->pos;
 
-- 
2.51.0

#336

Peter Geoghegan

pg@bowt.ie

2 months ago

In reply to: Peter Geoghegan (#335)

3 attachment(s)

Re: index prefetching

On Wed, Nov 5, 2025 at 11:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

Same story again today. The recent "Sort guc_parameters.dat
alphabetically by name" commit made the patch no longer compile.

Same again. This new revision fixes bitrot caused by Andres' recent
"bufmgr: Allow some buffer state modifications while holding header
lock" commit.

I had to fix some of Andres' code to get this working. I think that I
got this right, but haven't tested those changes very well.

--
Peter Geoghegan

Attachments:

v20251109-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchapplication/octet-stream; name=v20251109-0001-bufmgr-aio-Prototype-for-not-waiting-for-a.patchDownload

From 77e183cc39e1a9279fb1687c0e9328a28da71e28 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v20251109 1/3] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b5f8f3c5d..139499111 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7ad..7c77543a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v20251109-0002-Add-amgetbatch-interface-for-index-scan-pr.patchapplication/octet-stream; name=v20251109-0002-Add-amgetbatch-interface-for-index-scan-pr.patchDownload

From fb86c89bcc9692b8b0e650ab1e5a6f9acd14aef3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v20251109 2/3] Add amgetbatch interface for index scan
 prefetching.

Allows the index AM to provide items (TIDs and tuples) in batches, which
is then used to implement prefetching of heap tuples in index scans
(including index-only scans). This is similar to prefetching already
done in bitmap scans, and can result in significant speedups.

The index AM may implement an optional "amgetbatch" callback, returning
a batch of items. The indexam.c code then handles this transparently
through the existing "getnext" interface.
---
 src/include/access/amapi.h                    |   22 +-
 src/include/access/genam.h                    |    4 +
 src/include/access/heapam.h                   |    1 +
 src/include/access/nbtree.h                   |  100 +-
 src/include/access/relscan.h                  |  150 ++
 src/include/nodes/pathnodes.h                 |    2 +-
 src/include/optimizer/cost.h                  |    1 +
 src/backend/access/brin/brin.c                |    5 +-
 src/backend/access/gin/ginutil.c              |    5 +-
 src/backend/access/gist/gist.c                |    5 +-
 src/backend/access/hash/hash.c                |    5 +-
 src/backend/access/heap/heapam_handler.c      |   43 +-
 src/backend/access/index/genam.c              |    1 +
 src/backend/access/index/indexam.c            | 1327 ++++++++++++++++-
 src/backend/access/nbtree/nbtree.c            |  321 ++--
 src/backend/access/nbtree/nbtsearch.c         |  718 ++++-----
 src/backend/access/nbtree/nbtutils.c          |   70 +-
 src/backend/access/spgist/spgutils.c          |    5 +-
 src/backend/commands/indexcmds.c              |    2 +-
 src/backend/executor/execAmi.c                |    2 +-
 src/backend/executor/nodeIndexonlyscan.c      |  101 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/util/plancat.c          |    6 +-
 src/backend/replication/logical/relation.c    |    3 +-
 src/backend/storage/aio/read_stream.c         |   14 +-
 src/backend/utils/adt/amutils.c               |    4 +-
 src/backend/utils/adt/selfuncs.c              |    4 +
 src/backend/utils/misc/guc_parameters.dat     |    7 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 contrib/bloom/blutils.c                       |    3 +-
 doc/src/sgml/indexam.sgml                     |   99 +-
 .../modules/dummy_index_am/dummy_index_am.c   |    3 +-
 src/test/regress/expected/sysviews.out        |    3 +-
 src/tools/pgindent/typedefs.list              |   10 +-
 34 files changed, 2183 insertions(+), 865 deletions(-)

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..e8558937d 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef IndexScanBatch (*amgetbatch_function) (IndexScanDesc scan,
+											   IndexScanBatch batch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  IndexScanBatch batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 IndexScanBatch batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..767503bb6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -15,6 +15,7 @@
 #define GENAM_H
 
 #include "access/htup.h"
+#include "access/itup.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -115,6 +116,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct IndexScanBatchData *IndexScanBatch;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -228,6 +230,8 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
+extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 909db73b7..744ad2fac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 } IndexFetchHeapData;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 16be5c7a9..119705f64 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -939,10 +939,10 @@ typedef BTVacuumPostingData *BTVacuumPosting;
  * processing.  This approach minimizes lock/unlock traffic.  We must always
  * drop the lock to make it okay for caller to process the returned items.
  * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
+ * We drop the pin (when dropPin is set in batch state) to avoid blocking
+ * progress by VACUUM (see nbtree/README section about making concurrent TID
+ * recycling safe).  We'll always release both the lock and the pin on the
+ * current page before moving on to its sibling page.
  *
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
@@ -961,74 +961,25 @@ typedef struct BTScanPosItem	/* what we remember about each match */
 
 typedef struct BTScanPosData
 {
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
 	/* page details as of the saved position's call to _bt_readpage */
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
 
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
 	 * index entries to the left and right of the current page, respectively.
 	 */
 	bool		moreLeft;
 	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
 
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
+#define BTScanPosIsValid(scanpos) BlockNumberIsValid((scanpos).currPage)
 
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
@@ -1066,32 +1017,7 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
+	BTScanPos	pos;
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1191,14 +1117,15 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch btgetbatch(IndexScanDesc scan, IndexScanBatch batch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, IndexScanBatch batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, IndexScanBatch markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1306,8 +1233,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern IndexScanBatch _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   IndexScanBatch priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1327,7 +1255,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, IndexScanBatch batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386..db5e3b309 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,9 +16,11 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -121,10 +123,156 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 struct IndexScanInstrumentation;
 
+/* Forward declaration, the prefetch callback needs IndexScanDescData. */
+typedef struct IndexScanBatchData IndexScanBatchData;
+
+typedef struct IndexScanBatchPosItem	/* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} IndexScanBatchPosItem;
+
+/*
+ * Data about one batch of items returned by the index AM
+ */
+typedef struct IndexScanBatchData
+{
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/*
+	 * AM-specific state representing the current position of the scan within
+	 * the index
+	 */
+	void	   *pos;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 *
+	 * XXX maybe currTuples should be part of the am-specific per-batch state
+	 * stored in "position" field?
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+
+	/*
+	 * batch contents (TIDs, index tuples, kill bitmap, ...)
+	 *
+	 * XXX Shouldn't this be part of the "IndexScanBatchPosItem" struct? To
+	 * keep everything in one place? Or why should we have separate arrays?
+	 * One advantage is that we don't need to allocate memory for arrays that
+	 * we don't need ... e.g. if we don't need heap tuples, we don't allocate
+	 * that. We couldn't do that with everything in one struct.
+	 */
+	char	   *itemsvisibility;	/* Index-only scan visibility cache */
+
+	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
+} IndexScanBatchData;
+
+/*
+ * Position in the queue of batches - index of a batch, index of item in a batch.
+ */
+typedef struct IndexScanBatchPos
+{
+	int			batch;
+	int			index;
+} IndexScanBatchPos;
+
+typedef struct IndexScanDescData IndexScanDescData;
+typedef bool (*IndexPrefetchCallback) (IndexScanDescData * scan,
+									   void *arg,
+									   IndexScanBatchPos *pos);
+
+/*
+ * State used by amgetbatch index AMs, which manage per-page batches of items
+ * with matching index tuples using a circular buffer
+ */
+typedef struct IndexScanBatchState
+{
+	/* Index AM drops leaf pin before amgetbatch returns? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
+	/*
+	 * Current scan direction, for the currently loaded batches. This is used
+	 * to load data in the read stream API callback, etc.
+	 */
+	ScanDirection direction;
+
+	/* positions in the queue of batches (batch + item) */
+	IndexScanBatchPos readPos;	/* read position */
+	IndexScanBatchPos streamPos;	/* prefetch position (for read stream API) */
+	IndexScanBatchPos markPos;	/* mark/restore position */
+
+	IndexScanBatchData *markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * maxBatches into index in the batches array).
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	IndexScanBatchData **batches;
+
+	/* callback to skip prefetching in IOS etc. */
+	IndexPrefetchCallback prefetch;
+	void	   *prefetchArg;
+} IndexScanBatchState;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -138,6 +286,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	IndexScanBatchState *batchState;	/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 30d889b54..6879fe99b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1343,7 +1343,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d14379..7a9c8711e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..9b55ab230 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3fb1a1285..0e0bcc034 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b4b460724 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..6c41b3119 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -84,7 +84,9 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,14 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +114,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	pfree(hscan);
 }
 
@@ -124,23 +133,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..55e60c9ff 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchState = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..d42cc1c50 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -44,6 +44,7 @@
 #include "postgres.h"
 
 #include "access/amapi.h"
+#include "access/nbtree.h"		/* XXX for MaxTIDsPerBTreePage (should remove) */
 #include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
@@ -51,9 +52,11 @@
 #include "catalog/index.h"
 #include "catalog/pg_type.h"
 #include "nodes/execnodes.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/memdebug.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -107,8 +110,146 @@ do { \
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
+static ItemPointer index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction);
+static ItemPointer index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction);
 static inline void validate_relation_kind(Relation r);
 
+/* index batching */
+static void index_batch_init(IndexScanDesc scan);
+static void index_batch_reset(IndexScanDesc scan, bool complete);
+static void index_batch_end(IndexScanDesc scan);
+static bool index_batch_getnext(IndexScanDesc scan, ScanDirection direction);
+static void index_batch_free(IndexScanDesc scan, IndexScanBatch batch);
+
+static BlockNumber index_scan_stream_read_next(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data);
+
+static pg_attribute_always_inline bool index_batch_pos_advance(IndexScanDesc scan,
+															   IndexScanBatchPos *pos,
+															   ScanDirection direction);
+static void index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void index_batch_kill_item(IndexScanDesc scan);
+
+static void AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos);
+static void AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch);
+static void AssertCheckBatches(IndexScanDesc scan);
+
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.
+ *
+ * The value 64 value is arbitrary, it's about 1MB of data with 8KB pages
+ * (512kB for pages, and then a bit of overhead). We should not really need
+ * this many batches in most cases, though. The read stream looks ahead just
+ * enough to queue enough IOs, adjusting the distance (TIDs, but ultimately
+ * the number of future batches) to meet that.
+ *
+ * In most cases an index leaf page has many (hundreds) index tuples, and
+ * it's enough to read one or maybe two leaf pages ahead to satisfy the
+ * distance.
+ *
+ * But there are cases where this may not quite work, for example:
+ *
+ * a) bloated index - many pages only have a single index item, so that
+ *    achieving the distance requires too many leaf pages
+ *
+ * b) correlated index - duplicate blocks are skipped (the callback does not
+ *    even return those, thanks to currentPrefetchBlock optimization), and are
+ *    mostly ignored in the distance heuristics (read stream does not even see
+ *    those TIDs, and there's no I/O either)
+ *
+ * c) index-only scan - the callback skips TIDs from all-visible blocks (not
+ *    reading those is the whole point of index-only scans), and so it's
+ *    invisible to the distance / IO heuristics (similarly to duplicates)
+ *
+ * In these cases we might need to read a significant number of batches to
+ * find the first block to return to the read stream. It's not clear if
+ * looking this far ahead is worth it - it's a lot of work / synchronous
+ * I/O, and the query may terminate before reaching those TIDs (e.g. due to
+ * a LIMIT clause).
+ *
+ * Currently, there's no way to "pause" a read stream - stop looking ahead
+ * for a while, but then resume the work when a batch gets freed. To simulate
+ * this, the read stream is terminated (as if there were no more data), and
+ * then reset after draining all the queued blocks in order to resume work.
+ * This works, but it "stalls" the I/O queue. If it happens very often, it
+ * can be a serious performance bottleneck.
+
+ * XXX Maybe 64 is too high? It also defines the maximum amount of overhead
+ * allowed. In the worst case, reading a single row might trigger reading this
+ * many leaf pages (e.g. with IOS, if most pages are all-visible). Which might
+ * be an issue with LIMIT queries, when we actually won't get that far.
+ */
+#define INDEX_SCAN_MAX_BATCHES	64
+
+/*
+ * Thresholds controlling when we cancel use of a read stream to do
+ * prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchState->nextBatch - (scan)->batchState->headBatch)
+
+/* Did we already load batch with the requested index? */
+/* XXX shouldn't this also compare headBatch? maybe the batch was freed? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchState->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchState->maxBatches)
+
+/* Return batch for the provided index. */
+/* XXX Should this have an assert to enforce the batch is loaded? Maybe the
+ * index is too far back, but there happens to be a batch in the right slot?
+ * Could easily happen if we have to keep many batches around.
+ */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->index == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/* debug: print info about current batches */
+static void
+index_batch_print(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	IndexScanBatchState *batches = scan->batchState;
+
+	if (!scan->batchState)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batches->headBatch, batches->nextBatch, batches->maxBatches);
+
+	for (int i = batches->headBatch; i < batches->nextBatch; i++)
+	{
+		IndexScanBatchData *batch = INDEX_SCAN_BATCH(scan, i);
+		BTScanPos	pos = (BTScanPos) batch->pos;
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, pos->currPage, batch, batch->firstItem, batch->lastItem,
+				  batch->numKilled);
+	}
+#endif
+}
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -283,6 +424,9 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
 
@@ -380,6 +524,12 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled within the read stream, etc.
+	 */
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +544,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -421,10 +574,42 @@ index_endscan(IndexScanDesc scan)
 void
 index_markpos(IndexScanDesc scan)
 {
-	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *markPos = &batchState->markPos;
+	IndexScanBatchData *markBatch = batchState->markBatch;
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	SCAN_CHECKS;
+
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range). This means that if we're
+	 * marking the same batch (different item), we don't really do anything.
+	 *
+	 * XXX Should have some macro for this check, I guess.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchState->headBatch ||
+							  markPos->batch >= batchState->nextBatch))
+	{
+		batchState->markBatch = NULL;
+		index_batch_free(scan, markBatch);
+	}
+
+	/* just copy the read position (which has to be valid) */
+	batchState->markPos = batchState->readPos;
+	batchState->markBatch = INDEX_SCAN_BATCH(scan, batchState->markPos.batch);
+
+	/*
+	 * FIXME we need to make sure the batch does not get freed during the
+	 * regular advances.
+	 */
+
+	AssertCheckBatchPosValid(scan, &batchState->markPos);
 }
 
 /* ----------------
@@ -445,19 +630,60 @@ index_markpos(IndexScanDesc scan)
 void
 index_restrpos(IndexScanDesc scan)
 {
+	IndexScanBatchState *batchState;
+	IndexScanBatchPos *markPos;
+	IndexScanBatchData *markBatch;
+
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchState->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	/*
+	 * FIXME this should probably check there actually is a batch state. For
+	 * now it works the only AM with mark/restore support is btree, and that
+	 * has batching. But we should not rely on that, right?
+	 */
+
+	batchState = scan->batchState;
+	markPos = &batchState->markPos;
+	markBatch = scan->batchState->markBatch;
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchState->markPos = *markPos;
+	batchState->readPos = *markPos;
+	batchState->headBatch = markPos->batch;
+	batchState->nextBatch = (batchState->headBatch + 1);
+
+	INDEX_SCAN_BATCH(scan, batchState->markPos.batch) = markBatch;
+	batchState->markBatch = markBatch;	/* also remember this */
 }
 
 /*
@@ -579,6 +805,17 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	/*
+	 * Reset the batching. This makes it look like there are no batches,
+	 * discards reads already scheduled to the read stream, etc.
+	 *
+	 * XXX We do this before calling amparallelrescan, so that it could
+	 * reinitialize everything (this probably does not matter very much, now
+	 * that we've moved all the batching logic to indexam.c, it was more
+	 * important when the index AM was responsible for more of it).
+	 */
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -614,6 +851,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
 
@@ -630,14 +870,286 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 ItemPointer
 index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
-	bool		found;
-
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amgettuple);
 
 	/* XXX: we should assert that a snapshot is pushed or registered */
 	Assert(TransactionIdIsValid(RecentXmin));
 
+	/*
+	 * Index AMs that support plain index scans must provide exactly one of
+	 * either the amgetbatch or amgettuple callbacks
+	 */
+	Assert(!(scan->indexRelation->rd_indam->amgettuple != NULL &&
+			 scan->indexRelation->rd_indam->amgetbatch != NULL));
+
+	if (scan->batchState != NULL)
+		return index_batch_getnext_tid(scan, direction);
+	else
+		return index_retail_getnext_tid(scan, direction);
+}
+
+/* ----------------
+ *		index_getnext_batch_tid - amgetbatch index_getnext_tid implementation
+ *
+ * If we advance to the next batch, we release the previous one (unless it's
+ * tracked for mark/restore).
+ *
+ * If the scan direction changes, we release all batches except the current
+ * one (per readPos), to make it look it's the only batch we loaded.
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ *
+ * FIXME This only sets xs_heaptid and xs_itup (if requested). Not sure if
+ * we need to do something with xs_hitup. Should this set xs_hitup?
+ *
+ * XXX Maybe if we advance the position to the next batch, we could keep the
+ * batch for a bit more, in case the scan direction changes (as long as it
+ * fits into maxBatches)? But maybe that's unnecessary complexity for too
+ * little gain, we'd need to be careful about releasing the batches lazily.
+ * ----------------
+ */
+static ItemPointer
+index_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *readPos;
+
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* shouldn't get here without batching */
+	AssertCheckBatches(scan);
+
+	/* Initialize direction on first call */
+	if (batchState->direction == NoMovementScanDirection)
+		batchState->direction = direction;
+
+	/*
+	 * Handle cancelling the use of the read stream for prefetching
+	 */
+	else if (unlikely(batchState->disabled && scan->xs_heapfetch->rs))
+	{
+		index_batch_pos_reset(scan, &batchState->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
+
+	/*
+	 * Handle change of scan direction (reset stream, ...).
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded. Also reset the stream position, as if we are
+	 * just starting the scan.
+	 */
+	else if (unlikely(batchState->direction != direction))
+	{
+		/* release "future" batches in the wrong direction */
+		while (batchState->nextBatch > batchState->headBatch + 1)
+		{
+			IndexScanBatch fbatch;
+
+			batchState->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch);
+			index_batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
+		 */
+		batchState->direction = direction;
+		batchState->finished = false;
+		batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+		index_batch_pos_reset(scan, &batchState->streamPos);
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchState->readPos;
+
+	DEBUG_LOG("index_batch_getnext_tid readPos %d %d direction %d",
+			  readPos->batch, readPos->index, direction);
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and maybe retry.
+	 *
+	 * FIXME This loop shouldn't happen more than twice. Maybe we should have
+	 * some protection against infinite loops? To detect cases when the
+	 * advance/getnext functions get to disagree?
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (index_batch_pos_advance(scan, readPos, direction))
+		{
+			IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->index].heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->index].tupleOffset);
+
+			DEBUG_LOG("readBatch %p firstItem %d lastItem %d readPos %d/%d TID (%u,%u)",
+					  readBatch, readBatch->firstItem, readBatch->lastItem,
+					  readPos->batch, readPos->index,
+					  ItemPointerGetBlockNumber(&scan->xs_heaptid),
+					  ItemPointerGetOffsetNumber(&scan->xs_heaptid));
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchState->headBatch))
+			{
+				IndexScanBatchData *headBatch = INDEX_SCAN_BATCH(scan,
+																 batchState->headBatch);
+
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchState->streamPos.batch == batchState->headBatch))
+				{
+					DEBUG_LOG("index_batch_pos_reset called early (streamPos.batch == headBatch)");
+					index_batch_pos_reset(scan, &batchState->streamPos);
+				}
+
+				DEBUG_LOG("index_batch_getnext_tid free headBatch %p headBatch %d nextBatch %d",
+						  headBatch, batchState->headBatch, batchState->nextBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				index_batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchState->headBatch++;
+
+				DEBUG_LOG("index_batch_getnext_tid batch freed headBatch %d nextBatch %d",
+						  batchState->headBatch, batchState->nextBatch);
+
+				index_batch_print("index_batch_getnext_tid / free old batch", scan);
+
+				/* we can't skip any batches */
+				Assert(batchState->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchState->reset))
+		{
+			DEBUG_LOG("resetting read stream readPos %d,%d",
+					  readPos->batch, readPos->index);
+
+			batchState->reset = false;
+			batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			index_batch_pos_reset(scan, &batchState->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call index_batch_getnext here too, for two reasons.
+		 * First, the read_stream only gets working after we try fetching the
+		 * first heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		DEBUG_LOG("loaded next batch, retry to advance position");
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches, so we're done.
+	 */
+	DEBUG_LOG("no more batches to process");
+
+	/*
+	 * Reset the position - we must not keep the last valid position, in case
+	 * we change direction of the scan and start scanning again. If we kept
+	 * the position, we'd skip the first item.
+	 *
+	 * XXX This is a bit strange. Do we really need to reset the position
+	 * after returning the last item? I wonder if it means the API is not
+	 * quite right.
+	 */
+	index_batch_pos_reset(scan, readPos);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_retail_getnext_tid - amgettuple index_getnext_tid implementation
+ *
+ * Returns the first/next TID, or NULL if no more items.
+ * ----------------
+ */
+static ItemPointer
+index_retail_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	bool		found;
+
+	CHECK_SCAN_PROCEDURE(amgettuple);
+
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
@@ -704,9 +1216,18 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * amgettuple call, in index_getnext_tid).  We do not do this when in
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
 	 */
 	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
+	{
+		if (scan->batchState == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
 
 	return found;
 }
@@ -1089,3 +1610,789 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ *
+ * XXX The "marked" batch is an exception. The marked batch may get outside
+ * the range of current batches, so make sure to never check the position
+ * for that.
+ */
+static void
+AssertCheckBatchPosValid(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchState->headBatch);
+	Assert(pos->batch < batchState->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static void
+AssertCheckBatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* there must be valid range of items */
+	Assert(batch->firstItem <= batch->lastItem);
+	Assert(batch->firstItem >= 0);
+
+	/* we should have items (buffer and pointers) */
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+
+	/* XXX can we check some of the other batch fields? */
+#endif
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+AssertCheckBatches(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* we should have batches initialized */
+	Assert(batchState != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchState->maxBatches > 0 &&
+		   batchState->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The head/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchState->headBatch >= 0 &&
+		   batchState->headBatch <= batchState->nextBatch);
+	Assert(batchState->nextBatch - batchState->headBatch <=
+		   batchState->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchState->headBatch; i < batchState->nextBatch; i++)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, i);
+
+		AssertCheckBatch(scan, batch);
+	}
+#endif
+}
+
+/*
+ * index_batch_pos_advance
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Advance the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined. Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.
+ *
+ * The position is guaranteed to be valid only after a successful advance.
+ * If an advance fails (false returned), the position can be invalid.
+ *
+ * XXX This seems like a good place to enforce some "invariants", e.g.
+ * that the positions are always valid. We should never get here with
+ * invalid position (so probably should be initialized as part of loading the
+ * initial/head batch), and then invalidated if advance fails. Could be tricky
+ * for the stream position, though, because it can get "lag" for IOS etc.
+ */
+static pg_attribute_always_inline bool
+index_batch_pos_advance(IndexScanDesc scan, IndexScanBatchPos *pos,
+						ScanDirection direction)
+{
+	IndexScanBatchData *batch;
+
+	/* make sure we have batching initialized and consistent */
+	AssertCheckBatches(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchState->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 *
+	 * XXX Maybe we should just explicitly initialize the position after
+	 * loading the initial batch, without having to go through the advance.
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 *
+		 * XXX Actually, could there be more batches? Maybe we prefetched more
+		 * batches right away? It doesn't seem to be a substantial invariant.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchState->headBatch);
+
+		pos->batch = scan->batchState->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	AssertCheckBatchPosValid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->index <= batch->lastItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->index >= batch->firstItem)
+		{
+			AssertCheckBatchPosValid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->index = batch->firstItem;
+		else
+			pos->index = batch->lastItem;
+
+		AssertCheckBatchPosValid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * index_batch_pos_reset
+ *		Reset the position, so that it looks as if never advanced.
+ */
+static void
+index_batch_pos_reset(IndexScanDesc scan, IndexScanBatchPos *pos)
+{
+	pos->batch = -1;
+	pos->index = -1;
+}
+
+/*
+ * index_scan_stream_read_next
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere. Here
+ * we rely on having the correct value in batchState->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from index_batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in index_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ *
+ * XXX It seems the readPos/streamPos comments should be placed elsewhere. The
+ * read_stream callback does not seem like the right place.
+ */
+static BlockNumber
+index_scan_stream_read_next(ReadStream *stream,
+							void *callback_private_data,
+							void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatchPos *streamPos = &batchState->streamPos;
+	ScanDirection direction = batchState->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	AssertCheckBatchPosValid(scan, &batchState->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike index_batch_getnext_tid, this can loop more than twice. If
+	 * many blocks get skipped due to currentPrefetchBlock or all-visibility
+	 * (per the "prefetch" callback), we get to load additional batches. In
+	 * the worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to
+	 * "pause" the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchState->readPos;
+			advanced = true;
+		}
+		else if (index_batch_pos_advance(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			IndexScanBatch streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->index].heapTid;
+
+			DEBUG_LOG("index_scan_stream_read_next: index %d TID (%u,%u)",
+					  streamPos->index,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/*
+			 * If there's a prefetch callback, use it to decide if we need to
+			 * read the next block.
+			 *
+			 * We need to do this before checking currentPrefetchBlock; it's
+			 * essential that the VM cache used by index-only scans is
+			 * initialized here.
+			 */
+			if (batchState->prefetch &&
+				!batchState->prefetch(scan, batchState->prefetchArg, streamPos))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (callback)");
+				continue;
+			}
+
+			/* same block as before, don't need to read it */
+			if (batchState->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("index_scan_stream_read_next: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchState->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchState->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!index_batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchState->finished && !batchState->prefetchingLockedIn)
+		{
+			int			indexdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchState->readPos.batch == streamPos->batch)
+			{
+				IndexScanBatchPos *readPos = &batchState->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					indexdiff = streamPos->index - readPos->index;
+				else
+				{
+					IndexScanBatch readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					indexdiff = (readPos->index - readBatch->firstItem) -
+						(streamPos->index - readBatch->firstItem);
+				}
+
+				if (indexdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchState->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchState->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchState->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
+/* ----------------
+ *		index_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are no
+ * more TIDs in the scan. The load may also return false if we used the maximum
+ * number of batches (INDEX_SCAN_MAX_BATCHES), in which case we'll reset the
+ * stream and continue the scan later.
+ *
+ * Returns true if the batch was loaded successfully, false otherwise.
+ *
+ * This only loads the TIDs and resets the various batch fields to fresh
+ * state. It does not set xs_heaptid/xs_itup/xs_hitup, that's the
+ * responsibility of the following index_batch_getnext_tid() calls.
+ * ----------------
+ */
+static bool
+index_batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+	IndexScanBatch priorbatch = NULL,
+				batch = NULL;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchState->finished)
+		return false;
+
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("index_batch_getnext: ran out of space for batches");
+		scan->batchState->reset = true;
+		return false;
+	}
+
+	index_batch_print("index_batch_getnext / start", scan);
+
+	/*
+	 * Check if there's an existing batch that amgetbatch has to pick things
+	 * up from
+	 */
+	if (batchState->headBatch < batchState->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchState->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/*
+		 * We got the batch from the AM, but we need to add it to the queue.
+		 * Maybe that should be part of the "batch allocation" that happens in
+		 * the AM?
+		 */
+		int			batchIndex = batchState->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchState->nextBatch++;
+
+		DEBUG_LOG("index_batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchState->headBatch, batchState->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchState->disabled &&
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   index_scan_stream_read_next, scan, 0);
+	}
+	else
+		batchState->finished = true;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/*
+ * index_batch_init
+ *		Initialize various fields / arrays needed by batching.
+ *
+ * FIXME This is a bit ad-hoc hodge podge, due to how I was adding more and
+ * more pieces. Some of the fields may be not quite necessary, needs cleanup.
+ */
+static void
+index_batch_init(IndexScanDesc scan)
+{
+	/* init batching info */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchState = palloc(sizeof(IndexScanBatchState));
+
+	/*
+	 * Initialize the batch.
+	 *
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchState->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchState->finished = false;
+	scan->batchState->reset = false;
+	scan->batchState->prefetchingLockedIn = false;
+	scan->batchState->disabled = false;
+	scan->batchState->currentPrefetchBlock = InvalidBlockNumber;
+	scan->batchState->direction = NoMovementScanDirection;
+	/* positions in the queue of batches */
+	index_batch_pos_reset(scan, &scan->batchState->readPos);
+	index_batch_pos_reset(scan, &scan->batchState->streamPos);
+	index_batch_pos_reset(scan, &scan->batchState->markPos);
+
+	scan->batchState->markBatch = NULL;
+	scan->batchState->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchState->headBatch = 0;	/* initial head batch */
+	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
+
+	scan->batchState->batches =
+		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
+
+	scan->batchState->prefetch = NULL;
+	scan->batchState->prefetchArg = NULL;
+}
+
+/*
+ * index_batch_reset
+ *		Reset the batch before reading the next chunk of data.
+ *
+ * complete - true means we reset even marked batch
+ *
+  * XXX Should this reset the batch memory context, xs_itup, xs_hitup, etc?
+ */
+static void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	IndexScanBatchState *batchState = scan->batchState;
+
+	/* bail out if batching not enabled */
+	if (!batchState)
+		return;
+
+	AssertCheckBatches(scan);
+
+	index_batch_print("index_batch_reset", scan);
+
+	/* With batching enabled, we should have a read stream. Reset it. */
+	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
+
+	/* reset the positions */
+	index_batch_pos_reset(scan, &batchState->readPos);
+	index_batch_pos_reset(scan, &batchState->streamPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 *
+	 * XXX Do this before the loop, so that it calls the amfreebatch().
+	 */
+	if (complete && unlikely(batchState->markBatch != NULL))
+	{
+		IndexScanBatchPos *markPos = &batchState->markPos;
+		IndexScanBatch markBatch = batchState->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchState->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchState->headBatch ||
+			markPos->batch >= batchState->nextBatch)
+			index_batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		index_batch_pos_reset(scan, &batchState->markPos);
+	}
+
+	/* release all currently loaded batches */
+	while (batchState->headBatch < batchState->nextBatch)
+	{
+		IndexScanBatch batch = INDEX_SCAN_BATCH(scan, batchState->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchState->headBatch, batch);
+
+		index_batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchState->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchState->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchState->headBatch = 0;	/* initial batch */
+	batchState->nextBatch = 0;	/* initial batch is empty */
+
+	batchState->finished = false;
+	batchState->reset = false;
+	batchState->currentPrefetchBlock = InvalidBlockNumber;
+
+	AssertCheckBatches(scan);
+}
+
+static void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	IndexScanBatchPos *readPos = &scan->batchState->readPos;
+	IndexScanBatchData *readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	AssertCheckBatchPosValid(scan, readPos);
+
+	/*
+	 * XXX Maybe we can move the state that indicates if an item has been
+	 * killed into IndexScanBatchData.items[] array.
+	 *
+	 * See:
+	 * https://postgr.es/m/CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
+	 */
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(MaxTIDsPerBTreePage * sizeof(int));
+	if (readBatch->numKilled < MaxTIDsPerBTreePage)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->index;
+}
+
+static void
+index_batch_free(IndexScanDesc scan, IndexScanBatch batch)
+{
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amfreebatch);
+
+	AssertCheckBatch(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchState->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* */
+static void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+}
+
+/*
+ * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
+ * which seems unfortunate - it increases the allocation sizes, even if
+ * the index would be fine with smaller arrays. This means all batches exceed
+ * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ */
+IndexScanBatch
+index_batch_alloc(int maxitems, bool want_itup)
+{
+	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
+								  sizeof(IndexScanBatchPosItem) * maxitems);
+
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->killedItems = NULL;
+	batch->numKilled = 0;
+
+	/*
+	 * If we are doing an index-only scan, we need a tuple storage workspace.
+	 * We allocate BLCKSZ for this, which should always give the index AM
+	 * enough space to fit a full page's worth of tuples.
+	 */
+	batch->currTuples = NULL;
+	if (want_itup)
+		batch->currTuples = palloc(BLCKSZ);
+
+	batch->buf = InvalidBuffer;
+	batch->pos = NULL;
+	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
+
+	return batch;
+}
+
+/*
+ * Unlock batch->buf.  If batch scan is dropPin, drop the pin, too.  Dropping
+ * the pin prevents VACUUM from blocking on acquiring a cleanup lock.
+ */
+void
+index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
+{
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set batch->lsn so that amfreebatch has a way to detect when
+	 * concurrent heap TID recycling by VACUUM might have taken place.  It'll
+	 * only be safe to set any index tuple LP_DEAD bits when the page LSN
+	 * hasn't advanced.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	batch->lsn = BufferGetLSNAtomic(batch->buf);
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..fba562df8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,12 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the next batch of tuples in the scan.
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+btgetbatch(IndexScanDesc scan, IndexScanBatch batch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +241,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +262,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +271,33 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
+					/* btfreebatch won't be called */
+					ReleaseBuffer(batch->buf);
+
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
 				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +325,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +338,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +354,51 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch, releasing its buffer pin
+ *
+ * XXX Should we really be freeing memory like this? What if we were to just
+ * reuse most memory across distinct pages, avoiding pfree/palloc cycles?
+ */
+void
+btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (batch->itemsvisibility)
+		pfree(batch->itemsvisibility);
+
+	if (batch->currTuples)
+		pfree(batch->currTuples);
+
+	if (batch->pos)
+	{
+		if (!scan->batchState || !scan->batchState->dropPin)
+			ReleaseBuffer(batch->buf);
+
+		pfree(batch->pos);
+	}
+
+	pfree(batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +407,50 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, IndexScanBatch markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	pos = (BTScanPos) markbatch->pos;
+	_bt_start_array_keys(scan, pos->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(pos->dir))
+		pos->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		pos->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +697,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +846,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0605356ec..5bb28b4bb 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(IndexScanBatch newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(IndexScanBatch newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static IndexScanBatch _bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static IndexScanBatch _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static IndexScanBatch _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   IndexScanBatch firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+IndexScanBatch
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	IndexScanBatch firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch->pos = palloc(sizeof(BTScanPosData));
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1515,12 +1479,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1536,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1548,11 +1512,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1570,69 +1534,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+IndexScanBatch
+_bt_next(IndexScanDesc scan, ScanDirection dir, IndexScanBatch priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	priorpos = (BTScanPos) priorbatch->pos;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*priorpos));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorpos->nextPage;
+	else
+		blkno = priorpos->prevPage;
+	lastcurrblkno = priorpos->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorpos->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorpos->moreRight : !priorpos->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorpos _bt_readpage call ended scan in this direction (though if
+		 * so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1644,8 +1618,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, IndexScanBatch newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1656,37 +1630,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
+	BTScanPos	pos = newbatch->pos;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pos->currPage = BufferGetBlockNumber(newbatch->buf);
+	pos->prevPage = opaque->btpo_prev;
+	pos->nextPage = opaque->btpo_next;
+	pos->dir = dir;
+
+	so->pos = pos;				/* _bt_checkkeys needs this */
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? pos->moreRight : pos->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->nextPage, pos->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, pos->prevPage, pos->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pos->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1724,11 +1696,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					pos->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1792,28 +1763,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1853,12 +1824,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			pos->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1875,11 +1845,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					pos->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, pos->currPage);
 					return false;
 				}
 			}
@@ -1980,11 +1949,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1997,17 +1966,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2023,12 +1992,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			pos->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2045,202 +2013,96 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(IndexScanBatch newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(IndexScanBatch newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	IndexScanBatchPosItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2252,73 +2114,96 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static IndexScanBatch
+_bt_readfirstpage(IndexScanDesc scan, IndexScanBatch firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel = scan->indexRelation;
+	BlockNumber blkno,
+				lastcurrblkno;
+	BTScanPos	firstpos = firstbatch->pos;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstpos->moreLeft = false;
+		firstpos->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstpos->moreLeft = true;
+		firstpos->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage succeeded */
+		index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+						   firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page in firstbatch->buf */
+	_bt_relbuf(rel, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstpos->nextPage;
+	else
+		blkno = firstpos->prevPage;
+	lastcurrblkno = firstpos->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstpos->dir == dir);
+
+	/* firstbatch will never be returned to scan, so free it ourselves */
+	pfree(firstbatch);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstpos->moreRight : !firstpos->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		pfree(firstpos);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	pfree(firstpos);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2328,102 +2213,70 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static IndexScanBatch
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	IndexScanBatch newbatch;
+	BTScanPos	newpos;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch->pos = palloc(sizeof(BTScanPosData));
+	newpos = newbatch->pos;
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newpos->moreLeft = true;
+	newpos->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2431,17 +2284,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newpos->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newpos->prevPage;
 			}
 		}
 		else
@@ -2456,19 +2309,36 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newpos->moreRight : !newpos->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage succeeded */
+	Assert(newpos->currPage == blkno);
+	index_batch_unlock(rel, scan->batchState && scan->batchState->dropPin,
+					   newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2691,25 +2561,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static IndexScanBatch
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, IndexScanBatch firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2720,7 +2588,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2746,9 +2614,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ab0f98b02..9872de87a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1022,14 +1022,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1396,7 +1388,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2033,13 +2025,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pos->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, so->pos->currPage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2152,7 +2144,7 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3302,7 +3294,7 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = so->pos->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3376,69 +3368,67 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, IndexScanBatch batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPos	pos = (BTScanPos) batch->pos;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsValid(*pos));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, pos->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3455,12 +3445,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		IndexScanBatchPosItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3485,7 +3474,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3501,7 +3490,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchState->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3518,7 +3508,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3568,7 +3558,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchState->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 87c31da71..9d66d26dd 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac36..6d895e4ff 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..5e7bafe07 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -49,7 +49,13 @@
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(IndexOnlyScanState *node, TupleTableSlot *slot,
 							IndexTuple itup, TupleDesc itupdesc);
+static bool ios_prefetch_block(IndexScanDesc scan, void *arg,
+							   IndexScanBatchPos *pos);
 
+/* values stored in ios_prefetch_block in the batch cache */
+#define		IOS_UNKNOWN_VISIBILITY		0	/* default value */
+#define		IOS_ALL_VISIBLE				1
+#define		IOS_NOT_ALL_VISIBLE			2
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -103,6 +109,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		node->ioss_ScanDesc->xs_want_itup = true;
 		node->ioss_VMBuffer = InvalidBuffer;
 
+		/*
+		 * Set the prefetch callback info, if the scan has batching enabled
+		 * (we only know what after index_beginscan, which also checks which
+		 * callbacks are defined for the AM.
+		 */
+		if (scandesc->batchState != NULL)
+		{
+			scandesc->batchState->prefetch = ios_prefetch_block;
+			scandesc->batchState->prefetchArg = (void *) node;
+		}
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -120,10 +137,42 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		bool		all_visible;
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Without batching, inspect the VM directly. With batching, we need
+		 * to retrieve the visibility information seen by the read_stream
+		 * callback (or rather by ios_prefetch_block), otherwise the
+		 * read_stream might get out of sync (if the VM got updated since
+		 * then).
+		 */
+		if (scandesc->batchState == NULL)
+		{
+			all_visible = VM_ALL_VISIBLE(scandesc->heapRelation,
+										 ItemPointerGetBlockNumber(tid),
+										 &node->ioss_VMBuffer);
+		}
+		else
+		{
+			/*
+			 * Reuse the previously determined page visibility info, or
+			 * calculate it now. If we decided not to prefetch the block, the
+			 * page had to be all-visible at that point. The VM bit might have
+			 * changed since then, but the tuple visibility could not have.
+			 *
+			 * XXX It's a bit weird we use the visibility to decide if we
+			 * should skip prefetching the block, and then deduce the
+			 * visibility from that (even if it matches pretty clearly). But
+			 * maybe we could/should have a more direct way to read the
+			 * private state?
+			 */
+			all_visible = !ios_prefetch_block(scandesc, node,
+											  &scandesc->batchState->readPos);
+		}
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -158,9 +207,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!all_visible)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
@@ -889,3 +936,51 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
 	node->ioss_SharedInfo = palloc(size);
 	memcpy(node->ioss_SharedInfo, SharedInfo, size);
 }
+
+/* FIXME duplicate from indexam.c */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchState->batches[(idx) % (scan)->batchState->maxBatches])
+
+/*
+ * ios_prefetch_block
+ *		Callback to only prefetch blocks that are not all-visible.
+ *
+ * We don't want to inspect the visibility map repeatedly, so the result of
+ * VM_ALL_VISIBLE is stored in the batch private data. The values are set
+ * to 0 by default, so we use two constants to remember if all-visible or
+ * not all-visible.
+ *
+ * However, this is not merely a question of performance. The VM may get
+ * modified during the scan, and we need to make sure the two places (the
+ * read_next callback and the index_fetch_heap here) make the same decision,
+ * otherwise we might get out of sync with the stream. For example, the
+ * callback might find a page is all-visible (and skips reading the block),
+ * and then someone might update the page, resetting the VM bit. If this
+ * place attempts to read the page from the stream, it'll fail because it
+ * will probably receive an entirely different page.
+ */
+static bool
+ios_prefetch_block(IndexScanDesc scan, void *arg, IndexScanBatchPos *pos)
+{
+	IndexOnlyScanState *node = (IndexOnlyScanState *) arg;
+	IndexScanBatch batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (batch->itemsvisibility == NULL)
+		batch->itemsvisibility = palloc0(sizeof(char) * (batch->lastItem + 1));
+
+	if (batch->itemsvisibility[pos->index] == IOS_UNKNOWN_VISIBILITY)
+	{
+		bool		all_visible;
+		ItemPointer tid = &batch->items[pos->index].heapTid;
+
+		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									 ItemPointerGetBlockNumber(tid),
+									 &node->ioss_VMBuffer);
+
+		batch->itemsvisibility[pos->index] =
+			all_visible ? IOS_ALL_VISIBLE : IOS_NOT_ALL_VISIBLE;
+	}
+
+	/* prefetch only blocks that are not all-visible */
+	return (batch->itemsvisibility[pos->index] == IOS_NOT_ALL_VISIBLE);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8335cf5b5..5dd89ed09 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd930..3e1ee438d 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 745fd3bab..de6c66a15 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -888,7 +888,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cb23ad527..652bb4c53 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6782,6 +6782,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	 * a huge amount of time here, so we give up once we've read too many heap
 	 * pages.  When we fail for that reason, the caller will end up using
 	 * whatever extremal value is recorded in pg_statistic.
+	 *
+	 * XXX We're not using ios_prefetch_block here.  That creates a window
+	 * where the scan's read stream can get out of sync.  At a minimum we'll
+	 * need to close this window by explicitly disabling heap I/O prefetching.
 	 */
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c0..79487545e 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -875,6 +875,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f62b61967..20a0ffaa5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..4e8225332 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -746,6 +747,63 @@ amgettuple (IndexScanDesc scan,
 
   <para>
 <programlisting>
+IndexScanBatch
+amgetbatch (IndexScanDesc scan,
+            IndexScanBatch batch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the given
+   direction (forward or backward in the index).  Returns an instance of
+   <structname>IndexScanBatch</structname> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples. The caller may
+   provide a pre-allocated <structname>IndexScanBatch</structname> instance,
+   in which case the index tuples are loaded into it instead of allocating
+   a new one. The same caveats described for <function>amgettuple</function>
+   apply here too. When an entry in the returned batch means only that the
+   index contains an entry that matches the scan keys, not that the tuple
+   necessarily still exists in the heap or will pass the caller's snapshot
+   test.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the access
+   method supports <quote>plain</quote> index scans.  If it doesn't, the
+   <structfield>amgetbatch</structfield> field in its <structname>IndexAmRoutine</structname>
+   struct must be set to NULL.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function> and
+   <function>amgetbatch</function> callbacks, not both. When the access method
+   provides <function>amgetbatch</function>, it has to also povide
+   <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The <structname>IndexScanBatch</structname> returned by <function>amgetbatch</function>
+   is no longer managed by the access method. It is up to the caller to decide
+   when it should be reused or freed by passing it to <function>amfreebatch</function>.
+  </para>
+
+  <para>
+<programlisting>
+bool
+amfreebatch (IndexScanDesc scan,
+             IndexScanBatch batch);
+</programlisting>
+   Releases the batch returned by the <function>amgetbatch</function> earlier.
+   This frees all AM-specific resources associated with the batch, like buffer
+   pins, allocated memory, etc.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the access
+   method provides <function>amgetbatch</function>. Otherwise it has to remain
+   set to <literal>NULL</literal>.
+  </para>
+
+  <para>
+<programlisting>
 int64
 amgetbitmap (IndexScanDesc scan,
              TIDBitmap *tbm);
@@ -789,32 +847,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function need only be provided if the access
+   method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..be1b0f55c 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -319,8 +319,7 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3b37fafa6..9702e3103 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277..5d798cd87 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1267,8 +1267,14 @@ IndexOnlyScanState
 IndexOptInfo
 IndexOrderByDistance
 IndexPath
+IndexPrefetchCallback
 IndexRuntimeKeyInfo
 IndexScan
+IndexScanBatch
+IndexScanBatchData
+IndexScanBatchPos
+IndexScanBatchPosItem
+IndexScanBatchState
 IndexScanDesc
 IndexScanInstrumentation
 IndexScanState
@@ -3429,15 +3435,17 @@ amcanreturn_function
 amcostestimate_function
 amendscan_function
 amestimateparallelscan_function
+amfreebatch_function
+amgetbatch_function
 amgetbitmap_function
 amgettreeheight_function
 amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
+amposreset_function
 amproperty_function
 amrescan_function
 amrestrpos_function
-- 
2.51.0

v20251109-0003-Reduce-malloc-free-traffic-by-caching-batc.patchapplication/octet-stream; name=v20251109-0003-Reduce-malloc-free-traffic-by-caching-batc.patchDownload

From 6fe6c4b4d5e52c2e476e292d0c1d4e3add3f6b63 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 10 Sep 2025 16:54:50 -0400
Subject: [PATCH v20251109 3/3] Reduce malloc/free traffic by caching batches

Instead of immediately freeing a batch, stash it in a cache (a small
fixed-size array), for reuse by the same scan.

There's room for improvement:

- Keeping some of the batch pieces (killItems, itemsvisibility, ...)
  instead of freeing them in index_batch_release.

- Allocating only space we need (both index_batch_alloc calls use
  MaxTIDsPerBTreePage, and thus malloc - because of ALLOC_CHUNK_LIMIT).
---
 src/include/access/genam.h            |   4 +-
 src/include/access/relscan.h          |   9 ++
 src/backend/access/index/indexam.c    | 202 ++++++++++++++++++++++++--
 src/backend/access/nbtree/nbtree.c    |  16 +-
 src/backend/access/nbtree/nbtsearch.c |   4 +-
 5 files changed, 211 insertions(+), 24 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 767503bb6..1a92a8195 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -230,7 +230,9 @@ extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 bool recheckOrderBy);
 extern bytea *index_opclass_options(Relation indrel, AttrNumber attnum,
 									Datum attoptions, bool validate);
-extern IndexScanBatch index_batch_alloc(int maxitems, bool want_itup);
+extern IndexScanBatch index_batch_alloc(IndexScanDesc scan,
+										int maxitems, bool want_itup);
+extern void index_batch_release(IndexScanDesc scan, IndexScanBatch batch);
 extern void index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch);
 
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index db5e3b309..763f4ffe4 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -187,6 +187,8 @@ typedef struct IndexScanBatchData
 	 */
 	char	   *itemsvisibility;	/* Index-only scan visibility cache */
 
+	/* capacity of the batch (size of the items array) */
+	int			maxitems;
 	IndexScanBatchPosItem items[FLEXIBLE_ARRAY_MEMBER];
 } IndexScanBatchData;
 
@@ -266,6 +268,13 @@ typedef struct IndexScanBatchState
 	int			headBatch;		/* head batch slot */
 	int			nextBatch;		/* next empty batch slot */
 
+	/* small cache of unused batches, to reduce malloc/free traffic */
+	struct
+	{
+		int			maxbatches;
+		IndexScanBatchData **batches;
+	}			cache;
+
 	IndexScanBatchData **batches;
 
 	/* callback to skip prefetching in IOS etc. */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index d42cc1c50..f1f2531b3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2196,6 +2196,10 @@ index_batch_init(IndexScanDesc scan)
 	scan->batchState->headBatch = 0;	/* initial head batch */
 	scan->batchState->nextBatch = 0;	/* initial batch starts empty */
 
+	/* XXX init the cache of batches, capacity 16 is arbitrary */
+	scan->batchState->cache.maxbatches = 16;
+	scan->batchState->cache.batches = NULL;
+
 	scan->batchState->batches =
 		palloc(sizeof(IndexScanBatchData *) * scan->batchState->maxBatches);
 
@@ -2328,34 +2332,123 @@ static void
 index_batch_end(IndexScanDesc scan)
 {
 	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchState)
+		return;
+
+	/* we can simply free batches thanks to the earlier reset */
+	if (scan->batchState->batches)
+		pfree(scan->batchState->batches);
+
+	/* also walk the cache of batches, if any */
+	if (scan->batchState->cache.batches)
+	{
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			if (scan->batchState->cache.batches[i] == NULL)
+				continue;
+
+			pfree(scan->batchState->cache.batches[i]);
+		}
+
+		pfree(scan->batchState->cache.batches);
+	}
+
+	pfree(scan->batchState);
 }
 
 /*
+ * index_batch_alloc
+ *		Allocate a batch that can fit maxitems index tuples.
+ *
+ * Returns a IndexScanBatch struct with capacity sufficient for maxitems
+ * index tuples. It's either newly allocated or loaded from a small cache
+ * maintained for individual scans.
+ *
+ * maxitems determines the minimum size of the batch (it may be larger)
+ * want_itup determines whether the bach allocates space for currTuples
+ *
  * XXX Both index_batch_alloc() calls in btree use MaxTIDsPerBTreePage,
  * which seems unfortunate - it increases the allocation sizes, even if
- * the index would be fine with smaller arrays. This means all batches exceed
- * ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive).
+ * the index would be fine with smaller arrays. This means all batches
+ * exceed ALLOC_CHUNK_LIMIT, forcing a separate malloc (expensive). The
+ * cache helps for longer queries, not for queries that only create a
+ * single batch, etc.
  */
 IndexScanBatch
-index_batch_alloc(int maxitems, bool want_itup)
+index_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
 {
-	IndexScanBatch batch = palloc(offsetof(IndexScanBatchData, items) +
-								  sizeof(IndexScanBatchPosItem) * maxitems);
+	IndexScanBatch batch = NULL;
 
+	/*
+	 * Try to find a sufficiently large batch in the cache.
+	 *
+	 * Use the first batch that can fit the requested number of items. We
+	 * could be smarter and look for the smallest of such batches. But that
+	 * probably won't help very much. We expect batches to be mostly uniform,
+	 * with about the same size. And index_batch_release() prefers larger
+	 * batches, so we should end up with mostly larger batches in the cache.
+	 *
+	 * XXX We can get here with batchState==NULL for bitmapscans. Could that
+	 * mean bitmapscans have issues with malloc/free on batches too? But the
+	 * cache can't help with that, when it's in batchState (because bitmap
+	 * scans don't have that).
+	 */
+	if ((scan->batchState != NULL) &&
+		(scan->batchState->cache.batches != NULL))
+	{
+		/*
+		 * try to find a batch in the cache, with maxitems high enough
+		 *
+		 * XXX Maybe should look for a batch with lowest maxitems? That should
+		 * increase probability of cache hits in the future?
+		 */
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			if ((scan->batchState->cache.batches[i] != NULL) &&
+				(scan->batchState->cache.batches[i]->maxitems >= maxitems))
+			{
+				batch = scan->batchState->cache.batches[i];
+				scan->batchState->cache.batches[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	/* found a batch in the cache? */
+	if (batch)
+	{
+		/* for IOS, we expect to already have the currTuples */
+		Assert(!(want_itup && (batch->currTuples == NULL)));
+
+		/* XXX maybe we could keep these allocations too */
+		Assert(batch->pos == NULL);
+		Assert(batch->itemsvisibility == NULL);
+	}
+	else
+	{
+		batch = palloc(offsetof(IndexScanBatchData, items) +
+					   sizeof(IndexScanBatchPosItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+	}
+
+	/* shared initialization */
 	batch->firstItem = -1;
 	batch->lastItem = -1;
 	batch->killedItems = NULL;
 	batch->numKilled = 0;
 
-	/*
-	 * If we are doing an index-only scan, we need a tuple storage workspace.
-	 * We allocate BLCKSZ for this, which should always give the index AM
-	 * enough space to fit a full page's worth of tuples.
-	 */
-	batch->currTuples = NULL;
-	if (want_itup)
-		batch->currTuples = palloc(BLCKSZ);
-
 	batch->buf = InvalidBuffer;
 	batch->pos = NULL;
 	batch->itemsvisibility = NULL;	/* per-batch IOS visibility */
@@ -2396,3 +2489,84 @@ index_batch_unlock(Relation rel, bool dropPin, IndexScanBatch batch)
 	ReleaseBuffer(batch->buf);
 	batch->buf = InvalidBuffer; /* defensive */
 }
+
+/*
+ * index_batch_release
+ *		Either stash the batch info a small cache for reuse, or free it.
+ */
+void
+index_batch_release(IndexScanDesc scan, IndexScanBatch batch)
+{
+	/* custom fields should have been cleaned by amfreebatch */
+	Assert(batch->pos == NULL);
+	Assert(batch->buf == InvalidBuffer);
+
+	/*
+	 * free killedItems / itemsvisibility
+	 *
+	 * XXX We could keep/reuse those too, I guess.
+	 */
+
+	if (batch->killedItems != NULL)
+	{
+		pfree(batch->killedItems);
+		batch->killedItems = NULL;
+	}
+
+	if (batch->itemsvisibility != NULL)
+	{
+		pfree(batch->itemsvisibility);
+		batch->itemsvisibility = NULL;
+	}
+
+	/*
+	 * Try adding the batch to the small cache - find a slot that's either
+	 * empty or used by a smaller batch (with smallest maxitems value), and
+	 * replace that batch.
+	 *
+	 * XXX There may be ways to improve this. We could track the number of
+	 * empty slots, and minimum maxitems value, which would allow skipping
+	 * pointless searches (in cases when should just discard the batch).
+	 */
+	if (scan->batchState != NULL)
+	{
+		/* lowest maxitems we found in the cache (to replace with this batch) */
+		int			maxitems = batch->maxitems;
+		int			slot = scan->batchState->cache.maxbatches;
+
+		/* first time through, initialize the cache */
+		if (scan->batchState->cache.batches == NULL)
+			scan->batchState->cache.batches
+				= palloc0_array(IndexScanBatch,
+								scan->batchState->cache.maxbatches);
+
+		/* find am empty or sufficiently large batch */
+		for (int i = 0; i < scan->batchState->cache.maxbatches; i++)
+		{
+			/* found empty slot, we're done */
+			if (scan->batchState->cache.batches[i] == NULL)
+			{
+				scan->batchState->cache.batches[i] = batch;
+				return;
+			}
+
+			/* found a smaller slot, remember it */
+			if (scan->batchState->cache.batches[i]->maxitems < maxitems)
+			{
+				maxitems = scan->batchState->cache.batches[i]->maxitems;
+				slot = i;
+			}
+		}
+
+		/* found a slot for this batch? */
+		if (maxitems < batch->maxitems)
+		{
+			pfree(scan->batchState->cache.batches[slot]);
+			scan->batchState->cache.batches[slot] = batch;
+			return;
+		}
+	}
+
+	/* either no cache or no slot for this batch */
+	pfree(batch);
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fba562df8..af947b6dc 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -382,21 +382,23 @@ btfreebatch(IndexScanDesc scan, IndexScanBatch batch)
 	if (batch->numKilled > 0)
 		_bt_killitems(scan, batch);
 
-	if (batch->itemsvisibility)
-		pfree(batch->itemsvisibility);
-
-	if (batch->currTuples)
-		pfree(batch->currTuples);
-
+	/* free AM-specific fields of the batch */
 	if (batch->pos)
 	{
 		if (!scan->batchState || !scan->batchState->dropPin)
+		{
 			ReleaseBuffer(batch->buf);
+			batch->buf = InvalidBuffer;
+		}
 
 		pfree(batch->pos);
+		batch->pos = NULL;
 	}
 
-	pfree(batch);
+	/* other fields (itemsvisibility, killItems, currTuples) freed elsewhere */
+
+	/* free the batch (or cache it for reuse) */
+	index_batch_release(scan, batch);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 5bb28b4bb..ee947a758 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1200,7 +1200,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Allocate space for first batch */
-	firstbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	firstbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	firstbatch->pos = palloc(sizeof(BTScanPosData));
 
 	/*
@@ -2237,7 +2237,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	BTScanPos	newpos;
 
 	/* Allocate space for next batch */
-	newbatch = index_batch_alloc(MaxTIDsPerBTreePage, scan->xs_want_itup);
+	newbatch = index_batch_alloc(scan, MaxTIDsPerBTreePage, scan->xs_want_itup);
 	newbatch->pos = palloc(sizeof(BTScanPosData));
 	newpos = newbatch->pos;
 
-- 
2.51.0

#337

Peter Geoghegan

pg@bowt.ie

2 months ago

In reply to: Peter Geoghegan (#333)

Re: index prefetching

On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote:

Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

Tomas and I had a meeting on Friday to discuss a way forward with this
project. Progress has stalled, and we feel that now is a good time to
pivot by refactoring the patch into smaller, independently
useful/committable pieces. This email explains our current thinking
(Tomas should correct me if I get anything wrong here).

The short version/executive summary
===================================

The need to get everything done in one single release seems to be
hampering progress. We made quick progress for a few months, but now
that we've exhausted the easy wins, the layering issues that remain
are making every remaining open item near intractable.

The layering issues make it very hard to keep on top of all of the
regressions; we're just doing too much at once. We're trying to manage
all of the regressions from the addition of prefetching/a heapam read
stream, while also trying to manage the regressions from moving index
AMs from the old amgettuple interface to the new amgetbatch interface.
And we still need to revise the table AM to move the read stream from
indexam.c over to the table AM side (this isn't in the latest version
of the patch at all).

Just making these AM interface changes is already a huge project on
its own. This makes it hard to focus on just a few things at any one
time; everything is interdependent. We seem to end up playing
whack-a-mole whenever we try to zero in on any single problem; we end
up going in circles.

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19. There is an almost immediate benefit to just
doing that much, unrelated to I/O prefetching for index scans: it
enables batching of heap page buffer locking/unlocking (during the
point where index scans perform heap_hot_search_buffer calls) on the
table AM/heapam side during ordered index scans. That can dramatically
cut down on repeat buffer locking and unlocking, giving us enough of a
win (more details below) to be the sole justification for switching
over to the new set of AM interfaces for Postgres 19.

Our long term goals won't change under this phased approach, but our
timeline/short term focus certainly will. We hope to get some general
feedback about this new strategy for the project now, particularly
from Andres. The main practical concern is managing project risk
sensibly.

Difficulties with refactoring AM interfaces while introducing a read stream
===========================================================================

The uncertainty about how to resolve some of the remaining individual
open items for the project (specifically concerns about I/O
prefetching/read stream + resource management concerns, and how they
*interact* with broader layering questions) is the main blocker to
further progress. I'll now give a specific example of what I mean by
this, just because it's likely to be clearer than explaining the
underlying problem in general terms.

Currently, we can have up to 64 leaf-page-wise batches. Usually this
is more than enough, but occasionally we run out of space for batches,
and have to reset the read stream. This is certainly a kludge; we
discard pinned buffers with useful data in order to work around what
we've thought of as an implementation deficiency on the read stream
side up until now. Obviously just discarding useful work like that is
never okay (nobody will argue with me on this, I'm sure).

At various points we talked about addressing this particular problem
by teaching the read stream to "pause" such that we can consume those
remaining pinned buffers as needed, without consuming even more heap
pages/buffers to do so (there's no natural upper bound on those, I
think). We'd then "unpause" and resume prefetching again, once we
managed to free some more leaf-page-wise batches up. But I'm now
starting to have serious doubts about this approach (or at least
doubts about the approach that I think other people have in mind when
they talk about this kind of "pausing").

Again, it's really hard to pin down *where* we should be fixing things.

It occurs to me that it doesn't make much sense that the table
AM/indexam.c has *no* awareness of how many heap buffers are already
pinned on its behalf. The fact that that knowledge is *exclusively*
confined to the read stream isn't actually good. What we really need
to do is to care about all buffer pins held by the whole index scan
node, whether for index pages or for heap pages (though note that
holding onto buffer pins on index pages should be rare in practice).
We need to directly acknowledge the tension that exists between heapam
and index AM needs, I think.

The read stream needs to be involved in this process, but it should be
a 2-way conversation. The read stream already defensively checks
externally held buffer pins, which might kinda work for what we have
in mind -- but probably not. It seems bad to depend on what is
supposed to be a defensive measure for all this.

Separately, we'll probably eventually want the heapam side to be able
to notice that a block number that it requests is already in the
pending list, so that it can be marked as a duplicate (and so not
unpinned until the duplicate request is also satisfied/has its heap
tuples returned to the scan). That's another factor pushing things in
this general direction. (Less important, but noted here for
completeness.)

I've been talking about problems when 64 leaf-page-wise batches isn't
enough, which is rare in practice. It's far more common for 64 to be
too *many* batches, which wastes memory (e.g, with largely sequential
heap access we seem to need no more than 5 or 10 at a time, even when
prefetching is really important). But it's hard to see how we could
lazily allocate memory used for batches under anything like the
current structure. It's circular: we should only allocate more
leaf-page-wise batches to make it possible to do more useful heap
prefetching. But right now heap prefetching will stall (or will
"pause" in its own kludgey way) precisely because there aren't enough
leaf-page-wise batches!

Granted, adding a "pausing" capability might be useful elsewhere. But
that in itself doesn't justify the general idea of pausing in the
specific way that index prefetching requires. Why should it?

Why should we pause when we've filled 64 leaf-page-wise batches
instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the
total number of usable leaf-page-wise batches remaining is a useful
proxy for the costs that actually matter. But why should it be? 64 is
just a number that we picked fairly arbitrarily, and one that has only
a weak relationship with more concrete costs such as leaf page buffer
pins held (as noted already, needing to hold onto a leaf page buffer
pin until we call btfreebatch against its batch isn't actually needed
during most index scans, but there will be exceptions).

My gut instinct is that this stuff will actually matter, in practice,
at least some of the time. And that that'll necessitate giving the
implementation a clear and complete picture of costs and benefits when
scheduling index scans that prefetch. Pausing can't be based on some
randomly chosen magic number, like 64, since that's bound to be
totally wrong in a nonzero number of cases.

ISTM that we cannot subordinate the table AM to the read stream. But
we also can't subordinate the read stream to the table AM. Figuring
all that out is hard. This is the kind of problem that we'd like to
defer for now.

Minor caveat: I'm not sure that Tomas endorses everything I've said
here about "pausing" the read stream. But that probably doesn't matter
much. Either way, these kinds of questions still weigh on the project,
and something should be done about it now, to keep things on course.

Phased approach
===============

As touched upon at the start of this email, under this new phased
approach to the project, the short term goal is to make heapam avoid
repeat buffer locks during index scans where that's clearly avoidable.
Making that much work shares many of the same problems with I/O
prefetching (particularly the basics of layering/AM revisions), but
defers dealing with the thorniest issues with pin resource management.
That's what I'll talk about here --- what we can defer, and what we
cannot defer.

But first, on a more positive note, I'll talk about the short term
benefits. My early prototype of the "only lock heap buffer once per
group of TIDs that point to the same heap page returned from an index
scan" optimization has been shown to improve throughput for large-ish
range scans by quite a bit. Variants of pgbench select with queries
like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
show improvements in throughput of up to 20% (and show similar
reductions in query latency). That's a nice win, all on its own.

Now back to talking about risks. There's still a lot of complexity
that cannot be deferred with this phased approach. We must still
switch over index AMs from amgettuple to the new amgetbatch interface.
And, we need to make the table AM interface used by index scans higher
level: index_getnext_slot would directly call a new table-AM-wise
callback, just passing it its own ScanDirection argument directly --
we wouldn't be passing TIDs to the table AM anymore.

The new table AM batch interface would work in terms of "give me the
next tuple in the current scan direction", not in terms of "give me
this random TID, which you know nothing else about". The table AM
becomes directly aware of the fact that it is participating in an
ordered index scan. This design is amenable to allowing the table AM
to see which accesses will be required in the near future -- that
requirement is common to both I/O prefetching and this other heap
buffer lock optimization.

It's even more complicated than just those changes to the index AM and
table AM interfaces: we'll also require that the table AM directly
interfaces with another layer that manages leaf-page-wise batches on
its behalf. They need to *cooperate* with each other, to a certain
degree. The executor proper won't call amgetbatch directly under this
scheme (it'd just provide a library of routines that help table AMs to
do so on their own).

That much doesn't seem deferrable. And it's hard. So this phased
approach certainly doesn't eliminate project risk, by any stretch of
the imagination. Offhand, I'd estimate that taking this phased
approach cuts the number of blockers to making an initial commit in
half.

Here's a nonexhaustive list of notable pain points that *won't* need
to be addressed in the short term, under this new approach/structure
(I'm somewhat repeating myself here):

* Most regressions are likely much easier to avoid/are automatically
avoided. Particularly with selective point query scans.

* No need to integrate the read stream, no need to solve most resource
management problems (the prior item about regressions is very much
related to this one).

* No need for streamPos stuff when iterating through TIDs from a
leaf-page-wise batch (only need readPos now). There's no need to keep
those 2 things in sync, because there'll only be 1 thing now.

Here's a nonexhaustive list of problems that we *will* still need to
solve in the earliest committed patch, under this phased approach
(again, I'm repeating myself somewhat):

* Actually integrating the amgetbatch interface in a way that is future-proof.

* Revising the table AM interface such that the table AM is directly
aware of the fact that it is feeding heap/table tuples to an ordered
index scan. That's a big conceptual shift for table AMs.

* Making the prior 2 changes "fit together" sensibly, in a way that
considers current and future needs. Also a big shift.

The "only lock heap buffer once per group of TIDs that point to the
same heap page returned from an index scan" optimization still
requires some general awareness of index AM costs on the table AM
side.

It only makes sense for us to batch-up extra TIDs (from the same heap
page) when determining which TIDs are about to be accessed as a group
isn't too expensive/the information is readily available to the table
AM, because it requested it from the index AM itself. We're setting a
new precedent by saying that it's okay to share certain knowledge
across what we previously thought of as strictly separate layers of
abstraction. I think that that makes sense (what else could possibly
work?), but I want to draw specific attention to that now.

* We'll still need index-only scans to do things in a way that
prevents inconsistencies/changing our mind in terms of which TIDs are
all-visible.

This has the advantage of allowing us to avoid accessing the
visibility map from the executor proper, which is an existing
modularity violation that we already agree ought to be fixed. This
will also keep us honest (we won't be deferring more than we should).
But that's not why I think it's essential to move VM accesses into the
table AM.

We should only batch together accesses to a heap page when we know for
sure that those TIDs will in fact be accessed. How are we supposed to
have general and robust handling for all that, in a world where the
visibility map continues to be accessed from the executor proper? At
best, not handling VM integration comprehensively (for index-only
scans) ties our hands around reordering work, and seems like it'd be
very brittle. It would likely have similar problems to our current
problems with managing a read stream in indexam.c, while relying on
tacit knowledge of how precisely those same heap blocks will later
actually be accessed from the heapam side.

The sensible solution is to put control of the scan's progress all in
one place. We don't want to have to worry about what happens when the
VM is concurrently set or unset.

When Andres and Tomas talk about table AM modularity stuff, they tend
to focus on why it's bad that the table AM interface uses heap TIDs
specifically. I agree with all that. But even if I didn't, everything
that I just said about the need to centralize control in the table AM
would still be true. That's why I'm focussing on that here (it's
really pretty subtle).

That's all I have for now. My thoughts here should be considered
tentative; I want to put my thinking on a more rigorous footing before
really committing to this new phased approach.

--
Peter Geoghegan

#338

Tomas Vondra

tomas@vondra.me

2 months ago

In reply to: Peter Geoghegan (#337)

Re: index prefetching

On 11/11/25 00:59, Peter Geoghegan wrote:

On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote:

Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

Tomas and I had a meeting on Friday to discuss a way forward with this
project. Progress has stalled, and we feel that now is a good time to
pivot by refactoring the patch into smaller, independently
useful/committable pieces. This email explains our current thinking
(Tomas should correct me if I get anything wrong here).

The short version/executive summary
===================================

The need to get everything done in one single release seems to be
hampering progress. We made quick progress for a few months, but now
that we've exhausted the easy wins, the layering issues that remain
are making every remaining open item near intractable.

The layering issues make it very hard to keep on top of all of the
regressions; we're just doing too much at once. We're trying to manage
all of the regressions from the addition of prefetching/a heapam read
stream, while also trying to manage the regressions from moving index
AMs from the old amgettuple interface to the new amgetbatch interface.
And we still need to revise the table AM to move the read stream from
indexam.c over to the table AM side (this isn't in the latest version
of the patch at all).

Just making these AM interface changes is already a huge project on
its own. This makes it hard to focus on just a few things at any one
time; everything is interdependent. We seem to end up playing
whack-a-mole whenever we try to zero in on any single problem; we end
up going in circles.

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19. There is an almost immediate benefit to just
doing that much, unrelated to I/O prefetching for index scans: it
enables batching of heap page buffer locking/unlocking (during the
point where index scans perform heap_hot_search_buffer calls) on the
table AM/heapam side during ordered index scans. That can dramatically
cut down on repeat buffer locking and unlocking, giving us enough of a
win (more details below) to be the sole justification for switching
over to the new set of AM interfaces for Postgres 19.

Our long term goals won't change under this phased approach, but our
timeline/short term focus certainly will. We hope to get some general
feedback about this new strategy for the project now, particularly
from Andres. The main practical concern is managing project risk
sensibly.

Difficulties with refactoring AM interfaces while introducing a read stream
===========================================================================

The uncertainty about how to resolve some of the remaining individual
open items for the project (specifically concerns about I/O
prefetching/read stream + resource management concerns, and how they
*interact* with broader layering questions) is the main blocker to
further progress. I'll now give a specific example of what I mean by
this, just because it's likely to be clearer than explaining the
underlying problem in general terms.

Currently, we can have up to 64 leaf-page-wise batches. Usually this
is more than enough, but occasionally we run out of space for batches,
and have to reset the read stream. This is certainly a kludge; we
discard pinned buffers with useful data in order to work around what
we've thought of as an implementation deficiency on the read stream
side up until now. Obviously just discarding useful work like that is
never okay (nobody will argue with me on this, I'm sure).

At various points we talked about addressing this particular problem
by teaching the read stream to "pause" such that we can consume those
remaining pinned buffers as needed, without consuming even more heap
pages/buffers to do so (there's no natural upper bound on those, I
think). We'd then "unpause" and resume prefetching again, once we
managed to free some more leaf-page-wise batches up. But I'm now
starting to have serious doubts about this approach (or at least
doubts about the approach that I think other people have in mind when
they talk about this kind of "pausing").

Again, it's really hard to pin down *where* we should be fixing things.

It occurs to me that it doesn't make much sense that the table
AM/indexam.c has *no* awareness of how many heap buffers are already
pinned on its behalf. The fact that that knowledge is *exclusively*
confined to the read stream isn't actually good. What we really need
to do is to care about all buffer pins held by the whole index scan
node, whether for index pages or for heap pages (though note that
holding onto buffer pins on index pages should be rare in practice).
We need to directly acknowledge the tension that exists between heapam
and index AM needs, I think.

The read stream needs to be involved in this process, but it should be
a 2-way conversation. The read stream already defensively checks
externally held buffer pins, which might kinda work for what we have
in mind -- but probably not. It seems bad to depend on what is
supposed to be a defensive measure for all this.

Separately, we'll probably eventually want the heapam side to be able
to notice that a block number that it requests is already in the
pending list, so that it can be marked as a duplicate (and so not
unpinned until the duplicate request is also satisfied/has its heap
tuples returned to the scan). That's another factor pushing things in
this general direction. (Less important, but noted here for
completeness.)

I've been talking about problems when 64 leaf-page-wise batches isn't
enough, which is rare in practice. It's far more common for 64 to be
too *many* batches, which wastes memory (e.g, with largely sequential
heap access we seem to need no more than 5 or 10 at a time, even when
prefetching is really important). But it's hard to see how we could
lazily allocate memory used for batches under anything like the
current structure. It's circular: we should only allocate more
leaf-page-wise batches to make it possible to do more useful heap
prefetching. But right now heap prefetching will stall (or will
"pause" in its own kludgey way) precisely because there aren't enough
leaf-page-wise batches!

Granted, adding a "pausing" capability might be useful elsewhere. But
that in itself doesn't justify the general idea of pausing in the
specific way that index prefetching requires. Why should it?

Why should we pause when we've filled 64 leaf-page-wise batches
instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the
total number of usable leaf-page-wise batches remaining is a useful
proxy for the costs that actually matter. But why should it be? 64 is
just a number that we picked fairly arbitrarily, and one that has only
a weak relationship with more concrete costs such as leaf page buffer
pins held (as noted already, needing to hold onto a leaf page buffer
pin until we call btfreebatch against its batch isn't actually needed
during most index scans, but there will be exceptions).

My gut instinct is that this stuff will actually matter, in practice,
at least some of the time. And that that'll necessitate giving the
implementation a clear and complete picture of costs and benefits when
scheduling index scans that prefetch. Pausing can't be based on some
randomly chosen magic number, like 64, since that's bound to be
totally wrong in a nonzero number of cases.

ISTM that we cannot subordinate the table AM to the read stream. But
we also can't subordinate the read stream to the table AM. Figuring
all that out is hard. This is the kind of problem that we'd like to
defer for now.

Minor caveat: I'm not sure that Tomas endorses everything I've said
here about "pausing" the read stream. But that probably doesn't matter
much. Either way, these kinds of questions still weigh on the project,
and something should be done about it now, to keep things on course.

I think I generally agree with what you said here about the challenges,
although it's a bit too abstract to respond to individual parts. I just
don't know how to rework the design to resolve this ...

For the reads stream "pausing" I think it's pretty clear it's more a
workaround than a desired behavior. We only pause the stream because we
need to limit the look-ahead distance (measured in index leaf pages),
and the read_stream has no such concept. It only knows about heap pins,
but e.g. IOS may need to read many leaf pages to find a single heap page
to prefetch. And the leaf pages are invisible to the stream.

The limit of 64 batches is entirely arbitrary. I needed a number that
would limit the amount of memory and time wasted on useless look-ahead,
and 64 seemed "reasonable" (not too high, but enough to not be hit very
often). Originally there was a fixed-length queue of batches, and 64 was
the capacity, but we no longer do it that way. So it's an imperfect
safety measure against "runaway" streams.

I don't want to get into too much detail about this particular issue,
it's already discussed somewhere in this thread. But if there was a way
to "tell" the read stream how much effort to spend looking ahead, we
wouldn't do the pausing (not in the end+reset way).

Phased approach
===============

As touched upon at the start of this email, under this new phased
approach to the project, the short term goal is to make heapam avoid
repeat buffer locks during index scans where that's clearly avoidable.
Making that much work shares many of the same problems with I/O
prefetching (particularly the basics of layering/AM revisions), but
defers dealing with the thorniest issues with pin resource management.
That's what I'll talk about here --- what we can defer, and what we
cannot defer.

But first, on a more positive note, I'll talk about the short term
benefits. My early prototype of the "only lock heap buffer once per
group of TIDs that point to the same heap page returned from an index
scan" optimization has been shown to improve throughput for large-ish
range scans by quite a bit. Variants of pgbench select with queries
like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
show improvements in throughput of up to 20% (and show similar
reductions in query latency). That's a nice win, all on its own.

Now back to talking about risks. There's still a lot of complexity
that cannot be deferred with this phased approach. We must still
switch over index AMs from amgettuple to the new amgetbatch interface.
And, we need to make the table AM interface used by index scans higher
level: index_getnext_slot would directly call a new table-AM-wise
callback, just passing it its own ScanDirection argument directly --
we wouldn't be passing TIDs to the table AM anymore.

The new table AM batch interface would work in terms of "give me the
next tuple in the current scan direction", not in terms of "give me
this random TID, which you know nothing else about". The table AM
becomes directly aware of the fact that it is participating in an
ordered index scan. This design is amenable to allowing the table AM
to see which accesses will be required in the near future -- that
requirement is common to both I/O prefetching and this other heap
buffer lock optimization.

It's even more complicated than just those changes to the index AM and
table AM interfaces: we'll also require that the table AM directly
interfaces with another layer that manages leaf-page-wise batches on
its behalf. They need to *cooperate* with each other, to a certain
degree. The executor proper won't call amgetbatch directly under this
scheme (it'd just provide a library of routines that help table AMs to
do so on their own).

That much doesn't seem deferrable. And it's hard. So this phased
approach certainly doesn't eliminate project risk, by any stretch of
the imagination. Offhand, I'd estimate that taking this phased
approach cuts the number of blockers to making an initial commit in
half.

Here's a nonexhaustive list of notable pain points that *won't* need
to be addressed in the short term, under this new approach/structure
(I'm somewhat repeating myself here):

* Most regressions are likely much easier to avoid/are automatically
avoided. Particularly with selective point query scans.

* No need to integrate the read stream, no need to solve most resource
management problems (the prior item about regressions is very much
related to this one).

* No need for streamPos stuff when iterating through TIDs from a
leaf-page-wise batch (only need readPos now). There's no need to keep
those 2 things in sync, because there'll only be 1 thing now.

Here's a nonexhaustive list of problems that we *will* still need to
solve in the earliest committed patch, under this phased approach
(again, I'm repeating myself somewhat):

* Actually integrating the amgetbatch interface in a way that is future-proof.

* Revising the table AM interface such that the table AM is directly
aware of the fact that it is feeding heap/table tuples to an ordered
index scan. That's a big conceptual shift for table AMs.

* Making the prior 2 changes "fit together" sensibly, in a way that
considers current and future needs. Also a big shift.

The "only lock heap buffer once per group of TIDs that point to the
same heap page returned from an index scan" optimization still
requires some general awareness of index AM costs on the table AM
side.

It only makes sense for us to batch-up extra TIDs (from the same heap
page) when determining which TIDs are about to be accessed as a group
isn't too expensive/the information is readily available to the table
AM, because it requested it from the index AM itself. We're setting a
new precedent by saying that it's okay to share certain knowledge
across what we previously thought of as strictly separate layers of
abstraction. I think that that makes sense (what else could possibly
work?), but I want to draw specific attention to that now.

* We'll still need index-only scans to do things in a way that
prevents inconsistencies/changing our mind in terms of which TIDs are
all-visible.

This has the advantage of allowing us to avoid accessing the
visibility map from the executor proper, which is an existing
modularity violation that we already agree ought to be fixed. This
will also keep us honest (we won't be deferring more than we should).
But that's not why I think it's essential to move VM accesses into the
table AM.

We should only batch together accesses to a heap page when we know for
sure that those TIDs will in fact be accessed. How are we supposed to
have general and robust handling for all that, in a world where the
visibility map continues to be accessed from the executor proper? At
best, not handling VM integration comprehensively (for index-only
scans) ties our hands around reordering work, and seems like it'd be
very brittle. It would likely have similar problems to our current
problems with managing a read stream in indexam.c, while relying on
tacit knowledge of how precisely those same heap blocks will later
actually be accessed from the heapam side.

The sensible solution is to put control of the scan's progress all in
one place. We don't want to have to worry about what happens when the
VM is concurrently set or unset.

When Andres and Tomas talk about table AM modularity stuff, they tend
to focus on why it's bad that the table AM interface uses heap TIDs
specifically. I agree with all that. But even if I didn't, everything
that I just said about the need to centralize control in the table AM
would still be true. That's why I'm focussing on that here (it's
really pretty subtle).

That's all I have for now. My thoughts here should be considered
tentative; I want to put my thinking on a more rigorous footing before
really committing to this new phased approach.

I don't object to the "phased approach" with doing the batching first,
but without seeing the code I can't really say if/how much it helps with
resolving the design/layering questions. It feels a bit too abstract to
me. While working on the prefetching I moved the code between layers
about three times, and I'm still not quite sure which layer should be
responsible for which piece :-(

regards

--
Tomas Vondra

#339

Peter Geoghegan

pg@bowt.ie

about 2 months ago

In reply to: Tomas Vondra (#338)

Re: index prefetching

On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas@vondra.me> wrote:

I think I generally agree with what you said here about the challenges,
although it's a bit too abstract to respond to individual parts. I just
don't know how to rework the design to resolve this ...

I'm trying to identify which subsets of the existing design can
reasonably be committed in a single release (while acknowledging that
even those subsets will need to be reworked). That is more abstract
than any of us would like -- no question.

What are we most confident will definitely be useful to prefetching,
that also enables the "only lock heap buffer once per group of TIDs
that point to the same heap page returned from an index scan"
optimization? I'm trying to reach a tentative agreement that just
doing the amgetbatch revisions and the table AM revisions (to do the
other heap buffer lock optimization) will represent useful progress
that can be committed in a single release. And on what the specifics
of the table AM revisions will need to be, to get us to a patch that
we can commit to Postgres 19.

For the reads stream "pausing" I think it's pretty clear it's more a
workaround than a desired behavior. We only pause the stream because we
need to limit the look-ahead distance (measured in index leaf pages),
and the read_stream has no such concept. It only knows about heap pins,
but e.g. IOS may need to read many leaf pages to find a single heap page
to prefetch. And the leaf pages are invisible to the stream.

Right. But we seemed to talk about this as if the implementation of
"pausing" was the problem. I was suggesting that the general idea of
pausing might well be the wrong one -- at least when applied in
anything like the way we currently apply it.

More importantly, I feel that it'll be really hard to get a clear
answer to that particular question (and a couple of others like it)
without first getting clarity on what we need from the table AM at a
high level, API-wise. Bearing in mind that we've made no real progress
on that at all.

We all agree that it's bad that indexam.c tacitly coordinates with
heapam in the way it does in the current patch. And that assuming a
TID representation in the API is bad. But that isn't very satisfying
to me; it's too focussed on that one really obvious and glaring
problem, and what we *don't* want. There's been very little (almost
nothing) on this thread about what we actually *do* want. That's the
thing that's still way to abstract, that I'd like to make more
concrete.

As you know, I think that we should add a new table AM interface that
makes the table AM directly aware of the fact that it is feeding an
ordered index scan, completely avoiding the use of TIDs (as well as
avoiding *any* more abstract representation of a table AM tuple
identifier). In other words, I think that we should just fully admit
the fact that the table AM is in control of the scan, and all that
comes with it. The table AM will have to directly coordinate with the
index AM in a way that's quite different to what we do right now.

I don't think that anybody else has really said much about that idea,
at least on the list. Is it a reasonable approach to take? This is
really important, especially in the short term/for Postgres 19.

The limit of 64 batches is entirely arbitrary. I needed a number that
would limit the amount of memory and time wasted on useless look-ahead,
and 64 seemed "reasonable" (not too high, but enough to not be hit very
often). Originally there was a fixed-length queue of batches, and 64 was
the capacity, but we no longer do it that way. So it's an imperfect
safety measure against "runaway" streams.

Right, but we still max out at 64. And then we stay there. It just
feels unprincipled to me.

I don't want to get into too much detail about this particular issue,
it's already discussed somewhere in this thread. But if there was a way
to "tell" the read stream how much effort to spend looking ahead, we
wouldn't do the pausing (not in the end+reset way).

I don't want to get into that again either. It was just an example of
the kinds of problems we're running into. Though a particularly good
example IMV.

That's all I have for now. My thoughts here should be considered
tentative; I want to put my thinking on a more rigorous footing before
really committing to this new phased approach.

I don't object to the "phased approach" with doing the batching first,
but without seeing the code I can't really say if/how much it helps with
resolving the design/layering questions. It feels a bit too abstract to
me.

It is in no small part based on gut feeling and intuition. I don't
have anything better to go on right now. It's a really difficult
project.

While working on the prefetching I moved the code between layers
about three times, and I'm still not quite sure which layer should be
responsible for which piece :-(

I don't think that this is quite the same situation.

The index prefetching design was completely overhauled twice now, but
on both occasions that was driven by some clear goal/the need to fix
some problem with the prior design. The first time it was due to the
fact that the original version didn't work with kill_prior_tuple. The
second time was due to the need to support reading index pages that
were ahead of the current page that the scan is returning tuples from.
Granted, it took a while to actually prove that the second overhaul
(which created the third major redesign) was the right direction to
take things in, but testing did eventually make that quite clear.

I don't see this as doing the same thing a third time/creating a forth
design from scratch. It's more of a refinement (albeit quite a big
one) of the most recent design. And in a direction that doesn't seem
too surprising to me. We knew that the table AM side of the most
recent redesign still had plenty of problems. We should have been a
bit more focussed on that side of things earlier on.

--
Peter Geoghegan

#340

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Peter Geoghegan (#337)

Re: index prefetching

Hi,

On 2025-11-10 18:59:07 -0500, Peter Geoghegan wrote:

Tomas and I had a meeting on Friday to discuss a way forward with this
project. Progress has stalled, and we feel that now is a good time to
pivot by refactoring the patch into smaller, independently
useful/committable pieces.

Makes sense.

Phased approach
===============

As touched upon at the start of this email, under this new phased
approach to the project, the short term goal is to make heapam avoid
repeat buffer locks during index scans where that's clearly avoidable.
Making that much work shares many of the same problems with I/O
prefetching (particularly the basics of layering/AM revisions), but
defers dealing with the thorniest issues with pin resource management.
That's what I'll talk about here --- what we can defer, and what we
cannot defer.

But first, on a more positive note, I'll talk about the short term
benefits. My early prototype of the "only lock heap buffer once per
group of TIDs that point to the same heap page returned from an index
scan" optimization has been shown to improve throughput for large-ish
range scans by quite a bit. Variants of pgbench select with queries
like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500"
show improvements in throughput of up to 20% (and show similar
reductions in query latency). That's a nice win, all on its own.

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

Besides the locking overhead, it turns out that doing visibility checks
one-by-one is a good bit slower than doing so in batches (or for the whole
page). So that's another perf improvement this would enable.

Now back to talking about risks. There's still a lot of complexity
that cannot be deferred with this phased approach. We must still
switch over index AMs from amgettuple to the new amgetbatch interface.
And, we need to make the table AM interface used by index scans higher
level: index_getnext_slot would directly call a new table-AM-wise
callback, just passing it its own ScanDirection argument directly --
we wouldn't be passing TIDs to the table AM anymore.

The new table AM batch interface would work in terms of "give me the
next tuple in the current scan direction", not in terms of "give me
this random TID, which you know nothing else about". The table AM
becomes directly aware of the fact that it is participating in an
ordered index scan. This design is amenable to allowing the table AM
to see which accesses will be required in the near future -- that
requirement is common to both I/O prefetching and this other heap
buffer lock optimization.

Yes, I think that's clearly required. I think one nice bonus of such a change
is that it'd resolve one of the biggest existing layering violations around
tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which
it really has no business doing.

It's even more complicated than just those changes to the index AM and
table AM interfaces: we'll also require that the table AM directly
interfaces with another layer that manages leaf-page-wise batches on
its behalf. They need to *cooperate* with each other, to a certain
degree. The executor proper won't call amgetbatch directly under this
scheme (it'd just provide a library of routines that help table AMs to
do so on their own).

That much doesn't seem deferrable. And it's hard. So this phased
approach certainly doesn't eliminate project risk, by any stretch of
the imagination. Offhand, I'd estimate that taking this phased
approach cuts the number of blockers to making an initial commit in
half.

I wonder if we could actually do part of the redesign in an even more
piecemeal fashion:

1) Move the responsibility for getting the next tid from the index into
tableam, but do so by basically using index_getnext_tid().

2) Have the new interface get a single batch of tuples from the index, instead
of doing it on a single tid-by-tid basis.

3) Have heapam not acquire the page lock for each tuple, but do so for all the
tuples on the same page.

4) Add awareness of multiple batches

5) Use read stream

Greetings,

Andres Freund

#341

Peter Geoghegan

pg@bowt.ie

about 2 months ago

In reply to: Andres Freund (#340)

Re: index prefetching

On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

I haven't noticed that effect myself. In fact, it seemed to be the
other way around; it looked like it helped most with very low client
count workloads.

It's possible that that had something to do with my hacky approach to
validating the general idea of optimizing heapam buffer
locking/avoiding repeated locking. This was a very rough prototype.

Besides the locking overhead, it turns out that doing visibility checks
one-by-one is a good bit slower than doing so in batches (or for the whole
page). So that's another perf improvement this would enable.

Isn't that just what you automatically get by only locking once per
contiguous group of TIDs that all point to the same heap page?

Or did you mean that we could do the visibility checks in a separate
pass, or something like that? I know that we do something like that
when pruning these days, but that seems quite different.

Yes, I think that's clearly required. I think one nice bonus of such a change
is that it'd resolve one of the biggest existing layering violations around
tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which
it really has no business doing.

Right. One relevant artefact of that layering violation is the way
that it forces index I/O prefetching (as implemented in the current
draft patch) to cache visibility lookup info. But with an I/O
prefetching design that puts exactly one place (namely the new table
AM index scan implementation) in charge of everything, that is no
longer necessary.

I wonder if we could actually do part of the redesign in an even more
piecemeal fashion:

1) Move the responsibility for getting the next tid from the index into
tableam, but do so by basically using index_getnext_tid().

I would prefer it if the new table AM interface was able to totally
replace the existing one, for all types of index scans that currently
use amgettuple. Individual table AMs would generally be expected to
fully move over to the new interface in one go.

That means that we'll need to have index_getnext_tid() support built
into the heapam implementation of said new interface anway. We'll need
it so that it is compatible with index AMs that still use amgettuple
(i.e. that haven't switched over to amgetbatch). Because switching
over to the amgetbatch interface isn't going to happen with every
index AM in a single release -- that definitely isn't practical.

Anyway, I don't see that much point in doing just step 1 in a single
release. If we don't use amgetbatch in some fashion, then we risk
committing something that solves the wrong problem.

2) Have the new interface get a single batch of tuples from the index, instead
of doing it on a single tid-by-tid basis.

That was already what I had in mind for this new plan/direction. There
isn't any point in having more than 1 index-AM-wise batch when we just
need index AM batching to implement the heapam buffer locking
optimization. (Actually, we'll need up to 2 such batches to handle
things like mark + restore, much like the way that nbtree uses a
separate CurrPos and markPos today.)

3) Have heapam not acquire the page lock for each tuple, but do so for all the
tuples on the same page.

I think that this is very easy compared to 1 and 2. It doesn't really
seem like it makes sense as a separate step/item?

Unlike with I/O prefetching, there'll be nothing speculative about the
way that the heap buffer lock optimization needs to schedule work done
by index AMs. We'll do it only when we can readily see that there's a
group of contiguous TIDs to be returned by the scan that all point to
the same heap page. There's no eager work done by index AMs compared
to today/compared to amgettuple -- we're just testing if the very next
TID has a matching block number, and then including it when it does.

4) Add awareness of multiple batches

5) Use read stream

I think that it makes sense to do these 2 together. But if we were
going to break them up, my guess is that it'd make the most sense to
start with the read stream work, and only then add support for reading
multiple index-AM-wise batches at a time.

I think that it's essential that the design of amgetbatch be able to
accomodate reading leaf pages that are ahead of the current leaf page,
to maintain heap I/O prefetch distance with certain workloads. But I
don't think it has to do it in the first committed version.

--
Peter Geoghegan

#342

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Peter Geoghegan (#341)

Re: index prefetching

Hi,

On 2025-11-21 18:14:56 -0500, Peter Geoghegan wrote:

On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

I haven't noticed that effect myself. In fact, it seemed to be the
other way around; it looked like it helped most with very low client
count workloads.

It's possible that that effect is more visible on larger machines - I did test
that on a 2x 24cores/48 threads machine. I do see a smaller effect on a
2x10c/20t machine.

It's possible that that had something to do with my hacky approach to
validating the general idea of optimizing heapam buffer
locking/avoiding repeated locking. This was a very rough prototype.

Heh, mine also was a dirty dirty hack. So....

Besides the locking overhead, it turns out that doing visibility checks
one-by-one is a good bit slower than doing so in batches (or for the whole
page). So that's another perf improvement this would enable.

Isn't that just what you automatically get by only locking once per
contiguous group of TIDs that all point to the same heap page?

No, what I mean is to actually enter heapam_visibility.c once for a set of
tuples. That allows to do some expensive-ish stuff once per page, instead of
doing it repeatedly and allows for more out-of-order execution as the loop is
a lot tighter. See here for my patch to do that for sequential scans:

/messages/by-id/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar

Basically, instead of calling HeapTupleSatisfiesVisibility() individually for
each tuple, you call HeapTupleSatisfiesMVCCBatch() once for all the tuples
that you want to determine visibility for.

Yes, I think that's clearly required. I think one nice bonus of such a change
is that it'd resolve one of the biggest existing layering violations around
tableam - namely that nodeIndexonlyscan.c does VM_ALL_VISIBLE() calls, which
it really has no business doing.

Right. One relevant artefact of that layering violation is the way
that it forces index I/O prefetching (as implemented in the current
draft patch) to cache visibility lookup info. But with an I/O
prefetching design that puts exactly one place (namely the new table
AM index scan implementation) in charge of everything, that is no
longer necessary.

Yep.

I wonder if we could actually do part of the redesign in an even more
piecemeal fashion:

1) Move the responsibility for getting the next tid from the index into
tableam, but do so by basically using index_getnext_tid().

I would prefer it if the new table AM interface was able to totally
replace the existing one, for all types of index scans that currently
use amgettuple. Individual table AMs would generally be expected to
fully move over to the new interface in one go.

Right.

That means that we'll need to have index_getnext_tid() support built
into the heapam implementation of said new interface anway. We'll need
it so that it is compatible with index AMs that still use amgettuple
(i.e. that haven't switched over to amgetbatch). Because switching
over to the amgetbatch interface isn't going to happen with every
index AM in a single release -- that definitely isn't practical.

Anyway, I don't see that much point in doing just step 1 in a single
release. If we don't use amgetbatch in some fashion, then we risk
committing something that solves the wrong problem.

I'm not actually suggesting that we do all these steps in separate releases or
such, just that we can get them committed individually. The nice thing about
my step 1) is that it would not require any indexam changes...

I think that it's essential that the design of amgetbatch be able to
accomodate reading leaf pages that are ahead of the current leaf page,
to maintain heap I/O prefetch distance with certain workloads. But I
don't think it has to do it in the first committed version.

Agreed and agreed.

Greetings,

Andres Freund

#343

Peter Geoghegan

pg@bowt.ie

about 2 months ago

In reply to: Andres Freund (#342)

Re: index prefetching

On Fri, Nov 21, 2025 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-21 18:14:56 -0500, Peter Geoghegan wrote:

On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

I haven't noticed that effect myself. In fact, it seemed to be the
other way around; it looked like it helped most with very low client
count workloads.

It's possible that that effect is more visible on larger machines - I did test
that on a 2x 24cores/48 threads machine. I do see a smaller effect on a
2x10c/20t machine.

Update: I find that when I build Postgres with -march=native, I see
performance characteristics that are much more in line with what you
saw when you ran your own experiments (experiments with minimizing the
number of heap buffer locks acquired during index scans). At 1 client
count, there's now only about a 10% increase in throughput for a
pgbench variant that uses the type of range queries that you'd expect
to benefit the most from this work (that was more like 18%-20% without
-march=native). Whereas with 32 clients, it's an ~18% improvement in
throughput (where before it was only around 15% - 16%).

Are you in the habit of using -march=native? I'm not. I assume that
most Postgres users aren't using packages that were built with the
flags that -march=native implies, which is why I largely go with
defaults for my release/benchmarking builds (the only exception is my
use of -fno-omit-frame-pointer).

In case it matters, my workstation uses a Ryzen 9 5950X CPU (which is Zen 3).

--
Peter Geoghegan

#344

Andres Freund

andres@anarazel.de

about 2 months ago

In reply to: Peter Geoghegan (#343)

Re: index prefetching

Hi,

On 2025-11-23 19:03:44 -0500, Peter Geoghegan wrote:

On Fri, Nov 21, 2025 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2025-11-21 18:14:56 -0500, Peter Geoghegan wrote:

On Fri, Nov 21, 2025 at 5:38 PM Andres Freund <andres@anarazel.de> wrote:

Another benfit is that it helps even more when there multiple queries running
concurrently - the high rate of lock/unlock on the buffer rather badly hurts
scalability.

I haven't noticed that effect myself. In fact, it seemed to be the
other way around; it looked like it helped most with very low client
count workloads.

It's possible that that effect is more visible on larger machines - I did test
that on a 2x 24cores/48 threads machine. I do see a smaller effect on a
2x10c/20t machine.

Update: I find that when I build Postgres with -march=native, I see
performance characteristics that are much more in line with what you
saw when you ran your own experiments (experiments with minimizing the
number of heap buffer locks acquired during index scans).

Huh. I wouldn't have expected -march=native to make a huge difference...

Are you in the habit of using -march=native? I'm not.

I occasionally use it, but not regularly - I do however use -O3, as I found
that to actually improve performance sufficiently in plenty cases. And it's
something that's much more generally applicable than -march=native?.

I don't think the precise gains here, particularly basedon on quick
prototypes, make that much of a difference. There's so much more optimization
potential other than the amortization of locking costs...

Greetings,

Andres Freund

#345

Peter Geoghegan

pg@bowt.ie

about 2 months ago

In reply to: Andres Freund (#344)

Re: index prefetching

On Mon, Nov 24, 2025 at 3:48 PM Andres Freund <andres@anarazel.de> wrote:

Huh. I wouldn't have expected -march=native to make a huge difference...

Me neither. On the other hand I find that this area is quite sensitive
to icache misses and branch misprediction penalties. This is partly
due to my holding the patch to a very high standard, in terms of
avoiding regressions (at least for simple point lookup queries and
nestloop join queries).

I don't think the precise gains here, particularly basedon on quick
prototypes, make that much of a difference. There's so much more optimization
potential other than the amortization of locking costs...

I agree that this precise issue isn't necessarily all that important.

My current focus is on completely separating the I/O prefetching parts
of the patch from the core AM interface changes, while avoiding
regressions shown by various microbenchmarks. My experiments with
-march=native were mostly about that -- not about the heap buffer
locking thing specifically. That was just something I noticed in
passing, and found curious.

--
Peter Geoghegan

#346

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Peter Geoghegan (#337)

4 attachment(s)

Re: index prefetching

On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19.

Attached patch makes the table AM revisions we talked about. This is a
significant change in direction, so I'm adopting a new patch
versioning scheme: this new version is v1. (I just find it easier to
deal with sequential patch version numbers.)

I'm sure that I'll have made numerous mistakes in this new v1. There
will certainly be some bugs, and some of the exact details of how I'm
doing the layering are likely suboptimal or even wrong. I am
nevertheless cautiously optimistic that this will be the last major
redesign that will be required for this project.

There is an almost immediate benefit to just
doing that much, unrelated to I/O prefetching for index scans: it
enables batching of heap page buffer locking/unlocking (during the
point where index scans perform heap_hot_search_buffer calls) on the
table AM/heapam side during ordered index scans.

Note that this new v1 doesn't yet include the important heapam buffer
locking optimization discussed here. It didn't seem worth holding up
everything just for that. Plan is to get to it next.

(It isn't intrinsically all that complicated to add the optimization
with this new table AM orientated structure, but doing so would have
made performance validation work/avoiding regressions with simple
queries that much harder. So I just put it off for a bit longer.)

What's new in v1 (compared to v20251109-*, the prior version):

* The first patch in the series is now mostly about changing the table
AM and index AM in a complementary way (not just about adding the
amgetbatch interface to the index AM).

To summarize this point (mostly just a recap of recent discussion on
the table AM API on this thread) with its own sub points:

- We're now using a slot-based table AM interface that understands
scan direction. We now do all VM access for index-only scans on the
heapam side, fixing that existing table AM modularity violation once
and for all.

- Batches returned by amgetbatch are directly managed by heapam,
giving it the direct control that it requires to get the best possible
performance. Whether that's for adding I/O prefetching, or for other
optimizations.

- The old table_index_fetch_tuple index scan interface is still needed
-- though only barely.

The rule going forward for core executor code is that it should always
use this new slot-based interface, unless there is a specific need for
such a caller to pass *their own* TID, in a way that cannot possibly
be delegated to our new high level table AM interface.

For example, we still need table_index_fetch_tuple for nbtree's
_bt_check_unique; it must pass TIDs to heapam, and get back tuples,
without starting any new index scan to do so (the only "index scan"
involved in the case of the _bt_check_unique caller takes place in the
btinsert that needs to perform unique index enforcement in passing). I
think it makes perfect sense that a small handful of special case
callers still need to use table_index_fetch_tuple, since there really
is no way around the need for these callers to pass their own TID.

* Major restructuring of batch management code, to allow it to work
with the table AM interface (as well as related improvements and
polishing).

The parts of batch management that aren't under the direct control of
table AMs/heapam (the batch helper functions that all table AMs will
use) are no longer in indexam.c; there's a new file for those routines
named indexbatch.c. indexbatch.c is also the place where a few other
helper functions go. These other functions are called by indexam.c/the
core executor, for things like initializing an amgetbatch scan, and
informing nbtree that it is taking a mark (for mark/restore).

Maybe there are certain remaining problems with the way that indexam.c
and heapam_handler.c are coordinating across index scans. Hopefully
the structure wasn't accidentally overfitted to heapam/isn't brittle
in some other way.

* Renamed and made lots of tweaks to batching related functions and
structs. I've also integrated code that previously appeared in its own
"batch cache" patch into the new main commit in the patch series (the
first patch in the new series).

The main goal of the tweaks to the data structures was to avoid
indirection that previously caused small regressions in my
microbenchmarks. We're very sensitive to costs from additional pointer
chasing in these code paths. And from even small memory allocations.

I think that I've avoided all regressions with just the first patch,
at least for my own microbenchmark suite. I did not aim to avoid these
regressions with the prefetching patch, since I consider it out of
scope now (for Postgres 19).

* v1 breaks out prefetching into its own patch, which is now the
second patch in the patch series.

The new I/O prefetching patch turned out to be surprisingly small. I
still feel good about our choice to put that off until Postgres 20,
though -- it's definitely where most of the difficulties are.
Especially with things like resource management. (The problem with the
second patch is that it's too small/doesn't address all the problems,
not that it's too big and unwieldy.)

Prefetching works at least as well as it did in earlier versions
(maybe even slightly better). It's not just an afterthought here. At a
minimum, we need to continue to maintain prefetching in a reasonably
complete and usable form to keep us honest about the design changes in
the table AM and index AM APIs. If the design itself cannot eventually
accommodate Postgres 20 work on I/O prefetching (and even later work),
then it's no good.

Minor caveat about preserving prefetching in good working order: I
disabled support for index-only scans that use I/O prefetching for
heap accesses in the second patch, at least for now. To recap, IoS
support requires a visibility cache so that both readBatch and
streamBatch agree on exactly which heap blocks will need to be read,
even when the visibility map has some relevant heap page bits
concurrently set or unset. It won't be too hard to add something like
that back to heapam_handler.c, but I didn't get around to doing so
just yet.

It might be independently useful to have some kind of visibility
cache, even without prefetching; batching VM accesses (say by doing
them up front, for a whole batch, right after amgetbatch returns)
might work out saving cycles with cached scans. You know, somewhat
like how we'll do same-heap-page heap tuple fetches eagerly as a way
of minimizing buffer lock/unlock traffic.

* There's a new patch that adds amgetbatch support for hash indexes.

This demonstrates that the amgetbatch interface is already reasonably
general. And that adding support to an index AM doesn't have to be all
that invasive. I'm more focussed than ever on the generality of the
API now.

* Added documentation that attempts to formalize the constraints that
index AMs that opt to use amgetbatch are under.

I don't think that it makes sense to think of amgettuple as the legacy
interface for plain index scans. There will probably always be cases
like KNN GiST scans, that legitimately need the index AM to directly
control the progress of index scans, a tuple at a time.

After all, these scan types give an absurd amount of control over many
things to the index AM -- that seems to really make it hard to put the
table AM in control of the scan's progress. For example, GiST scans
use their own GISTSearchHeapItem struct to manage each item returned
to the scan (which has a bunch of extra fields compared to our new
AM-generic BatchMatchingItem struct). This GISTSearchHeapItem struct
allows GiST to indicate whether or not an index tuple's quals must be
rechecked. It works at the tuple granularity (individual GiST
opclasses might expect that level of flexibility), which really works
against the batching concepts that we're pursuing here.

It's true that hash index scans are also lossy, but that's quite
different: they're inherently lossy. It's not as if hash index scans
are sometimes not lossy. They certainly cannot be lossy for some
tuples but not other tuples that all get returned during the same
index scan. Not so with GiST scans.

Likely the best solution to the problems posed by GiST and SP-GiST
will be to choose one of either amgettuple and amgetbatch during
planning, according to what the scan actually requires (while having
support for both interfaces in both index AMs). I'm still not sure
what that should look like, though -- how does the planner know which
interface to use, in a world where it has to make a choice with those
index AMs that offer both? Obviously the answer depends in part on
what actually matters to GiST/where GiST *can* reasonably use
amgetbatch, to get benefits such as prefetching. And I don't claim to
have a full understanding of that right now.

Here are the things that I'd like to ask from reviewers, and from Tomas:

* Review of the table AM changes, with a particular emphasis on high
level architectural choices.

* Most importantly: will the approach in this new v1 avoid painting
ourselves into a corner? It can be incomplete, as long as it doesn't
block progress on things we're likely to want to do in the next couple
of releases.

* Help with putting the contract that amgetbatch requires of index AMs
on a more rigorous footing. In other words, is amgetbatch itself
sufficiently general to accomodate the needs of index AMs in the
future? I've made a start on that here (by adding sgml docs about the
index AM API, which mentions table AM concerns), but work remains,
particularly when it comes to supporting GiST + SP-GiST.

I think it makes sense to keep feedback mostly high level for now --
to make it primarily about how the individual API changes fit
together, if they're coordinating too much (or not enough), and if the
interface we have is able to accommodate future needs.

--
Peter Geoghegan

Attachments:

v1-0004-Add-amgetbatch-support-to-hash-index-AM.patchapplication/octet-stream; name=v1-0004-Add-amgetbatch-support-to-hash-index-AM.patchDownload

From 6b495c6808f3bdfd13a5ac34951850643edc4456 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v1 4/4] Add amgetbatch support to hash index AM.

This patch should be considered a work in progress.  It has only been
lightly tested, and it's not clear if I've accounted for all of the
intricacies with bucket splits (and with pins that are held by the
scan's opaque state more generally).

This automatically switched hash indexes over to using the dropPin
optimization, since that is standard when using the new amgetbatch
interface.  This won't bring similar benefits to hash index scans that
nbtree index scans gained when commit 2ed5b87f went in.  Hash index
vacuuming acquires a cleanup lock on bucket pages, but conflicting pins
on bucket pages are still held for the full duration of each index scan.

However, there is still independent value in avoiding holding on to
buffer pins during index scans: index prefetching tends to hold open
as many as several dozen batches with certain workloads (workloads where
the stream position has to get quite far ahead of the read position in
order to maintain the appropriate prefetch distance on the heapam side).
Guaranteeing that open batches won't hold buffer pins on index pages (at
least during plain index scans that use an MVCC snapshot, per the rules
established by commit 2ed5b87f) is likely to make life easier when the
resource management rules for I/O prefetching are fully ironed-out.

Note that the code path in _hash_kill_items that calls _hash_getbuf
(though only with the LH_OVERFLOW_PAGE flag/overflow pages) is dead
code/vestigial on master [1].  However, it's now useful again, since
_hash_kill_items might now need to relock and repin an overflow page
(not just relock it) during dropPin scans.

[1] https://postgr.es/m/CAH2-Wz=8mefy8QUcsnKLTePuy4tE8pdO+gSRQ9yQwUHoaeFTFw@mail.gmail.com

Author: Peter Geoghegan <pg@bowt.ie>
---
 src/include/access/hash.h            |  73 +------
 src/backend/access/hash/hash.c       | 126 +++++-------
 src/backend/access/hash/hashpage.c   |  19 +-
 src/backend/access/hash/hashsearch.c | 282 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   |  98 +++++-----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 240 insertions(+), 360 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 839c34312..03f8bc9b3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b4b460724..2f796aeac 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = hashadjustmembers;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
-	amroutine->amgettuple = hashgettuple;
-	amroutine->amgetbatch = NULL;
-	amroutine->amfreebatch = NULL;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = hashgetbatch;
+	amroutine->amfreebatch = hashfreebatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->amposreset = NULL;
@@ -285,54 +285,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = (int *)
-					palloc(MaxIndexTuplesPerPage * sizeof(int));
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +310,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &(batch->items[itemIndex].heapTid), 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,16 +348,12 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
 
 	return scan;
@@ -408,18 +369,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +379,28 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +410,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005..b0b0c530c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65b..ecfcf801f 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,104 +22,75 @@
 #include "utils/rel.h"
 
 static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+						   ScanDirection dir, BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a new batch containing the next items from
+ *		the next page (determined by priorbatch navigation info).
  *
- *		On failure exit (no more tuples), we return false with pin
+ *		On failure exit (no more tuples), we return NULL with pin
  *		held on bucket page but no pins or locks held on overflow
  *		page.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
+
+	/* Get the buffer for next batch */
+	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +240,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items. If the
+ *		page is a bucket page, it remains pinned but not locked. If it is an
+ *		overflow page, both pin and lock are released and batch->buf is
+ *		InvalidBuffer.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
+ *		On failure exit (no matching items), we return NULL, with pin held on
  *		bucket page but no pins or locks held on overflow page.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +265,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +295,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,34 +388,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
@@ -461,8 +430,8 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -473,25 +442,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
-			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
+			 * the next page.
+			 *
 			 * If this is a primary bucket page, hasho_prevblkno is not a real
 			 * block number.
 			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				prev_blkno = InvalidBlockNumber;
 			else
 				prev_blkno = opaque->hasho_prevblkno;
@@ -499,29 +464,25 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->prevPage = prev_blkno;
+				batch->nextPage = InvalidBlockNumber;
+				batch->buf = buf;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -532,77 +493,91 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				next_blkno = opaque->hasho_nextblkno;
 
 			_hash_readprev(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->prevPage = InvalidBlockNumber;
+				batch->nextPage = next_blkno;
+				batch->buf = buf;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/* Saved at least one match in batch.items[], so prepare to return it */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split
+		 */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
+
+		/*
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released
+		 */
+		IncrBufferRefCount(batch->buf);
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +614,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +661,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +680,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;	/* Hash doesn't support index-only scans */
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f41233fcd..d01d3b7d2 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,67 +510,74 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
@@ -578,13 +585,13 @@ _hash_kill_items(IndexScanDesc scan)
 
 	for (i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +620,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->batchqueue->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6f1ebb955..253f08b40 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1183,8 +1183,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

v1-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/octet-stream; name=v1-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From b57f881219790dfa71b69b5955dcd85fac74a0ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v1 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

The older amgettuple interface continues to work with the lower-level
table_index_fetch_tuple callback, where the caller provides TIDs one at a
time. Index AMs like GiST that require fine-grained control over scan
ordering will continue to use amgettuple and table_index_fetch_tuple.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  27 +-
 src/include/access/heapam.h                   |   5 +
 src/include/access/nbtree.h                   | 173 +----
 src/include/access/relscan.h                  | 209 +++++
 src/include/access/tableam.h                  |  57 +-
 src/include/nodes/execnodes.h                 |   1 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 544 ++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  10 +-
 src/backend/access/index/indexam.c            | 130 +---
 src/backend/access/index/indexbatch.c         | 683 +++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtree.c            | 307 ++------
 src/backend/access/nbtree/nbtsearch.c         | 712 +++++++-----------
 src/backend/access/nbtree/nbtutils.c          |  77 +-
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/commands/constraint.c             |   3 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeIndexonlyscan.c      | 100 +--
 src/backend/executor/nodeIndexscan.c          |  12 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   4 +-
 src/backend/utils/adt/selfuncs.c              |  57 +-
 contrib/bloom/blutils.c                       |   3 +-
 doc/src/sgml/indexam.sgml                     | 310 +++++++-
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   4 +-
 src/tools/pgindent/typedefs.list              |   4 -
 42 files changed, 2343 insertions(+), 1195 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..a7eb33ce9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..f1984e700 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -115,6 +115,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +176,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 TupleTableSlot *ios_tableslot,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +203,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  TupleTableSlot *ios_tableslot,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +272,25 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern bool batch_getnext(IndexScanDesc scan, ScanDirection direction);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan,
+											   int maxitems, bool want_itup);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 632c4332a..e8d347e47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,11 @@ typedef struct IndexFetchHeapData
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
+	TupleTableSlot *ios_tableslot;	/* transient slot for fetching tuples to
+									 * check visibility during index-only
+									 * scans */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 16be5c7a9..da57666d2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1102,6 +994,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
+	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -1191,14 +1085,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1306,8 +1202,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1327,7 +1224,7 @@ extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arra
 extern bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 extern void _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 87a8be104..9139f28f8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -123,8 +124,180 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+
+	int			nheapaccesses;	/* number of heap accesses, for
+								 * instrumentation/metrics */
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	int			maxitems;		/* allocated size of items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == scan->batchqueue->maxBatches)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by amgetbatch index AMs to manage a queue of batches of items
+ * with matching index tuples.  Also used by indexbatch.c to store information
+ * about the progress of an index scan.
+ */
+typedef struct BatchQueue
+{
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * maxBatches into index in the batches array).
+	 */
+	int			maxBatches;		/* size of the batches array */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -140,6 +313,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
@@ -216,4 +391,38 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6b..3e98d8537 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -420,7 +420,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  TupleTableSlot *ios_tableslot);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -433,11 +434,34 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot. This enables important optimizations (such as table block I/O
+	 * prefetching) that require that the table AM directly manages the
+	 * progress of the index scan.
+	 *
+	 * Table AMs that implement this are expected to use batch_getnext (and
+	 * other batch utility routines) to perform amgetbatch index scans.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -459,7 +483,6 @@ typedef struct TableAmRoutine
 									  TupleTableSlot *slot,
 									  bool *call_again, bool *all_dead);
 
-
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
 	 * ------------------------------------------------------------------------
@@ -1159,14 +1182,15 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
 
 /*
  * Prepare to fetch tuples from the relation, as needed when fetching tuples
- * for an index scan.
+ * for an index scan.  Index-only scan callers must provide ios_tableslot,
+ * which is a slot for holding tuples fetched from the table.
  *
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, ios_tableslot);
 }
 
 /*
@@ -1188,6 +1212,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers must pass an index scan descriptor that was created
+ * by passing a valid ios_tableslot to index_beginscan.  This ios_tableslot
+ * will be passed down to table_index_fetch_begin by index_beginscan.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1255,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64ff69964..7e351a210 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1774,7 +1774,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 46a865562..11805bff9 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index cb3331921..d54c782ca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0d4108d05..74ee89032 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..9b55ab230 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3fb1a1285..0e0bcc034 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b4b460724 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..ceb4015ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,33 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+static void
+StoreIndexTuple(TupleTableSlot *slot,
+				IndexTuple itup, TupleDesc itupdesc)
+{
+	/*
+	 * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
+	 * not the slot's tupdesc, in case the latter has different datatypes
+	 * (this happens for btree name_ops in particular).  They'd better have
+	 * the same number of columns though, as well as being datatype-compatible
+	 * which is something we can't so easily check.
+	 */
+	Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
+
+	ExecClearTuple(slot);
+	index_deform_tuple(itup, itupdesc, slot->tts_values, slot->tts_isnull);
+
+	/*
+	 * Copy all name columns stored as cstrings back into a NAMEDATALEN byte
+	 * sized allocation.  We mark this branch as unlikely as generally "name"
+	 * is used only for the system catalogs and this would have to be a user
+	 * query running on those or some other user table with an index on a name
+	 * column.
+	 */
+
+	ExecStoreVirtualTuple(slot);
+}
+
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -79,12 +106,17 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
+	IndexFetchHeapData *hscan = palloc(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.nheapaccesses = 0;
+
+	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->vmbuf = InvalidBuffer;
+	hscan->ios_tableslot = ios_tableslot;
 
 	return &hscan->xs_base;
 }
@@ -94,6 +126,7 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +141,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -173,6 +212,500 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+static void
+batch_assert_batches_valid(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchqueue->maxBatches > 0 &&
+		   batchqueue->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The head/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+	Assert(batchqueue->nextBatch - batchqueue->headBatch <=
+		   batchqueue->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
+/*
+ * heap_batch_advance_pos
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.  This is
+ * usually readPos.  Advances the position to the next item, either in the
+ * same batch or the following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+static bool
+heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
+					   ScanDirection direction)
+{
+	BatchIndexScan batch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchqueue->headBatch);
+
+		pos->batch = scan->batchqueue->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	batch_assert_pos_valid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item <= batch->lastItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item >= batch->firstItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->direction != direction))
+	{
+		/*
+		 * Handle a change in the scan's direction.
+		 *
+		 * Release future batches properly, to make it look like the current
+		 * batch is the only one we loaded.
+		 */
+		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+		{
+			/* release "later" batches in reverse order */
+			BatchIndexScan fbatch;
+
+			batchqueue->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+			batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over).
+		 */
+		batchqueue->direction = direction;
+		batchqueue->finished = false;
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchqueue->readPos;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (heap_batch_advance_pos(scan, readPos, direction))
+		{
+			BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+			/* xs_hitup is not supported by amgetbatch scans */
+			Assert(!scan->xs_hitup);
+
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->item].tupleOffset);
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchqueue->headBatch))
+			{
+				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+															batchqueue->headBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchqueue->headBatch++;
+
+				/* we can't skip any batches */
+				Assert(batchqueue->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * Failed to advance the read position.  Have indexbatch.c utility
+		 * routine load another batch into our queue (next in this direction).
+		 */
+		if (!batch_getnext(scan, direction))
+		{
+			/* we're done -- there's no more batches in this scan direction */
+			break;
+		}
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	Assert(scan->batchqueue->finished);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, tell index
+	 * AM to kill its entry for that TID (this will take effect in the next
+	 * amgettuple call, in index_getnext_tid).  We do not do this when in
+	 * recovery because it may violate MVCC to do so.  See comments in
+	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				hscan->xs_base.nheapaccesses++;
+				if (!index_fetch_heap(scan, hscan->ios_tableslot))
+					continue;	/* no visible tuple, try next index entry */
+
+				/*
+				 * selfuncs.c caller uses SnapshotNonVacuumable.  Just assume
+				 * that it's good enough that any one tuple from HOT chain is
+				 * visible for such a caller
+				 */
+				if (unlikely(!IsMVCCSnapshot(scan->xs_snapshot)))
+					return true;
+
+				ExecClearTuple(hscan->ios_tableslot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.
+				 */
+				if (scan->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+				if (scan->xs_itup)
+					StoreIndexTuple(slot, scan->xs_itup, scan->xs_itupdesc);
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1286,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, NULL, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1324,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3168,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af13..a82854293 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -446,7 +447,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, NULL,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +518,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +709,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, NULL,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +736,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..9e4ed5b55 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				TupleTableSlot *ios_tableslot,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -283,8 +282,11 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, ios_tableslot);
 
 	return scan;
 }
@@ -380,6 +382,8 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +398,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +429,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	return index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +456,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +466,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +588,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +602,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 TupleTableSlot *ios_tableslot,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -614,17 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, ios_tableslot);
 
 	return scan;
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +686,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..f3814ac8b
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,683 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+/* batch debug functions */
+static void batch_assert_batches_valid(IndexScanDesc scan);
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.  It isn't
+ * safe to drop index page pins eagerly when doing so risks breaking an
+ * assumption (about table TID recycling) that amfreebatch routines make when
+ * setting LP_DEAD bits for known-dead index tuples.  Specifically, buffer
+ * pins on index pages serve as interlocks preventing VACUUM from recycling
+ * TIDs on those pages, protecting the table AM from confusing a recycled TID
+ * with the original row it meant to reference.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc(sizeof(BatchQueue));
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchqueue->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchqueue->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->maxBatches = INDEX_SCAN_MAX_BATCHES;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0,
+		   sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called by table AM's ordered index scan implementation when it needs to
+ * load the next batch of index entries to process in the given direction.
+ *
+ * The table AM controls the overall progress of the scan, deciding when to
+ * request new batches.  This division of labor gives the table AM the ability
+ * to reorder fetches of nearby table tuples (from the same batch, or from
+ * adjacent batches) based on its own considerations.  Importantly, table AMs
+ * are _not_ required to free a batch before loading the next batch during an
+ * index scan of an index that uses the amgetbatch/amfreebatch interface.
+ * (This isn't possible with the single-tuple amgettuple interface, which gives
+ * the index AM direct control over the progress of the index scan.  amgettuple
+ * index scans perform the work that we perform in batch_free as the scan
+ * progresses, and without notifying the table AM, which makes it impossible
+ * to safely reorder work in the way that our callers can.)
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are
+ * no more batches in the given scan direction.
+ * ----------------
+ */
+bool
+batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan priorbatch = NULL,
+				batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchqueue->finished)
+		return false;
+
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch_debug_print_batches("batch_getnext / start", scan);
+
+	/*
+	 * Get the previously returned batch to pass to amgetbatch.  The index AM
+	 * uses this to determine which index page to read next, typically by
+	 * following page links forward or backward.
+	 */
+	if (batchqueue->headBatch < batchqueue->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+	else
+		batchqueue->finished = true;
+
+	batch_assert_batches_valid(scan);
+
+	batch_debug_print_batches("batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	Assert(batchqueue->maxBatches == INDEX_SCAN_MAX_BATCHES);
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	batchqueue->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(readBatch->maxitems * sizeof(int));
+	if (readBatch->numKilled < readBatch->maxitems)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchqueue->cache[i] == NULL)
+			continue;
+
+		pfree(scan->batchqueue->cache[i]);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = !scan->batchqueue || scan->batchqueue->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch that can fit maxitems-many BatchMatchingItems.
+ *
+ * Returns a BatchIndexScan sized to caller's required maxitems capacity.
+ * This will either be a newly allocated batch, or a batch reused from a cache
+ * of batches already freed by calling indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * We assume that all calls here during the same index scan will always use
+ * the same maxitems and want_itup arguments.  Index AMs that use batches
+ * should call this from either their amgetbatch or amgetbitmap routines.
+ * They must not call here from other routines (particularly not amfreebatch).
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* want_itup callers must get a currTuples space */
+	Assert(batch->maxitems == maxitems);
+	Assert(!(want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->batchqueue->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+			pfree(batch);
+			return;
+		}
+
+		/*
+		 * Use cache.  This is generally only beneficial when there are many
+		 * small rescans of an index.
+		 */
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] == NULL)
+			{
+				/* found empty slot, we're done */
+				scan->batchqueue->cache[i] = batch;
+				return;
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+/*
+ * Check invariants on current batches
+ *
+ * Makes sure the indexes are set as expected, the buffer size is within
+ * limits, and so on.
+ */
+static void
+batch_assert_batches_valid(IndexScanDesc scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* We should not have too many batches. */
+	Assert(batchqueue->maxBatches > 0 &&
+		   batchqueue->maxBatches <= INDEX_SCAN_MAX_BATCHES);
+
+	/*
+	 * The head/next indexes should define a valid range (in the cyclic
+	 * buffer, and should not overflow maxBatches.
+	 */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+	Assert(batchqueue->nextBatch - batchqueue->headBatch <=
+		   batchqueue->maxBatches);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d maxBatches %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch, batchqueue->maxBatches);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index e29c03089..c01fff708 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 30b43a4dd..039fa15d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1032,6 +1032,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..eee2b69c5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -158,11 +158,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -220,13 +221,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -241,44 +242,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan, dir));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +263,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +272,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +322,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,16 +335,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -387,82 +351,40 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	/*
+	 * Check if there are tuples to kill from this batch (that weren't already
+	 * killed earlier on)
+	 */
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +393,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -827,15 +681,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -985,8 +830,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0605356ec..72c798132 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,62 +25,33 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex,
+						 OffsetNumber offnum, IndexTuple itup,
+						 int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex,
 									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+									   ItemPointer heapTid, int baseOffset);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -870,20 +841,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -897,8 +864,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	BatchIndexScan firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -923,7 +889,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -940,14 +906,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1239,6 +1199,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										  scan->xs_want_itup);
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1247,7 +1211,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1515,12 +1479,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1536,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1548,11 +1512,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1570,69 +1534,78 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BlockNumberIsValid(priorbatch->currPage));
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = priorbatch->nextPage;
+	else
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
+	 * Cancel primitive index scans that were scheduled when the call to
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (ScanDirectionIsForward(dir))
+	if (priorbatch->dir != dir)
+		so->needPrimScan = false;
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
 	}
 
-	_bt_returnitem(scan, so);
-	return true;
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -1644,8 +1617,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1656,37 +1629,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	BTReadPageState pstate;
 	bool		arrayKeys;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, newbatch->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -1695,6 +1663,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	pstate.dir = dir;
+	pstate.currpage = newbatch->currPage;
 	pstate.minoff = minoff;
 	pstate.maxoff = maxoff;
 	pstate.finaltup = NULL;
@@ -1724,11 +1694,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -1792,28 +1761,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -1853,12 +1822,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -1875,11 +1843,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -1980,11 +1947,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -1997,17 +1964,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -2023,12 +1990,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -2045,202 +2011,96 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	BlockNumber blkno,
-				lastcurrblkno;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
-
-	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
-	 */
-	if (so->currPos.dir != dir)
-		so->needPrimScan = false;
-
-	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 /*
@@ -2252,73 +2112,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2328,102 +2205,69 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										scan->xs_want_itup);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -2431,17 +2275,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -2456,19 +2300,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2691,25 +2555,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2720,7 +2582,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2746,9 +2608,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ab0f98b02..b92adc5c7 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1022,14 +1022,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
@@ -1396,7 +1388,7 @@ _bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = pstate ? pstate->dir : ForwardScanDirection;
 	int			arrayidx = 0;
 	bool		beyond_end_advance = false,
 				skip_array_advanced = false,
@@ -2033,13 +2025,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2151,8 +2143,8 @@ _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 			  IndexTuple tuple, int tupnatts)
 {
 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	BTScanOpaque so PG_USED_FOR_ASSERTS_ONLY = (BTScanOpaque) scan->opaque;
+	ScanDirection dir = pstate->dir;
 	int			ikey = pstate->startikey;
 	bool		res;
 
@@ -3301,8 +3293,7 @@ static void
 _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 						 int tupnatts, TupleDesc tupdesc)
 {
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	ScanDirection dir = so->currPos.dir;
+	ScanDirection dir = pstate->dir;
 	OffsetNumber aheadoffnum;
 	IndexTuple	ahead;
 
@@ -3372,73 +3363,69 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -3455,12 +3442,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -3485,7 +3471,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3501,7 +3487,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchqueue->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -3518,7 +3505,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -3568,7 +3555,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 87c31da71..9d66d26dd 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 1e099febd..a7341bcaf 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -248,7 +248,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221..8a5d79a27 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a8033be4b..27dee79d9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index dd323c9b9..9b4b6c33b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, NULL, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index def32774c..a769d9c86 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..1f424bf3d 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -91,6 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   node->ioss_TableSlot,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -101,7 +99,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,77 +115,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
+		InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+		scandesc->xs_heapfetch->nheapaccesses = 0;
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
@@ -238,19 +170,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
+	/* XXX This is ugly, but not clear how to do better */
+	InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+	scandesc->xs_heapfetch->nheapaccesses = 0;
+
 	/*
 	 * if we get here it means the index scan failed so we are at the end of
 	 * the scan..
@@ -407,13 +333,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -786,12 +705,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -852,6 +771,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index f36929dee..8bd700c2a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -1721,7 +1721,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1785,7 +1785,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2654c59c4..4e7871dbb 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7af9a2064..138968af9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 10b3d0d9b..728bbde0c 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 540aa9628..b2288cd93 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7043,10 +7042,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7098,7 +7093,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, tableslot,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	/* Set it up for index-only scan */
@@ -7106,48 +7101,20 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (index_scan->xs_heapfetch->nheapaccesses > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7180,8 +7147,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 2c0e71eed..3e0014930 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..cdc9c65a6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -743,6 +744,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -762,8 +894,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -789,32 +921,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -988,30 +1113,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1180,6 +1322,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 6557c5cff..8f0952a50 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..4e2877a4d 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,10 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = dibeginscan;
 	amroutine->amrescan = direscan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf3f6a7da..6f1ebb955 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -220,8 +220,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3446,12 +3444,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v1-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/octet-stream; name=v1-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From 9f2d9324c2d364c6541387f0e8b998a1caf42541 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v1 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 9f6785910..21d4f22c2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f373cead9..50f3b0de8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v1-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/octet-stream; name=v1-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From 6aa085a5be9847059f8879a92b6799890d9e1f76 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v1 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/heapam.h                   |   1 +
 src/include/access/relscan.h                  |  32 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 325 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  46 ++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 12 files changed, 435 insertions(+), 21 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e8d347e47..2cc817360 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
 	Buffer		vmbuf;			/* visibility map buffer */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9139f28f8..67828163f 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 
 	int			nheapaccesses;	/* number of heap accesses, for
 								 * instrumentation/metrics */
@@ -224,8 +226,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -270,6 +278,27 @@ typedef struct BatchQueue
 	 * it's over.
 	 */
 	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
 
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
@@ -277,6 +306,7 @@ typedef struct BatchQueue
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3e98d8537..c3e58012f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -454,6 +454,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ceb4015ea..e0d9468ab 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -59,6 +59,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -111,10 +114,12 @@ heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 	IndexFetchHeapData *hscan = palloc(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_base.nheapaccesses = 0;
 
 	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
 	hscan->ios_tableslot = ios_tableslot;
 
@@ -126,11 +131,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -141,6 +150,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -163,23 +175,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -395,13 +421,24 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 	else if (unlikely(batchqueue->direction != direction))
 	{
 		/*
 		 * Handle a change in the scan's direction.
 		 *
 		 * Release future batches properly, to make it look like the current
-		 * batch is the only one we loaded.
+		 * batch is the only one we loaded. Also reset the stream position, as
+		 * if we are just starting the scan.
 		 */
 		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
 		{
@@ -416,10 +453,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/*
 		 * Remember the new direction, and make sure the scan is not marked as
 		 * "finished" (we might have already read the last batch, but now we
-		 * need to start over).
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
 		 */
 		batchqueue->direction = direction;
 		batchqueue->finished = false;
+		batch_reset_pos(&batchqueue->streamPos);
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
 	}
 
 	/* shortcut for the read position, for convenience */
@@ -461,6 +504,37 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
 															batchqueue->headBatch);
 
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchqueue->streamPos.batch == batchqueue->headBatch))
+				{
+					batch_reset_pos(&batchqueue->streamPos);
+				}
+
 				/* Free the head batch (except when it's markBatch) */
 				batch_free(scan, headBatch);
 
@@ -479,8 +553,38 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		}
 
 		/*
-		 * Failed to advance the read position.  Have indexbatch.c utility
-		 * routine load another batch into our queue (next in this direction).
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchqueue->reset))
+		{
+			batchqueue->reset = false;
+			batchqueue->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			batch_reset_pos(&batchqueue->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call batch_getnext here too, for two reasons. First,
+		 * the read_stream only gets working after we try fetching the first
+		 * heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
 		 */
 		if (!batch_getnext(scan, direction))
 		{
@@ -500,6 +604,198 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return NULL;
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike batch_getnext_tid, this can loop more than twice. If many
+	 * blocks get skipped due to currentPrefetchBlock or all-visibility (per
+	 * the "prefetch" callback), we get to load additional batches. In the
+	 * worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to "pause"
+	 * the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_pos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchqueue->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3169,6 +3465,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9e4ed5b55..1936a34c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -459,7 +459,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index f3814ac8b..f28a747f9 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -77,11 +77,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->batchqueue->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->maxBatches = INDEX_SCAN_MAX_BATCHES;
@@ -127,7 +132,23 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	if (batchqueue->finished)
 		return false;
 
-	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("batch_getnext: ran out of space for batches");
+		scan->batchqueue->reset = true;
+		return false;
+	}
 
 	batch_debug_print_batches("batch_getnext / start", scan);
 
@@ -152,6 +173,17 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 	else
 		batchqueue->finished = true;
@@ -184,9 +216,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -233,6 +268,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	batchqueue->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -296,9 +333,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -309,6 +350,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5a7283bd2..6f63693f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d83490..3f264f1ce 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 3b37fafa6..9702e3103 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0

#347

Tomas Vondra

tomas@vondra.me

about 1 month ago

In reply to: Peter Geoghegan (#346)

Re: index prefetching

On 12/1/25 02:23, Peter Geoghegan wrote:

On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19.

Attached patch makes the table AM revisions we talked about. This is a
significant change in direction, so I'm adopting a new patch
versioning scheme: this new version is v1. (I just find it easier to
deal with sequential patch version numbers.)

Thanks for the new version! I like the layering in this patch, moving
some of the stuff from indexam.c/executor to table AM. It makes some of
the code much cleaner, I think.

I'm sure that I'll have made numerous mistakes in this new v1. There
will certainly be some bugs, and some of the exact details of how I'm
doing the layering are likely suboptimal or even wrong. I am
nevertheless cautiously optimistic that this will be the last major
redesign that will be required for this project.

Sounds good. FWIW I don't see any major issues in this version.

There is an almost immediate benefit to just
doing that much, unrelated to I/O prefetching for index scans: it
enables batching of heap page buffer locking/unlocking (during the
point where index scans perform heap_hot_search_buffer calls) on the
table AM/heapam side during ordered index scans.

Note that this new v1 doesn't yet include the important heapam buffer
locking optimization discussed here. It didn't seem worth holding up
everything just for that. Plan is to get to it next.

(It isn't intrinsically all that complicated to add the optimization
with this new table AM orientated structure, but doing so would have
made performance validation work/avoiding regressions with simple
queries that much harder. So I just put it off for a bit longer.)

Understood. I presume that optimization fits mostly "seamlessly" into
this patch design.

What's new in v1 (compared to v20251109-*, the prior version):

* The first patch in the series is now mostly about changing the table
AM and index AM in a complementary way (not just about adding the
amgetbatch interface to the index AM).

To summarize this point (mostly just a recap of recent discussion on
the table AM API on this thread) with its own sub points:

- We're now using a slot-based table AM interface that understands

scan direction. We now do all VM access for index-only scans on the
heapam side, fixing that existing table AM modularity violation once
and for all.

I admit I was a bit skeptical about this approach, mostly because I
didn't have a clear idea how would it work. But it turns out to be quite
clean. Well, definitely cleaner that what I had before.

- Batches returned by amgetbatch are directly managed by heapam,
giving it the direct control that it requires to get the best possible
performance. Whether that's for adding I/O prefetching, or for other
optimizations.

- The old table_index_fetch_tuple index scan interface is still needed
-- though only barely.

The rule going forward for core executor code is that it should always
use this new slot-based interface, unless there is a specific need for
such a caller to pass *their own* TID, in a way that cannot possibly
be delegated to our new high level table AM interface.

For example, we still need table_index_fetch_tuple for nbtree's
_bt_check_unique; it must pass TIDs to heapam, and get back tuples,
without starting any new index scan to do so (the only "index scan"
involved in the case of the _bt_check_unique caller takes place in the
btinsert that needs to perform unique index enforcement in passing). I
think it makes perfect sense that a small handful of special case
callers still need to use table_index_fetch_tuple, since there really
is no way around the need for these callers to pass their own TID.

Agreed. I don't think it makes sense to require eliminating all these
table_index_fetch_tuple calls (even if it was possible).

* Major restructuring of batch management code, to allow it to work
with the table AM interface (as well as related improvements and
polishing).

The parts of batch management that aren't under the direct control of
table AMs/heapam (the batch helper functions that all table AMs will
use) are no longer in indexam.c; there's a new file for those routines
named indexbatch.c. indexbatch.c is also the place where a few other
helper functions go. These other functions are called by indexam.c/the
core executor, for things like initializing an amgetbatch scan, and
informing nbtree that it is taking a mark (for mark/restore).

Maybe there are certain remaining problems with the way that indexam.c
and heapam_handler.c are coordinating across index scans. Hopefully
the structure wasn't accidentally overfitted to heapam/isn't brittle
in some other way.

+1, seems like a clear improvement. I don't see any major issues, but I
have a couple minor comments/questions.

I realize this also removes mark/restore support from the old
"amgettuple" interface, so only AMs implementing the new batching API
will be able to do mark/restore. AFAICS in core this only affects btree,
and that is switched to the batching. But won't this break external AMs
that might do mark/restore, and don't want to / can't do batching? I'm
not aware of any such AMs, though. Maybe it's fine.

Do we want to make "per-leaf-page" batches explicit in the commit
message / comments? Yes, we do create per-leaf batches, but isn't it
more because it's convenient, and the AM could create larger/smaller
batches if appropriate? Or is this a requirement? I'm thinking about
doing batching for gist/spgist ordered scans, where the index entries
are not returned leaf-at-a-time.

Another question about IOS with xs_hitup - which is not supported by
the patch (only IOS with xs_itup are). Is there a reason why this can't
be supported? I can't think of any, maybe I'm missing something?

There's also the question whether amgettuple/amgetbatch should be
exclusive, or an AM could support both. In the docs the patch seems to
imply it's exclusive, but then it also says "XXX uncertain" about this.

I suspect it probably makes sense to allow implementing both, with the
amgettuple as a "fallback" for scans where doing batching is complex and
unlikely to help (I'm thinking about spgist ordered scans, which reorder
the index entries). But maybe it makes sense to apply batching even to
this case, not sure.

If we choose to allow both, then something will have to make a decision
which of the APIs to use in a given scan. This decision probably needs
happen early in the planning, and can't be renegotiated. Ultimately the
only part aware of all the details is the AM opclass, so there'd need to
be some sort of optional AM procedure to decide this.

But maybe it's not worth it? I'm concerned about painting ourselves in
the corner, where some index AM can't do batching for one corner case,
and therefore it can't do batching at all.

* Renamed and made lots of tweaks to batching related functions and
structs. I've also integrated code that previously appeared in its own
"batch cache" patch into the new main commit in the patch series (the
first patch in the new series).

The main goal of the tweaks to the data structures was to avoid
indirection that previously caused small regressions in my
microbenchmarks. We're very sensitive to costs from additional pointer
chasing in these code paths. And from even small memory allocations.

I think that I've avoided all regressions with just the first patch,
at least for my own microbenchmark suite. I did not aim to avoid these
regressions with the prefetching patch, since I consider it out of
scope now (for Postgres 19).

I think it's much more consistent now, thanks.

I find the new "BatchIndexScan" name a bit confusing, it sounds more
like a special type of IndexScan. Maybe IndexScanBatch would be better?

Also, I find the batch_assert_pos_valid/batch_assert_batch_valid naming
a bit surprising. I think the custom is to name "asserts" function
something like AssertSomethingSomething(), to make it distinct from
usual functions. At least that's what I saw in other patches, and I
followed that practice ... But maybe it's not suitable for non-static
functions.

Speaking of batch_assert_batches_valid, why not to add it to relscan.h,
next to the other "asserts"?

* v1 breaks out prefetching into its own patch, which is now the
second patch in the patch series.

The new I/O prefetching patch turned out to be surprisingly small. I
still feel good about our choice to put that off until Postgres 20,
though -- it's definitely where most of the difficulties are.
Especially with things like resource management. (The problem with the
second patch is that it's too small/doesn't address all the problems,
not that it's too big and unwieldy.)

Let's see how that goes. I'm not against putting that off until Postgres
20, but maybe it's too early to make that decision. I'd like to at least
give it a try for 19. If it doesn't make it, that's fine.

Prefetching works at least as well as it did in earlier versions
(maybe even slightly better). It's not just an afterthought here. At a
minimum, we need to continue to maintain prefetching in a reasonably
complete and usable form to keep us honest about the design changes in
the table AM and index AM APIs. If the design itself cannot eventually
accommodate Postgres 20 work on I/O prefetching (and even later work),
then it's no good.

Minor caveat about preserving prefetching in good working order: I
disabled support for index-only scans that use I/O prefetching for
heap accesses in the second patch, at least for now. To recap, IoS
support requires a visibility cache so that both readBatch and
streamBatch agree on exactly which heap blocks will need to be read,
even when the visibility map has some relevant heap page bits
concurrently set or unset. It won't be too hard to add something like
that back to heapam_handler.c, but I didn't get around to doing so
just yet.

Right, I was wondering how's the patch dealing with that before I
realized it's disabled.

It might be independently useful to have some kind of visibility
cache, even without prefetching; batching VM accesses (say by doing
them up front, for a whole batch, right after amgetbatch returns)
might work out saving cycles with cached scans. You know, somewhat
like how we'll do same-heap-page heap tuple fetches eagerly as a way
of minimizing buffer lock/unlock traffic.

True.

* There's a new patch that adds amgetbatch support for hash indexes.

This demonstrates that the amgetbatch interface is already reasonably
general. And that adding support to an index AM doesn't have to be all
that invasive. I'm more focussed than ever on the generality of the
API now.

Nice, and surprisingly small.

* Added documentation that attempts to formalize the constraints that
index AMs that opt to use amgetbatch are under.

I don't think that it makes sense to think of amgettuple as the legacy
interface for plain index scans. There will probably always be cases
like KNN GiST scans, that legitimately need the index AM to directly
control the progress of index scans, a tuple at a time.

Right.

After all, these scan types give an absurd amount of control over many
things to the index AM -- that seems to really make it hard to put the
table AM in control of the scan's progress. For example, GiST scans
use their own GISTSearchHeapItem struct to manage each item returned
to the scan (which has a bunch of extra fields compared to our new
AM-generic BatchMatchingItem struct). This GISTSearchHeapItem struct
allows GiST to indicate whether or not an index tuple's quals must be
rechecked. It works at the tuple granularity (individual GiST
opclasses might expect that level of flexibility), which really works
against the batching concepts that we're pursuing here.

It's true that hash index scans are also lossy, but that's quite
different: they're inherently lossy. It's not as if hash index scans
are sometimes not lossy. They certainly cannot be lossy for some
tuples but not other tuples that all get returned during the same
index scan. Not so with GiST scans.

Likely the best solution to the problems posed by GiST and SP-GiST
will be to choose one of either amgettuple and amgetbatch during
planning, according to what the scan actually requires (while having
support for both interfaces in both index AMs). I'm still not sure
what that should look like, though -- how does the planner know which
interface to use, in a world where it has to make a choice with those
index AMs that offer both? Obviously the answer depends in part on
what actually matters to GiST/where GiST *can* reasonably use
amgetbatch, to get benefits such as prefetching. And I don't claim to
have a full understanding of that right now.

Right, that's pretty much what I suggested earlier.

Here are the things that I'd like to ask from reviewers, and from Tomas:

* Review of the table AM changes, with a particular emphasis on high
level architectural choices.

To me the proposed architecture/layering looks nice and reasonable.

But I haven't really thought about table AM until Andres pointed out the
issues, so maybe I may not be the right person to judge this. And I've
moved the code between the various layers so many times I have vertigo.

* Most importantly: will the approach in this new v1 avoid painting
ourselves into a corner? It can be incomplete, as long as it doesn't
block progress on things we're likely to want to do in the next couple
of releases.

I don't see why we would paint ourselves in the corner with this. (I'm
ignoring the question about allowing only one of amgettuple/amgetbatch.)

* Help with putting the contract that amgetbatch requires of index AMs
on a more rigorous footing. In other words, is amgetbatch itself
sufficiently general to accomodate the needs of index AMs in the
future? I've made a start on that here (by adding sgml docs about the
index AM API, which mentions table AM concerns), but work remains,
particularly when it comes to supporting GiST + SP-GiST.

So what exactly is the "contract" assumed by the current patch? Do you
have any thoughts about it being too inflexible in some respect?

As mentioned earlier, maybe we shouldn't tie batches to leaf pages too
much, so that the AM can build batches not aligned to leaf-pages in such
a simple way. I think this would allow doing batchning/prefetching for
cases like the spgist ordered scans, etc.

I think it makes sense to keep feedback mostly high level for now --
to make it primarily about how the individual API changes fit
together, if they're coordinating too much (or not enough), and if the
interface we have is able to accommodate future needs.

Makes sense. Hopefully I wasn't nitpicking about details too much.

regards

--
Tomas Vondra

#348

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Tomas Vondra (#347)

Re: index prefetching

On Mon, Dec 1, 2025 at 11:32 AM Tomas Vondra <tomas@vondra.me> wrote:

Thanks for the new version! I like the layering in this patch, moving
some of the stuff from indexam.c/executor to table AM. It makes some of
the code much cleaner, I think.

Yeah, I think so too. Clearly the main way that this improves the
design is by avoiding implicit coordination between code paths that
are a great distance away from each other. Particularly with index
prefetching, where previously we maintained a visibility info cache
for use in indexOnlyscan.c, that was also used by the read stream
callback. There's no need for such coordination if it all has to
happen from the same few table AM routines.

I'm sure that I'll have made numerous mistakes in this new v1. There
will certainly be some bugs, and some of the exact details of how I'm
doing the layering are likely suboptimal or even wrong. I am
nevertheless cautiously optimistic that this will be the last major
redesign that will be required for this project.

Sounds good. FWIW I don't see any major issues in this version.

I was thinking of stuff like how the heapam data structure still
doesn't actually contain the read stream, so that indexam.c can call
indexbatch.c and do things like reset the read stream if necessary.
Stuff like that.

Maybe we should be calling index_batch_init from heapam_handler.c, and
not from indexam.c (we still do the latter). OTOH, maybe that'd be a
case of adding more mechanism for no real benefit. These kinds of
design choices are relatively unimportant, but did seem like the kind
of thing that I'm relatively likely to have messed up in this v1.

(It isn't intrinsically all that complicated to add the optimization
with this new table AM orientated structure, but doing so would have
made performance validation work/avoiding regressions with simple
queries that much harder. So I just put it off for a bit longer.)

Understood. I presume that optimization fits mostly "seamlessly" into
this patch design.

Right. Obviously, that's another advantage of the new table AM interface.

We could even do something much more sophisticated than what I
actually have planned for 19: we could reorder table fetches, such
that we only had to lock and pin each heap page exactly once *even
when the TIDs returned by the index scan return TIDs slightly out of
order*. For example, if an index page/batch returns TIDs "(1,1),
(2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
and 2 by locking and pinning each of those 2 pages exactly once. The
only downside (other than the complexity) is that we'd sometimes hold
multiple heap page pins at a time, not just one.

I think of this as making index scans behave somewhat more like bitmap
scans. It might even make sense to do it very aggressively. We don't
have to hold on to pins if we can materialize/make our own private
copies of the tuples for later.

This is very speculative stuff. I won't be working on anything this
complicated any time soon. But I think it's good that to have a
structure that enables this kind of thing.

I admit I was a bit skeptical about this approach, mostly because I
didn't have a clear idea how would it work. But it turns out to be quite
clean. Well, definitely cleaner that what I had before.

It's the least worst way of implementing a design that gives the table
AM the required understanding of index AM costs. Which is just what's
required to do I/O prefetching as efficiently as possible.

Agreed. I don't think it makes sense to require eliminating all these
table_index_fetch_tuple calls (even if it was possible).

One could argue that the remaining use of table_index_fetch_tuple is
still a modularity violation, since we're still using TIDs instead of
some abstract concept that generalizes the idea of TID across all
possible table AMs. But that's not a new problem, and not one that
we're in any way obligated to fix within the scope of this project.

I realize this also removes mark/restore support from the old
"amgettuple" interface, so only AMs implementing the new batching API
will be able to do mark/restore.

Right.

AFAICS in core this only affects btree,
and that is switched to the batching. But won't this break external AMs
that might do mark/restore, and don't want to / can't do batching?

Yes.

I'm not aware of any such AMs, though. Maybe it's fine.

If somebody shows up and complains about it, we can do something about
it then. But I'd rather not add code to deal with a 100% theoretical
problem such as this. I really doubt that there will be any
complaints.

Do we want to make "per-leaf-page" batches explicit in the commit
message / comments? Yes, we do create per-leaf batches, but isn't it
more because it's convenient, and the AM could create larger/smaller
batches if appropriate?

It is a structure that forces index AMs to follow the existing long
documented rules about holding onto buffer locks as an interlock
against unsafe concurrent TID recycling by VACUUM. There's nothing
fundamentally new about it.

Or is this a requirement? I'm thinking about
doing batching for gist/spgist ordered scans, where the index entries
are not returned leaf-at-a-time.

It's true that GiST and SP-GiST don't return tuples a leaf at a time,
and never hold on to buffer pins. It's also true that both support
index-only scans that can give wrong answers to queries, precisely
because they just ignore the rules we have for index AMs.

This presents us with an absurd dilemma: should we make amgetbatch
work with those requirements, even though we know that they're based
on a faulty understanding of the basic protocols that index scans are
supposed to follow? It would be very difficult to make ordered GiST
scans hold the required buffer pins sufficient to avoid races with
VACUUM, fixing the index-only scan bug -- no question. But that
difficulty has exactly nothing to do with this project.

The easiest way to fix the bugs in GiST is likely to be by disabling
KNN ordered index-only scans, while making remaining index-only scans
correctly follow the index AM protocol for the first time (by holding
onto leaf page pins until the table AM is done reading the heap tuples
for the page's TIDs). I think that we'll probably need to disable
index-only KNN scans, since I suspect that there just isn't a way to
keep the number of pins held manageably low in the general case. Once
all that happens (in addition to making GiST VACUUM acquire
conflicting cleanup locks like nbtree does already), then it should be
possible to adopt GiST to the amgetbatch paradigm with some more work.
(Recall that plain index scans that use an MVCC snapshot don't
actually need to hold onto buffer pins on leaf pages.)

Adopting GiST to amgetbatch then becomes a matter of inventing a new
layer that either treats GiST leaf pages just like nbtree leaf pages,
or (in the case of KNN scans) builds virtual batches or somesuch. A
virtual batch doesn't ever have a buffer pin in its batch, since the
relationship between index leaf pages and the contents of the batch
are fuzzy. In general the use of virtual batches is undesirable,
though, because they are inherently incompatible with index-only
scans.

Another question about IOS with xs_hitup - which is not supported by
the patch (only IOS with xs_itup are). Is there a reason why this can't
be supported? I can't think of any, maybe I'm missing something?

We don't support them for the kinds of reasons you'd guess: they're
really only useful in GiST and SP-GiST, which aren't going to be
supported in the first release regardless of what we do (btw, pgvector
doesn't use xs_hitup either, nor does it support any kind of
index-only scan). They're also just awkward, because we can't assume
that BLCKSZ space will always be enough to store all the required heap
tuples (GiST uses retail palloc()s for the heap tuples). And because
we have no way to test xs_hitup support.

These reasons don't make it fundamentally impossible. On the other
hand, not supporting xs_hitup doesn't close any doors to adding such
support in a later release.

There's also the question whether amgettuple/amgetbatch should be
exclusive, or an AM could support both. In the docs the patch seems to
imply it's exclusive, but then it also says "XXX uncertain" about this.

I lean towards requiring that index AMs choose one or the other in the
first committed version. This is a reversible choice, after all.

But maybe it's not worth it? I'm concerned about painting ourselves in
the corner, where some index AM can't do batching for one corner case,
and therefore it can't do batching at all.

Maybe not. I know that I said that I think that it might make sense to
keep amgettuple to allow things like KNN GiST scans to continue to
work. But now I'm not so sure. The better paradigm might still be to
invent the concept of virtual batches. That allows index AMs to deal
with tricky cases like KNN GiST scans as their own implementation
detail (mostly). It's not particularly natural for such index AMs to
use amgetbatch right now...but if they're sometimes doing so anyway,
that isn't really true anymore.

I find the new "BatchIndexScan" name a bit confusing, it sounds more
like a special type of IndexScan. Maybe IndexScanBatch would be better?

I did have that at one point, but was then concerned that it implied
that the struct belonged in a file like indexam.c, which it does not.

Do you have another suggestion?

Also, I find the batch_assert_pos_valid/batch_assert_batch_valid naming
a bit surprising. I think the custom is to name "asserts" function
something like AssertSomethingSomething(), to make it distinct from
usual functions. At least that's what I saw in other patches, and I
followed that practice ... But maybe it's not suitable for non-static
functions.

I don't feel strongly either way.

Speaking of batch_assert_batches_valid, why not to add it to relscan.h,
next to the other "asserts"?

No good reason. Will fix.

Let's see how that goes. I'm not against putting that off until Postgres
20, but maybe it's too early to make that decision. I'd like to at least
give it a try for 19. If it doesn't make it, that's fine.

Perhaps you're right -- I could have overreacted when I said that I/O
prefetching for 19 just wasn't going to happen. I don't feel bad about
putting that back in scope now, as long as the primary goal remains
getting the API changes (as well as the heap buffer locking
optimization) in place.

Likely the best solution to the problems posed by GiST and SP-GiST
will be to choose one of either amgettuple and amgetbatch during
planning, according to what the scan actually requires (while having
support for both interfaces in both index AMs). I'm still not sure
what that should look like, though -- how does the planner know which
interface to use, in a world where it has to make a choice with those
index AMs that offer both? Obviously the answer depends in part on
what actually matters to GiST/where GiST *can* reasonably use
amgetbatch, to get benefits such as prefetching. And I don't claim to
have a full understanding of that right now.

Right, that's pretty much what I suggested earlier.

Like I said just now, this seems pretty complicated. So complicated
that not requiring the planner to figure it out at all (pushing the
problem into index AMs like GiST) has a certain appeal.

It's not like amgettuple and amgetbatch are all that different. The
main difference is that as things stand GiST cannot just use all the
amgettuple state it currently holds in
GISTScanOpaqueData.GISTSearchHeapItem.

* Review of the table AM changes, with a particular emphasis on high
level architectural choices.

To me the proposed architecture/layering looks nice and reasonable.

Cool.

But I haven't really thought about table AM until Andres pointed out the
issues, so maybe I may not be the right person to judge this. And I've
moved the code between the various layers so many times I have vertigo.

I can certainly sympathize with that.

I don't see why we would paint ourselves in the corner with this. (I'm
ignoring the question about allowing only one of amgettuple/amgetbatch.)

We can change our mind about requiring exactly one of amgettuple or
amgetbatch in the future. It's a completely reversible design
decision. We could even add a caveat about it to the sgml docs that
cover the index AM API.

* Help with putting the contract that amgetbatch requires of index AMs
on a more rigorous footing. In other words, is amgetbatch itself
sufficiently general to accomodate the needs of index AMs in the
future? I've made a start on that here (by adding sgml docs about the
index AM API, which mentions table AM concerns), but work remains,
particularly when it comes to supporting GiST + SP-GiST.

So what exactly is the "contract" assumed by the current patch? Do you
have any thoughts about it being too inflexible in some respect?

I was a tiny bit worried about the xs_hitup support question, but less
so now. That was one. I guess that I also wondered if the use of
fields like "moreLeft" and "moreRight" was sufficiently general. I
think it probably is, actually, but it's a question that needs to be
asked.

pgvector doesn't support index-only scans at all, and only does MVCC
snapshots, so AFAICT it will always be safe to assume that we can
always drop a pin on a pgvector index page (especially because it
doesn't do kilitems stuff). From there I think it's just a matter of
building virtual/simulated batches. I'm not sure where the logic to do
that belongs, but I suspect that it might belong in pgvector. After
all, pgvector probably shouldn't be forced to scan most of the index
to get the required number of TIDs to make up a decent sized batch. It
needs to decide that the time has come to at least return the matches
we have already.

As mentioned earlier, maybe we shouldn't tie batches to leaf pages too
much, so that the AM can build batches not aligned to leaf-pages in such
a simple way. I think this would allow doing batchning/prefetching for
cases like the spgist ordered scans, etc.

That makes sense. But I don't think that we necessarily have to have
that fully worked out to commit the first patch. After all, the
problems in that area are just really hard, for reasons that have very
little to do with new stuff introduced by this patch series.

--
Peter Geoghegan

#349

Amit Langote

amitlangote09@gmail.com

about 1 month ago

In reply to: Peter Geoghegan (#346)

Re: index prefetching

Hi Peter,

On Mon, Dec 1, 2025 at 10:24 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Nov 10, 2025 at 6:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new tentative plan is to cut scope by focussing on switching over
to the new index AM + table AM interface from the patch in the short
term, for Postgres 19.

Attached patch makes the table AM revisions we talked about. This is a
significant change in direction, so I'm adopting a new patch
versioning scheme: this new version is v1. (I just find it easier to
deal with sequential patch version numbers.)

I'm sure that I'll have made numerous mistakes in this new v1. There
will certainly be some bugs, and some of the exact details of how I'm
doing the layering are likely suboptimal or even wrong. I am
nevertheless cautiously optimistic that this will be the last major
redesign that will be required for this project.

...

Here are the things that I'd like to ask from reviewers, and from Tomas:

* Review of the table AM changes, with a particular emphasis on high
level architectural choices.

* Most importantly: will the approach in this new v1 avoid painting
ourselves into a corner? It can be incomplete, as long as it doesn't
block progress on things we're likely to want to do in the next couple
of releases.

I was looking at your email and the v1 patch and recalled your earlier
note from my executor batching thread [1]/messages/by-id/CAH2-WznijhPtw2vtwCtfFSwamwkT2O1KXMx6tE+eoHi3CKwRFg@mail.gmail.com, where you mentioned:

"I think that the base index prefetching patch's current notion of
index-AM-wise batches can be kept quite separate from any table AM
batch concept that might be invented, either as part of what I'm
working on, or in Amit's patch. It probably wouldn't be terribly
difficult to get the new interface I've described to return heap
tuples in whatever batch format Amit comes up with. ... I doubt that
adopting Amit's batch format will make life much harder for the
heap_hot_search_buffer-batching mechanism (at least if it is generally
understood that its new index scan interface's builds batches in
Amit's format on a best-effort basis)."

I want to acknowledge that figuring out the right layering to make I/O
prefetching and perhaps other optimizations internal to IndexNext()
work is obviously the priority right now, regardless of the output
format used to populate the slots ultimately returned by
table_index_getnext_slot(). However, regarding your question about
"painting ourselves into a corner":

In my executor batching work (which has focused on Seq Scans), the
HeapBatch is essentially just a pinned buffer plus an array of
pre-allocated tuple headers. I hadn't strictly considered creating a
HeapBatch to return from Index Scans, largely because
heap_hot_search_buffer() is designed for scalar (or non-batched)
access that requires repeated buffer locking.

But it seems like the eventual goal of batching calls to
heap_hot_search_buffer() effectively clears that hurdle. As long as
the internal logic separates the "grouping/locking" from the
"materializing into a slot," it seems this design does not prevent us
from eventually wiring up a table_index_getnext_batch() to populate
the HeapBatch structure I am proposing for the regular non-index scan
path (table_scan_getnextbatch() in my patch).

Sorry to hijack the thread, but just wanted to confirm I haven't
misunderstood the architectural implications for future batching. Now
off to continue reading the new indexbatch.c, which kind of reminds me
of the stuff I've added in my execBatch.c. :-)

--
Thanks, Amit Langote

[1]: /messages/by-id/CAH2-WznijhPtw2vtwCtfFSwamwkT2O1KXMx6tE+eoHi3CKwRFg@mail.gmail.com

#350

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Amit Langote (#349)

Re: index prefetching

Hi Amit,

On Thu, Dec 4, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com> wrote:

I want to acknowledge that figuring out the right layering to make I/O
prefetching and perhaps other optimizations internal to IndexNext()
work is obviously the priority right now, regardless of the output
format used to populate the slots ultimately returned by
table_index_getnext_slot().

Right; table_index_getnext_slot simply returns a tuple into the
caller's slot. That's almost the same as the existing getnext_slot
interface used by those same call sites on the master branch, except
that in the patch we're directly calling a table AM callback/heapam
specific implementation (not code in indexam.c).

The new heapam implementation heapam_index_getnext_slot applies more
high-level context about ordered index scans, which enables it to
reorder work quite freely, even when it is work that takes place in
index AMs.

However, regarding your question about
"painting ourselves into a corner":

In my executor batching work (which has focused on Seq Scans), the
HeapBatch is essentially just a pinned buffer plus an array of
pre-allocated tuple headers. I hadn't strictly considered creating a
HeapBatch to return from Index Scans, largely because
heap_hot_search_buffer() is designed for scalar (or non-batched)
access that requires repeated buffer locking.

But it seems like the eventual goal of batching calls to
heap_hot_search_buffer() effectively clears that hurdle.

Actually, that's not the eventual goal anymore; now we're treating it
as our *immediate* goal, at least in terms of things that will have
user-visible impact (as opposed to API changes needed to facilitate
batching type optimizations in the future, including I/O prefetching).

It's not completely clear if prefetching is off the table for Postgres
19, but it certainly seems optimistic at this point. But the
heap_hot_search_buffer thing definitely is in scope for Postgres 19
(if we're going to make all these API changes then it seems best to
give users an immediate benefit).

As long as
the internal logic separates the "grouping/locking" from the
"materializing into a slot," it seems this design does not prevent us
from eventually wiring up a table_index_getnext_batch() to populate
the HeapBatch structure I am proposing for the regular non-index scan
path (table_scan_getnextbatch() in my patch).

That's good.

Suppose we do a much more advanced version of the kind of work
reordering that the heap_hot_search_buffer thing will do for Postgres
19. I described this to Tomas in my last email to this thread, when I
said:

"""
We could even do something much more sophisticated than what I
actually have planned for 19: we could reorder table fetches, such
that we only had to lock and pin each heap page exactly once *even
when the TIDs returned by the index scan return TIDs slightly out of
order*. For example, if an index page/batch returns TIDs "(1,1),
(2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
and 2 by locking and pinning each of those 2 pages exactly once. The
only downside (other than the complexity) is that we'd sometimes hold
multiple heap page pins at a time, not just one.
"""

(To be clear this more advanced version is definitely out of scope for
Postgres 19.)

We'd be holding on to multiple buffer pins at a time (across calls to
heapam_index_getnext_slot) were we to do this more advanced
optimization. I *think* that still means that the design/internal
logic will (as you put it) "separate the 'grouping/locking' from the
'materializing into a slot'". That's just the only way that could
possibly work correctly, at least with heapam.

It makes sense for us both to (at a minimum) have at least some
general awareness of each other's goals. I really only want to avoid
completely gratuitous incompatibilities/conflicts. For example, if you
invent a new slot-like mechanism in the executor that can return
multiple tuples in one go, then it seems like we should probably try
to use that in our own work on batching. If we're already assembling
the information in a way that almost works with that new interface,
why wouldn't we make sure that it actually worked with and used that
new interface directly?

It doesn't sound like there'd be many disagreements on how that would
have to work, since the requirements are largely dictated by existing
constraints that we're both already naturally subject to. For example:

* We need to hold on to a buffer pin on a heap page if one of its heap
tuples is contained in a slot/something slot-like. For as long as
there's any chance that somebody will examine that heap tuple (until
the slot releases the tuple).

* Buffer locks must only be acquired by lower-level access method
code, for very short periods, and never in a way that requires
coordination across module boundaries.

It sounds like the potential for conflicts between each other's work
will be absolutely minimal. It seems as if we don't even have to agree
on anything new or novel.

Sorry to hijack the thread, but just wanted to confirm I haven't
misunderstood the architectural implications for future batching.

I don't think that you've hijacked anything. Your input is more than welcome.

--
Peter Geoghegan

#351

Amit Langote

amitlangote09@gmail.com

about 1 month ago

In reply to: Peter Geoghegan (#350)

Re: index prefetching

On Fri, Dec 5, 2025 at 6:11 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Dec 4, 2025 at 12:54 AM Amit Langote <amitlangote09@gmail.com> wrote:

I want to acknowledge that figuring out the right layering to make I/O
prefetching and perhaps other optimizations internal to IndexNext()
work is obviously the priority right now, regardless of the output
format used to populate the slots ultimately returned by
table_index_getnext_slot().

Right; table_index_getnext_slot simply returns a tuple into the
caller's slot. That's almost the same as the existing getnext_slot
interface used by those same call sites on the master branch, except
that in the patch we're directly calling a table AM callback/heapam
specific implementation (not code in indexam.c).

The new heapam implementation heapam_index_getnext_slot applies more
high-level context about ordered index scans, which enables it to
reorder work quite freely, even when it is work that takes place in
index AMs.

However, regarding your question about
"painting ourselves into a corner":

In my executor batching work (which has focused on Seq Scans), the
HeapBatch is essentially just a pinned buffer plus an array of
pre-allocated tuple headers. I hadn't strictly considered creating a
HeapBatch to return from Index Scans, largely because
heap_hot_search_buffer() is designed for scalar (or non-batched)
access that requires repeated buffer locking.

But it seems like the eventual goal of batching calls to
heap_hot_search_buffer() effectively clears that hurdle.

Actually, that's not the eventual goal anymore; now we're treating it
as our *immediate* goal, at least in terms of things that will have
user-visible impact (as opposed to API changes needed to facilitate
batching type optimizations in the future, including I/O prefetching).

That makes sense. I had thought the "vectorized HOT search" (batching
heap_hot_search_buffer) was a distant goal, so pulling it into PG19 is
great news. It means the internal mechanics for "group-by-page"
access, which seems like the hardest part of batching an index scan,
will be in place sooner rather than later, allowing us to share some
bits related to batching the output.

But the
heap_hot_search_buffer thing definitely is in scope for Postgres 19
(if we're going to make all these API changes then it seems best to
give users an immediate benefit).

As long as
the internal logic separates the "grouping/locking" from the
"materializing into a slot," it seems this design does not prevent us
from eventually wiring up a table_index_getnext_batch() to populate
the HeapBatch structure I am proposing for the regular non-index scan
path (table_scan_getnextbatch() in my patch).

That's good.

Suppose we do a much more advanced version of the kind of work
reordering that the heap_hot_search_buffer thing will do for Postgres
19. I described this to Tomas in my last email to this thread, when I
said:

"""
We could even do something much more sophisticated than what I
actually have planned for 19: we could reorder table fetches, such
that we only had to lock and pin each heap page exactly once *even
when the TIDs returned by the index scan return TIDs slightly out of
order*. For example, if an index page/batch returns TIDs "(1,1),
(2,1), (1,2), (1,3), (2,2)", we could get all tuples for heap blocks 1
and 2 by locking and pinning each of those 2 pages exactly once. The
only downside (other than the complexity) is that we'd sometimes hold
multiple heap page pins at a time, not just one.
"""

(To be clear this more advanced version is definitely out of scope for
Postgres 19.)

We'd be holding on to multiple buffer pins at a time (across calls to
heapam_index_getnext_slot) were we to do this more advanced
optimization. I *think* that still means that the design/internal
logic will (as you put it) "separate the 'grouping/locking' from the
'materializing into a slot'". That's just the only way that could
possibly work correctly, at least with heapam.

Agreed. Even if a future version holds multiple pins to handle
out-of-order TIDs, the architectural separation holds. The TAM would
just populate a batch that spans those multiple pinned buffers (or a
more complex batch structure), but the interface between it and the
executor remains the same.

It makes sense for us both to (at a minimum) have at least some
general awareness of each other's goals. I really only want to avoid
completely gratuitous incompatibilities/conflicts. For example, if you
invent a new slot-like mechanism in the executor that can return
multiple tuples in one go, then it seems like we should probably try
to use that in our own work on batching. If we're already assembling
the information in a way that almost works with that new interface,
why wouldn't we make sure that it actually worked with and used that
new interface directly?

It doesn't sound like there'd be many disagreements on how that would
have to work, since the requirements are largely dictated by existing
constraints that we're both already naturally subject to. For example:

* We need to hold on to a buffer pin on a heap page if one of its heap
tuples is contained in a slot/something slot-like. For as long as
there's any chance that somebody will examine that heap tuple (until
the slot releases the tuple).

* Buffer locks must only be acquired by lower-level access method
code, for very short periods, and never in a way that requires
coordination across module boundaries.

Your list of constraints matches my experience with making batches for
Seq Scans.

My current HeapBatch implementation for Seq Scans received in a
TupleBatch executor container is designed exactly around that first
point. It fills the batch from the currently pinned page, stopping
either when the batch capacity (GUC in the next version of my patch)
is reached or when we reach the end of the page. I deliberately avoid
spanning multiple pages in a single batch (for now) to keep that pin
management simple.

From the executor's perspective, HeapBatch is just an opaque black
box. I rely on the TAM to manage the underlying resources (pins) to
keep the data valid. This aligns well with your work because it leaves
the pin management strategy, whether single-page or multi-page,
entirely up to the AM implementation without exposing those details to
the scan node.

It sounds like the potential for conflicts between each other's work
will be absolutely minimal. It seems as if we don't even have to agree
on anything new or novel.

Sorry to hijack the thread, but just wanted to confirm I haven't
misunderstood the architectural implications for future batching.

I don't think that you've hijacked anything. Your input is more than welcome.

Thanks. It would be nice to keep this channel open as your API
evolves. Based on what I understand so far, the heapam internals of
this work seem compatible with how I am populating "output" batches in
my executor work, so I just want to ensure we don't accidentally
diverge on that front.

--
Thanks, Amit Langote

#352

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Peter Geoghegan (#346)

4 attachment(s)

Re: index prefetching

On Sun, Nov 30, 2025 at 8:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch makes the table AM revisions we talked about. This is a
significant change in direction, so I'm adopting a new patch
versioning scheme: this new version is v1. (I just find it easier to
deal with sequential patch version numbers.)

Attached is v2, just to keep the patch set cleanly applying against
HEAD following recent changes in nbtree. No real changes here.

--
Peter Geoghegan

Attachments:

v2-0004-Add-amgetbatch-support-to-hash-index-AM.patchapplication/x-patch; name=v2-0004-Add-amgetbatch-support-to-hash-index-AM.patchDownload

From 0e204c4483da5498887943f815ad81624bb9521a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v2 4/4] Add amgetbatch support to hash index AM.

This patch should be considered a work in progress.  It has only been
lightly tested, and it's not clear if I've accounted for all of the
intricacies with bucket splits (and with pins that are held by the
scan's opaque state more generally).

This automatically switched hash indexes over to using the dropPin
optimization, since that is standard when using the new amgetbatch
interface.  This won't bring similar benefits to hash index scans that
nbtree index scans gained when commit 2ed5b87f went in.  Hash index
vacuuming acquires a cleanup lock on bucket pages, but conflicting pins
on bucket pages are still held for the full duration of each index scan.

However, there is still independent value in avoiding holding on to
buffer pins during index scans: index prefetching tends to hold open
as many as several dozen batches with certain workloads (workloads where
the stream position has to get quite far ahead of the read position in
order to maintain the appropriate prefetch distance on the heapam side).
Guaranteeing that open batches won't hold buffer pins on index pages (at
least during plain index scans that use an MVCC snapshot, per the rules
established by commit 2ed5b87f) is likely to make life easier when the
resource management rules for I/O prefetching are fully ironed-out.

Note that the code path in _hash_kill_items that calls _hash_getbuf
(though only with the LH_OVERFLOW_PAGE flag/overflow pages) is dead
code/vestigial on master [1].  However, it's now useful again, since
_hash_kill_items might now need to relock and repin an overflow page
(not just relock it) during dropPin scans.

[1] https://postgr.es/m/CAH2-Wz=8mefy8QUcsnKLTePuy4tE8pdO+gSRQ9yQwUHoaeFTFw@mail.gmail.com

Author: Peter Geoghegan <pg@bowt.ie>
---
 src/include/access/hash.h            |  73 +------
 src/backend/access/hash/hash.c       | 123 ++++--------
 src/backend/access/hash/hashpage.c   |  19 +-
 src/backend/access/hash/hashsearch.c | 287 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   |  98 ++++-----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 239 insertions(+), 363 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 839c34312..03f8bc9b3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b4b460724..fd1a9173a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = hashadjustmembers;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
-	amroutine->amgettuple = hashgettuple;
-	amroutine->amgetbatch = NULL;
-	amroutine->amfreebatch = NULL;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = hashgetbatch;
+	amroutine->amfreebatch = hashfreebatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->amposreset = NULL;
@@ -285,54 +285,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = (int *)
-					palloc(MaxIndexTuplesPerPage * sizeof(int));
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +310,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,16 +348,12 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
 
 	return scan;
@@ -408,18 +369,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +379,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +407,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005..b0b0c530c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65b..7b9b31eee 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,104 +22,74 @@
 #include "utils/rel.h"
 
 static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+						   ScanDirection dir, BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
+
+	/* Get the buffer for next batch */
+	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +239,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +264,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +294,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,34 +387,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
@@ -461,8 +429,8 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -473,25 +441,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
-			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
+			 * the next page.
+			 *
 			 * If this is a primary bucket page, hasho_prevblkno is not a real
 			 * block number.
 			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				prev_blkno = InvalidBlockNumber;
 			else
 				prev_blkno = opaque->hasho_prevblkno;
@@ -499,29 +463,25 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->prevPage = prev_blkno;
+				batch->nextPage = InvalidBlockNumber;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -532,77 +492,91 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				next_blkno = opaque->hasho_nextblkno;
 
 			_hash_readprev(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->prevPage = InvalidBlockNumber;
+				batch->nextPage = next_blkno;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/* Saved at least one match in batch.items[], so prepare to return it */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split
+		 */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
+
+		/*
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released
+		 */
+		IncrBufferRefCount(batch->buf);
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +613,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +660,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +679,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;	/* Hash doesn't support index-only scans */
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f41233fcd..d01d3b7d2 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,67 +510,74 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
@@ -578,13 +585,13 @@ _hash_kill_items(IndexScanDesc scan)
 
 	for (i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +620,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->batchqueue->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6c2c94690..37110fcef 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1184,8 +1184,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

v2-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/x-patch; name=v2-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From ee463134e52f52dbc5ffc732d31d67a2df1787b9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v2 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 97c1124c1..1ef3ebc8f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ce52d6ca8..b4f8112be 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v2-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/x-patch; name=v2-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From 8b2e088d9e8717b503286a0e5267dec10bdded7c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v2 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  27 +-
 src/include/access/heapam.h                   |   5 +
 src/include/access/nbtree.h                   | 176 +----
 src/include/access/relscan.h                  | 244 +++++++
 src/include/access/tableam.h                  |  57 +-
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 512 +++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  10 +-
 src/backend/access/index/indexam.c            | 130 +---
 src/backend/access/index/indexbatch.c         | 640 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 198 +++---
 src/backend/access/nbtree/nbtree.c            | 307 ++-------
 src/backend/access/nbtree/nbtsearch.c         | 511 +++++---------
 src/backend/access/nbtree/nbtutils.c          |  56 +-
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/commands/constraint.c             |   3 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeIndexonlyscan.c      | 100 +--
 src/backend/executor/nodeIndexscan.c          |  12 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   4 +-
 src/backend/utils/adt/selfuncs.c              |  57 +-
 contrib/bloom/blutils.c                       |   3 +-
 doc/src/sgml/indexam.sgml                     | 310 +++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   4 +-
 src/tools/pgindent/typedefs.list              |   4 -
 43 files changed, 2289 insertions(+), 1189 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..a7eb33ce9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..f1984e700 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -115,6 +115,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +176,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 TupleTableSlot *ios_tableslot,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +203,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  TupleTableSlot *ios_tableslot,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +272,25 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern bool batch_getnext(IndexScanDesc scan, ScanDirection direction);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan,
+											   int maxitems, bool want_itup);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 632c4332a..e8d347e47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,11 @@ typedef struct IndexFetchHeapData
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
+	TupleTableSlot *ios_tableslot;	/* transient slot for fetching tuples to
+									 * check visibility during index-only
+									 * scans */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209..658c46a1f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 87a8be104..3a416e76b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -123,8 +124,192 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+
+	int			nheapaccesses;	/* number of heap accesses, for
+								 * instrumentation/metrics */
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	int			maxitems;		/* allocated size of items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans work with a queue of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
+ */
+typedef struct BatchQueue
+{
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -140,6 +325,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
@@ -216,4 +403,61 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData  *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6b..3e98d8537 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -420,7 +420,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  TupleTableSlot *ios_tableslot);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -433,11 +434,34 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot. This enables important optimizations (such as table block I/O
+	 * prefetching) that require that the table AM directly manages the
+	 * progress of the index scan.
+	 *
+	 * Table AMs that implement this are expected to use batch_getnext (and
+	 * other batch utility routines) to perform amgetbatch index scans.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -459,7 +483,6 @@ typedef struct TableAmRoutine
 									  TupleTableSlot *slot,
 									  bool *call_again, bool *all_dead);
 
-
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
 	 * ------------------------------------------------------------------------
@@ -1159,14 +1182,15 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
 
 /*
  * Prepare to fetch tuples from the relation, as needed when fetching tuples
- * for an index scan.
+ * for an index scan.  Index-only scan callers must provide ios_tableslot,
+ * which is a slot for holding tuples fetched from the table.
  *
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, ios_tableslot);
 }
 
 /*
@@ -1188,6 +1212,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers must pass an index scan descriptor that was created
+ * by passing a valid ios_tableslot to index_beginscan.  This ios_tableslot
+ * will be passed down to table_index_fetch_begin by index_beginscan.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1255,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64ff69964..b6064bb3d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1751,7 +1751,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1774,7 +1773,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 46a865562..11805bff9 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index cb3331921..d54c782ca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0d4108d05..74ee89032 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 78f7b7a24..9b55ab230 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3fb1a1285..0e0bcc034 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819..b4b460724 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844b..ec7527aed 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,33 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+static void
+StoreIndexTuple(TupleTableSlot *slot,
+				IndexTuple itup, TupleDesc itupdesc)
+{
+	/*
+	 * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
+	 * not the slot's tupdesc, in case the latter has different datatypes
+	 * (this happens for btree name_ops in particular).  They'd better have
+	 * the same number of columns though, as well as being datatype-compatible
+	 * which is something we can't so easily check.
+	 */
+	Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
+
+	ExecClearTuple(slot);
+	index_deform_tuple(itup, itupdesc, slot->tts_values, slot->tts_isnull);
+
+	/*
+	 * Copy all name columns stored as cstrings back into a NAMEDATALEN byte
+	 * sized allocation.  We mark this branch as unlikely as generally "name"
+	 * is used only for the system catalogs and this would have to be a user
+	 * query running on those or some other user table with an index on a name
+	 * column.
+	 */
+
+	ExecStoreVirtualTuple(slot);
+}
+
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -79,12 +106,17 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	IndexFetchHeapData *hscan = palloc0(sizeof(IndexFetchHeapData));
+	IndexFetchHeapData *hscan = palloc(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.nheapaccesses = 0;
+
+	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->vmbuf = InvalidBuffer;
+	hscan->ios_tableslot = ios_tableslot;
 
 	return &hscan->xs_base;
 }
@@ -94,6 +126,7 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +141,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -173,6 +212,468 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+/*
+ * heap_batch_advance_pos
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.  This is
+ * usually readPos.  Advances the position to the next item, either in the
+ * same batch or the following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+static bool
+heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
+					   ScanDirection direction)
+{
+	BatchIndexScan batch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchqueue->headBatch);
+
+		pos->batch = scan->batchqueue->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	batch_assert_pos_valid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item <= batch->lastItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item >= batch->firstItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->direction != direction))
+	{
+		/*
+		 * Handle a change in the scan's direction.
+		 *
+		 * Release future batches properly, to make it look like the current
+		 * batch is the only one we loaded.
+		 */
+		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+		{
+			/* release "later" batches in reverse order */
+			BatchIndexScan fbatch;
+
+			batchqueue->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+			batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over).
+		 */
+		batchqueue->direction = direction;
+		batchqueue->finished = false;
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchqueue->readPos;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (heap_batch_advance_pos(scan, readPos, direction))
+		{
+			BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+			/* xs_hitup is not supported by amgetbatch scans */
+			Assert(!scan->xs_hitup);
+
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->item].tupleOffset);
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchqueue->headBatch))
+			{
+				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+															batchqueue->headBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchqueue->headBatch++;
+
+				/* we can't skip any batches */
+				Assert(batchqueue->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * Failed to advance the read position.  Have indexbatch.c utility
+		 * routine load another batch into our queue (next in this direction).
+		 */
+		if (!batch_getnext(scan, direction))
+		{
+			/* we're done -- there's no more batches in this scan direction */
+			break;
+		}
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	Assert(scan->batchqueue->finished);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, tell index
+	 * AM to kill its entry for that TID (this will take effect in the next
+	 * amgettuple call, in index_getnext_tid).  We do not do this when in
+	 * recovery because it may violate MVCC to do so.  See comments in
+	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				hscan->xs_base.nheapaccesses++;
+				if (!index_fetch_heap(scan, hscan->ios_tableslot))
+					continue;	/* no visible tuple, try next index entry */
+
+				/*
+				 * selfuncs.c caller uses SnapshotNonVacuumable.  Just assume
+				 * that it's good enough that any one tuple from HOT chain is
+				 * visible for such a caller
+				 */
+				if (unlikely(!IsMVCCSnapshot(scan->xs_snapshot)))
+					return true;
+
+				ExecClearTuple(hscan->ios_tableslot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.
+				 */
+				if (scan->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+				if (scan->xs_itup)
+					StoreIndexTuple(slot, scan->xs_itup, scan->xs_itupdesc);
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1254,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, NULL, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1292,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3136,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c96917085..fe152861e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -446,7 +447,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, NULL,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +518,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +709,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, NULL,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +736,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..9e4ed5b55 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				TupleTableSlot *ios_tableslot,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -283,8 +282,11 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, ios_tableslot);
 
 	return scan;
 }
@@ -380,6 +382,8 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +398,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +429,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	return index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +456,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +466,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +588,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +602,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 TupleTableSlot *ios_tableslot,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -614,17 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, ios_tableslot);
 
 	return scan;
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +686,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..e30cc179c
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,640 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.  It isn't
+ * safe to drop index page pins eagerly when doing so risks breaking an
+ * assumption (about table TID recycling) that amfreebatch routines make when
+ * setting LP_DEAD bits for known-dead index tuples.  Specifically, buffer
+ * pins on index pages serve as interlocks preventing VACUUM from recycling
+ * TIDs on those pages, protecting the table AM from confusing a recycled TID
+ * with the original row it meant to reference.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc(sizeof(BatchQueue));
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchqueue->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchqueue->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0, sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called by table AM's ordered index scan implementation when it needs to
+ * load the next batch of index entries to process in the given direction.
+ *
+ * The table AM controls the overall progress of the scan, deciding when to
+ * request new batches.  This division of labor gives the table AM the ability
+ * to reorder fetches of nearby table tuples (from the same batch, or from
+ * adjacent batches) based on its own considerations.  Importantly, table AMs
+ * are _not_ required to free a batch before loading the next batch during an
+ * index scan of an index that uses the amgetbatch/amfreebatch interface.
+ * (This isn't possible with the single-tuple amgettuple interface, which gives
+ * the index AM direct control over the progress of the index scan.  amgettuple
+ * index scans perform the work that we perform in batch_free as the scan
+ * progresses, and without notifying the table AM, which makes it impossible
+ * to safely reorder work in the way that our callers can.)
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are
+ * no more batches in the given scan direction.
+ * ----------------
+ */
+bool
+batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan priorbatch = NULL,
+				batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchqueue->finished)
+		return false;
+
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch_debug_print_batches("batch_getnext / start", scan);
+
+	/*
+	 * Get the previously returned batch to pass to amgetbatch.  The index AM
+	 * uses this to determine which index page to read next, typically by
+	 * following page links forward or backward.
+	 */
+	if (batchqueue->headBatch < batchqueue->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+	else
+		batchqueue->finished = true;
+
+	batch_assert_batches_valid(scan);
+
+	batch_debug_print_batches("batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	batchqueue->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(readBatch->maxitems * sizeof(int));
+	if (readBatch->numKilled < readBatch->maxitems)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchqueue->cache[i] == NULL)
+			continue;
+
+		pfree(scan->batchqueue->cache[i]);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = !scan->batchqueue || scan->batchqueue->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch that can fit maxitems-many BatchMatchingItems.
+ *
+ * Returns a BatchIndexScan sized to caller's required maxitems capacity.
+ * This will either be a newly allocated batch, or a batch reused from a cache
+ * of batches already freed by calling indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * We assume that all calls here during the same index scan will always use
+ * the same maxitems and want_itup arguments.  Index AMs that use batches
+ * should call this from either their amgetbatch or amgetbitmap routines.
+ * They must not call here from other routines (particularly not amfreebatch).
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* want_itup callers must get a currTuples space */
+	Assert(batch->maxitems == maxitems);
+	Assert(!(want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->batchqueue->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+			pfree(batch);
+			return;
+		}
+
+		/*
+		 * Use cache.  This is generally only beneficial when there are many
+		 * small rescans of an index.
+		 */
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] == NULL)
+			{
+				/* found empty slot, we're done */
+				scan->batchqueue->cache[i] = batch;
+				return;
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index e29c03089..c01fff708 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a8d56fe5a..ecdb021e2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index ac67b6f7a..3269a9bce 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                         IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+                                 OffsetNumber offnum, const ItemPointerData *heapTid,
+                                 IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                                       ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,37 +144,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	if (scan->parallel_scan)
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, newbatch->currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
@@ -182,7 +177,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* initialize page-level state that we'll pass to _bt_checkkeys */
+	/* initialize remaining page-level state */
 	pstate.minoff = minoff;
 	pstate.maxoff = maxoff;
 	pstate.finaltup = NULL;
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,28 +274,28 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
 					 * TID
 					 */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -341,12 +335,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -363,11 +356,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -468,11 +460,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/*
 					 * Set up state to return posting list, and remember first
@@ -485,17 +477,17 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					 * associated with the same posting list tuple.
 					 */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -511,12 +503,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -533,7 +524,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1036,90 +1027,93 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2831,13 +2825,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2922,14 +2916,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 211e20ebc..7d36f3b9c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -221,13 +222,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -242,44 +243,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = (int *)
-						palloc(MaxTIDsPerBTreePage * sizeof(int));
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -289,6 +264,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -297,29 +273,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -347,8 +323,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -362,19 +336,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	return scan;
 }
@@ -388,82 +351,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -472,116 +390,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -897,15 +747,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1055,8 +896,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7fcd7e94e..2c3dc987e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,52 +25,21 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -860,20 +829,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -887,8 +852,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	BatchIndexScan firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -913,7 +877,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -930,14 +894,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1229,6 +1187,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										  scan->xs_want_itup);
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1237,7 +1199,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1505,12 +1467,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1526,11 +1488,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1538,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1560,168 +1522,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
 	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1731,73 +1594,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1807,102 +1687,69 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										scan->xs_want_itup);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1910,17 +1757,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1935,19 +1782,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2170,25 +2037,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2199,7 +2064,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2225,9 +2090,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 33b0e4055..0971c6889 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -161,73 +161,69 @@ _bt_freestack(BTStack stack)
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -244,12 +240,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -274,7 +269,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -290,7 +285,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchqueue->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -307,7 +303,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -357,7 +353,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 87c31da71..9d66d26dd 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 1e099febd..a7341bcaf 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -248,7 +248,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221..8a5d79a27 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9cccb6ac..3ac91f12e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index dd323c9b9..9b4b6c33b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, NULL, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index def32774c..a769d9c86 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f464cca95..1f424bf3d 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -91,6 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   node->ioss_TableSlot,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -101,7 +99,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,77 +115,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
+		InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+		scandesc->xs_heapfetch->nheapaccesses = 0;
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
@@ -238,19 +170,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
+	/* XXX This is ugly, but not clear how to do better */
+	InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+	scandesc->xs_heapfetch->nheapaccesses = 0;
+
 	/*
 	 * if we get here it means the index scan failed so we are at the end of
 	 * the scan..
@@ -407,13 +333,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -786,12 +705,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -852,6 +771,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index f36929dee..8bd700c2a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -1721,7 +1721,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1785,7 +1785,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2654c59c4..4e7871dbb 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 07f92fac2..6ec65f23f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 10b3d0d9b..728bbde0c 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 540aa9628..b2288cd93 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7043,10 +7042,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7098,7 +7093,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, tableslot,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	/* Set it up for index-only scan */
@@ -7106,48 +7101,20 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (index_scan->xs_heapfetch->nheapaccesses > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7180,8 +7147,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 7a468b4a1..0c3ba684a 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..cdc9c65a6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -743,6 +744,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -762,8 +894,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -789,32 +921,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -988,30 +1113,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1180,6 +1322,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 6557c5cff..8f0952a50 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 94ef639b6..4e2877a4d 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,10 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = dibeginscan;
 	amroutine->amrescan = direscan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6e2ed0c88..6c2c94690 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -220,8 +220,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3447,12 +3445,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v2-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/x-patch; name=v2-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From b39aef80b59af6ec8adc293d7760749c04947a84 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v2 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/heapam.h                   |   1 +
 src/include/access/relscan.h                  |  32 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 325 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  46 ++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 12 files changed, 435 insertions(+), 21 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e8d347e47..2cc817360 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
 	Buffer		vmbuf;			/* visibility map buffer */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a416e76b..dfef2fffb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 
 	int			nheapaccesses;	/* number of heap accesses, for
 								 * instrumentation/metrics */
@@ -224,8 +226,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -283,6 +291,27 @@ typedef struct BatchQueue
 	 * it's over.
 	 */
 	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
 
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
@@ -290,6 +319,7 @@ typedef struct BatchQueue
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3e98d8537..c3e58012f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -454,6 +454,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec7527aed..b90f8e6b4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -59,6 +59,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -111,10 +114,12 @@ heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 	IndexFetchHeapData *hscan = palloc(sizeof(IndexFetchHeapData));
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_base.nheapaccesses = 0;
 
 	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
 	hscan->ios_tableslot = ios_tableslot;
 
@@ -126,11 +131,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -141,6 +150,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -163,23 +175,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -363,13 +389,24 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 	else if (unlikely(batchqueue->direction != direction))
 	{
 		/*
 		 * Handle a change in the scan's direction.
 		 *
 		 * Release future batches properly, to make it look like the current
-		 * batch is the only one we loaded.
+		 * batch is the only one we loaded. Also reset the stream position, as
+		 * if we are just starting the scan.
 		 */
 		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
 		{
@@ -384,10 +421,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/*
 		 * Remember the new direction, and make sure the scan is not marked as
 		 * "finished" (we might have already read the last batch, but now we
-		 * need to start over).
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
 		 */
 		batchqueue->direction = direction;
 		batchqueue->finished = false;
+		batch_reset_pos(&batchqueue->streamPos);
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
 	}
 
 	/* shortcut for the read position, for convenience */
@@ -429,6 +472,37 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
 															batchqueue->headBatch);
 
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchqueue->streamPos.batch == batchqueue->headBatch))
+				{
+					batch_reset_pos(&batchqueue->streamPos);
+				}
+
 				/* Free the head batch (except when it's markBatch) */
 				batch_free(scan, headBatch);
 
@@ -447,8 +521,38 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		}
 
 		/*
-		 * Failed to advance the read position.  Have indexbatch.c utility
-		 * routine load another batch into our queue (next in this direction).
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchqueue->reset))
+		{
+			batchqueue->reset = false;
+			batchqueue->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			batch_reset_pos(&batchqueue->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call batch_getnext here too, for two reasons. First,
+		 * the read_stream only gets working after we try fetching the first
+		 * heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
 		 */
 		if (!batch_getnext(scan, direction))
 		{
@@ -468,6 +572,198 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return NULL;
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike batch_getnext_tid, this can loop more than twice. If many
+	 * blocks get skipped due to currentPrefetchBlock or all-visibility (per
+	 * the "prefetch" callback), we get to load additional batches. In the
+	 * worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to "pause"
+	 * the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_pos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchqueue->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3137,6 +3433,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9e4ed5b55..1936a34c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -459,7 +459,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index e30cc179c..34f96e544 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -75,11 +75,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->batchqueue->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->headBatch = 0;	/* initial head batch */
@@ -123,7 +128,23 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	if (batchqueue->finished)
 		return false;
 
-	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("batch_getnext: ran out of space for batches");
+		scan->batchqueue->reset = true;
+		return false;
+	}
 
 	batch_debug_print_batches("batch_getnext / start", scan);
 
@@ -148,6 +169,17 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 	else
 		batchqueue->finished = true;
@@ -180,9 +212,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -228,6 +263,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	batchqueue->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -291,9 +328,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -304,6 +345,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5a7283bd2..6f63693f5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 031fde9f4..e34e60060 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d83490..3f264f1ce 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832..a2a8c3afa 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0

#353

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Peter Geoghegan (#352)

4 attachment(s)

Re: index prefetching

On Mon, Dec 8, 2025 at 3:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v2, just to keep the patch set cleanly applying against
HEAD following recent changes in nbtree. No real changes here.

Attached is v3. This is another revision whose sole purpose is to keep
the patch applying cleanly. No real changes compared to v1 to report
here, either.

--
Peter Geoghegan

Attachments:

v3-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/octet-stream; name=v3-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From ae5e5d8d3a1d4f3a8a62a4cdeff5ed875a22061c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v3 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 97c1124c1..1ef3ebc8f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ce52d6ca8..b4f8112be 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v3-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/octet-stream; name=v3-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From 99f18830b0e61a02155ccacc29dbbf9e9f6b2501 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v3 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/heapam.h                   |   1 +
 src/include/access/relscan.h                  |  32 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 325 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  46 ++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 12 files changed, 435 insertions(+), 21 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e8d347e47..2cc817360 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
 	Buffer		vmbuf;			/* visibility map buffer */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a416e76b..dfef2fffb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 
 	int			nheapaccesses;	/* number of heap accesses, for
 								 * instrumentation/metrics */
@@ -224,8 +226,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -283,6 +291,27 @@ typedef struct BatchQueue
 	 * it's over.
 	 */
 	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
 
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
@@ -290,6 +319,7 @@ typedef struct BatchQueue
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3e98d8537..c3e58012f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -454,6 +454,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c12db48a6..cdcf68a6e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -59,6 +59,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -111,10 +114,12 @@ heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_base.nheapaccesses = 0;
 
 	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
 	hscan->ios_tableslot = ios_tableslot;
 
@@ -126,11 +131,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -141,6 +150,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -163,23 +175,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -363,13 +389,24 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 	else if (unlikely(batchqueue->direction != direction))
 	{
 		/*
 		 * Handle a change in the scan's direction.
 		 *
 		 * Release future batches properly, to make it look like the current
-		 * batch is the only one we loaded.
+		 * batch is the only one we loaded. Also reset the stream position, as
+		 * if we are just starting the scan.
 		 */
 		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
 		{
@@ -384,10 +421,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/*
 		 * Remember the new direction, and make sure the scan is not marked as
 		 * "finished" (we might have already read the last batch, but now we
-		 * need to start over).
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
 		 */
 		batchqueue->direction = direction;
 		batchqueue->finished = false;
+		batch_reset_pos(&batchqueue->streamPos);
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
 	}
 
 	/* shortcut for the read position, for convenience */
@@ -429,6 +472,37 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
 															batchqueue->headBatch);
 
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchqueue->streamPos.batch == batchqueue->headBatch))
+				{
+					batch_reset_pos(&batchqueue->streamPos);
+				}
+
 				/* Free the head batch (except when it's markBatch) */
 				batch_free(scan, headBatch);
 
@@ -447,8 +521,38 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		}
 
 		/*
-		 * Failed to advance the read position.  Have indexbatch.c utility
-		 * routine load another batch into our queue (next in this direction).
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchqueue->reset))
+		{
+			batchqueue->reset = false;
+			batchqueue->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			batch_reset_pos(&batchqueue->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call batch_getnext here too, for two reasons. First,
+		 * the read_stream only gets working after we try fetching the first
+		 * heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
 		 */
 		if (!batch_getnext(scan, direction))
 		{
@@ -468,6 +572,198 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return NULL;
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike batch_getnext_tid, this can loop more than twice. If many
+	 * blocks get skipped due to currentPrefetchBlock or all-visibility (per
+	 * the "prefetch" callback), we get to load additional batches. In the
+	 * worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to "pause"
+	 * the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_pos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchqueue->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3137,6 +3433,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9e4ed5b55..1936a34c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -459,7 +459,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index e30cc179c..34f96e544 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -75,11 +75,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->batchqueue->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->headBatch = 0;	/* initial head batch */
@@ -123,7 +128,23 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	if (batchqueue->finished)
 		return false;
 
-	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("batch_getnext: ran out of space for batches");
+		scan->batchqueue->reset = true;
+		return false;
+	}
 
 	batch_debug_print_batches("batch_getnext / start", scan);
 
@@ -148,6 +169,17 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 	else
 		batchqueue->finished = true;
@@ -180,9 +212,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -228,6 +263,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	batchqueue->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -291,9 +328,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -304,6 +345,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a39cc793b..37a0e6a3f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f1b88d058..2a06279f5 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d83490..3f264f1ce 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832..a2a8c3afa 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0

v3-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/octet-stream; name=v3-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From 9f53f1ad23f00bcee67bddf69f580e047ec3de99 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v3 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  27 +-
 src/include/access/heapam.h                   |   5 +
 src/include/access/nbtree.h                   | 176 +----
 src/include/access/relscan.h                  | 244 +++++++
 src/include/access/tableam.h                  |  57 +-
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 512 +++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  10 +-
 src/backend/access/index/indexam.c            | 130 +---
 src/backend/access/index/indexbatch.c         | 640 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 196 +++---
 src/backend/access/nbtree/nbtree.c            | 306 ++-------
 src/backend/access/nbtree/nbtsearch.c         | 511 +++++---------
 src/backend/access/nbtree/nbtutils.c          |  60 +-
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/commands/constraint.c             |   3 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeIndexonlyscan.c      | 100 +--
 src/backend/executor/nodeIndexscan.c          |  12 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   4 +-
 src/backend/utils/adt/selfuncs.c              |  57 +-
 contrib/bloom/blutils.c                       |   3 +-
 doc/src/sgml/indexam.sgml                     | 310 +++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   4 +-
 src/tools/pgindent/typedefs.list              |   4 -
 43 files changed, 2290 insertions(+), 1189 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..a7eb33ce9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..f1984e700 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -115,6 +115,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +176,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 TupleTableSlot *ios_tableslot,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +203,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  TupleTableSlot *ios_tableslot,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +272,25 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern bool batch_getnext(IndexScanDesc scan, ScanDirection direction);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan,
+											   int maxitems, bool want_itup);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 632c4332a..e8d347e47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,11 @@ typedef struct IndexFetchHeapData
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
+	TupleTableSlot *ios_tableslot;	/* transient slot for fetching tuples to
+									 * check visibility during index-only
+									 * scans */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209..658c46a1f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 87a8be104..3a416e76b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -123,8 +124,192 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+
+	int			nheapaccesses;	/* number of heap accesses, for
+								 * instrumentation/metrics */
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	int			maxitems;		/* allocated size of items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans work with a queue of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
+ */
+typedef struct BatchQueue
+{
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -140,6 +325,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
@@ -216,4 +403,61 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData  *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6b..3e98d8537 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -420,7 +420,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  TupleTableSlot *ios_tableslot);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -433,11 +434,34 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot. This enables important optimizations (such as table block I/O
+	 * prefetching) that require that the table AM directly manages the
+	 * progress of the index scan.
+	 *
+	 * Table AMs that implement this are expected to use batch_getnext (and
+	 * other batch utility routines) to perform amgetbatch index scans.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -459,7 +483,6 @@ typedef struct TableAmRoutine
 									  TupleTableSlot *slot,
 									  bool *call_again, bool *all_dead);
 
-
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
 	 * ------------------------------------------------------------------------
@@ -1159,14 +1182,15 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
 
 /*
  * Prepare to fetch tuples from the relation, as needed when fetching tuples
- * for an index scan.
+ * for an index scan.  Index-only scan callers must provide ios_tableslot,
+ * which is a slot for holding tuples fetched from the table.
  *
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, ios_tableslot);
 }
 
 /*
@@ -1188,6 +1212,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers must pass an index scan descriptor that was created
+ * by passing a valid ios_tableslot to index_beginscan.  This ios_tableslot
+ * will be passed down to table_index_fetch_begin by index_beginscan.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1255,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64ff69964..b6064bb3d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1751,7 +1751,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1774,7 +1773,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 46a865562..11805bff9 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 26cb75058..880921961 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index b3e2e9d5f..b8f831a31 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 605f80aad..1d233087e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c26d8538f..0d282c69a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e388252af..7289df574 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf6..c12db48a6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,33 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+static void
+StoreIndexTuple(TupleTableSlot *slot,
+				IndexTuple itup, TupleDesc itupdesc)
+{
+	/*
+	 * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
+	 * not the slot's tupdesc, in case the latter has different datatypes
+	 * (this happens for btree name_ops in particular).  They'd better have
+	 * the same number of columns though, as well as being datatype-compatible
+	 * which is something we can't so easily check.
+	 */
+	Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
+
+	ExecClearTuple(slot);
+	index_deform_tuple(itup, itupdesc, slot->tts_values, slot->tts_isnull);
+
+	/*
+	 * Copy all name columns stored as cstrings back into a NAMEDATALEN byte
+	 * sized allocation.  We mark this branch as unlikely as generally "name"
+	 * is used only for the system catalogs and this would have to be a user
+	 * query running on those or some other user table with an index on a name
+	 * column.
+	 */
+
+	ExecStoreVirtualTuple(slot);
+}
+
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -79,12 +106,17 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.nheapaccesses = 0;
+
+	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->vmbuf = InvalidBuffer;
+	hscan->ios_tableslot = ios_tableslot;
 
 	return &hscan->xs_base;
 }
@@ -94,6 +126,7 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +141,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -173,6 +212,468 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+/*
+ * heap_batch_advance_pos
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.  This is
+ * usually readPos.  Advances the position to the next item, either in the
+ * same batch or the following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+static bool
+heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
+					   ScanDirection direction)
+{
+	BatchIndexScan batch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchqueue->headBatch);
+
+		pos->batch = scan->batchqueue->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	batch_assert_pos_valid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item <= batch->lastItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item >= batch->firstItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->direction != direction))
+	{
+		/*
+		 * Handle a change in the scan's direction.
+		 *
+		 * Release future batches properly, to make it look like the current
+		 * batch is the only one we loaded.
+		 */
+		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+		{
+			/* release "later" batches in reverse order */
+			BatchIndexScan fbatch;
+
+			batchqueue->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+			batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over).
+		 */
+		batchqueue->direction = direction;
+		batchqueue->finished = false;
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchqueue->readPos;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (heap_batch_advance_pos(scan, readPos, direction))
+		{
+			BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+			/* xs_hitup is not supported by amgetbatch scans */
+			Assert(!scan->xs_hitup);
+
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->item].tupleOffset);
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchqueue->headBatch))
+			{
+				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+															batchqueue->headBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchqueue->headBatch++;
+
+				/* we can't skip any batches */
+				Assert(batchqueue->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * Failed to advance the read position.  Have indexbatch.c utility
+		 * routine load another batch into our queue (next in this direction).
+		 */
+		if (!batch_getnext(scan, direction))
+		{
+			/* we're done -- there's no more batches in this scan direction */
+			break;
+		}
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	Assert(scan->batchqueue->finished);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, tell index
+	 * AM to kill its entry for that TID (this will take effect in the next
+	 * amgettuple call, in index_getnext_tid).  We do not do this when in
+	 * recovery because it may violate MVCC to do so.  See comments in
+	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				hscan->xs_base.nheapaccesses++;
+				if (!index_fetch_heap(scan, hscan->ios_tableslot))
+					continue;	/* no visible tuple, try next index entry */
+
+				/*
+				 * selfuncs.c caller uses SnapshotNonVacuumable.  Just assume
+				 * that it's good enough that any one tuple from HOT chain is
+				 * visible for such a caller
+				 */
+				if (unlikely(!IsMVCCSnapshot(scan->xs_snapshot)))
+					return true;
+
+				ExecClearTuple(hscan->ios_tableslot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.
+				 */
+				if (scan->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+				if (scan->xs_itup)
+					StoreIndexTuple(slot, scan->xs_itup, scan->xs_itupdesc);
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1254,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, NULL, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1292,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3136,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 707c25289..9eadc3afc 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -446,7 +447,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, NULL,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +518,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +709,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, NULL,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +736,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..9e4ed5b55 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				TupleTableSlot *ios_tableslot,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -283,8 +282,11 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, ios_tableslot);
 
 	return scan;
 }
@@ -380,6 +382,8 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +398,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +429,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	return index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +456,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +466,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +588,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +602,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 TupleTableSlot *ios_tableslot,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -614,17 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, ios_tableslot);
 
 	return scan;
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +686,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..e30cc179c
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,640 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.  It isn't
+ * safe to drop index page pins eagerly when doing so risks breaking an
+ * assumption (about table TID recycling) that amfreebatch routines make when
+ * setting LP_DEAD bits for known-dead index tuples.  Specifically, buffer
+ * pins on index pages serve as interlocks preventing VACUUM from recycling
+ * TIDs on those pages, protecting the table AM from confusing a recycled TID
+ * with the original row it meant to reference.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc(sizeof(BatchQueue));
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchqueue->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchqueue->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0, sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called by table AM's ordered index scan implementation when it needs to
+ * load the next batch of index entries to process in the given direction.
+ *
+ * The table AM controls the overall progress of the scan, deciding when to
+ * request new batches.  This division of labor gives the table AM the ability
+ * to reorder fetches of nearby table tuples (from the same batch, or from
+ * adjacent batches) based on its own considerations.  Importantly, table AMs
+ * are _not_ required to free a batch before loading the next batch during an
+ * index scan of an index that uses the amgetbatch/amfreebatch interface.
+ * (This isn't possible with the single-tuple amgettuple interface, which gives
+ * the index AM direct control over the progress of the index scan.  amgettuple
+ * index scans perform the work that we perform in batch_free as the scan
+ * progresses, and without notifying the table AM, which makes it impossible
+ * to safely reorder work in the way that our callers can.)
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are
+ * no more batches in the given scan direction.
+ * ----------------
+ */
+bool
+batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan priorbatch = NULL,
+				batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchqueue->finished)
+		return false;
+
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch_debug_print_batches("batch_getnext / start", scan);
+
+	/*
+	 * Get the previously returned batch to pass to amgetbatch.  The index AM
+	 * uses this to determine which index page to read next, typically by
+	 * following page links forward or backward.
+	 */
+	if (batchqueue->headBatch < batchqueue->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+	else
+		batchqueue->finished = true;
+
+	batch_assert_batches_valid(scan);
+
+	batch_debug_print_batches("batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	batchqueue->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(readBatch->maxitems * sizeof(int));
+	if (readBatch->numKilled < readBatch->maxitems)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchqueue->cache[i] == NULL)
+			continue;
+
+		pfree(scan->batchqueue->cache[i]);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = !scan->batchqueue || scan->batchqueue->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch that can fit maxitems-many BatchMatchingItems.
+ *
+ * Returns a BatchIndexScan sized to caller's required maxitems capacity.
+ * This will either be a newly allocated batch, or a batch reused from a cache
+ * of batches already freed by calling indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * We assume that all calls here during the same index scan will always use
+ * the same maxitems and want_itup arguments.  Index AMs that use batches
+ * should call this from either their amgetbatch or amgetbitmap routines.
+ * They must not call here from other routines (particularly not amfreebatch).
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* want_itup callers must get a currTuples space */
+	Assert(batch->maxitems == maxitems);
+	Assert(!(want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->batchqueue->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+			pfree(batch);
+			return;
+		}
+
+		/*
+		 * Use cache.  This is generally only beneficial when there are many
+		 * small rescans of an index.
+		 */
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] == NULL)
+			{
+				/* found empty slot, we're done */
+				scan->batchqueue->cache[i] = batch;
+				return;
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index e29c03089..c01fff708 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index cfb07b2bc..15b788e3a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index b3b8b5534..c0d22ee51 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                         IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+                                 OffsetNumber offnum, const ItemPointerData *heapTid,
+                                 IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                                       ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, newbatch->currPage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,93 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2816,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2907,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6197b725f..1dcebd8ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -221,13 +222,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -242,43 +243,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +264,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +273,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +323,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,19 +336,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	return scan;
 }
@@ -387,82 +351,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +390,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -896,15 +747,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1054,8 +896,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aec710936..4fa593f2c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,52 +25,21 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -860,20 +829,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -887,8 +852,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	BatchIndexScan firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -913,7 +877,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -930,14 +894,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1229,6 +1187,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										  scan->xs_want_itup);
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1237,7 +1199,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1505,12 +1467,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1526,11 +1488,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1538,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1560,168 +1522,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
 	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1731,73 +1594,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1807,102 +1687,69 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										scan->xs_want_itup);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1910,17 +1757,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1935,19 +1782,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2170,25 +2037,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2199,7 +2064,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2225,9 +2090,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 16e23a517..0dc3ec5ff 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -176,50 +176,48 @@ _bt_compare_int(const void *va, const void *vb)
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
 	/*
 	 * so->killedItems[] is in whatever order the scan returned items in.
@@ -230,33 +228,31 @@ _bt_killitems(IndexScanDesc scan)
 	 */
 	if (numKilled > 1)
 	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
+		qsort(batch->killedItems, numKilled, sizeof(int), _bt_compare_int);
+		numKilled = qunique(batch->killedItems, numKilled, sizeof(int),
 							_bt_compare_int);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -273,12 +269,11 @@ _bt_killitems(IndexScanDesc scan)
 
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
 		while (offnum <= maxoff)
@@ -295,7 +290,7 @@ _bt_killitems(IndexScanDesc scan)
 
 				/*
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -311,7 +306,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchqueue->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -328,7 +324,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -378,7 +374,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index a60ec85e8..b0a6e0974 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 1e099febd..a7341bcaf 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -248,7 +248,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221..8a5d79a27 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9cccb6ac..3ac91f12e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0b3a31f17..891a5b2c9 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, NULL, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 860f79f9c..9236da9b2 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 6bea42f12..96778abe3 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -91,6 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   node->ioss_TableSlot,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -101,7 +99,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,77 +115,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
+		InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+		scandesc->xs_heapfetch->nheapaccesses = 0;
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
@@ -238,19 +170,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
+	/* XXX This is ugly, but not clear how to do better */
+	InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+	scandesc->xs_heapfetch->nheapaccesses = 0;
+
 	/*
 	 * if we get here it means the index scan failed so we are at the end of
 	 * the scan..
@@ -407,13 +333,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -785,12 +704,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -851,6 +770,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 72b135e5d..fdf8bcbc2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -1719,7 +1719,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1783,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5d4f81ee7..83681a6f3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index e553afb7f..ac89c5ee3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b88..4a74b69df 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c760b19db..7bba223c0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7043,10 +7042,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7098,7 +7093,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, tableslot,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	/* Set it up for index-only scan */
@@ -7106,48 +7101,20 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (index_scan->xs_heapfetch->nheapaccesses > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7180,8 +7147,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 7a468b4a1..0c3ba684a 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..cdc9c65a6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -743,6 +744,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -762,8 +894,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -789,32 +921,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -988,30 +1113,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1180,6 +1322,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index a34382a5f..5526771b5 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,10 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = dibeginscan;
 	amroutine->amrescan = direscan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9dd65b102..6812d2fd7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -220,8 +220,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3450,12 +3448,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v3-0004-Add-amgetbatch-support-to-hash-index-AM.patchapplication/octet-stream; name=v3-0004-Add-amgetbatch-support-to-hash-index-AM.patchDownload

From d0f5138a458ea38313c89ca843b59a2beee0b336 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v3 4/4] Add amgetbatch support to hash index AM.

This patch should be considered a work in progress.  It has only been
lightly tested, and it's not clear if I've accounted for all of the
intricacies with bucket splits (and with pins that are held by the
scan's opaque state more generally).

This automatically switched hash indexes over to using the dropPin
optimization, since that is standard when using the new amgetbatch
interface.  This won't bring similar benefits to hash index scans that
nbtree index scans gained when commit 2ed5b87f went in.  Hash index
vacuuming acquires a cleanup lock on bucket pages, but conflicting pins
on bucket pages are still held for the full duration of each index scan.

However, there is still independent value in avoiding holding on to
buffer pins during index scans: index prefetching tends to hold open
as many as several dozen batches with certain workloads (workloads where
the stream position has to get quite far ahead of the read position in
order to maintain the appropriate prefetch distance on the heapam side).
Guaranteeing that open batches won't hold buffer pins on index pages (at
least during plain index scans that use an MVCC snapshot, per the rules
established by commit 2ed5b87f) is likely to make life easier when the
resource management rules for I/O prefetching are fully ironed-out.

Note that the code path in _hash_kill_items that calls _hash_getbuf
(though only with the LH_OVERFLOW_PAGE flag/overflow pages) is dead
code/vestigial on master [1].  However, it's now useful again, since
_hash_kill_items might now need to relock and repin an overflow page
(not just relock it) during dropPin scans.

[1] https://postgr.es/m/CAH2-Wz=8mefy8QUcsnKLTePuy4tE8pdO+gSRQ9yQwUHoaeFTFw@mail.gmail.com

Author: Peter Geoghegan <pg@bowt.ie>
---
 src/include/access/hash.h            |  73 +------
 src/backend/access/hash/hash.c       | 122 ++++--------
 src/backend/access/hash/hashpage.c   |  19 +-
 src/backend/access/hash/hashsearch.c | 287 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   |  98 ++++-----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 239 insertions(+), 362 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 839c34312..03f8bc9b3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7289df574..3f2b77996 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = hashadjustmembers;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
-	amroutine->amgettuple = hashgettuple;
-	amroutine->amgetbatch = NULL;
-	amroutine->amfreebatch = NULL;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = hashgetbatch;
+	amroutine->amfreebatch = hashfreebatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->amposreset = NULL;
@@ -285,53 +285,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -341,26 +310,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -382,16 +348,12 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
 
 	return scan;
@@ -407,18 +369,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -427,6 +379,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -436,17 +407,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005..b0b0c530c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65b..7b9b31eee 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,104 +22,74 @@
 #include "utils/rel.h"
 
 static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+						   ScanDirection dir, BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
+
+	/* Get the buffer for next batch */
+	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +239,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +264,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +294,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,34 +387,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
@@ -461,8 +429,8 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -473,25 +441,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
-			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
+			 * the next page.
+			 *
 			 * If this is a primary bucket page, hasho_prevblkno is not a real
 			 * block number.
 			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				prev_blkno = InvalidBlockNumber;
 			else
 				prev_blkno = opaque->hasho_prevblkno;
@@ -499,29 +463,25 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->prevPage = prev_blkno;
+				batch->nextPage = InvalidBlockNumber;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -532,77 +492,91 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				next_blkno = opaque->hasho_nextblkno;
 
 			_hash_readprev(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->prevPage = InvalidBlockNumber;
+				batch->nextPage = next_blkno;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/* Saved at least one match in batch.items[], so prepare to return it */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split
+		 */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
+
+		/*
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released
+		 */
+		IncrBufferRefCount(batch->buf);
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +613,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +660,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +679,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;	/* Hash doesn't support index-only scans */
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f41233fcd..d01d3b7d2 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,67 +510,74 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
@@ -578,13 +585,13 @@ _hash_kill_items(IndexScanDesc scan)
 
 	for (i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +620,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->batchqueue->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6812d2fd7..e23a9a98d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1184,8 +1184,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

#354

Peter Geoghegan

pg@bowt.ie

about 1 month ago

In reply to: Peter Geoghegan (#353)

4 attachment(s)

Re: index prefetching

On Wed, Dec 10, 2025 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3. This is another revision whose sole purpose is to keep
the patch applying cleanly. No real changes compared to v1 to report
here, either.

Attached is v4. Same story again (another bitrot-fix-only revision).

--
Peter Geoghegan

Attachments:

v4-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/octet-stream; name=v4-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From 8dcd560bb39c249f5d06f02fdbe39d2e69506c31 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v4 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 97c1124c1..1ef3ebc8f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ce52d6ca8..b4f8112be 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1538,6 +1538,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1670,7 +1710,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1689,11 +1729,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1779,6 +1874,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1836,9 +1968,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v4-0004-Add-amgetbatch-support-to-hash-index-AM.patchapplication/octet-stream; name=v4-0004-Add-amgetbatch-support-to-hash-index-AM.patchDownload

From 95476069d729582fa29fa6823367b2efa17876de Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v4 4/4] Add amgetbatch support to hash index AM.

This patch should be considered a work in progress.  It has only been
lightly tested, and it's not clear if I've accounted for all of the
intricacies with bucket splits (and with pins that are held by the
scan's opaque state more generally).

This automatically switched hash indexes over to using the dropPin
optimization, since that is standard when using the new amgetbatch
interface.  This won't bring similar benefits to hash index scans that
nbtree index scans gained when commit 2ed5b87f went in.  Hash index
vacuuming acquires a cleanup lock on bucket pages, but conflicting pins
on bucket pages are still held for the full duration of each index scan.

However, there is still independent value in avoiding holding on to
buffer pins during index scans: index prefetching tends to hold open
as many as several dozen batches with certain workloads (workloads where
the stream position has to get quite far ahead of the read position in
order to maintain the appropriate prefetch distance on the heapam side).
Guaranteeing that open batches won't hold buffer pins on index pages (at
least during plain index scans that use an MVCC snapshot, per the rules
established by commit 2ed5b87f) is likely to make life easier when the
resource management rules for I/O prefetching are fully ironed-out.

Note that the code path in _hash_kill_items that calls _hash_getbuf
(though only with the LH_OVERFLOW_PAGE flag/overflow pages) is dead
code/vestigial on master [1].  However, it's now useful again, since
_hash_kill_items might now need to relock and repin an overflow page
(not just relock it) during dropPin scans.

[1] https://postgr.es/m/CAH2-Wz=8mefy8QUcsnKLTePuy4tE8pdO+gSRQ9yQwUHoaeFTFw@mail.gmail.com

Author: Peter Geoghegan <pg@bowt.ie>
---
 src/include/access/hash.h            |  73 +------
 src/backend/access/hash/hash.c       | 122 ++++--------
 src/backend/access/hash/hashpage.c   |  19 +-
 src/backend/access/hash/hashsearch.c | 287 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   |  98 ++++-----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 239 insertions(+), 362 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 839c34312..03f8bc9b3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7289df574..3f2b77996 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = hashadjustmembers;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
-	amroutine->amgettuple = hashgettuple;
-	amroutine->amgetbatch = NULL;
-	amroutine->amfreebatch = NULL;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = hashgetbatch;
+	amroutine->amfreebatch = hashfreebatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->amposreset = NULL;
@@ -285,53 +285,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -341,26 +310,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -382,16 +348,12 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
 
 	return scan;
@@ -407,18 +369,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -427,6 +379,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -436,17 +407,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005..b0b0c530c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -280,31 +280,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65b..7b9b31eee 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,104 +22,74 @@
 #include "utils/rel.h"
 
 static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+						   ScanDirection dir, BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
+
+	/* Get the buffer for next batch */
+	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +239,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +264,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +294,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,34 +387,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan, MaxIndexTuplesPerPage, false);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
@@ -461,8 +429,8 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -473,25 +441,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
-			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
+			 * the next page.
+			 *
 			 * If this is a primary bucket page, hasho_prevblkno is not a real
 			 * block number.
 			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				prev_blkno = InvalidBlockNumber;
 			else
 				prev_blkno = opaque->hasho_prevblkno;
@@ -499,29 +463,25 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->prevPage = prev_blkno;
+				batch->nextPage = InvalidBlockNumber;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -532,77 +492,91 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				next_blkno = opaque->hasho_nextblkno;
 
 			_hash_readprev(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->prevPage = InvalidBlockNumber;
+				batch->nextPage = next_blkno;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/* Saved at least one match in batch.items[], so prepare to return it */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split
+		 */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
+
+		/*
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released
+		 */
+		IncrBufferRefCount(batch->buf);
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +613,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +660,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +679,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;	/* Hash doesn't support index-only scans */
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f41233fcd..d01d3b7d2 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,67 +510,74 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
@@ -578,13 +585,13 @@ _hash_kill_items(IndexScanDesc scan)
 
 	for (i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +620,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->batchqueue->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6812d2fd7..e23a9a98d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1184,8 +1184,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

v4-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/octet-stream; name=v4-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From 06b8550323f7f8048d1a4ae276080f7b33820501 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v4 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  27 +-
 src/include/access/heapam.h                   |   5 +
 src/include/access/nbtree.h                   | 176 +----
 src/include/access/relscan.h                  | 244 +++++++
 src/include/access/tableam.h                  |  57 +-
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 512 +++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  10 +-
 src/backend/access/index/indexam.c            | 130 +---
 src/backend/access/index/indexbatch.c         | 640 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 196 +++---
 src/backend/access/nbtree/nbtree.c            | 306 ++-------
 src/backend/access/nbtree/nbtsearch.c         | 511 +++++---------
 src/backend/access/nbtree/nbtutils.c          |  70 +-
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/commands/constraint.c             |   3 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeIndexonlyscan.c      | 100 +--
 src/backend/executor/nodeIndexscan.c          |  12 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   4 +-
 src/backend/utils/adt/selfuncs.c              |  57 +-
 contrib/bloom/blutils.c                       |   3 +-
 doc/src/sgml/indexam.sgml                     | 310 +++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   4 +-
 src/tools/pgindent/typedefs.list              |   4 -
 43 files changed, 2295 insertions(+), 1194 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..a7eb33ce9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..f1984e700 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -115,6 +115,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +176,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 TupleTableSlot *ios_tableslot,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +203,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  TupleTableSlot *ios_tableslot,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +272,25 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern bool batch_getnext(IndexScanDesc scan, ScanDirection direction);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan,
+											   int maxitems, bool want_itup);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 632c4332a..e8d347e47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,11 @@ typedef struct IndexFetchHeapData
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
+	TupleTableSlot *ios_tableslot;	/* transient slot for fetching tuples to
+									 * check visibility during index-only
+									 * scans */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209..658c46a1f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 87a8be104..3a416e76b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -123,8 +124,192 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+
+	int			nheapaccesses;	/* number of heap accesses, for
+								 * instrumentation/metrics */
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	int			maxitems;		/* allocated size of items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans work with a queue of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
+ */
+typedef struct BatchQueue
+{
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
+
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
+
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -140,6 +325,8 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
@@ -216,4 +403,61 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 }			SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData  *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6b..3e98d8537 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -420,7 +420,8 @@ typedef struct TableAmRoutine
 	 *
 	 * Tuples for an index scan can then be fetched via index_fetch_tuple.
 	 */
-	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel);
+	struct IndexFetchTableData *(*index_fetch_begin) (Relation rel,
+													  TupleTableSlot *ios_tableslot);
 
 	/*
 	 * Reset index fetch. Typically this will release cross index fetch
@@ -433,11 +434,34 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot. This enables important optimizations (such as table block I/O
+	 * prefetching) that require that the table AM directly manages the
+	 * progress of the index scan.
+	 *
+	 * Table AMs that implement this are expected to use batch_getnext (and
+	 * other batch utility routines) to perform amgetbatch index scans.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -459,7 +483,6 @@ typedef struct TableAmRoutine
 									  TupleTableSlot *slot,
 									  bool *call_again, bool *all_dead);
 
-
 	/* ------------------------------------------------------------------------
 	 * Callbacks for non-modifying operations on individual tuples
 	 * ------------------------------------------------------------------------
@@ -1159,14 +1182,15 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan)
 
 /*
  * Prepare to fetch tuples from the relation, as needed when fetching tuples
- * for an index scan.
+ * for an index scan.  Index-only scan callers must provide ios_tableslot,
+ * which is a slot for holding tuples fetched from the table.
  *
  * Tuples for an index scan can then be fetched via table_index_fetch_tuple().
  */
 static inline IndexFetchTableData *
-table_index_fetch_begin(Relation rel)
+table_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	return rel->rd_tableam->index_fetch_begin(rel);
+	return rel->rd_tableam->index_fetch_begin(rel, ios_tableslot);
 }
 
 /*
@@ -1188,6 +1212,26 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers must pass an index scan descriptor that was created
+ * by passing a valid ios_tableslot to index_beginscan.  This ios_tableslot
+ * will be passed down to table_index_fetch_begin by index_beginscan.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1255,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 64ff69964..b6064bb3d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1751,7 +1751,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1774,7 +1773,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 46a865562..11805bff9 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 26cb75058..880921961 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index b3e2e9d5f..b8f831a31 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 605f80aad..1d233087e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c26d8538f..0d282c69a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e388252af..7289df574 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf6..c12db48a6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,33 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+static void
+StoreIndexTuple(TupleTableSlot *slot,
+				IndexTuple itup, TupleDesc itupdesc)
+{
+	/*
+	 * Note: we must use the tupdesc supplied by the AM in index_deform_tuple,
+	 * not the slot's tupdesc, in case the latter has different datatypes
+	 * (this happens for btree name_ops in particular).  They'd better have
+	 * the same number of columns though, as well as being datatype-compatible
+	 * which is something we can't so easily check.
+	 */
+	Assert(slot->tts_tupleDescriptor->natts == itupdesc->natts);
+
+	ExecClearTuple(slot);
+	index_deform_tuple(itup, itupdesc, slot->tts_values, slot->tts_isnull);
+
+	/*
+	 * Copy all name columns stored as cstrings back into a NAMEDATALEN byte
+	 * sized allocation.  We mark this branch as unlikely as generally "name"
+	 * is used only for the system catalogs and this would have to be a user
+	 * query running on those or some other user table with an index on a name
+	 * column.
+	 */
+
+	ExecStoreVirtualTuple(slot);
+}
+
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -79,12 +106,17 @@ heapam_slot_callbacks(Relation relation)
  */
 
 static IndexFetchTableData *
-heapam_index_fetch_begin(Relation rel)
+heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.nheapaccesses = 0;
+
+	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->vmbuf = InvalidBuffer;
+	hscan->ios_tableslot = ios_tableslot;
 
 	return &hscan->xs_base;
 }
@@ -94,6 +126,7 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
@@ -108,6 +141,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -173,6 +212,468 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+/*
+ * heap_batch_advance_pos
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.  This is
+ * usually readPos.  Advances the position to the next item, either in the
+ * same batch or the following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+static bool
+heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
+					   ScanDirection direction)
+{
+	BatchIndexScan batch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchqueue->headBatch);
+
+		pos->batch = scan->batchqueue->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	batch_assert_pos_valid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item <= batch->lastItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item >= batch->firstItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+		Assert(batch != NULL);
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->direction != direction))
+	{
+		/*
+		 * Handle a change in the scan's direction.
+		 *
+		 * Release future batches properly, to make it look like the current
+		 * batch is the only one we loaded.
+		 */
+		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+		{
+			/* release "later" batches in reverse order */
+			BatchIndexScan fbatch;
+
+			batchqueue->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+			batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over).
+		 */
+		batchqueue->direction = direction;
+		batchqueue->finished = false;
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchqueue->readPos;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (heap_batch_advance_pos(scan, readPos, direction))
+		{
+			BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+			/* xs_hitup is not supported by amgetbatch scans */
+			Assert(!scan->xs_hitup);
+
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->item].tupleOffset);
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchqueue->headBatch))
+			{
+				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+															batchqueue->headBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchqueue->headBatch++;
+
+				/* we can't skip any batches */
+				Assert(batchqueue->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * Failed to advance the read position.  Have indexbatch.c utility
+		 * routine load another batch into our queue (next in this direction).
+		 */
+		if (!batch_getnext(scan, direction))
+		{
+			/* we're done -- there's no more batches in this scan direction */
+			break;
+		}
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	Assert(scan->batchqueue->finished);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, tell index
+	 * AM to kill its entry for that TID (this will take effect in the next
+	 * amgettuple call, in index_getnext_tid).  We do not do this when in
+	 * recovery because it may violate MVCC to do so.  See comments in
+	 * RelationGetIndexScan().
+	 *
+	 * XXX For scans using batching, record the flag in the batch (we will
+	 * pass it to the AM later, when freeing it). Otherwise just pass it to
+	 * the AM using the kill_prior_tuple field.
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue == NULL)
+			scan->kill_prior_tuple = all_dead;
+		else if (all_dead)
+			index_batch_kill_item(scan);
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				hscan->xs_base.nheapaccesses++;
+				if (!index_fetch_heap(scan, hscan->ios_tableslot))
+					continue;	/* no visible tuple, try next index entry */
+
+				/*
+				 * selfuncs.c caller uses SnapshotNonVacuumable.  Just assume
+				 * that it's good enough that any one tuple from HOT chain is
+				 * visible for such a caller
+				 */
+				if (unlikely(!IsMVCCSnapshot(scan->xs_snapshot)))
+					return true;
+
+				ExecClearTuple(hscan->ios_tableslot);
+
+				/*
+				 * Only MVCC snapshots are supported here, so there should be
+				 * no need to keep following the HOT chain once a visible
+				 * entry has been found.
+				 */
+				if (scan->xs_heap_continue)
+					elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+				if (scan->xs_itup)
+					StoreIndexTuple(slot, scan->xs_itup, scan->xs_itupdesc);
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1254,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, NULL, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1292,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3136,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 707c25289..9eadc3afc 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,7 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* used by amgetbatch index AMs */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -446,7 +447,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, NULL,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +518,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +709,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, NULL,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +736,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..9e4ed5b55 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				TupleTableSlot *ios_tableslot,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -283,8 +282,11 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation, ios_tableslot);
 
 	return scan;
 }
@@ -380,6 +382,8 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +398,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +429,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	return index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +456,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +466,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +588,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +602,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 TupleTableSlot *ios_tableslot,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -614,17 +626,24 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
 
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
+
 	/* prepare to fetch index matches from table */
-	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	scan->xs_heapfetch = table_index_fetch_begin(heaprel, ios_tableslot);
 
 	return scan;
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +686,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..e30cc179c
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,640 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.  It isn't
+ * safe to drop index page pins eagerly when doing so risks breaking an
+ * assumption (about table TID recycling) that amfreebatch routines make when
+ * setting LP_DEAD bits for known-dead index tuples.  Specifically, buffer
+ * pins on index pages serve as interlocks preventing VACUUM from recycling
+ * TIDs on those pages, protecting the table AM from confusing a recycled TID
+ * with the original row it meant to reference.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc(sizeof(BatchQueue));
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->batchqueue->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->batchqueue->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0, sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called by table AM's ordered index scan implementation when it needs to
+ * load the next batch of index entries to process in the given direction.
+ *
+ * The table AM controls the overall progress of the scan, deciding when to
+ * request new batches.  This division of labor gives the table AM the ability
+ * to reorder fetches of nearby table tuples (from the same batch, or from
+ * adjacent batches) based on its own considerations.  Importantly, table AMs
+ * are _not_ required to free a batch before loading the next batch during an
+ * index scan of an index that uses the amgetbatch/amfreebatch interface.
+ * (This isn't possible with the single-tuple amgettuple interface, which gives
+ * the index AM direct control over the progress of the index scan.  amgettuple
+ * index scans perform the work that we perform in batch_free as the scan
+ * progresses, and without notifying the table AM, which makes it impossible
+ * to safely reorder work in the way that our callers can.)
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are
+ * no more batches in the given scan direction.
+ * ----------------
+ */
+bool
+batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan priorbatch = NULL,
+				batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (batchqueue->finished)
+		return false;
+
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch_debug_print_batches("batch_getnext / start", scan);
+
+	/*
+	 * Get the previously returned batch to pass to amgetbatch.  The index AM
+	 * uses this to determine which index page to read next, typically by
+	 * following page links forward or backward.
+	 */
+	if (batchqueue->headBatch < batchqueue->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+	else
+		batchqueue->finished = true;
+
+	batch_assert_batches_valid(scan);
+
+	batch_debug_print_batches("batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	batchqueue->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = (int *)
+			palloc(readBatch->maxitems * sizeof(int));
+	if (readBatch->numKilled < readBatch->maxitems)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		if (scan->batchqueue->cache[i] == NULL)
+			continue;
+
+		pfree(scan->batchqueue->cache[i]);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = !scan->batchqueue || scan->batchqueue->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer; /* defensive */
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch that can fit maxitems-many BatchMatchingItems.
+ *
+ * Returns a BatchIndexScan sized to caller's required maxitems capacity.
+ * This will either be a newly allocated batch, or a batch reused from a cache
+ * of batches already freed by calling indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * We assume that all calls here during the same index scan will always use
+ * the same maxitems and want_itup arguments.  Index AMs that use batches
+ * should call this from either their amgetbatch or amgetbitmap routines.
+ * They must not call here from other routines (particularly not amfreebatch).
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan, int maxitems, bool want_itup)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * maxitems);
+
+		batch->maxitems = maxitems;
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* want_itup callers must get a currTuples space */
+	Assert(batch->maxitems == maxitems);
+	Assert(!(want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->batchqueue->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+			pfree(batch);
+			return;
+		}
+
+		/*
+		 * Use cache.  This is generally only beneficial when there are many
+		 * small rescans of an index.
+		 */
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] == NULL)
+			{
+				/* found empty slot, we're done */
+				scan->batchqueue->cache[i] = batch;
+				return;
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index e29c03089..c01fff708 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index cfb07b2bc..15b788e3a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index b3b8b5534..c0d22ee51 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
-								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                         IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+                                 OffsetNumber offnum, const ItemPointerData *heapTid,
+                                 IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+                                       ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
- * initialized from scratch here.
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, pos.moreLeft and moreRight must be valid; they are
+ * updated as appropriate.  All other fields of newbatch are initialized from
+ * scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, newbatch->currPage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,93 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2816,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2907,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 6197b725f..1dcebd8ae 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -221,13 +222,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -242,43 +243,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +264,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +273,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +323,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,19 +336,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	return scan;
 }
@@ -387,82 +351,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 *
-	 * NOTE: this data structure also makes it safe to return data from a
-	 * "name" column, even though btree name_ops uses an underlying storage
-	 * datatype of cstring.  The risk there is that "name" is supposed to be
-	 * padded to NAMEDATALEN, but the actual index tuple is probably shorter.
-	 * However, since we only return data out of tuples sitting in the
-	 * currTuples array, a fetch of NAMEDATALEN bytes can at worst pull some
-	 * data out of the markTuples array --- running off the end of memory for
-	 * a SIGSEGV is not possible.  Yeah, this is ugly as sin, but it beats
-	 * adding special-case treatment for name_ops elsewhere.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->batchqueue->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -471,116 +390,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -896,15 +747,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1054,8 +896,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aec710936..4fa593f2c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,52 +25,21 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-
-
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
@@ -860,20 +829,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -887,8 +852,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
-
-	Assert(!BTScanPosIsValid(so->currPos));
+	BatchIndexScan firstbatch;
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -913,7 +877,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+		return false;			/* definitely done (so->needPrimScan is unset) */
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -930,14 +894,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
-
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1229,6 +1187,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										  scan->xs_want_itup);
+
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
 	 * the tree.  Walk down that edge to the first or last key, and scan from
@@ -1237,7 +1199,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1505,12 +1467,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1526,11 +1488,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
 			return false;
@@ -1538,11 +1500,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1560,168 +1522,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
 	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * _bt_readpage for pos happened to use the opposite direction to the one
+	 * that we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1731,73 +1594,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1807,102 +1687,69 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior scan position's nextPage or prevPage (depending on scan
+ * direction), and lastcurrblkno is the prior position's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns scan batch containing data from the next
+ * interesting page.  We hold a pin on the buffer on success exit (except
+ * during dropPin plain index scans, when we drop the pin eagerly to avoid
+ * blocking VACUUM).  If there are no more matching records in the given
+ * direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan, MaxTIDsPerBTreePage,
+										scan->xs_want_itup);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * pos is the first valid page to the right (or to the left) of
+	 * lastcurrblkno.  Also provisionally assume that there'll be another page
+	 * we'll need to the right (or to the left) ahead of _bt_readpage call.
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1910,17 +1757,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1935,19 +1782,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2170,25 +2037,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2199,7 +2064,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2225,9 +2090,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a451d48e1..6ec563c95 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -176,88 +176,84 @@ _bt_compare_int(const void *va, const void *vb)
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
+	int			numKilled = batch->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
 	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
+	/* Always invalidate batch->killedItems[] before freeing batch */
+	batch->numKilled = 0;
 
 	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
+	 * We need to iterate through batch.killedItems[] in leaf page order; the
 	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
+	 * killedItems[] is now in whatever order the scan returned items in.
 	 * Scrollable cursor scans might have even saved the same item/TID twice.
 	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
+	 * Sort and unique-ify batch.killedItems[] to deal with all this.
 	 */
 	if (numKilled > 1)
 	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
+		qsort(batch->killedItems, numKilled, sizeof(int), _bt_compare_int);
+		numKilled = qunique(batch->killedItems, numKilled, sizeof(int),
 							_bt_compare_int);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -272,17 +268,16 @@ _bt_killitems(IndexScanDesc scan)
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* Iterate through so->killedItems[] in leaf page order */
+	/* Iterate through batch->killedItems[] in leaf page order */
 	for (int i = 0; i < numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
+			   offnum >= batch->items[batch->killedItems[i - 1]].indexOffset);
 
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
@@ -300,7 +295,7 @@ _bt_killitems(IndexScanDesc scan)
 
 				/*
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -316,7 +311,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum ||
+						   !scan->batchqueue->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -333,7 +329,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * correctly -- posting tuple still gets killed).
 					 */
 					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -383,7 +379,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->batchqueue->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index a60ec85e8..b0a6e0974 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 1e099febd..a7341bcaf 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -248,7 +248,7 @@ table_index_fetch_tuple_check(Relation rel,
 	bool		found;
 
 	slot = table_slot_create(rel, NULL);
-	scan = table_index_fetch_begin(rel);
+	scan = table_index_fetch_begin(rel, NULL);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
 									all_dead);
 	table_index_fetch_end(scan);
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index 3497a8221..8a5d79a27 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,7 +106,8 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 */
 	tmptid = checktid;
 	{
-		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
+		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation,
+															NULL);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9cccb6ac..3ac91f12e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0b3a31f17..891a5b2c9 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, NULL, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 860f79f9c..9236da9b2 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, NULL, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 6bea42f12..96778abe3 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -91,6 +88,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
 								   node->ioss_RelationDesc,
+								   node->ioss_TableSlot,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
@@ -101,7 +99,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		/* Set it up for index-only scan */
 		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,77 +115,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
+		InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+		scandesc->xs_heapfetch->nheapaccesses = 0;
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
@@ -238,19 +170,13 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
+	/* XXX This is ugly, but not clear how to do better */
+	InstrCountTuples2(node, scandesc->xs_heapfetch->nheapaccesses);
+	scandesc->xs_heapfetch->nheapaccesses = 0;
+
 	/*
 	 * if we get here it means the index scan failed so we are at the end of
 	 * the scan..
@@ -407,13 +333,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -785,12 +704,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
 	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -851,6 +770,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
 								 node->ioss_RelationDesc,
+								 node->ioss_TableSlot,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 72b135e5d..fdf8bcbc2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, NULL,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -1719,7 +1719,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1783,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, NULL,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5d4f81ee7..83681a6f3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index e553afb7f..ac89c5ee3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b88..4a74b69df 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..1ebe0a76a 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,7 +363,7 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 			case AMPROP_BITMAP_SCAN:
 				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
@@ -392,7 +392,7 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple || routine->amgetbatch ? true : false);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c760b19db..7bba223c0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7043,10 +7042,6 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
@@ -7098,7 +7093,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
+	index_scan = index_beginscan(heapRel, indexRel, tableslot,
 								 &SnapshotNonVacuumable, NULL,
 								 1, 0);
 	/* Set it up for index-only scan */
@@ -7106,48 +7101,20 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (index_scan->xs_heapfetch->nheapaccesses > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7180,8 +7147,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 7a468b4a1..0c3ba684a 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..cdc9c65a6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -743,6 +744,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -762,8 +894,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -789,32 +921,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -988,30 +1113,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1180,6 +1322,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index a34382a5f..5526771b5 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,10 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = dibeginscan;
 	amroutine->amrescan = direscan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9dd65b102..6812d2fd7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -220,8 +220,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3450,12 +3448,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v4-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/octet-stream; name=v4-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From 395aa3cc50d74fb36d6ddfb7fb75b8d25fd11546 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v4 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/heapam.h                   |   1 +
 src/include/access/relscan.h                  |  32 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 325 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  46 ++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 12 files changed, 435 insertions(+), 21 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e8d347e47..2cc817360 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
 	Buffer		vmbuf;			/* visibility map buffer */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a416e76b..dfef2fffb 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 
 	int			nheapaccesses;	/* number of heap accesses, for
 								 * instrumentation/metrics */
@@ -224,8 +226,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -283,6 +291,27 @@ typedef struct BatchQueue
 	 * it's over.
 	 */
 	bool		finished;
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
 
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
@@ -290,6 +319,7 @@ typedef struct BatchQueue
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3e98d8537..c3e58012f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -454,6 +454,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c12db48a6..cdcf68a6e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -59,6 +59,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -111,10 +114,12 @@ heapam_index_fetch_begin(Relation rel, TupleTableSlot *ios_tableslot)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_base.nheapaccesses = 0;
 
 	/* heapam specific fields */
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
 	hscan->ios_tableslot = ios_tableslot;
 
@@ -126,11 +131,15 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -141,6 +150,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -163,23 +175,37 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	/*
+	 * Switch to correct buffer if we don't have it already (we can skip this
+	 * if we're in mid-HOT chain)
+	 */
+	if (!*call_again && hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -363,13 +389,24 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 	else if (unlikely(batchqueue->direction != direction))
 	{
 		/*
 		 * Handle a change in the scan's direction.
 		 *
 		 * Release future batches properly, to make it look like the current
-		 * batch is the only one we loaded.
+		 * batch is the only one we loaded. Also reset the stream position, as
+		 * if we are just starting the scan.
 		 */
 		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
 		{
@@ -384,10 +421,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/*
 		 * Remember the new direction, and make sure the scan is not marked as
 		 * "finished" (we might have already read the last batch, but now we
-		 * need to start over).
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
 		 */
 		batchqueue->direction = direction;
 		batchqueue->finished = false;
+		batch_reset_pos(&batchqueue->streamPos);
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
 	}
 
 	/* shortcut for the read position, for convenience */
@@ -429,6 +472,37 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
 															batchqueue->headBatch);
 
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchqueue->streamPos.batch == batchqueue->headBatch))
+				{
+					batch_reset_pos(&batchqueue->streamPos);
+				}
+
 				/* Free the head batch (except when it's markBatch) */
 				batch_free(scan, headBatch);
 
@@ -447,8 +521,38 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		}
 
 		/*
-		 * Failed to advance the read position.  Have indexbatch.c utility
-		 * routine load another batch into our queue (next in this direction).
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchqueue->reset))
+		{
+			batchqueue->reset = false;
+			batchqueue->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			batch_reset_pos(&batchqueue->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call batch_getnext here too, for two reasons. First,
+		 * the read_stream only gets working after we try fetching the first
+		 * heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
 		 */
 		if (!batch_getnext(scan, direction))
 		{
@@ -468,6 +572,198 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return NULL;
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike batch_getnext_tid, this can loop more than twice. If many
+	 * blocks get skipped due to currentPrefetchBlock or all-visibility (per
+	 * the "prefetch" callback), we get to load additional batches. In the
+	 * worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to "pause"
+	 * the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_pos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!batchqueue->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3137,6 +3433,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 9e4ed5b55..1936a34c8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -459,7 +459,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index e30cc179c..34f96e544 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -75,11 +75,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->batchqueue->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->headBatch = 0;	/* initial head batch */
@@ -123,7 +128,23 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	if (batchqueue->finished)
 		return false;
 
-	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("batch_getnext: ran out of space for batches");
+		scan->batchqueue->reset = true;
+		return false;
+	}
 
 	batch_debug_print_batches("batch_getnext / start", scan);
 
@@ -148,6 +169,17 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 	else
 		batchqueue->finished = true;
@@ -180,9 +212,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -228,6 +263,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	batchqueue->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -291,9 +328,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -304,6 +345,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a39cc793b..37a0e6a3f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f1b88d058..2a06279f5 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d83490..3f264f1ce 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832..a2a8c3afa 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0

#355

Konstantin Knizhnik

knizhnik@garret.ru

26 days ago

In reply to: Peter Geoghegan (#354)

Re: index prefetching

On 11/12/2025 4:21 AM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3. This is another revision whose sole purpose is to keep
the patch applying cleanly. No real changes compared to v1 to report
here, either.

Attached is v4. Same story again (another bitrot-fix-only revision).

I did some small benchmarking and was slightly confused by result.
I run tests at my MacBook with 64Gb RAM. Database is initialized in this
way:

create table t (pk integer primary key, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000))

So it creates table with size 80Gb (160 after vacuum) which doesn't fit
in RAM.
I used default Postgres configuration and alter the only parameter -
`effective_io_concurrency`.

File with query for pgbench:

\set pk random(1, 10000000)
select * from t where pk >= :pk order by pk limit N;

where N is 1,10,100

I run pgbench with just one client:
pgbench -T 60 -n -M prepared -f select.sql postgres

Results with current master are the following:

limit
1 7754
10 1868
100 1047

With applied index prefetching patch results are almost 2 times better
for small limit (eio=effective_io_concurrency):

limit\eio 0 10 100
1 14260 14240 13909
10 3088 3152 3174
100 1135 1020 1052

but what confuses me is that they do not depend on
`effective_io_concurrency`.
Moreover with `enable_indexscan_prefetch=off` results are the same.

Also I expected that the best effect of index prefetching should be for
larger limit (accessing more heap pages). But as you see - it is not true.

May we there is something wrong with my test scenario.
It will be nice to get some information about efficiency of prefetch,
for example add `pefetch` option to explain: `explain
(analyze,buffers,prefetch) ...`
I think that in `pgaio_io_wait` we can distinguish IO operations which
are completed without waiting and can be considered as prefetch hit.
Right now it is hard to understand without debugger whether prefetch is
perfromed at all.

#356

Peter Geoghegan

pg@bowt.ie

26 days ago

In reply to: Konstantin Knizhnik (#355)

Re: index prefetching

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

create table t (pk integer primary key, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000))

So it creates table with size 80Gb (160 after vacuum) which doesn't fit
in RAM.

160 after VACUUM? What do you mean?

but what confuses me is that they do not depend on
`effective_io_concurrency`.

You did change other settings, right? You didn't just use the default
shared_buffers, for example? (Sorry, I have to ask.)

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Also I expected that the best effect of index prefetching should be for
larger limit (accessing more heap pages). But as you see - it is not true.

May we there is something wrong with my test scenario.

I could definitely believe that the new amgetbatch interface is
noticeably faster with range queries. Maybe 5% - 10% faster (even
without using the heap-buffer-locking optimization we've talked about
on this thread, which you can't have used here because I haven't
posted it to the list just yet). But a near 2x improvement wildly
exceeds my expectations. Honestly, I have no idea why the patch is so
much faster, and suspect an invalid result.

It might make sense for you to try it again with just the first patch
applied (the patch that adds the basic table AM and index AM interface
revisions, and makes nbtree supply its own amgetbatch/replaces
btgetbatch with btgettuple). I suppose it's possible that Andres'
patch 0004 somehow played some role here, since that is independently
useful work (I don't quite recall the details of where else that might
be useful right now). But that's just a wild guess.

It will be nice to get some information about efficiency of prefetch,
for example add `pefetch` option to explain: `explain
(analyze,buffers,prefetch) ...`
I think that in `pgaio_io_wait` we can distinguish IO operations which
are completed without waiting and can be considered as prefetch hit.

Right now it is hard to understand without debugger whether prefetch is
perfromed at all.

Tomas did write a patch for that, but it isn't particularly well
optimized. I have mostly avoided using it for that reason. Basic
performance validation of the patch set is really hard in general, and
I've found it easier to just be extremely paranoid.

--
Peter Geoghegan

#357

Tomas Vondra

tomas@vondra.me

26 days ago

In reply to: Peter Geoghegan (#356)

Re: index prefetching

On 12/17/25 19:49, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

create table t (pk integer primary key, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000))

So it creates table with size 80Gb (160 after vacuum) which doesn't fit
in RAM.

160 after VACUUM? What do you mean?

but what confuses me is that they do not depend on
`effective_io_concurrency`.

You did change other settings, right? You didn't just use the default
shared_buffers, for example? (Sorry, I have to ask.)

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

True, although I suspect some queries may benefit from prefetching if
they start close to the end of a leaf page (and so get to read the
following leaf page too).

Also I expected that the best effect of index prefetching should be for
larger limit (accessing more heap pages). But as you see - it is not true.

May we there is something wrong with my test scenario.

I could definitely believe that the new amgetbatch interface is
noticeably faster with range queries. Maybe 5% - 10% faster (even
without using the heap-buffer-locking optimization we've talked about
on this thread, which you can't have used here because I haven't
posted it to the list just yet). But a near 2x improvement wildly
exceeds my expectations. Honestly, I have no idea why the patch is so
much faster, and suspect an invalid result.

FWIW I did try to reproduce this improvement, and I don't see anything
like 2x speedup. I see this:

eic master prefetch
1 28369 28699
10 7062 8134
100 2080 2162

So on my machine there's ~5-10% speedup, just like you predicted.
There's noise, I'd need to do more runs to get more stable results. But
it's clearly far from 2x.

regards

--
Tomas Vondra

#358

Andres Freund

andres@anarazel.de

26 days ago

In reply to: Peter Geoghegan (#356)

Re: index prefetching

Hi,

On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Note that due to the tuple size and fillfactor in Konstantin's workload, there
will be one tuple per page... That should allow for some prefetching.

Greetings,

Andres Freund

#359

Peter Geoghegan

pg@bowt.ie

26 days ago

In reply to: Andres Freund (#358)

Re: index prefetching

On Wed, Dec 17, 2025 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:

Note that due to the tuple size and fillfactor in Konstantin's workload, there
will be one tuple per page... That should allow for some prefetching.

I don't see how, unless he also set leaf fillfactor very low (though
probably not even then, with LIMIT 1, since you still get a few index
tuples on each leaf page).

As Tomas points out, this particularly heuristic probably has some
problems. I'm not claiming that this is the ideal behavior. Just that
it would seem to make it almost impossible for prefetching to ever
show benefits, with a workload such as this (in particular with LIMIT
1 it seems quite impossible).

--
Peter Geoghegan

#360

Tomas Vondra

tomas@vondra.me

26 days ago

In reply to: Andres Freund (#358)

Re: index prefetching

On 12/17/25 20:30, Andres Freund wrote:

Hi,

On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Note that due to the tuple size and fillfactor in Konstantin's workload, there
will be one tuple per page... That should allow for some prefetching.

Yes, but that's in the heap. The mechanism Peter described is about leaf
pages in the index, and the index has the usual fillfactor. So there'll
be many index entries per leaf.

regards

--
Tomas Vondra

#361

Konstantin Knizhnik

knizhnik@garret.ru

25 days ago

In reply to: Peter Geoghegan (#356)

Re: index prefetching

On 17/12/2025 8:49 PM, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik<knizhnik@garret.ru> wrote:

create table t (pk integer primary key, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000))

So it creates table with size 80Gb (160 after vacuum) which doesn't fit
in RAM.

160 after VACUUM? What do you mean?

Sorry, it was my mistake. Now relation to vacuum.
As you can see with specified fillfactor and filler field there is
exactly one record per page. So table size should be ~80Gb.
But when I did `select pg_relation_size('t') I saw 160Gb. It was because
my first attempt to upload populate this relation was canceled.
For some reasons I thought that fiel will be just truncated in this
case. But it is not and actually it doubles size of the relation.
But it should not affect index scan speed.

but what confuses me is that they do not depend on
`effective_io_concurrency`.

You did change other settings, right? You didn't just use the default
shared_buffers, for example? (Sorry, I have to ask.)

No, I have not changed default value of shared buffers (128Mb).
It should be enough to provide enough free buffers for stream io to use
prefetch.

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

I have checked in debugger that prefetching is actually performed:
xs_heapfetch is initialized and its prefetch distance is increased (to 32).

I could definitely believe that the new amgetbatch interface is
noticeably faster with range queries. Maybe 5% - 10% faster (even
without using the heap-buffer-locking optimization we've talked about
on this thread, which you can't have used here because I haven't
posted it to the list just yet). But a near 2x improvement wildly
exceeds my expectations. Honestly, I have no idea why the patch is so
much faster, and suspect an invalid result.

It might make sense for you to try it again with just the first patch
applied (the patch that adds the basic table AM and index AM interface
revisions, and makes nbtree supply its own amgetbatch/replaces
btgetbatch with btgettuple). I suppose it's possible that Andres'
patch 0004 somehow played some role here, since that is independently
useful work (I don't quite recall the details of where else that might
be useful right now). But that's just a wild guess.

I will try to find out the reason, that you for suggestion.

#362

Konstantin Knizhnik

knizhnik@garret.ru

25 days ago

In reply to: Tomas Vondra (#360)

Re: index prefetching

On 17/12/2025 9:54 PM, Tomas Vondra wrote:

On 12/17/25 20:30, Andres Freund wrote:

Hi,

On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Note that due to the tuple size and fillfactor in Konstantin's workload, there
will be one tuple per page... That should allow for some prefetching.

Yes, but that's in the heap. The mechanism Peter described is about leaf
pages in the index, and the index has the usual fillfactor. So there'll
be many index entries per leaf.

I slightly change my benchmark setup:

create table t (pk integer, sk integer, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

You are right. There is almost no effect of prefetch for limit=100, but
~2x times improvement for limit=1000:

eio\limit 1 100 1000
10 11102 142 28
0 11419 137 14

master:
limit 1 100 1000
11480 130 13

One of the motivation of my experiments was to check that there is no
degrade of performance because of batching.
And it is nice that there is no performance penalty here.
Still it is not quite clear to me why there is no any positive effect
for LIMIT 100.
Reading 100 random heap pages definitely should take advantages of AIO.
We have also implemented prefetching for index only scan in Neon and
here effect for similar query is quite noticeable (~3x times).
But in Neon architecture prices of IO is much higher because requires
network communication with page server.

#363

Tomas Vondra

tomas@vondra.me

25 days ago

In reply to: Konstantin Knizhnik (#362)

Re: index prefetching

On 12/18/25 14:57, Konstantin Knizhnik wrote:

On 17/12/2025 9:54 PM, Tomas Vondra wrote:

On 12/17/25 20:30, Andres Freund wrote:

Hi,

On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik
<knizhnik@garret.ru> wrote:

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Note that due to the tuple size and fillfactor in Konstantin's
workload, there
will be one tuple per page... That should allow for some prefetching.

Yes, but that's in the heap. The mechanism Peter described is about leaf
pages in the index, and the index has the usual fillfactor. So there'll
be many index entries per leaf.

I slightly change my benchmark setup:

create table t (pk integer, sk integer, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

You are right. There is almost no effect of prefetch for limit=100, but
~2x times improvement for limit=1000:

eio\limit 1 100 1000
10 11102 142 28
0 11419 137 14

master:
limit 1 100 1000
11480 130 13

One of the motivation of my experiments was to check that there is no
degrade of performance because of batching.
And it is nice that there is no performance penalty here.
Still it is not quite clear to me why there is no any positive effect
for LIMIT 100.

The technical reason is that batch_getnext() does this:

/* Delay initializing stream until reading from scan's second batch */
if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
enable_indexscan_prefetch)
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

You can try disabling this "priorbatch" condition, so that the
read_stream gets created right away.

Reading 100 random heap pages definitely should take advantages of AIO.
We have also implemented prefetching for index only scan in Neon and
here effect for similar query is quite noticeable (~3x times).
But in Neon architecture prices of IO is much higher because requires
network communication with page server.

True, but only if the data is not already in memory / shared buffers.
IIRC this "priorbatch" logic mitigates regressions for cached workloads,
because the read_stream initialization is expensive enough to hurt small
queries when no I/O is needed.

Maybe the tradeoff is different for Neon, which probably can't rely on
cache that much? It's also true tying this to the number of batches is a
bit coarse, because the batch size can vary a lot. It can be a couple
items or hundreds of items, easily.

I believe we're open to alternative ideas.

regards

--
Tomas Vondra

#364

Andres Freund

andres@anarazel.de

25 days ago

In reply to: Tomas Vondra (#363)

Re: index prefetching

Hi,

On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:

The technical reason is that batch_getnext() does this:

/* Delay initializing stream until reading from scan's second batch */
if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
enable_indexscan_prefetch)
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

Why is the logic tied to the number of batches, rather the number of items in
batches? It's not hard to come up with scenarios where having to wait for ~100
random pages will be the majority of the queries IO wait... It makes sense to
not initialize readahead if we just fetch an entry or two, but after that?

Greetings,

Andres Freund

#365

Konstantin Knizhnik

knizhnik@garret.ru

25 days ago

In reply to: Tomas Vondra (#363)

Re: index prefetching

On 18/12/2025 4:40 PM, Tomas Vondra wrote:

On 12/18/25 14:57, Konstantin Knizhnik wrote:

On 17/12/2025 9:54 PM, Tomas Vondra wrote:

On 12/17/25 20:30, Andres Freund wrote:

Hi,

On 2025-12-17 13:49:43 -0500, Peter Geoghegan wrote:

On Wed, Dec 17, 2025 at 12:19 PM Konstantin Knizhnik
<knizhnik@garret.ru> wrote:

Moreover with `enable_indexscan_prefetch=off` results are the same.

It's quite unlikely that the current heuristics that trigger
prefetching would have ever allowed any prefetching, for queries such
as these.

The exact rule right now is that we don't even begin prefetching until
we've already read at least one index leaf page, and have to read
another one. So it's impossible to use prefetching with a LIMIT of 1,
with queries such as these. It's highly unlikely that you'd see any
benefits from prefetching even with LIMIT 100 (usually we wouldn't
even begin prefetching).

Note that due to the tuple size and fillfactor in Konstantin's
workload, there
will be one tuple per page... That should allow for some prefetching.

Yes, but that's in the heap. The mechanism Peter described is about leaf
pages in the index, and the index has the usual fillfactor. So there'll
be many index entries per leaf.

I slightly change my benchmark setup:

create table t (pk integer, sk integer, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

You are right. There is almost no effect of prefetch for limit=100, but
~2x times improvement for limit=1000:

eio\limit 1 100 1000
10 11102 142 28
0 11419 137 14

master:
limit 1 100 1000
11480 130 13

One of the motivation of my experiments was to check that there is no
degrade of performance because of batching.
And it is nice that there is no performance penalty here.
Still it is not quite clear to me why there is no any positive effect
for LIMIT 100.

The technical reason is that batch_getnext() does this:

/* Delay initializing stream until reading from scan's second batch */
if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
enable_indexscan_prefetch)
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

You can try disabling this "priorbatch" condition, so that the
read_stream gets created right away.

It makes the expected effect - performance of LIMIT 100 is increased
from 142TPS to 315TPS (so also 2x times). At the same time performance
of LIMIT 1 is reduced from 11419 to 3499 - ~4x times For LIMIT 10 result
are 1388 with disabled prefetch and 1116 with enabled prefetch. So looks
like threshold for enabling prefetch should be based not on number of
batches, but on expected heap reads and it is more closer to 100 or even
10 than to 1000. And for slower disks (or remote storage), effect of
prefetch should be much bigger.

True, but only if the data is not already in memory / shared buffers.
IIRC this "priorbatch" logic mitigates regressions for cached workloads,
because the read_stream initialization is expensive enough to hurt small
queries when no I/O is needed.

I see.
But may be we should compare table size with shared buffers or
effective_cache_size?

#366

Peter Geoghegan

pg@bowt.ie

22 days ago

In reply to: Peter Geoghegan (#354)

4 attachment(s)

Re: index prefetching

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

Attachments:

v5-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/octet-stream; name=v5-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From 8e918cc08e7eae2d5c92f6937004e63d70131deb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v5 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/relscan.h                  |  33 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 294 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  46 ++-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 11 files changed, 413 insertions(+), 12 deletions(-)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 1c9faaad5..1157ba9ba 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 /*
@@ -220,8 +222,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -268,12 +276,35 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  */
 typedef struct BatchQueue
 {
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
 
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a543682bb..75c6b5276 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -448,6 +448,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b523bcda8..00f4c3d00 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 503ff095e..b9d42b15a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -59,6 +59,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -84,6 +87,7 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
@@ -96,6 +100,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
@@ -112,6 +119,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -149,7 +159,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * When using a read stream, the stream will already know which block
 		 * number comes next (though an assertion will verify a match below)
 		 */
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -343,13 +356,24 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 	else if (unlikely(batchqueue->direction != direction))
 	{
 		/*
 		 * Handle a change in the scan's direction.
 		 *
 		 * Release future batches properly, to make it look like the current
-		 * batch is the only one we loaded.
+		 * batch is the only one we loaded. Also reset the stream position, as
+		 * if we are just starting the scan.
 		 */
 		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
 		{
@@ -364,10 +388,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		/*
 		 * Remember the new direction, and make sure the scan is not marked as
 		 * "finished" (we might have already read the last batch, but now we
-		 * need to start over).
+		 * need to start over). Do this before resetting the stream - it
+		 * should not invoke the callback until the first read, but it may
+		 * seem a bit confusing otherwise.
 		 */
 		batchqueue->direction = direction;
 		scan->finished = false;
+		batch_reset_pos(&batchqueue->streamPos);
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
 	}
 
 	/* shortcut for the read position, for convenience */
@@ -409,6 +439,37 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
 															batchqueue->headBatch);
 
+				/*
+				 * XXX When advancing readPos, the streamPos may get behind as
+				 * we're only advancing it when actually requesting heap
+				 * blocks. But we may not do that often enough - e.g. IOS may
+				 * not need to access all-visible heap blocks, so the
+				 * read_next callback does not get invoked for a long time.
+				 * It's possible the stream gets so far behind the position
+				 * that is becomes invalid, as we already removed the batch.
+				 * But that means we don't need any heap blocks until the
+				 * current read position -- if we did, we would not be in this
+				 * situation (or it's a sign of a bug, as those two places are
+				 * expected to be in sync). So if the streamPos still points
+				 * at the batch we're about to free, reset the position --
+				 * we'll set it to readPos in the read_next callback later on.
+				 *
+				 * XXX This can happen after the queue gets full, we "pause"
+				 * the stream, and then reset it to continue. But I think that
+				 * just increases the probability of hitting the issue, it's
+				 * just more chance to to not advance the streamPos, which
+				 * depends on when we try to fetch the first heap block after
+				 * calling read_stream_reset().
+				 *
+				 * FIXME Simplify/clarify/shorten this comment. Can it
+				 * actually happen, if we never pull from the stream in IOS?
+				 * We probably don't look ahead for the first call.
+				 */
+				if (unlikely(batchqueue->streamPos.batch == batchqueue->headBatch))
+				{
+					batch_reset_pos(&batchqueue->streamPos);
+				}
+
 				/* Free the head batch (except when it's markBatch) */
 				batch_free(scan, headBatch);
 
@@ -427,8 +488,38 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		}
 
 		/*
-		 * Failed to advance the read position.  Have indexbatch.c utility
-		 * routine load another batch into our queue (next in this direction).
+		 * We failed to advance, i.e. we ran out of currently loaded batches.
+		 * So if we filled the queue, this is a good time to reset the stream
+		 * (before we try loading the next batch).
+		 */
+		if (unlikely(batchqueue->reset))
+		{
+			batchqueue->reset = false;
+			batchqueue->currentPrefetchBlock = InvalidBlockNumber;
+
+			/*
+			 * Need to reset the stream position, it might be too far behind.
+			 * Ultimately we want to set it to readPos, but we can't do that
+			 * yet - readPos still point sat the old batch, so just reset it
+			 * and we'll init it to readPos later in the callback.
+			 */
+			batch_reset_pos(&batchqueue->streamPos);
+
+			if (scan->xs_heapfetch->rs)
+				read_stream_reset(scan->xs_heapfetch->rs);
+		}
+
+		/*
+		 * Failed to advance the read position, so try reading the next batch.
+		 * If this fails, we're done - there's nothing more to load.
+		 *
+		 * Most of the batches should be loaded from read_stream_next_buffer,
+		 * but we need to call batch_getnext here too, for two reasons. First,
+		 * the read_stream only gets working after we try fetching the first
+		 * heap tuple, so we need to load the initial batch (the head).
+		 * Second, while most batches will be preloaded by the stream thanks
+		 * to prefetching, it's possible to set effective_io_concurrency=0,
+		 * and in that case all the batches get loaded from here.
 		 */
 		if (!batch_getnext(scan, direction))
 		{
@@ -448,6 +539,198 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return NULL;
 }
 
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded from batch_getnext_tid(). We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance. However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME Unlike batch_getnext_tid, this can loop more than twice. If many
+	 * blocks get skipped due to currentPrefetchBlock or all-visibility (per
+	 * the "prefetch" callback), we get to load additional batches. In the
+	 * worst case we hit the INDEX_SCAN_MAX_BATCHES limit and have to "pause"
+	 * the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_pos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (!batch_getnext(scan, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!scan->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3116,6 +3399,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 39363eff3..0d2cfa605 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -467,7 +467,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index b6f24b379..29207276d 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -75,11 +75,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->headBatch = 0;	/* initial head batch */
@@ -123,7 +128,23 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 	if (scan->finished)
 		return false;
 
-	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+	/*
+	 * If we already used the maximum number of batch slots available, it's
+	 * pointless to try loading another one. This can happen for various
+	 * reasons, e.g. for index-only scans on all-visible table, or skipping
+	 * duplicate blocks on perfectly correlated indexes, etc.
+	 *
+	 * We could enlarge the array to allow more batches, but that's futile, we
+	 * can always construct a case using more memory. Not only it would risk
+	 * OOM, it'd also be inefficient because this happens early in the scan
+	 * (so it'd interfere with LIMIT queries).
+	 */
+	if (INDEX_SCAN_BATCH_FULL(scan))
+	{
+		DEBUG_LOG("batch_getnext: ran out of space for batches");
+		scan->batchqueue->reset = true;
+		return false;
+	}
 
 	batch_debug_print_batches("batch_getnext / start", scan);
 
@@ -148,6 +169,17 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 	else
 		scan->finished = true;
@@ -180,9 +212,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -228,6 +263,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	scan->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -291,9 +328,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -304,6 +345,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a39cc793b..37a0e6a3f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f1b88d058..2a06279f5 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d83490..3f264f1ce 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -882,6 +882,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832..a2a8c3afa 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0

v5-0004-Add-amgetbatch-support-to-hash-index-AM.patchapplication/octet-stream; name=v5-0004-Add-amgetbatch-support-to-hash-index-AM.patchDownload

From 1100fe66f957d4186f8611e9e318fb3e5562f599 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v5 4/4] Add amgetbatch support to hash index AM.

This patch should be considered a work in progress.  It has only been
lightly tested, and it's not clear if I've accounted for all of the
intricacies with bucket splits (and with pins that are held by the
scan's opaque state more generally).

This automatically switched hash indexes over to using the dropPin
optimization, since that is standard when using the new amgetbatch
interface.  This won't bring similar benefits to hash index scans that
nbtree index scans gained when commit 2ed5b87f went in.  Hash index
vacuuming acquires a cleanup lock on bucket pages, but conflicting pins
on bucket pages are still held for the full duration of each index scan.

However, there is still independent value in avoiding holding on to
buffer pins during index scans: index prefetching tends to hold open
as many as several dozen batches with certain workloads (workloads where
the stream position has to get quite far ahead of the read position in
order to maintain the appropriate prefetch distance on the heapam side).
Guaranteeing that open batches won't hold buffer pins on index pages (at
least during plain index scans that use an MVCC snapshot, per the rules
established by commit 2ed5b87f) is likely to make life easier when the
resource management rules for I/O prefetching are fully ironed-out.

Note that the code path in _hash_kill_items that calls _hash_getbuf
(though only with the LH_OVERFLOW_PAGE flag/overflow pages) is dead
code/vestigial on master [1].  However, it's now useful again, since
_hash_kill_items might now need to relock and repin an overflow page
(not just relock it) during dropPin scans.

Also add Valgrind buffer lock instrumentation to hash, bringing it in
line with nbtree following commit 4a70f829.  This is a requirement for
any index AM that uses the new amgetbatch interface.

[1] https://postgr.es/m/CAH2-Wz=8mefy8QUcsnKLTePuy4tE8pdO+gSRQ9yQwUHoaeFTFw@mail.gmail.com

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  73 +------
 src/backend/access/hash/hash.c       | 123 ++++--------
 src/backend/access/hash/hashpage.c   |  26 +--
 src/backend/access/hash/hashsearch.c | 287 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   |  98 ++++-----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 245 insertions(+), 364 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 839c34312..03f8bc9b3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7289df574..10e6ba3f0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = hashadjustmembers;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
-	amroutine->amgettuple = hashgettuple;
-	amroutine->amgetbatch = NULL;
-	amroutine->amfreebatch = NULL;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = hashgetbatch;
+	amroutine->amfreebatch = hashfreebatch;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
 	amroutine->amposreset = NULL;
@@ -285,53 +285,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -341,26 +310,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -382,17 +348,14 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
 
 	return scan;
 }
@@ -407,18 +370,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -427,6 +380,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -436,17 +408,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005..ecd3bf1e0 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -35,6 +35,7 @@
 #include "port/pg_bitutils.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
@@ -79,6 +80,9 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 	if (access != HASH_NOLOCK)
 		LockBuffer(buf, access);
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -108,6 +112,9 @@ _hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
 		return InvalidBuffer;
 	}
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -280,31 +287,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 92c15a65b..a9fa839cf 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -22,104 +22,74 @@
 #include "utils/rel.h"
 
 static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+						   ScanDirection dir, BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		indexam_util_batch_release(scan, batch);
+		return NULL;
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +239,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +264,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +294,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,34 +387,33 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, &buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
@@ -461,8 +429,8 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -473,25 +441,21 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
-			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
+			 * the next page.
+			 *
 			 * If this is a primary bucket page, hasho_prevblkno is not a real
 			 * block number.
 			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				prev_blkno = InvalidBlockNumber;
 			else
 				prev_blkno = opaque->hasho_prevblkno;
@@ -499,29 +463,25 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->prevPage = prev_blkno;
+				batch->nextPage = InvalidBlockNumber;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -532,77 +492,91 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
 			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
+			if (batch->buf == so->hashso_bucket_buf ||
+				batch->buf == so->hashso_split_bucket_buf)
 				next_blkno = opaque->hasho_nextblkno;
 
 			_hash_readprev(scan, &buf, &page, &opaque);
 			if (BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
+				batch->buf = buf;
+				batch->currPage = BufferGetBlockNumber(buf);
 			}
 			else
 			{
 				/*
 				 * Remember next and previous block numbers for scrollable
 				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
+				 * indicating that no more matching tuples were found
 				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->prevPage = InvalidBlockNumber;
+				batch->nextPage = next_blkno;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/* Saved at least one match in batch.items[], so prepare to return it */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split
+		 */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
+
+		/*
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released
+		 */
+		IncrBufferRefCount(batch->buf);
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
+		batch->moreLeft = (ScanDirectionIsBackward(dir) &&
+						   BlockNumberIsValid(batch->prevPage));
+		batch->moreRight = (ScanDirectionIsForward(dir) &&
+							BlockNumberIsValid(batch->nextPage));
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +613,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +660,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +679,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;	/* Hash doesn't support index-only scans */
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index f41233fcd..80f9a396c 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,81 +510,84 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
 	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	for (i = 0; i < numKilled; i++)
+	for (i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +616,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c559973f7..b62dd7136 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1193,8 +1193,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

v5-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/octet-stream; name=v5-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From 1a9b302787fd0fd6cff470ec867dc4c8e33a462d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v5 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  32 +-
 src/include/access/heapam.h                   |   3 +
 src/include/access/nbtree.h                   | 176 +----
 src/include/access/relscan.h                  | 246 ++++++-
 src/include/access/tableam.h                  |  42 ++
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 507 ++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  15 +-
 src/backend/access/index/indexam.c            | 135 +---
 src/backend/access/index/indexbatch.c         | 669 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |   6 +-
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 192 +++--
 src/backend/access/nbtree/nbtree.c            | 297 ++------
 src/backend/access/nbtree/nbtsearch.c         | 510 +++++--------
 src/backend/access/nbtree/nbtutils.c          |  93 +--
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 108 +--
 src/backend/executor/nodeIndexscan.c          |  13 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 src/backend/utils/adt/selfuncs.c              |  68 +-
 contrib/bloom/blutils.c                       |   3 +-
 doc/src/sgml/indexam.sgml                     | 310 ++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   4 +-
 src/tools/pgindent/typedefs.list              |   4 -
 44 files changed, 2332 insertions(+), 1239 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 63dd41c1f..a7eb33ce9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd..88c34b983 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -40,6 +40,12 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * heap blocks fetched counts (incremented by index_getnext_slot calls
+	 * within table AMs, though only during index-only scans)
+	 */
+	uint64		nheapfetches;
 } IndexScanInstrumentation;
 
 /*
@@ -115,6 +121,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +182,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool xs_want_itup,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +209,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool xs_want_itup,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +278,24 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern bool batch_getnext(IndexScanDesc scan, ScanDirection direction);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f7e4ae384..74af71267 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,7 +117,10 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7a3efd209..658c46a1f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 78989a959..1c9faaad5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -125,6 +126,174 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans work with a queue of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
+ */
+typedef struct BatchQueue
+{
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -142,6 +311,13 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* table access method's private amgetbatch state */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -150,9 +326,17 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
 
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/*
+	 * Did we read the final batch in this scan direction? The batches may be
+	 * loaded from multiple places, and we need to remember when we fail to
+	 * load the next batch in a given scan (which means "no more batches").
+	 * amgetbatch may restart the scan on the get call, so we need to remember
+	 * it's over.
+	 */
+	bool		finished;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -177,6 +361,7 @@ typedef struct IndexScanDescData
 	IndexFetchTableData *xs_heapfetch;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	uint16		maxitemsbatch;	/* set by ambeginscan when amgetbatch used */
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -216,4 +401,61 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData  *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6b..a543682bb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -433,11 +433,29 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -1188,6 +1206,27 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
+ * can consume index tuple results by examining IndexScanDescData fields such
+ * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
+ * into the provided slot in the case of xs_want_itup=true callers.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1250,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3968429f9..c36932edc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1751,7 +1751,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1774,7 +1773,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index b5ff456ef..4ed0f22a7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 45d306037..364b36523 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index b3e2e9d5f..b8f831a31 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 605f80aad..1d233087e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c26d8538f..0d282c69a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
 	amroutine->amgettuple = gistgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e388252af..7289df574 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
 	amroutine->amgettuple = hashgettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf6..503ff095e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -81,10 +81,12 @@ heapam_slot_callbacks(Relation relation)
 static IndexFetchTableData *
 heapam_index_fetch_begin(Relation rel)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
+	hscan->vmbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,12 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +112,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -125,22 +135,32 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*call_again);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -173,6 +193,466 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+/*
+ * heap_batch_advance_pos
+ *		Advance the position to the next item, depending on scan direction.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.  This is
+ * usually readPos.  Advances the position to the next item, either in the
+ * same batch or the following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+static bool
+heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
+					   ScanDirection direction)
+{
+	BatchIndexScan batch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * If the position has not been advanced yet, it has to be right after we
+	 * loaded the initial batch (must be the head batch). In that case just
+	 * initialize it to the batch's first item (or its last item, when
+	 * scanning backwards).
+	 */
+	if (INDEX_SCAN_POS_INVALID(pos))
+	{
+		/*
+		 * We should have loaded the scan's initial batch, or maybe we have
+		 * changed the direction of the scan after scanning all the way to the
+		 * end (in which case the position is invalid, and we make it look
+		 * like there is just one batch). We should have just one batch,
+		 * though.
+		 */
+		Assert(INDEX_SCAN_BATCH_COUNT(scan) == 1);
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		batch = INDEX_SCAN_BATCH(scan, scan->batchqueue->headBatch);
+
+		pos->batch = scan->batchqueue->headBatch;
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	batch_assert_pos_valid(scan, pos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++pos->item <= batch->lastItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--pos->item >= batch->firstItem)
+		{
+			batch_assert_pos_valid(scan, pos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, pos->batch + 1))
+	{
+		/* advance to the next batch */
+		pos->batch++;
+
+		batch = INDEX_SCAN_BATCH(scan, pos->batch);
+
+		if (ScanDirectionIsForward(direction))
+			pos->item = batch->firstItem;
+		else
+			pos->item = batch->lastItem;
+
+		batch_assert_pos_valid(scan, pos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->direction != direction))
+	{
+		/*
+		 * Handle a change in the scan's direction.
+		 *
+		 * Release future batches properly, to make it look like the current
+		 * batch is the only one we loaded.
+		 */
+		while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+		{
+			/* release "later" batches in reverse order */
+			BatchIndexScan fbatch;
+
+			batchqueue->nextBatch--;
+			fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+			batch_free(scan, fbatch);
+		}
+
+		/*
+		 * Remember the new direction, and make sure the scan is not marked as
+		 * "finished" (we might have already read the last batch, but now we
+		 * need to start over).
+		 */
+		batchqueue->direction = direction;
+		scan->finished = false;
+	}
+
+	/* shortcut for the read position, for convenience */
+	readPos = &batchqueue->readPos;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	while (true)
+	{
+		/*
+		 * If we manage to advance to the next items, return it and we're
+		 * done. Otherwise try loading another batch.
+		 */
+		if (heap_batch_advance_pos(scan, readPos, direction))
+		{
+			BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+			/* set the TID / itup for the scan */
+			scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+			/* xs_hitup is not supported by amgetbatch scans */
+			Assert(!scan->xs_hitup);
+
+			if (scan->xs_want_itup)
+				scan->xs_itup =
+					(IndexTuple) (readBatch->currTuples +
+								  readBatch->items[readPos->item].tupleOffset);
+
+			/*
+			 * If we advanced to the next batch, release the batch we no
+			 * longer need. The positions is the "read" position, and we can
+			 * compare it to headBatch.
+			 */
+			if (unlikely(readPos->batch != batchqueue->headBatch))
+			{
+				BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+															batchqueue->headBatch);
+
+				/* Free the head batch (except when it's markBatch) */
+				batch_free(scan, headBatch);
+
+				/*
+				 * In any case, remove the batch from the regular queue, even
+				 * if we kept it for mark/restore.
+				 */
+				batchqueue->headBatch++;
+
+				/* we can't skip any batches */
+				Assert(batchqueue->headBatch == readPos->batch);
+			}
+
+			pgstat_count_index_tuples(scan->indexRelation, 1);
+			return &scan->xs_heaptid;
+		}
+
+		/*
+		 * Failed to advance the read position.  Have indexbatch.c utility
+		 * routine load another batch into our queue (next in this direction).
+		 */
+		if (!batch_getnext(scan, direction))
+		{
+			/* we're done -- there's no more batches in this scan direction */
+			break;
+		}
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	Assert(scan->finished);
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue)
+		{
+			if (all_dead)
+				index_batch_kill_item(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->nheapfetches++;
+
+				if (!index_fetch_heap(scan, slot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(scan->xs_heap_continue &&
+						 IsMVCCSnapshot(scan->xs_snapshot)));
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1233,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1271,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3115,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index b7f10a1ae..8dcd06b0a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* set later for amgetbatch callers */
+	scan->xs_want_itup = false; /* caller must initialize this */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -102,8 +104,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	else
 		scan->orderByData = NULL;
 
-	scan->xs_want_itup = false; /* may be set later */
-
 	/*
 	 * During recovery we ignore killed tuples and don't bother to kill them
 	 * either. We do this because the xmin on the primary node could easily be
@@ -115,6 +115,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	 * should not be altered by index AMs.
 	 */
 	scan->kill_prior_tuple = false;
+	scan->dropPin = true;		/* for now */
+	scan->finished = false;
 	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
 	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
 
@@ -446,7 +448,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +519,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +710,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +737,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d2..39363eff3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool xs_want_itup,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -282,6 +281,10 @@ index_beginscan(Relation heapRelation,
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
@@ -380,6 +383,15 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * batchqueue shouldn't be marked finished (must make sure that
+	 * index_batch_reset doesn't see this, since indexam_util_batch_release
+	 * will be affected)
+	 */
+	scan->finished = false;
+
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +406,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +437,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	return index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +464,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +474,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +596,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +610,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool xs_want_itup,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -613,6 +633,10 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
@@ -621,10 +645,14 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +695,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..b6f24b379
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,669 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+#include "optimizer/cost.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static int	batch_compare_int(const void *va, const void *vb);
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc_object(BatchQueue);
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0, sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called by table AM's ordered index scan implementation when it needs to
+ * load the next batch of index entries to process in the given direction.
+ *
+ * The table AM controls the overall progress of the scan, deciding when to
+ * request new batches.  This division of labor gives the table AM the ability
+ * to reorder fetches of nearby table tuples (from the same batch, or from
+ * adjacent batches) based on its own considerations.  Importantly, table AMs
+ * are _not_ required to free a batch before loading the next batch during an
+ * index scan of an index that uses the amgetbatch/amfreebatch interface.
+ * (This isn't possible with the single-tuple amgettuple interface, which gives
+ * the index AM direct control over the progress of the index scan.  amgettuple
+ * index scans perform the work that we perform in batch_free as the scan
+ * progresses, and without notifying the table AM, which makes it impossible
+ * to safely reorder work in the way that our callers can.)
+ *
+ * Returns true if we managed to read a batch of TIDs, or false if there are
+ * no more batches in the given scan direction.
+ * ----------------
+ */
+bool
+batch_getnext(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan priorbatch = NULL,
+				batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/* Did we already read the last batch for this scan? */
+	if (scan->finished)
+		return false;
+
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch_debug_print_batches("batch_getnext / start", scan);
+
+	/*
+	 * Get the previously returned batch to pass to amgetbatch.  The index AM
+	 * uses this to determine which index page to read next, typically by
+	 * following page links forward or backward.
+	 */
+	if (batchqueue->headBatch < batchqueue->nextBatch)
+		priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+	else
+		scan->finished = true;
+
+	batch_assert_batches_valid(scan);
+
+	batch_debug_print_batches("batch_getnext / end", scan);
+
+	return (batch != NULL);
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	scan->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	/*
+	 * killedItems[] is now in whatever order the scan returned items in.
+	 * Scrollable cursor scans might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify killedItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numKilled > 1)
+	{
+		qsort(batch->killedItems, batch->numKilled, sizeof(int),
+			  batch_compare_int);
+		batch->numKilled = qunique(batch->killedItems, batch->numKilled,
+								   sizeof(int), batch_compare_int);
+	}
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = palloc_array(int, scan->maxitemsbatch);
+	if (readBatch->numKilled < scan->maxitemsbatch)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		BatchIndexScan cached = scan->batchqueue->cache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->killedItems)
+			pfree(cached->killedItems);
+		if (cached->currTuples)
+			pfree(cached->currTuples);
+		pfree(cached);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = scan->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		Assert(!scan->xs_want_itup);
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer;
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch during a amgetbatch (or amgetbitmap) index scan.
+ *
+ * Returns BatchIndexScan with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Index AMs that use batches should call this from either their amgetbatch or
+ * amgetbitmap routines only.  Note in particular that it cannot safely be
+ * called from a amfreebatch routine.
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (scan->xs_want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+		}
+		else
+		{
+			/*
+			 * Use cache.  This is generally only beneficial when there are
+			 * many small rescans of an index.
+			 */
+			for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+			{
+				if (scan->batchqueue->cache[i] == NULL)
+				{
+					/* found empty slot, we're done */
+					scan->batchqueue->cache[i] = batch;
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index e29c03089..c01fff708 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc..231da20e4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -471,7 +471,7 @@ proper.  A plain index scan will even recognize LP_UNUSED items in the
 heap (items that could be recycled but haven't been just yet) as "not
 visible" -- even when the heap page is generally considered all-visible.
 
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+Opportunistic LP_DEAD setting of known-dead index tuples during index scans
 (described in full in simple deletion, below) is also more complicated for
 index scans that drop their leaf page pins.  We must be careful to avoid
 LP_DEAD-marking any new index tuple that looks like a known-dead index
@@ -481,7 +481,7 @@ new, unrelated index tuple, on the same leaf page, which has the same
 original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.
 
-We handle this kill_prior_tuple race condition by having affected index
+We handle this LP_DEAD setting race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
@@ -735,7 +735,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index cfb07b2bc..15b788e3a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index b3b8b5534..91c5e4da0 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,93 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2816,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2907,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b44252319..1d5b118fa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
-	amroutine->amgettuple = btgettuple;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = btgetbatch;
+	amroutine->amfreebatch = btfreebatch;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
-	amroutine->ammarkpos = btmarkpos;
-	amroutine->amrestrpos = btrestrpos;
+	amroutine->amposreset = btposreset;
 	amroutine->amestimateparallelscan = btestimateparallelscan;
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
@@ -221,13 +222,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -242,43 +243,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -288,6 +264,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -296,29 +273,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -346,8 +323,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -361,19 +336,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
 
 	return scan;
 }
@@ -387,72 +352,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -461,116 +391,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -886,15 +748,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1044,8 +897,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7a416d60c..b08bf61d7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,53 +25,23 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -860,20 +830,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -887,8 +853,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	BatchIndexScan firstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -904,6 +872,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
+		indexam_util_batch_release(scan, firstbatch);
 		return false;
 	}
 
@@ -913,7 +882,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return false;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -930,14 +902,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1237,7 +1205,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1505,12 +1473,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(firstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1526,23 +1494,24 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, firstbatch);
 			return false;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1560,168 +1529,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * Cancel primitive index scans that were scheduled when priorbatch's call
+	 * to _bt_readpage happened to use the opposite direction to the one that
+	 * we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1731,73 +1601,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1807,102 +1694,65 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior batch's nextPage or prevPage (depending on the current
+ * scan direction), and lastcurrblkno is the prior batch's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns batch containing data from the next page that has
+ * at least one matching item.  If there are no more matching items in the
+ * given scan direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for lastcurrblkno, a page to the left of
+	 * blkno (or to the right, when the scan is moving backwards)
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1910,17 +1760,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1935,19 +1785,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2173,25 +2043,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2202,7 +2070,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2228,9 +2096,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a451d48e1..5a7851ee8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -21,15 +21,12 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -160,104 +157,69 @@ _bt_freestack(BTStack stack)
 	}
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
+	if (!scan->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -272,17 +234,16 @@ _bt_killitems(IndexScanDesc scan)
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in leaf page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
+			   offnum >= batch->items[batch->killedItems[i - 1]].indexOffset);
 
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
@@ -300,7 +261,7 @@ _bt_killitems(IndexScanDesc scan)
 
 				/*
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -316,7 +277,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum || !scan->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -332,8 +293,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * kitem is also the last heap TID in the last index tuple
 					 * correctly -- posting tuple still gets killed).
 					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+					if (pi < batch->numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -383,7 +344,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index a60ec85e8..b0a6e0974 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
 	amroutine->amgettuple = spggettuple;
+	amroutine->amgetbatch = NULL;
+	amroutine->amfreebatch = NULL;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5a6390631..b2a09543d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -135,7 +135,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1970,7 +1970,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1984,15 +1984,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3856,15 +3853,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				nheapfetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3885,6 +3883,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument.nsearches;
+				nheapfetches = indexstate->ioss_Instrument.nheapfetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3908,9 +3907,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			nheapfetches += winstrument->nheapfetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, nheapfetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9cccb6ac..3ac91f12e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -880,7 +880,7 @@ DefineIndex(Oid tableId,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 1d0e8ad57..ac337d900 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0b3a31f17..584cf7fd0 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 860f79f9c..618435671 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index abbb03388..7fe72992d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -202,6 +202,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument.nsearches;
+		Assert(node->biss_Instrument.nheapfetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 6bea42f12..c90dc4d82 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -90,18 +87,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,78 +111,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -238,16 +163,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -407,13 +322,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -433,6 +341,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument.nsearches;
+		winstrument->nheapfetches += node->ioss_Instrument.nheapfetches;
 	}
 
 	/*
@@ -784,13 +693,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -850,12 +758,12 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 72b135e5d..082deea47 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -812,6 +812,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument.nsearches;
+		Assert(node->iss_Instrument.nheapfetches == 0);
 	}
 
 	/*
@@ -1719,7 +1720,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1784,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5d4f81ee7..83681a6f3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index bf45c355b..364a99bcc 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b88..4a74b69df 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 0af26d6ac..97d2b7f65 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c760b19db..8bc533582 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7043,13 +7042,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
+	IndexScanInstrumentation instrument;
 
 	/*
 	 * We use the index-only-scan machinery for this.  With mostly-static
@@ -7098,56 +7094,32 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
-								 &SnapshotNonVacuumable, NULL,
+	/*
+	 * Set it up for instrumented index-only scan.  We need the
+	 * instrumentation to monitor the number of heap fetches.
+	 */
+	memset(&instrument, 0, sizeof(instrument));
+	index_scan = index_beginscan(heapRel, indexRel, true,
+								 &SnapshotNonVacuumable, &instrument,
 								 1, 0);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (instrument.nheapfetches > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7180,8 +7152,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 7a468b4a1..0c3ba684a 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -148,8 +148,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 63d7e376f..cdc9c65a6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -161,10 +161,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -743,6 +744,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -762,8 +894,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -789,32 +921,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -988,30 +1113,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1180,6 +1322,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index a34382a5f..5526771b5 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,10 @@ dihandler(PG_FUNCTION_ARGS)
 	amroutine->ambeginscan = dibeginscan;
 	amroutine->amrescan = direscan;
 	amroutine->amgettuple = NULL;
+	amroutine->amgetbatch = NULL;
 	amroutine->amgetbitmap = NULL;
 	amroutine->amendscan = diendscan;
-	amroutine->ammarkpos = NULL;
-	amroutine->amrestrpos = NULL;
+	amroutine->amposreset = NULL;
 	amroutine->amestimateparallelscan = NULL;
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 04845d5e6..c559973f7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -226,8 +226,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3469,12 +3467,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v5-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/octet-stream; name=v5-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From f40279ee5c0591a4da342b5a76eaec0544cbcda8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v5 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 97c1124c1..1ef3ebc8f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb55102b0..48a2376de 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1632,6 +1632,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1764,7 +1804,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1783,11 +1823,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1873,6 +1968,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1930,9 +2062,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

#367

Konstantin Knizhnik

knizhnik@garret.ru

18 days ago

In reply to: Andres Freund (#364)

Re: index prefetching

On 18/12/2025 4:45 PM, Andres Freund wrote:

Hi,

On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:

The technical reason is that batch_getnext() does this:

/* Delay initializing stream until reading from scan's second batch */
if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
enable_indexscan_prefetch)
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

Why is the logic tied to the number of batches, rather the number of items in
batches? It's not hard to come up with scenarios where having to wait for ~100
random pages will be the majority of the queries IO wait... It makes sense to
not initialize readahead if we just fetch an entry or two, but after that?

I did more experiments trying to understand when we can take advantage
of prefetch:

So schema is the same:

create table t (pk integer, sk integer, payload text default repeat('x',
1000)) with (fillfactor=10);
insert into t values (generate_series(1,10000000),random()*10000000);
create index on t(sk);

select.sql:

\set sk random(1, 10000000)
select * from t where sk >= :sk order by sk limit N;

And I do
pgbench -n -T 30 -M prepared -f select.sql postgres

limit\prefetch on off always incremental
1 12074 12765 3146 3282
2 5912 6198 2463 2438
4 2919 3047 1334 1964
8 1554 1496 1166 1409
16 815 775 947 940
32 424 403 687 695
64 223 208 446 453
128 115 106 258 270
256 68 53 138 149
512 43 27 72 78
1024 28 13 38 40

prefetch=always means commenting of `priorbatch` check and immediate
creation of read_stream:

         /* Delay initializing stream until reading from scan's second 
batch */
-        if (priorbatch && !scan->xs_heapfetch->rs && 
!batchqueue->disabled &&+
+       if (/*priorbatch && */!scan->xs_heapfetch->rs && 
!batchqueue->disabled &&

prefetch=increment replaces doubling of prefetch distance with increment:

         /* Look-ahead distance ramps up rapidly after we do I/O. */
-        distance = stream->distance * 2;
+       distance = stream->distance ? stream->distance + 1 : 0;

So as you expected, immediate creation of read_stream cause quite
significant degrade of performance on indexscans inspecting small number
of TIDs.
Looks like the threshold where read stream provides advantages in
performance is about 10.
After it earlier initialization of read stream adds quite noticeable
performance improvement.

I tried to find out using profiler and debugger where most of the time
is spent in this case and answer was quite predictable -
in read_stream_reset->read_stream_next_buffer.

So we just consuming pefetched buffers which we do not need.

I thought that we can use some better policy for increasing prefetch
distance (right now it is just doubled).
This is why I have tried this "incremental" policy.
Unfortunately it can not help to reduce prefetch overhead for "short"
indexscans.
But what surprised me is that for longer indexscans this approach seems
to be slightly more efficient than doubling.

So look like we really should use number of items criteria for read
stream initialization rather than number of batches.
And may be think about alternative policy for increasing prefetch distance.

#368

Konstantin Knizhnik

knizhnik@garret.ru

18 days ago

In reply to: Peter Geoghegan (#366)

1 attachment(s)

Re: index prefetching

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

Attachments:

read_stream_threshold.patchtext/plain; charset=UTF-8; name=read_stream_threshold.patchDownload

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b9d42b15a18..aac663edfdd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -46,6 +47,8 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+int read_stream_threshold = DEFAULT_READ_STREAM_THRESHOLD;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -229,6 +232,7 @@ heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
 					   ScanDirection direction)
 {
 	BatchIndexScan batch;
+	int proceed_items;
 
 	/* make sure we have batching initialized and consistent */
 	batch_assert_batches_valid(scan);
@@ -288,6 +292,24 @@ heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
 	 */
 	batch = INDEX_SCAN_BATCH(scan, pos->batch);
 
+	proceed_items = ScanDirectionIsForward(direction)
+		? pos->item - batch->firstItem
+		: batch->lastItem - pos->item;
+
+	/* Delay initializing stream until proceeding */
+	if (proceed_items >= read_stream_threshold
+		&& !scan->xs_heapfetch->rs
+		&& !scan->batchqueue->disabled
+		&& !scan->xs_want_itup	/* XXX prefetching disabled for IoS, for now */
+		&& enable_indexscan_prefetch)
+	{
+		scan->xs_heapfetch->rs =
+			read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+									   scan->heapRelation, MAIN_FORKNUM,
+									   scan->heapRelation->rd_tableam->index_getnext_stream,
+									   scan, 0);
+	}
+
 	if (ScanDirectionIsForward(direction))
 	{
 		if (++pos->item <= batch->lastItem)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 8f6fa6843cb..0c2081e32ba 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2322,6 +2322,15 @@
   max => 'DBL_MAX',
 },
 
+{ name => 'read_stream_threshold', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_COST',
+  short_desc => 'Minimal number of heap tuples for creation read stream.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'read_stream_threshold',
+  boot_val => 'DEFAULT_READ_STREAM_THRESHOLD',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 { name => 'recovery_end_command', type => 'string', context => 'PGC_SIGHUP', group => 'WAL_ARCHIVE_RECOVERY',
   short_desc => 'Sets the shell command that will be executed once at the end of recovery.',
   variable => 'recoveryEndCommand',
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 00f4c3d0011..97150433c99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -28,6 +28,7 @@
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
 #define DEFAULT_PARALLEL_TUPLE_COST 0.1
 #define DEFAULT_PARALLEL_SETUP_COST  1000.0
+#define DEFAULT_READ_STREAM_THRESHOLD	10
 
 /* defaults for non-Cost parameters */
 #define DEFAULT_RECURSIVE_WORKTABLE_FACTOR  10.0
@@ -72,6 +73,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
 extern PGDLLIMPORT bool enable_presorted_aggregate;
 extern PGDLLIMPORT bool enable_async_append;
 extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT int read_stream_threshold;
 
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 								  double index_pages, PlannerInfo *root);

#369

Tomas Vondra

tomas@vondra.me

15 days ago

In reply to: Konstantin Knizhnik (#368)

2 attachment(s)

Re: index prefetching

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

regards

--
Tomas Vondra

Attachments:

results-prefetch-v5.pdfapplication/pdf; name=results-prefetch-v5.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��Untitled spreadsheet)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
792
612
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x���[�,=r%���������A��+3+/|5z�=����Z��'lc �1��S�d�XdU���������Lf�Jf0�_�N?���&�a����������8�Z�����|�u��LS��.
l�����iO��a~�����M����]����1a����y]|�c�R�����~���������{���9x����
��_Lm$�����w����'�e�<��;������S��?��0��e��q>���k7�7Q)����z|�C�����2���z��	�����S��1�.�~�5.7����}��v�8�i@�O�(���0/�2���:s���������R��]�Ek��hw<�i��O�B��4����������`����������a�E$��A{�B�V�}��w�e�����cK��+;O���?Z����:nL+"d?ZJd�����r����T��D��]��P[S�6~�"B���DR�x0O��a�gs;��?����6~���OD@[J���������\�4�O�pH�������R�=��D�u�8�_�|��������x:�;��o��_V����xK�F�u���ZJdj�7c���Q+��w�ln$��a�B������������D�N�k&���G��4�@�������z1���P 6����__�8!�	M3x�����6�L/"�M3h���)���^
R�����fP��{�:���c|��d,(f�Z�R�YD�M��T�?��Ga�b��!}l����-�������M3x���n���	�E��i���0F�����o�i���g�����G�@�E����Z "��~��q����R�@C�p��8�R�������d��g�I���SH��.5���q��cJ&��qM�B�"�ML�A!5���w��^J�[*����F���8�R��;���>��o�?���61E��8��b�����@����5�q��;�E�� !���|�@�zk���(���*BPD�saP�]C���"�3�8�R���h�Y�g:Gt;<�"Vf05A�u�|���!(�h�~g��q�"�F� �"\�H��g0�,G��0�,��^���!������q�D��8�RC��m�z
71Q��8�������|����AD�E��Qa��
)�D[r>���5�����<�Y���8���p\�t\n��<��b�b�wg���x�t�����h��Bx�<��a���Eh����)��E�3��.��`L����x���)���Z�Dj�A�H�M ��g�
!��!��Dl����.Z/�)d�b
h�!H���k?��@1A���3(��D�E!\��&�L�@�gPjU� Z-"��^�(�M
4��X2AV ��	� (���s�$�1��7�/5��
�!7���Ce���=��d��F&`�`��A����%L���Kb��U����� WS���\��
D�U�M��R�J�j�8�)��b� �g0S��74]p���*B��&e�q�+D9~�*V��/h��gP
j�@�����M���?5�`6e�IJs��wL�����q���`zP0�B0�[�3x�$e���B�#�P�B!
�k�`������4� 4V���@QeC7r|����p
���+W)T�d�C(�.\#6� ��/(�`�0��ctG��}A��d��>�1��3�I�����;���;f@��>���3x� ~������z�_��Bl�A[2�>��x�6���#�g�����"��8����~�.Q�?k��6a��������Z$.��4R�����i�LO�	�7�g��T�q����M �A!5�����Q2:�L5b��w����71u��8�RP_�fo@��8)��`���M
�kk��6SN��yw15��)�;+-�A�@�0
��o!4��]�~�Q01L����.0E)����B�2=�7��I��	�8c����I$Z��H2/��^�Fj�A&��d�8�?�����3x�A,:�(�	�Fj�AS�EPp7-APH�3(a�a���w|:k�"�� �
t����������}��W6Hr�1A�H�3x�5���Hn$BT�V���[�}�E.���n��Eh��L�G�vx�CZ+fq��"����8��,%�Bk��O,���-3�����:(05��&�d�%�J��gPJV�� �"��b�(g�)����=z d��R�fC��,-���hj@b:�Q(��������	71I��8�������"/� b����.�}:��'�a�����"	�Bj�AkR��,!6
"��`A�b������OA��@0������{.���E�~�q���|A��$��#� D.uWWn5I��%��Q#5� �4��C���DgP� e� "�k���S�J����� �:,�>����3�]�G����$�Ir3$5���
�H�u�V�H�M
,L
�="���715R��|%wD*7���|[h_	����	����l�U~����D�h-v8
"�9���<���J����b��q3�[�(��?�$��qo�����
��5b��{�����&�d�(g��;	�$�j���4������n�j���8yl�A���g<i�C
��g1_b��;_�Y�0
���Kj��;�,��br;�!I���<�`^P+,B+���g��(��Jd`GH�m)�9����H%R�I0BPH�3�u����D>4���4/��P"�HOF��8�R����NF�A 5����6/��H0����
�q-A���
��Q!6�`}D����`
#"?�q�+��f\0����HO&�0(��+�`!��I0�I�3h?�k��!E�Yd$�pR��k+�$����PF��8����2�b��WF�Q#6���W��H25����q��J["i%D��8��da�71y��8�R��Y�HZF�Qa��	�w��c���bq%71u��8��]#�J�CQa[x���B	�E>��8����iIz�Eq�sV�4��Y�=����]OAD��`���g�,G��#r�I#6��-���'�`D
1*�����@DV/
&�
�q� �R���O5Gz=ECIx�Eb/j���������_$���� (���A�������T��]L�3h	s���u�hLz�D�^�+d&MA��ML����#qS��D"1I�A��"~o���q;f/O�#����g
4�`6�7���V�[!�� ��s���F�5J��bSDC�p��8�R��U(T��&��I"6��
�x��P04���
�q��D�Y"�D��8��%�f�Pp�&APH�3hR��)��AE5Il�mE� w�W�$71q��8��\$=�%E5pD�mU�5qJ�H0�B0�y�3����$0E�4��A���P>:M��(2�I0�9�3�����)���1d����p
�q}�����&�"yl��\�^|��'��Q+B"?g�k��	��� ��b�WR
� 	Qwk�B�h��L��O�AI�g�Dl��L��{b�m��E��qs��]��#���#AR��kx�]�Z�'�0
���bj��;�"F"�����D��}���H���`��&5����1E�AdjGH��y!,���H����O4*��#��8�R�v��y���hAF�g��KL�3*�#��8�R�e#�6��b�/�� J���oPSA�M��A!5�����b� ��� Jl�oP�� (F!������!��b�/�� J�A��],�h@A����g
4� ���!Z�\���PQ�A��{���I&�K!����I��e��ND7j����*!b�n���CN��$}����A�)�(� ���h	j[�0^��m���JR�
�����)\��m�H�)W��Z!E��F�����4T���B���`�T�rP�
w��� 	�E>e�q��Z��{���^��"_j������$��HFH�)'2I=��!R����Q�
��� 
"p�d�X�26�`�� ��&�P�T}rP�� �=t�'~�@�ZqR�Z��"�$��:��+��4���w���?�9�gP
V���`�`�3g��!e�'�3��p��@h^`a�/	�E>g�q�`uYb�H	����b�B�[�E ��pR< J������RxY��!I�3�+�V��S���(g��k�x(r�Q1B�8g�+��71K"D��8���P
��z"B��J�6]CK�����P$���&fGl��L�%Cw��Z��=DB��V%H
�z
mA�0Q�Cd����Kp�
��5��b-�����NX`q�"���qo��W)��i�A%B
*P�/5��I��V�X�*�� ����)V�' �;=��QF.�gP*��bG��AG�r��q�a��2�pU7��G�"ndj�AK����<��7q#�����"��d�l["<���}>1����
4�e	��|H/�B/�y�3x�b���3�5�|��@)H�	����=
r��k�������������Q��GH�	J���E�i-�%'��)��TrTH�3�	��@1�PL�
H�
�oK@Eih$F.�gP*����o1�,��N�����q���m��=�X"���n��_�!���HEW����@$o#{�X0q

�&A�;9	�E>��q� ���-���)r�#��q�N�g�,�6��w��~�Q/=F"�y�z��3����c"��Q/A��3(�n+�L����3�3x��s#��Q/A��3x�
����&�K�T<qV�]�O��c
�T=qU`v�n�Czaz�'4����u[�������
���\(�P_�Dvx�DF/���d�w�d�$<d��1r��8���gFd{�X�o(/��1�/��	Re�I����V����&�����@SP/%���/Q!6��)��M_��z���]����IT��K���Bh�����gRd�F����Ig�I��ig�7���+X*��^��\<�"gE"$�x����H�U%�c,�V*�7\�L��*�O��ZE�t�U����m�i�{�Qr%�G���khK��YY��g����^�{�W��J���8_O�;o$�3�"���Ja����*��J���8kOB.�GxhEnd���QQ�U�'5��|���	���?a�x��by�>L"U(R�I�R�Z�'�HU$��>\l�A[R-���,���J���:�3��H��Pb���Cs��,�H���f
U8�4��i:p��HqC�����@SP{��FE�p��_���B<nTd��`��>4��]W�g��7L���L����G��7R����M
d��[fHq�''��v@(��gp�7�-������d���j�M
d����'�	R9�Y��^P�s 	?8"�5R����@����i	R��A��^#D����	R��I����@d����Jq�
���0�6���A�����d"�
%D�0(��y��@���`�O4�`&h����0�@�"�@���8���]��A!i��E�J���M
��2W�")r�H�X�+6�`&rc&.JB�@0
��8�LR_"�M������@���;`����b��<��!+�����J�3hF`�`��@��:x��"N�V�T�kP`�J�9��=B��5)Px�������c�
%�b�B�{MJFW��\���t
��a2�"�H#��q�� �0"C�$��q
��e,�� ����h��L��j]����@/����M���+����&�[�+����W(����G�,�Zl���.#�H#��+����5��
i�(�C$n�B�a�9�X ��"��j���&�T��T�g�V�T-kV �����%��t�q�������E�E�^�T1jR��W8�q���!TH�3x� ?�$���~Q0A���g��KR�_?R�P|L�q3��H��~$�Nj�����PI+=H"7�S�2-�������M�����${�P$��H�T�iU����~�@+q�J�c��Z��;$���� }�JIV�'��8����A���>1#R�fz���y�d.&������@����?�O�AtYI+�_����~i3>�"+�Oh��l�T_� &�"���!|I�38��vA�1��p���q3���A�yW(� �^��Kj|�"
���K����>�(���� A(%)��O^b���	R�I�� �>/�)�P$���g��+L���A�
�������.#���B��"Ql�A)H��G�����(�6�M�$!�M�������!��"�g����\kYW��3�bo�.5� c/��.E"�*A���~������"��8��%T�H�S+2�V�4(��T?=�s)2�Hp�)5��}���<��k���:A���Fx;AxlE��JE{f��s����V$�]�����8���d�6�p�*(�q�q������L���g�6�k��>D�i�3�K��H+E�`���q3��=�"D�mFG��=�;D�Inb����}@��-�b"������;��D��$�Q-���|u�;�ZaZ����8���<�Q"�%#$�0\��/��H�"��o^j�AC�d���(���!�4���gp0y����@0E0t#WH�3�u'�����l�.6��~9�N�DR����Pl�A{���S�D�*&��q��K�D�	nbZ������pE�[(� 9�(��&0��&&J�T�hV��W*��E")l���
�`�I�$����4o��f����(rI 
���N��
��#�����4�`�P���6&'4Z��/�q'��VM�M�c`D������[�Kw�d�� � �<��I��6b8D�;Q2B,�gP���wKI�DD��8��Pr�Hp�%A��4(P
���S�0A2A*�4)��,�2B���.	R	�Y��d%do|�@��D�,	l��K�2�SD�0
���
4��u�T(�/|�0�)��`�T�iP��QB�Ha����gI�3�X���	T"��"��H�����s�@���Mh�����-��D-�EH��V�WX:C	%!��Z���3hJ��!
nb�$H��Cj�
��)���7	R��I��<�u������|��71m�RI��b2����p��9^._�Eh"��$�U���?#b�Y�������z�O���4*&O�	Rm�A��]���^�'�����
4��u���`��#P74f;P�&6� 4����;�!F�^����.g0�+F7���� !��3����BpShS��*�^�����~p/�ADSP��$�����i��$� ��)!*��4���Q�':31eb��y�Ap�&�@U��k�	V�K :"�n`��b����k�+� ��(��3�q��P� �
?�!v<��D�a�H)��cPw���,$!��I�3h>�z
&�"�9B��P�'5��i�
^k!�<E�3J�����R�B�`�TujV����B���$!����3(%�y0~!BBZt���G�gP
�V�71o�2W��`���Pd]#{�T�jR`v9<������)v��&�H�X�j�X�Ka�"'�a�I�$��
4n�F?�)�"'��Q�>=�q�{�1B�����(���
	��H�&��'
4����9D��N�����Rj�������'r��=B�K�3���e;x��� � ����"7V&�
A1B�V�*P����;JB<D{7
��8����
�71��d��!doBW�� �C4sbb���F���oB|��$&	D�rb����E�
�@��D��Z��I�3h�=�f
B0�B0��3h����$B$H&�aP�]��1�"�6�8������!�B(x�g�,��k�"��"@�b���8�R�A���)@��qu�B�J!O� ���[l�����<��	�Bl��L����Hqs%A*.7(P*�=kPR�L��)���6)0�+��c�ML�����@������3 �f��VZh�'��a��Oh�AS1�$�L-A���AaP�5e�cmz�����d��������R��bb�q��]=�9B�/�#��8���mgU�"�L�-�$b��y���TL�j��
���/�%��D6q�R�J��#��I��@�����@�X���Il� �M���8���}]��� ���@�*��
4��W$F!�,��4o`!���,������{�C��yAz�Oh�������
�_����k��	j���9��%K�����A2$��:��,BC@2P-B�~��q�v�a�(	�J�� ��:�\�x_E�M���/��qvI��y��.���Il��s��vQ�Gb)�b�.R-I�A	6���dI���s|��b`{OH2���Z��IQy1�	��Kl?aaPp��s����o�h��a�A���:��S���1�AG(�]X�V��YH�z��>��;��1f��(L7a���9������??����,7�Q����9�di1?b��,_b��h�c}���S�����.�A�R	=T:+���)��&F:�U�il�����u=�A
~L���M�@���$+���L�QA�T�sUqv�7x8�F�����9���-���8_�"���B�������l�V�5�s��1��}�o	GW��\h�f��Y4�R��>��};X ��M&�
��r�*�T�>GtaWG�0Uyb�	�@������p&6�� `H�s,o\�c�������96G�����Sc|;PH0U
�Tl���/��&�`,:�*�+��6G�YS��U��j��0S
�T
bBa�[���88�-�DS��A�L4{i�`&6�0� f�����=��bKJ��B�Hh�����D��C�cQ7a�^:���U�b&�B�U������������)��n�T?uU�}W�T!��*`TI�s|��B�D�M�)�TUuP1���gH�*H�������4����	��N�����(�b	� �����mU�a�Yr �J�*��b�aC��H� n-��q�Z�Dq!�Wf!jTB��"���3��,D<���d�}�3���ibg	��:����l�pb;IH0VNM�s,���Q�1Dl&a�����>�R��hCqE�(!Y�T�uRqv��%V��&�0�q�U|��b\{K�1Us]Ul���\qtj ��F��@zl�ckp��y�����E�qO�}��|01�2t<� `j�ch��b��cXY;hy<`���>��l!�b@I���@0a��s�	����d1���Lj���:�p����02��9����3?R�T1�����1���9}�/���z�{�����8�-~i��2�@�yDL"���DE B�0J� ����)�.�1�(���9K�0h��q�P�a��AL"h���C��&b�D�,�P
�I���k��8�b��&�Nj�c]W[�p"vL0���D��b)x[��b��&�Jj���2u� ��mLp�?b�>�R���m��c��O�T�I�R���#��T�T79��}���Vebk� F�mU���`,�^r#�0J� ����=��xA�&�yb��	��3�b*%�4����1��}��
I��R:�=4ZC�Vh�������y���x
�o�l-B��X��y�����&�S�K��A�EA�,�&L%�W�/#�$#-(F��>�R�V��r?v�s�+�:9g�}��jy�E�IF[L��'g����;As��0�N�U�����j�/�Rl�u����8{$�V�2T'LX��U�(U��J�>��*Z��>r� ��V�vg�[S)}]�'���$��I5,R5����s�?�����W�b+G�,�BM]h���������S��`�	c9��>��ju����A������b)k|	����T��`�	Sy�U��l�,DT����@j�c�:�*�b����\��9��j��:U���d��6��������(�����_�*���m�0�������`�*6/������!�0J� ����}��c*I���
�t��bs.��.)���!L6�I��s|��R|�Df��C�������W�S��&L��'����T��+��0��>�L����R��P	S-�U���E! ��m)�uR������������s*��z1���*��JtO*�o�bH�V��&'���-��z�d)���+Lf[Ul�\�p��n�n�
����A���P�^RL5��uk)��-L5�	��s�����Y�
rc�59S�C���0��m.L5����O6U
���@�E��I��>��j-?#��d�T#F��>�����X���d&��>���-Zb���|Jx�����3�J�U��b�&L*���\�\k������T79�R�g���i��|�FR�V
�#��C�'F
�b+S
b:A��#l����U��!��I&�g�m�p���Yx��������
1�+B��M2��9��
v
���
C�	S��U��n�V
�b�bH'����e�P-��AFn�!�M��A�K�[+j�.�!�A�t�I��n*���b��;dHw��
��X��X�e)�=2$��*�/WI~�H#w��jS
����X*�����&S���������Z+����#��H� &���.
K��{d���*C�Ojv��H�nRJ���O*�����aU��A]�Te|Vq�{�:)�B��rP����U
)�T�+U�Jj�c[��Iv��0T,e�B*oJ�T��A=�Ts|R����
�r7�nr&A�K�B�&��Z��t�U��n�x�F��T��9�T�g�S(��qP0�!T,E��]����b(�[tH5����sl�&}�b�,��D�b/�B��9���+��(�����	SQ�I�����(�B�Y�T�|Vq&{�:�����	S��U����SY� eR�KY�
�e���������*��|�v��a9�<�.D_�L5�'K]J�*&���&'`�\>�����h:�����R������L�\�)��Q�1��}�MQ;S5!��&`�c>�Xj��%�������Q7,R7����sl>��j�r�P��'������������,D��Jj�c[�b�A�/��M���9��j1��=J�1���9.��K�*_�m�Tb�g�����$K�Qt��)aI�s���J(�����}��&B1�b��Hf�4l���&V�c��@4�'*L�j�����<*�l�lw����-�@U��r���A��ju������D�A�
h��;��s��P�5��}�O���j�B]�
`������M���n����&i:�>�R���
P����,T�}����$����d�SCY�T|R�)�/$�����s�c�sl������ �[TL��g���K|
A5,R5�!��9��q��a�Mr/#����&&���h%��
�	�Bl��~&��=����90���
>(��i�*6/���G�}�6���9�e<��@-r��F��>�7]+�u3����Gn�Ah�c�[^���"���(`��=����]��tB�2�]7��s|���
A�0J� �+��q&����\Q��vC:aP�������yD��H� n,��q)(@�d�9�XDt��9>[���������@�T�zV�����@lU�=��0�^U|��bx�d8]�����X�V����|���_2h�c�Z�]`_�$g`��=�8S�b�����5������9�e����>�t)���9�)�sl���0�j�j
�������D0�M2�,`*�=���Ea
T�"U��Q�>�v�(|���#&b���0���rg^?FX�'�t����M��G��,���Lj��L��4�x�Otf�~�����9�/u~#� �]�!�R���l�qU��!�M�$h�c�z��$��W��!�mRq&[<�����
C���F�>����2M�&��S
�U��
-�S�ecI�)�X����=�����:`$)���96�oy����(S(<C���>����CA������}���4-@,�;BH0bH�s����!��� ��0���96.�|��"����&o`j��L�/2�u�}��/�m��b���AHf�4��V6�RH��AHu�S)���y���6E�X���j�0��u|�
��
bBA��7���@���B*a�p����\�t�"u��P�>��C���0�$b���0����2���G�"%����}��D��$���S�FH�s\�&��r{J������k�1��d7y�S��A�R��#9Q�q���T�I���Rq�`���D�al�cs�����#[�b�S��A���j���R���jX�j���qa���+w5D�C�g�s<�����y���T�yV��TX}��L�j ������b�{��'����
D�������>W��]na8S}�A��)>�r��������^�*��K<������;�����M@�A�_l��;�����O�������bSWs�E�~R
��b�g���g��rGA��t@M[h�����@_k"A�urc���>O*�����0��m�
��>�*�.S��z��qO�+@h�c�[�6��Nn,@Y�(���8�����Nn, �MN&h�c)�*S"#�b$��Hf�Tl��+���������9�e�7|�;�T�ALe�W���3�0��l��u���b���e��#�d�?�qw�}�������1�PM��x�	�����dS�C������b){s�U��������Nj�c]�^�"��P0a*>��8�-�U!�,h��]��9����#���ru�A�nB�[r�s���6yS��Y�7^D�E&�����*N�*6������P67�M�@OA�7�*)���z�����b�W��
1
U�"U��;�>���<��d�|��f���������P��L�G��>�\�����d�<���9�MW�z�AM���Nj�cC�����`���Q
��9�+�����&gN�����8S�y��&1ZB��$gS�$�M�e�S(�N�e���K	�T�z��}�zI�M�$
�2?�T�U��\+?qC���'� ����)k}�bBQ��	�$
g���M�.�T���j�	���1f"F� T���9N�o^lGA�th�6!�s���v��&&a�'��X^S��W! 
!��Y�@�KQ��;U[�2���������b)Z��N!t7yS��Y����j�L�E��&a*����U7]Q��0J� ����=���� �
��2����(�����k����
��E�1��}��H`��`<�����Uc�}���P-0���#��Q����-���*F"�~�*�>��juO�$c��������bS����F���,`�<��Yvg1���{`AL��g�4��C <�2�t����>���[z�e>��:����,�����_���l����������o�3�����)���l����_$YJ�I!�s|���au1(�Q��Nj��;/����B�M���*6eK+a���
�(UR�����2u����
*��U}?����H��HN(h�c[U����t79�SY�Y��]-��J������>�����0�M22�,`*<�8�U2��N7
t������9f�{7���y����c�����?�����>����k���`�?��Gi^f�]�������n���.o��������y`7/��yY��-p"fB��m�}�����g��p�}�c8��PQP��?�� �}������
�k0���;W@3�;W@�;W@+�;W@�;W@�/0��x_p\����"���qE�{�������5����5����5����kB�;��V�w8�	M��p\�}�qM���� �
�+@�W��G4�y�p\���������;z������fO3��-����J�w��u:�s	Z�u���W��:4���;����������
������%��=@g�w�;����	����_k�7#����[�c�����bx9s��]�����f!��>��_F�Oo���_����'����e����g!^�>���D���S���_����u����,��>
��4�5c��L���?�L��t{���y�����/��������/������=���c����sp���_�y�������?�����������8���������4?������x����x��?=���?����������x��y]
�?{�O���WO���?����|7�<m&��/���
��J���X��o�~���1����7(l����������x�_/�{��������K,	�H]��O���������

�e����D���U]�n��Y�|;�e��X�����q1�����|8~1t���t����r�x�@����x���������?]T��7����5��2"x�������V�����W4��'p���u5�[���M��@�\b�|>��V��>Z�C(�4�}���N���K���\�@K��}��v������E�������c��e���/�i�������b�.��As��7=v����!�e���A�e������y6W�����2�/�7���\�2���\���N��ls��8�S8���O�;Y��f���u;-�lPY�l��w�<0F�a��p��� �K�y0�������������0l��������?[��u�^�����ew�.���� �����������L�_�x����� �l�1�������q^Br��������t2���[k���
:����e�_js�zQ�����/�����{���2���w���������G
����w���\���}y�*������y�-�
��>..�p	�e������$\�SmL>�sp8����|�-q�/����|s/�R~4�ay}����c���g�>W��[������V�|E9����|��q���Tz<w��D�N�}[_��2��g���Rz@�
��a������x�}��j�:��Hy�/��=^�c���n����c�b;/b�����|�������O����������x����Px��a�ii�������&�9=t�*��1�?��~M��l�U��~��_Xt��F�kr���������/�k��5�w���7�_�������oa~8� �����4����s�����������A�����5�/��k���������3���5{�g����������-������[�O*w�KM����� fp�`N�?�6G�27�
��5b�g��5x?Z�o������_���y��}��Z�w��j�0��9�{�>��o�p1�����T]��`����e���.����N�f�m�������z��j�K=q.4�������V�D�]�D/\s�*�a�}w
N��n��w�5C;l�/N^'�~Xg;B�����!��?���|Q@Gl���2��(����?{m+zb�=zn�z3�b{d��is��O]�R(�b��������n6�p��5Xc���7��?��9�����������c���y���"��
4����;����j[$�i_����%d��=z�F�l�|����9d��c_�k](.��7�����}i�7�d�M�7���q����s�j�=���u�o:9ds��?��C.L{��=2o����
����{��VyN_��9c������:�����N�0#�;d�}O�����	������?V���/#�������sx�pl��A�)q���c����5����;��x)��vwl�4�c�����yKi��2����{����`�:�N}���
�Q���o���]%�w�
��jV����z	�^{�8���gq�e+S��.NK�����������o�l������5�0���+s������'��S���v'�c(>����f|Ls����k�x�<�_-���T~���p[�L�k��:�R7Y't�&u��5����qu����Z�? ��Mj���j��������|���8���m��S�6�S�/Oz���m��<�6������*�����t#?lR�9?�6���)������\�����p�6�^����tE��j��F?lR����6���^��p���~����$�*O���� ?l��W�6����D7T�}�4�a����l���{��A~��)k�DgS����F?�j��&����~u����t��ml�(~�m$?��������h��:��^?\��gp�ee��_G���2`V�E�����\~��G|����2lr�����|�Du�CZF���]1-�&���o�`�ri�?��!+�4��2lv.ee�f�{�fe�F�����2t��c�x��PV������Y6���#fe���M��Q��;O�2,��������x�G���?�:9�P�ur��'�j������07�71�f�'N���!7�?��x��y����e���i�89d��������{y�6�����A6���h�=��&E�A6��_�� �?���Cm*~�kE�A.��;}��������S��L.�����W�;2�*����O� �\`w/�A6���5�� �����J� ����� �����d�U�.�l�v����u��br`�����u�3�c#6o�������������"v��@s���h��=�n���:���/��W�}=��x�G�=6�?�����I��6�d�M���j�=.��]\�q������qaX^ZK������qa*����=.�Qo���M�hk�����5��c���Xkt���q��=2d�M��u/l�w���&�s�R��s���h��_����{����@
�?5d���AV����d�;�A�N�71���K��U���yY�e���c�����ud�M�Ww�&�l��+_�����������r���0�M���Q'�l��wu52�r�g2�&�O�	;� �����|F����K<�A6���7�d�M���S� ����^��/�
)�{4�A.��{��^���
����H���slG��<0�&�]*{a
Gs���g�~B4������ v�@sl��c��=��;�R�y���X�b��C[�^���
�V����t>��!?����v����J��i���S�6?���c���J��G���J��q�_��u��="�l��gs���a�_�;I��J�����W���P��g������������?:������a��O��I��8���o���a�������5l�{5l��q�_�^�6����=�����������>�[��������4�z�0{?�.�Pev����h�~c�y�2;��p�2;��:W����aZ������>�����x����1�����]�=%^o:�0��r��-�Cx���<^�����z�x�K��^��p�'���pmj����jp���N���pAv����
������p�:�GP���X�����d�g�b���w[�NzT�u�����0���somRL����E	0��D�]��������	6��{�`�M���%8l�W�YL�(
���J�(����
����%&p���2�w�z+>��R���:L��������w*>��|!�)���a����1����?,~�7�1�C������>��5���h��SC�N��>�E�|vI�cGwbH��O�4���=��SQ��u��#�[���q������_�=zd�M�w�7��&��K����%�O5�w
���l��c�|��d���d�#{\��w]{�<G�\ v�e2w�����K��d�����;6��7�Vw\�M%w\��;�����������X%�����!Y+�������}���}��������T������/���R�b(���1,��������XC�=����+��?�+/e���!l��g���V#o\so����1����76��_����I��.�E��$��W�"ol���Z�q����t\����sd�
��_O��d��g��96��!N����u������w�S�I�u�O���;l�������C�������������U����qsE���K��}����=���
���|������:m������M�Owj�b��_��\�I���h]1.��yz ������\qaX�/h��M��������B��I	��M�7��+�b���]*���B��I	��7�}(��b������UTh��d�/��-.�ov��{|���bX26�[�c�R��PB������P����mH���g4�F����A��d����q�� sl�=������.���N����A;��GM�����]<����`���@M�����51�����v��G�bM���VoE�HcM���:
X���LH51lnw���(=�����D�y�^�Py$?��aM��
+�����P#��(����<�{�D���:u����N����2aS7'���;Aj��y��M�p�^�6����G�F�����k���X�51��}TNkM���m��&����B51�>;m6����L51lf��8��a3;��&���>1kbX��5D��a���&���t��&�=�u�X���73"�����V��u�u�jbt�M51,^�:u��a����-�������5�����p@
�?5:\��]$n�^����kt�&uC��n#9���&��������h8�m�m���:�O����?^��Oz����>��K�2	-�������A���x�����]�
���\��`n�@�I���)>r�@�������b��+���j�@����ra�E��P[m��[�������(�
}wgc�M��N�lf������1���w�=�������(6�f��n����H�r���=���V�;��Hon�@)����������J�zh���om<�������jxC�O�o(��7��}
��!E��l?��"nXEb�kO�C�NI����P�c\7��v�Kc��N�yd"{��#�l��OOA�\���#,d���x��k�d����)��a��2�A.�����A.��7g�� z�}d����~�I�0_���� �#��7jd�U�=u���A6�?:������O�ugo��G�0��T&2�*�{
���A����1�����
��� ��F���
����.���w1�'���
rw��|��w1��~Lcm��x|?�d�Mr]U��&���D1Yd��XY����0Yd���}�4Yd����0Yd�����X��D�?�-�9.�� f����t�tf���HYe��y���d�M�w�c��v��Zf�����"�?��M������[�=�X����k9�,�I�?T,r�^�E~5X���h�M�v��R�3��"�
���E�M�Y%����D�l��3D�<�$�����!����H��5d�{_ql��(������B���^���c�h�p/��q
����o��*��M�w77�����a������-.L���c���0�N����$��l'Ol�7|i@Kl���k��#.���
�A�|,�7�T��]B|sr����?�9y���`�M�f'�2�a���w�M6�7/����K��}�gZ����	_��y��x���sg1t������`���0G.����"\�����\���MI%�s����N]�����B.��>~Z]��qoEl������PVz�=�{��5�����N�&l�{s��q�R�������?`X���/�+��\�����g��]pfp�ssi;����y4{���+���W�_�!�����nX���vC��1/��eo1��D�����mW`vfCi����@�lW`n,lg2{s��v�� ��3���5���]����E�T���}d��������v�a�%�
�n�hg��@{*hW��7#"����\j,gW��f+�G���P�Zz��5��3{��A�k�n��8�����9�gF����n�q�%_��_=$<�����Lv�?�\���T�S���������t���4��z:u�����u��y�������/�F����;;�V�-n��%ZU����aK����x���������B@�����"�Ue���}Z![U���IOlU�4���iUY'9�T��)*y�5��J�V�u����z��I����p*=���J���E�{��d�B���Kh����V����vwf4����hwgF���s+4fu���y��Tt���
��y^w��x��kT,�IYJhs�*�G*n�x����?�����������p��{r:���xK<�6�?��
���^:�����6��o�5���������,������^G������R�����%l�IQ���u	��.t��4����qw���y���*w�����3��5��G�$����o����2�7�����������C��o��������[u����<e�5f�������}������[���^��-n�[9`{L�[����?����1� ������^#l���Cg��/�l���s7���/��m�������z�Zr�$i8�{-?�/GU���u*��i�_+=������g��=pf����!�Xe�pf����^��3/9`��[9�p�������[{���<k�6�k����+�|�G��Y�����B�����k��F������?.{�r���'.p�si��:�M/>������a���k����;-?���M�����mr�i���mr�39`������X�����3�����?38�����5�X�m8Lp}/9�����;������[�kC����,�������S�����t��{��3��6����	���0�$]�OW�����~�����[���L���t����<<hB�~��=�R�����~A��
�]I��{X�\�6w�?����Lj��!����v�1A��1�1�����vgF���nwf4�s����t�{���3�n�0������`u�3�Z<b�V����p��J}T�������NC���E��s��%^T��6w8?��m�\&��%�i�;���F�\�:��{f��^�m(�z��NZ�I��x��u��t��N�5-S��{��x.�:_����A���#f�W����=���?3����h_�3�}����(x�V�}4���f&�{����g2��G�A��
;���i3��ELu&l��::u[�������5V/.p{�&fX�����������3,Ln����i������&���B���S,
�}����{]S,L�w��y�bQx|���P�E����	L�P�?������
���G�5��$��^`��9,{r��������H3�l�hW����[���*w��%��q��������������3C����/��)�y1���m>��w!��wa���[7��r6Hc����y|k��`���?�	{d�-n�f52�������"ls{�S��c�x���}d��s-��t
�������X2��v����N�'���qq����z��a��J����������~l��V�c�����0��:��QU���Z��}��yc�	��K���[��K�����;3:������(���FX�����pg^t�}������mv���4�XhBe};|���y`��]=����O�@ls{f�����*�����y`����������Ng!c��V����0lu����y`��������lX���kz9�����X�����I��{x�6N����6�����0y`��-�������bnN�����?3x�������A���W{������{�����X���HX���[�=���\jBe}>��x�����_��8����
����~{�&����-�E����l+����`��6����z�R����[��������{�|���������gs_6��@o��s'w;I��6��M�y`����N�&���1����]�[�<���!�����-fwZ0z��}F��=�����5��X�u{?��}�K�s���]
������&��������J|�@�RC�	��k?�����5�Xl�d��.C��B��Y�k��[em�0�O��b*sL�p�C����s�<T���'l@���H��lR��T{�y	Xl��k�a�b�����I�-JH�&L��S5J�Q����j�},nf��s{�o�>����g����C�6��H&Wc�'�����&�3/����`ru^�B/����o���?=��hQ?�_���|P���mi�\a���X.a���<R��V����N[��ksC���1*L�*�3�!Uf���<V��K���3��[{���T{���c�=���8��'���rj��{��������\{������[�6�dv������6��h����g��`�;�\��{��(�������8=�bQ
�������s9����O���6Z�����?<����7.��T����a��_�u;��1�)�C�v��
smsy��-+��cOK��&��;f����X�e�����]cVJ����0+��Ww�3f�������x3�1+���W����R|xZ�R����n���B���:���=?���U�RL�W�����R$oMJ1�_���.�1C}�[4��O���*��{����P����������-�b�g�����1���=�����v��a�6�g�6�I��_X�����[5'��{g�D��ll����?�]$�kq���^���py������7c}�����J�{mn�������[����{��|��owm���I�z�_���5Y���c��K���y<���k����dQ7�y�a+d|�����b6�I��!�@��Jk�:w�>x_������E�k������wc����������;�]�zosE���E���6����V�"n�zxKc=����u�������;r���;��������������y�����_��[g�����������w�6�~-�����S����&_j����L	�_�~~�kl_���7u�8����������N�G����jrkA����Ncy�[N�5_k�[}����"����E�5_���g\��<��{;����x��S��9���P��h�h�Cs%fc�~z)>s�����������A�U��cG��v�~��Mf�'��Pe�M�$q�����o�x����}|��.�n�q~w�����xv��RZ����.������F����H{�kh�������,�}h���*�>4����m����kw�7�>4{����^���~�������6�XhYe~>RjZ��-�vs��y='�d�Vk�?�����:������kOh��c2���kOh��L���b~����[�,Q-�����5T��X������_(��&�w%�����
@{k0�������]E{k0�+hN
��`�t����5����A����p;g���H����c���`0���7w)�����F
�[����
�[��]�
��9��
S�a|��[��U����5���dn�u-���`B�����Z�X0G��
;Mfw�70�&s�������Z�;o2�����r�����]��F��Ck��5.�T�����;1�l��iwe��6�o8��Z�����v�[m���V�����������\��]C.�06�\�6�\����Z���1��Z�nM6���/��G�[���\��}`^�al>.Zen��&��=l\�5��
1��e*�*�1�����GR���������&�7���;�����;\����Y�[�D��J:u[���������9�&���\(�����
��7�����0��0�nn��-p{_'0��0��t��5���(�����^����A
s~�����$q�-S��A�T�w������;�Lr*��z����$k)�\~@����3~/[����_�km17X�d�4���t���},�Z[���������Z������;��u�����y���K���$��n��&�T��#����i�����>|����o������<�����.��6���������}��Z�{�����������A��p�
��v��{�������������^/F�W';^ZK��=�p�0�k����^���|�k
��H��"���A��sw��l27Wm���K����wgF������������������;���y������o�^[��r�����p��&������p��T����A�k���}n��n7�����*�������8��w�������.����9�>�z�n`w�%:���.��w�\l�\!������sw�,�������po%H���?�o��eaq�l�,ln��-�Y�c�~��D���o��[��CF�M�|^���?�:�b����{���f[�S7l�_l�8$cl7WD�N��?�R�y��bf�G�����:�{C.&[�������L��1����K{n��ko&������u��Jjk�{@����_l�������ln��?�0�����WS�K�z�������Q����Na����b��x�������N����0!^���uCJ)TB���9�a-7gCt'�<��cL�W#��z5�/��#��������;����=n^�z�\4�=�����s��:s�`��{uj�$�{-jo�1�{m�f�k������F�=���^��Y��|�= ����vzS�{-�/��a���~?���k����?�{-n�Wb����Vs����v{j�{-n��{�~����>������j�N}�O��W'n8�"<���������/��Z�����p��_���K���N�\	�"n.l��{��2/��I��8��_M"����_�'a��1/��(�0V�}Q��*�3cw8?���:W�����Re������^��_A�����
���Ne^g��x���4�y��v����5�����=os����z^��~V#�p�Eb����)����n^���g(���8l���}���lo��1���Y���p��������:R��L5)
��E)ngBr��cu��X��mo�my�W�[�LG^j=�/w��<������;�Jc��t��y�1��z��������F�O^��y6���m�+��7{������;17���G��T�����s��D������������u��g����~�<��a��3��_�����l��6o����+U���[�A����_i^����%�p�Z��p\�E�\�B'�,�l��c���l}o��|��+�u�<��C�V���1�z��� ��f�/d����M2�f��I�d�����d��o�72������A6���G}�A6���DZ
r�)r�o2�E��
n�������A6�_�
dl��������w�<�G��;��TyP_����2n�����B����h�
��z`�Mj��/4�&u������?5���Y��a�
��3�,bw�4���� �����|��e�h����A�%sZd,ja�>�����������E���LXQ���.+ja���k�����{k���&�>�}}gE-
���aE-
�����EaB��8�0�?[j�-Uro�Dn�u���o�w%��U���?�6���7��%8z�]��g����O	���3���w0�u{a���&����d�-�����5d����i2�}��A�M�Y'v�f�A��c2�:quV�`���a%
[���D���1n�3��:�0����h��A���q��9�������p�%�p��A���O�������?s����?sb�z�$m�3o�w�-��gp7l���r���r���H>����?s��o4����-K�k��<�n����������������BJ�J����������������kr�����t�%3���s��f�����@���������3����%.�U9Q�K��9�;uV=&lS;�� �S��(�[���D��-jo.*s��{��E�?�#�o�s-��E�������.�����O�X+���w���9T�W��������_*e��^�O��-�9`{L�����=����>
��X�������_��~�;�o`{��A�.�L�W'n�+��|�]�>��Q}&����"C0���P�B��ObE.:3�"�������gV��s�Y���Y���\��N�KE.z3S����}�]*r���X����w-9���_(r���P��w����_�l*r�{�A����\��/9Z�����ME.z�z��;�Z����{��Vs���m�>���{Y.C������Qf�����������x���������?>����������z���_�$�-�Z��0:/������x� )!C�v��@;n��W�����������q.j�&D��<��i�qb�����6!���?y�6!�=u�O�&D��g}v�+�����}��Ve���"�	���O���M�&���mB4��TmB4���N�g�������=2p�<��!����s�9}�=w��>�����^yR#{�a�6{�Q��>���8<� v/����A�^���u�����o#�?�z~���w������A�^@� n�+Q��{0�M�������S<i�8��b��H�)��wzgZ	�b��{�/3�:y��c����{�)�z�����L�����a��"�t�yf��y��^���u�i'��3��?#��b������L������L����>I�\yN�Y����L���������$1Sl���!�b���)���S��1Mi���g�Z:����A�X'v�m�������}��-���+����d�u��tkc��������������?������oH��"?�3n5���?����E�������N�y�2i-Yg�|���M�d�-r�(3����@f����������?b�T}F�����1���.������zdJ{���c�z��{�_�����e�c��}<
3�����H��V���W5o���:���B5cDzT�0��	d�����^��x.�������03�:�����t�zTr�ML�X'��XJ7���^���������
���Pt����
�����tV���wo�_V��d�W�b��
��o��P]�����
����^4+Tg������P]�����W,TW��{��+�����-��u
[��_���c������������!�����SQe��yl^w����I�7��?�g�Vo�w��0ns<��(:��|�N���&��7����F���U!_������c��ub���t�zp6���"���zVW��*�{���T����~������04��0{�-mF�9Ljo���aR{����X�{�����
u&����ar{�U���~{���������s�_��C�>���.�M�9
��u���P�?��#���S�vx��������j�]\�N�!QB#������{{�������|�����}������.��c��:q{e����ze�i0���W�+��'j'7:����|���+r�&�����o�������59_��[��o��Y��o�66;�B������>�������:O���v���-p;�J��5����o�1!���������\�C��Q��;1���&F���}�A��<��7��,{*����H:���N��o�����|���#1���My��<9��P����zlt�����vP��I�?��R�M����>�zl���OW���B��L�z\����c��8��9��0,�3U0��@�.!F����|���Q�����4��+�S�R*�V�o��mg�)��`?�$�K�V���P������A��&{�y�S�Q}k���F��p���5lXj.��7[n���)��,w������i��r��
���n^�R�y-�C���(Ua�S,�-��c�Zm�Fw*yC?�����k��yC��������������-��?��'��*���
��%��R���>/��b��o\O�g���v�m����>���b����}���U����m�������8U�}��P�-���.��l�=��s*�k�=�����'���`�;�����d��>���S�{TqS�[�J�J������C��q��5v���+o��C]��*pW��w)�����<���]����+���+/�mJ����7���)~�D��
��-C�6����$lS���6����)�$u���My�q�&g��<�}�rp����O�7���3t��u]���s����3�1i��gp����xz���yl�m^-��������R�����
y���[��~p��,6�b���]��b���s�T�}-50��
f�{/����%�L�����]tai�*��a+[�6�?�g�RZ�I�]���ds@��)0-���7g����^{w(RZ��ko�JK6���'��3�}t�P�v��RZ�y#�	<,-��~��0����{�)��&�?[b�<��
�Z��������T6��������.�o�������S�����V�kR78������A��oQ!WB���-� n��q��<���C�b�����;y��=F���l��r��8�������Pc��_����K���v��#�kS���_��[������f����n[b���nv�6u���w����
w���X�T}"��	�����c�����������o���Z}*N��<�-�k����[r��Z��o�^���O
k�����F�#!�;1�������&F�����oo��|����XY���8<�_�O�V��ie�z�2;�y��Q�a|�h�����m8W����� �0-�����{����s��yC��{8�vzH�����4�y�w��{��o�m~H��R���E����q��i���fO����i���<mw��\��~����H�V��/q_|��n�x�u���<�m�������:�����Z���E-tng�2�2:|)�&z��=���4��3��6i�^aX�+`�{� �j8��v������5��g��;�v������{�$�t�|��{�������{vo�k�{�dm�k<V��X�����U��#zy���+�iLxh>O���N	��{�������=��/
'����~C;�kX����L��vs�}@��[���_2��{MK�u�=|q������d���l}o��t�n8���)g����PIc5ov$3�6�w�3�6���;T�d?6d�:d��^O���so�8s�v����0�l���{��M��R�!�����WyH?��f�C����zs�:����������O�f�����D�������8]`�W��*����F�M�F�����Q�C����rwjr�u��f��KBsgbr��ks�:q�9#�{���N�a���P�;���T��O�!?�
��k������[���v���$v���B��cN����oo���B���Y�=6�������{u�qaX�;������z%��Qo��R}H���0{l��l9��������,'{l���{<����q���<�?�Wpo%
�wV��ZGslv�=7�0&����n7������c��X��a����~��A�]�'{l��;��1�c�����J�Q�y7��Q9��C����C������n?�����H��L�/o���3���E�����v����.�n?��{N��3��g��n?����v���j]X1�G�7���X���
����q/�x.>��
�Hl����]y`��Lnw�s�:�O�;�_��������Z�����{M��;5�_���n?����� q�n��=f�W'n/����~5�O���j�Y9s�K����8��Y����M��-j�~?��v�[k�u�{���_{�[k����t	��v������n7W�(p{W�����������ZoK�=fr�k�1�k���������S��������)yk�A�c����x�����������}�����s��u�������%f��7�R�y�U.�/��W]N�W�����**��;��*&�3%��\���jV���h�*�G�U���gV��g��/U���LU.:3c����P��;o�ra�z�aS������\���:���Eo^�ra�z�r�r�{|�����z��J����&F/k��6{��}&/��x.�>�.���ny��=f^�o����LL^�7�R�ylq����w�~*�
endstream
endobj
8
0
obj
30421
endobj
9
0
obj
[
]
endobj
7
0
obj
<<
/Font
<<
/Font0
10
0
R
/Font1
11
0
R
/Font2
12
0
R
/Font3
13
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
10
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-ItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
14
0
R
]
/ToUnicode
15
0
R
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
18
0
R
]
/ToUnicode
19
0
R
>>
endobj
12
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-BoldMT
/Encoding
/Identity-H
/DescendantFonts
[
22
0
R
]
/ToUnicode
23
0
R
>>
endobj
13
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-BoldItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
26
0
R
]
/ToUnicode
27
0
R
>>
endobj
15
0
obj
<<
/Filter
/FlateDecode
/Length
30
0
R
>>
stream
x�e��n�0��y��C����4u��a4�H�E!
���_����"�����/8��yl��<{�n!��:`�A���:Vn����o�+��$n�1B��n`u���c����{�c�k0�;����M����\�9���R�g�_T<C��1)n��L�����������+
A��:OK��)-����xN�}G�N���_*`�U���"�3��,����D�h��Z#��������.����J�Kk��Z�)m�-��H_��Z$C���Uu�M��#�g���;��y����ci,xp�$���m���U��c��w
endstream
endobj
17
0
obj
<<
/Filter
/FlateDecode
/Length
31
0
R
>>
stream
x��}w`�U����i3��Lz#�L&	!��H��d 	��#1�DB�U�"E��4u�� ]A��Q�U+kAW��k]\d��w��	!�}�����#~����{�9��gH����D�Qsf���WF>R!Rj�N7���;#|	xc��yc=w��%��Fu|���O�U&�mB����vM�Q��@<u��Ys'��v.����N�:j\��5�"~j����Z�JoM}��f���{H�4�O�[�}�``%��O�	|%�o�����M�?������?����z��OY�L�i���!��ji��v�L�O�~�=��Q�G�����$���[Y+R)����TNw��YY(��i���H����^l
qJ�L�H���L�W+��L=�p�ez����#(�����a�V��%��e(��'� 7
A��hm�W�j>�����#�P�!��%�S/��R3�fz�8�BX;�����G|��~��*�����Ji6f�"�J��/�'6���.~�4MV�qz��cN�t->�h(��-t8�(����
����A&$�����+�_^�N����X[��
a�zv�k��/�����"e��Y�(��>�>���{�s��=C/���s�u��3��F�m�ifu3-�����>��:�L�����>z�>���{�����+dc�$6�=�v�=�M���#��������&��y�<S~�G�.��:�z��S?����Mx��fS�A�w��� z�NO��}�!����93>V��X*k��Y���A���c��<���bk����a����%����g��3`3��pn�)<�g���?���5|����s������/�w�g)B��'EJ�zK�J����\i��P�~�*}*�X�p9C��o�7�O�o������J�Wy@�B�B%��^�T��kU��OM�ic���m��������3m����c�6{�>f[�h��
�[�},L������J��r��c��9l�H��t���2����l=��NZ�'��r8�^~\n`���d���-�[��-oEos��d��q4�����U�	�0�^���f�����yo��������4vG+���O�Fe������T��'��2���VZ��y�Uv
;o?�����l$5����/>�����Q���>b.�_�#�S��4�G�g�����;�u�+�a��������������v�[�q�x�r.��R��T/��-��vI��y�M:�\��<�Ma���
����E~J�S:���W|'�fvL����|�i�o{DN����w8d����4�2!����
��f�.����nd�c�,�:���Sh"����h"�-=���<����m���=�su���[��.��:�u�*7�}v�+�]F���Tg��nk��:)1!>.6&:*2��*�b6i��Ue�U��Uc���x�tg���"���Mj<v$�����^c�_Y���c��t�K�K2���
����N�������U ���Yi��5�����nDZ!�p���4~|���j���^s�������:KH��xLHv��X� ��sN�cq���q�]�8�ZaT�DgI�'�Y"����JG��TQZ��pTfgyX�(�H9{z�]F*6�������>AL����e�]����W�h���*<��J�G���x�����E������&I����"Z[���Y?��i�C<++����^5����
���D_L�?�1�R�Rs��cv�t�����X����;�{�O)��^{]���)JrV�(i]M����'��	W�dg�Y#���B[5
�i�3BFq*��N&F��1��G�1�
'&�E<�t��Q]P?��<��<���ZkW�.�{�4��^�a��g�\�2"���Y ��(`��=.�'3S��V�����N�Ys���s�����
T����;bU���i$"�E�*�q;�L�I�W����������"gQ0��z�������1�7���F����a��=��_6�Y6���^Z[�m�uW���]�!OTq���!�$���a��E�"�#���jH�h�f�()���c���V�8�e%�~^�2��j�az����w�"~��Bk%XN�e�U���\��z���������v�W_4�i�:k���|s������z�}w%yz���$����VN=��l��:7[>��b�N���*vr��kzV��"�b/�_���SE�.bT� �;���J��&Zd��F��ed���i�Fy�?�j��'[q��
>�V][����=<X���$E����$
�D`7����|N+�7��|�����������
�a�%<��uD8"��`$�%�t��[��d��O�=��s��5
���^&U����)��PY5���j�)\1[C�e��!��B�5-���r�GK0$Z8��d
���ME��g�a�&��D�����.�w%X/P|Q�U|.`\E�gYD\ADA����e����}��V�������[f
{Qy��e�S�.��xU���E9%��Rz[Uk'}����4<}��/
�5��_z�-�*~5���=�����u���^t_���������J���E1�eS�P�lnC������H��N���a!���L���{E6K/p��;�,��y�3�S�z�����^�].W!s��o=]]�p�k~?�w���"�������SX/W���9����eJ{��h�����c�!�1���)�Kj���K���������.[}|�B��/U��6p���6���������X�L�;�^��zKD��^��@����[L/�B,%Q�Qr��f���MTA�=a�r�����0)9[5GeFr3�J^�������rBXHG	<�J�#�� �CL�U��d�/TO�w��{�����Eg1iW!5��S&�����1up�E�.���T�io�a�L�����;Sd��!?J�����:���}����k6����t{�A��\�;����N��=�4l�[sh##fbuoN����{�����>_��ag�TE�{��8J�s���5�c�=4����*,�b����O�K���2+��������E���	�D�Td�EJK�Y"��w�&��;bccb��j�W\xFxX�RS�qqA9�&��f�e��\l��t!O�������p�5���e�j�)�l��bt��E���qb�����,c�Q�����W��~>C����hU������Ko;�(G����SGD5,��{f��o|��e��
�t?����Y������w�q�/E�w��E�g��
������N*�i�-e*����uOr���"��G�Q%��_����$+
W5�I��S�.�B����b�V1��+��+
Z[�L�&w��M�X������7��<k����8A%Z�/$�=*�l<(���K��-�E�!W�8�}b�:��<��}�z����a����K'�}������S��������_��1�(r�%��D�������6;�6����j������kb�c����L���I��GJ�S%����Q4c��FR���jQo����E���i]�z@�s�������;���GA���e��	�\rAh8�?����|L3�4��3!EV�D�����0���Z�|Wu��b�5�	��"rm-~���jVm�Q���3vhLt��LIo�Eib�v��9STM-avO._�w���s�~�����������]<|���v���z|���o�
��:�(���|�,��;�z���!;3��X�;%�
wH�4!z\���/��Pl����"���	�[����5:6����U�
��6��Rs������i�h	6��*�,�qv>,������0k+K�R����wv�Bu���>��&����T���PP1�~R������]rq����2�y���l�����0���1�4���O�>�>}h]����/�,_~'$jfyN6�����O�1�H1���+�����	��Po�g������v�3�������8��!Y1�!cC���CHaL�rm�j�D�C)�f���B�m������������������,���������4�� �&�j8%f_@r�Qn5���M��� .*�9Yt�P����������]wn��c��������u��F�y��'�~h�'���z�?����?7u��EO��pU����|����@)tp/9�3b::Vp�BpE�H�+��9>��D������t)�rs��D�DYJJ$)�E�'S�'N��" �!rRbV�����<2R��B�8HF����N�������pw��#�[���NuOM���_>��1�Cxc
��O�;��K������AVdp�QX�5�����3K�z�}��������}�d���������K��\>q�6!r�����x�]���k���>:��o�����%�p*��	�������`�,�'Q�D+�,=�s�$C��1RdLll���ZB%�9,"6�I
l�2 ����,�gIQ��,���FQ���YV���Y11���
2��`�r����]q�WW@�4�X����S�79`�P)~�����B��4Kd����X�
�M��j���y0E-��$Z����mH�`���9�t�cX�N����$�rHzh���G�H�[|�Q3V\+$�j<R���k�w�V�e�K�I6{����9]:0/23K6s�y��i��nV���LR5����5�)�h�2JhhUQ�9�s��M��V�3�$�G���c2������l��W���vMw�'���ew��EF5sQz����f������a=H���#���J����9~(j�E����	�"W��J��M���B����r.��4��k:���XT��J]�\���w,�}���Qr���=��&�����(l��H9�H&fSr�u���EW4.IN��d�7�X'IS��g�7�/�cX.�%�#����kx�m�k3�L#w|d�F�iF,�u>b`	f��zG���Id	&�`�-��"�y%�Y��s�&����?����8�`���Co��'���"���@S%	������LK����F����)oSn�ScS�6�:'zA��6�$��M�`�,�vrg�t�tG��4�6M���)�c��(h"C	�<�I(�(k��&sC��U��7���{�����������w���u�����S��������_�v��=��������=�S���Q7�Y7��7�i��o���� ��N)���7�W�LW��t�������(ir�pWL�\�X��y� L���@e�\6/�/�~���&���va��8��x��`�\J����Z���Av�8[�%"����DJ�����N��9"
i�!OV����)%C�����8���g����
�q]���j���%�?�w���e��lX=��j��i��?��'�_d��1,o�������~S��	s�"f�=rpqM��1=}���:���7���m��B�����{d��,����oa�!G��E���d���y�����0~	�m����i�4?�>��P�"lGn=�C�*A����M�-��^�<�r����Hqv	���)9��*�����Q��X�D[����Y��n+��bY��Y�]x�L�{�5�������fM�i�c��Hm��b����X{�}�!��Nw��P�p��gOY/Lo���3�RK��Z��T���
G��6=�������3 g�8����`|��������MM�}�Q\rd���u���7��ns����#/�.^�&Z�O�U�#'3sAB�����{h����;Wur��n��������l�DZ���6|l���e��=���i}��WQ!8�H�N����P5�&Y��Gx�o�uV�>��(�T]a�����fY��$��$2�I����d��W�%����O�R/.u�]N]?���o�DFfQh�4��g����$F�K���v�M��'�s�N^�����m���e����N���|�
���[��a��X{
��r5GN�_�s�N0%��Z���	�2��x�l2�o�3~��6d���jj�?G�2+tv���r�}"�����8glBt��Zp�;6������k-�ab��fJ�5����p8�	�!JbFl�I��U�z�>wx|EQ�&H�05AN��[���&/��nj�U�1�S��"�(E
W��8%1����v��3��~g/�|:��j�
�v�UTXX����!L�q��v��0k��R�]�a��X���"��Yp����"�\���	�����o=����������9���.,��e�
sh��*��?��^��)�d���8����:����5��D�aFyf�S���O�!O\����!�M������'5P�4b&���3i80AK��J9=��1�?AO�'�{�d:#��V-���6t/�@����,t.0
�V�����Z��-��61��&MU����_�r��7 <L>I�j�q��E]�BK�~��>AUH��g�V��5�A=���i+���Cz��
x
�^.B���B��D�U@	��	:	�	��<:�tv��f�u�{!���g�h�qZ�<��^"��
�:����):����<����u��&��1o��qN`L���75���).��WX���<�����i���w��U��!&b��S�xVM��H�|J��z�7b&���P�t�� o�z}�t�W�����V��Y���h�A`��c��h����|A�?����|��.�������%�#P�}�8;�L��"�?W�\�;+o�
�B���S
�$x�����}��< ��.S��F�1a�Y�������Q�C�l��~�p#nB����tC^!3B6
��l�ke����{�)��vWw�`;�8�|%���Y1�`�B����!�i7��,�)d����G4�qB��T�;!��r7uT���Bf�����1~�G�'����_�A���Y_�A^4�5t��������)�N���4����t�ZN�����t�� 	M9����w�M�A������(�t�]�"��]�Z!vT���_��`�y�
4�������W��X��Q��u���'��,�)�w��L��=`����P��DT���UqS�['�@�!}��1��VR7�4�a�`�e1Zl�}� ���i��htZ9�B���R���9:? S�,t���~��>a�~6�t�C^�)��y�F��	���r�n"�g�~|s�lN
����bo�/���qBG
=�XF�|sz�>��>y����SU`o����t��S��B���w�4J}��K�i�Z���P��GI����6�F�L��v�R�'��	�Q%�L�>{��7��k�m�Q�M�Ou35�����������T��<�������<�&���._OK�<�B����<>"l�t���F�k�S�)���{�>Yy�Q�����DP�&��>O_�����a����X{�2��������L2}��"�t2�Ti����&����=��4J��4��s�~:�����0dX�m���}@��F��Z5]��D���K-i���X�����W{��-�FK�C�&�x��XI��-h����wK���[@�b�����1�
�=����	�^��T�_�D��7i���SN�|'*��a����
��RIE��4B������3Zo�}u��_�1����[��t=�CE�%�LO@V��d����9����v����"Y�;���y��Hzy��M�S�\��K�
r;�*g?�)���W���t���(z\��&�"���&9���O�-����X�������S�TGOK�����H���
�W���>�/�NB>��o��4�?,���~���[�����X�h6f^F���T ��p�+���6�s9�$1������.��2���>����7�	��/p�	��5�$���:�8t_%|�HZ�6/5���\��uC��wA��4���4�t��_DZ������S�v��H�:7A<�0���@�
��w���������]�4�wi-���x�?��u�]��1���2����l����_��m��[���u�>���M��ip=��6�]���3��mF�|���-�C��h"�)���CG
�,t����>
P����o�
]=xA�b���_�>\:;�����c�yw?��������B_� =L���.7���2��y�1�����I=S�*����l�;�#������_Z�
�����+��O��1��
�RNKP�zasa����9{��8G����0���?�����.�!�"��w��S�&���yB�=��f��B�����	�1�stW����3�Fk� /]�s����b/��>����F_-0���i������i�o�����;���?�c{�9@o�m?��y���]y�!��K����f(������%��������Ax���K(��w��5�&0]�@s�e�g�H�i
tq��d`��i�|�8[�l�e��
�Q8��&/�\�
�(�X�:��}
�t�����4�Y�,F�
�-�B�^����w���AT���y��h�6z�&�`�j��
9�#�
�5�s��^C��0�8
^���H�e�����m�?i�1�	('|���D��u�$�F�����>�j%���k���.�o��������@�-�����6?
�/���c1�����pfgO���6��zRy����L�~��n�^L�?��li+��
8���q��|(��t_s��H?���{�����k���@]As����5G =�9�.h��@z�������q�^zzs =�a����9������5�����=>�6�S�`������z{���	_�>��?,�'
_^?�0�0>f���QN����������8�2����|�W�8	��$�;@���^������y�����H��o@}_5���8�������'�tE>&]��dPk���ME�����]E
+��?��O���5��
pC4�3�(�qv�����@�����c��@_������5��9{p,*���F�j�]�s��o7�UG]x��%�f��B���h��(�����R��Yd	�}'�KN��_��>���%Z������l9����8s�y���"���OZ��ea�g�5�q&\r��@#�h�&��D+�b�IJ�
}3��s�J=�T�>F��*6]��?
�������,g���_4�p�����Y���h |�6!=��v�^	���N�����A2B��cM.�
�y����|�U2��:�e�����})��<�A���8���g�c��w��A�>'�l����T��G���3���V���������>3(S������C<��6�!�p�w��������0|�F�78Pa?����7VR�8���	�7oNc2�'N��#��Q��n������u_3��j������M����Td�k���	�M�P1���X�yhg2��t�5�^����8��H��D�Jg�� >	q���wh|_��.����:B�c��>�-��#��������v493���%���j�~�m����7������������]���]^�z�����4�Rv��C��
���������M�S���*!��i�{�����?�"xF1����}3Z����Oh����]@���@�{�i���[�,Y=#�����Ug�F����_R�=�L�;;��2�w�w��wSK�����0����������h3p H���?���V�v��6��c���~�%�v�n�hB�������G����7�ec�G1���'��?x �x�i�������4Q��~BPg��b�A���Kp~�c�h��u�����?[���y����~�8���A���c|����?�[���>�xG�y8�{r����?�[P��N�J������
t�>��}c���g�����~ �h=A�����l
�[������bs�h+)���'i#��{��*��o lV�<��d���q'��$^��t�}�2�f`|m@.����,O�`Lw�!�����^���#P-��k�/�o�~_�@�@��/�����������f�7g����X������?.
��}�ZJ1�%x����L���/Z����6#m���[��a[��?�^
��+kq&��]�}���?��
O����?�����oq6�zK��|��G8M�7i>�C���8����^~F�p��#�������Q��7�s���w����u|!uE�v)�>�8�j�Z
w>���|�������V�n��9�fkp6�VK���:�\~�^���)�2�y�@
�X��R(�]��-�=(�r��\}��(���Q&ef*O�_��S`�O�_����(E����g���g����{��3�]������A�v��cx>C
{d|W�=����o������`���h�h��.�&�W"����W�'�Ch�H�7�����?�MY��a����
*���5��d���z��(U��wQ�`������{�����������U?'Y���HQ��C����x�5���M�[�,�N�{T������O�]����R������F���&�j7����z��s�� �W�]�]��a�)�L���/��-B^�f��m��r���.�N02�2��E����7�d����l}*d�|������99	����.R>t�l�	L��\��Q_��������e�U@�k=r��
m�B-;��D����&!;�f���'�3�g������a<s�{e�����{y5��)' ?��C���w�

]j�ZU�R��G4�����XS����v�}��X�����gm!����� �.rB�X�/�_�1}�[��W�Z~�������C?F��R����T��M���5`��"����8/���G�I������N��{�'���(��&������r�����1Zo��H�G�B��kl��e����/`�@��ao,��_D{O��7��/��+�]g@Z��C��J(G�g�:B�&�!/!UP{�.#��e���2�`������(r��]H-hAZ����-hAZ����-hAZ����-hAZ����-hAZ������o������H!NV���D|1�G2�u��>��@)7��� ������1��������n������� S.������[v����n����:,�o�F���OCZ��������r��K�����]�������4to��0�s�3���
������d5s����>1�C��%6��Fc��!��������4Z[g<o3���g���1����oD�����|�x��"�9�xN5�Fyv�3������;����5�Ym����6����egG��^fq�w����:y�klY�6`Afo[6��,��3�}�q2Q\EF��^�c�oY��e���eE;3��z�YW�'��:�����Q�nD��|�N��l/+�i���5����m^��Q��l�l?����`��v4s�m/J=����ye�Z��������l�1�S���I�)v#k��m��B���*[��+z�o7z���f���"�$���36��/��l�jQ��U���6��,w��c��[[t�b�Rj���������f��f��f��������������&W[���qik��5�Z�)�d5��BM!&�I5�&n"S��U�.��0�U� �,���r�4�Y/�~��������x�����sp���{�3��e!��<��'�D�Q�u=�=]\e^M��w�y��7T�1��������
/KIw$������p��$A�;VVVR��������J~�Qx6����W������b�+�5���C]#�kD4>�s_��
����"�'W�y�bV��=�v���eO
RY�W�bO��RVIee��(�J�{J�3�"Q��L��r2��s� v�r�vr����+��aO�r���\����(�&��&���9KK���`[��2��my
�"6�8lFl�Q���Q���"��"���7z���26�V�`�V�'��3���U:A����:��,����i��uo��}s�>zK|�U�	q��X�=��(��Sj�GE�����������C��*���#�������0���Y��9���m�,+�#�G�q��5?_:���������Ee��!e��AUu�V�q��T"-7�f��z�����H,���X�1�l7����X>�P��������5�xb������
endstream
endobj
14
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-ItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
16
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
35
0
36
[
666
]
37
39
0
40
[
666
0
777
]
43
52
0
53
[
722
]
54
56
0
57
[
666
]
58
68
0
69
[
556
500
556
556
277
0
0
222
0
0
222
0
0
556
556
0
0
500
277
556
]
]
>>
endobj
16
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-ItalicMT
/Flags
68
/FontBBox
[
-517
-324
1358
997
]
/Ascent
728
/Descent
-207
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
17
0
R
>>
endobj
19
0
obj
<<
/Filter
/FlateDecode
/Length
32
0
R
>>
stream
x��R�j�0��+tL��_I�JJ��>����u*�e!+�}�U��)�[�����dW��F{���Q6�i��r0�'�v�����*-�7��ZK� n���P�~$UE�������N�����)p����}��������A�����O�4A��V!������l���y|�L���Z�R������������xL���:����B,�����"d��n-C��eL����)�i%�n��H������.�����_�iE���Xe���XE���"^np������V��_�������\h0vv��6p�;;�E������
endstream
endobj
21
0
obj
<<
/Filter
/FlateDecode
/Length
33
0
R
>>
stream
x���	xTE�7~�����������t��t��4H����d_$H$H�E�����UT��QT��
��tfDQ��Tt\e|f���S���6�������{���SU��V����V�	�)�.m
��K�y�4f���s^��j�O!��~������������S'�O9�7���X����{s0��!w����&u?7�"�{�Us�L�/w��E��1g��y���z�g�c���S���5t?�6L&�F�~D��8���E|F�yf�3zN���u<��������SXk����{H�
����X&��9�`�*���O��B����	�`���f�^�K|K`���Z���0�����50>�o��p)\
�X}b|���=�-�(��~�h�$H�)�{$�������k����.�a+�X�X��j�%�'��d�u���v�G��S����K��-�J��T�a��}�;�������#��6�[������kx���S�-�S��Np	~�Nx�����6���)8J������K�:�����])Rb�
�7!��X���X�S�O~3�.������r7�6��tV���q�#���K��-v��Z������}e����~%?%�5e6O8qF"�<�e���c������?���I�����h��_}��;�)�'��^l$���`��
v7{�a���x_>���_K3�����~�;Z��oU�+��>k�|����L%��H������`=~�^8
o���SXs�o�e���F�����6�'��l'��:��}��a�`g9���g�l���a��_���G��u�7���&�HQ��T&UIs�W+����K�PN���	�"�~e������;���n���W�������fh^�|sC������s����2��d����}?r�6x��q��Y��.����f��l��ml{T��Y�<������g�>w��y?>��S�|����w���w�YJ�\R�T 
�����B�z�~).�*�'}$����oB��A9G��Qy�<I�F^/�U��2QyE��d3�1-75��n�a�cai�6�e�m~�R�����=�����J��.���~�
�y�JC9r*���7��<WYd��_���)9�c}�o�g�E�P6���Y���6S��I������o{
���dg7��Mvh`�K����rTz��>`fy�E��4v�?.�@.���G�����4���x%���e
��0���V��%%@����zJ��0��N���_�Zy:�	�l1�C���\m*0����Ly5Of;��O����\&))p������o�5pT���������Yi�|J�f���a~b)\���������A�|��b�H�F���D�i�Q����+
�r���cQC���PO��A3Q�/C-��4���0]q2�:�+��`B�1x01�N��Q�H,�7>��]�[�|#��,������~T���W���h~�������|��>��>�s�Z�������1��|����0N�W~�-�@q�0�=1@�����L<�2�H\��yx���ds�8����{#L�����3q��Q��h]��gU���1}c�}..��wi���K���u-���S��c~�H^n8';��d��}i���d�[u9�$��b6)��t��	�#5q94�3���1cr���x�\X&��B��a�imJ������L
�AY�N��p(~�"jdF�����P�����"��xv6VU�fT���&Tp����5���I����Sm�;�v[F�0O�����0�i���s�8�S��pEe����������#�WVddgWu�g�����C�_�E��h&n�7�fB3�k����NV�iT�����6\;y���4���pG���x�
'|���rO��+Z?��VW�f�(�z��P|�����fSXU����<o@�����!�C�_V5>��a�!��*����+)�fV(n
��X=��&}uF]������8����c����������)�z��;�����'�;mW���nw�����:2������2�ed�(|	2D<4%�=�o�E��^�zJ/,�?Uk�kqFf���kV��)����<5Z�@����9��S���(�I��s#�F��"��8���>"��s�ky8<O
!���8���z��gg����+1�9^K�����F�����0����'�����5a���@�jj�i���z�+g��3�O<��=2:<d������5��sAJ{�����'�/ep=�3$��rbKaJ����<��$����lA�9,4 ����*[v��Y�1q�j	r����x�����.H_�=�j	;��r��	�W�.x���5x�N��a���P�8�E����z�2�1��T�O������U�C����Tt�W���Y=�1Qe8��W�����[=���`�����3��T�X�`�Q(8��f+Gn����'����Za���
���5�������{C1��)�2)�a��
�"�g��������)�D���c0��ky���1O��b"�~H��3�5��������p�@�����v�a����I��8!��.q��AyW
i��c0�lk�k��KzFF#_s��)>�������w�z�������7���t�_b>_^>�1�����x���(��_�#���&J�b������S�)�l�/���>y��z(?�tR-S��*���SZJX�%z�z�[W_��c�Xq��h�C���sc��spW��� s�0�l�XAf~�ec��C� t
c�d� �*�I� ��1������RV�\�����)�a9��oqQ��%�3�#��[��]{�/��J�WM�W��r�9�!�d�/��
|��2+�&�y����^>�^\��u��Y�]����?����B^b>�f!=�����,>g�S�?����P6�s����~y���0�D��)=��+��������wd��v!_���_e�rT�����e����u�����55&>��)��(mV)���$�i�0}:�u8��
�(t�)4�n���1[z���d9iV�����c)�;s��r�k�ch�<(�?��d�r�x�v���t�d�7}�����b��$�R�P�v
)���������Tk(�I�A��Ex����0����LZ�z��X�I>^�|i�`��1�S'�Ge�6�N��|U�lM���:��I7�V�0����{��ez+�/����Roi$1��N6��+�B��Vg�e�K�ZXU+F%���um�-����ne��U[,�UC@:A��Q�f�t;�~tu�$���/�� ������E?QUe�{% Y�S{=�*G��IV���j$�c6�����k�����7��{���?����W�7��.~z����;���/<���H�/�>�+�Y_�\��7�m��E���;�
��s���m�W�P�9�%��%�.��E~ZfV0)\�*���a�
�"0R$���;U�1�E�-�+ ��)��<�'�18DpK�]A(Q�]Nz��BJL��?i+c�@���8�L�
m*����R�.���jCgd��&��{�=����}����
�7�Y|v��I��2\��q��o/�Q2h,�b��$+9B��5v�����V��d�X���-"_������%�8(�
de�����d��k��s1;��������P�j��G(<��"�r
�����
��.77�9K�$��Y)�G�n�Wk����#�C��DR{��h,��"�"�s�~�s��,/�����c�����<7$��<��$���S���I{�y�P3�,����)��`Aj�N�����p =%H��%�-����6�-;������v�uy�v�!�6CPm$�B�����@�%ae�����������$>�/AC������k��j2b�2��&oeMeh��O�=�+�pv�:���MC�l�`/�f�tI�K���D�A�$�c2w��������z���_����o��a�7�_x�����yb�3��-�r����L�}��������3��>�e��n$����^��(��[�	%��s Ev�R�I~_�&�F��c�u��Z��l���Y9�c@��;�>?OH%i�#��X<RH��H!�����\=����?yH=�)UF�4������t,w������fH��W��Rj��8�OY�X��*�Q�M	I�����lf�.�9%�9F���?&U���[��g���K���S�1�����=u�BsC<�#	����f������."�v�AD�p���{�Vdmg_#�����c�p��Zt��N���;���1���hu�Zn:ArqR���
j�T.l~����^��@U��%j�1��B@wd���}��l�|S��)������f�I��������V{������m���{pE����R��E7M����B�^��P;��.Y���@�?>%o!5���C��`{��=�]�i�I0�P�&Z5u.BE�s���j��"���pY,Z>
�E��M"��P3����u�'�/��K�K��%�7�M�Y�%If�'�aIN�$Yr�;d��,�
o��@�����|�E��2�%6�p�4�BD�������9���p��>��y���0$9RJ]��8U�:9����]�F�F������8Mb]�~�
���.;S�.-e�O+�>p�\8��(����
�R��o���K�������YF��Bf�2�{,��^?�����sH;�
�Q������(���Y��85������t��C�v6wg��v��h�&Te�5�&=A�a��d�s��d1�&	�!��$��W"�4�%��#B�g84B��BK��G�I>1�THL��Zh/0m
/8mY����$�,gNf{�\���Xd\H��T�H5D-CJ��GA���_��:�����_��e�K����s�K��J��}hP$�4j�-���|��Jq&�8]T7�d�H�����NW,��������������=�$u��D�K���Q*>�T||�����[�<��������������K�B�r�z�.a}�8T���O��t�*�r L��>l~5�N[�i�oP�$g�fK�h 5�L+���N}��[v>���5�O���n�3���0��;N������o?�y]��r/�����Nl>����n8Ne(�W*��L(��^t�O��q7���1Z#��`�pek�:3��Bs��b��$��6-�bA6��:5?+�`z1JNV���� O�Zby�f���Si�����I���8���E���O����:���
s,��_���-��k-�<����k�,�oY����!����0���9�X�x`�B�P6=pS/G88�3��A�c#�4�����3�(�������Sg�TTp������������W,9KHu�P�YB�f5��XNy����iK��4��K�����i^�iZ#��mYihv�5�����0f8L:���5���Z
��� �&3Y2y:�p�=����Vl(����t��q}�^��>?}g�u���a��GV}��{M=��9l���7��U���uh�>_�;����\}�f6�-fO�����{�zkU���m������<�st0������MW�d��Q'I�r���$����Y����L��T�d.[� �S�a����eCO��������A�`��Ts�P�hMn�d��xzN�v�i>9��k�t����{f�}�����y�}�^z��pF���Qj� ]�{!`,���6�t��ssD�P��
?�c����/.V�9B�-�t��h^g�^��j��qr�&���E����Eu�Q��y5��Y]�����������eR��<�,�(�����b� �"A��2]>��%��RR�LK��v*�*��*d:�������B9YZ���Gt�dt��u����p�_o��0r.�C��,�LV'U|���3��X���Vzv����K��D>��U�ve7�Ml���2?i���
i��j�F^nY�t�}����W���=9�>�t"�P!�������|`�nl��Z���F�5�EjD(����Vf���CT/5
E[tM����q1p�.�jdw��E>�`|B�������%&5����\�0W<�s�R��x(5��S�v{��m���
��-���{��V�A$�J�5��U/���~�)���"���.��<�$���9��5��O��b��w4�y���3w_�|��e��M_������>��-K�2:>0k�;l������|>����l���n�4e�m�C�����O�q$�Yh7��J��#�����0�I���Q����t&�R�lvaX3��%�nn�jl��<������������ :'yntM���i�VDN����}n��n�����;E���\��s�Sr:S`cb!�P������Z����V	+Q$����Q%K���-�:q�����-��GZ���6M�n(V�;�Ro,|����4�*������o�Z�~{�:�������4t&�|�w�����c��w o1�����B�Y����5P������I�24�'h����RH�'�0��$�qYk��n��N������'�(��~w��cn�^�@��R�2�KP��hXj��������x�}x��d�����-�#L��>�����o���xn��s�|���7����O��_yt���'����Vo���%�g����,�����?n�����;��/�<����k�T��N|&�>�=��NV����H/��(((u�H�������jGu�,�������;��>���#5_�s�y}�m?������\�A���?���o��2��d:Mc=
�����N�`Z��TPR*�v�D�i��*:�23z�}�H�u|u�,q2Y-�-I+�N�M�8�#�(t�;�rnp&���6�������N�<Wbe'�4.'u!����B��&7g�&�)d9KB�kk���$��})��Z��lE)��du2��Hb���Sl����h^vnc�o�m����L��K�Lf��;m����$�]������Fs��1g����"]#�"J)2�N���E���!����[��Y���J��RVJ������4�5����)�\(��Ppwa�~�Q��M��"TX��U ����$L��G�`&�B>�5	�d���e;W��jR���Bl��<m<,k���OH}����DOh;�-u�c����i�����
��������S���h��>��l���(+i���N��
��j�������A�g�3�W�\r}f�w���Vn�Z�r��]yp���93gl�d�:v�S��-��t������|q�|�����&�����e�b�������j.~�u����L:�����k/���{7b �"t�P�[��:�����\I,#`:��'�d��$�L5[h��b�B+�	�b���yHha\5H���Y0�?�����i5�5i���u�-��t������3�Y�5�y�z�c�]���]v�����1��9�\s]K\���LD�u���6�q8Vp���|�uCc���Gk�:-B�r2�I�iCZ��-��
2�V����~cY������A%X������� b2�Nod�R�
>N|��{����r37;���F��B����4Q��-��������<�V/�~�>I�*|��4i��_�K~���G\�
}��	�4m�U"{����Z[*������4�s����y7���d���[n�u'[���(�d��_�mS���^x������y���>pC&�[l��#�Q��p(�S�.�cl�RF��Ze�uJJM�@�M�X�{�O�?I�:�K�'�����7���y������
���\Goo��1�W:�\��6�1��������i��R%g����Z7�RQ���I�=��,���f���<��(p!tL���������]���1����w{HE���%-�6��}��!�����yt���p����H�F�B������	n��������75`N�e���f��%�G�3s�&W�g��0���g���NC)E�C�Y�Zn\������e'4����.��"�Qe������Pci��.v���p����f�����_��)��5�>����6-_����60i�������y��o���A��C�ng��IE�XKB �c�J�ul�Ti�2�:5��J�F��Q������w)g��n���n�����}#=���=s�'����g|*x����6�[������Zu��UU�����o%)�N8�b�T���Q-�{O�
s�s��{�cc�A�F�A��v((�;�#��;�"%D��y���L�D�[k�T��
>Ps����c�5��4@�����k�" f\l����l5���<����!��OyC[�<|��yeM����1�/�b-0���.�S��b������I�b_���~��5Ky�s�s���MY��i�5n��'���_�dA&1;�o~��[5�m�v���3#��@���)������JM+G:�4����Ji�CY�i��4���N�\RsJ��nH���%;Id��=�+�Q���V�����l[�E�Bk%�-���cM�r�{F`k
q���U��Ym�u�3�������cI��������n���	���)b:S���	N�c��Z�^���a!xM���������[ :C��W��WWs��qq\=,u�_k0_�O��q9�@�u��d�MO4����9MNs��d�`�+��\/�(�`��k'I��[��)��b���}v��kf���L����=�[n��7��q��75=�_v+NnO�W�
�9�M�������;����#��7��I��B�A'�"���J���T�ACD��;z���;J�k�k7���i��y�\.%�lP>P���R��2O�W���a��6A�&1Q���K6;�^o=[�:?[��f�&f+E����K4�������6&_8s8A���!q%c���Z4��T�}7@�S�h���u�W[�gj��4�/`D2�H���<x�E2�H���g�?#�4".#�L�
�fD<F�mD�
�U���������X���?��&9J���	��i���c��O���V_F�*I���)5�af�p�_������m��yii����n�����	�-�9�|��m�Hc\q�0�b��&Ymc�Yu,�gi��%����6�e�2Z�
d���2�lgos�5%�a��2��T���8,^�%,K8���\[�r��7�E�JU,�T�F��M��I�����t,E��%��n^#[�#��3:�tk�,�l��*S;�>�nV9�����q������V�z�-.;�
���I��3���j(��|��[X?U�j���9�����Z3m*zl����|x����}��b���K���#�
�t��}�v7u��\5��}[�~�-�����Q�+�#����?��;U�'k;�b?���T�������4
��3UY��fZ,%joOoow_�:�3�[���L��R�=��Q�9�k�:�3�[����ZM��ri�2�v��*i�2�v�����nd�CRS_4�O�0$�������r��,<N}ME>�p	(� i�f��tur�J���Us����c�>@��s�Y��S0����4�
�\�;�.�G��X�A@��pGt.2��b��'������$tK'�E��f��t~��L��|�7��3��o�he��J�J�����}�dqH��Q�m��-�^������4����by��e+x2�p���6��������^|�0vvE�L9��Y���B��Y�X����x�C�������~��BkC��i�3�
���\n��61c�e�}�:'mv���)���K#�D�����D���j4���[ V'��$}���&����Hky�$p�
���'��(�s_�1�����mrH�E(��V�4�D�a��i�Y�5�#�qo�F���&���%�b}1�-C�%I�����,�N19���p����	=���aB�0��L�Gm6��/�h(����������bT�1�O�M�s����/?�.��&X�'Z���"�;��V3�bK�{f�|}�5�8��.���]�����7�T~�z��5�~�|��K{7���9���W����2\��yw������L�YX.�����i�B�du[��#�mu�daIb��f�_ka��P2K�9\���m�����O�+�n�O��P]��5�$���9�����E��'p�h�J��%����y�A��������-(���6��Y~�}������,9�i����wX^���M�s9�����t�����4c'B��[��k�H��h-�kD�F$��d:��-�������:�Z�;.gj�b����rK~���$�5-���uH����>�s���|-�m�&�':fYfYg�f%���r�����������G�[URm�6axan}����������}]����������#�|���1"a#�kD��5�`2>�d|����c���	�yv�����I]2�iw*��Il(���������G�&�?�����/�w��������b�K��*�1����2Nk�)���U����L��*�gR���S���t����"�db#9�%)���s��d_IU�NJ���B�j����QM�j��~�_�6�)��>~9����J���]���X�I�������"_�!,�=`zKA��A6��k�����"^D��\�i�����6�\�B������r]B�D�]!qL�zD;�}���K\Xs�|�|'�n���z�����*�������ht>��[�='i�-J�W���`��Q�	�m�{����:t�
+)�"n��&��)��k�9�)�1�J�d�3�9a�������;Xm���A5��(�5��� �t����V������-\����w/���b����K��o�g�����������{���}{�=��_Op��u3��z3n���q3�t�mvq`����}yE�,6���`t���}�&��d&�r��.�8a�eO�n�M|��!���`�q��l\�i6nEi����#%V�����3`v��I�U�Q�
m���Rs �9.0c6���Y�l��V������k�2����70�n6��6�6k��k7������������5��L-�L�w�9��}|�X����������W�'N��=��22j��b�eZ��E���}7w�{�����'����~i��Wu����v%G��6mP�L����a����X�t��N�8����"GpY�c/X�
I\N;���;�R_hi���-�N���q��N�b���L6P��q%��J)��wD}�~Y��{�+r��6R�w���	�X(����L�6Z�Y��K ��Xq�^0�N���.%��e��������@�86�WY�[��i|�e�u\����[Y���`+�ri�y�e��x�z��i�l�5�1o��/���c������p��	?���-"�����Y���[������{������K\�Gh,(O�54*"�+�=�N����� �D�D���\��f�z��K���b��@�����#6�%,���l�J��B;��Xb�����4�e��)�
W0��x��$}�G���������t����f^)�?�*�]�J�������#R�N�~*u���M�lV��������=�|�oN�}���m�Z�4�6}��k�����?��?+��{����I��X2,���7����+#.������*X�����D
Q���y�W��m���������2�Y���:��Op�\��K��p�2��0m�b��	1�iX�e�b�/�}T��E|�(C�C��yC��)�e�R]|�<z��u0�����D�w��LC���f�cx�T
s0����zR�s�i+<����)���xLo��D��U�[�w��(����=����A�-�����T�;#�c#�@�2�H�!V��`%{)��#�[������� |�2|^��r1}+���&�.D6"�?
�<�GZ�������`}s�7a��>}Z����kD��&>Ajm�����
.����lDb$?s�K��x=�|9���}��r-�4�~�Vv�:J#�
�%���a�tz��L��w��xwC��B�7�l��%�_��������?��l��b��C�k����q����R��Q��9��?1��q��/�1�yg��K��	,3���i����T�����t>�|��f,s��q�2"��`@��|v��G���.�O���C��6`���W��M���K8��7���7�����&�]�N��i��#��I�B<�}�n��d�x����g�����x�����_�@���A�-���a�I�G�q%�u����R�J�B�&�eB�e������@X��[
j�E�[��5�+Q�l�A�B$�
W���B�]������e��Ke9�8��1�`��|��R�w>��y�1�/�9�1�(O%>W���<�o����`�gD	����������Pg>��B9�H���C2a��uE���
�zD�%���f�ft�M�s��Vb�S>����ze��*�~�X%K��������T���G��m���V���V|t���%�������y*�������:� ��|4y�O��������5~M|���/��Ho7��
��n����|��
�����S��*��I?��#Iz���Q�-mU5��|Lz�L��:G�`��G������,�0
H<n��xB�$�0a��%�8~���:>������-��!���J1�����o��_;:N��j�K��8��E7�2�����-�����5�~i�#�#&������] �(���L���U��T���^��e���E�T���\�M_B�<u����������[��%��1�&?�eR���6�1����/��lt�p,�S��<;���6�:1����E������0����
��?�%lP��e(C����	}U@�x��(�K}�z��^���|�D��u���8+=����:B��1z
|J=��l����cW��H[!B<b��0���j9
���p������vo���P~������A]o��
��n9�2�#���c�l�~�>����K��&i0�D>�k��atF�&�1�M�H��c���j�eK*�D����-$$������LyI�Pv��Y��������I�axHn�5���!_����D���������0A.��+�jy�I�����<
��)w"��b�o��:��0A�����&��r������ �,������6}�C����b)~A��-�4����I��zTF~��v9�."O��#��b#�KC�z�Db���6�:-wg�]���������MK�����X���-��. �~��(��G<�x�x���������wAz�;��Gh[����C�8����8�`Z)�k!E���YX�MZ�@y��$��s}�)�O�V�k���| ��x�
�m��o�]��.AT���
R5'{+�.�q�-P�k�����dc<�y��{E~��C^���m�m����|Lj
�Z���C���<�m��2�!�^�g/~?-?�3��:��`���M����b_����%}u����H �%���^C�<���V����UZ�=7������������%H#HK��F:���e�����3t��i#]���/e�0�%�����b���P�w�)G?��'����&�%�
����O�����#��q7�MG���`|����rl��J?����Z�������=�w����g�"fa����g�C�[�`�/��mH�=o���k�c�KL_����HS�vB$#<X�~�#�[����^��}�)�� �y!]�v
�oSc>��]k��s���A����>B�/�z��Sk��|6��<6��>���h�e���N��M���.@�A�w&��|g�_�n{���XZ��~�v��ne�a=BEd�t6���wH�������\m!`���!qm�m�~���@z��H�a�4C�~O���M��N��6��aS�tLj��7�K�%����?������-���N�O���7`�Es,����/���3���s��t[��?N��K�t[|�y[�3��tHoA��OAky�y���C[9n�7=�cT���u�����L���=�w��Y��"L�B��l>����!������D�o��*e����9~n������L����(D\�� �#�sMkHl�m�V������?��m|����a>�L�0�B]�br������������}��=�D��Qf��[^�P�_-����A���.�]����64h��a:����!�/I4��s5�o�^���J��'��f����
�BJ�
}9��K��)�W�.��}����C7y"T �e��j,��H������.
������m+����-�0��D�7�/=�b���;�aST���5�*���������=M������R��i��Q�2m�7bJ����m�jD�v^�8�����W�}�����������#��q�g��>��-�B^����m_��p\�~�2|�_&������Jnu7@���b�.�9S(�.���^Y;��'/�����m�Q����0�2�6�_�2z5�
��|�8��M�M<&�]����F#��_���V:;2��#qB���Zb3OI�E���"�]�Y�_^c����3A��/����1�������S1V1���u}#��u�g�"}�T/k��1��$���\i.�/P�p�����V�#��{��
�6 ��)z������nL�����
�\M������W@�g�����e���Uz<S�c^)�0��k,��q���������}�v$��
���:��v����]��m�u����k=?�-0�h����~?��+�c����H[`~��?��p[`~�'�7�-0����������������am���	����pm�4�?���s��"E�k>�q\_$���?��~���o�A���t��K�x��!p]�y�/#
h�0�v�"
����n�sZ�z��;��M� �}������h�t�>�a�:��V�����7�{�|s@�FQ/~		1
���>��]/ }A��/���x�>��{�]��|'�C�Q��:��U���p���G/�U��>���.����L�C�~�7�W����+�h���q�<o�|�E�+��$�j��v�_<�-�!�e�������V�P�8�gBtv�V�v
�E�2)�_���~\��T���&sL�E��	)7�
�9��t
�z���
�&A�r2���9`U�����L1w����?��u�u����F�-g�fH�������sQ�����_��d\[���j�Z��a���I�q���h�/�|�}�BXi��F����Qq.?M��t�d���1���8�c�fP:�3��w�$���G�k��-�x�����+���1���B�#h�s0�)�����i+C�S8��i*D�O����z��9��%��5���f	�c0��F+Cq\�a�������3�Y�us�F+��/:"87�uy�A�4P�����'���<�|Q��M�kZ���B,F������-����~�l�V�	�0Agp��^��:$��O��R-�=vO�3�g����F2Lw�~���-��#�~�(��`��d��m���O�Y��7$��NE�F�^[������g�	?V�3�^x���W�4�r/�gh�{2�i"����������n�����������ZS1' �~,����9?���	����y�B�#�}��	-	�|�������.����"�_��R
��t|�c3Ab��F�w�E��~�~]��la��������p��l�j"[��@/�`�Z��	�1��8����)~���>��������t^��������=���=���G��4��4[!Y�	��9�S��:�%����]%i*��Tq_������\���~��dB��������h���T����q�v4����~�v���r�#`�~O6H��.�yW��0�B�/1Z[O'6��T��[Y�+�_)7�N8����}�u���ul�|��6��I��
O��m�T��C<�����.���W-�|��+���,�����X_��g��~&��/�9�t�$M��d3��[���E�}Y�{x�����V�M�k�^
B���y:�6�����}��_��z�/w��������N�I��P�#_���#�@:H��Wc��L~t���z�u�wR1>b.�}H]�*���k���?�|��GH2�_E���^��ou��@��z{7��O\����u4����N�U+���a9�+%	=
)U����2��_[������2��e��0�6
�������}�s(�'������s���~ ?�(G+���y�{�9�N�g`�r1tV��?x��8�)g�!��M#��=
�K!hm7�������{�L�!��"�9��aP���W�{�
{��[s��=�yd���&k��5W��(���{����1�AQ�d��j��(dj~���q�$�Q7����;m�[�/�UO�?��6���*���%�^��K�_��Y>����~����-������~�n�����^�?<Si{w���r�l�������!���<��~���1L�A���uA�DB��j��*)	e{!�A/�\}O��I�Pe�k���r�}������7�8�����O��9�K%��;�t�������[�w-��%0�t-�Ta3�n7��P���n�/C1?�� vLH�}�~��~��8/�uJ?��b��{5H���B'95�%�����_M_eJ����oj:���e�F|Ag5��kj����M�izR�B�����{m��"����9I�-�jC�3����z���:�/����-I6�%�Hw{[�]��n��b�2��r��7���<�ig������shn�5��o��f+:I���4�E���v�R��8q�S�8����'~���[�~�Z�Xk\$��-�t�����$a��o���BwH^�G�]f��w�
����!/"^G���-m���m��!�����?��Wy��X-����O�W�zX@�������)[Q�H������)�i�?�4��<�;��������^��9���'|����@@C���r�����;�B�a��>
�.�#��P��/�������v���hG;���v���hG;���v���hG;���v���hG;���v���hG;���v���hG;���v�_
F��|e����
��}��N�
(���)G�|�y�#Gp�cC43�W� e6\�5J���"W��R�V(��s��2L��0_�p	��
��:��!=
!�"6 ��)S
4��j������.)
�F$1,DGLB����0�r�3��qJ<�Ii
�c��nd����Dr���X-�;.�������D+�[+��D���O�:i��WTO��(:��+y�#���y2~\�A6J�Gp����$���H����L��Z&H���.�k�	�5x ���'�'����hC���#������G��!����4��#6 �#�"�F��q����������r�$��~��3C��K�"B��#8C��?�/��;{���]{��gi�^���`�I��#oQ#�c����"8��Q�I9������n�F��P63��?��7�����8�cO�����b�1a��[{�kqr�*"�#^E�]1�������4��
�~��^�	�p�����*?$�+�EA_F���0?����I����TEZ������`������bX�(GGLB��0��<��6���<�-�%�sA����
�"��CDz_�16�6Dx,r���� r�=� r��Q�a)�(�\u-�(�����	�0FAd��a�������9|6�u��p���Q�G�:��u������

p�������X���~��������Y�RV_���`�QV`�Y�>���c�p(�Yl������f����:Va�y�>���X�X#�n��X�JAv�%�Czq�>.��#��<��:a?�G	��a�P�V��E4gGA�����hn�A���N��B�	z��|�����I���	�	K�`����BD9bb	�k�It�k��z�����N)����o6��e�5���
0W����=�����nd���t�������;�]���V�w5|�ld4D��Me��,���B��!�u"��%�O!-j��j��H��>��Z���N?4r�~x.��P����0����7��/6Z0��H#C�/$��
�
>sX]��5o&�;xS``pv@<��=��S1WpTdBp��"pe0V���,\,�Ju�:��]�Q-Z���������ld3b��������{��������9��aN�x,��i�[l��d�-������X��������O��"�r
��OBrf�0���>dt?6$~`
�2?3:��l#'��p?��!c��{E�4���=�C��������U��+�����,#��?~/0�^vG��ewTU��{m�����]:���=l�/d�.�g��2z||kfU��"���!�{G�&����a�*+�����W����E�R����!�l�(!�w,��wQ�����A����[�����X.���Z!O���ZE9�Q��u���ssE����2ui��e�a��<Q�[�E���z*�#�X$+ ��t�"�.��;_�P/����*�����	he��2��X&���L���UM�X95\Y������~�_���Ph��*z�K��+�� :yj�*<�">%\�~��x<�_��+���>16�����E���U;�(�yA[�Z�*�/A/+������=��@j�'����(���#�o�@���5��'��_k2���y�y}�^���9cz+O@R�*n��;��s��}��=rb�K������}�	�����p?�.����U�������Zx

�F�~��U�c�+���/9a�v�sk�����������-�f��LIj)Hye�g���?����?IA=n�e��PW%������
�L�o�8a�>���<�U���(�3��w;��Ql�o6��=���B�j5�J�1$-?4X��[�/������
endstream
endobj
18
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
20
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
15
0
16
[
333
0
0
]
19
28
556
29
67
0
68
[
556
0
500
556
556
0
0
556
]
76
78
0
79
[
222
833
0
556
556
0
333
500
277
0
500
]
]
>>
endobj
20
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
21
0
R
>>
endobj
23
0
obj
<<
/Filter
/FlateDecode
/Length
34
0
R
>>
stream
x�}R�j�0��+tL����FPR
>�A�~�-�SA-Y>��+���i!I���]�����7Z9���I������0O�@{8*M��J%�7�S��!�7���`l�0��������ts'�nH�l%X��t�~h=nc>a�hJ8����3O�4A���>�������j�����1I�M'�v��N���~����O<��~��N��)">:�d!O���G���?�v(�
�v)Gr��H2~�c(+�����J�D��z��XI�*��"�/�ET�}u����0���b��N[��4�3��������
endstream
endobj
25
0
obj
<<
/Filter
/FlateDecode
/Length
35
0
R
>>
stream
x���	|U�>|nUuUw�;�$�t��lB�	I "�@� a5#A��qT6pEEA��&A�c\�u�Qt��qpA��QDGI�����C��������}�/�����]�=��so��h�����E�o�<�R6iSf���g���B�+"��/�������y>��Q���5}��������<� �����8�S���-�r��'S�J2����d�����v�Z����M�r~�P�r��?D����f���������y�d�K�[)S	�0���k�y�J����z���z� ���)��bi��$���H����i"����O>�D#��2a���c\n�3��l<��5�#
=G_�SU���D����5w���S
���G��
��}��n���__��d��UQ
�O��7�����=F�hS����M��J
o�P��z}
�ve���t=�gi�s�N��(sHM�0��hi$M�K�
ZE��%�e��C�����H�$��>����?-=�8�!�[4���/���LU�Z�F��{��)�gv��=e)����K�~�r�?� �1hg����������XJ#hZ��bA������%��5:�mBo�F�`F��>z����X2�`g�l�LrH3�W�{�]��
S~y�Qd�����{z�^e�_���El����K��K���J�F�������c�/�O�t]MK!���v�� }F��I�f�v?�����&�Jc��������y����_�\����e���Z��EOm��}8�G�q����D��:H�����Iz
��I�w�����l
;�,d+�m�a�;�G�FI�O�4X�E���� �k�[������s@zK������l�s���~9"�����R�P�)c�)���)��e�`�f�ay�r\�Rg����k���������(E�D#�V���t5$qm�������=>L'0�,�������b����<6�]���[�zv���0IC��R�4A�.�����K7I��g�����tH:����yrX�'����S�K1�E��:Hv��]~U~M�@�P>�YKU������]�Ve��G�9����f���v�-�,�TIMW3�b�"u����j�zm�����u>�d��� u�������R���CBS����1�*>�j9�yI���[���$qNUW"�_��Q�;Z�J2��r�Z���a��L:��Y��U�����C;`��J��}l(��������{l�}��nc����c��5��-��%�<�]GU�fIa66�'��~�����GX%�MG��)N��Omtf�!z����f�X7�h:�������[�&���X�i� ����.���W�C���8���Z�B����~�������0�b�a��6��9tV�{��'�������%�X��4�f�5�z�������q�1�^�����m��hG��?k�M������DgR;}�����b=�\nYk�n�e�����}��~�l�.�?�G�%�bn��������Dj���a,��c���j�d!j��������q�����t�I,#��[Q�(�yJo����"e&�vo��Nd�EhOGMw�j��Oo����!��v��MF]_��4-�z�3��*aYk��C���MCY.�5���B)�*-�`���1Js�'��H���+��d���A)l,���G^c�a��K�e,���^B/�o0'�r�VK��L����Y5xP�����e��J����'���Wa� ?/7'����HO���R����+��H����jQd�Q��yu��H�9���F����y��0�KBs$���o���E���K�(9�[I=VR�,���*���'8</y�6/����k@�����`�����"�D8'���9��k��]>g���ZT�3�>,o�,{�>����`B����;Y�&R��A;%�:��Hz^��HZ^-�AD.>}f�~\��������}"l�y3"�74�
�"4L4Q�E4�Lp.
�������67�h;f���~^CD�������nm$��#��QT����kn��j�n�GW�Z���k���������RA]��:4}#�8jB�I�76D��h2�G�G����<���`��74o����15��"4�����t}�q���WMl���Tg�5N����L��_���������N�'&���.3�pv
���!Q��F���,�=�	��/�'
y�@��5�V]0���������
k^���9�R������y�>�v�t3E-pA<���S��G��H��\E�a�S�q�������6i@�|w��z�vz��b�?'�O��6�f Y6�!������������sR&��e��N��<h�.�N{J����r����a����5!o��)
����M�����X,`g��$
k�3$3$e�"Jy^gaipD��U�R���PJ���uw���������<m��S�q�s	r���edP�������w�U2����Q��Ze�V^��Uuy��U�����f��y��H[����o�Oh��wuF���Fbe�h��<�b�N���0�a�'�Z$&
k��3y
{���"U�L�� ��(Eo��"+c�N�L�*"A�/hc$���4F�I�4�H�O_�1|�-��@���$U�6�ZO"����)QFiV����,D68�~���'�:���OT����j�����W�������`�
���t}CAG>F5l�4W���J���c��@�n9(���\kqS�J���<x�?<�}�i���&*>���$	5�H��j�E?@�i����hi���[�QW�W�����v����m/���=������=s|s�/X+�Ju�m�s�4Rn�sn��,��>k{�����u��N���K�6�]/����[�����%�t�\[��uh,������r^{�����cc�'�>F���8���&jjb��>�[S�r������j��������q�BR��+��������,��e��-�K��]�i���w������}����?���)���@p�S5��f�/�h�n�X���Z�%���N�0K�t��V�����scd��&�i!%����(i��l+<01����|V1��U'��U�+�������_Q&�����K�TrD�[;���>�e�}�A���=����'�%�y������a�2�eMN�G���]��iU���K�H��K�n���������2YY����!sC����YV/e���Y��.��.km�Y�,>/����,��peqe�����d�b��M��UbW���������%�b;����a�pr_!�e�'�������nLbGx�(t�iA��[�<e�&��=�e��x*yZ\����J����X=�����i��J�&6�	���)#��S*���F���5<A�3�,������&��'g�����<�e)�_V
o
j����8U����o��f56��F?Lc�������e��g��%��m������N:�E?�������uF���_+x(��%;9���tV�i��c��b�������O)Pz9G8�:�:w>��1�J��Y�	N����=���J���*9���;i���y�}��������vY�����I���m���d-�U--�mKK���a#��Y�>�4:�������Ah��Jd����r��Y��q_�����WuX(&\��0�OV�)K��0��tl�~�������GY�I��S�}S�f_D�SaE�[�@�z�51�����L��y_����m^O�7������e��vH����I�t_�391���'�����l-&4�17�����8tO���t�IStg�^l��\�[��%$����%�z����v��O�]�������n����y[I.W��r�Hez*KM$�������?@L��8��pf����e.g��G��"���[h5�B���i�'&�D��y*MU�,4%c%�Q+L�\�e����mZ�T�w\>�����_�|d��A��9�>��]}� vU��yk�._~q��M�_�*�����~�x.$���Jy��>�������;��Z�Z��=iO�n�I��W��)��Z�����?�$k��U�+L�{�A,..�l�(c���)��r��r�T��n;`3l����m}�1���P�1-���;DWd�`���!oZ~7{~"fN4A��e�
�%�3��+R��$�0��5��;��*��q��a�?w���&������z����t��d��9�'���L��X������=z�5���V6s���+o�z����b�#��3�!�X)[�������,_��#2G�����6 �.�����C7�nI�5}���/d8T���S�|�jQJc��
��1�9��d��n)+�����3_�Q�����#-�|^��|)�NX��DW��Y�[�H������>��t��(��M��3=�9z�zyN���1Es8�}�QF���%����''d�Y�l����F�p0s�'�Z��-g��X�7�`���r���wR���i��R�����5��_�yZp�i���d8;�
�1Xblc '��G���M�1��R��4�E�P>iFV�����RS��	�`9�]UU%V5-`�0�>_���K�	�
U��P�����yf*�t
�xa��,#��W����3
�%�5y���~���{n��9��FMd��(������e���3���q�����?'�"�ZW��b�M�2�������[�/�<�4T�?R�m��6����q
�Fo�o�k�w���UiwJw:�s?����
�Q���h�������IS������7:�:�A�
_�_��r�k�����m���=��>[����rN�&�'�9yJZv��.O�s/S��9�zHGQ�Q���b��2x~�
�j���*v��3gl"KL��r���jN������'����N4��u��A����7�����
��v�2�_������f����SXr��+G�3�����>)�0q��'6L�W���Y��h���|'��M��%e�Z���Qm�7z'�&�3�kw�����g/�����A)�ig����Sj�������-��t���	Z�SaO-Jt�`^�t����d�lw�5-���s�NV�>�Q������U�C���5����\�\�l�l��L��1'��9@oYi*����.�dz����S�~}��Z���-��z���.��|��FV��,���&�O��~������7b�5o!t%�2�{�m|��%T�e��y�{�e�}�m��-�jMf#���:���m�������/8��r|�}�tf�2St��=�S�Jy2��9EhCv������M�����OlN��^���i���+���`���E1���LAuL�&~u����^��UI�����4�a�)1%*���=/{c������NW9nZ�0�xW*n���s���z��j����f����p��;�g�E'P��;�B^�\q�/
S#�;�@��F��S9����CD�&�:L��nE�D��:���M��'��J��f�/�����e��U<X��u���'�5������?���������_;��j�._;}��B����UU16����[�W�B8�|��kV�`�\�t����W�^`#t�G-z�����Ie��l��o�_��Y|�|��3�caLJJ�x��d���d���nON����!�M��?jc���������_����/���K��qNL�R�����v<����V����������B��k���eRY�If�Zen���l)��raTd;V<1}�����qg�]Z�������#��X��N��uJ���7t|�AC�n�B��u��C6���c��m�6i�-bk������5���6!�"����{X�08�pdV-������Gg��/W����N���[���G�L.�c,<}�������L����l%��[���=���,�����Ro��Y"�v���K��lYj��:�C�$��{���wzb�]k����:���H]�P���"p7�-h�#E�9��rk��<�Vn���s���	���eE�~�n��8SsD�.J����zd<Y�N�Z��g�W#�LdZ�EK��q�i��w�g+���)����!N�;���S�7{�<�M��]���L���d��,i�
B�Bni[�
���Us�@�Q�P������B}�������S(�
��
e*��JaZ�T����-���*�zaAN.h
W�VG�P��cq���,�]N�I��A��R}B=;����.a����w��m>o��^����q���� �z@��������L�4i���j�w4J��;�j��;��Tw��>u���q
s���`���6�~-)5i�u�UiSf�]k�uu[T�\=Z�Su$$`��X�Gb�3�����=!�H��u��_�v^��W���w.�QU�����e*����4F?�W9rQ�oY�Z��cR�C��_�
(�
������|����u7F���s�>�C�����/S���-�7��f����&��wJ����;m���_,o�:>�|�~�to��,�^}�������R��*{�&�r%+Zr����1G����[�����������m�{6���~�55b�MI�����~����?~U���������r�q����~��wH�72uU��O�}�:c�}��m��m���%���~����F$I�r��Y�T�Q+�t�L���W�m�:��_9��+����.��'�/!��JL�����/�)JLt��n��$��eh)-�:6N�qp$�r�W�xu��r�d�:��BM\������.�.��Q�fj�#�aR����5c1���g����.\�����[�#z2�f���������n��}�����CYt�^�������U�gK�����3��,���Z�hi���������>9k�}K�W���/�J��BPR��4�[�����#
y<..�7s�b���.�?�m����Z��g'�M��6;`���bhj������<����[�����S�����g�^y����D/�|gMX��������~]'�z`�����<�w��Dr���mz�;-���&Xf[[�boC����^�ns9i��pH�����&]�iV�,��^ds�Jl�m�-}�w�W��]�}�{��x�b��IZ�61��y�����=��.�dS��#��y�X���1eX@�"�FE������^:�Q��t!U���6�5=�����s�:s��b%t�����8�f{��c	V�c�-=���5�Z��I�[�]�|g���mZr]�������|�{y_9O��E�I�Y����n����j�������3C����o��U��"4\�K8�9�U�34W��/U8�������?_S��-��,t����i��z���+��J��hq�)���;����rw�9��5�7�������>jj�O��+���������c�2kN}���@�g�������}��+�����>�9%P�2x��IP���6'v���m\�����s�iC�����D������2�2������Sg���'���JJzN��+���z�OS�R�+�>���%i�M����V��� c�/�����y����[��y�pk ?OKq=���l@n]�z�m��������:�������y*��}kj�jf��"�[P��������X=S��2v����g�8���I>�dLM
��W$>���}e�:�M�Qi����<��K��Ex�^Wj@\(���u�_�tV�n�K������9��as��8�yE;�/�OS�n7�xQ�%x�]���>���t$;*y��Q		}�3�R8����w���B(]�r��[b�z
�m
��	JX���~VQ��22���%o������_z�5�+	f��S�
'>}����O��YLI��2��U�o^�o���//;%s���n��O�������r/v�W�� Y���5(���F���B~��B���d�����_�iv����vQ���H����*��1��f~�����ZG���^LT����LE���C��I)���M����%�M>�|<�B��d~y��@���;5*R;1vb%��F�7C'���'��i9&���~�X���1,%��,d����H=y���x���
3�����9WW&�~�K���G'^��x�w����ng�~������M�2�<�
z���=wXd���VIU�Q�(��&�3%�G���d�MMJ��7��>�'���5�#~���� Y�q+����&��t���b����<(�����������G\���m�4���wH������]|�����q&���q&�=��+��\��?��!��ZA�%�>��n��r��jw$�&y�t[�=��&<����>��pB��%!��
�wB9
JXN��I�egN��+��Z�0�1��T]�����c\��{��$�m6Ib*��J~g��3{�'8���T����n{�}�]��I%z�"U&(��XEV�J%pQ��.GbA~����B���r����5a�jJ3|V��".<t��{+� �v��`b�����T~I����������A�j��%���:�}l��o_);&Sd�,d�^_�K{A��k{����v�i�j��_��V�@���~�k�k� �����O�J�J�(!���\d���?�X��OT�3����d��oS����*o�O�����T�S�Jo{�Rm�Sl)J�}�}��b�V�q�E�I��arZ�~>��Z�)�P�8<�L�k
I�X�f�1��w�-7d�����_.�$[�$�,jB��}<�������Y��,��/�j�%��M�Y�Zf����:7:;e�������=������uz����K�hw�I�������2���_���3�f�q�L�����A[�`(	[��n���1�(c�Q>�9��`K������cgG���������$9�6���:��FF�������&��]�^����w���>��\��?,>�I\_X-8+��D�-Q��$I�,��T������G��.gqb/
���4���jC���rq����.O��+���O+_�'��6I�$&���U��9��K~y���
��H��k|�PuH����Gp"i*��;��c�C@KtWqI���iT�
#<F�Eq�^�e�)��@��Qly�C=���Nr'����W[��!�i�������D~o���Md��W,/�rX��s����6���������NI{�f���y����K?��������}Qa����Sngxh�j��d_������g9X�4����z��(�*���X�_�'-"+��DxR6����3�r���F�'":����l�2��t�\�]>��\nW�U��]K]�+�o�o,y5k-s���4���:4G*��pc?��9��y��y���tS�2�	����l�)'���/���8��1N���[f�m�r�"{	�]�������=�.7|�b�$�f{Be��8��He���
U���Ia�����\v���~��������,�d�a/�/��t��h�{�/����s����Yw�9�\U_X��������_��\?x��_��p�5em#�OC�����__�&3������IR����&�L������,j�<OS�t�.��Y��<D�P~1�Ao�*���l���@�
�kbP����c�G��E�+t&�"�`:p�e2���;�J�������#���~����!�������2���>�j�l�M�!��@�����B����1�{����_:	t��_�������1��a�g	������T�����/��M'�_6P����L�t&E@���a��I����/���������Q��@��u�M��B������1;���+4r��qY�3�������e&������������U�)h�|s���V����NR?�$H�P�Z@�_�?��:��0����4��G����m���
�#0�
({����:`.d�����s�c��d���A��hg�`��y���_���h'61
 �"���I�)��8���um'Y�n|�������@%/���(o�
���������y��x�ccka��f~��@�K}��3�e�|fp���%^7�-�3q*t�b��/�qr��Bo���8��.t+N��C�Wq*��>�-c_��-N�\�������I������F��?[�:t1N��������s��z�1�Q��1��i��*�[0�=H�x���aakm�"��X�����������edq��r=(�*����8j!��e��D��C�����8�������w �a�A���r�00�[���>f%@0N��,z[�l��b��M"�Jt���4��S��N�J
��S�'�g	���?�>��0_7h)�'�mD[��^?��.z�-���Kq���������4����>�-��>���{��b��n���q�S?_�
�7���������Zw��N������[�N���������8n#���{_�|w���v��?;�
M1�uP�������	�����}Z��O~����el�.6^Pw0���=�=f��z���\N|_�����6����,���daH�
��"��z��U��
Xw�'��V�F�(��Z��%?KW&�n��F:l:�O�������t�R��6�{��jt��4�1^i����4��zW���_[vR�+>����{��t�2�W#���t��5���1>/�=B�8o��5�6�R-2�����r/My�!d�.dt��a�����.�
�B��t��Nw[a���t
�D������B��������1tl��$����a�_c
}������Bi�����B>1����cJ�:��M������2u���C�b/8�y�c��"�N�a|���Q���>N�'|���|�h���t��2���C��{��������1=��S1�4f�b�����D�;Fq�������,z���������{�F�Jve6����Z����c�w��g�t
�+}��|�F�_��g��N�����T�4���f*-4S6��C�gi�)��g��zLH�c�zu5���C��h�+���\E���D_��}�����cT��>��"�������4��}�����O���@No1'�D;�M�[���i	�����u���qe	[�������p���K�O����@[���C�-l(�����
����w�����k�RJ�R��>�����r�J��c/�|9�9�D*���Hz����-n�Ew*�Pv<�?��~J��Q�:��|���
���9�������I��;��..��Dg�>�f<�������5����S���}����W��n�]�e�=�*�x����T������������
���q�:��u��w����&R9�T�LD_��n\uS1�4����
9:�������l�8 ������!�G�xQ~G�6p����W�����y��/)����<�����k������%�W���(��:������������=g]���5������2�D�������NM��|�a�MB8�O��������9��|���lR2�J��LC|x��c�Y_A���}
o��1�o�� �Z��3���~���5��8u-h9�����`�?58
������*�}�����s��.�����Kcw�$|N���������	����������w�)�|/qt9���'N1��2q�TYit����
_V���4���
��w��r�����w��+�q�{�YH��s��W|�b[�>4�����a(�'��8l����-�r����`������$�o|�
�Y��l�=-n[�cc����_���{�cOkbn7��g���_l"���^��������^�{t�}����q��P)�����w���~�O����O����.��?�/����x�;���]���3�Xoqt[w�)�N�*��7��5�����s��qu)����m�^�#E����+a����|���R��T��c��u6�h��}��l|i��&z=����(�`�������r�\������Z��	*^`'����.�_�E_����s�|��u}�C��Q��.��=�]��`�3�����5������.>OsO��j�(s��[~�������x���!�����e��(�n�.�]J��>��~����������4I��l�"v?e/�wS����#���&�����~��=&��=� �Q@��J�=�T�k��Y���d����0M������6��t���M�{�L�W��n���V~��I��~�W!���������.w�b��}������������� ���wC�=���6��������N���s���_���w�����P�D{�;Y���q��w�K�-����|��ob�Q��l������d3m��j���X~gfYN~e����v���$�� tv��{o�\,�Z���&%�;���;��M,��n��>wa���w4h<d��i`mu�x_���8�{?�Y�(��~�����|E��mBg��;A�����`b!�&��	*d�1�����1����[�G�O�N���V�Dm%��Q���h���z�����Wa���{-���I��f�]IA���Jx��=����w��{���m��9/'����r������=a<����L|nbK���w���9D���_�>�n�j7�g�	����s!�ocXw�����@:��a��w�9�H�=���r?��Ju�C����z���y?��Q���Q�A?~H����������@�����}zxg�������JP�7�	:xa�{��f������?F�D-������O�
����m)��e�c\��g�g�����}���6�[��>��.q�w�}7��h�����1��(���f�[c����.2������0����>�	�������?�3E�LYn����s���5�W��������^m��8��V�����e���|g}'�w*z�TQ�
u����m�8O�������{����G^G���c�0v�:����e��Y�����_�q��h{u
��gR�:}:A��?C������/{^�6��w�wd]n��n�!_�:�W������z�aX�PB�ZOR��2���<K:�������t*�mw����s��s��)�Sa��g����n@��E��e�#�9��|FN5����l�/v�
�Dwi���}�����|/���R�D},�����]}r�����n$~`���V^@]�RP��2�:i����c��+�����Q���yG���9N���e��v�7bw
;�\�n���~G7����������5�������0����0�0��g�0k:�)��g�l�&���N��Z�!�qc��E�+����'�k���C��Kkiw������y��4��4V���E�|�U����hLt��m���������������f��?N��;4���7U����;�����X����r��~tw{�/|X���c����/s]��^w���������Xs�����{�N;���	�t�N����{w'���N��w�?0��:���w��4�~��L?61v�H7�os~��p}n|�~���F�������r�D�#�g�#�����G����/�3k�+�d�ym�n�Y�o�'@+�.3�3����3>S�u��;�����}_7^����v&�=Xq�E�V@|��#��N�a�1���^������`���������~��{\�q�`l����9��Y��tQ��?���s���q�X�-����XKo����c����o��f���1F@;7�-�v�3�C���6����7Kr���h������0���o��}��v��W���?���>�h��?B�����\$�H�������
��v�i�rl�`��I���^o�@�	��)��m���`�`�,������D9J����Nq�&�a�����3���u1�������$O�MW�gP6���Z1�5�m��^�/�x��~u~#	Og]���c�]��}�gpF7rA�.��K�O�(��1�������1�@.�[�7���������S���8����&��/����3�~�s��okx��~���6E��3.�����"��g�����
����\�+�+�^
L2�N��_���9p�Vv��l������� �@���]G��gH���&x����x�PQ�T:�6�!��a��!�-�;��{����x����X�Q�}=|�$`���Rc�eG)[���2�f2��C��~V�1����~����{r�Z^�J�-T���ZK��Y�Fuan�^�?���I�3Q���������������y?.��:���d~s,����sp�w��;����{o;b��~k-��j�t�v?|��i�i���w]��#\)���hh�*4�0(_`:�^9���q�2}A^�!����2*��
�L�������2<������n���l�?�~����z_�S�f��������T���S�r�T�;�`~�}���)�Cs;�e��q���0t�@�[�����tk�m��P���@z2�W�����������;W~7?���sX
�u6�K��������I�Y��rW[�yO;��	[�j�����i��Tng�
b���3�^r�91�����X
7#\���1;�m�<
<���u�f	��m_��W����G&`��#����X?�����-1�='���M�vJ�z�=$���G�{�n}:�����d��q������Q����������?�����7/����6Z-�aw0�]����;�r1_+�����s�L��������s��%q:-�Os9~ l%�&�l��w��k�������E�g��Y�g�V/����+�w��~���o1�k�
���A|��"�'��(Wb�� �}��������=�n���f �MR�S��7R�Nd�$~/���]_��i���7����4�o��:f��,�KB�Z����v�`�jO�9?9�D�us������]O��=.�:�x�����"D����6�,G�������Az����=�Az����=�Az����=�Az����=�Az����=�Az����=�Az��0���3��5��Dn*���W���&I;'.�q���2���t��V�Y���z�m��K��*�}o+]�_�A���;Z&���zm��e�c����-�X��\�I[1 �����'�A� ����-u�� *r�$�bx:�� ��b,��f��^=�js��\��r��������F�d�6#o3I�f��w�]c�����$�M.��i����-dsW�+�T�q��S= QDM���j��mI(>��o?!�Q���R7��F�W�#���&<���/��5����U��#�~�RR����������Y���Gy	h6��Y�3�����[]��eh����*Bv���R�Z9�2D��-��v���]�����KvR9�U�ZJ�}�.�����������R��|��Q2J-C����	������Ll�9K��8���D�%�>2H�RQ��-���#�3�����,J����*�Ou����2���[�-�R4?$�ZCZ����56yr#�����E�k[CK�&$��@���"�T(�*�Va�Va�Va�V�S��}$�D�J�)��������0W��t���*�#��~���dHMo�%���[�I�����XZ���z�u����T��}ro1�>���0������0���<!gB\0YrvKJ R@�+r����t�IzM:��[zqN_6�+&�C������������L�=T6M�+mDH��I�P	���x/�7�=T
z���{@�@����h��ZA��{Z�>>X���p����3����HOKOQ&��3h>�SR;��>	�m����I�i0�.�>+��*.=.�����-�����G[TNi�X��8�_zD�A�(�pK(��ZC��>����E-Yo�]��5�(��qJ^isK�dm��``��VZ��+����E.)(�[�E�+�[�5n�f������

J�@�J+[��HM���%�2<7�P3��E��tw��j�zH�c	�X��<�~��F�,W����|p��|�1��1��|��b�s4����h��hG38��o38�G=8��Q�z�Q�zp���^p�����C�������C�.8tp���G	8J�Q��Q�p���Dp���%�#� 8��
� 8���#(8���#(8��p��
��p��
78���-�g1�9��08����8���8����08���t�N�@���r,�r@�����r,�r��"!	j�X
,8o;x����v��.�k1�y#���#�����#�8"�#�8"�c86�c86	�M���M��$86	�]p��\)����~���k�e�H�����K�����NAA[�9]+��T!��	��V��p��`���y�F�Q�I@�W�wC���*.m��Q{T{R�<��$�:V��>�>�ZU�R�&Cr
;
�Bk�s)���D���j��������r�\�~�����=��=�����jl�YL�.H:�tGhH�P*�t��OR-��6�?F��0�'�N`p-P�}�  �z�|��kV�(r� o�|>"�z��������I6�Na/��k),ik)�xK��@����B���0s;@m	A��1�PK`���@9HSK� S[
_	�8�$
(�u�I'`���o	LF�q-���M��!^�7*@nk�#�&W~�����`���@%/m�B>�L���{�S��tkP��8�5�	�?�`�o��W��d����>�	���yy�;M��������{P+��+pF���mV$��~�M��
�I;����@I`Q�#������MHo	����I��A��;P�
Gb-��
�D�W�@a�2����������K�Jc���|{�q�T��<zo���V��
�kyZ���ei�V��mM�:�v���Z�d%k2���a�;q�U��qU�?vK�)�GlU�*��I�GI�&e�"������	ym�>nJ��7�E��h������Qm�1>R���6�d��F�F�m�&6�1�']��C&]S�b,��������_�������G�����w
fE�5�!�=�1R�FV���/'�k�#�$���=R"'�
{���k�x����mD�#��9����� /{2�������r9����I!Q.dw�r
��v
���2D�D�C��4��;C!Q*/�x)��+(�7 �0�u���E�O)0���,�_�%��e�2���e�{�L��3kh���[�������7�
�4GV_>�Y6#��d1�F�P���p:}Vdq���������~�|O�3<�_^�Nzf�������j[�����M�ml��j��V[+;�j�����xe
�������������V
o�Z�m
�����a����_E.h��`�7g�4����
�gp�I�^��6J7FyC#N�g���[����xV"�]f������l���F�'o(�EK���H�q�"9�4pU����������sk��E���$-���E���x�����8��hT���Q���MCS���H;#�&�"m��6��hGf�`�xs<faHP����I��M���
�Z��J�=�|)�s�tEK�8>KW���������1��*�-�9��w�W�����=}X[�����M��n�P��{[�V�R�E�E��qA �����������%���pcx�����]�a�W�C���]h��PT�(>!���f%��X���l�M&��X0�*��:�#�?�id�
endstream
endobj
22
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-BoldMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
24
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
15
0
16
[
333
0
0
]
19
21
556
22
[
0
]
23
25
556
26
67
0
68
[
556
0
556
610
556
0
0
610
277
0
0
277
889
]
81
84
610
85
[
389
556
333
610
556
0
556
556
500
]
]
>>
endobj
24
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-BoldMT
/Flags
4
/FontBBox
[
-627
-376
2000
1017
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
715
/StemV
80
/FontFile2
25
0
R
>>
endobj
27
0
obj
<<
/Filter
/FlateDecode
/Length
36
0
R
>>
stream
x�}Q�n� ���C�����B�RU����X�H5 �������C�0����v��z�e����Q�z	��6����2|!��Y8��y��soFK����X\����V�3���xm&�{;
�s0�	� �Scl� ����2��{�:l���V�nh�����V���/��+������0�O�����a\���"��O}�:�)�c����o�QV��>p$GDM�I�d�d�������7"�����J�_����1(|L(e�
\��Y�\8?x,��
endstream
endobj
29
0
obj
<<
/Filter
/FlateDecode
/Length
37
0
R
>>
stream
x���y`TE�z���[�;{g�;d��&$,		F����'JdQDdS6�D�������T� @�QGEQ�EFFG����{;!d������t����NU�:uN�K FDVZL���s�o��k&!�Q"54i�5����G�]F���������g�D���D)q�'������;�r��N�d$8��B��3'O�a�m��;�F"����_5�)����!��M?o���t�h�B����5q���GrG����6J1�4����B����T���N��E�RMD>�C�g�����Fo�t���B����J�s<�|L!�<4���-�Q
m����4��������4��2'fg$=F�Y���N0�A=���X�2G9�n%�I�=��P�6�������B�f�u������C��s�I3�K:��x1����xZD�2U����w����t��$8�h8�K7����'��7(WS*�QH�4�����$�1��M����t�=�I���e�|�|��.C���+cdci��{�!z���a�a���[0'>��e��tmB��E�x6�=�o���?�u�A}?Ju�9��6z��J'�{��N�������q���_$�N�K�C�h���t����@/a6_�����\)$����w��.O7�&�C�n1,�g�o�t��Ii��#F+���G0��x~�c����Kh9=A�im?����}���T��=���3�����R>�����6�)A*��fJ+�����r�\.W���[�O�<��6>�&��>X��o���o�_�����}2�#M�\���a&��W��&}D�'t����s�V���l$���b��}���+{��x4��Cx����������Q�������T��N�~)�(k�ze�rJ9������=�{��������u�������GR�Vo<]�9ys���Y�I�h/f�pw��N��g��:��X��'�u�l
f��y���C��$���K�u�������c�Q��
�';�%���<��X>�/����~�0_���������q~RrK��|�?�R/�\Z&������D���Fyf�iy����7�s���d*�J�ne����S������m����Q���H�U�K{D{B��g����l�(rXK�V���
�e��-a#X[�j(��	y&(?����|�(���!A�u�\b�%�Id�f��������OX�7�HWGZ)����/�Z`O�b:#��N:����:�~4��)�����g�+�����U~�����qr��0���v�J�R�S�-���&�aG.d2]�/���B��,�w�26@JbU�$��q���%�����v��T)�������M����2L>��7�>�,0&��`��i��A��-�<�U)�_�����)�����!uf��
�,��r Ao�A�K�OA��o!C��_� �/��*��/Q�2���FSi(��~���O_�TX�s���h�t=����F~������yh�
��� �T��G�����4	:��h�	�?��~���}����`�l�^���:�N��x�f���������c8��	��=�C"�G���t%��6�������`�>L�8�*�Qp~�^��~a�J]pf8����g��0w/�]���X��|'����zR��z�z�@+h�D^K��G�z^ZzI����E���v�\��)�c �C�?;+3�}���.-5%9����v9�v�����,���}��B��BrvF��y"�1	�[$������2!�8�����A����d�,l.���R*�����������5���V#|o��_��d��l#�Hz:j�*<���Bl��"�w��e�����v[���myi�����Pb���,�'3<���FN�(pJ��SJ��#XIY��U
������^��1�z_�1!D�!W�(B��nBj��ft��"�Cw�6v����F7Mp\�q��1�!i|��#:�~���\������KZ��H�*<S|"�l�_h���������m�.��;nY_t}��EO>�������Q!R�]�Y3�3&/�v$yY���OoHNn�-�\�[6�:#=T��Q3�O��8Z6l����/��������lnt�"GT����<#d��a���G !�U>pR����������b����V�j,������e�"]�)Y���	��q���S�GR�,��$�B8��M�P ��r���B���F�0���F>*c������jT����9OO�zwc�& Z<����hBJ�5!>N��l��)r7�4W��}����d7���	��{�X��dO4�+�gT���W�l\dn+G\3����"!ff`�Crfj@$n���"�JV���)��c���Pl�j)���!�"MAl�4�,"�����buH��	��7���������f�F���R�~J�2��j�!�z.�_rQ�"��$�+g���/[f�(�/t��e}3|}��[6�Q_<!���X�b�e3*�5�~�����P�{j0���$�S���t�� [:����nxKGT7��=��fc&������F*oN1��Q%��h�#+ek�h��+	F��FFF��)��U��Lsi��q!)�a`D��)}��U�����"�(�M�_d�dQ������p�)=_:�}�t��R*C�}_��������H�s>i���Bg�'�D/4F�J�#��
���	N��sg��)w�.�QbdI�QW��7�O������[��$J>�?�mOHb�]?���/8��\���,�I=bjb��LN�)i+���	?��0�x�7�9��$�93Uj������,��%�SR��3R�M=����;]6)/6���cy���M]���g���S�����qv"/q��D��^?��j�2�����D�� j�Q����t��+kZo;�n�k��5��`�cXtWB>������v�\.���yY����u�t1�DiOn��`Y��DV�L���Y������-O�_;30����'kEp����f�:Ie�������e'E�4��6������$�t.��ZVK�f���KA7�J`A+( j�P-�**��'����q���~U�h�]���{�0�3��y"7�k������������oWt�\O��h�����?���*bT�Xr�`y~�Kz�.I�?{��^3���/_����=�V>���-�����N=��/ijr��K�T�]5�r��v�O
m	���\G
��������$-Pf�-M�@�x��l������J�
��y.�$r�����������b;\�%�e1~�F���M�
�����#:��*�^��9�N�Sr&�%d���|�)X�_�v%�C����t�1+_����Y��t��s�t�b�*���9MT���E&Nj�+��7���:�����8���j��4��b1������<��m��:�f�����r��bO���Rj�m��o�^����!Ro��������R�Kl=<=R����-n�4�'������'�A�����������N�� �Q�1gT���*<1Jr2r�|N�D��b��=��tah�c��f��LZ�B�k��|�������3����,�,���4��a�'`�PLI-f'�d@�����)]�t
���R�XB�\-q�Kwa�f��Rmzz���Q�.����2C�����~p��}�w������\��#/�p���]���a�=Z������x�������c��B�f@�����4f��'zY����Q�;OZ�X���`�R-�6�������X,��K�/[�8�ZY?�hv��fn���������[������@�3�{��g���5Q/i;,�-[��}b�Z�������������	���1|���_l�����c����������*1���1i�X���"�RVN7A��I��UY�3��]L�bi���]cT��nYh��������)��w�/AJX���y��4��)���"�I�g���f�����TS/����D�B���(h�]��B	A���%)M(�� j�	��&4cZ��p��)���kg�����	}3:�������Vb=�J,&1bV3f5cN#��ij,�2�Z-K7�UwE�v�^�.�XVDei���l���0��g�0|z����/���kkj�Mb������@��G���O�������������6���qP�R�'|��-���Mg�4�]���})���A��tP�wK��7�����Zw��\�@��������8'�Y	E����fF����U���u����<�-2$w�+���ez,���w2��$y,�,[���(I>��S�He:T��4!-I�E�TI�Ye7�����#^HL��m|q�~�4L����x{R�����2���K���[�����|S����o�%sg���s����1|��q�{f�T�D�w�{�Ds��Zi)y����]J*CY�+C����C�5b�c�G>L���c3�w�
N�D
���F��:��>���������K�����k��=�h�����g��M�3?XY�/�7�R��`x�|;?�������"�$��|^�18����\9;1��3uer�+��~��SNI��=�im�\U47gF�����(=:�K����j���t*�-N*	(y��p��nOb
9<�<��dONI������b��LNRr��~��M>�D�H�M�$��Z�X-'�O�Mw�s�����X�J����^d����%��g�i�${,U����Sv����L)��^��I��h_��h9��B{���vO�X�Rc��9����:y�}�H���0m���VP��J�r.i����!�W���D%.A�5at7�u�0��KN��k��a+�f��z���{�dw����}s�����:������U��K6r���{+T
�%�W��S~zfjO�m]�`;+^=c�K��>������b�g��Y�M��l��	�����m8�Y��_*KR�k}�a�G��������d�}�qx\Y,����L.b���R�(r%���u��G;F'�w,p�a{���>�x��.z]�3��m[[\�i7{�������'�#:��(�����C1��������$�.�3����y`DGE���8N�%X���1�%Nq��%�X[��3w��:'�9�;�sJ.g����;e��N�l�]�6�{y�3�%6��a�g�i�	Ob%�������N�����kVq�.
KY*��fA��L��*EL=
;�\*a�i���Q��y��v�-+Ro
�#����������g.�����������I+v)W�K��a�z� 	�\���������ZOY������$A��e������[��}���'��rT����[��"m�2Z]l]�����k��RO"��!i'�)�������Iu�$���U��U��Q�H��VK!I������Mj���N���Z�
6,��a��4��u��[��F�uc���8v�XD���}&0��!cl�&MW*m�2����e�q�����(|��~>�l��;�	Ar�0gOsvS0��R��V�U�(�-�0C7\��0_--V6(��M�[:�>����$;N�qh���]sK�Bt#'6�/����a��XbM�i`�Y3��t]Y��;��_���"�h���1EZ_�_�d!K�%���c���P^h8KzYaU��<�my��\�����������<��3�H������J�J����N%6\L����~�~1�R��e�����a���8o�u<�z:f]��)��g�6�7EmqmI��.��&�q�v��#���=��sw�����O�?J��w(����������RO��N�*�g�����l���8�:��2����������2�������n_t\RRRS}i�������\_��rGG����l��6[<<Fj\t�4
��A���h=���f�����=uX~���������e����o�260W`�lH�.��;F'�%A3���F5���
nnoS�N��O����9��������M/�,�Z[��ak���
���F�D�$�Q?���]���K�E�f��_S�aU5b���)�??�����L_�<ui�������#.
L�3��K���)��I��r�-+����n���K���w-pw�:s�;�v�����E�e.
{":~;$2�M
�`��F��%�XT_J�EJ��y�a�k�k�|����qKo�w'��r�/8^�zS�m���o�l��`����O��������G��Y�i����>����]�Ym�u�S����������E��I<�����p��6�%�y,v����C�x|I��Dv���d��U������s7�
��;���>�~w9�e��X��C�{�
IbI���I�`j��am7k�2S~O������;r��o���XmW�~�f�=$�����7�z���q7�N��q�m��!|���^=�Vv��\!�tYu�������_�$�{���{�_�b!<����J����$��z6��d�����y��������H�J�$�KZ�t*��^�[8n��l�'�������2�`nA75��P
VnP��|�z��A�U�T��S��S��6�
L�}�����>:��EJ�A�;T{�=������9kf����P<��gg�n���E� �:�,�-�+���������j��s�f���
c���������I7�>4f���C��0�l��h�6�g��YwK��:�a�[u$:�(;*?�{T�6�>6z�6C]����Q�]E}%%�G���Hm������L�V�G1����8�w�����Z
�'P5��K�T�V������h���i�5�-�]�44��C\�����?K�X��N�k���:fLU�06�%�_/��P<�f��=[^5����z����q���r�
�y�+�	���o�7av� +�bv:�����	Za"{|��j��h-�	�c/qw�u*��T�I�9-���s3E�:e�2�����������
s��n'���[���������j�
�����a��3Gd��X�in���Jg}������No����3�����)��/_z�������\�Y���\�g�g:lq5�����8���d2k�R��97�����X�8$Q����������j����w�}�����w��q��|b���W��O��l\e9���������=�)��t�����-\��8>�}z��3r���#>��
�K4v;Vs��{��~Afq�Sm�Q;�*6`�p�S�.L
6WA%�N��������Y�Y�P������gG��D%�/$ >.1A���	�!�%d]?v�%������m��{/5�S|~�����$-}������<�:t���] -1�7����o�b����z��K�9���K/~����������|�
�{�?���>���e�'��C�t�|�|���$�1�V$KQ��������1Y��I�$�sMV�;7�jU=	<A�V�5�;!�������Sj�g)G�1��*�}l�,��$y����>�P�_�����v��S�5��=O%��vM]�8�~y�e�u��V{�Y�{����n�}R���Z-U[k����l������Q�����t5����'!AJ�H��v���=/%0��N�3�+5����l�%�/A��&i.b:f��2���d�k��B�����l_]c!������(���]�z]�����g�%�v&H������	J�6~<�xae
1<y��,�����8z�O��y�T�I��,>�j�<����$ �\�,1,�{�������g��v��=K��eH����g�����`m��M��{Y���-�����#\w �{~tz�v�u����z��~��[�lr��N,5��I�)��Y��$�C�K��h��z�r�z�&�%��P���p��m�}�3D����v�V�v�|��(=�<�>l_�j�jSl���<�"���Vi�Bj����M�S$�����J���^�� -����j��k�t��Z6���P�h�A�]����l0��]��dK`Iv�+�fk�s�������XTX8�,VI����1Z����vv�����L����PMG���~fc�G�Z�[�������$����m�:]]�~��8 �SI�>�@j3 -G`|D�����6��7E
"p���Et�������P]M_U�������T<[���dM��5�G^�>|	}r�+�c��A[b��#��b���[��%�^]cY:~������_
��C�*���0��9�|1S����J�M�M��c]�?Z�,��(O�Q$��U��p��s��^K&l+�3��"�gx0�������av/�c�Z|��V#EyI�t��6+�g��r-�����g_�"�-���8)�������G���y�����,&�������G�:`8�� �1��tN�Gw(�����0�5�FE#�W���fG_���%E�� <��Pv�Aw�����3oG��v/�RF��!|�]��Z��,G����������1�z�|��({=�A`��X}H�p�g�H���?�.���#?�V"^�����~��>�Q2Q{��/����C�w��6�.x�<&�o����I��&p���@�E�M�I]�(�< ����v9��+_`-���<��9�jZc!���Jy��#~k3fSg�Q��4MF�[�
zB�:gh-��P�h	���_tG�C�������.�g�Z�����ys���XW�[?'v�L�I/�0��������s��lTXBX��@�?�����a��_�3�FC[w�������<4��������\��UO�����
�!������/5�2#dS���
�U)���Ys�!c'#{�e�?�U������#�)�_��
>���%d���=��_1N!S(���<{��D���/��A=@UBf��5Q1/���bO4�c�����3d��D�����c����@��]�U+Ch�TAO(O m*���4�r����CP��V�!��}��c>��1�x{�S��Xwb{���f#��5�N3OP��y�o�������I��u��ObOh'X�k�Ho��{�2�5j#�����A������2:�(�#�rC��E��	V$`EZ<�+��x��2��Ag����d��,5�&ymM�Fd�&��Et[+��g�����
3�U��Y>��U�����b9�?j!�� �[�ek*�������M��Q�8�#��:��|kz�>�����G�G��c@=0yi8�^7��~}�T�l��fK{h��*]������t
��Y��:V�9O�6��b�p.nl:G�rJ4��+t��o���8G��8?�'��ZF1�rB�C�Qf�q�l�?������w4U��s��F�C�)���[����4�8���������J�U�q^>G�5�'���4����,���o�q�������[^�?��������'�����}T�A�!������IF[S�(�t�Q0�,�?����o��QF�5c.�&��9��8�y�gV��h_��:�}�4`'��`?~h���a��X��{Z`��1}/�Yj�9����E��
e�G��!�m��c��-N����J�����6���� �k�n�0���|-U�'�����Q��h����O��������1����S?+�Fz��1��������ZF�?
l6Aw	U#^'��{y��B|�t?�A��O���
|����(��^���g��������i�4���'i�"�3�7�i�t��M[��`�F�<_�	�f�|���5�{�crI:��������Q�Ai���U�/��4h�E��VZ��M�E�}�.���qy���Q�^� x�K��LA�*6��3�n��������N��k�|�vQ�(s�F`�Y&
mA�nA}�
\��&��X��y�Y�P��������`��ic���p{���#�}��T���BZ��B�"zj-�a����2��({�E����8@�q.���~7h�I�_����f�����A��v�0u�DZp)����B�����?��q��������@�3��M��_h���i��ma�^L#��4�g����M�X���%�Q��^����Gj�B/���<��`��]�bCB}�;�8�����4��/�K��R�7B�;P�����A���6�nz���R��F�Q4����A6�{���z�g�*?���g�'U@�&�C����
�[1W��b������o�����b���������'7���#���y��1��_6�s�>���9e�E�"�G��	�/8��sJ���5�RH~<����S�si�x��h������*[a�q�wMlZ_L��_�i�������1��M��"��8gZ��T������^�~�Ed-�������Z�.�wg0K����r}]��	�Q����9n�KS_��~��'����!7�^��[�]
�Lw# 
;
��2�����z}`��!(��X��;�:b+����=�����}��������4��	L�dc+0�����"��-M�[��������[���v�>�\�����W�4����S��i���!����w���9c�����eNO�1Wg���hG�{[*��@�/���d��S�r���z?d���i��"�9]��������o�R1�`���x)����4O�A�.���NK�u5��v7{X@���h��J`��>6�n�c���9�z{�M�!��[@E���`��d���/t�T@W�~$7�h
��`B_�M�m��!�wk������tA�Z#���H��5�^�+|�V�������[���|�V���������5�^���������@z���1�5�>�5����`G��e����sAa�����MG���l��o��o�������6��p
�?s�r���>��
v�N�k���k�T���?�TS4��>
'D���>�^B|�4�,�����8lAa'��.F���-n6y�>E8��@���H~a�(����E�3 �e8����A��������x����i#�wE�E��N\�
g�v������h��B�>��p��J���WI������Ki�:�����_������+(����#�[q������������f�F�#��}���^�~���O:]�:����5}�:[-mR{��zt7�k��L��y��z��6�kp����P/K=���G~&����y�	V��3W�������{��Za�|����<�?�EZ�8;�o5�m�Z�n�������	J�=�	_��������o#��?*#a��.��7|����z���%���������;���4D�G/�w�Z���g��|�
tw�
�1���&*��&�T�����
�]������~�:7���m�{�����]�dG5�7[�����|�g��#�?����������a��������#h�g�i�v#�����E��L�A��-6��v��`�������pF�]���I�^X��|�QWa"�}�j���G3�=,�	t=��b�Z�����:�s����;��3�!a���3�����}"��V�N�������oM��z1��[������/'��xV�[w�-hh�w�z�	��m��[��=�/&%9�����r�������{���c�7�(����iE�4?o�/4b��Pa�B���~�f�������}b��{��o�L#vy�����V�����|�!�9���M�@7�/P- ��_���k��i<po5�N�}�w��xi<poh�,�5$F�5�i<po6���
������/��j{������@�����9�?g���I�#`y�<�:���~�h���yl���>M�������wM�������j���{K�.��������n/�Y������_B��%�h�,�A����G���wi"-���J�������~�H�|���K�9�&�} �i�`1�6?���M���>�:�y�����������Et_�V��@CX<k�o	�����/�J�"����S���x�o<+�������'j������D��]U�����?!�9x��eH��W����&�*��4���
�l�"���
4?�}	��\��0l��������y���^�a��R��E:J�g	/E�0���_����������F�3('������������'�n�iz�A��"J���I���7�s���xV�6�|�'�"9��+���9�����z����>�l#�����Ak���OF��x�9�C����	�}K�}K�?�Y>F��a����#��7�:�)
�%�J*���k(S\�2����2�I��>�+����(�8�=*�^i��&��N�-�T������^Je�P�Ji��>�G����45�2�?�3j�T�(^I�-��TS��9�,���s_����=���������N�I������������+J�$����������j#�N{6tu����h	��r��(B�J�'��F���Fv!o��w5��
*���56m2]<�)��W�h�^k>d�������!zl�g�gM����]���(������_D��z����H�����KA������D��]�<)�x�a������N���\�N
6��Yq&��?�����Z���3r�
�.�h��H��n`�����@V�ZO��`/���#/�6dj.��T�:%4K��Zj����l��d�J��$i���B�:���8UH����^���4�f���r����vK![�!��|'�x?�X��������l��v�����g����u��r�6�6=���6%d��_L@����5H6�g2������x��n�t���G�������g��n �C.��~Ix(��_f��})h?T�<��a�w����y���5"Ss�iT#�~��o��G�K�k��������p(vf�=�I��;�
�/�c>o3t�H:bk/h�Y���>U-F���r}�~J�c��J������x�Eg��#�c�m���5��&��%��&�m���=�N��_� O�DQK/���;�@3�(����/��)Q]@J\���6��
mhC���6��
mhC���6��
mhC���6��
mhC���6��
�3��K~�=��hR�������J>�d�Gl���TP�P�5�(
����Z��#��M�]
�����`F��h��&��u��LI1��������+^D������=���\�|.]�����{Y�D�R�O��M�{wY�C@����H-0��������M����}:v�x�������]�{�K�5�����~�3�;@_)�V5�~f�
J7u�!���Mvw��^n�?��
����f����7���duw������������R#����^]>���?E�~��@0x8�4"|���Z����j��F�:��XS#4�F�����/Rf
1`�<���"��
f�=���u��7�4�J]��!���TW$���V*��c��1�rc)�1��4��t@�|^D�N.|{�R�t�Xh��Q����HmP��+4�*D�BLL!fZ�1#�"����)��yw����b��6�w3�����[M�=M�7E@���i���:�Pd�+	#tN����N&m���� ]M��$�MR`�|�����IrL�n�D�$�$�$q&�5I�I�L�0�]�M�f�&3~�����d�o2�7�����Mf�&3~�����d�o2�7�����Mf�&3~����?2C��b2��X��e�=&y3h��Y��Z��A/�B`0��~ ]����w�������*]J��E�}�,�l��{��G��bj1D�b�
��������-h���R�����d������7�(��4IJp0�/�7�{�\�z`4p��(�3�9�N1#���:&1FV���(&�����������
9�`���y�&6��!�b���gY����@����8�����<	����ruC�������
9��<7���������GQ.[	:�!�.d�0����� ^��v
9�{{�YM��Q6��M�\������Qf
����|��O�C���6Z������r9��t��}/}����|�kSP2h������
�o�4X���F�����?�B:!���:'w�w�Bw��F����J�N�����{�w��-�������YV�w�A����[��D:����A���
�^9F�A���)�K��y/�;m��^�����7#���~
z�;2�e�^��2�EZ����YZ�H���V�����l�.K�k���iq�����8,6���Zd��%�Q?�(�Fo�*�Z���������_���3����*y���Pq��Q����*C��+�72��F��v^E�|�3�3�m��!%���b*�rD�'��62Q
)�H�a�Vb,��{S"���w�6��b�k(aN��,�gtI�>��5.�����*��o�x���y�4D�#Z'�u"�I��^�O�	u=��2t�p������*�l����To�7qO�0�.o�SSS�%6��t��e
�rN�D9�9-F9��,����\� (�YC^�����('3Qn�_E��>�Q&�h�QfJ�(����L���4K�fcE)66c�(

��(��oa��7��T�H��"��"c���5��s�H�YD�o*"��H���3��b��rVYU��B�5���4�=��!QI=��l���d��l�!{�2O�]��kQ�ac�5F�;R!�4@�pI����Wk�-8������KdA�E�S����,�����lck#Yn$G��_���7f�L��R���<S�������n�}��������2�;�2T,��8M����AZ��4I2�6Z�������|7�xC������AX
A�A�AAX
A�	A�A��A�A��A�A���{�{n�a��2��p|veAXA�A�A��A�	A��A�A�A��4X�~�+��E�d0��SC�Xd�bf�`����?7�����
endstream
endobj
26
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-BoldItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
28
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
67
0
68
[
556
0
556
610
556
333
610
610
277
0
0
277
889
610
610
0
0
389
556
333
0
0
777
]
]
>>
endobj
28
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-BoldItalicMT
/Flags
68
/FontBBox
[
-559
-376
1390
1017
]
/Ascent
728
/Descent
-210
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
29
0
R
>>
endobj
30
0
obj
312
endobj
31
0
obj
12254
endobj
32
0
obj
319
endobj
33
0
obj
19070
endobj
34
0
obj
305
endobj
35
0
obj
17775
endobj
36
0
obj
275
endobj
37
0
obj
14972
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
]
/Count
1
>>
endobj
xref
0 38
0000000002 65535 f 
0000100092 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000255 00000 n 
0000000420 00000 n 
0000030955 00000 n 
0000030915 00000 n 
0000030936 00000 n 
0000031139 00000 n 
0000031290 00000 n 
0000031434 00000 n 
0000031583 00000 n 
0000044456 00000 n 
0000031738 00000 n 
0000044854 00000 n 
0000032126 00000 n 
0000064603 00000 n 
0000045062 00000 n 
0000064968 00000 n 
0000045457 00000 n 
0000083397 00000 n 
0000065165 00000 n 
0000083784 00000 n 
0000065546 00000 n 
0000099385 00000 n 
0000083986 00000 n 
0000099711 00000 n 
0000084337 00000 n 
0000099924 00000 n 
0000099944 00000 n 
0000099966 00000 n 
0000099986 00000 n 
0000100008 00000 n 
0000100028 00000 n 
0000100050 00000 n 
0000100070 00000 n 
trailer
<<
/Size
38
/Root
3
0
R
/Info
4
0
R
>>
startxref
100151
%%EOF

run.shapplication/x-shellscript; name=run.shDownload

#370

Tomas Vondra

tomas@vondra.me

15 days ago

In reply to: Andres Freund (#364)

Re: index prefetching

On 12/18/25 15:45, Andres Freund wrote:

Hi,

On 2025-12-18 15:40:59 +0100, Tomas Vondra wrote:

The technical reason is that batch_getnext() does this:

/* Delay initializing stream until reading from scan's second batch */
if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
enable_indexscan_prefetch)
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
....);

which means we only create the read_stream (which is what enables the
prefetching) only when creating the second batch. And with LIMIT 100 we
likely read just a single leaf page (=batch) most of the time, which
means no read_stream and thus no prefetching.

Why is the logic tied to the number of batches, rather the number of items in
batches? It's not hard to come up with scenarios where having to wait for ~100
random pages will be the majority of the queries IO wait... It makes sense to
not initialize readahead if we just fetch an entry or two, but after that?

Because the number of items in a batch does not tell you much about
prefetching either. It does not say how many TIDs (or rather the heap
pages) are already in cache, it does not say what's the access pattern.
It also does not say what distance will the read_stream converge to
(maybe it drops to 1 or 2).

Maybe it's too defensive, of course. I recall we discussed various other
heuristics, but our #1 goal was to not cause regressions against master
(or at least not too many). It doesn't mean we can't improve this later.

regards

--
Tomas Vondra

#371

Konstantin Knizhnik

knizhnik@garret.ru

15 days ago

In reply to: Tomas Vondra (#369)

Re: index prefetching

On 28/12/2025 8:08 PM, Tomas Vondra wrote:

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

Thank you for looking at my patch.
I agree with you that such overhead in case of presence of data in
shared buffers is certainly not acceptable.
But it just means that we need some better criteria than number of
scanned TIDs - i.e. number of smgr heap reads.
I do not think that it will be too complex or expensive to implement - I
will try.

But in any case - the current heuristics: prefetching only from the
second batch, is IMHO not solving this problem.
First of all, as far as I understand batch = TIDs from one leaf page and
if there are large keys (i.e. URLs), there will be just few items in the
batch.
Also if all pages are present in shared buffers, then even starting
prefetching from the second batch will have negative impact on performance.
May be for long queries (scanning a lot of data) this overhead will be
less noticeable.
But as far as it is proportional to the amount of scanned data, it can
be still desirable to avoid it.

Show quoted text

regards

#372

Tomas Vondra

tomas@vondra.me

15 days ago

In reply to: Tomas Vondra (#308)

Re: index prefetching

On 12/28/25 20:15, Konstantin Knizhnik wrote:

On 28/12/2025 8:08 PM, Tomas Vondra wrote:

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

Thank you for looking at my patch.
I agree with you that such overhead in case of presence of data in
shared buffers is certainly not acceptable.
But it just means that we need some better criteria than number of
scanned TIDs - i.e. number of smgr heap reads.
I do not think that it will be too complex or expensive to implement - I
will try.

Feel free to try. I don't claim there is not a better heuristics.

AFAIK checking if a page is already in shared buffers is not free (and
what if it's only in page cache?). But I guess you'll also need to check
the I/O pattern, earlier blocks in the batch, etc.

But in any case - the current heuristics: prefetching only from the
second batch, is IMHO not solving this problem.

The results I shared suggest otherwise, though. It shows pretty much no
regressions for the v5 patch.

First of all, as far as I understand batch = TIDs from one leaf page and
if there are large keys (i.e. URLs), there will be just few items in the
batch.
Also if all pages are present in shared buffers, then even starting
prefetching from the second batch will have negative impact on performance.
May be for long queries (scanning a lot of data) this overhead will be
less noticeable.
But as far as it is proportional to the amount of scanned data, it can
be still desirable to avoid it.

That's not what we saw in earlier testing.

I'm not claiming the issue disappears entirely. Just that the regression
gets much smaller (relative to the total duration) as the scan clearly
processes enough entries to need the second batch.

The worst regression is for queries that only need a couple items from
the whole batch. Consider a query with LIMIT 10. The batch may easily
have ~1000 index entries. And we don't know if we'll need just the first
10 - the read stream certainly does not know. It can easily happen it
tries to prefetch all 1000 entries (e.g. because many are duplicate).

I'm not saying it has to be like this forever. To some extent this
happens due to the read_stream having no way to give up temporarily,
based on how far ahead it got. I think the recent patches proposing a
way to "yield" might help with this. But that's not how it works right
now, and I don't think we should make this patch dependent on a recent
WIP/PoC proposal.

Anyway, feel free to test and propose alternative approaches. I
certainly hope we can make this heuristics smarter.

regards

--
Tomas Vondra

Import Notes

Reply to msg id not found: 425c9dba-309d-49fa-a701-4ccbf5cce1da@garret.ru

#373

Konstantin Knizhnik

knizhnik@garret.ru

15 days ago

In reply to: Tomas Vondra (#369)

Re: index prefetching

On 28/12/2025 8:08 PM, Tomas Vondra wrote:

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

I tried to reproduce your results, but at Mac I do not see some
noticeable difference for 250k records, fillfactor=10 and 4GB shared
buffers
between `enable_indexscan_prefetch=false` and
`enable_indexscan_prefetch=true`.
I can't believe that just adding this checks in `heap_batch_advance_pos`
can cause 75% degrade of performance (because for limit < 10, no read
stream is initialized, but still we somewhere loose 25%).

I just commented this fragment of code in heapam_handler.c:

#if 0
proceed_items = ScanDirectionIsForward(direction)
? pos->item - batch->firstItem
: batch->lastItem - pos->item;
/* Delay initializing stream until proceeding */
if (proceed_items >= read_stream_threshold
&& !scan->xs_heapfetch->rs
&& !scan->batchqueue->disabled
&& !scan->xs_want_itup /* XXX prefetching disabled for IoS,
for now */
&& enable_indexscan_prefetch)
{
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
scan->heapRelation, MAIN_FORKNUM,
scan->heapRelation->rd_tableam->index_getnext_stream,
scan, 0);
}
#endif

and ... see no difference.

I can understand why initializing read stream earlier (not at the second
batch, but after 10 proceeded items) may have negative impact on
performance when all data is present i shared buffers for LIMIT>=10.
But how it can happen with LIMIT 1 and commented fragment above. There
is nothing else in my patch except adding GUC.
So I think that it is some "external" factor and wonder if you can
reproduce this results (just first line).

#374

Tomas Vondra

tomas@vondra.me

14 days ago

In reply to: Tomas Vondra (#308)

Re: index prefetching

On 12/28/25 21:30, Konstantin Knizhnik wrote:

On 28/12/2025 8:08 PM, Tomas Vondra wrote:

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

I tried to reproduce your results, but at Mac I do not see some
noticeable difference for 250k records, fillfactor=10 and 4GB shared
buffers
between `enable_indexscan_prefetch=false` and
`enable_indexscan_prefetch=true`.
I can't believe that just adding this checks in `heap_batch_advance_pos`
can cause 75% degrade of performance (because for limit < 10, no read
stream is initialized, but still we somewhere loose 25%).

I just commented this fragment of code in heapam_handler.c:

#if 0
proceed_items = ScanDirectionIsForward(direction)
? pos->item - batch->firstItem
: batch->lastItem - pos->item;
/* Delay initializing stream until proceeding */
if (proceed_items >= read_stream_threshold
&& !scan->xs_heapfetch->rs
&& !scan->batchqueue->disabled
&& !scan->xs_want_itup /* XXX prefetching disabled for IoS,
for now */
&& enable_indexscan_prefetch)
{
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
scan->heapRelation, MAIN_FORKNUM,
scan->heapRelation->rd_tableam->index_getnext_stream,
scan, 0);
}
#endif

and ... see no difference.

I can understand why initializing read stream earlier (not at the second
batch, but after 10 proceeded items) may have negative impact on
performance when all data is present i shared buffers for LIMIT>=10.
But how it can happen with LIMIT 1 and commented fragment above. There
is nothing else in my patch except adding GUC.
So I think that it is some "external" factor and wonder if you can
reproduce this results (just first line).

It seems this is due to sending an extra SET (for the new GUC) in the
pgbench script, which is recognized only on the v5+threshold build.

That's a thinko on my side, I should have realized the extra command
might affect this. It doesn't really affect the behavior, because 10 is
the default value for read_stream_threshold. I've fixed the script, will
check fresh results tomorrow.

Still, I think most of what I said about heuristics when to initialize
the read stream, and the risk/benefit tradeoff, still applies.

regards

--
Tomas Vondra

Import Notes

Reply to msg id not found: 69908539-3cc5-4572-b69f-055fa40d3bea@garret.ru

#375

Konstantin Knizhnik

knizhnik@garret.ru

14 days ago

In reply to: Tomas Vondra (#374)

Re: index prefetching

On 29/12/2025 1:53 AM, Tomas Vondra wrote:

It seems this is due to sending an extra SET (for the new GUC) in the
pgbench script, which is recognized only on the v5+threshold build.

That's a thinko on my side, I should have realized the extra command
might affect this. It doesn't really affect the behavior, because 10 is
the default value for read_stream_threshold. I've fixed the script, will
check fresh results tomorrow.

Still, I think most of what I said about heuristics when to initialize
the read stream, and the risk/benefit tradeoff, still applies.

I did a lot of experiments this morning but could not find any
noticeable difference at any configuration when all working set fits in
shared buffers.
And frankly speaking after more thinking I do not see good reasons which
can explain such difference.
Just initialization of read stream should not add much overhead - it
seems to be not expensive operation.
What is actually matter is async IO. Without read stream, Postgres reads
heap pages using sync operation: backend just calls pread.
With read stream, AIO is used. By default "worker" AIO mode is used, it
means that backend sends request to one of the workers and wait for it's
completion. Worker receives request, performs IO and notifies backend.
Such interprocess communication adds significant overhead and this is
why if we initialize read stream from the very beginning, then we get
about ~4x worse performance with LIMIT 1.

Please correct me if I wrong (or it is Mac specific), but it is not
caused by any overhead related with read_stream, but by AIO.
I have not made such experiment, but it seems to me that if we make read
stream to perform sync calls, then there will be almost no difference in
performance.

When all data is cached in shared buffers, then we do not perform IO at all.
It means there it doesn't matter whether and when we initialize read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.

It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.

More precise heuristic should IMHO take in account actual number of
performed disk read.
Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.

Unfortunately looks like it is not possible to accumulate such
information without changing other Postgres code.
For example, if `ReadBuffer` can somehow inform caller that it actually
performs read, then it can be easily calculate number of reads in
`heapam_index_fetch_tuple`:

```

static pg_attribute_always_inline Buffer
ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy,
bool* fast_path)
{
...
if (StartReadBuffer(&operation,
&buffer,
blockNum,
flags))

{
WaitReadBuffers(&operation);
*fast_path = false;
}
else
*fast_path = true;
return buffer;
}

It can be certainly achieved without changed ReadBuffer* family, just by
directly calling StartReadBuffer
from `heapam_index_fetch_tuple` instead of `ReadBuffer`.
Not so nice because we have to duplicate some bufmgr code. Not so much -
check for local relation and
filling `ReadBuffersOperation`structure. But it is better to avoid it.

#376

Konstantin Knizhnik

knizhnik@garret.ru

14 days ago

In reply to: Tomas Vondra (#374)

1 attachment(s)

Re: index prefetching

On 29/12/2025 1:53 AM, Tomas Vondra wrote:

On 12/28/25 21:30, Konstantin Knizhnik wrote:

On 28/12/2025 8:08 PM, Tomas Vondra wrote:

On 12/25/25 16:39, Konstantin Knizhnik wrote:

On 21/12/2025 7:55 PM, Peter Geoghegan wrote:

On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v4.

Attached is v5. Changes from v4:

* Simplified and optimized index-only scans, with a particular
emphasis on avoiding regressions with nested loop joins with an inner
index-only scan.

There were quite a number of small problems/dead code related to
index-only scans fixed by this new v5. Overall, I'm quite a bit
happier with the state of index-only scans, which I'd not paid too
much attention to before now.

* Added Valgrind instrumentation to the hash index patch, which was
required to fix some false positives.

The generic indexam_util_batch_unlock routine had Valgrind
instrumentation in earlier versions, just to keep nbtree's buffer
locking checks from generating similar false positives. Some time
later, when I added the hashgetbatch patch, there were new Valgrind
false positives during hash index scans -- which I missed at first.
This new v5 revisions adds similar Valgrind checks to hash itself
(changes that add code that is more or less a direct port of the stuff
added to nbtree by commit 4a70f829), which fixes the false positives,
and is independently useful.

The rule for amgetbatch-based index AMs is that they must have similar
buffer locking instrumentation. That seems like a good thing.

--
Peter Geoghegan

I the previous mail I shared results of my experiments with different
prefetch distance.
I think that we should start prefetching of heap tuples not from the
second batch, but after some number of proceeded tids.

Attached please find a patch which implements this approach.
And below are updated results:

limit\prefetch on off always inc threshold
1 12074 12765 3146 3282 12394
2 5912 6198 2463 2438 6124
4 2919 3047 1334 1964 2910
8 1554 1496 1166 1409 1588
16 815 775 947 940 600
32 424 403 687 695 478
64 223 208 446 453 358
128 115 106 258 270 232
256 68 53 138 149 131
512 43 27 72 78 71
1024 28 13 38 40 38

Last column is result of prefetch with read_stream_threshold=10.

That's great, but it only works for cases that can (and do) benefit from
the prefetching. Try running the benchmark with a data set that fits
into shared buffers (or RAM), which makes prefetching useless.

I tried that with your test, comparing master, v5 and v5 + your
read_stream_threshold patch. See the attached run.sh script, and the PDF
summarizing the results. The last two column groups are comparisons to
master, with green=improvement, red=regression. There are no actual
improvements (1% delta is just noise). But the read_stream_threshold
results have a clear pattern of pretty massive (20-30%) regressions.

The difference between v5 and v5-threshold is pretty clear.

IIRC cases like this are *exactly* why we ended up with the current
heuristics, enabling prefetching only from the second batch. This
removes the risk of expensive read_stream init for very fast queries
that don't benefit anything. Of course, prefetching may be useless for
later batches too (e.g. if all the data is cached), but the query will
be expensive enough for the read_stream init cost to be negligible.

To put this differently, the more aggressive the heuristics is (enabling
prefetching in more case), the more likely it's to cause regressions.
We've chosen to be more defensive, i.e. to sacrifice some possible gains
in order to not regress plausible workloads. I hope we agree queries on
fully cached "hot" data are pretty common / important.

We can probably do better in the future. But we'll never know for sure
if a given scan benefits from prefetching. It's not just about the
number of items in the batch, but also about how many heap pages that
translates to, what I/O pattern (random vs. sequential?), how many are
already cached. For some queries we don't even know how many items we'll
actually need. We can't check all that at the very beginning, because
it's simply prohibitively expensive.

I tried to reproduce your results, but at Mac I do not see some
noticeable difference for 250k records, fillfactor=10 and 4GB shared
buffers
between `enable_indexscan_prefetch=false` and
`enable_indexscan_prefetch=true`.
I can't believe that just adding this checks in `heap_batch_advance_pos`
can cause 75% degrade of performance (because for limit < 10, no read
stream is initialized, but still we somewhere loose 25%).

I just commented this fragment of code in heapam_handler.c:

#if 0
proceed_items = ScanDirectionIsForward(direction)
? pos->item - batch->firstItem
: batch->lastItem - pos->item;
/* Delay initializing stream until proceeding */
if (proceed_items >= read_stream_threshold
&& !scan->xs_heapfetch->rs
&& !scan->batchqueue->disabled
&& !scan->xs_want_itup /* XXX prefetching disabled for IoS,
for now */
&& enable_indexscan_prefetch)
{
scan->xs_heapfetch->rs =
read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
scan->heapRelation, MAIN_FORKNUM,
scan->heapRelation->rd_tableam->index_getnext_stream,
scan, 0);
}
#endif

and ... see no difference.

I can understand why initializing read stream earlier (not at the second
batch, but after 10 proceeded items) may have negative impact on
performance when all data is present i shared buffers for LIMIT>=10.
But how it can happen with LIMIT 1 and commented fragment above. There
is nothing else in my patch except adding GUC.
So I think that it is some "external" factor and wonder if you can
reproduce this results (just first line).

It seems this is due to sending an extra SET (for the new GUC) in the
pgbench script, which is recognized only on the v5+threshold build.

That's a thinko on my side, I should have realized the extra command
might affect this. It doesn't really affect the behavior, because 10 is
the default value for read_stream_threshold. I've fixed the script, will
check fresh results tomorrow.

Still, I think most of what I said about heuristics when to initialize
the read stream, and the risk/benefit tradeoff, still applies.

regards

Attached please find alternative version of the proposed patch which use
number of disk reads as criteria for using read stream.

Attachments:

read_stream_threshold_v2.patchtext/plain; charset=UTF-8; name=read_stream_threshold_v2.patchDownload

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b9d42b15a18..125b8addd9b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -46,6 +47,8 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+int read_stream_threshold = DEFAULT_READ_STREAM_THRESHOLD;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -88,6 +91,7 @@ heapam_index_fetch_begin(Relation rel)
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_base.rs = NULL;
+	hscan->xs_base.n_heap_reads = 0;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
@@ -162,7 +166,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		if (scan->rs)
 			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
 		else
-			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		{
+			ReadBuffersOperation operation;
+			operation.smgr = RelationGetSmgr(hscan->xs_base.rel);
+			operation.rel = hscan->xs_base.rel;
+			operation.persistence = hscan->xs_base.rel->rd_rel->relpersistence;
+			operation.forknum = MAIN_FORKNUM;
+			operation.strategy = NULL;
+			if (StartReadBuffer(&operation,
+								&hscan->xs_cbuf,
+								hscan->xs_blk,
+								READ_BUFFERS_SYNCHRONOUSLY))
+			{
+				WaitReadBuffers(&operation);
+				scan->n_heap_reads += 1;
+			}
+		}
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -288,6 +307,20 @@ heap_batch_advance_pos(IndexScanDesc scan, struct BatchQueueItemPos *pos,
 	 */
 	batch = INDEX_SCAN_BATCH(scan, pos->batch);
 
+	/* Delay initializing stream until proceeding */
+	if (!scan->xs_heapfetch->rs
+		&& !scan->batchqueue->disabled
+		&& !scan->xs_want_itup	/* XXX prefetching disabled for IoS, for now */
+		&& enable_indexscan_prefetch
+		&& scan->xs_heapfetch->n_heap_reads >= (uint64)read_stream_threshold) /* -1 -> +inf */
+	{
+		scan->xs_heapfetch->rs =
+			read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+									   scan->heapRelation, MAIN_FORKNUM,
+									   scan->heapRelation->rd_tableam->index_getnext_stream,
+									   scan, 0);
+	}
+
 	if (ScanDirectionIsForward(direction))
 	{
 		if (++pos->item <= batch->lastItem)
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 29207276dca..d2fa5519378 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -169,17 +169,6 @@ batch_getnext(IndexScanDesc scan, ScanDirection direction)
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
-
-		/* Delay initializing stream until reading from scan's second batch */
-		if (priorbatch && !scan->xs_heapfetch->rs && !batchqueue->disabled &&
-			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
-									 * now */
-			enable_indexscan_prefetch)
-			scan->xs_heapfetch->rs =
-				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
-										   scan->heapRelation, MAIN_FORKNUM,
-										   scan->heapRelation->rd_tableam->index_getnext_stream,
-										   scan, 0);
 	}
 	else
 		scan->finished = true;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 8f6fa6843cb..0c0819e4d13 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2322,6 +2322,16 @@
   max => 'DBL_MAX',
 },
 
+{ name => 'read_stream_threshold', type => 'int', context => 'PGC_USERSET', group => 'QUERY_TUNING_COST',
+  short_desc => 'Minimal number of heap reads during index scan for creation of read stream',
+  long_desc => 'Index scan needs to read heap to check visibility of tuples and get attributes not present in index key. Read stream allows to do it asynchronously which adds extra overhead, but allows to significantly increase speed for long scans. Specify -1 to disable.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'read_stream_threshold',
+  boot_val => 'DEFAULT_READ_STREAM_THRESHOLD',
+  min => '-1',
+  max => 'INT_MAX',
+},
+
 { name => 'recovery_end_command', type => 'string', context => 'PGC_SIGHUP', group => 'WAL_ARCHIVE_RECOVERY',
   short_desc => 'Sets the shell command that will be executed once at the end of recovery.',
   variable => 'recoveryEndCommand',
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 1157ba9ba9d..9690d89ad8a 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,6 +126,7 @@ typedef struct IndexFetchTableData
 {
 	Relation	rel;
 	ReadStream *rs;
+	uint64		n_heap_reads;   /* number of heap page read from the disk */
 } IndexFetchTableData;
 
 /*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 00f4c3d0011..97150433c99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -28,6 +28,7 @@
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
 #define DEFAULT_PARALLEL_TUPLE_COST 0.1
 #define DEFAULT_PARALLEL_SETUP_COST  1000.0
+#define DEFAULT_READ_STREAM_THRESHOLD	10
 
 /* defaults for non-Cost parameters */
 #define DEFAULT_RECURSIVE_WORKTABLE_FACTOR  10.0
@@ -72,6 +73,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
 extern PGDLLIMPORT bool enable_presorted_aggregate;
 extern PGDLLIMPORT bool enable_async_append;
 extern PGDLLIMPORT int constraint_exclusion;
+extern PGDLLIMPORT int read_stream_threshold;
 
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 								  double index_pages, PlannerInfo *root);

#377

Tomas Vondra

tomas@vondra.me

7 days ago

In reply to: Konstantin Knizhnik (#375)

Re: index prefetching

On 12/29/25 14:37, Konstantin Knizhnik wrote:

On 29/12/2025 1:53 AM, Tomas Vondra wrote:

It seems this is due to sending an extra SET (for the new GUC) in the
pgbench script, which is recognized only on the v5+threshold build.

That's a thinko on my side, I should have realized the extra command
might affect this. It doesn't really affect the behavior, because 10 is
the default value for read_stream_threshold. I've fixed the script, will
check fresh results tomorrow.

Still, I think most of what I said about heuristics when to initialize
the read stream, and the risk/benefit tradeoff, still applies.

I did a lot of experiments this morning but could not find any
noticeable difference at any configuration when all working set fits in
shared buffers.

I found any significant differences either, in my tests. It's possible
this is thanks to some of the recent improvements. I also recall some of
the regressions (that made us to add this heuristics) were tied to
specific data patterns, and entirely invisible on other data sets.

And frankly speaking after more thinking I do not see good reasons which
can explain such difference.
Just initialization of read stream should not add much overhead - it
seems to be not expensive operation.

Even a small fixes overhead can be quite visible for extremely short
queries that can't possibly benefit from prefetching (like the LIMIT 1).

That's what we were concerned about when we invented this heuristics
(which only prefetches from the 2nd batch). I agree it's crude, no
argument there. I'm not against having something better, as long as we
can make it reliable.

One of the problems with read_stream is that it may look ahead very far
ahead. Much further than the query will need. Imagine an index-only
scan, with 99.99% pages being all-visible. At some point the scan will
hit an item that requires reading the heap page. Which triggers the
look-ahead, and a search for the *next* heap page.

But that next index entry may be many leafs ahead. And maybe the scan
never even gets to actually need that, which means the work will be
wasted. If the query only needs a couple tuples, this may cause a bad
regression.

IIRC the same thing can happen for plain index scans. You may need to
try with correlated indexes (I think the read_stream callback skips
duplicate TIDs).

What is actually matter is async IO. Without read stream, Postgres reads
heap pages using sync operation: backend just calls pread.
With read stream, AIO is used. By default "worker" AIO mode is used, it
means that backend sends request to one of the workers and wait for it's
completion. Worker receives request, performs IO and notifies backend.
Such interprocess communication adds significant overhead and this is
why if we initialize read stream from the very beginning, then we get
about ~4x worse performance with LIMIT 1.

Yes, the IPC (with worker) can be quite costly. Especially with data
patterns that force one roundtrip per block.

Please correct me if I wrong (or it is Mac specific), but it is not
caused by any overhead related with read_stream, but by AIO.
I have not made such experiment, but it seems to me that if we make read
stream to perform sync calls, then there will be almost no difference in
performance.

Why would it matter if the overhead is caused by AIO or the read_stream
itself? What matters is the impact on queries, no?

Anyway, as I explained earlier, I believe some of the overhead is really
about the read_stream, which can be "forced" to do a lot of work which
is then wasted.

A couple days ago Thomas posted a patch with read_stream yield, allowing
the callback to "give up" if it gets too far ahead. That might mitigate
this particular issue.

When all data is cached in shared buffers, then we do not perform IO at
all.
It means there it doesn't matter whether and when we initialize
read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.

I'm not sure it's this simple. Even if we perform no physical I/O, the
read_stream callback will need to go through the whole batch and pass
all the items to the read_stream. The data may be cached, but not in
shared buffers, in which case the read_stream will do the IPC etc.

It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.

I'm sure such "adversarial" data sets exist, even if we don't have a
good example at hand. The question is how plausible / realistic such
data sets are, though.

I'm not sure about v5, though. The assumption was the overhead is most
visible for short queries, and once we get to the 2nd batch the query is
expensive enough to to not be affected too much.

When you say "we just need to properly choose cache hit rate", how would
we do that? And how would we know/determine the optimal threshold?

More precise heuristic should IMHO take in account actual number of
performed disk read.

I'm not sure "disk reads" is the right term. The data may be in page
cache, and we just need to copy them into shared buffers.

Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.

Not sure I understand what this says. Can you elaborate?

Unfortunately looks like it is not possible to accumulate such
information without changing other Postgres code.
For example, if `ReadBuffer` can somehow inform caller that it actually
performs read, then it can be easily calculate number of reads in
`heapam_index_fetch_tuple`:

```

static pg_attribute_always_inline Buffer
ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
ForkNumber forkNum,
BlockNumber blockNum, ReadBufferMode mode,
BufferAccessStrategy strategy,
bool* fast_path)
{
...
if (StartReadBuffer(&operation,
&buffer,
blockNum,
flags))

{
WaitReadBuffers(&operation);
*fast_path = false;
}
else
*fast_path = true;
return buffer;
}

It can be certainly achieved without changed ReadBuffer* family, just by
directly calling StartReadBuffer
from `heapam_index_fetch_tuple` instead of `ReadBuffer`.
Not so nice because we have to duplicate some bufmgr code. Not so much -
check for local relation and
filling `ReadBuffersOperation`structure. But it is better to avoid it.

Yeah, this is not great. There must be a better way to do this.

regards

--
Tomas Vondra

#378

Konstantin Knizhnik

knizhnik@garret.ru

7 days ago

In reply to: Tomas Vondra (#377)

Re: index prefetching

On 05/01/2026 2:21 AM, Tomas Vondra wrote:

Even a small fixes overhead can be quite visible for extremely short
queries that can't possibly benefit from prefetching (like the LIMIT 1).

That's what we were concerned about when we invented this heuristics
(which only prefetches from the 2nd batch). I agree it's crude, no
argument there. I'm not against having something better, as long as we
can make it reliable.

I completely agree that we should not create read stream from the very
beginning despite to small overhead.
My concern is that second batch is not good criteria: depending on key
size and position of the first key on the page, it can be too short.

One of the problems with read_stream is that it may look ahead very far
ahead. Much further than the query will need. Imagine an index-only
scan, with 99.99% pages being all-visible. At some point the scan will
hit an item that requires reading the heap page. Which triggers the
look-ahead, and a search for the *next* heap page.

But that next index entry may be many leafs ahead. And maybe the scan
never even gets to actually need that, which means the work will be
wasted. If the query only needs a couple tuples, this may cause a bad
regression.

Sorry, I do not completely understand how it can happen.
Read stream can only fetch heap pages which are referenced by TIDs in
leave pages (leaf=batch).
So read stream can not advance more than TIDs from currently processes
leaf page, can it?

IIRC the same thing can happen for plain index scans. You may need to
try with correlated indexes (I think the read_stream callback skips
duplicate TIDs).

Another problem is that different TIDs can refer to the same heap page
(which happens with clustered index or table populated in key ascending
order).
But number of SMGR reads criteria also should work good in this case.

When all data is cached in shared buffers, then we do not perform IO at
all.
It means there it doesn't matter whether and when we initialize
read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.

I'm not sure it's this simple. Even if we perform no physical I/O, the
read_stream callback will need to go through the whole batch and pass
all the items to the read_stream. The data may be cached, but not in
shared buffers, in which case the read_stream will do the IPC etc.

You mean that data is present in OS file cache and reading it is fast
and will not benefit from AIO?
It certainly can happen but it seems to be quite obvious problem of
double buffering.

It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.

I'm sure such "adversarial" data sets exist, even if we don't have a
good example at hand. The question is how plausible / realistic such
data sets are, though.

I'm not sure about v5, though. The assumption was the overhead is most
visible for short queries, and once we get to the 2nd batch the query is
expensive enough to to not be affected too much.

When you say "we just need to properly choose cache hit rate", how would
we do that? And how would we know/determine the optimal threshold?

Sorry, may be I was unclear.
Saying about "choosing cache hit rate" I didn't not mean to use it for
determining proper threshold.
I was thinking about the best/worst scenario for index prefetch.
Originally in my "benchmark" I considered case when no heap pages are
present in shared buffer.
In your scenario - size of shared buffers is larger than size of table,
so all pages are present in shared buffers.
First case is certainly the best scenario for index prefetch where we
can expect to get the largest improvement.
In your case there is certainly no improvement, but as we can see -
overhead is also not so large (just because we do not read any page).
I expect that the worst result will be in some intermediate case - when
some pages are present in shared buffer, some - not.
It will be especially true for "second batch" criteria, because it can
happen that we need to read just the single heap page for the whole
batch and using read stream for it will just add extra overhead.

More precise heuristic should IMHO take in account actual number of
performed disk read.

I'm not sure "disk reads" is the right term. The data may be in page
cache, and we just need to copy them into shared buffers.

I agree, but there seems to be no chance to determine if page was
actually read from disk or cache.
But as I wrote above, it is problem of double buffering. If we can avoid
it (i.e. using direct IO), then there will be no such problem.

Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.

Not sure I understand what this says. Can you elaborate?

I sent proposed patch in the previous mail.
Did you have a chance to look at it?

#379

Tomas Vondra

tomas@vondra.me

7 days ago

In reply to: Konstantin Knizhnik (#378)

Re: index prefetching

On 1/5/26 13:21, Konstantin Knizhnik wrote:

On 05/01/2026 2:21 AM, Tomas Vondra wrote:

Even a small fixes overhead can be quite visible for extremely short
queries that can't possibly benefit from prefetching (like the LIMIT 1).

That's what we were concerned about when we invented this heuristics
(which only prefetches from the 2nd batch). I agree it's crude, no
argument there. I'm not against having something better, as long as we
can make it reliable.

I completely agree that we should not create read stream from the very
beginning despite to small overhead.
My concern is that second batch is not good criteria: depending on key
size and position of the first key on the page, it can be too short.

One of the problems with read_stream is that it may look ahead very far
ahead. Much further than the query will need. Imagine an index-only
scan, with 99.99% pages being all-visible. At some point the scan will
hit an item that requires reading the heap page. Which triggers the
look-ahead, and a search for the *next* heap page.

But that next index entry may be many leafs ahead. And maybe the scan
never even gets to actually need that, which means the work will be
wasted. If the query only needs a couple tuples, this may cause a bad
regression.

Sorry, I do not completely understand how it can happen.
Read stream can only fetch heap pages which are referenced by TIDs in
leave pages (leaf=batch).
So read stream can not advance more than TIDs from currently processes
leaf page, can it?

Sure it can. It can look multiple leaf pages ahead. Allowing that is one
of the points of amgetbatch().

The read_stream is gradually increasing the look-ahead distance, do
let's say it increases distance from 4 to 8. Now the callback needs to
find 4 more blocks to return to the read stream.

For plain index scans, we do skip duplicate blocks. If there's a
sequence of index entries pointing to toples on the same heap page, the
callback will skip those. This can happen with correlated indexes. The
number of tuples on a heap page is limited, so there's an limit of how
bad this can get. But it still adds overhead, and forces the callback to
look further ahead.

For index-only scans it's much worse, because there's no limit on how
far the next not-allvisible page is (and maybe there is none).

FWIW the "second batch" heuristics does not really fix these issues. The
unbounded look-ahead can happen for the later batches, and it's just as
bad. The "good" solution would be to have some feedback about how much
effort was already used (and then yield control, using the WIP patch).

IIRC the same thing can happen for plain index scans. You may need to
try with correlated indexes (I think the read_stream callback skips
duplicate TIDs).

Another problem is that different TIDs can refer to the same heap page
(which happens with clustered index or table populated in key ascending
order).

Sure, but master does not check that either. It's not clear to me what
exactly we should do about that. Should we hold on to ping for a couple
recent pages? Or should we just remember the recent blocks, assuming
they're still in cache? Or something else?

But number of SMGR reads criteria also should work good in this case.

It might be one of the inputs for the heuristics. But I still don't know
what should the heuristics look like exactly.

When all data is cached in shared buffers, then we do not perform IO at
all.
It means there it doesn't matter whether and when we initialize
read_stream.
We can do it after processing 10 items (current default), or immediately
- it should not affect performance.
And this is what I have tested: performance actually not depends on
`read_stream_threshold` (if data fits in shared buffers).
At least it is within few percents and may be it is just random
fluctuations.
Obviously there is no 25% degradation.

I'm not sure it's this simple. Even if we perform no physical I/O, the
read_stream callback will need to go through the whole batch and pass
all the items to the read_stream. The data may be cached, but not in
shared buffers, in which case the read_stream will do the IPC etc.

You mean that data is present in OS file cache and reading it is fast
and will not benefit from AIO?
It certainly can happen but it seems to be quite obvious problem of
double buffering.

Depends. It can benefit from AIO with io_method=worker, because it can
parallelize the memcpy() and checksum validation.

My point is that the "page cache" is pretty opaque to us. We don't know
if a page was read from disk or page cache, which means we can't really
calculate cache hit ratio for it.

It definitely doesn't mean that it is not possible to find scenario
where this approach with enabling prefetch after processing N items will
show worse performance than master or v5. We just need to properly
choose cache hit rate. But the same is true IMHO for v5 itself: it is
possible to find workload where it will show the same degradation
comparing with master.

I'm sure such "adversarial" data sets exist, even if we don't have a
good example at hand. The question is how plausible / realistic such
data sets are, though.

I'm not sure about v5, though. The assumption was the overhead is most
visible for short queries, and once we get to the 2nd batch the query is
expensive enough to to not be affected too much.

When you say "we just need to properly choose cache hit rate", how would
we do that? And how would we know/determine the optimal threshold?

Sorry, may be I was unclear.
Saying about "choosing cache hit rate" I didn't not mean to use it for
determining proper threshold.
I was thinking about the best/worst scenario for index prefetch.
Originally in my "benchmark" I considered case when no heap pages are
present in shared buffer.
In your scenario - size of shared buffers is larger than size of table,
so all pages are present in shared buffers.
First case is certainly the best scenario for index prefetch where we
can expect to get the largest improvement.
In your case there is certainly no improvement, but as we can see -
overhead is also not so large (just because we do not read any page).
I expect that the worst result will be in some intermediate case - when
some pages are present in shared buffer, some - not.
It will be especially true for "second batch" criteria, because it can
happen that we need to read just the single heap page for the whole
batch and using read stream for it will just add extra overhead.

True. The case with no data in memory is definitely the best case for
index prefetching. With all data in shared buffers we can measure the
overhead of read_stream (because it won't really do any prefetching).

The in-between cases with a mix of cached / uncached pages can trigger
all kinds of weird behaviors in read_stream / AIO. We've seen cases
where the distance "collapsed" to ~2.0, which maximizes the overhead
with io_method=worker (doing IPC for each block).

You could argue that's more an issue in read_stream, of course. And in
my last round of testing I haven't seen such cases. But I don't think it
got fixed, more likely I haven't generated a suitable data set.

More precise heuristic should IMHO take in account actual number of
performed disk read.

I'm not sure "disk reads" is the right term. The data may be in page
cache, and we just need to copy them into shared buffers.

I agree, but there seems to be no chance to determine if page was
actually read from disk or cache.
But as I wrote above, it is problem of double buffering. If we can avoid
it (i.e. using direct IO), then there will be no such problem.

True, but most users don't use direct I/O.

Please notice that I do not want to predict number of disk reads - i.e.
check if candidates for prefetch are present in shared buffers.
It will really adds significant overhead. I think that it is better to
use as threshold number of performed reads.

Not sure I understand what this says. Can you elaborate?

I sent proposed patch in the previous mail.
Did you have a chance to look at it?

Not yet, I'll take a look.

regards

--
Tomas Vondra

#380

Peter Geoghegan

pg@bowt.ie

5 days ago

In reply to: Peter Geoghegan (#366)

4 attachment(s)

Re: index prefetching

On Sun, Dec 21, 2025 at 12:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v5.

Attached is v6. I'm posting this a little earlier than planned in
order to make sure that the patch continues to apply cleanly against
HEAD (v5 stopped cleanly applying about a week back).

v6 focusses on simplifying the batch management code in
heapam_batch_getnext_tid. Importantly, heapam_batch_getnext_tid no
longer uses a loop to process items from the currently loaded batch/to
load the next batch. The control flow in heapam_batch_getnext_tid is a
lot simpler in general compared to v5.

I still haven't had time to produce an implementation of the "heap
buffer locking minimization" optimization that's clean enough to
present to the list. The heapam_batch_getnext_tid refactoring in v6 is
groundwork for that, though; the previous complicated loop-based
control flow in heapam_batch_getnext_tid made that harder. Hopefully
I'll manage to have that in shape in the next revision.

The corresponding "batch management" control flow used by the stream
callback (added by the second/prefetching patch) still needs to use a
loop -- that can't be avoided/actually makes sense. But I did manage
to make those parts a bit simpler and clearer for v6. Work remains to
simplify that stream callback control flow. We still need to invent
clear invariants that describe the relationship between
readPos/readBatch and streamPos/streamBatch. Some verifying assertions
would also go a long way towards making the readstream callback clean
and maintainable.

--
Peter Geoghegan

Attachments:

v6-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchapplication/octet-stream; name=v6-0003-bufmgr-aio-Prototype-for-not-waiting-for-already-.patchDownload

From 1b1d59f667c23a6da8323b903634c0b5c3bd08db Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Aug 2025 11:01:52 -0400
Subject: [PATCH v6 3/4] bufmgr: aio: Prototype for not waiting for
 already-in-progress IO

Author:
Reviewed-by:
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Backpatch:
---
 src/include/storage/bufmgr.h        |   1 +
 src/backend/storage/buffer/bufmgr.c | 150 ++++++++++++++++++++++++++--
 2 files changed, 142 insertions(+), 9 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 715ae96f0..0b6fa848a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -147,6 +147,7 @@ struct ReadBuffersOperation
 	int			flags;
 	int16		nblocks;
 	int16		nblocks_done;
+	bool		foreign_io;
 	PgAioWaitRef io_wref;
 	PgAioReturn io_return;
 };
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a036c2aa2..d3dd64808 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1632,6 +1632,46 @@ ReadBuffersCanStartIOOnce(Buffer buffer, bool nowait)
 		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
+/*
+ * Check if the buffer is already undergoing read AIO. If it is, assign the
+ * IO's wait reference to operation->io_wref, thereby allowing the caller to
+ * wait for that IO.
+ */
+static inline bool
+ReadBuffersIOAlreadyInProgress(ReadBuffersOperation *operation, Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	PgAioWaitRef iow;
+
+	pgaio_wref_clear(&iow);
+
+	if (BufferIsLocal(buffer))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if ((buf_state & BM_IO_IN_PROGRESS) && !(buf_state & BM_VALID))
+			iow = desc->io_wref;
+		UnlockBufHdr(desc);
+	}
+
+	if (pgaio_wref_valid(&iow))
+	{
+		operation->io_wref = iow;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Helper for AsyncReadBuffers that tries to get the buffer ready for IO.
  */
@@ -1764,7 +1804,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 *
 			 * we first check if we already know the IO is complete.
 			 */
-			if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
+			if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
 				!pgaio_wref_check_done(&operation->io_wref))
 			{
 				instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1783,11 +1823,66 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				Assert(pgaio_wref_check_done(&operation->io_wref));
 			}
 
-			/*
-			 * We now are sure the IO completed. Check the results. This
-			 * includes reporting on errors if there were any.
-			 */
-			ProcessReadBuffersResult(operation);
+			if (unlikely(operation->foreign_io))
+			{
+				Buffer		buffer = operation->buffers[operation->nblocks_done];
+				BufferDesc *desc;
+				uint32		buf_state;
+
+				if (BufferIsLocal(buffer))
+				{
+					desc = GetLocalBufferDescriptor(-buffer - 1);
+					buf_state = pg_atomic_read_u32(&desc->state);
+				}
+				else
+				{
+					desc = GetBufferDescriptor(buffer - 1);
+					buf_state = LockBufHdr(desc);
+					UnlockBufHdr(desc);
+				}
+
+				if (buf_state & BM_VALID)
+				{
+					operation->nblocks_done += 1;
+					Assert(operation->nblocks_done <= operation->nblocks);
+
+					/*
+					 * Report and track this as a 'hit' for this backend, even
+					 * though it must have started out as a miss in
+					 * PinBufferForBlock(). The other backend (or ourselves,
+					 * as part of a read started earlier) will track this as a
+					 * 'read'.
+					 */
+					TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->forknum,
+													  operation->blocknum + operation->nblocks_done,
+													  operation->smgr->smgr_rlocator.locator.spcOid,
+													  operation->smgr->smgr_rlocator.locator.dbOid,
+													  operation->smgr->smgr_rlocator.locator.relNumber,
+													  operation->smgr->smgr_rlocator.backend,
+													  true);
+
+					if (BufferIsLocal(buffer))
+						pgBufferUsage.local_blks_hit += 1;
+					else
+						pgBufferUsage.shared_blks_hit += 1;
+
+					if (operation->rel)
+						pgstat_count_buffer_hit(operation->rel);
+
+					pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
+
+					if (VacuumCostActive)
+						VacuumCostBalance += VacuumCostPageHit;
+				}
+			}
+			else
+			{
+				/*
+				 * We now are sure the IO completed. Check the results. This
+				 * includes reporting on errors if there were any.
+				 */
+				ProcessReadBuffersResult(operation);
+			}
 		}
 
 		/*
@@ -1873,6 +1968,43 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_object = IOOBJECT_RELATION;
 	}
 
+	/*
+	 * If AIO is in progress, be it in this backend or another backend, we
+	 * just associate the wait reference with the operation and wait in
+	 * WaitReadBuffers(). This turns out to be important for performance in
+	 * two workloads:
+	 *
+	 * 1) A read stream that has to read the same block multiple times within
+	 * the readahead distance. This can happen e.g. for the table accesses of
+	 * an index scan.
+	 *
+	 * 2) Concurrent scans by multiple backends on the same relation.
+	 *
+	 * If we were to synchronously wait for the in-progress IO, we'd not be
+	 * able to keep enough I/O in flight.
+	 *
+	 * If we do find there is ongoing I/O for the buffer, we set up a 1-block
+	 * ReadBuffersOperation that WaitReadBuffers then can wait on.
+	 *
+	 * It's possible that another backend starts IO on the buffer between this
+	 * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+	 * we will synchronously wait for the IO below, but the window for that is
+	 * small enough that it won't happen often enough to have a significant
+	 * performance impact.
+	 */
+	if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
+	{
+		*nblocks_progress = 1;
+		operation->foreign_io = true;
+
+		CheckReadBuffersOperation(operation, false);
+
+
+		return true;
+	}
+
+	operation->foreign_io = false;
+
 	/*
 	 * If zero_damaged_pages is enabled, add the READ_BUFFERS_ZERO_ON_ERROR
 	 * flag. The reason for that is that, hopefully, zero_damaged_pages isn't
@@ -1930,9 +2062,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 	/*
 	 * Check if we can start IO on the first to-be-read buffer.
 	 *
-	 * If an I/O is already in progress in another backend, we want to wait
-	 * for the outcome: either done, or something went wrong and we will
-	 * retry.
+	 * If a synchronous I/O is in progress in another backend (it can't be
+	 * this backend), we want to wait for the outcome: either done, or
+	 * something went wrong and we will retry.
 	 */
 	if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
 	{
-- 
2.51.0

v6-0004-Make-hash-index-AM-use-amgetbatch-interface.patchapplication/octet-stream; name=v6-0004-Make-hash-index-AM-use-amgetbatch-interface.patchDownload

From 49ab6c51288983e2e125292f006a76d247ccc400 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 25 Nov 2025 18:03:15 -0500
Subject: [PATCH v6 4/4] Make hash index AM use amgetbatch interface.

Replace hashgettuple with hashgetbatch, a function that implements the
new amgetbatch interface.  Plain index scans of hash indexes now return
matching items in batches consisting of all of the matches from a given
bucket or overflow page.  This gives the core executor the ability to
perform optimizations like index prefetching during hash index scans.

Note that this makes hash index scans use the dropPin optimization,
since that is required to use the amgetbatch interface.  This won't
avoid making hash index vacuuming wait for a cleanup lock when an index
scan holds onto a conflicting pin, since such index scans still need to
hold onto a conflicting bucket page pin for the duration of the scan.
In other words, use of the dropPin optimization won't benefit hash index
scans in the same way that it benefits nbtree index scans following
commit 2ed5b87f.  However, there is still some value in keeping the
number of buffer pins held at any one time to a minimum.

Index prefetching tends to hold open as many as several dozen batches
with certain workloads (workloads where the stream position has to get
quite far ahead of the read position in order to maintain the
appropriate prefetch distance on the heapam side).  Guaranteeing that
open batches won't hold buffer pins on index pages (at least in the
common case where the dropPin optimization is safe to use) thereby
simplifies resource management during index prefetching.

Also add Valgrind buffer lock instrumentation to hash, bringing it in
line with nbtree following commit 4a70f829.  This is another requirement
when using the amgetbatch interface.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com
---
 src/include/access/hash.h            |  73 +-----
 src/backend/access/hash/hash.c       | 123 ++++------
 src/backend/access/hash/hashpage.c   |  26 +--
 src/backend/access/hash/hashsearch.c | 328 ++++++++++++---------------
 src/backend/access/hash/hashutil.c   | 100 ++++----
 src/tools/pgindent/typedefs.list     |   2 -
 6 files changed, 255 insertions(+), 397 deletions(-)

diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index a8702f0e5..1b80fd7ed 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -100,58 +100,6 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
-typedef struct HashScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-} HashScanPosItem;
-
-typedef struct HashScanPosData
-{
-	Buffer		buf;			/* if valid, the buffer is pinned */
-	BlockNumber currPage;		/* current hash index page */
-	BlockNumber nextPage;		/* next overflow page */
-	BlockNumber prevPage;		/* prev overflow or bucket page */
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	HashScanPosItem items[MaxIndexTuplesPerPage];	/* MUST BE LAST */
-} HashScanPosData;
-
-#define HashScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-
-#define HashScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-
-#define HashScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-		(scanpos).nextPage = InvalidBlockNumber; \
-		(scanpos).prevPage = InvalidBlockNumber; \
-		(scanpos).firstItem = 0; \
-		(scanpos).lastItem = 0; \
-		(scanpos).itemIndex = 0; \
-	} while (0)
-
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -178,15 +126,6 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-
-	/*
-	 * Identify all the matching items on a page and save them in
-	 * HashScanPosData
-	 */
-	HashScanPosData currPos;	/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -368,11 +307,14 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
 					   IndexUniqueCheck checkUnique,
 					   bool indexUnchanged,
 					   struct IndexInfo *indexInfo);
-extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan hashgetbatch(IndexScanDesc scan,
+								   BatchIndexScan priorbatch,
+								   ScanDirection dir);
 extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
+extern void hashfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void hashendscan(IndexScanDesc scan);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
@@ -445,8 +387,9 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   uint32 lowmask);
 
 /* hashsearch.c */
-extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _hash_next(IndexScanDesc scan, ScanDirection dir,
+								 BatchIndexScan priorbatch);
+extern BatchIndexScan _hash_first(IndexScanDesc scan, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
@@ -476,7 +419,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 												 uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, BatchIndexScan batch);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6a20b67f6..d5f1c4f8c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -101,9 +101,9 @@ hashhandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = hashadjustmembers,
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
-		.amgettuple = hashgettuple,
-		.amgetbatch = NULL,
-		.amfreebatch = NULL,
+		.amgettuple = NULL,
+		.amgetbatch = hashgetbatch,
+		.amfreebatch = hashfreebatch,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
 		.amposreset = NULL,
@@ -286,53 +286,22 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 
 
 /*
- *	hashgettuple() -- Get the next tuple in the scan.
+ *	hashgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-hashgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+hashgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
-
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
-	/*
-	 * If we've already initialized this scan, we can just advance it in the
-	 * appropriate direction.  If we haven't done so yet, we call a routine to
-	 * get the first item in the scan.
-	 */
-	if (!HashScanPosIsValid(so->currPos))
-		res = _hash_first(scan, dir);
-	else
+	if (priorbatch == NULL)
 	{
-		/*
-		 * Check to see if we should kill the previously-fetched tuple.
-		 */
-		if (scan->kill_prior_tuple)
-		{
-			/*
-			 * Yes, so remember it for later. (We'll deal with all such tuples
-			 * at once right after leaving the index page or at end of scan.)
-			 * In case if caller reverses the indexscan direction it is quite
-			 * possible that the same item might get entered multiple times.
-			 * But, we don't detect that; instead, we just forget any excess
-			 * entries.
-			 */
-			if (so->killedItems == NULL)
-				so->killedItems = palloc_array(int, MaxIndexTuplesPerPage);
-
-			if (so->numKilled < MaxIndexTuplesPerPage)
-				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-		}
-
-		/*
-		 * Now continue the scan.
-		 */
-		res = _hash_next(scan, dir);
+		/* Initialize the scan, and return first batch of matching items */
+		return _hash_first(scan, dir);
 	}
 
-	return res;
+	/* Return batch positioned after caller's batch (in direction 'dir') */
+	return _hash_next(scan, dir, priorbatch);
 }
 
 
@@ -342,26 +311,23 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 int64
 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch;
 	int64		ntids = 0;
-	HashScanPosItem *currItem;
+	int			itemIndex;
 
-	res = _hash_first(scan, ForwardScanDirection);
+	batch = _hash_first(scan, ForwardScanDirection);
 
-	while (res)
+	while (batch != NULL)
 	{
-		currItem = &so->currPos.items[so->currPos.itemIndex];
+		for (itemIndex = batch->firstItem;
+			 itemIndex <= batch->lastItem;
+			 itemIndex++)
+		{
+			tbm_add_tuples(tbm, &batch->items[itemIndex].heapTid, 1, true);
+			ntids++;
+		}
 
-		/*
-		 * _hash_first and _hash_next handle eliminate dead index entries
-		 * whenever scan->ignore_killed_tuples is true.  Therefore, there's
-		 * nothing to do here except add the results to the TIDBitmap.
-		 */
-		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
-		ntids++;
-
-		res = _hash_next(scan, ForwardScanDirection);
+		batch = _hash_next(scan, ForwardScanDirection, batch);
 	}
 
 	return ntids;
@@ -383,17 +349,14 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc_object(HashScanOpaqueData);
-	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
-	so->killedItems = NULL;
-	so->numKilled = 0;
-
 	scan->opaque = so;
+	scan->maxitemsbatch = MaxIndexTuplesPerPage;
 
 	return scan;
 }
@@ -408,18 +371,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	/* set position invalid (this will cause _hash_first call) */
-	HashScanPosInvalidate(so->currPos);
-
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
@@ -428,6 +381,25 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	so->hashso_buc_split = false;
 }
 
+/*
+ *	hashfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+hashfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_hash_kill_items(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	hashendscan() -- close down a scan
  */
@@ -437,17 +409,8 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	if (HashScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_hash_kill_items(scan);
-	}
-
 	_hash_dropscanbuf(rel, so);
 
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8e220a3ae..9b6911905 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -35,6 +35,7 @@
 #include "port/pg_bitutils.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 
 static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
@@ -79,6 +80,9 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
 	if (access != HASH_NOLOCK)
 		LockBuffer(buf, access);
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -108,6 +112,9 @@ _hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
 		return InvalidBuffer;
 	}
 
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
 	/* ref count and lock type are correct */
 
 	_hash_checkpage(rel, buf, flags);
@@ -280,31 +287,24 @@ _hash_dropbuf(Relation rel, Buffer buf)
 }
 
 /*
- *	_hash_dropscanbuf() -- release buffers used in scan.
+ *	_hash_dropscanbuf() -- release buffers owned by scan.
  *
- * This routine unpins the buffers used during scan on which we
- * hold no lock.
+ * This routine unpins the buffers for the primary bucket page and for the
+ * bucket page of a bucket being split as needed.
  */
 void
 _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
-	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->currPos.buf)
+	if (BufferIsValid(so->hashso_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
-	/* release pin we hold on primary bucket page  of bucket being split */
-	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->currPos.buf)
+	/* release pin held on primary bucket page of bucket being split */
+	if (BufferIsValid(so->hashso_split_bucket_buf))
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
-	/* release any pin we still hold */
-	if (BufferIsValid(so->currPos.buf))
-		_hash_dropbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-
 	/* reset split scan */
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index dfb0517b3..2602f3e64 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -21,105 +21,89 @@
 #include "storage/predicate.h"
 #include "utils/rel.h"
 
-static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
-						   ScanDirection dir);
+static bool _hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+						   BatchIndexScan batch);
 static int	_hash_load_qualified_items(IndexScanDesc scan, Page page,
-									   OffsetNumber offnum, ScanDirection dir);
-static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ScanDirection dir,
+									   BatchIndexScan batch);
+static inline void _hash_saveitem(BatchIndexScan batch, int itemIndex,
 								  OffsetNumber offnum, IndexTuple itup);
 static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
 						   Page *pagep, HashPageOpaque *opaquep);
 
 /*
- *	_hash_next() -- Get the next item in a scan.
+ *	_hash_next() -- Get the next batch of items in a scan.
  *
- *		On entry, so->currPos describes the current page, which may
- *		be pinned but not locked, and so->currPos.itemIndex identifies
- *		which item was previously returned.
+ *		On entry, priorbatch describes the current page batch with items
+ *		already returned.
  *
- *		On successful exit, scan->xs_heaptid is set to the TID of the next
- *		heap tuple.  so->currPos is updated as needed.
+ *		On successful exit, returns a batch containing matching items from
+ *		next page.  Otherwise returns NULL, indicating that there are no
+ *		further matches.  No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false with pin
- *		held on bucket page but no pins or locks held on overflow
- *		page.
+ *		Retains pins according to the same rules as _hash_first.
  */
-bool
-_hash_next(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_hash_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	HashScanPosItem *currItem;
 	BlockNumber blkno;
 	Buffer		buf;
-	bool		end_of_scan = false;
+	BatchIndexScan batch;
 
 	/*
-	 * Advance to the next tuple on the current page; or if done, try to read
-	 * data from the next or previous page based on the scan direction. Before
-	 * moving to the next or previous page make sure that we deal with all the
-	 * killed items.
+	 * Determine which page to read next based on scan direction and details
+	 * taken from the prior batch
 	 */
 	if (ScanDirectionIsForward(dir))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		blkno = priorbatch->nextPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreRight)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.nextPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		blkno = priorbatch->prevPage;
+		if (!BlockNumberIsValid(blkno) || !priorbatch->moreLeft)
 		{
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			blkno = so->currPos.prevPage;
-			if (BlockNumberIsValid(blkno))
-			{
-				buf = _hash_getbuf(rel, blkno, HASH_READ,
-								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-				/*
-				 * We always maintain the pin on bucket page for whole scan
-				 * operation, so releasing the additional pin we have acquired
-				 * here.
-				 */
-				if (buf == so->hashso_bucket_buf ||
-					buf == so->hashso_split_bucket_buf)
-					_hash_dropbuf(rel, buf);
-
-				if (!_hash_readpage(scan, &buf, dir))
-					end_of_scan = true;
-			}
-			else
-				end_of_scan = true;
+			_hash_dropscanbuf(rel, so);
+			return NULL;
 		}
 	}
 
-	if (end_of_scan)
+	/* Allocate space for next batch */
+	batch = indexam_util_batch_alloc(scan);
+
+	/* Get the buffer for next batch */
+	if (ScanDirectionIsForward(dir))
+		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	else
 	{
-		_hash_dropscanbuf(rel, so);
-		HashScanPosInvalidate(so->currPos);
-		return false;
+		buf = _hash_getbuf(rel, blkno, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		/*
+		 * We always maintain the pin on bucket page for whole scan operation,
+		 * so releasing the additional pin we have acquired here.
+		 */
+		if (buf == so->hashso_bucket_buf ||
+			buf == so->hashso_split_bucket_buf)
+			_hash_dropbuf(rel, buf);
 	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
+	/* Read the next page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	return true;
+	/* Return the batch containing matched items from next page */
+	return batch;
 }
 
 /*
@@ -269,22 +253,21 @@ _hash_readprev(IndexScanDesc scan,
 }
 
 /*
- *	_hash_first() -- Find the first item in a scan.
+ *	_hash_first() -- Find the first batch of items in a scan.
  *
- *		We find the first item (or, if backward scan, the last item) in the
- *		index that satisfies the qualification associated with the scan
- *		descriptor.
+ *		We find the first batch of items (or, if backward scan, the last
+ *		batch) in the index that satisfies the qualification associated with
+ *		the scan descriptor.
  *
- *		On successful exit, if the page containing current index tuple is an
- *		overflow page, both pin and lock are released whereas if it is a bucket
- *		page then it is pinned but not locked and data about the matching
- *		tuple(s) on the page has been loaded into so->currPos,
- *		scan->xs_heaptid is set to the heap TID of the current tuple.
+ *		On successful exit, returns a batch containing matching items.
+ *		Otherwise returns NULL, indicating that there are no further matches.
+ *		No locks are ever held when we return.
  *
- *		On failure exit (no more tuples), we return false, with pin held on
- *		bucket page but no pins or locks held on overflow page.
+ *		We always retain our own pin on the bucket page.  When we return a
+ *		batch with a bucket page, it will retain its own reference pin iff
+ *		indexam_util_batch_release determined that table AM requires one.
  */
-bool
+BatchIndexScan
 _hash_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -295,7 +278,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	HashScanPosItem *currItem;
+	BatchIndexScan batch;
 
 	pgstat_count_index_scan(rel);
 	if (scan->instrument)
@@ -325,7 +308,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	 * items in the index.
 	 */
 	if (cur->sk_flags & SK_ISNULL)
-		return false;
+		return NULL;
 
 	/*
 	 * Okay to compute the hash key.  We want to do this before acquiring any
@@ -418,191 +401,159 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* remember which buffer we have pinned, if any */
-	Assert(BufferIsInvalid(so->currPos.buf));
-	so->currPos.buf = buf;
+	/* Allocate space for first batch */
+	batch = indexam_util_batch_alloc(scan);
 
-	/* Now find all the tuples satisfying the qualification from a page */
-	if (!_hash_readpage(scan, &buf, dir))
-		return false;
+	/* Read the first page and load items into allocated batch */
+	if (!_hash_readpage(scan, buf, dir, batch))
+	{
+		indexam_util_batch_release(scan, batch);
+		return NULL;
+	}
 
-	/* OK, itemIndex says what to return */
-	currItem = &so->currPos.items[so->currPos.itemIndex];
-	scan->xs_heaptid = currItem->heapTid;
-
-	/* if we're here, _hash_readpage found a valid tuples */
-	return true;
+	/* Return the batch containing matched items */
+	return batch;
 }
 
 /*
- *	_hash_readpage() -- Load data from current index page into so->currPos
+ *	_hash_readpage() -- Load data from current index page into batch
  *
  *	We scan all the items in the current index page and save them into
- *	so->currPos if it satisfies the qualification. If no matching items
+ *	the batch if they satisfy the qualification. If no matching items
  *	are found in the current page, we move to the next or previous page
  *	in a bucket chain as indicated by the direction.
  *
  *	Return true if any matching items are found else return false.
  */
 static bool
-_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+_hash_readpage(IndexScanDesc scan, Buffer buf, ScanDirection dir,
+			   BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum;
 	uint16		itemIndex;
 
-	buf = *bufP;
 	Assert(BufferIsValid(buf));
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 
-	so->currPos.buf = buf;
-	so->currPos.currPage = BufferGetBlockNumber(buf);
+	batch->buf = buf;
+	batch->currPage = BufferGetBlockNumber(buf);
+	batch->dir = dir;
 
 	if (ScanDirectionIsForward(dir))
 	{
-		BlockNumber prev_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != 0)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the next page. Before leaving the current page, deal with any
-			 * killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the next page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			/*
-			 * If this is a primary bucket page, hasho_prevblkno is not a real
-			 * block number.
-			 */
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				prev_blkno = InvalidBlockNumber;
-			else
-				prev_blkno = opaque->hasho_prevblkno;
-
 			_hash_readnext(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = prev_blkno;
-				so->currPos.nextPage = InvalidBlockNumber;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		batch->firstItem = 0;
+		batch->lastItem = itemIndex - 1;
 	}
 	else
 	{
-		BlockNumber next_blkno = InvalidBlockNumber;
-
 		for (;;)
 		{
 			/* new page, locate starting position by binary search */
 			offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 
-			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+			itemIndex = _hash_load_qualified_items(scan, page, offnum, dir,
+												   batch);
 
 			if (itemIndex != MaxIndexTuplesPerPage)
 				break;
 
 			/*
-			 * Could not find any matching tuples in the current page, move to
-			 * the previous page. Before leaving the current page, deal with
-			 * any killed items.
+			 * Could not find any matching tuples in the current page, try to
+			 * move to the previous page
 			 */
-			if (so->numKilled > 0)
-				_hash_kill_items(scan);
-
-			if (so->currPos.buf == so->hashso_bucket_buf ||
-				so->currPos.buf == so->hashso_split_bucket_buf)
-				next_blkno = opaque->hasho_nextblkno;
-
 			_hash_readprev(scan, &buf, &page, &opaque);
-			if (BufferIsValid(buf))
+			if (!BufferIsValid(buf))
 			{
-				so->currPos.buf = buf;
-				so->currPos.currPage = BufferGetBlockNumber(buf);
-			}
-			else
-			{
-				/*
-				 * Remember next and previous block numbers for scrollable
-				 * cursors to know the start position and return false
-				 * indicating that no more matching tuples were found. Also,
-				 * don't reset currPage or lsn, because we expect
-				 * _hash_kill_items to be called for the old page after this
-				 * function returns.
-				 */
-				so->currPos.prevPage = InvalidBlockNumber;
-				so->currPos.nextPage = next_blkno;
-				so->currPos.buf = buf;
+				batch->buf = InvalidBuffer;
 				return false;
 			}
+
+			batch->buf = buf;
+			batch->currPage = BufferGetBlockNumber(buf);
 		}
 
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		batch->firstItem = itemIndex;
+		batch->lastItem = MaxIndexTuplesPerPage - 1;
 	}
 
-	if (so->currPos.buf == so->hashso_bucket_buf ||
-		so->currPos.buf == so->hashso_split_bucket_buf)
+	/*
+	 * Saved at least one match in batch.items[].  Prepare for hashgetbatch to
+	 * return it by initializing remaining uninitialized fields.
+	 */
+	if (batch->buf == so->hashso_bucket_buf ||
+		batch->buf == so->hashso_split_bucket_buf)
 	{
-		so->currPos.prevPage = InvalidBlockNumber;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * Batch's buffer is either the primary bucket, or a bucket being
+		 * populated due to a split.
+		 *
+		 * Increment local reference count so that batch gets an independent
+		 * buffer reference that can be released (by hashfreebatch) before the
+		 * hashso_bucket_buf/hashso_split_bucket_buf references are released.
+		 */
+		IncrBufferRefCount(batch->buf);
+
+		/* Can only use opaque->hasho_nextblkno */
+		batch->prevPage = InvalidBlockNumber;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 	else
 	{
-		so->currPos.prevPage = opaque->hasho_prevblkno;
-		so->currPos.nextPage = opaque->hasho_nextblkno;
-		_hash_relbuf(rel, so->currPos.buf);
-		so->currPos.buf = InvalidBuffer;
+		/* Can use opaque->hasho_prevblkno and opaque->hasho_nextblkno */
+		batch->prevPage = opaque->hasho_prevblkno;
+		batch->nextPage = opaque->hasho_nextblkno;
 	}
 
-	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+	batch->moreLeft = BlockNumberIsValid(batch->prevPage);
+	batch->moreRight = BlockNumberIsValid(batch->nextPage);
+
+	/* Unlock (and likely unpin) buffer, per amgetbatch contract */
+	indexam_util_batch_unlock(scan, batch);
+
+	Assert(batch->firstItem <= batch->lastItem);
 	return true;
 }
 
 /*
  * Load all the qualified items from a current index page
- * into so->currPos. Helper function for _hash_readpage.
+ * into batch. Helper function for _hash_readpage.
  */
 static int
 _hash_load_qualified_items(IndexScanDesc scan, Page page,
-						   OffsetNumber offnum, ScanDirection dir)
+						   OffsetNumber offnum, ScanDirection dir,
+						   BatchIndexScan batch)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	IndexTuple	itup;
@@ -639,7 +590,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 				_hash_checkqual(scan, itup))
 			{
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 				itemIndex++;
 			}
 			else
@@ -686,7 +637,7 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 			{
 				itemIndex--;
 				/* tuple is qualified, so remember it */
-				_hash_saveitem(so, itemIndex, offnum, itup);
+				_hash_saveitem(batch, itemIndex, offnum, itup);
 			}
 			else
 			{
@@ -705,13 +656,14 @@ _hash_load_qualified_items(IndexScanDesc scan, Page page,
 	}
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into batch->items[itemIndex] */
 static inline void
-_hash_saveitem(HashScanOpaque so, int itemIndex,
+_hash_saveitem(BatchIndexScan batch, int itemIndex,
 			   OffsetNumber offnum, IndexTuple itup)
 {
-	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
+	currItem->tupleOffset = 0;
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index cf7f0b901..f99105d3b 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -510,81 +510,84 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
  * told us were killed.
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
- * The caller must have pin on bucket buffer, but may or may not have pin
- * on overflow buffer, as indicated by HashScanPosIsPinned(so->currPos).
- *
- * We match items by heap TID before assuming they are the right ones to
- * delete.
- *
- * There are never any scans active in a bucket at the time VACUUM begins,
- * because VACUUM takes a cleanup lock on the primary bucket page and scans
- * hold a pin.  A scan can begin after VACUUM leaves the primary bucket page
- * but before it finishes the entire bucket, but it can never pass VACUUM,
- * because VACUUM always locks the next page before releasing the lock on
- * the previous one.  Therefore, we don't have to worry about accidentally
- * killing a TID that has been reused for an unrelated tuple.
+ * We match items by heap TID before assuming they are the right ones to set
+ * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
+ * continuously from initially reading the items until applying this function
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
+ * page, so the page's TIDs can't have been recycled by now.  There's no risk
+ * that we'll confuse a new index tuple that happens to use a recycled TID
+ * with a now-removed tuple with the same TID (that used to be on this same
+ * page).  We can't rely on that during scans that drop buffer pins eagerly
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * the page LSN having not changed since back when _hash_readpage saw the page.
+ * We totally give up on setting LP_DEAD bits when the page LSN changed.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, BatchIndexScan batch)
 {
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
-	BlockNumber blkno;
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
 	OffsetNumber offnum,
 				maxoff;
-	int			numKilled = so->numKilled;
-	int			i;
 	bool		killedsomething = false;
-	bool		havePin = false;
 
-	Assert(so->numKilled > 0);
-	Assert(so->killedItems != NULL);
-	Assert(HashScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(batch->killedItems != NULL);
+	Assert(BlockNumberIsValid(batch->currPage));
 
-	/*
-	 * Always reset the scan state, so we don't look for same items on other
-	 * pages.
-	 */
-	so->numKilled = 0;
-
-	blkno = so->currPos.currPage;
-	if (HashScanPosIsPinned(so->currPos))
+	if (!scan->dropPin)
 	{
 		/*
-		 * We already have pin on this buffer, so, all we need to do is
-		 * acquire lock on it.
+		 * We have held the pin on this page since we read the index tuples,
+		 * so all we need to do is lock it.  The pin will have prevented
+		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		havePin = true;
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 	}
 	else
-		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+	{
+		XLogRecPtr	latestlsn;
+
+		Assert(RelationNeedsWAL(rel));
+		buf = _hash_getbuf(rel, batch->currPage, HASH_READ,
+						   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
+		{
+			/* Modified, give up on hinting */
+			_hash_relbuf(rel, buf);
+			return;
+		}
+
+		/* Unmodified, hinting is safe */
+	}
 
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	for (i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in index page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *currItem = &batch->items[itemIndex];
 
 		offnum = currItem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem &&
+			   itemIndex <= batch->lastItem);
 
 		while (offnum <= maxoff)
 		{
@@ -613,9 +616,8 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (so->hashso_bucket_buf == so->currPos.buf ||
-		havePin)
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	if (!scan->dropPin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a22e6fc4b..7dc2ae9af 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1194,8 +1194,6 @@ HashPageStat
 HashPath
 HashScanOpaque
 HashScanOpaqueData
-HashScanPosData
-HashScanPosItem
 HashSkewBucket
 HashState
 HashValueFunc
-- 
2.51.0

v6-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchapplication/octet-stream; name=v6-0001-Add-batching-interfaces-used-by-heapam-and-nbtree.patchDownload

From 3b2ba5dcf522964388de62a90a945879a0d5c659 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 9 Sep 2025 19:50:03 -0400
Subject: [PATCH v6 1/4] Add batching interfaces used by heapam and nbtree.

Add a new amgetbatch index AM interface that allows index access methods
to implement plain/ordered index scans that return index entries in
per-leaf-page batches, rather than one at a time.  This enables a
variety of optimizations on the table AM side, most notably I/O
prefetching of heap tuples during ordered index scans.  It will also
enable an optimization that has heapam avoid repeatedly locking and
unlocking the same heap page's buffer.

Index access methods that support plain index scans must now implement
either the amgetbatch interface OR the amgettuple interface.  The
amgettuple interface will still be used by index AMs that require direct
control over the progress of index scans (e.g., GiST with KNN ordered
scans).

This commit also adds a new table AM interface callback, called by the
core executor through the new table_index_getnext_slot shim function.
This allows the table AM to directly manage the progress of index scans
rather than having individual TIDs passed in by the caller. The
amgetbatch interface is tightly coupled with the new approach to ordered
index scans added to the table AM.  The table AM can apply knowledge of
which TIDs will be returned to scan in the near future to optimize and
batch table AM block accesses, and to perform I/O prefetching.  These
optimizations are left as work for later commits.

Batches returned from amgetbatch are guaranteed to be associated with an
index page containing at least one matching tuple.  The amgetbatch
interface may hold buffer pins as interlocks against concurrent TID
recycling by VACUUM.  This extends/generalizes the mechanism added to
nbtree by commit 2ed5b87f to all index AMs that add support for the new
amgetbatch interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
Discussion: https://postgr.es/m/efac3238-6f34-41ea-a393-26cc0441b506%40vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk9%3Dx%3Da2TbcqYcX%2BXXmDHQr5%3D1v9m4Z_v8a-KwF1Zoz0A%40mail.gmail.com
---
 src/include/access/amapi.h                    |  22 +-
 src/include/access/genam.h                    |  31 +-
 src/include/access/heapam.h                   |   3 +
 src/include/access/nbtree.h                   | 176 ++----
 src/include/access/relscan.h                  | 242 ++++++-
 src/include/access/tableam.h                  |  42 ++
 src/include/nodes/execnodes.h                 |   2 -
 src/include/nodes/pathnodes.h                 |   2 +-
 src/backend/access/brin/brin.c                |   5 +-
 src/backend/access/gin/ginget.c               |   6 +-
 src/backend/access/gin/ginutil.c              |   5 +-
 src/backend/access/gist/gist.c                |   5 +-
 src/backend/access/hash/hash.c                |   5 +-
 src/backend/access/heap/heapam_handler.c      | 454 ++++++++++++-
 src/backend/access/index/Makefile             |   3 +-
 src/backend/access/index/genam.c              |  15 +-
 src/backend/access/index/indexam.c            | 135 ++--
 src/backend/access/index/indexbatch.c         | 596 ++++++++++++++++++
 src/backend/access/index/meson.build          |   1 +
 src/backend/access/nbtree/README              |   6 +-
 src/backend/access/nbtree/nbtpage.c           |   3 +
 src/backend/access/nbtree/nbtreadpage.c       | 192 +++---
 src/backend/access/nbtree/nbtree.c            | 297 +++------
 src/backend/access/nbtree/nbtsearch.c         | 510 ++++++---------
 src/backend/access/nbtree/nbtutils.c          |  93 +--
 src/backend/access/spgist/spgutils.c          |   5 +-
 src/backend/commands/explain.c                |  23 +-
 src/backend/commands/indexcmds.c              |   2 +-
 src/backend/executor/execAmi.c                |   2 +-
 src/backend/executor/execIndexing.c           |   6 +-
 src/backend/executor/execReplication.c        |   8 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   1 +
 src/backend/executor/nodeIndexonlyscan.c      | 108 +---
 src/backend/executor/nodeIndexscan.c          |  13 +-
 src/backend/optimizer/path/indxpath.c         |   2 +-
 src/backend/optimizer/util/plancat.c          |   6 +-
 src/backend/replication/logical/relation.c    |   3 +-
 src/backend/utils/adt/amutils.c               |   8 +-
 src/backend/utils/adt/selfuncs.c              |  68 +-
 contrib/bloom/blutils.c                       |   5 +-
 doc/src/sgml/indexam.sgml                     | 310 +++++++--
 doc/src/sgml/ref/create_table.sgml            |  13 +-
 .../modules/dummy_index_am/dummy_index_am.c   |   5 +-
 src/tools/pgindent/typedefs.list              |   4 -
 44 files changed, 2204 insertions(+), 1239 deletions(-)
 create mode 100644 src/backend/access/index/indexbatch.c

diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index ecfbd017d..962c70f43 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -198,6 +198,15 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next batch of valid tuples */
+typedef BatchIndexScan (*amgetbatch_function) (IndexScanDesc scan,
+											   BatchIndexScan priorbatch,
+											   ScanDirection direction);
+
+/* release batch of valid tuples */
+typedef void (*amfreebatch_function) (IndexScanDesc scan,
+									  BatchIndexScan batch);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -205,11 +214,9 @@ typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 /* end index scan */
 typedef void (*amendscan_function) (IndexScanDesc scan);
 
-/* mark current scan position */
-typedef void (*ammarkpos_function) (IndexScanDesc scan);
-
-/* restore marked scan position */
-typedef void (*amrestrpos_function) (IndexScanDesc scan);
+/* invalidate index AM state that independently tracks scan's position */
+typedef void (*amposreset_function) (IndexScanDesc scan,
+									 BatchIndexScan batch);
 
 /*
  * Callback function signatures - for parallel index scans.
@@ -309,10 +316,11 @@ typedef struct IndexAmRoutine
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgetbatch_function amgetbatch; /* can be NULL */
+	amfreebatch_function amfreebatch;	/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
-	ammarkpos_function ammarkpos;	/* can be NULL */
-	amrestrpos_function amrestrpos; /* can be NULL */
+	amposreset_function amposreset; /* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index e37834c40..627edd486 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -40,6 +40,12 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/*
+	 * heap blocks fetched counts (incremented by index_getnext_slot calls
+	 * within table AMs, though only during index-only scans)
+	 */
+	uint64		nheapfetches;
 } IndexScanInstrumentation;
 
 /*
@@ -115,6 +121,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
 /* struct definitions appear in relscan.h */
 typedef struct IndexScanDescData *IndexScanDesc;
+typedef struct BatchIndexScanData *BatchIndexScan;
 typedef struct SysScanDescData *SysScanDesc;
 
 typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
@@ -175,6 +182,7 @@ extern void index_insert_cleanup(Relation indexRelation,
 
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
+									 bool xs_want_itup,
 									 Snapshot snapshot,
 									 IndexScanInstrumentation *instrument,
 									 int nkeys, int norderbys);
@@ -201,14 +209,12 @@ extern void index_parallelscan_initialize(Relation heapRelation,
 extern void index_parallelrescan(IndexScanDesc scan);
 extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel,
+											  bool xs_want_itup,
 											  IndexScanInstrumentation *instrument,
 											  int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
-extern bool index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot);
-extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -272,4 +278,23 @@ extern void systable_inplace_update_begin(Relation relation,
 extern void systable_inplace_update_finish(void *state, HeapTuple tuple);
 extern void systable_inplace_update_cancel(void *state);
 
+/*
+ * amgetbatch utilities called by indexam.c (in indexbatch.c)
+ */
+struct BatchQueueItemPos;
+extern void index_batch_init(IndexScanDesc scan);
+extern void batch_free(IndexScanDesc scan, BatchIndexScan batch);
+extern void index_batch_reset(IndexScanDesc scan, bool complete);
+extern void index_batch_mark_pos(IndexScanDesc scan);
+extern void index_batch_restore_pos(IndexScanDesc scan);
+extern void index_batch_kill_item(IndexScanDesc scan);
+extern void index_batch_end(IndexScanDesc scan);
+
+/*
+ * amgetbatch utilities called by index AMs (in indexbatch.c)
+ */
+extern void indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch);
+extern BatchIndexScan indexam_util_batch_alloc(IndexScanDesc scan);
+extern void indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch);
+
 #endif							/* GENAM_H */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ce48fac42..a23fd941e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,7 +117,10 @@ typedef struct IndexFetchHeapData
 	IndexFetchTableData xs_base;	/* AM independent part of the descriptor */
 
 	Buffer		xs_cbuf;		/* current heap buffer in scan, if any */
+	BlockNumber xs_blk;			/* xs_cbuf's block number, if any */
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+
+	Buffer		vmbuf;			/* visibility map buffer */
 } IndexFetchHeapData;
 
 /* Result codes for HeapTupleSatisfiesVacuum */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 772248596..2cc306d09 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -924,112 +924,6 @@ typedef struct BTVacuumPostingData
 
 typedef BTVacuumPostingData *BTVacuumPosting;
 
-/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any.  (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.)	In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
- * Index scans work a page at a time: we pin and read-lock the page, identify
- * all the matching items on the page and save them in BTScanPosData, then
- * release the read-lock while returning the items to the caller for
- * processing.  This approach minimizes lock/unlock traffic.  We must always
- * drop the lock to make it okay for caller to process the returned items.
- * Whether or not we can also release the pin during this window will vary.
- * We drop the pin (when so->dropPin) to avoid blocking progress by VACUUM
- * (see nbtree/README section about making concurrent TID recycling safe).
- * We'll always release both the lock and the pin on the current page before
- * moving on to its sibling page.
- *
- * If we are doing an index-only scan, we save the entire IndexTuple for each
- * matched item, otherwise only its heap TID and offset.  The IndexTuples go
- * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.  Posting list tuples store a "base" tuple once,
- * allowing the same key to be returned for each TID in the posting list
- * tuple.
- */
-
-typedef struct BTScanPosItem	/* what we remember about each match */
-{
-	ItemPointerData heapTid;	/* TID of referenced heap item */
-	OffsetNumber indexOffset;	/* index item's location within page */
-	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
-} BTScanPosItem;
-
-typedef struct BTScanPosData
-{
-	Buffer		buf;			/* currPage buf (invalid means unpinned) */
-
-	/* page details as of the saved position's call to _bt_readpage */
-	BlockNumber currPage;		/* page referenced by items array */
-	BlockNumber prevPage;		/* currPage's left link */
-	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN (when so->dropPin) */
-
-	/* scan direction for the saved position's call to _bt_readpage */
-	ScanDirection dir;
-
-	/*
-	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
-	 */
-	int			nextTupleOffset;
-
-	/*
-	 * moreLeft and moreRight track whether we think there may be matching
-	 * index entries to the left and right of the current page, respectively.
-	 */
-	bool		moreLeft;
-	bool		moreRight;
-
-	/*
-	 * The items array is always ordered in index order (ie, increasing
-	 * indexoffset).  When scanning backwards it is convenient to fill the
-	 * array back-to-front, so we start at the last slot and fill downwards.
-	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
-	 * itemIndex is a cursor showing which entry was last returned to caller.
-	 */
-	int			firstItem;		/* first valid index in items[] */
-	int			lastItem;		/* last valid index in items[] */
-	int			itemIndex;		/* current index in items[] */
-
-	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
-} BTScanPosData;
-
-typedef BTScanPosData *BTScanPos;
-
-#define BTScanPosIsPinned(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BufferIsValid((scanpos).buf) \
-)
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
-#define BTScanPosIsValid(scanpos) \
-( \
-	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
-				!BufferIsValid((scanpos).buf)), \
-	BlockNumberIsValid((scanpos).currPage) \
-)
-#define BTScanPosInvalidate(scanpos) \
-	do { \
-		(scanpos).buf = InvalidBuffer; \
-		(scanpos).currPage = InvalidBlockNumber; \
-	} while (0)
-
 /* We need one of these for each equality-type SK_SEARCHARRAY scan key */
 typedef struct BTArrayKeyInfo
 {
@@ -1050,6 +944,30 @@ typedef struct BTArrayKeyInfo
 	ScanKey		high_compare;	/* array's < or <= upper bound */
 } BTArrayKeyInfo;
 
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), and information about the current array
+ * keys.  There are assumptions about how the current array keys track the
+ * progress of the index scan through the index's key space (see _bt_readpage
+ * and _bt_advance_array_keys), but we don't actually track anything about the
+ * current scan position in this opaque struct.  That is tracked externally,
+ * by implementing a queue of "batches", where each batch represents the items
+ * returned by btgetbatch within a single leaf page.
+ *
+ * Index scans work a page at a time, as required by the amgetbatch contract:
+ * we pin and read-lock the page, identify all the matching items on the page
+ * and return them in a newly allocated batch.  We then release the read-lock
+ * using amgetbatch utility routines.  This approach minimizes lock/unlock
+ * traffic. _bt_next is passed priorbatch, which contains details of which
+ * page is next in line to be read (priorbatch is provided as an argument to
+ * btgetbatch by core code).
+ *
+ * If we are doing an index-only scan, we save the entire IndexTuple for each
+ * matched item, otherwise only its heap TID and offset.  This is also per the
+ * amgetbatch contract.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list.
+ */
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1066,32 +984,6 @@ typedef struct BTScanOpaqueData
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	FmgrInfo   *orderProcs;		/* ORDER procs for required equality keys */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
-
-	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
-	bool		dropPin;		/* drop leaf pin before btgettuple returns? */
-
-	/*
-	 * If we are doing an index-only scan, these are the tuple storage
-	 * workspaces for the currPos and markPos respectively.  Each is of size
-	 * BLCKSZ, so it can hold as much as a full page's worth of tuples.
-	 */
-	char	   *currTuples;		/* tuple storage for currPos */
-	char	   *markTuples;		/* tuple storage for markPos */
-
-	/*
-	 * If the marked position is on the same page as current position, we
-	 * don't use markPos, but just keep the marked itemIndex in markItemIndex
-	 * (all the rest of currPos is valid for the mark position). Hence, to
-	 * determine if there is a mark, first look at markItemIndex, then at
-	 * markPos.
-	 */
-	int			markItemIndex;	/* itemIndex, or -1 if not valid */
-
-	/* keep these last in struct for efficiency */
-	BTScanPosData currPos;		/* current position data */
-	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
 
 typedef BTScanOpaqueData *BTScanOpaque;
@@ -1160,14 +1052,16 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
 extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys);
 extern void btinitparallelscan(void *target);
-extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan btgetbatch(IndexScanDesc scan,
+								 BatchIndexScan priorbatch,
+								 ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
+extern void btfreebatch(IndexScanDesc scan, BatchIndexScan batch);
 extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
-extern void btmarkpos(IndexScanDesc scan);
-extern void btrestrpos(IndexScanDesc scan);
+extern void btposreset(IndexScanDesc scan, BatchIndexScan markbatch);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
@@ -1271,8 +1165,9 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);
 /*
  * prototypes for functions in nbtreadpage.c
  */
-extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum, bool firstpage);
+extern bool _bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch,
+						 ScanDirection dir, OffsetNumber offnum,
+						 bool firstpage);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern int	_bt_binsrch_array_skey(FmgrInfo *orderproc,
 								   bool cur_elem_trig, ScanDirection dir,
@@ -1287,8 +1182,9 @@ extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern BatchIndexScan _bt_next(IndexScanDesc scan, ScanDirection dir,
+							   BatchIndexScan priorbatch);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
 
 /*
@@ -1296,7 +1192,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
  */
 extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
-extern void _bt_killitems(IndexScanDesc scan);
+extern void _bt_killitems(IndexScanDesc scan, BatchIndexScan batch);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
 extern void _bt_end_vacuum(Relation rel);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9b342d5bd..d6a34c193 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "access/sdir.h"
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
@@ -125,6 +126,174 @@ typedef struct IndexFetchTableData
 	Relation	rel;
 } IndexFetchTableData;
 
+/*
+ * Queue-wise location of a BatchMatchingItem that appears in a BatchIndexScan
+ * returned by (and subsequently passed to) an amgetbatch routine
+ */
+typedef struct BatchQueueItemPos
+{
+	/* BatchQueue.batches[]-wise index to relevant BatchIndexScan */
+	int			batch;
+
+	/* BatchIndexScan.items[]-wise index to relevant BatchMatchingItem */
+	int			item;
+} BatchQueueItemPos;
+
+static inline void
+batch_reset_pos(BatchQueueItemPos *pos)
+{
+	pos->batch = -1;
+	pos->item = -1;
+}
+
+/*
+ * Matching item returned by amgetbatch (in returned BatchIndexScan) during an
+ * index scan.  Used by table AM to locate relevant matching table tuple.
+ */
+typedef struct BatchMatchingItem
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+	LocationIndex tupleOffset;	/* IndexTuple's offset in workspace, if any */
+} BatchMatchingItem;
+
+/*
+ * Data about one batch of items returned by (and passed to) amgetbatch during
+ * index scans
+ */
+typedef struct BatchIndexScanData
+{
+	/*
+	 * Information output by amgetbatch index AMs upon returning a batch with
+	 * one or more matching items, describing details of the index page where
+	 * matches were located.
+	 *
+	 * Used in the next amgetbatch call to determine which index page to read
+	 * next (or to determine if there's no further matches in current scan
+	 * direction).
+	 */
+	BlockNumber currPage;		/* Index page with matching items */
+	BlockNumber prevPage;		/* currPage's left link */
+	BlockNumber nextPage;		/* currPage's right link */
+
+	Buffer		buf;			/* currPage buf (invalid means unpinned) */
+	XLogRecPtr	lsn;			/* currPage's LSN (when dropPin) */
+
+	/* scan direction when the index page was read */
+	ScanDirection dir;
+
+	/*
+	 * moreLeft and moreRight track whether we think there may be matching
+	 * index entries to the left and right of the current page, respectively
+	 */
+	bool		moreLeft;
+	bool		moreRight;
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	int		   *killedItems;	/* indexes of killed items */
+	int			numKilled;		/* number of currently stored items */
+
+	/*
+	 * Matching items state for this batch.
+	 *
+	 * If we are doing an index-only scan, these are the tuple storage
+	 * workspaces for the matching tuples (tuples referenced by items[]). Each
+	 * is of size BLCKSZ, so it can hold as much as a full page's worth of
+	 * tuples.
+	 */
+	char	   *currTuples;		/* tuple storage for items[] */
+	BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER];
+} BatchIndexScanData;
+
+typedef struct BatchIndexScanData *BatchIndexScan;
+
+/*
+ * Maximum number of batches (leaf pages) we can keep in memory.  We need a
+ * minimum of two, since we'll only consider releasing one batch when another
+ * is read.
+ */
+#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_CACHE_BATCHES	2
+#define INDEX_SCAN_BATCH_COUNT(scan) \
+	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
+
+/* Did we already load batch with the requested index? */
+#define INDEX_SCAN_BATCH_LOADED(scan, idx) \
+	((idx) >= 0 && (idx) < (scan)->batchqueue->nextBatch)
+
+/* Have we loaded the maximum number of batches? */
+#define INDEX_SCAN_BATCH_FULL(scan) \
+	(INDEX_SCAN_BATCH_COUNT(scan) == INDEX_SCAN_MAX_BATCHES)
+
+/* Return batch for the provided index. */
+#define INDEX_SCAN_BATCH(scan, idx)	\
+		((scan)->batchqueue->batches[(idx) % INDEX_SCAN_MAX_BATCHES])
+
+/* Is the position invalid/undefined? */
+#define INDEX_SCAN_POS_INVALID(pos) \
+		(((pos)->batch == -1) && ((pos)->item == -1))
+
+#ifdef INDEXAM_DEBUG
+#define DEBUG_LOG(...) elog(AmRegularBackendProcess() ? NOTICE : DEBUG2, __VA_ARGS__)
+#else
+#define DEBUG_LOG(...)
+#endif
+
+/*
+ * State used by table AMs to manage an index scan that uses the amgetbatch
+ * interface.  Scans work with a queue of batches returned by amgetbatch.
+ *
+ * Batches are kept in the order that they were returned in by amgetbatch,
+ * since that is the same order that table_index_getnext_slot will return
+ * matches in.  However, table AMs are free to fetch table tuples in whatever
+ * order is most convenient/efficient -- provided that such reordering cannot
+ * affect the order that table_index_getnext_slot later returns tuples in.
+ *
+ * This data structure also provides table AMs with a way to read ahead of the
+ * current read position by _multiple_ batches/index pages.  The further out
+ * the table AM reads ahead like this, the further it can see into the future.
+ * That way the table AM is able to reorder work as aggressively as desired.
+ * For example, index scans sometimes need to readahead by as many as a few
+ * dozen amgetbatch batches in order to maintain an optimal I/O prefetch
+ * distance (distance for reading table blocks/fetching table tuples).
+ */
+typedef struct BatchQueue
+{
+	/* Current scan direction, for the currently loaded batches */
+	ScanDirection direction;
+
+	/* current positions in batches[] for scan */
+	BatchQueueItemPos readPos;	/* read position */
+	BatchQueueItemPos markPos;	/* mark/restore position */
+
+	BatchIndexScan markBatch;
+
+	/*
+	 * Array of batches returned by the AM. The array has a capacity (but can
+	 * be resized if needed). The headBatch is an index of the batch we're
+	 * currently reading from (this needs to be translated by modulo
+	 * INDEX_SCAN_MAX_BATCHES into index in the batches array).
+	 */
+	int			headBatch;		/* head batch slot */
+	int			nextBatch;		/* next empty batch slot */
+
+	/* Array of pointers to cached recyclable batches */
+	BatchIndexScan cache[INDEX_SCAN_CACHE_BATCHES];
+
+	/* Array of pointers to queued batches */
+	BatchIndexScan batches[INDEX_SCAN_MAX_BATCHES];
+
+} BatchQueue;
+
 struct IndexScanInstrumentation;
 
 /*
@@ -142,6 +311,13 @@ typedef struct IndexScanDescData
 	int			numberOfOrderBys;	/* number of ordering operators */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+
+	/* index access method's private state */
+	void	   *opaque;			/* access-method-specific info */
+
+	/* table access method's private amgetbatch state */
+	BatchQueue *batchqueue;		/* amgetbatch related state */
+
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
@@ -150,9 +326,13 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	/* amgetbatch can safely drop pins on returned batch's index page? */
+	bool		dropPin;
 
-	/* index access method's private state */
-	void	   *opaque;			/* access-method-specific info */
+	/*
+	 * Did we read the final batch in this scan direction?
+	 */
+	bool		finished;
 
 	/*
 	 * Instrumentation counters maintained by all index AMs during both
@@ -177,6 +357,7 @@ typedef struct IndexScanDescData
 	IndexFetchTableData *xs_heapfetch;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
+	uint16		maxitemsbatch;	/* set by ambeginscan when amgetbatch used */
 
 	/*
 	 * When fetching with an ordering operator, the values of the ORDER BY
@@ -216,4 +397,61 @@ typedef struct SysScanDescData
 	struct TupleTableSlot *slot;
 } SysScanDescData;
 
+/*
+ * Check that a position (batch,item) is valid with respect to the batches we
+ * have currently loaded.
+ */
+static inline void
+batch_assert_pos_valid(IndexScanDescData *scan, BatchQueueItemPos *pos)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* make sure the position is valid for currently loaded batches */
+	Assert(pos->batch >= batchqueue->headBatch);
+	Assert(pos->batch < batchqueue->nextBatch);
+#endif
+}
+
+/*
+ * Check a single batch is valid.
+ */
+static inline void
+batch_assert_batch_valid(IndexScanDescData *scan, BatchIndexScan batch)
+{
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+	Assert(batch->items != NULL);
+
+	/*
+	 * The number of killed items must be valid, and there must be an array of
+	 * indexes if there are items.
+	 */
+	Assert(batch->numKilled >= 0);
+	Assert(!(batch->numKilled > 0 && batch->killedItems == NULL));
+}
+
+static inline void
+batch_assert_batches_valid(IndexScanDescData  *scan)
+{
+#ifdef USE_ASSERT_CHECKING
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* we should have batches initialized */
+	Assert(batchqueue != NULL);
+
+	/* The head/next indexes should define a valid range */
+	Assert(batchqueue->headBatch >= 0 &&
+		   batchqueue->headBatch <= batchqueue->nextBatch);
+
+	/* Check all current batches */
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		batch_assert_batch_valid(scan, batch);
+	}
+#endif
+}
+
 #endif							/* RELSCAN_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d..384656ce1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -433,11 +433,29 @@ typedef struct TableAmRoutine
 	 */
 	void		(*index_fetch_end) (struct IndexFetchTableData *data);
 
+	/*
+	 * Fetch the next tuple from an index scan into slot, scanning in the
+	 * specified direction, and return true if a tuple was found, false
+	 * otherwise.
+	 *
+	 * This callback allows the table AM to directly manage the scan process,
+	 * including interfacing with the index AM. The caller simply specifies
+	 * the direction of the scan; the table AM takes care of retrieving TIDs
+	 * from the index, performing visibility checks, and returning tuples in
+	 * the slot.
+	 */
+	bool		(*index_getnext_slot) (IndexScanDesc scan,
+									   ScanDirection direction,
+									   TupleTableSlot *slot);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
 	 * test, return true, false otherwise.
 	 *
+	 * This is a lower-level callback that takes a TID from the caller.
+	 * Callers should favor the index_getnext_slot callback whenever possible.
+	 *
 	 * Note that AMs that do not necessarily update indexes when indexed
 	 * columns do not change, need to return the current/correct version of
 	 * the tuple that is visible to the snapshot, even if the tid points to an
@@ -1188,6 +1206,27 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
 	scan->rel->rd_tableam->index_fetch_end(scan);
 }
 
+/*
+ * Fetch the next tuple from an index scan into `slot`, scanning in the
+ * specified direction. Returns true if a tuple was found, false otherwise.
+ *
+ * The index scan should have been started via table_index_fetch_begin().
+ * Callers must check scan->xs_recheck and recheck scan keys if required.
+ *
+ * Index-only scan callers (that pass xs_want_itup=true to index_beginscan)
+ * can consume index tuple results by examining IndexScanDescData fields such
+ * as xs_itup and xs_hitup.  The table AM won't usually fetch a heap tuple
+ * into the provided slot in the case of xs_want_itup=true callers.
+ */
+static inline bool
+table_index_getnext_slot(IndexScanDesc idxscan, ScanDirection direction,
+						 TupleTableSlot *slot)
+{
+	struct IndexFetchTableData *scan = idxscan->xs_heapfetch;
+
+	return scan->rel->rd_tableam->index_getnext_slot(idxscan, direction, slot);
+}
+
 /*
  * Fetches, as part of an index scan, tuple at `tid` into `slot`, after doing
  * a visibility test according to `snapshot`. If a tuple was found and passed
@@ -1211,6 +1250,9 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * entry (like heap's HOT). Whereas table_tuple_fetch_row_version() only
  * evaluates the tuple exactly at `tid`. Outside of index entry ->table tuple
  * lookups, table_tuple_fetch_row_version() is what's usually needed.
+ *
+ * This is a lower-level interface that takes a TID from the caller.  Callers
+ * should favor the table_index_getnext_slot interface whenever possible.
  */
 static inline bool
 table_index_fetch_tuple(struct IndexFetchTableData *scan,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 022654569..1bb63a787 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1751,7 +1751,6 @@ typedef struct IndexScanState
  *		Instrument		   local index scan instrumentation
  *		SharedInfo		   parallel worker instrumentation (no leader entry)
  *		TableSlot		   slot for holding tuples fetched from the table
- *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
  *		NameCStringAttNums attnums of name typed columns to pad to NAMEDATALEN
  *		NameCStringCount   number of elements in the NameCStringAttNums array
@@ -1774,7 +1773,6 @@ typedef struct IndexOnlyScanState
 	IndexScanInstrumentation ioss_Instrument;
 	SharedIndexScanInstrumentation *ioss_SharedInfo;
 	TupleTableSlot *ioss_TableSlot;
-	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
 	AttrNumber *ioss_NameCStringAttNums;
 	int			ioss_NameCStringCount;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 449885b93..3c81602b4 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1344,7 +1344,7 @@ typedef struct IndexOptInfo
 	/* does AM have amgetbitmap interface? */
 	bool		amhasgetbitmap;
 	bool		amcanparallel;
-	/* does AM have ammarkpos interface? */
+	/* is AM prepared for us to restore a mark? */
 	bool		amcanmarkpos;
 	/* AM's cost estimator */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e4214..d5d01b877 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -294,10 +294,11 @@ brinhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = brinbeginscan,
 		.amrescan = brinrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = bringetbitmap,
 		.amendscan = brinendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 6b148e69a..8f7033d62 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -1953,9 +1953,9 @@ gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	 * into the main index, and so we might visit it a second time during the
 	 * main scan.  This is okay because we'll just re-set the same bit in the
 	 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
-	 * can't support the amgettuple API, however.) Note that it would not do
-	 * to scan the main index before the pending list, since concurrent
-	 * cleanup could then make us miss entries entirely.
+	 * can't support either the amgettuple or amgetbatch API.) Note that it
+	 * would not do to scan the main index before the pending list, since
+	 * concurrent cleanup could then make us miss entries entirely.
 	 */
 	scanPendingInsert(scan, tbm, &ntids);
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a546cac18..b630e66b3 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -82,10 +82,11 @@ ginhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = ginbeginscan,
 		.amrescan = ginrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gingetbitmap,
 		.amendscan = ginendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d5944205d..9c219cadf 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -103,10 +103,11 @@ gisthandler(PG_FUNCTION_ARGS)
 		.ambeginscan = gistbeginscan,
 		.amrescan = gistrescan,
 		.amgettuple = gistgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = gistgetbitmap,
 		.amendscan = gistendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e88ddb32a..6a20b67f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -102,10 +102,11 @@ hashhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = hashbeginscan,
 		.amrescan = hashrescan,
 		.amgettuple = hashgettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = hashgetbitmap,
 		.amendscan = hashendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09a456e99..dd46ccb3b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -81,10 +81,12 @@ heapam_slot_callbacks(Relation relation)
 static IndexFetchTableData *
 heapam_index_fetch_begin(Relation rel)
 {
-	IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData);
+	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
 	hscan->xs_cbuf = InvalidBuffer;
+	hscan->xs_blk = InvalidBlockNumber;
+	hscan->vmbuf = InvalidBuffer;
 
 	return &hscan->xs_base;
 }
@@ -94,10 +96,12 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
 		ReleaseBuffer(hscan->xs_cbuf);
 		hscan->xs_cbuf = InvalidBuffer;
+		hscan->xs_blk = InvalidBlockNumber;
 	}
 }
 
@@ -108,6 +112,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (hscan->vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(hscan->vmbuf);
+		hscan->vmbuf = InvalidBuffer;
+	}
+
 	pfree(hscan);
 }
 
@@ -125,22 +135,32 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	/* We can skip the buffer-switching logic if we're in mid-HOT chain. */
-	if (!*call_again)
+	if (hscan->xs_blk != ItemPointerGetBlockNumber(tid))
 	{
-		/* Switch to correct buffer if we don't have it already */
-		Buffer		prev_buf = hscan->xs_cbuf;
+		Assert(!*call_again);
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		/* Remember this buffer's block number for next time */
+		hscan->xs_blk = ItemPointerGetBlockNumber(tid);
+
+		if (BufferIsValid(hscan->xs_cbuf))
+			ReleaseBuffer(hscan->xs_cbuf);
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * When using a read stream, the stream will already know which block
+		 * number comes next (though an assertion will verify a match below)
 		 */
-		if (prev_buf != hscan->xs_cbuf)
-			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
+		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+
+		/*
+		 * Prune page when it is pinned for the first time
+		 */
+		heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+	/* Assert that the TID's block number's buffer is now pinned */
+	Assert(BufferIsValid(hscan->xs_cbuf));
+	Assert(BufferGetBlockNumber(hscan->xs_cbuf) == hscan->xs_blk);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
@@ -173,6 +193,413 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 	return got_heap_tuple;
 }
 
+static pg_noinline void
+heapam_batch_rewind(IndexScanDesc scan, BatchQueue *batchqueue,
+					ScanDirection direction)
+{
+	/*
+	 * Handle a change in the scan's direction.
+	 *
+	 * Release future batches properly, to make it look like the current batch
+	 * is the only one we loaded.
+	 */
+	while (batchqueue->nextBatch > batchqueue->headBatch + 1)
+	{
+		/* release "later" batches in reverse order */
+		BatchIndexScan fbatch;
+
+		batchqueue->nextBatch--;
+		fbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch);
+		batch_free(scan, fbatch);
+	}
+
+	/*
+	 * Remember the new direction, and make sure the scan is not marked as
+	 * "finished" (we might have already read the last batch, but now we need
+	 * to start over).
+	 */
+	batchqueue->direction = direction;
+	scan->finished = false;
+}
+
+static inline ItemPointer
+heapam_batch_return_tid(IndexScanDesc scan, BatchIndexScan readBatch,
+						BatchQueueItemPos *readPos)
+{
+	batch_assert_pos_valid(scan, readPos);
+
+	/* set the TID / itup for the scan */
+	scan->xs_heaptid = readBatch->items[readPos->item].heapTid;
+
+	if (scan->xs_want_itup)
+		scan->xs_itup =
+			(IndexTuple) (readBatch->currTuples +
+						  readBatch->items[readPos->item].tupleOffset);
+
+	return &scan->xs_heaptid;
+}
+
+/* ----------------
+ *		heap_batch_getnext - get the next batch of TIDs from a scan
+ *
+ * Called when we need to load the next batch of index entries to process in
+ * the given direction.
+ *
+ * Returns the next batch to be processed by the index scan, or NULL when
+ * there are no more matches in the given scan direction.
+ * ----------------
+ */
+static BatchIndexScan
+heap_batch_getnext(IndexScanDesc scan, BatchIndexScan priorbatch,
+				   ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchIndexScan batch = NULL;
+
+	/* XXX: we should assert that a snapshot is pushed or registered */
+	Assert(TransactionIdIsValid(RecentXmin));
+	Assert(!INDEX_SCAN_BATCH_FULL(scan));
+
+	batch = scan->indexRelation->rd_indam->amgetbatch(scan, priorbatch,
+													  direction);
+	if (batch != NULL)
+	{
+		/* We got the batch from the AM -- add it to our queue */
+		int			batchIndex = batchqueue->nextBatch;
+
+		Assert(batch->dir == direction);
+
+		INDEX_SCAN_BATCH(scan, batchIndex) = batch;
+
+		batchqueue->nextBatch++;
+
+		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
+				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+	}
+
+	batch_assert_batches_valid(scan);
+
+	return batch;
+}
+
+/* ----------------
+ *		heapam_batch_getnext_tid - get next TID from index scan batch queue
+ *
+ * This function implements heapam's version of getting the next TID from an
+ * index scan that uses the amgetbatch interface.  It is implemented using
+ * various indexbatch.c utility routines.
+ *
+ * The routines from indexbatch.c are stateless -- they just implement batch
+ * queue mechanics.  heapam_batch_getnext_tid implements the heapam policy; it
+ * decides when to load/free batches, and controls scan direction changes.
+ * ----------------
+ */
+static ItemPointer
+heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan readBatch = NULL;
+
+	/* shouldn't get here without batching */
+	batch_assert_batches_valid(scan);
+
+	/* Initialize direction on first call */
+	if (batchqueue->direction == NoMovementScanDirection)
+		batchqueue->direction = direction;
+
+	/*
+	 * Try advancing the batch position. If that doesn't succeed, it means we
+	 * don't have more items in the current batch, and there's no future batch
+	 * loaded. So try loading another batch, and retry if needed.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, readPos->batch))
+	{
+		readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+		if (ScanDirectionIsForward(direction))
+		{
+			if (++readPos->item > readBatch->lastItem)
+				goto nextbatch;
+		}
+		else					/* ScanDirectionIsBackward */
+		{
+			if (--readPos->item < readBatch->firstItem)
+				goto nextbatch;
+		}
+
+		pgstat_count_index_tuples(scan->indexRelation, 1);
+		return heapam_batch_return_tid(scan, readBatch, readPos);
+	}
+
+nextbatch:
+
+	if (unlikely(batchqueue->direction != direction))
+	{
+		heapam_batch_rewind(scan, batchqueue, direction);
+		readPos->batch = batchqueue->nextBatch - 1;
+	}
+
+	if ((readBatch = heap_batch_getnext(scan, readBatch, direction)) != NULL)
+	{
+		/* xs_hitup is not supported by amgetbatch scans */
+		Assert(!scan->xs_hitup);
+
+		readPos->batch++;
+
+		/*
+		 * Get the initial batch (which must be the head), and initialize the
+		 * position to the appropriate item for the current scan direction
+		 */
+		if (ScanDirectionIsForward(direction))
+			readPos->item = readBatch->firstItem;
+		else
+			readPos->item = readBatch->lastItem;
+
+		batch_assert_pos_valid(scan, readPos);
+
+		/*
+		 * If we advanced to the next batch, release the batch we no longer
+		 * need.  The positions is the "read" position, and we can compare it
+		 * to headBatch.
+		 */
+		if (readPos->batch != batchqueue->headBatch)
+		{
+			BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+														batchqueue->headBatch);
+
+			/* Free the head batch (except when it's markBatch) */
+			batch_free(scan, headBatch);
+
+			/*
+			 * In any case, remove the batch from the regular queue, even if
+			 * we kept it for mark/restore
+			 */
+			batchqueue->headBatch++;
+
+			/* we can't skip any batches */
+			Assert(batchqueue->headBatch == readPos->batch);
+		}
+
+		pgstat_count_index_tuples(scan->indexRelation, 1);
+		return heapam_batch_return_tid(scan, readBatch, readPos);
+	}
+
+	/*
+	 * If we get here, we failed to advance the position and there are no more
+	 * batches to be loaded in the current scan direction.  Defensively reset
+	 * the read position.
+	 */
+	batch_reset_pos(readPos);
+	scan->finished = true;
+
+	return NULL;
+}
+
+/* ----------------
+ *		index_fetch_heap - get the scan's next heap tuple
+ *
+ * The result is a visible heap tuple associated with the index TID most
+ * recently fetched by our caller in scan->xs_heaptid, or NULL if no more
+ * matching tuples exist.  (There can be more than one matching tuple because
+ * of HOT chains, although when using an MVCC snapshot it should be impossible
+ * for more than one such tuple to exist.)
+ *
+ * On success, the buffer containing the heap tup is pinned.  The pin must be
+ * dropped elsewhere.
+ * ----------------
+ */
+static bool
+index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
+{
+	bool		all_dead = false;
+	bool		found;
+
+	found = heapam_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
+									 scan->xs_snapshot, slot,
+									 &scan->xs_heap_continue, &all_dead);
+
+	if (found)
+		pgstat_count_heap_fetch(scan->indexRelation);
+
+	/*
+	 * If we scanned a whole HOT chain and found only dead tuples, remember it
+	 * for later.  We do not do this when in recovery because it may violate
+	 * MVCC to do so.  See comments in RelationGetIndexScan().
+	 */
+	if (!scan->xactStartedInRecovery)
+	{
+		if (scan->batchqueue)
+		{
+			if (all_dead)
+				index_batch_kill_item(scan);
+		}
+		else
+		{
+			/*
+			 * Tell amgettuple-based index AM to kill its entry for that TID
+			 * (this will take effect in the next call, in index_getnext_tid)
+			 */
+			scan->kill_prior_tuple = all_dead;
+		}
+	}
+
+	return found;
+}
+
+/* ----------------
+ *		heapam_index_getnext_slot - get the next tuple from a scan
+ *
+ * The result is true if a tuple satisfying the scan keys and the snapshot was
+ * found, false otherwise.  The tuple is stored in the specified slot.
+ *
+ * On success, resources (like buffer pins) are likely to be held, and will be
+ * dropped by a future call here (or by a later call to index_endscan).
+ *
+ * Note: caller must check scan->xs_recheck, and perform rechecking of the
+ * scan keys if required.  We do not do that here because we don't have
+ * enough information to do it efficiently in the general case.
+ * ----------------
+ */
+static bool
+heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
+						  TupleTableSlot *slot)
+{
+	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch;
+	ItemPointer tid = NULL;
+
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			/*
+			 * Scans that use an amgetbatch index AM are managed by heapam's
+			 * index scan manager.  This gives heapam the ability to read heap
+			 * tuples in a flexible order that is attuned to both costs and
+			 * benefits on the heapam and table AM side.
+			 *
+			 * Scans that use an amgettuple index AM simply call through to
+			 * index_getnext_tid to get the next TID returned by index AM. The
+			 * progress of the scan will be under the control of index AM (we
+			 * just pass it through a direction to get the next tuple in), so
+			 * we cannot reorder any work.
+			 */
+			if (scan->batchqueue != NULL)
+				tid = heapam_batch_getnext_tid(scan, direction);
+			else
+				tid = index_getnext_tid(scan, direction);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (!scan->xs_want_itup)
+		{
+			/* Plain index scan */
+			if (index_fetch_heap(scan, slot))
+				return true;
+		}
+		else
+		{
+			/*
+			 * Index-only scan.
+			 *
+			 * We can skip the heap fetch if the TID references a heap page on
+			 * which all tuples are known visible to everybody.  In any case,
+			 * we'll use the index tuple not the heap tuple as the data
+			 * source.
+			 *
+			 * Note on Memory Ordering Effects: visibilitymap_get_status does
+			 * not lock the visibility map buffer, and therefore the result we
+			 * read here could be slightly stale.  However, it can't be stale
+			 * enough to matter.
+			 *
+			 * We need to detect clearing a VM bit due to an insert right
+			 * away, because the tuple is present in the index page but not
+			 * visible. The reading of the TID by this scan (using a shared
+			 * lock on the index buffer) is serialized with the insert of the
+			 * TID into the index (using an exclusive lock on the index
+			 * buffer). Because the VM bit is cleared before updating the
+			 * index, and locking/unlocking of the index page acts as a full
+			 * memory barrier, we are sure to see the cleared bit if we see a
+			 * recently-inserted TID.
+			 *
+			 * Deletes do not update the index page (only VACUUM will clear
+			 * out the TID), so the clearing of the VM bit by a delete is not
+			 * serialized with this test below, and we may see a value that is
+			 * significantly stale. However, we don't care about the delete
+			 * right away, because the tuple is still visible until the
+			 * deleting transaction commits or the statement ends (if it's our
+			 * transaction). In either case, the lock on the VM buffer will
+			 * have been released (acting as a write barrier) after clearing
+			 * the bit. And for us to have a snapshot that includes the
+			 * deleting transaction (making the tuple invisible), we must have
+			 * acquired ProcArrayLock after that time, acting as a read
+			 * barrier.
+			 *
+			 * It's worth going through this complexity to avoid needing to
+			 * lock the VM buffer, which could cause significant contention.
+			 */
+			if (!VM_ALL_VISIBLE(hscan->xs_base.rel,
+								ItemPointerGetBlockNumber(tid),
+								&hscan->vmbuf))
+			{
+				/*
+				 * Rats, we have to visit the heap to check visibility.
+				 */
+				if (scan->instrument)
+					scan->instrument->nheapfetches++;
+
+				if (!index_fetch_heap(scan, slot))
+					continue;	/* no visible tuple, try next index entry */
+
+				ExecClearTuple(slot);
+
+				/*
+				 * Only MVCC snapshots are supported with standard index-only
+				 * scans, so there should be no need to keep following the HOT
+				 * chain once a visible entry has been found.  Other callers
+				 * (currently only selfuncs.c) use SnapshotNonVacuumable, and
+				 * want us to assume that just having one visible tuple in the
+				 * hot chain is always good enough.
+				 */
+				Assert(!(scan->xs_heap_continue &&
+						 IsMVCCSnapshot(scan->xs_snapshot)));
+
+				/*
+				 * Note: at this point we are holding a pin on the heap page,
+				 * as recorded in IndexFetchHeapData.xs_cbuf.  We could
+				 * release that pin now, but we prefer to hold on to VM pins.
+				 * it's quite possible that the index entry will require a
+				 * visit to the same heap page.  It's even more likely that
+				 * the index entry will force us to perform a lookup that uses
+				 * the same already-pinned VM page.
+				 */
+			}
+			else
+			{
+				/*
+				 * We didn't access the heap, so we'll need to take a
+				 * predicate lock explicitly, as if we had.  For now we do
+				 * that at page level.
+				 */
+				PredicateLockPage(hscan->xs_base.rel,
+								  ItemPointerGetBlockNumber(tid),
+								  scan->xs_snapshot);
+			}
+
+			return true;
+		}
+	}
+
+	return false;
+}
 
 /* ------------------------------------------------------------------------
  * Callbacks for non-modifying operations on individual tuples for heap AM
@@ -753,7 +1180,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, false, SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -790,7 +1218,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!heapam_index_getnext_slot(indexScan, ForwardScanDirection,
+										   slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
@@ -2633,6 +3062,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_begin = heapam_index_fetch_begin,
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
+	.index_getnext_slot = heapam_index_getnext_slot,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/Makefile b/src/backend/access/index/Makefile
index 6f2e3061a..e6d681b40 100644
--- a/src/backend/access/index/Makefile
+++ b/src/backend/access/index/Makefile
@@ -16,6 +16,7 @@ OBJS = \
 	amapi.o \
 	amvalidate.o \
 	genam.o \
-	indexam.o
+	indexam.o \
+	indexbatch.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index a29be6f46..f003e954d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -89,6 +89,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_snapshot = InvalidSnapshot;	/* caller must initialize this */
 	scan->numberOfKeys = nkeys;
 	scan->numberOfOrderBys = norderbys;
+	scan->batchqueue = NULL;	/* set later for amgetbatch callers */
+	scan->xs_want_itup = false; /* caller must initialize this */
 
 	/*
 	 * We allocate key workspace here, but it won't get filled until amrescan.
@@ -102,8 +104,6 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	else
 		scan->orderByData = NULL;
 
-	scan->xs_want_itup = false; /* may be set later */
-
 	/*
 	 * During recovery we ignore killed tuples and don't bother to kill them
 	 * either. We do this because the xmin on the primary node could easily be
@@ -115,6 +115,8 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	 * should not be altered by index AMs.
 	 */
 	scan->kill_prior_tuple = false;
+	scan->dropPin = true;		/* for now */
+	scan->finished = false;
 	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
 	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
 
@@ -446,7 +448,7 @@ systable_beginscan(Relation heapRelation,
 				elog(ERROR, "column is not in index");
 		}
 
-		sysscan->iscan = index_beginscan(heapRelation, irel,
+		sysscan->iscan = index_beginscan(heapRelation, irel, false,
 										 snapshot, NULL, nkeys, 0);
 		index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 		sysscan->scan = NULL;
@@ -517,7 +519,8 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (table_index_getnext_slot(sysscan->iscan, ForwardScanDirection,
+									 sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -707,7 +710,7 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
-	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
+	sysscan->iscan = index_beginscan(heapRelation, indexRelation, false,
 									 snapshot, NULL, nkeys, 0);
 	index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0);
 	sysscan->scan = NULL;
@@ -734,7 +737,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (table_index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4ed0508c6..a59c76750 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -24,9 +24,7 @@
  *		index_parallelscan_initialize - initialize parallel scan
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
- *		index_getnext_tid	- get the next TID from a scan
- *		index_fetch_heap		- get the scan's next heap tuple
- *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_tid	- amgettuple table AM helper routine
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
@@ -255,6 +253,7 @@ index_insert_cleanup(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
+				bool xs_want_itup,
 				Snapshot snapshot,
 				IndexScanInstrumentation *instrument,
 				int nkeys, int norderbys)
@@ -282,6 +281,10 @@ index_beginscan(Relation heapRelation,
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexRelation->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
@@ -380,6 +383,15 @@ index_rescan(IndexScanDesc scan,
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
+	/*
+	 * batchqueue shouldn't be marked finished (must make sure that
+	 * index_batch_reset doesn't see this, since indexam_util_batch_release
+	 * will be affected)
+	 */
+	scan->finished = false;
+
+	index_batch_reset(scan, true);
+
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
 }
@@ -394,6 +406,9 @@ index_endscan(IndexScanDesc scan)
 	SCAN_CHECKS;
 	CHECK_SCAN_PROCEDURE(amendscan);
 
+	/* Cleanup batching, so that the AM can release pins and so on. */
+	index_batch_end(scan);
+
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
@@ -422,9 +437,10 @@ void
 index_markpos(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(ammarkpos);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
-	scan->indexRelation->rd_indam->ammarkpos(scan);
+	/* Only amgetbatch index AMs support mark and restore */
+	index_batch_mark_pos(scan);
 }
 
 /* ----------------
@@ -448,7 +464,8 @@ index_restrpos(IndexScanDesc scan)
 	Assert(IsMVCCSnapshot(scan->xs_snapshot));
 
 	SCAN_CHECKS;
-	CHECK_SCAN_PROCEDURE(amrestrpos);
+	CHECK_SCAN_PROCEDURE(amgetbatch);
+	CHECK_SCAN_PROCEDURE(amposreset);
 
 	/* release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
@@ -457,7 +474,7 @@ index_restrpos(IndexScanDesc scan)
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
 
-	scan->indexRelation->rd_indam->amrestrpos(scan);
+	index_batch_restore_pos(scan);
 }
 
 /*
@@ -579,6 +596,8 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	index_batch_reset(scan, true);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -591,6 +610,7 @@ index_parallelrescan(IndexScanDesc scan)
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel,
+						 bool xs_want_itup,
 						 IndexScanInstrumentation *instrument,
 						 int nkeys, int norderbys,
 						 ParallelIndexScanDesc pscan)
@@ -613,6 +633,10 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 	scan->heapRelation = heaprel;
 	scan->xs_snapshot = snapshot;
 	scan->instrument = instrument;
+	scan->xs_want_itup = xs_want_itup;
+
+	if (indexrel->rd_indam->amgetbatch != NULL)
+		index_batch_init(scan);
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
@@ -621,10 +645,14 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel,
 }
 
 /* ----------------
- * index_getnext_tid - get the next TID from a scan
+ * index_getnext_tid - amgettuple interface
  *
  * The result is the next TID satisfying the scan keys,
  * or NULL if no more matching tuples exist.
+ *
+ * This should only be called by table AM's index_getnext_slot implementation,
+ * and only given an index AM that supports the single-tuple amgettuple
+ * interface.
  * ----------------
  */
 ItemPointer
@@ -667,97 +695,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
-/* ----------------
- *		index_fetch_heap - get the scan's next heap tuple
- *
- * The result is a visible heap tuple associated with the index TID most
- * recently fetched by index_getnext_tid, or NULL if no more matching tuples
- * exist.  (There can be more than one matching tuple because of HOT chains,
- * although when using an MVCC snapshot it should be impossible for more than
- * one such tuple to exist.)
- *
- * On success, the buffer containing the heap tup is pinned (the pin will be
- * dropped in a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
-{
-	bool		all_dead = false;
-	bool		found;
-
-	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
-									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
-
-	if (found)
-		pgstat_count_heap_fetch(scan->indexRelation);
-
-	/*
-	 * If we scanned a whole HOT chain and found only dead tuples, tell index
-	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
-	 */
-	if (!scan->xactStartedInRecovery)
-		scan->kill_prior_tuple = all_dead;
-
-	return found;
-}
-
-/* ----------------
- *		index_getnext_slot - get the next tuple from a scan
- *
- * The result is true if a tuple satisfying the scan keys and the snapshot was
- * found, false otherwise.  The tuple is stored in the specified slot.
- *
- * On success, resources (like buffer pins) are likely to be held, and will be
- * dropped by a future index_getnext_tid, index_fetch_heap or index_endscan
- * call).
- *
- * Note: caller must check scan->xs_recheck, and perform rechecking of the
- * scan keys if required.  We do not do that here because we don't have
- * enough information to do it efficiently in the general case.
- * ----------------
- */
-bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
-{
-	for (;;)
-	{
-		if (!scan->xs_heap_continue)
-		{
-			ItemPointer tid;
-
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
-
-			/* If we're out of index entries, we're done */
-			if (tid == NULL)
-				break;
-
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
-		}
-
-		/*
-		 * Fetch the next (or only) visible heap tuple for this index entry.
-		 * If we don't find anything, loop around and grab the next TID from
-		 * the index.
-		 */
-		Assert(ItemPointerIsValid(&scan->xs_heaptid));
-		if (index_fetch_heap(scan, slot))
-			return true;
-	}
-
-	return false;
-}
-
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
new file mode 100644
index 000000000..7fad00084
--- /dev/null
+++ b/src/backend/access/index/indexbatch.c
@@ -0,0 +1,596 @@
+/*-------------------------------------------------------------------------
+ *
+ * indexbatch.c
+ *	  amgetbatch implementation routines
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/index/indexbatch.c
+ *
+ * INTERFACE ROUTINES
+ *		index_batch_init - Initialize fields needed by batching
+ *		index_batch_reset - reset a batch
+ *		index_batch_mark_pos - set a mark from current batch position
+ *		index_batch_restore_pos - restore mark to current batch position
+ *		index_batch_kill_item - record dead index tuple
+ *		index_batch_end - end batch
+ *
+ *		indexam_util_batch_unlock - unlock batch's buffer lock
+ *		indexam_util_batch_alloc - allocate another batch
+ *		indexam_util_batch_release - release allocated batch
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/amapi.h"
+#include "access/tableam.h"
+#include "common/int.h"
+#include "lib/qunique.h"
+#include "pgstat.h"
+#include "utils/memdebug.h"
+
+static int	batch_compare_int(const void *va, const void *vb);
+static void batch_debug_print_batches(const char *label, IndexScanDesc scan);
+
+/*
+ * index_batch_init
+ *		Initialize various fields and arrays needed by batching.
+ *
+ * Sets up the batch queue structure and its initial read position.  Also
+ * determines whether the scan will eagerly drop index page pins.
+ *
+ * Only call here when all of the index related fields in 'scan' were already
+ * initialized.
+ */
+void
+index_batch_init(IndexScanDesc scan)
+{
+	/* Both amgetbatch and amfreebatch must be present together */
+	Assert(scan->indexRelation->rd_indam->amgetbatch != NULL);
+	Assert(scan->indexRelation->rd_indam->amfreebatch != NULL);
+
+	scan->batchqueue = palloc_object(BatchQueue);
+
+	/*
+	 * We prefer to eagerly drop leaf page pins before amgetbatch returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks index tuples as
+	 * LP_DEAD has to deal with concurrent TID recycling races.  The scheme
+	 * used to detect unsafe TID recycling won't work when scanning unlogged
+	 * relations (since it involves saving an affected page's LSN).  Opt out
+	 * of eager pin dropping during unlogged relation scans for now.
+	 */
+	scan->dropPin =
+		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
+		 RelationNeedsWAL(scan->indexRelation));
+	scan->finished = false;
+	scan->batchqueue->direction = NoMovementScanDirection;
+
+	/* positions in the queue of batches */
+	batch_reset_pos(&scan->batchqueue->readPos);
+	batch_reset_pos(&scan->batchqueue->markPos);
+
+	scan->batchqueue->markBatch = NULL;
+	scan->batchqueue->headBatch = 0;	/* initial head batch */
+	scan->batchqueue->nextBatch = 0;	/* initial batch starts empty */
+	memset(&scan->batchqueue->cache, 0, sizeof(scan->batchqueue->cache));
+}
+
+/* ----------------
+ *		index_batch_reset - reset batch queue and read position
+ *
+ * Resets all loaded batches in the queue, and resets the read position to the
+ * initial state (or just initialize queue state).  When 'complete' is true,
+ * also frees the scan's marked batch (if any), which is useful when ending an
+ * amgetbatch-based index scan.
+ * ----------------
+ */
+void
+index_batch_reset(IndexScanDesc scan, bool complete)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	/* bail out if batching not enabled */
+	if (!batchqueue)
+		return;
+
+	batch_assert_batches_valid(scan);
+	batch_debug_print_batches("index_batch_reset", scan);
+	Assert(scan->xs_heapfetch);
+
+	/* reset the positions */
+	batch_reset_pos(&batchqueue->readPos);
+
+	/*
+	 * With "complete" reset, make sure to also free the marked batch, either
+	 * by just forgetting it (if it's still in the queue), or by explicitly
+	 * freeing it.
+	 */
+	if (complete && unlikely(batchqueue->markBatch != NULL))
+	{
+		BatchQueueItemPos *markPos = &batchqueue->markPos;
+		BatchIndexScan markBatch = batchqueue->markBatch;
+
+		/* always reset the position, forget the marked batch */
+		batchqueue->markBatch = NULL;
+
+		/*
+		 * If we've already moved past the marked batch (it's not in the
+		 * current queue), free it explicitly. Otherwise it'll be in the freed
+		 * later.
+		 */
+		if (markPos->batch < batchqueue->headBatch ||
+			markPos->batch >= batchqueue->nextBatch)
+			batch_free(scan, markBatch);
+
+		/* reset position only after the queue range check */
+		batch_reset_pos(&batchqueue->markPos);
+	}
+
+	/* now release all other currently loaded batches */
+	while (batchqueue->headBatch < batchqueue->nextBatch)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, batchqueue->headBatch);
+
+		DEBUG_LOG("freeing batch %d %p", batchqueue->headBatch, batch);
+
+		batch_free(scan, batch);
+
+		/* update the valid range, so that asserts / debugging works */
+		batchqueue->headBatch++;
+	}
+
+	/* reset relevant batch state fields */
+	batchqueue->headBatch = 0;	/* initial batch */
+	batchqueue->nextBatch = 0;	/* initial batch is empty */
+
+	scan->finished = false;
+
+	batch_assert_batches_valid(scan);
+}
+
+/* ----------------
+ *		index_batch_mark_pos - mark current position in scan for restoration
+ *
+ * Saves the current read position and associated batch so that the scan can
+ * be restored to this point later, via a call to index_batch_restore_pos.
+ * The marked batch is retained and not freed until a new mark is set or the
+ * scan ends (or until the mark is restored).
+ * ----------------
+ */
+void
+index_batch_mark_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	/*
+	 * Free the previous mark batch (if any), but only if the batch is no
+	 * longer valid (in the current head/next range).  Note that we don't have
+	 * to do this in the common case where we mark a position that comes from
+	 * our current readBatch.
+	 */
+	if (markBatch != NULL && (markPos->batch < batchqueue->headBatch ||
+							  markPos->batch >= batchqueue->nextBatch))
+	{
+		batchqueue->markBatch = NULL;
+		batch_free(scan, markBatch);
+	}
+
+	/* copy the read position */
+	batchqueue->markPos = batchqueue->readPos;
+	batchqueue->markBatch = INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch);
+
+	/* readPos/markPos must be valid */
+	batch_assert_pos_valid(scan, &batchqueue->markPos);
+}
+
+/* ----------------
+ *		index_batch_restore_pos - restore scan to a previously marked position
+ *
+ * Restores the scan to a position previously saved by index_batch_mark_pos.
+ * The marked batch is restored as the current batch, allowing the scan to
+ * resume from the marked position.  Also notifies the index AM via a call to
+ * its amposreset routine, which allows it to invalidate any private state
+ * that independently tracks scan progress (such as array key state)
+ *
+ * Function currently just discards most batch queue state.  It might make
+ * sense to teach it to hold on to other nearby batches (still-held batches
+ * that are likely to be needed once the scan finishes returning matching
+ * items from the restored batch) as an optimization.  Such a scheme would
+ * have the benefit of avoiding repeat calls to amgetbatch/repeatedly reading
+ * the same index pages.
+ * ----------------
+ */
+void
+index_batch_restore_pos(IndexScanDesc scan)
+{
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *markPos = &batchqueue->markPos;
+	BatchQueueItemPos *readPos = &batchqueue->readPos;
+	BatchIndexScan markBatch = batchqueue->markBatch;
+
+	if (readPos->batch == markPos->batch &&
+		readPos->batch == batchqueue->headBatch)
+	{
+		/*
+		 * We don't have to discard the scan's state after all, since the
+		 * current headBatch is also the batch that we're restoring to
+		 */
+		readPos->item = markPos->item;
+		return;
+	}
+
+	/*
+	 * Call amposreset to let index AM know to invalidate any private state
+	 * that independently tracks the scan's progress
+	 */
+	scan->indexRelation->rd_indam->amposreset(scan, markBatch);
+
+	/*
+	 * Reset the batching state, except for the marked batch, and make it look
+	 * like we have a single batch -- the marked one.
+	 */
+	index_batch_reset(scan, false);
+
+	batchqueue->readPos = *markPos;
+	batchqueue->headBatch = markPos->batch;
+	batchqueue->nextBatch = markPos->batch + 1;
+
+	INDEX_SCAN_BATCH(scan, batchqueue->markPos.batch) = markBatch;
+	batchqueue->markBatch = markBatch;
+}
+
+/*
+ * batch_free
+ *		Release resources associated with a batch returned by the index AM.
+ *
+ * Called by table AM's ordered index scan implementation when it is finished
+ * with a batch and wishes to release its resources.
+ *
+ * This calls the index AM's amfreebatch callback to release AM-specific
+ * resources, and to set LP_DEAD bits on the batch's index page.  It isn't
+ * safe for table AMs to fetch table tuples using TIDs saved from a batch that
+ * was already freed: 'dropPin' scans need the index AM to retain a pin on the
+ * TID's index page, as an interlock against concurrent TID recycling.
+ */
+void
+batch_free(IndexScanDesc scan, BatchIndexScan batch)
+{
+	batch_assert_batch_valid(scan, batch);
+
+	/* don't free the batch that is marked */
+	if (batch == scan->batchqueue->markBatch)
+		return;
+
+	/*
+	 * killedItems[] is now in whatever order the scan returned items in.
+	 * Scrollable cursor scans might have even saved the same item/TID twice.
+	 *
+	 * Sort and unique-ify killedItems[].  That way the index AM can safely
+	 * assume that items will always be in their original index page order.
+	 */
+	if (batch->numKilled > 1)
+	{
+		qsort(batch->killedItems, batch->numKilled, sizeof(int),
+			  batch_compare_int);
+		batch->numKilled = qunique(batch->killedItems, batch->numKilled,
+								   sizeof(int), batch_compare_int);
+	}
+
+	scan->indexRelation->rd_indam->amfreebatch(scan, batch);
+}
+
+/* ----------------
+ *		index_batch_kill_item - record item for deferred LP_DEAD marking
+ *
+ * Records the item index of the currently-read tuple in readBatch's
+ * killedItems array. The items' index tuples will later be marked LP_DEAD
+ * when current readBatch is freed by amfreebatch routine (see batch_free).
+ * ----------------
+ */
+void
+index_batch_kill_item(IndexScanDesc scan)
+{
+	BatchQueueItemPos *readPos = &scan->batchqueue->readPos;
+	BatchIndexScan readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+	batch_assert_pos_valid(scan, readPos);
+
+	if (readBatch->killedItems == NULL)
+		readBatch->killedItems = palloc_array(int, scan->maxitemsbatch);
+	if (readBatch->numKilled < scan->maxitemsbatch)
+		readBatch->killedItems[readBatch->numKilled++] = readPos->item;
+}
+
+/* ----------------
+ *		index_batch_end - end a batch scan and free all resources
+ *
+ * Called when an index scan is being ended, right before the owning scan
+ * descriptor goes away.  Cleans up all batch related resources.
+ * ----------------
+ */
+void
+index_batch_end(IndexScanDesc scan)
+{
+	index_batch_reset(scan, true);
+
+	/* bail out without batching */
+	if (!scan->batchqueue)
+		return;
+
+	for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+	{
+		BatchIndexScan cached = scan->batchqueue->cache[i];
+
+		if (cached == NULL)
+			continue;
+
+		if (cached->killedItems)
+			pfree(cached->killedItems);
+		if (cached->currTuples)
+			pfree(cached->currTuples);
+		pfree(cached);
+	}
+
+	pfree(scan->batchqueue);
+}
+
+/* ----------------------------------------------------------------
+ *			utility functions called by amgetbatch index AMs
+ *
+ * These functions manage batch allocation, unlock/pin management, and batch
+ * resource recycling.  Index AMs implementing amgetbatch should use these
+ * rather than managing buffers directly.
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * indexam_util_batch_unlock - Drop lock and conditionally drop pin on batch page
+ *
+ * Unlocks caller's batch->buf in preparation for amgetbatch returning items
+ * saved in that batch.  Manages the details of dropping the lock and possibly
+ * the pin for index AM caller (dropping the pin prevents VACUUM from blocking
+ * on acquiring a cleanup lock, but isn't always safe).
+ *
+ * Only call here when a batch has one or more matching items to return using
+ * amgetbatch (or for amgetbitmap to load into its bitmap of matching TIDs).
+ * When an index page has no matches, it's always safe for index AMs to drop
+ * both the lock and the pin for themselves.
+ *
+ * Note: It is convenient for index AMs that implement both amgetbitmap and
+ * amgetbitmap to consistently use the same batch management approach, since
+ * that avoids introducing special cases to lower-level code.  We always drop
+ * both the lock and the pin on batch's page on behalf of amgetbitmap callers.
+ * Such amgetbitmap callers must be careful to free all batches with matching
+ * items once they're done saving the matching TIDs (there will never be any
+ * calls to amfreebatch, so amgetbitmap must call indexam_util_batch_release
+ * directly, in lieu of a deferred call to amfreebatch from core code).
+ */
+void
+indexam_util_batch_unlock(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Relation	rel = scan->indexRelation;
+	bool		dropPin = scan->dropPin;
+
+	/* batch must have one or more matching items returned by index AM */
+	Assert(batch->firstItem >= 0 && batch->firstItem <= batch->lastItem);
+
+	if (!dropPin)
+	{
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+
+		/* Just drop the lock (not the pin) */
+		LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+		return;
+	}
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch (not amgetbitmap) caller */
+		Assert(scan->heapRelation != NULL);
+
+		/*
+		 * Have to set batch->lsn so that amfreebatch has a way to detect when
+		 * concurrent heap TID recycling by VACUUM might have taken place.
+		 * It'll only be safe to set any index tuple LP_DEAD bits when the
+		 * page LSN hasn't advanced.
+		 */
+		Assert(RelationNeedsWAL(rel));
+		Assert(!scan->xs_want_itup);
+		batch->lsn = BufferGetLSNAtomic(batch->buf);
+	}
+
+	/* Drop both the lock and the pin */
+	LockBuffer(batch->buf, BUFFER_LOCK_UNLOCK);
+	if (!RelationUsesLocalBuffers(rel))
+		VALGRIND_MAKE_MEM_NOACCESS(BufferGetPage(batch->buf), BLCKSZ);
+	ReleaseBuffer(batch->buf);
+	batch->buf = InvalidBuffer;
+}
+
+/*
+ * indexam_util_batch_alloc
+ *		Allocate a batch during a amgetbatch (or amgetbitmap) index scan.
+ *
+ * Returns BatchIndexScan with space to fit scan->maxitemsbatch-many
+ * BatchMatchingItem entries.  This will either be a newly allocated batch, or
+ * a batch recycled from the cache managed by indexam_util_batch_release.  See
+ * comments above indexam_util_batch_release.
+ *
+ * Index AMs that use batches should call this from either their amgetbatch or
+ * amgetbitmap routines only.  Note in particular that it cannot safely be
+ * called from a amfreebatch routine.
+ */
+BatchIndexScan
+indexam_util_batch_alloc(IndexScanDesc scan)
+{
+	BatchIndexScan batch = NULL;
+
+	/* First look for an existing batch from queue's cache of batches */
+	if (scan->batchqueue != NULL)
+	{
+		for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+		{
+			if (scan->batchqueue->cache[i] != NULL)
+			{
+				/* Return cached unreferenced batch */
+				batch = scan->batchqueue->cache[i];
+				scan->batchqueue->cache[i] = NULL;
+				break;
+			}
+		}
+	}
+
+	if (!batch)
+	{
+		batch = palloc(offsetof(BatchIndexScanData, items) +
+					   sizeof(BatchMatchingItem) * scan->maxitemsbatch);
+
+		/*
+		 * If we are doing an index-only scan, we need a tuple storage
+		 * workspace. We allocate BLCKSZ for this, which should always give
+		 * the index AM enough space to fit a full page's worth of tuples.
+		 */
+		batch->currTuples = NULL;
+		if (scan->xs_want_itup)
+			batch->currTuples = palloc(BLCKSZ);
+
+		/*
+		 * Batches allocate killedItems lazily (though note that cached
+		 * batches keep their killedItems allocation when recycled)
+		 */
+		batch->killedItems = NULL;
+	}
+
+	/* xs_want_itup scans must get a currTuples space */
+	Assert(!(scan->xs_want_itup && (batch->currTuples == NULL)));
+
+	/* shared initialization */
+	batch->buf = InvalidBuffer;
+	batch->firstItem = -1;
+	batch->lastItem = -1;
+	batch->numKilled = 0;
+
+	return batch;
+}
+
+/*
+ * indexam_util_batch_release
+ *		Either stash the batch in a small cache for reuse, or free it.
+ *
+ * This function is called by index AMs to release a batch allocated by
+ * indexam_util_batch_alloc.  Batches are cached here for reuse (when scan
+ * hasn't already finished) to reduce palloc/pfree overhead.
+ *
+ * It's safe to release a batch immediately when it was used to read a page
+ * that returned no matches to the scan.  Batches actually returned by index
+ * AM's amgetbatch routine (i.e. batches for pages with one or more matches)
+ * must be released by calling here at the end of their amfreebatch routine.
+ * Index AMs that uses batches should call here to release a batch from any of
+ * their amgetbatch, amgetbitmap, and amfreebatch routines.
+ */
+void
+indexam_util_batch_release(IndexScanDesc scan, BatchIndexScan batch)
+{
+	Assert(batch->buf == InvalidBuffer);
+
+	if (scan->batchqueue)
+	{
+		/* amgetbatch scan caller */
+		Assert(scan->heapRelation != NULL);
+
+		if (scan->finished)
+		{
+			/* Don't bother using cache when scan is ending */
+		}
+		else
+		{
+			/*
+			 * Use cache.  This is generally only beneficial when there are
+			 * many small rescans of an index.
+			 */
+			for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++)
+			{
+				if (scan->batchqueue->cache[i] == NULL)
+				{
+					/* found empty slot, we're done */
+					scan->batchqueue->cache[i] = batch;
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Failed to find a free slot for this batch.  We'll just free it
+		 * ourselves.  This isn't really expected; it's just defensive.
+		 */
+		if (batch->killedItems)
+			pfree(batch->killedItems);
+		if (batch->currTuples)
+			pfree(batch->currTuples);
+	}
+	else
+	{
+		/* amgetbitmap scan caller */
+		Assert(scan->heapRelation == NULL);
+		Assert(batch->killedItems == NULL);
+		Assert(batch->currTuples == NULL);
+	}
+
+	/* no free slot to save this batch (expected with amgetbitmap callers) */
+	pfree(batch);
+}
+
+/*
+ * qsort comparison function for int arrays
+ */
+static int
+batch_compare_int(const void *va, const void *vb)
+{
+	int			a = *((const int *) va);
+	int			b = *((const int *) vb);
+
+	return pg_cmp_s32(a, b);
+}
+
+static void
+batch_debug_print_batches(const char *label, IndexScanDesc scan)
+{
+#ifdef INDEXAM_DEBUG
+	BatchQueue *batchqueue = scan->batchqueue;
+
+	if (!scan->batchqueue)
+		return;
+
+	if (!AmRegularBackendProcess())
+		return;
+	if (IsCatalogRelation(scan->indexRelation))
+		return;
+
+	DEBUG_LOG("%s: batches headBatch %d nextBatch %d",
+			  label,
+			  batchqueue->headBatch, batchqueue->nextBatch);
+
+	for (int i = batchqueue->headBatch; i < batchqueue->nextBatch; i++)
+	{
+		BatchIndexScan batch = INDEX_SCAN_BATCH(scan, i);
+
+		DEBUG_LOG("    batch %d currPage %u %p firstItem %d lastItem %d killed %d",
+				  i, batch->currPage, batch, batch->firstItem,
+				  batch->lastItem, batch->numKilled);
+	}
+#endif
+}
diff --git a/src/backend/access/index/meson.build b/src/backend/access/index/meson.build
index da64cb595..83dfa3f2b 100644
--- a/src/backend/access/index/meson.build
+++ b/src/backend/access/index/meson.build
@@ -5,4 +5,5 @@ backend_sources += files(
   'amvalidate.c',
   'genam.c',
   'indexam.c',
+  'indexbatch.c',
 )
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc..231da20e4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -471,7 +471,7 @@ proper.  A plain index scan will even recognize LP_UNUSED items in the
 heap (items that could be recycled but haven't been just yet) as "not
 visible" -- even when the heap page is generally considered all-visible.
 
-LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+Opportunistic LP_DEAD setting of known-dead index tuples during index scans
 (described in full in simple deletion, below) is also more complicated for
 index scans that drop their leaf page pins.  We must be careful to avoid
 LP_DEAD-marking any new index tuple that looks like a known-dead index
@@ -481,7 +481,7 @@ new, unrelated index tuple, on the same leaf page, which has the same
 original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.
 
-We handle this kill_prior_tuple race condition by having affected index
+We handle this LP_DEAD setting race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
@@ -735,7 +735,7 @@ of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
 During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
+and we never LP_DEAD-mark tuples. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 2ff0085b9..2c55e9ce8 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1033,6 +1033,9 @@ _bt_relbuf(Relation rel, Buffer buf)
  * Lock is acquired without acquiring another pin.  This is like a raw
  * LockBuffer() call, but performs extra steps needed by Valgrind.
  *
+ * Note: indexam_util_batch_unlock has similar Valgrind buffer lock
+ * instrumentation, which we rely on here.
+ *
  * Note: Caller may need to call _bt_checkpage() with buf when pin on buf
  * wasn't originally acquired in _bt_getbuf() or _bt_relandgetbuf().
  */
diff --git a/src/backend/access/nbtree/nbtreadpage.c b/src/backend/access/nbtree/nbtreadpage.c
index 372426f6a..bbe53e026 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -32,6 +32,7 @@ typedef struct BTReadPageState
 {
 	/* Input parameters, set by _bt_readpage for _bt_checkkeys */
 	ScanDirection dir;			/* current scan direction */
+	BlockNumber currpage;		/* current page being read */
 	OffsetNumber minoff;		/* Lowest non-pivot tuple's offset */
 	OffsetNumber maxoff;		/* Highest non-pivot tuple's offset */
 	IndexTuple	finaltup;		/* Needed by scans with array keys */
@@ -63,14 +64,13 @@ static bool _bt_scanbehind_checkkeys(IndexScanDesc scan, ScanDirection dir,
 									 IndexTuple finaltup);
 static bool _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 								  IndexTuple finaltup);
-static void _bt_saveitem(BTScanOpaque so, int itemIndex,
-						 OffsetNumber offnum, IndexTuple itup);
-static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+static void _bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+						 IndexTuple itup, int *tupleOffset);
+static int	_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
 								  OffsetNumber offnum, const ItemPointerData *heapTid,
-								  IndexTuple itup);
-static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
-									   OffsetNumber offnum,
-									   ItemPointer heapTid, int tupleOffset);
+								  IndexTuple itup, int *tupleOffset);
+static inline void _bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+									   ItemPointer heapTid, int baseOffset);
 static bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
 						  IndexTuple tuple, int tupnatts);
 static bool _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
@@ -111,15 +111,15 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
 
 
 /*
- *	_bt_readpage() -- Load data from current index page into so->currPos
+ *	_bt_readpage() -- Load data from current index page into newbatch.
  *
- * Caller must have pinned and read-locked so->currPos.buf; the buffer's state
- * is not changed here.  Also, currPos.moreLeft and moreRight must be valid;
- * they are updated as appropriate.  All other fields of so->currPos are
+ * Caller must have pinned and read-locked newbatch.buf; the buffer's state is
+ * not changed here.  Also, newbatch's moreLeft and moreRight must be valid;
+ * they are updated as appropriate.  All other fields of newbatch are
  * initialized from scratch here.
  *
  * We scan the current page starting at offnum and moving in the indicated
- * direction.  All items matching the scan keys are loaded into currPos.items.
+ * direction.  All items matching the scan keys are saved in newbatch.items.
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction
  * (could just be for the current primitive index scan when scan has arrays).
@@ -131,8 +131,8 @@ static bool _bt_verify_keys_with_arraykeys(IndexScanDesc scan);
  * Returns true if any matching items found on the page, false if none.
  */
 bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
-			 bool firstpage)
+_bt_readpage(IndexScanDesc scan, BatchIndexScan newbatch, ScanDirection dir,
+			 OffsetNumber offnum, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -144,23 +144,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	bool		arrayKeys,
 				ignore_killed_tuples = scan->ignore_killed_tuples;
 	int			itemIndex,
+				tupleOffset = 0,
 				indnatts;
 
 	/* save the page/buffer block number, along with its sibling links */
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(newbatch->buf);
 	opaque = BTPageGetOpaque(page);
-	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
-	so->currPos.prevPage = opaque->btpo_prev;
-	so->currPos.nextPage = opaque->btpo_next;
-	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
-	pstate.dir = so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
+	pstate.currpage = newbatch->currPage = BufferGetBlockNumber(newbatch->buf);
+	newbatch->prevPage = opaque->btpo_prev;
+	newbatch->nextPage = opaque->btpo_next;
+	pstate.dir = newbatch->dir = dir;
 
 	/* either moreRight or moreLeft should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
+	Assert(ScanDirectionIsForward(dir) ? newbatch->moreRight : newbatch->moreLeft);
 	Assert(!P_IGNORE(opaque));
-	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
 
 	/* initialize local variables */
@@ -188,14 +185,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	{
 		/* allow next/prev page to be read by other worker without delay */
 		if (ScanDirectionIsForward(dir))
-			_bt_parallel_release(scan, so->currPos.nextPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->nextPage, newbatch->currPage);
 		else
-			_bt_parallel_release(scan, so->currPos.prevPage,
-								 so->currPos.currPage);
+			_bt_parallel_release(scan, newbatch->prevPage, newbatch->currPage);
 	}
 
-	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
+	PredicateLockPage(rel, pstate.currpage, scan->xs_snapshot);
 
 	if (ScanDirectionIsForward(dir))
 	{
@@ -212,11 +207,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreRight = false;
+					newbatch->moreRight = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -280,26 +274,26 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 					itemIndex++;
 				}
 				else
 				{
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember first TID) */
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, 0),
-											  itup);
+											  itup, &tupleOffset);
 					itemIndex++;
 
 					/* Remember all later TIDs (must be at least one) */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 						itemIndex++;
 					}
 				}
@@ -339,12 +333,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		}
 
 		if (!pstate.continuescan)
-			so->currPos.moreRight = false;
+			newbatch->moreRight = false;
 
 		Assert(itemIndex <= MaxTIDsPerBTreePage);
-		so->currPos.firstItem = 0;
-		so->currPos.lastItem = itemIndex - 1;
-		so->currPos.itemIndex = 0;
+		newbatch->firstItem = 0;
+		newbatch->lastItem = itemIndex - 1;
 	}
 	else
 	{
@@ -361,11 +354,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 					!_bt_scanbehind_checkkeys(scan, dir, pstate.finaltup))
 				{
 					/* Schedule another primitive index scan after all */
-					so->currPos.moreLeft = false;
+					newbatch->moreLeft = false;
 					so->needPrimScan = true;
 					if (scan->parallel_scan)
-						_bt_parallel_primscan_schedule(scan,
-													   so->currPos.currPage);
+						_bt_parallel_primscan_schedule(scan, newbatch->currPage);
 					return false;
 				}
 			}
@@ -466,27 +458,27 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(newbatch, itemIndex, offnum, itup, &tupleOffset);
 				}
 				else
 				{
 					uint16		nitems = BTreeTupleGetNPosting(itup);
-					int			tupleOffset;
+					int			baseOffset;
 
 					/* Set up posting list state (and remember last TID) */
 					itemIndex--;
-					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+					baseOffset =
+						_bt_setuppostingitems(newbatch, itemIndex, offnum,
 											  BTreeTupleGetPostingN(itup, nitems - 1),
-											  itup);
+											  itup, &tupleOffset);
 
 					/* Remember all prior TIDs (must be at least one) */
 					for (int i = nitems - 2; i >= 0; i--)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(newbatch, itemIndex, offnum,
 											BTreeTupleGetPostingN(itup, i),
-											tupleOffset);
+											baseOffset);
 					}
 				}
 			}
@@ -502,12 +494,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 		 * be found there
 		 */
 		if (!pstate.continuescan)
-			so->currPos.moreLeft = false;
+			newbatch->moreLeft = false;
 
 		Assert(itemIndex >= 0);
-		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
-		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
+		newbatch->firstItem = itemIndex;
+		newbatch->lastItem = MaxTIDsPerBTreePage - 1;
 	}
 
 	/*
@@ -524,7 +515,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	 */
 	Assert(!pstate.forcenonrequired);
 
-	return (so->currPos.firstItem <= so->currPos.lastItem);
+	return (newbatch->firstItem <= newbatch->lastItem);
 }
 
 /*
@@ -1027,90 +1018,93 @@ _bt_oppodir_checkkeys(IndexScanDesc scan, ScanDirection dir,
 	return true;
 }
 
-/* Save an index item into so->currPos.items[itemIndex] */
+/* Save an index item into newbatch.items[itemIndex] */
 static void
-_bt_saveitem(BTScanOpaque so, int itemIndex,
-			 OffsetNumber offnum, IndexTuple itup)
+_bt_saveitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+			 IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
-
 	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = itup->t_tid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	newbatch->items[itemIndex].heapTid = itup->t_tid;
+	newbatch->items[itemIndex].indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		Size		itupsz = IndexTupleSize(itup);
 
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
-		so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+		newbatch->items[itemIndex].tupleOffset = *tupleOffset;
+		memcpy(newbatch->currTuples + *tupleOffset, itup, itupsz);
+		*tupleOffset += MAXALIGN(itupsz);
 	}
 }
 
 /*
  * Setup state to save TIDs/items from a single posting list tuple.
  *
- * Saves an index item into so->currPos.items[itemIndex] for TID that is
- * returned to scan first.  Second or subsequent TIDs for posting list should
- * be saved by calling _bt_savepostingitem().
+ * Saves an index item into newbatch.items[itemIndex] for TID that is returned
+ * to scan first.  Second or subsequent TIDs for posting list should be saved
+ * by calling _bt_savepostingitem().
  *
- * Returns an offset into tuple storage space that main tuple is stored at if
- * needed.
+ * Returns baseOffset, an offset into tuple storage space that main tuple is
+ * stored at if needed.
  */
 static int
-_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					  const ItemPointerData *heapTid, IndexTuple itup)
+_bt_setuppostingitems(BatchIndexScan newbatch, int itemIndex,
+					  OffsetNumber offnum, const ItemPointerData *heapTid,
+					  IndexTuple itup, int *tupleOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
 	Assert(BTreeTupleIsPosting(itup));
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
-	if (so->currTuples)
+	/* copy the populated part of the items array */
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
+
+	if (newbatch->currTuples)
 	{
 		/* Save base IndexTuple (truncate posting list) */
 		IndexTuple	base;
 		Size		itupsz = BTreeTupleGetPostingOffset(itup);
 
 		itupsz = MAXALIGN(itupsz);
-		currItem->tupleOffset = so->currPos.nextTupleOffset;
-		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		item->tupleOffset = *tupleOffset;
+		base = (IndexTuple) (newbatch->currTuples + *tupleOffset);
 		memcpy(base, itup, itupsz);
 		/* Defensively reduce work area index tuple header size */
 		base->t_info &= ~INDEX_SIZE_MASK;
 		base->t_info |= itupsz;
-		so->currPos.nextTupleOffset += itupsz;
+		*tupleOffset += itupsz;
 
-		return currItem->tupleOffset;
+		return item->tupleOffset;
 	}
 
 	return 0;
 }
 
 /*
- * Save an index item into so->currPos.items[itemIndex] for current posting
+ * Save an index item into newbatch.items[itemIndex] for current posting
  * tuple.
  *
  * Assumes that _bt_setuppostingitems() has already been called for current
- * posting list tuple.  Caller passes its return value as tupleOffset.
+ * posting list tuple.  Caller passes its return value as baseOffset.
  */
 static inline void
-_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
-					ItemPointer heapTid, int tupleOffset)
+_bt_savepostingitem(BatchIndexScan newbatch, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int baseOffset)
 {
-	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+	BatchMatchingItem *item = &newbatch->items[itemIndex];
 
-	currItem->heapTid = *heapTid;
-	currItem->indexOffset = offnum;
+	item->heapTid = *heapTid;
+	item->indexOffset = offnum;
 
 	/*
 	 * Have index-only scans return the same base IndexTuple for every TID
 	 * that originates from the same posting list
 	 */
-	if (so->currTuples)
-		currItem->tupleOffset = tupleOffset;
+	if (newbatch->currTuples)
+		item->tupleOffset = baseOffset;
 }
 
 #define LOOK_AHEAD_REQUIRED_RECHECKS 	3
@@ -2822,13 +2816,13 @@ new_prim_scan:
 	 * Note: We make a soft assumption that the current scan direction will
 	 * also be used within _bt_next, when it is asked to step off this page.
 	 * It is up to _bt_next to cancel this scheduled primitive index scan
-	 * whenever it steps to a page in the direction opposite currPos.dir.
+	 * whenever it steps to a page in the direction opposite pstate->dir.
 	 */
 	pstate->continuescan = false;	/* Tell _bt_readpage we're done... */
 	so->needPrimScan = true;	/* ...but call _bt_first again */
 
 	if (scan->parallel_scan)
-		_bt_parallel_primscan_schedule(scan, so->currPos.currPage);
+		_bt_parallel_primscan_schedule(scan, pstate->currpage);
 
 	/* Caller's tuple doesn't match the new qual */
 	return false;
@@ -2913,14 +2907,6 @@ _bt_advance_array_keys_increment(IndexScanDesc scan, ScanDirection dir,
 	 * Restore the array keys to the state they were in immediately before we
 	 * were called.  This ensures that the arrays only ever ratchet in the
 	 * current scan direction.
-	 *
-	 * Without this, scans could overlook matching tuples when the scan
-	 * direction gets reversed just before btgettuple runs out of items to
-	 * return, but just after _bt_readpage prepares all the items from the
-	 * scan's final page in so->currPos.  When we're on the final page it is
-	 * typical for so->currPos to get invalidated once btgettuple finally
-	 * returns false, which'll effectively invalidate the scan's array keys.
-	 * That hasn't happened yet, though -- and in general it may never happen.
 	 */
 	_bt_start_array_keys(scan, -dir);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3dec1ee65..f6cf67aa1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -159,11 +159,12 @@ bthandler(PG_FUNCTION_ARGS)
 		.amadjustmembers = btadjustmembers,
 		.ambeginscan = btbeginscan,
 		.amrescan = btrescan,
-		.amgettuple = btgettuple,
+		.amgettuple = NULL,
+		.amgetbatch = btgetbatch,
+		.amfreebatch = btfreebatch,
 		.amgetbitmap = btgetbitmap,
 		.amendscan = btendscan,
-		.ammarkpos = btmarkpos,
-		.amrestrpos = btrestrpos,
+		.amposreset = btposreset,
 		.amestimateparallelscan = btestimateparallelscan,
 		.aminitparallelscan = btinitparallelscan,
 		.amparallelrescan = btparallelrescan,
@@ -222,13 +223,13 @@ btinsert(Relation rel, Datum *values, bool *isnull,
 }
 
 /*
- *	btgettuple() -- Get the next tuple in the scan.
+ *	btgetbatch() -- Get the first or next batch of tuples in the scan
  */
-bool
-btgettuple(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+btgetbatch(IndexScanDesc scan, BatchIndexScan priorbatch, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	bool		res;
+	BatchIndexScan batch = priorbatch;
 
 	Assert(scan->heapRelation != NULL);
 
@@ -243,43 +244,18 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * the appropriate direction.  If we haven't done so yet, we call
 		 * _bt_first() to get the first item in the scan.
 		 */
-		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+		if (batch == NULL)
+			batch = _bt_first(scan, dir);
 		else
-		{
-			/*
-			 * Check to see if we should kill the previously-fetched tuple.
-			 */
-			if (scan->kill_prior_tuple)
-			{
-				/*
-				 * Yes, remember it for later. (We'll deal with all such
-				 * tuples at once right before leaving the index page.)  The
-				 * test for numKilled overrun is not just paranoia: if the
-				 * caller reverses direction in the indexscan then the same
-				 * item might get entered multiple times. It's not worth
-				 * trying to optimize that, so we don't detect it, but instead
-				 * just forget any excess entries.
-				 */
-				if (so->killedItems == NULL)
-					so->killedItems = palloc_array(int, MaxTIDsPerBTreePage);
-				if (so->numKilled < MaxTIDsPerBTreePage)
-					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
-			}
+			batch = _bt_next(scan, dir, batch);
 
-			/*
-			 * Now continue the scan.
-			 */
-			res = _bt_next(scan, dir);
-		}
-
-		/* If we have a tuple, return it ... */
-		if (res)
+		/* If we have a batch, return it ... */
+		if (batch)
 			break;
 		/* ... otherwise see if we need another primitive index scan */
 	} while (so->numArrayKeys && _bt_start_prim_scan(scan));
 
-	return res;
+	return batch;
 }
 
 /*
@@ -289,6 +265,7 @@ int64
 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan batch;
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
@@ -297,29 +274,29 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
-		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		/* Fetch the first batch */
+		if ((batch = _bt_first(scan, ForwardScanDirection)))
 		{
-			/* Save tuple ID, and continue scanning */
-			heapTid = &scan->xs_heaptid;
+			int			itemIndex = 0;
+
+			/* Save first tuple's TID */
+			heapTid = &batch->items[itemIndex].heapTid;
 			tbm_add_tuples(tbm, heapTid, 1, false);
 			ntids++;
 
 			for (;;)
 			{
-				/*
-				 * Advance to next tuple within page.  This is the same as the
-				 * easy case in _bt_next().
-				 */
-				if (++so->currPos.itemIndex > so->currPos.lastItem)
+				/* Advance to next TID within page-sized batch */
+				if (++itemIndex > batch->lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					itemIndex = 0;
+					batch = _bt_next(scan, ForwardScanDirection, batch);
+					if (!batch)
 						break;
 				}
 
-				/* Save tuple ID, and continue scanning */
-				heapTid = &so->currPos.items[so->currPos.itemIndex].heapTid;
+				heapTid = &batch->items[itemIndex].heapTid;
 				tbm_add_tuples(tbm, heapTid, 1, false);
 				ntids++;
 			}
@@ -347,8 +324,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 
 	/* allocate private workspace */
 	so = palloc_object(BTScanOpaqueData);
-	BTScanPosInvalidate(so->currPos);
-	BTScanPosInvalidate(so->markPos);
 	if (scan->numberOfKeys > 0)
 		so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
 	else
@@ -362,19 +337,9 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->orderProcs = NULL;
 	so->arrayContext = NULL;
 
-	so->killedItems = NULL;		/* until needed */
-	so->numKilled = 0;
-
-	/*
-	 * We don't know yet whether the scan will be index-only, so we do not
-	 * allocate the tuple workspace arrays until btrescan.  However, we set up
-	 * scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
-	 */
-	so->currTuples = so->markTuples = NULL;
-
-	scan->xs_itupdesc = RelationGetDescr(rel);
-
 	scan->opaque = so;
+	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->maxitemsbatch = MaxTIDsPerBTreePage;
 
 	return scan;
 }
@@ -388,72 +353,37 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
-	/*
-	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
-	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
-	 *
-	 * We cannot safely drop leaf page pins during index-only scans due to a
-	 * race condition involving VACUUM setting pages all-visible in the VM.
-	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
-	 *
-	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
-	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
-	 * The scheme used to detect unsafe TID recycling won't work when scanning
-	 * unlogged relations (since it involves saving an affected page's LSN).
-	 * Opt out of eager pin dropping during unlogged relation scans for now
-	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
-	 *
-	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
-	 * Pins cannot be held for more than an instant during bitmap scans either
-	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
-	 *
-	 * See nbtree/README section on making concurrent TID recycling safe.
-	 *
-	 * Note: so->dropPin should never change across rescans.
-	 */
-	so->dropPin = (!scan->xs_want_itup &&
-				   IsMVCCSnapshot(scan->xs_snapshot) &&
-				   RelationNeedsWAL(scan->indexRelation) &&
-				   scan->heapRelation != NULL);
-
-	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
-
-	/*
-	 * Allocate tuple workspace arrays, if needed for an index-only scan and
-	 * not already done in a previous rescan call.  To save on palloc
-	 * overhead, both workspaces are allocated as one palloc block; only this
-	 * function and btendscan know that.
-	 */
-	if (scan->xs_want_itup && so->currTuples == NULL)
-	{
-		so->currTuples = (char *) palloc(BLCKSZ * 2);
-		so->markTuples = so->currTuples + BLCKSZ;
-	}
-
 	/*
 	 * Reset the scan keys
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
 
+/*
+ *	btfreebatch() -- Free batch resources, including its buffer pin
+ */
+void
+btfreebatch(IndexScanDesc scan, BatchIndexScan batch)
+{
+	if (batch->numKilled > 0)
+		_bt_killitems(scan, batch);
+
+	if (!scan->dropPin)
+	{
+		/* indexam_util_batch_unlock didn't unpin page earlier, do it now */
+		ReleaseBuffer(batch->buf);
+		batch->buf = InvalidBuffer;
+	}
+
+	indexam_util_batch_release(scan, batch);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -462,116 +392,48 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-	}
-
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
-
 	/* Release storage */
 	if (so->keyData != NULL)
 		pfree(so->keyData);
 	/* so->arrayKeys and so->orderProcs are in arrayContext */
 	if (so->arrayContext != NULL)
 		MemoryContextDelete(so->arrayContext);
-	if (so->killedItems != NULL)
-		pfree(so->killedItems);
-	if (so->currTuples != NULL)
-		pfree(so->currTuples);
-	/* so->markTuples should not be pfree'd, see btrescan */
 	pfree(so);
 }
 
 /*
- *	btmarkpos() -- save current scan position
+ *	btposreset() -- invalidate scan's array keys
  */
 void
-btmarkpos(IndexScanDesc scan)
+btposreset(IndexScanDesc scan, BatchIndexScan markbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (!so->numArrayKeys)
+		return;
 
 	/*
-	 * Just record the current itemIndex.  If we later step to next page
-	 * before releasing the marked position, _bt_steppage makes a full copy of
-	 * the currPos struct in markPos.  If (as often happens) the mark is moved
-	 * before we leave the page, we don't have to do that work.
+	 * Core system is about to restore a mark associated with a previously
+	 * returned batch.  Reset the scan's arrays to make all this safe.
 	 */
-	if (BTScanPosIsValid(so->currPos))
-		so->markItemIndex = so->currPos.itemIndex;
+	_bt_start_array_keys(scan, markbatch->dir);
+
+	/*
+	 * Core system will invalidate all other batches.
+	 *
+	 * Deal with this by unsetting needPrimScan as well as moreRight (or as
+	 * well as moreLeft, when scanning backwards).  That way, the next time
+	 * _bt_next is called it will step to the right (or to the left).  At that
+	 * point _bt_readpage will restore the scan's arrays to elements that
+	 * correctly track the next page's position in the index's key space.
+	 */
+	if (ScanDirectionIsForward(markbatch->dir))
+		markbatch->moreRight = true;
 	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
-}
-
-/*
- *	btrestrpos() -- restore scan to last saved position
- */
-void
-btrestrpos(IndexScanDesc scan)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	if (so->markItemIndex >= 0)
-	{
-		/*
-		 * The scan has never moved to a new page since the last mark.  Just
-		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
-		 */
-		so->currPos.itemIndex = so->markItemIndex;
-	}
-	else
-	{
-		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
-		 */
-		if (BTScanPosIsValid(so->currPos))
-		{
-			/* Before leaving current page, deal with any killed items */
-			if (so->numKilled > 0)
-				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
-		}
-
-		if (BTScanPosIsValid(so->markPos))
-		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
-			memcpy(&so->currPos, &so->markPos,
-				   offsetof(BTScanPosData, items[1]) +
-				   so->markPos.lastItem * sizeof(BTScanPosItem));
-			if (so->currTuples)
-				memcpy(so->currTuples, so->markTuples,
-					   so->markPos.nextTupleOffset);
-			/* Reset the scan's array keys (see _bt_steppage for why) */
-			if (so->numArrayKeys)
-			{
-				_bt_start_array_keys(scan, so->currPos.dir);
-				so->needPrimScan = false;
-			}
-		}
-		else
-			BTScanPosInvalidate(so->currPos);
-	}
+		markbatch->moreLeft = true;
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 }
 
 /*
@@ -887,15 +749,6 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
 	*next_scan_page = InvalidBlockNumber;
 	*last_curr_page = InvalidBlockNumber;
 
-	/*
-	 * Reset so->currPos, and initialize moreLeft/moreRight such that the next
-	 * call to _bt_readnextpage treats this backend similarly to a serial
-	 * backend that steps from *last_curr_page to *next_scan_page (unless this
-	 * backend's so->currPos is initialized by _bt_readfirstpage before then).
-	 */
-	BTScanPosInvalidate(so->currPos);
-	so->currPos.moreLeft = so->currPos.moreRight = true;
-
 	if (first)
 	{
 		/*
@@ -1045,8 +898,6 @@ _bt_parallel_done(IndexScanDesc scan)
 	BTParallelScanDesc btscan;
 	bool		status_changed = false;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-
 	/* Do nothing, for non-parallel scans */
 	if (parallel_scan == NULL)
 		return;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 86b0f239e..4b53e8531 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,53 +25,23 @@
 #include "utils/rel.h"
 
 
-static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static inline void _bt_returnitem(IndexScanDesc scan, BTScanOpaque so);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum,
-							  ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-							 BlockNumber lastcurrblkno, ScanDirection dir,
-							 bool seized);
+static BatchIndexScan _bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+										OffsetNumber offnum, ScanDirection dir);
+static BatchIndexScan _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
+									   BlockNumber lastcurrblkno,
+									   ScanDirection dir, bool firstpage);
 static Buffer _bt_lock_and_validate_left(Relation rel, BlockNumber *blkno,
 										 BlockNumber lastcurrblkno);
-static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
+static BatchIndexScan _bt_endpoint(IndexScanDesc scan, ScanDirection dir,
+								   BatchIndexScan firstbatch);
 
 
-/*
- *	_bt_drop_lock_and_maybe_pin()
- *
- * Unlock so->currPos.buf.  If scan is so->dropPin, drop the pin, too.
- * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
- */
-static inline void
-_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
-{
-	if (!so->dropPin)
-	{
-		/* Just drop the lock (not the pin) */
-		_bt_unlockbuf(rel, so->currPos.buf);
-		return;
-	}
-
-	/*
-	 * Drop both the lock and the pin.
-	 *
-	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
-	 * when concurrent heap TID recycling by VACUUM might have taken place.
-	 */
-	Assert(RelationNeedsWAL(rel));
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	_bt_relbuf(rel, so->currPos.buf);
-	so->currPos.buf = InvalidBuffer;
-}
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
@@ -860,20 +830,16 @@ _bt_compare(Relation rel,
  *		conditions, and the tree ordering.  We find the first item (or,
  *		if backwards scan, the last item) in the tree that satisfies the
  *		qualifications in the scan key.  On success exit, data about the
- *		matching tuple(s) on the page has been loaded into so->currPos.  We'll
- *		drop all locks and hold onto a pin on page's buffer, except during
- *		so->dropPin scans, when we drop both the lock and the pin.
- *		_bt_returnitem sets the next item to return to scan on success exit.
+ *		matching tuple(s) on the page has been loaded into the returned batch.
  *
- * If there are no matching items in the index, we return false, with no
- * pins or locks held.  so->currPos will remain invalid.
+ * If there are no matching items in the index, we just return NULL.
  *
  * Note that scan->keyData[], and the so->keyData[] scankey built from it,
  * are both search-type scankeys (see nbtree/README for more about this).
  * Within this routine, we build a temporary insertion-type scankey to use
  * in locating the scan start position.
  */
-bool
+BatchIndexScan
 _bt_first(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
@@ -887,8 +853,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat_total = InvalidStrategy;
 	BlockNumber blkno = InvalidBlockNumber,
 				lastcurrblkno;
+	BatchIndexScan firstbatch;
 
-	Assert(!BTScanPosIsValid(so->currPos));
+	/* Allocate space for first batch */
+	firstbatch = indexam_util_batch_alloc(scan);
 
 	/*
 	 * Examine the scan keys and eliminate any redundant keys; also mark the
@@ -904,6 +872,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		Assert(!so->needPrimScan);
 		_bt_parallel_done(scan);
+		indexam_util_batch_release(scan, firstbatch);
 		return false;
 	}
 
@@ -913,7 +882,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (scan->parallel_scan != NULL &&
 		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, true))
-		return false;
+	{
+		indexam_util_batch_release(scan, firstbatch);
+		return false;			/* definitely done (so->needPrimScan is unset) */
+	}
 
 	/*
 	 * Initialize the scan's arrays (if any) for the current scan direction
@@ -930,14 +902,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_readnextpage releases the scan for us (not _bt_readfirstpage).
 		 */
 		Assert(scan->parallel_scan != NULL);
-		Assert(!so->needPrimScan);
-		Assert(blkno != P_NONE);
 
-		if (!_bt_readnextpage(scan, blkno, lastcurrblkno, dir, true))
-			return false;
+		indexam_util_batch_release(scan, firstbatch);
 
-		_bt_returnitem(scan, so);
-		return true;
+		return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, true);
 	}
 
 	/*
@@ -1237,7 +1205,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Note: calls _bt_readfirstpage for us, which releases the parallel scan.
 	 */
 	if (keysz == 0)
-		return _bt_endpoint(scan, dir);
+		return _bt_endpoint(scan, dir, firstbatch);
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -1505,12 +1473,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * position ourselves on the target leaf page.
 	 */
 	Assert(ScanDirectionIsBackward(dir) == inskey.backward);
-	stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+	stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (unlikely(!BufferIsValid(firstbatch->buf)))
 	{
 		Assert(!so->needPrimScan);
 
@@ -1526,23 +1494,24 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		if (IsolationIsSerializable())
 		{
 			PredicateLockRelation(rel, scan->xs_snapshot);
-			stack = _bt_search(rel, NULL, &inskey, &so->currPos.buf, BT_READ);
+			stack = _bt_search(rel, NULL, &inskey, &firstbatch->buf, BT_READ);
 			_bt_freestack(stack);
 		}
 
-		if (!BufferIsValid(so->currPos.buf))
+		if (!BufferIsValid(firstbatch->buf))
 		{
 			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, firstbatch);
 			return false;
 		}
 	}
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
+	offnum = _bt_binsrch(rel, &inskey, firstbatch->buf);
 
 	/*
 	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * currently in firstbatch.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
@@ -1560,168 +1529,69 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * for the page.  For example, when inskey is both < the leaf page's high
 	 * key and > all of its non-pivot tuples, offnum will be "maxoff + 1".
 	 */
-	if (!_bt_readfirstpage(scan, offnum, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, offnum, dir);
 }
 
 /*
  *	_bt_next() -- Get the next item in a scan.
  *
- *		On entry, so->currPos describes the current page, which may be pinned
- *		but is not locked, and so->currPos.itemIndex identifies which item was
- *		previously returned.
+ *		On entry, priorbatch describes the batch that was last returned by
+ *		btgetbatch.  We'll use the prior batch's positioning information to
+ *		decide which page to read next.
  *
- *		On success exit, so->currPos is updated as needed, and _bt_returnitem
- *		sets the next item to return to the scan.  so->currPos remains valid.
+ *		On success exit, returns the next batch.  There must be at least one
+ *		matching tuple on any returned batch (else we'd just return NULL).
  *
- *		On failure exit (no more tuples), we invalidate so->currPos.  It'll
- *		still be possible for the scan to return tuples by changing direction,
- *		though we'll need to call _bt_first anew in that other direction.
+ *		On failure exit (no more tuples), we return NULL.  It'll still be
+ *		possible for the scan to return tuples by changing direction, though
+ *		we'll need to call _bt_first anew in that other direction.
  */
-bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
-{
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
-	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-	else
-	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
-		{
-			if (!_bt_steppage(scan, dir))
-				return false;
-		}
-	}
-
-	_bt_returnitem(scan, so);
-	return true;
-}
-
-/*
- * Return the index item from so->currPos.items[so->currPos.itemIndex] to the
- * index scan by setting the relevant fields in caller's index scan descriptor
- */
-static inline void
-_bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
-{
-	BTScanPosItem *currItem = &so->currPos.items[so->currPos.itemIndex];
-
-	/* Most recent _bt_readpage must have succeeded */
-	Assert(BTScanPosIsValid(so->currPos));
-	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
-	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
-
-	/* Return next item, per amgettuple contract */
-	scan->xs_heaptid = currItem->heapTid;
-	if (so->currTuples)
-		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
-}
-
-/*
- *	_bt_steppage() -- Step to next page containing valid data for scan
- *
- * Wrapper on _bt_readnextpage that performs final steps for the current page.
- *
- * On entry, so->currPos must be valid.  Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
- * though so->currPos.currPage must still be set to a valid block number.)
- */
-static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+BatchIndexScan
+_bt_next(IndexScanDesc scan, ScanDirection dir, BatchIndexScan priorbatch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BlockNumber blkno,
 				lastcurrblkno;
 
-	Assert(BTScanPosIsValid(so->currPos));
-
-	/* Before leaving current page, deal with any killed items */
-	if (so->numKilled > 0)
-		_bt_killitems(scan);
-
-	/*
-	 * Before we modify currPos, make a copy of the page data if there was a
-	 * mark position that needs it.
-	 */
-	if (so->markItemIndex >= 0)
-	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
-		memcpy(&so->markPos, &so->currPos,
-			   offsetof(BTScanPosData, items[1]) +
-			   so->currPos.lastItem * sizeof(BTScanPosItem));
-		if (so->markTuples)
-			memcpy(so->markTuples, so->currTuples,
-				   so->currPos.nextTupleOffset);
-		so->markPos.itemIndex = so->markItemIndex;
-		so->markItemIndex = -1;
-
-		/*
-		 * If we're just about to start the next primitive index scan
-		 * (possible with a scan that has arrays keys, and needs to skip to
-		 * continue in the current scan direction), moreLeft/moreRight only
-		 * indicate the end of the current primitive index scan.  They must
-		 * never be taken to indicate that the top-level index scan has ended
-		 * (that would be wrong).
-		 *
-		 * We could handle this case by treating the current array keys as
-		 * markPos state.  But depending on the current array state like this
-		 * would add complexity.  Instead, we just unset markPos's copy of
-		 * moreRight or moreLeft (whichever might be affected), while making
-		 * btrestrpos reset the scan's arrays to their initial scan positions.
-		 * In effect, btrestrpos leaves advancing the arrays up to the first
-		 * _bt_readpage call (that takes place after it has restored markPos).
-		 */
-		if (so->needPrimScan)
-		{
-			if (ScanDirectionIsForward(so->currPos.dir))
-				so->markPos.moreRight = true;
-			else
-				so->markPos.moreLeft = true;
-		}
-
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
-	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
+	Assert(BlockNumberIsValid(priorbatch->currPage));
 
 	/* Walk to the next page with data */
 	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
+		blkno = priorbatch->nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+		blkno = priorbatch->prevPage;
+	lastcurrblkno = priorbatch->currPage;
 
 	/*
-	 * Cancel primitive index scans that were scheduled when the call to
-	 * _bt_readpage for currPos happened to use the opposite direction to the
-	 * one that we're stepping in now.  (It's okay to leave the scan's array
-	 * keys as-is, since the next _bt_readpage will advance them.)
+	 * Cancel primitive index scans that were scheduled when priorbatch's call
+	 * to _bt_readpage happened to use the opposite direction to the one that
+	 * we're stepping in now.  (It's okay to leave the scan's array keys
+	 * as-is, since the next _bt_readpage will advance them.)
 	 */
-	if (so->currPos.dir != dir)
+	if (priorbatch->dir != dir)
 		so->needPrimScan = false;
 
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !priorbatch->moreRight : !priorbatch->moreLeft))
+	{
+		/*
+		 * priorbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
+
 /*
  *	_bt_readfirstpage() -- Read first page containing valid data for _bt_first
  *
@@ -1731,73 +1601,90 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * to stop the scan on this page by calling _bt_checkkeys against the high
  * key.  See _bt_readpage for full details.
  *
- * On entry, so->currPos must be pinned and locked (so offnum stays valid).
+ * On entry, firstbatch must be pinned and locked (so offnum stays valid).
  * Parallel scan callers must have seized the scan before calling here.
  *
- * On exit, we'll have updated so->currPos and retained locks and pins
+ * On exit, we'll have updated firstbatch and retained locks and pins
  * according to the same rules as those laid out for _bt_readnextpage exit.
- * Like _bt_readnextpage, our return value indicates if there are any matching
- * records in the given direction.
  *
  * We always release the scan for a parallel scan caller, regardless of
  * success or failure; we'll call _bt_parallel_release as soon as possible.
  */
-static bool
-_bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
+static BatchIndexScan
+_bt_readfirstpage(IndexScanDesc scan, BatchIndexScan firstbatch,
+				  OffsetNumber offnum, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
-
-	/* Initialize so->currPos for the first page (page in so->currPos.buf) */
+	/* Initialize firstbatch's position for the first page */
 	if (so->needPrimScan)
 	{
 		Assert(so->numArrayKeys);
 
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = true;
 		so->needPrimScan = false;
 	}
 	else if (ScanDirectionIsForward(dir))
 	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
+		firstbatch->moreLeft = false;
+		firstbatch->moreRight = true;
 	}
 	else
 	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
+		firstbatch->moreLeft = true;
+		firstbatch->moreRight = false;
 	}
 
 	/*
 	 * Attempt to load matching tuples from the first page.
 	 *
-	 * Note that _bt_readpage will finish initializing the so->currPos fields.
+	 * Note that _bt_readpage will finish initializing the firstbatch fields.
 	 * _bt_readpage also releases parallel scan (even when it returns false).
 	 */
-	if (_bt_readpage(scan, dir, offnum, true))
+	if (_bt_readpage(scan, firstbatch, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
-		/*
-		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-		 * so->currPos.buf in preparation for btgettuple returning tuples.
-		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(rel, so);
-		return true;
+		/* _bt_readpage saved one or more matches in firstbatch.items[] */
+		indexam_util_batch_unlock(scan, firstbatch);
+		return firstbatch;
 	}
 
-	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	/* There's no actually-matching data on the page */
+	_bt_relbuf(scan->indexRelation, firstbatch->buf);
+	firstbatch->buf = InvalidBuffer;
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = firstbatch->nextPage;
+	else
+		blkno = firstbatch->prevPage;
+	lastcurrblkno = firstbatch->currPage;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	Assert(firstbatch->dir == dir);
+
+	if (blkno == P_NONE ||
+		(ScanDirectionIsForward(dir) ?
+		 !firstbatch->moreRight : !firstbatch->moreLeft))
+	{
+		/*
+		 * firstbatch _bt_readpage call ended scan in this direction (though
+		 * if so->needPrimScan was set the scan will continue in _bt_first)
+		 */
+		indexam_util_batch_release(scan, firstbatch);
+		_bt_parallel_done(scan);
+		return NULL;
+	}
+
+	indexam_util_batch_release(scan, firstbatch);
+
+	/* parallel scan must seize the scan to get next blkno */
+	if (scan->parallel_scan != NULL &&
+		!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		return NULL;			/* done iff so->needPrimScan wasn't set */
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -1807,102 +1694,65 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
  * previously-saved right link or left link.  lastcurrblkno is the page that
  * was current at the point where the blkno link was saved, which we use to
  * reason about concurrent page splits/page deletions during backwards scans.
- * In the common case where seized=false, blkno is either so->currPos.nextPage
- * or so->currPos.prevPage, and lastcurrblkno is so->currPos.currPage.
+ * blkno is the prior batch's nextPage or prevPage (depending on the current
+ * scan direction), and lastcurrblkno is the prior batch's currPage.
  *
- * On entry, so->currPos shouldn't be locked by caller.  so->currPos.buf must
- * be InvalidBuffer/unpinned as needed by caller (note that lastcurrblkno
- * won't need to be read again in almost all cases).  Parallel scan callers
- * that seized the scan before calling here should pass seized=true; such a
- * caller's blkno and lastcurrblkno arguments come from the seized scan.
- * seized=false callers just pass us the blkno/lastcurrblkno taken from their
- * so->currPos, which (along with so->currPos itself) can be used to end the
- * scan.  A seized=false caller's blkno can never be assumed to be the page
- * that must be read next during a parallel scan, though.  We must figure that
- * part out for ourselves by seizing the scan (the correct page to read might
- * already be beyond the seized=false caller's blkno during a parallel scan,
- * unless blkno/so->currPos.nextPage/so->currPos.prevPage is already P_NONE,
- * or unless so->currPos.moreRight/so->currPos.moreLeft is already unset).
+ * On entry, no page should be locked by caller.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page, and we return true.  We hold a pin on the buffer on
- * success exit (except during so->dropPin index scans, when we drop the pin
- * eagerly to avoid blocking VACUUM).
+ * On success exit, returns batch containing data from the next page that has
+ * at least one matching item.  If there are no more matching items in the
+ * given scan direction, we just return NULL.
  *
- * If there are no more matching records in the given direction, we invalidate
- * so->currPos (while ensuring it retains no locks or pins), and return false.
- *
- * We always release the scan for a parallel scan caller, regardless of
- * success or failure; we'll call _bt_parallel_release as soon as possible.
+ * Parallel scan callers must seize the scan before calling here.  blkno and
+ * lastcurrblkno should come from the seized scan.  We'll release the scan as
+ * soon as possible.
  */
-static bool
+static BatchIndexScan
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
-				 BlockNumber lastcurrblkno, ScanDirection dir, bool seized)
+				 BlockNumber lastcurrblkno, ScanDirection dir, bool firstpage)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BatchIndexScan newbatch;
 
-	Assert(so->currPos.currPage == lastcurrblkno || seized);
-	Assert(!(blkno == P_NONE && seized));
-	Assert(!BTScanPosIsPinned(so->currPos));
+	/* Allocate space for next batch */
+	newbatch = indexam_util_batch_alloc(scan);
 
 	/*
-	 * Remember that the scan already read lastcurrblkno, a page to the left
-	 * of blkno (or remember reading a page to the right, for backwards scans)
+	 * newbatch will be the batch for lastcurrblkno, a page to the left of
+	 * blkno (or to the right, when the scan is moving backwards)
 	 */
-	if (ScanDirectionIsForward(dir))
-		so->currPos.moreLeft = true;
-	else
-		so->currPos.moreRight = true;
+	newbatch->moreLeft = true;
+	newbatch->moreRight = true;
 
 	for (;;)
 	{
 		Page		page;
 		BTPageOpaque opaque;
 
-		if (blkno == P_NONE ||
-			(ScanDirectionIsForward(dir) ?
-			 !so->currPos.moreRight : !so->currPos.moreLeft))
-		{
-			/* most recent _bt_readpage call (for lastcurrblkno) ended scan */
-			Assert(so->currPos.currPage == lastcurrblkno && !seized);
-			BTScanPosInvalidate(so->currPos);
-			_bt_parallel_done(scan);	/* iff !so->needPrimScan */
-			return false;
-		}
-
-		Assert(!so->needPrimScan);
-
-		/* parallel scan must never actually visit so->currPos blkno */
-		if (!seized && scan->parallel_scan != NULL &&
-			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
-		{
-			/* whole scan is now done (or another primitive scan required) */
-			BTScanPosInvalidate(so->currPos);
-			return false;
-		}
+		Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
+		Assert(blkno != P_NONE && lastcurrblkno != P_NONE);
 
 		if (ScanDirectionIsForward(dir))
 		{
 			/* read blkno, but check for interrupts first */
 			CHECK_FOR_INTERRUPTS();
-			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			newbatch->buf = _bt_getbuf(rel, blkno, BT_READ);
 		}
 		else
 		{
 			/* read blkno, avoiding race (also checks for interrupts) */
-			so->currPos.buf = _bt_lock_and_validate_left(rel, &blkno,
-														 lastcurrblkno);
-			if (so->currPos.buf == InvalidBuffer)
+			newbatch->buf = _bt_lock_and_validate_left(rel, &blkno,
+													   lastcurrblkno);
+			if (newbatch->buf == InvalidBuffer)
 			{
 				/* must have been a concurrent deletion of leftmost page */
-				BTScanPosInvalidate(so->currPos);
 				_bt_parallel_done(scan);
-				return false;
+				indexam_util_batch_release(scan, newbatch);
+				return NULL;
 			}
 		}
 
-		page = BufferGetPage(so->currPos.buf);
+		page = BufferGetPage(newbatch->buf);
 		opaque = BTPageGetOpaque(page);
 		lastcurrblkno = blkno;
 		if (likely(!P_IGNORE(opaque)))
@@ -1910,17 +1760,17 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 			/* see if there are any matches on this page */
 			if (ScanDirectionIsForward(dir))
 			{
-				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 P_FIRSTDATAKEY(opaque), firstpage))
 					break;
-				blkno = so->currPos.nextPage;
+				blkno = newbatch->nextPage;
 			}
 			else
 			{
-				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), seized))
+				if (_bt_readpage(scan, newbatch, dir,
+								 PageGetMaxOffsetNumber(page), firstpage))
 					break;
-				blkno = so->currPos.prevPage;
+				blkno = newbatch->prevPage;
 			}
 		}
 		else
@@ -1935,19 +1785,39 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 		}
 
 		/* no matching tuples on this page */
-		_bt_relbuf(rel, so->currPos.buf);
-		seized = false;			/* released by _bt_readpage (or by us) */
+		_bt_relbuf(rel, newbatch->buf);
+		newbatch->buf = InvalidBuffer;
+
+		/* Continue the scan in this direction? */
+		if (blkno == P_NONE ||
+			(ScanDirectionIsForward(dir) ?
+			 !newbatch->moreRight : !newbatch->moreLeft))
+		{
+			/*
+			 * blkno _bt_readpage call ended scan in this direction (though if
+			 * so->needPrimScan was set the scan will continue in _bt_first)
+			 */
+			_bt_parallel_done(scan);
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;
+		}
+
+		/* parallel scan must seize the scan to get next blkno */
+		if (scan->parallel_scan != NULL &&
+			!_bt_parallel_seize(scan, &blkno, &lastcurrblkno, false))
+		{
+			indexam_util_batch_release(scan, newbatch);
+			return NULL;		/* done iff so->needPrimScan wasn't set */
+		}
+
+		firstpage = false;		/* next page cannot be first */
 	}
 
-	/*
-	 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
-	 * so->currPos.buf in preparation for btgettuple returning tuples.
-	 */
-	Assert(so->currPos.currPage == blkno);
-	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(rel, so);
+	/* _bt_readpage saved one or more matches in newbatch.items[] */
+	Assert(newbatch->currPage == blkno);
+	indexam_util_batch_unlock(scan, newbatch);
 
-	return true;
+	return newbatch;
 }
 
 /*
@@ -2173,25 +2043,23 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost)
  * Parallel scan callers must have seized the scan before calling here.
  * Exit conditions are the same as for _bt_first().
  */
-static bool
-_bt_endpoint(IndexScanDesc scan, ScanDirection dir)
+static BatchIndexScan
+_bt_endpoint(IndexScanDesc scan, ScanDirection dir, BatchIndexScan firstbatch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber start;
 
-	Assert(!BTScanPosIsValid(so->currPos));
-	Assert(!so->needPrimScan);
+	Assert(!((BTScanOpaque) scan->opaque)->needPrimScan);
 
 	/*
 	 * Scan down to the leftmost or rightmost leaf page.  This is a simplified
 	 * version of _bt_search().
 	 */
-	so->currPos.buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
+	firstbatch->buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir));
 
-	if (!BufferIsValid(so->currPos.buf))
+	if (!BufferIsValid(firstbatch->buf))
 	{
 		/*
 		 * Empty index. Lock the whole relation, as nothing finer to lock
@@ -2202,7 +2070,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		return false;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(firstbatch->buf);
 	opaque = BTPageGetOpaque(page);
 	Assert(P_ISLEAF(opaque));
 
@@ -2228,9 +2096,5 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readfirstpage(scan, start, dir))
-		return false;
-
-	_bt_returnitem(scan, so);
-	return true;
+	return _bt_readfirstpage(scan, firstbatch, start, dir);
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5c50f0dd1..b505d7876 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -21,15 +21,12 @@
 #include "access/reloptions.h"
 #include "access/relscan.h"
 #include "commands/progress.h"
-#include "common/int.h"
-#include "lib/qunique.h"
 #include "miscadmin.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static int	_bt_compare_int(const void *va, const void *vb);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -160,104 +157,69 @@ _bt_freestack(BTStack stack)
 	}
 }
 
-/*
- * qsort comparison function for int arrays
- */
-static int
-_bt_compare_int(const void *va, const void *vb)
-{
-	int			a = *((const int *) va);
-	int			b = *((const int *) vb);
-
-	return pg_cmp_s32(a, b);
-}
-
 /*
  * _bt_killitems - set LP_DEAD state for items an indexscan caller has
  * told us were killed
  *
- * scan->opaque, referenced locally through so, contains information about the
- * current page and killed tuples thereon (generally, this should only be
- * called if so->numKilled > 0).
+ * The batch parameter contains information about the current page and killed
+ * tuples thereon (this should only be called if batch->numKilled > 0).
  *
- * Caller should not have a lock on the so->currPos page, but must hold a
- * buffer pin when !so->dropPin.  When we return, it still won't be locked.
- * It'll continue to hold whatever pins were held before calling here.
+ * Caller should not have a lock on the batch position's page, but must hold a
+ * buffer pin when !dropPin.  When we return, it still won't be locked.  It'll
+ * continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to set
  * LP_DEAD.  If the scan is one that holds a buffer pin on the target page
  * continuously from initially reading the items until applying this function
- * (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
+ * (if it is a !dropPin scan), VACUUM cannot have deleted any items on the
  * page, so the page's TIDs can't have been recycled by now.  There's no risk
  * that we'll confuse a new index tuple that happens to use a recycled TID
  * with a now-removed tuple with the same TID (that used to be on this same
  * page).  We can't rely on that during scans that drop buffer pins eagerly
- * (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
+ * (i.e. dropPin scans), though, so we must condition setting LP_DEAD bits on
  * the page LSN having not changed since back when _bt_readpage saw the page.
  * We totally give up on setting LP_DEAD bits when the page LSN changed.
  *
- * We give up much less often during !so->dropPin scans, but it still happens.
+ * We tend to give up less often during !dropPin scans, but it still happens.
  * We cope with cases where items have moved right due to insertions.  If an
  * item has moved off the current page due to a split, we'll fail to find it
  * and just give up on it.
  */
 void
-_bt_killitems(IndexScanDesc scan)
+_bt_killitems(IndexScanDesc scan, BatchIndexScan batch)
 {
 	Relation	rel = scan->indexRelation;
-	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
-	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	Buffer		buf;
 
-	Assert(numKilled > 0);
-	Assert(BTScanPosIsValid(so->currPos));
+	Assert(batch->numKilled > 0);
+	Assert(BlockNumberIsValid(batch->currPage));
 	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
-	/* Always invalidate so->killedItems[] before leaving so->currPos */
-	so->numKilled = 0;
-
-	/*
-	 * We need to iterate through so->killedItems[] in leaf page order; the
-	 * loop below expects this (when marking posting list tuples, at least).
-	 * so->killedItems[] is now in whatever order the scan returned items in.
-	 * Scrollable cursor scans might have even saved the same item/TID twice.
-	 *
-	 * Sort and unique-ify so->killedItems[] to deal with all this.
-	 */
-	if (numKilled > 1)
-	{
-		qsort(so->killedItems, numKilled, sizeof(int), _bt_compare_int);
-		numKilled = qunique(so->killedItems, numKilled, sizeof(int),
-							_bt_compare_int);
-	}
-
-	if (!so->dropPin)
+	if (!scan->dropPin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
 		 * so all we need to do is lock it.  The pin will have prevented
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
-		Assert(BTScanPosIsPinned(so->currPos));
-		buf = so->currPos.buf;
+		buf = batch->buf;
 		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
 		XLogRecPtr	latestlsn;
 
-		Assert(!BTScanPosIsPinned(so->currPos));
 		Assert(RelationNeedsWAL(rel));
-		buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+		buf = _bt_getbuf(rel, batch->currPage, BT_READ);
 
 		latestlsn = BufferGetLSNAtomic(buf);
-		Assert(so->currPos.lsn <= latestlsn);
-		if (so->currPos.lsn != latestlsn)
+		Assert(batch->lsn <= latestlsn);
+		if (batch->lsn != latestlsn)
 		{
 			/* Modified, give up on hinting */
 			_bt_relbuf(rel, buf);
@@ -272,17 +234,16 @@ _bt_killitems(IndexScanDesc scan)
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	/* Iterate through so->killedItems[] in leaf page order */
-	for (int i = 0; i < numKilled; i++)
+	/* Iterate through batch->killedItems[] in leaf page order */
+	for (int i = 0; i < batch->numKilled; i++)
 	{
-		int			itemIndex = so->killedItems[i];
-		BTScanPosItem *kitem = &so->currPos.items[itemIndex];
+		int			itemIndex = batch->killedItems[i];
+		BatchMatchingItem *kitem = &batch->items[itemIndex];
 		OffsetNumber offnum = kitem->indexOffset;
 
-		Assert(itemIndex >= so->currPos.firstItem &&
-			   itemIndex <= so->currPos.lastItem);
+		Assert(itemIndex >= batch->firstItem && itemIndex <= batch->lastItem);
 		Assert(i == 0 ||
-			   offnum >= so->currPos.items[so->killedItems[i - 1]].indexOffset);
+			   offnum >= batch->items[batch->killedItems[i - 1]].indexOffset);
 
 		if (offnum < minoff)
 			continue;			/* pure paranoia */
@@ -300,7 +261,7 @@ _bt_killitems(IndexScanDesc scan)
 
 				/*
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !so->dropPin case), so it's
+				 * since we first read it (in the !dropPin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -316,7 +277,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !so->dropPin);
+					Assert(kitem->indexOffset == offnum || !scan->dropPin);
 
 					/*
 					 * Read-ahead to later kitems here.
@@ -332,8 +293,8 @@ _bt_killitems(IndexScanDesc scan)
 					 * kitem is also the last heap TID in the last index tuple
 					 * correctly -- posting tuple still gets killed).
 					 */
-					if (pi < numKilled)
-						kitem = &so->currPos.items[so->killedItems[pi++]];
+					if (pi < batch->numKilled)
+						kitem = &batch->items[batch->killedItems[pi++]];
 				}
 
 				/*
@@ -383,7 +344,7 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
-	if (!so->dropPin)
+	if (!scan->dropPin)
 		_bt_unlockbuf(rel, buf);
 	else
 		_bt_relbuf(rel, buf);
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 9f5379b87..a18a2fa9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -88,10 +88,11 @@ spghandler(PG_FUNCTION_ARGS)
 		.ambeginscan = spgbeginscan,
 		.amrescan = spgrescan,
 		.amgettuple = spggettuple,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = spggetbitmap,
 		.amendscan = spgendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1e68ad156..55a58e727 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -135,7 +135,7 @@ static void show_recursive_union_info(RecursiveUnionState *rstate,
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_indexsearches_info(PlanState *planstate, ExplainState *es);
+static void show_indexscan_info(PlanState *planstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1970,7 +1970,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1984,15 +1984,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
-			if (es->analyze)
-				ExplainPropertyFloat("Heap Fetches", NULL,
-									 planstate->instrument->ntuples2, 0, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapIndexScan:
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
-			show_indexsearches_info(planstate, es);
+			show_indexscan_info(planstate, es);
 			break;
 		case T_BitmapHeapScan:
 			show_scan_qual(((BitmapHeapScan *) plan)->bitmapqualorig,
@@ -3856,15 +3853,16 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
 }
 
 /*
- * Show the total number of index searches for a
+ * Show index scan related executor instrumentation for a
  * IndexScan/IndexOnlyScan/BitmapIndexScan node
  */
 static void
-show_indexsearches_info(PlanState *planstate, ExplainState *es)
+show_indexscan_info(PlanState *planstate, ExplainState *es)
 {
 	Plan	   *plan = planstate->plan;
 	SharedIndexScanInstrumentation *SharedInfo = NULL;
-	uint64		nsearches = 0;
+	uint64		nsearches = 0,
+				nheapfetches = 0;
 
 	if (!es->analyze)
 		return;
@@ -3885,6 +3883,7 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 				IndexOnlyScanState *indexstate = ((IndexOnlyScanState *) planstate);
 
 				nsearches = indexstate->ioss_Instrument.nsearches;
+				nheapfetches = indexstate->ioss_Instrument.nheapfetches;
 				SharedInfo = indexstate->ioss_SharedInfo;
 				break;
 			}
@@ -3908,9 +3907,13 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
 			IndexScanInstrumentation *winstrument = &SharedInfo->winstrument[i];
 
 			nsearches += winstrument->nsearches;
+			nheapfetches += winstrument->nheapfetches;
 		}
 	}
 
+	if (nodeTag(plan) == T_IndexOnlyScan)
+		ExplainPropertyUInteger("Heap Fetches", NULL, nheapfetches, es);
+
 	ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 08c86cc16..ffca83a8c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -884,7 +884,7 @@ DefineIndex(ParseState *pstate,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support multicolumn indexes",
 						accessMethodName)));
-	if (exclusion && amRoutine->amgettuple == NULL)
+	if (exclusion && amRoutine->amgettuple == NULL && amRoutine->amgetbatch == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("access method \"%s\" does not support exclusion constraints",
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 90a68c0d1..0dfb01337 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -428,7 +428,7 @@ ExecSupportsMarkRestore(Path *pathnode)
 		case T_IndexOnlyScan:
 
 			/*
-			 * Not all index types support mark/restore.
+			 * Not all index types support restoring a mark
 			 */
 			return castNode(IndexPath, pathnode)->indexinfo->amcanmarkpos;
 
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 6ae0f9595..3041d5c72 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -816,10 +816,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, false, &DirtySnapshot, NULL,
+								 indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	while (table_index_getnext_slot(index_scan, ForwardScanDirection,
+									existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 173d2fe54..9ff50308b 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,7 +204,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
 	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, &snap, NULL, skey_attoff, 0);
 
 retry:
 	found = false;
@@ -212,7 +212,7 @@ retry:
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
@@ -665,12 +665,12 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid,
 	 * not yet committed or those just committed prior to the scan are
 	 * excluded in update_most_recent_deletion_info().
 	 */
-	scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
+	scan = index_beginscan(rel, idxrel, false, SnapshotAny, NULL, skey_attoff, 0);
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
 	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+	while (table_index_getnext_slot(scan, ForwardScanDirection, scanslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 058a59ef5..2580a0139 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -202,6 +202,7 @@ ExecEndBitmapIndexScan(BitmapIndexScanState *node)
 		 * which will have a new BitmapIndexScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->biss_Instrument.nsearches;
+		Assert(node->biss_Instrument.nheapfetches == 0);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c2d093745..ff3e8f302 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -34,14 +34,12 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/tupdesc.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
-#include "storage/predicate.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -65,7 +63,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
 
 	/*
 	 * extract necessary information from index scan node
@@ -90,18 +87,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
+								   node->ioss_RelationDesc, true,
 								   estate->es_snapshot,
 								   &node->ioss_Instrument,
 								   node->ioss_NumScanKeys,
 								   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
-
-
-		/* Set it up for index-only scan */
-		node->ioss_ScanDesc->xs_want_itup = true;
-		node->ioss_VMBuffer = InvalidBuffer;
+		Assert(node->ioss_ScanDesc->xs_want_itup);
 
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
@@ -118,78 +111,10 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (table_index_getnext_slot(scandesc, direction, node->ioss_TableSlot))
 	{
-		bool		tuple_from_heap = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
-		{
-			/*
-			 * Rats, we have to visit the heap to check visibility.
-			 */
-			InstrCountTuples2(node, 1);
-			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
-
-			ExecClearTuple(node->ioss_TableSlot);
-
-			/*
-			 * Only MVCC snapshots are supported here, so there should be no
-			 * need to keep following the HOT chain once a visible entry has
-			 * been found.  If we did want to allow that, we'd need to keep
-			 * more state to remember not to call index_getnext_tid next time.
-			 */
-			if (scandesc->xs_heap_continue)
-				elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
-
-			/*
-			 * Note: at this point we are holding a pin on the heap page, as
-			 * recorded in scandesc->xs_cbuf.  We could release that pin now,
-			 * but it's not clear whether it's a win to do so.  The next index
-			 * entry might require a visit to the same heap page.
-			 */
-
-			tuple_from_heap = true;
-		}
-
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
 		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
@@ -238,16 +163,6 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("lossy distance functions are not supported in index-only scans")));
-
-		/*
-		 * If we didn't access the heap, then we'll need to take a predicate
-		 * lock explicitly, as if we had.  For now we do that at page level.
-		 */
-		if (!tuple_from_heap)
-			PredicateLockPage(scandesc->heapRelation,
-							  ItemPointerGetBlockNumber(tid),
-							  estate->es_snapshot);
-
 		return slot;
 	}
 
@@ -407,13 +322,6 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	indexRelationDesc = node->ioss_RelationDesc;
 	indexScanDesc = node->ioss_ScanDesc;
 
-	/* Release VM buffer pin, if any. */
-	if (node->ioss_VMBuffer != InvalidBuffer)
-	{
-		ReleaseBuffer(node->ioss_VMBuffer);
-		node->ioss_VMBuffer = InvalidBuffer;
-	}
-
 	/*
 	 * When ending a parallel worker, copy the statistics gathered by the
 	 * worker back into shared memory so that it can be picked up by the main
@@ -433,6 +341,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->ioss_Instrument.nsearches;
+		winstrument->nheapfetches += node->ioss_Instrument.nheapfetches;
 	}
 
 	/*
@@ -784,13 +693,12 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
-	node->ioss_VMBuffer = InvalidBuffer;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
@@ -850,12 +758,12 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 
 	node->ioss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->ioss_RelationDesc,
+								 node->ioss_RelationDesc, true,
 								 &node->ioss_Instrument,
 								 node->ioss_NumScanKeys,
 								 node->ioss_NumOrderByKeys,
 								 piscan);
-	node->ioss_ScanDesc->xs_want_itup = true;
+	Assert(node->ioss_ScanDesc->xs_want_itup);
 
 	/*
 	 * If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 84823f0b6..c34a13a87 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -107,7 +107,7 @@ IndexNext(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -128,7 +128,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (table_index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -203,7 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * serially executing an index scan that was planned to be parallel.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
+								   node->iss_RelationDesc, false,
 								   estate->es_snapshot,
 								   &node->iss_Instrument,
 								   node->iss_NumScanKeys,
@@ -260,7 +260,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!table_index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -812,6 +812,7 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument.nsearches;
+		Assert(node->iss_Instrument.nheapfetches == 0);
 	}
 
 	/*
@@ -1719,7 +1720,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
@@ -1783,7 +1784,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 
 	node->iss_ScanDesc =
 		index_beginscan_parallel(node->ss.ss_currentRelation,
-								 node->iss_RelationDesc,
+								 node->iss_RelationDesc, false,
 								 &node->iss_Instrument,
 								 node->iss_NumScanKeys,
 								 node->iss_NumOrderByKeys,
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 51b9d6677..03598b50d 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -45,7 +45,7 @@
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
 {
-	ST_INDEXSCAN,				/* must support amgettuple */
+	ST_INDEXSCAN,				/* must support amgettuple or amgetbatch */
 	ST_BITMAPSCAN,				/* must support amgetbitmap */
 	ST_ANYSCAN,					/* either is okay */
 } ScanTypeControl;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a90e1c9ee..39a5ea299 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -313,11 +313,11 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				info->amsearcharray = amroutine->amsearcharray;
 				info->amsearchnulls = amroutine->amsearchnulls;
 				info->amcanparallel = amroutine->amcanparallel;
-				info->amhasgettuple = (amroutine->amgettuple != NULL);
+				info->amhasgettuple = (amroutine->amgettuple != NULL ||
+									   amroutine->amgetbatch != NULL);
 				info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
 					relation->rd_tableam->scan_bitmap_next_tuple != NULL;
-				info->amcanmarkpos = (amroutine->ammarkpos != NULL &&
-									  amroutine->amrestrpos != NULL);
+				info->amcanmarkpos = amroutine->amposreset != NULL;
 				info->amcostestimate = amroutine->amcostestimate;
 				Assert(info->amcostestimate != NULL);
 
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2df910378..10f96855d 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -889,7 +889,8 @@ IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap)
 	 * The given index access method must implement "amgettuple", which will
 	 * be used later to fetch the tuples.  See RelationFindReplTupleByIndex().
 	 */
-	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL)
+	if (GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgettuple == NULL &&
+		GetIndexAmRoutineByAmId(idxrel->rd_rel->relam, false)->amgetbatch == NULL)
 		return false;
 
 	return true;
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index c81fb61a0..ddfd1b55c 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -363,10 +363,11 @@ indexam_property(FunctionCallInfo fcinfo,
 				PG_RETURN_BOOL(routine->amclusterable);
 
 			case AMPROP_INDEX_SCAN:
-				PG_RETURN_BOOL(routine->amgettuple ? true : false);
+				PG_RETURN_BOOL(routine->amgettuple != NULL ||
+							   routine->amgetbatch != NULL);
 
 			case AMPROP_BITMAP_SCAN:
-				PG_RETURN_BOOL(routine->amgetbitmap ? true : false);
+				PG_RETURN_BOOL(routine->amgetbitmap != NULL);
 
 			case AMPROP_BACKWARD_SCAN:
 				PG_RETURN_BOOL(routine->amcanbackward);
@@ -392,7 +393,8 @@ indexam_property(FunctionCallInfo fcinfo,
 			PG_RETURN_BOOL(routine->amcanmulticol);
 
 		case AMPROP_CAN_EXCLUDE:
-			PG_RETURN_BOOL(routine->amgettuple ? true : false);
+			PG_RETURN_BOOL(routine->amgettuple != NULL ||
+						   routine->amgetbatch != NULL);
 
 		case AMPROP_CAN_INCLUDE:
 			PG_RETURN_BOOL(routine->amcaninclude);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 29fec6555..a7213654a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -102,7 +102,6 @@
 #include "access/gin.h"
 #include "access/table.h"
 #include "access/tableam.h"
-#include "access/visibilitymap.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic.h"
@@ -7124,13 +7123,10 @@ get_actual_variable_endpoint(Relation heapRel,
 	bool		have_data = false;
 	SnapshotData SnapshotNonVacuumable;
 	IndexScanDesc index_scan;
-	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber last_heap_block = InvalidBlockNumber;
-	int			n_visited_heap_pages = 0;
-	ItemPointer tid;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 	MemoryContext oldcontext;
+	IndexScanInstrumentation instrument;
 
 	/*
 	 * We use the index-only-scan machinery for this.  With mostly-static
@@ -7179,56 +7175,32 @@ get_actual_variable_endpoint(Relation heapRel,
 	InitNonVacuumableSnapshot(SnapshotNonVacuumable,
 							  GlobalVisTestFor(heapRel));
 
-	index_scan = index_beginscan(heapRel, indexRel,
-								 &SnapshotNonVacuumable, NULL,
+	/*
+	 * Set it up for instrumented index-only scan.  We need the
+	 * instrumentation to monitor the number of heap fetches.
+	 */
+	memset(&instrument, 0, sizeof(instrument));
+	index_scan = index_beginscan(heapRel, indexRel, true,
+								 &SnapshotNonVacuumable, &instrument,
 								 1, 0);
-	/* Set it up for index-only scan */
-	index_scan->xs_want_itup = true;
+	Assert(index_scan->xs_want_itup);
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while (table_index_getnext_slot(index_scan, indexscandir, tableslot))
 	{
-		BlockNumber block = ItemPointerGetBlockNumber(tid);
+		/* We don't actually need the heap tuple for anything */
+		ExecClearTuple(tableslot);
 
-		if (!VM_ALL_VISIBLE(heapRel,
-							block,
-							&vmbuffer))
-		{
-			/* Rats, we have to visit the heap to check visibility */
-			if (!index_fetch_heap(index_scan, tableslot))
-			{
-				/*
-				 * No visible tuple for this index entry, so we need to
-				 * advance to the next entry.  Before doing so, count heap
-				 * page fetches and give up if we've done too many.
-				 *
-				 * We don't charge a page fetch if this is the same heap page
-				 * as the previous tuple.  This is on the conservative side,
-				 * since other recently-accessed pages are probably still in
-				 * buffers too; but it's good enough for this heuristic.
-				 */
+		/*
+		 * No visible tuple for this index entry, so we need to advance to the
+		 * next entry.  Before doing so, count heap page fetches and give up
+		 * if we've done too many.
+		 */
 #define VISITED_PAGES_LIMIT 100
 
-				if (block != last_heap_block)
-				{
-					last_heap_block = block;
-					n_visited_heap_pages++;
-					if (n_visited_heap_pages > VISITED_PAGES_LIMIT)
-						break;
-				}
-
-				continue;		/* no visible tuple, try next index entry */
-			}
-
-			/* We don't actually need the heap tuple for anything */
-			ExecClearTuple(tableslot);
-
-			/*
-			 * We don't care whether there's more than one visible tuple in
-			 * the HOT chain; if any are visible, that's good enough.
-			 */
-		}
+		if (instrument.nheapfetches > VISITED_PAGES_LIMIT)
+			break;
 
 		/*
 		 * We expect that the index will return data in IndexTuple not
@@ -7261,8 +7233,6 @@ get_actual_variable_endpoint(Relation heapRel,
 		break;
 	}
 
-	if (vmbuffer != InvalidBuffer)
-		ReleaseBuffer(vmbuffer);
 	index_endscan(index_scan);
 
 	return have_data;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 5111cdc6d..7fd98dba6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -146,10 +146,11 @@ blhandler(PG_FUNCTION_ARGS)
 		.ambeginscan = blbeginscan,
 		.amrescan = blrescan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = blgetbitmap,
 		.amendscan = blendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index f48da3185..e27e1268b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -167,10 +167,11 @@ typedef struct IndexAmRoutine
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
     amgettuple_function amgettuple;     /* can be NULL */
+    amgetbatch_function amgetbatch; /* can be NULL */
+    amfreebatch_function amfreebatch;	/* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
+    amposreset_function amposreset; /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -749,6 +750,137 @@ amgettuple (IndexScanDesc scan,
    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
    struct must be set to NULL.
   </para>
+  <note>
+   <para>
+    As of <productname>PostgreSQL</productname> version 19, position marking
+    and restoration of scans is no longer supported for the
+    <function>amgettuple</function> interface; only the
+    <function>amgetbatch</function> interface supports this feature through
+    the <function>amposreset</function> callback.
+   </para>
+  </note>
+
+  <para>
+<programlisting>
+BatchIndexScan
+amgetbatch (IndexScanDesc scan,
+            BatchIndexScan priorbatch,
+            ScanDirection direction);
+</programlisting>
+   Return the next batch of index tuples in the given scan, moving in the
+   given direction (forward or backward in the index).  Returns an instance of
+   <type>BatchIndexScan</type> with index tuples loaded, or
+   <literal>NULL</literal> if there are no more index tuples.
+  </para>
+
+  <para>
+   The <literal>priorbatch</literal> parameter passes the batch previously
+   returned by an earlier <function>amgetbatch</function> call (or
+   <literal>NULL</literal> on the first call).  The index AM uses
+   <literal>priorbatch</literal> to determine which index page to read next,
+   typically by following page links found in <literal>priorbatch</literal>.
+   The returned batch contains matching items immediately adjacent to those
+   from <literal>priorbatch</literal> in the common case where
+   <literal>priorbatch</literal> is the batch that was returned by the most
+   recent call to <function>amgetbatch</function> call (though not when the
+   most recent call used the opposite scan direction to this call, and not
+   when a mark has been restored).
+  </para>
+
+  <para>
+   A batch returned by <function>amgetbatch</function> is guaranteed to be
+   associated with an index page containing at least one matching tuple.
+   The index page associated with the batch may be retained in a buffer with
+   its pin held as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>.  See <xref linkend="index-locking"/> for details
+   on buffer pin management during <quote>plain</quote> index scans.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> interface does not support index-only
+   scans that return data via the <literal>xs_hitup</literal> mechanism.
+   Index-only scans are supported through the <literal>xs_itup</literal>
+   mechanism only.
+  </para>
+
+  <para>
+   The <function>amgetbatch</function> function need only be provided if the
+   access method supports <quote>plain</quote> index scans.  If it doesn't,
+   the <function>amgetbatch</function> field in its
+   <structname>IndexAmRoutine</structname> struct must be set to NULL.
+  </para>
+
+  <para>
+   A <type>BatchIndexScan</type> that is returned by
+   <function>amgetbatch</function> is no longer managed by the access method.
+   It is up to the table AM caller to decide when it should be freed by
+   passing it to <function>amfreebatch</function>.  Note also that
+   <function>amgetbatch</function> functions must never modify the
+   <structfield>priorbatch</structfield> parameter.
+  </para>
+
+  <para>
+   The access method may provide only one of <function>amgettuple</function>
+   and <function>amgetbatch</function> callbacks, not both (XXX uncertain).
+   When the access method provides <function>amgetbatch</function>, it must
+   also provide <function>amfreebatch</function>.
+  </para>
+
+  <para>
+   The same caveats described for <function>amgettuple</function> apply here
+   too: an entry in the returned batch means only that the index contains
+   an entry that matches the scan keys, not that the tuple necessarily still
+   exists in the heap or will pass the caller's snapshot test.
+  </para>
+
+  <para>
+<programlisting>
+void
+amfreebatch (IndexScanDesc scan,
+             BatchIndexScan batch);
+</programlisting>
+   Releases a batch returned by the <function>amgetbatch</function> callback.
+   This function is called exclusively by table access methods to indicate
+   that processing of the batch is complete; it should never be called within
+   the index access method itself.
+  </para>
+
+  <para>
+   <function>amfreebatch</function> frees buffer pins held on the batch's
+   associated index page and releases related memory and resources.  These
+   buffer pins serve as an interlock against concurrent TID recycling by
+   <command>VACUUM</command>, protecting the table access method from confusion
+   about which TID corresponds to which logical row.  See <xref
+   linkend="index-locking"/> for detailed discussion of buffer pin management.
+  </para>
+
+  <para>
+   The index AM may choose to retain its own buffer pins across multiple
+   <function>amfreebatch</function> calls when this serves an internal purpose
+   (for example, maintaining a descent stack of pinned index pages for reuse
+   across <function>amgetbatch</function> calls).  However, any scheme that
+   retains buffer pins must keep the number of retained pins fixed and small,
+   to avoid exhausting the backend's buffer pin limit.
+  </para>
+
+  <para>
+   The index AM has the option of setting <literal>LP_DEAD</literal> bits in
+   the index page to mark dead tuples before releasing the buffer pin.  When
+   <literal>BatchQueue.dropPin</literal> is true and the buffer pin is being
+   dropped eagerly, the index AM must check <literal>BatchIndexScan.lsn</literal>
+   to verify that the page LSN has not advanced since the batch was originally
+   read before setting <literal>LP_DEAD</literal> bits, to avoid concurrent
+   TID recycling hazards.  When <literal>scan->batchqueue->dropPin</literal>
+   is false (requiring that a buffer pin be held throughout first reading the
+   index leaf page and calling <function>amfreebatch</function>),
+   <literal>LP_DEAD</literal> bits can always be set safely without an LSN check.
+  </para>
+
+  <para>
+   The <function>amfreebatch</function> function need only be provided if the
+   access method provides <function>amgetbatch</function>. Otherwise it has to
+   remain set to <literal>NULL</literal>.
+  </para>
 
   <para>
 <programlisting>
@@ -768,8 +900,8 @@ amgetbitmap (IndexScanDesc scan,
    itself, and therefore callers recheck both the scan conditions and the
    partial index predicate (if any) for recheckable tuples.  That might not
    always be true, however.
-   <function>amgetbitmap</function> and
-   <function>amgettuple</function> cannot be used in the same index scan; there
+   Only one of <function>amgetbitmap</function>, <function>amgettuple</function>,
+   or <function>amgetbatch</function> can be used in any given index scan; there
    are other restrictions too when using <function>amgetbitmap</function>, as explained
    in <xref linkend="index-scanning"/>.
   </para>
@@ -795,32 +927,25 @@ amendscan (IndexScanDesc scan);
   <para>
 <programlisting>
 void
-ammarkpos (IndexScanDesc scan);
+amposreset (IndexScanDesc scan);
 </programlisting>
-   Mark current scan position.  The access method need only support one
-   remembered scan position per scan.
+   Notify index AM that core code will change the scan's position to an item
+   returned as part of an earlier batch.  The index AM must therefore
+   invalidate any state that independently tracks the scan's progress
+   (e.g., array keys used with a ScalarArrayOpExpr qual).  Called by the core
+   system when it is about to restore a mark.
   </para>
 
   <para>
-   The <function>ammarkpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
-  </para>
-
-  <para>
-<programlisting>
-void
-amrestrpos (IndexScanDesc scan);
-</programlisting>
-   Restore the scan to the most recently marked position.
-  </para>
-
-  <para>
-   The <function>amrestrpos</function> function need only be provided if the access
-   method supports ordered scans.  If it doesn't,
-   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
-   struct may be set to NULL.
+   The <function>amposreset</function> function can only be provided if the
+   access method supports ordered scans through the <function>amgetbatch</function>
+   interface.  If it doesn't, the <structfield>amposreset</structfield> field
+   in its <structname>IndexAmRoutine</structname> struct should be set to
+   NULL.  Index AMs that don't have any private state that might need to be
+   invalidated might still find it useful to provide an empty
+   <structfield>amposreset</structfield> function; if <function>amposreset</function>
+   is set to NULL, the core system will assume that it is unsafe to restore a
+   marked position.
   </para>
 
   <para>
@@ -994,30 +1119,47 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
   </para>
 
   <para>
-   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
+   The <function>amgettuple</function> and <function>amgetbatch</function>
+   functions have a <literal>direction</literal> argument,
    which can be either <literal>ForwardScanDirection</literal> (the normal case)
    or  <literal>BackwardScanDirection</literal>.  If the first call after
    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
    set of matching index entries is to be scanned back-to-front rather than in
-   the normal front-to-back direction, so <function>amgettuple</function> must return
-   the last matching tuple in the index, rather than the first one as it
-   normally would.  (This will only occur for access
-   methods that set <structfield>amcanorder</structfield> to true.)  After the
-   first call, <function>amgettuple</function> must be prepared to advance the scan in
+   the normal front-to-back direction.  In this case,
+   <function>amgettuple</function> must return the last matching tuple in the
+   index, rather than the first one as it normally would.  Similarly,
+   <function>amgetbatch</function> must return the last matching batch of items
+   when either the first call after <function>amrescan</function> specifies
+   <literal>BackwardScanDirection</literal>, or a subsequent call has
+   <literal>NULL</literal> as its <structfield>priorbatch</structfield> argument
+   (indicating a backward scan restart).  (This backward-scan behavior will
+   only occur for access methods that set <structfield>amcanorder</structfield>
+   to true.)  After the first call, both <function>amgettuple</function> and
+   <function>amgetbatch</function> must be prepared to advance the scan in
    either direction from the most recently returned entry.  (But if
    <structfield>amcanbackward</structfield> is false, all subsequent
    calls will have the same direction as the first one.)
   </para>
 
   <para>
-   Access methods that support ordered scans must support <quote>marking</quote> a
-   position in a scan and later returning to the marked position.  The same
-   position might be restored multiple times.  However, only one position need
-   be remembered per scan; a new <function>ammarkpos</function> call overrides the
-   previously marked position.  An access method that does not support ordered
-   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
-   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
-   instead.
+   Access methods using the <function>amgetbatch</function> interface may
+   support <quote>marking</quote> a position in a scan and later returning to
+   the marked position, though this is optional.  If the same marked position
+   might be restored multiple times, the core system manages marking and
+   restoration through the <function>index_batch_mark_pos</function> and
+   <function>index_batch_restore_pos</function> internal functions.  When a
+   marked position is restored, the index AM is notified via the
+   <function>amposreset</function> callback so it can invalidate any private
+   state that independently tracks the scan's progress (such as array key
+   state).
+  </para>
+
+  <para>
+   The <function>amposreset</function> function in <structname>IndexAmRoutine</structname>
+   should be set to NULL for access methods that do not support mark/restore.
+   For access methods that do support this feature, <function>amposreset</function>
+   must be provided (though it can be a no-op function if the AM has no private
+   state to invalidate).
   </para>
 
   <para>
@@ -1186,6 +1328,94 @@ amtranslatecmptype (CompareType cmptype, Oid opfamily, Oid opcintype);
    reduce the frequency of such transaction cancellations.
   </para>
 
+  <sect2 id="index-locking-batches">
+   <title>Batch Scanning and Buffer Pin Management</title>
+
+   <para>
+    Index access methods that implement the <function>amgetbatch</function>
+    interface must cooperate with the core system to manage buffer pins in a
+    way that prevents concurrent <command>VACUUM</command> from creating
+    TID recycling hazards.  Unlike <function>amgettuple</function> scans,
+    which keep the index access method in control of scan progression,
+    <function>amgetbatch</function> scans give control to the table access
+    method, which may fetch table tuples in a different order than the index
+    entries were returned.  This creates the need for explicit buffer pin
+    management to ensure the table access method does not confuse a recycled
+    TID with the original row it meant to reference.
+   </para>
+
+   <para>
+    When <function>amgetbatch</function> returns a batch, the batch's
+    associated index page may be retained in a buffer with a pin held on it.
+    This pin serves as an interlock: <command>VACUUM</command> cannot recycle
+    TIDs on a pinned page.  The buffer pin protects only the table access
+    method's ability to map TIDs to rows correctly; it does not protect the
+    index structure itself.  Index access methods may use pins for other
+    purposes (for example, maintaining a descent stack of pinned pages), but
+    those uses are internal to the access method and independent of the
+    table-AM synchronization described here.
+   </para>
+
+   <para>
+    Whether a pin should be held when returning a batch is controlled by the
+    <structfield>dropPin</structfield> flag in the <type>BatchQueue</type>
+    structure. When <literal>dropPin</literal> is true, the index access method
+    drops the pin before returning the batch, which avoids blocking
+    <command>VACUUM</command>. When <literal>dropPin</literal> is false, the
+    index access method must hold the pin until the batch is freed via
+    <function>amfreebatch</function>.  The core system sets the
+    <literal>dropPin</literal> flag based on scan type: it is true for
+    MVCC-compliant snapshots on logged relations (unless index-only scans are
+    in use), and false otherwise.
+   </para>
+
+   <para>
+    When <literal>dropPin</literal> is true and the index access method is
+    eager about dropping pins, it must save the page's LSN in the batch before
+    returning. Later, when <function>amfreebatch</function> is called and the
+    access method wishes to set <literal>LP_DEAD</literal> bits to mark dead
+    tuples, it must verify that the page's LSN has not changed since the batch
+    was read. If the LSN has changed, the page may have been modified by
+    concurrent activity and it is unsafe to set <literal>LP_DEAD</literal> bits.
+    This LSN-based validation scheme protects against TID recycling races when
+    pins have been dropped.  When <literal>dropPin</literal> is false, the pin
+    prevents unsafe concurrent removal of table TID references by
+    <command>VACUUM</command>, so no LSN check is necessary.
+   </para>
+
+   <para>
+    The core system provides three utility functions for managing batch
+    resources:
+    <function>indexam_util_batch_alloc</function> allocates a new batch or
+    reuses a cached one,
+    <function>indexam_util_batch_unlock</function> drops the lock and
+    conditionally drops the pin on a batch's index page (based on the
+    <literal>dropPin</literal> setting), and
+    <function>indexam_util_batch_release</function> frees or caches a batch.
+    Index access methods should use these utilities rather than managing
+    buffers directly.  The <filename>src/backend/access/nbtree/</filename>
+    implementation provides a reference example of correct usage.
+   </para>
+
+   <para>
+    Note that <function>amfreebatch</function> is called only by the core code
+    and table access method, never by the index access method itself. The
+    index AM must not assume that a call to <function>amfreebatch</function>
+    will take place before another call to <function>amgetbatch</function>
+    (for the same index scan) takes place.
+   </para>
+
+   <para>
+    The index AM must also avoid relying on the core code calling
+    <function>amfreebatch</function> with batches that are in any particular
+    order.  For example, it is not okay for an index AM to assume that calls
+    to <function>amfreebatch</function> will take place in the same order as
+    the <function>amgetbatch</function> calls that initially
+    allocated/populated/returned each batch.
+   </para>
+
+  </sect2>
+
  </sect1>
 
  <sect1 id="index-unique-checks">
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 77c5a763d..55b7222e9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1152,12 +1152,13 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      </para>
 
      <para>
-      The access method must support <literal>amgettuple</literal> (see <xref
-      linkend="indexam"/>); at present this means <acronym>GIN</acronym>
-      cannot be used.  Although it's allowed, there is little point in using
-      B-tree or hash indexes with an exclusion constraint, because this
-      does nothing that an ordinary unique constraint doesn't do better.
-      So in practice the access method will always be <acronym>GiST</acronym> or
+      The access method must support either <literal>amgettuple</literal>
+      or <literal>amgetbatch</literal> (see <xref linkend="indexam"/>); at
+      present this means <acronym>GIN</acronym> cannot be used.  Although
+      it's allowed, there is little point in using B-tree or hash indexes
+      with an exclusion constraint, because this does nothing that an
+      ordinary unique constraint doesn't do better.  So in practice the
+      access method will always be <acronym>GiST</acronym> or
       <acronym>SP-GiST</acronym>.
      </para>
 
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 9eb8f0a6c..9d4fddec4 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -317,10 +317,11 @@ dihandler(PG_FUNCTION_ARGS)
 		.ambeginscan = dibeginscan,
 		.amrescan = direscan,
 		.amgettuple = NULL,
+		.amgetbatch = NULL,
+		.amfreebatch = NULL,
 		.amgetbitmap = NULL,
 		.amendscan = diendscan,
-		.ammarkpos = NULL,
-		.amrestrpos = NULL,
+		.amposreset = NULL,
 		.amestimateparallelscan = NULL,
 		.aminitparallelscan = NULL,
 		.amparallelrescan = NULL,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 09e7f1d42..a22e6fc4b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -226,8 +226,6 @@ BTScanInsertData
 BTScanKeyPreproc
 BTScanOpaque
 BTScanOpaqueData
-BTScanPosData
-BTScanPosItem
 BTShared
 BTSortArrayContext
 BTSpool
@@ -3470,12 +3468,10 @@ amgettuple_function
 aminitparallelscan_function
 aminsert_function
 aminsertcleanup_function
-ammarkpos_function
 amoptions_function
 amparallelrescan_function
 amproperty_function
 amrescan_function
-amrestrpos_function
 amtranslate_cmptype_function
 amtranslate_strategy_function
 amvacuumcleanup_function
-- 
2.51.0

v6-0002-Add-prefetching-to-index-scans-using-batch-interf.patchapplication/octet-stream; name=v6-0002-Add-prefetching-to-index-scans-using-batch-interf.patchDownload

From b22280f0833c3f9638aa3fbecae10bf0ba9da455 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 15 Nov 2025 14:03:58 -0500
Subject: [PATCH v6 2/4] Add prefetching to index scans using batch interfaces.

This commit implements I/O prefetching for index scans, made possible by
the recent addition of batching interfaces to both the table AM and
index AM APIs.

The amgetbatch index AM interface provides batches of TIDs (rather than
one at a time) from a single index leaf page, and allows multiple
batches to be held in memory/pinned simultaneously.  This gives the
table AM the freedom to readahead within an index scan, which is crucial
for I/O prefetching with certain workloads (workloads that would
otherwise be unable to keep a sufficiently high prefetch distance for
heap block I/O).  Prefetching is implemented using a read stream under
the control of the table AM.

XXX When the batch queue reaches capacity, the stream pauses until
the scan catches up and frees some batches.  We need a more principled
approach here.  Essentially, we need infrastructure that allows a read
stream call back to tell the read stream to "back off" without it fully
ending/resetting the read stream.

Note: For now prefetching is temporarily disabled during index-only
scans, pending the reintroduction of visibility map caching in batches.
Previous versions of the patch series had that, but it was removed when
we moved over to the new table AM interface.

Author: Tomas Vondra <tomas@vondra.me>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
---
 src/include/access/relscan.h                  |  33 +-
 src/include/access/tableam.h                  |  15 +
 src/include/optimizer/cost.h                  |   1 +
 src/backend/access/heap/heapam_handler.c      | 386 +++++++++++++++++-
 src/backend/access/index/indexam.c            |  10 +-
 src/backend/access/index/indexbatch.c         |  17 +-
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/storage/aio/read_stream.c         |  14 +-
 src/backend/utils/misc/guc_parameters.dat     |   7 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/test/regress/expected/sysviews.out        |   3 +-
 11 files changed, 481 insertions(+), 7 deletions(-)

diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d6a34c193..8dcfeae6d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/read_stream.h"
 #include "storage/relfilelocator.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
@@ -124,6 +125,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	ReadStream *rs;
 } IndexFetchTableData;
 
 /*
@@ -220,8 +222,14 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  * Maximum number of batches (leaf pages) we can keep in memory.  We need a
  * minimum of two, since we'll only consider releasing one batch when another
  * is read.
+ *
+ * The choice of 64 batches is arbitrary.  It's about 1MB of data with 8KB
+ * pages (512kB for pages, and then a bit of overhead). We should not really
+ * need this many batches in most cases, though. The read stream looks ahead
+ * just enough to queue enough IOs, adjusting the distance (TIDs, but
+ * ultimately the number of future batches) to meet that.
  */
-#define INDEX_SCAN_MAX_BATCHES		2
+#define INDEX_SCAN_MAX_BATCHES		64
 #define INDEX_SCAN_CACHE_BATCHES	2
 #define INDEX_SCAN_BATCH_COUNT(scan) \
 	((scan)->batchqueue->nextBatch - (scan)->batchqueue->headBatch)
@@ -268,12 +276,35 @@ typedef struct BatchIndexScanData *BatchIndexScan;
  */
 typedef struct BatchQueue
 {
+	bool		reset;
+
+	/*
+	 * Did we disable prefetching/use of a read stream because it didn't pay
+	 * for itself?
+	 */
+	bool		prefetchingLockedIn;
+	bool		disabled;
+
+	/*
+	 * During prefetching, currentPrefetchBlock is the table AM block number
+	 * that was returned by our read stream callback most recently.  Used to
+	 * suppress duplicate successive read stream block requests.
+	 *
+	 * Prefetching can still perform non-successive requests for the same
+	 * block number (in general we're prefetching in exactly the same order
+	 * that the scan will return table AM TIDs in).  We need to avoid
+	 * duplicate successive requests because table AMs expect to be able to
+	 * hang on to buffer pins across table_index_fetch_tuple calls.
+	 */
+	BlockNumber currentPrefetchBlock;
+
 	/* Current scan direction, for the currently loaded batches */
 	ScanDirection direction;
 
 	/* current positions in batches[] for scan */
 	BatchQueueItemPos readPos;	/* read position */
 	BatchQueueItemPos markPos;	/* mark/restore position */
+	BatchQueueItemPos streamPos;	/* stream position (for prefetching) */
 
 	BatchIndexScan markBatch;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 384656ce1..07695add0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -448,6 +448,21 @@ typedef struct TableAmRoutine
 									   ScanDirection direction,
 									   TupleTableSlot *slot);
 
+	/*
+	 * Read stream callback, used to perform I/O prefetching of table AM pages
+	 * during `index_getnext_slot` index scans.
+	 *
+	 * This callback is directly passed to read_stream_begin_relation, from
+	 * batch_getnext routine.  It will only be used during scans whose index
+	 * AM uses the amgetbatch interface.  (Scans with amgettuple-based index
+	 * AMs cannot reasonably be used for I/O prefetching, since its opaque
+	 * tuple-at-a-time interface makes it impossible to schedule index scan
+	 * work sensibly.)
+	 */
+	BlockNumber (*index_getnext_stream) (ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
+
 	/*
 	 * Fetch tuple at `tid` into `slot`, after doing a visibility test
 	 * according to `snapshot`. If a tuple was found and passed the visibility
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 07b8bfa63..31236ceac 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -51,6 +51,7 @@ extern PGDLLIMPORT Cost disable_cost;
 extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexscan_prefetch;
 extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd46ccb3b..ad4987d58 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -36,6 +36,7 @@
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/bufpage.h"
@@ -59,6 +60,9 @@ static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
 static bool BitmapHeapScanNextBlock(TableScanDesc scan,
 									bool *recheck,
 									uint64 *lossy_pages, uint64 *exact_pages);
+static BlockNumber heapam_getnext_stream(ReadStream *stream,
+										 void *callback_private_data,
+										 void *per_buffer_data);
 
 
 /* ------------------------------------------------------------------------
@@ -84,6 +88,7 @@ heapam_index_fetch_begin(Relation rel)
 	IndexFetchHeapData *hscan = palloc_object(IndexFetchHeapData);
 
 	hscan->xs_base.rel = rel;
+	hscan->xs_base.rs = NULL;
 	hscan->xs_cbuf = InvalidBuffer;
 	hscan->xs_blk = InvalidBlockNumber;
 	hscan->vmbuf = InvalidBuffer;
@@ -96,6 +101,9 @@ heapam_index_fetch_reset(IndexFetchTableData *scan)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 
+	if (scan->rs)
+		read_stream_reset(scan->rs);
+
 	/* deliberately don't drop VM buffer pin here */
 	if (BufferIsValid(hscan->xs_cbuf))
 	{
@@ -112,6 +120,9 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 
 	heapam_index_fetch_reset(scan);
 
+	if (scan->rs)
+		read_stream_end(scan->rs);
+
 	if (hscan->vmbuf != InvalidBuffer)
 	{
 		ReleaseBuffer(hscan->vmbuf);
@@ -149,7 +160,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		 * When using a read stream, the stream will already know which block
 		 * number comes next (though an assertion will verify a match below)
 		 */
-		hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
+		if (scan->rs)
+			hscan->xs_cbuf = read_stream_next_buffer(scan->rs, NULL);
+		else
+			hscan->xs_cbuf = ReadBuffer(hscan->xs_base.rel, hscan->xs_blk);
 
 		/*
 		 * Prune page when it is pinned for the first time
@@ -275,6 +289,17 @@ heap_batch_getnext(IndexScanDesc scan, BatchIndexScan priorbatch,
 
 		DEBUG_LOG("batch_getnext headBatch %d nextBatch %d batch %p",
 				  batchqueue->headBatch, batchqueue->nextBatch, batch);
+
+		/* Delay initializing stream until reading from scan's second batch */
+		if (!scan->xs_heapfetch->rs && !batchqueue->disabled && priorbatch &&
+			!scan->xs_want_itup &&	/* XXX prefetching disabled for IoS, for
+									 * now */
+			enable_indexscan_prefetch)
+			scan->xs_heapfetch->rs =
+				read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
+										   scan->heapRelation, MAIN_FORKNUM,
+										   scan->heapRelation->rd_tableam->index_getnext_stream,
+										   scan, 0);
 	}
 
 	batch_assert_batches_valid(scan);
@@ -307,6 +332,16 @@ heapam_batch_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	/* Initialize direction on first call */
 	if (batchqueue->direction == NoMovementScanDirection)
 		batchqueue->direction = direction;
+	else if (unlikely(batchqueue->disabled && scan->xs_heapfetch->rs))
+	{
+		/*
+		 * Handle cancelling the use of the read stream for prefetching
+		 */
+		batch_reset_pos(&batchqueue->streamPos);
+
+		read_stream_reset(scan->xs_heapfetch->rs);
+		scan->xs_heapfetch->rs = NULL;
+	}
 
 	/*
 	 * Try advancing the batch position. If that doesn't succeed, it means we
@@ -337,6 +372,46 @@ nextbatch:
 	{
 		heapam_batch_rewind(scan, batchqueue, direction);
 		readPos->batch = batchqueue->nextBatch - 1;
+
+		if (scan->xs_heapfetch->rs)
+			read_stream_reset(scan->xs_heapfetch->rs);
+		batch_reset_pos(&batchqueue->streamPos);
+	}
+
+	if (INDEX_SCAN_BATCH_LOADED(scan, readPos->batch + 1))
+	{
+		/* advance to the next batch */
+		readPos->batch++;
+
+		readBatch = INDEX_SCAN_BATCH(scan, readPos->batch);
+
+		if (ScanDirectionIsForward(direction))
+			readPos->item = readBatch->firstItem;
+		else
+			readPos->item = readBatch->lastItem;
+
+		batch_assert_pos_valid(scan, readPos);
+
+		if (readPos->batch != batchqueue->headBatch)
+		{
+			BatchIndexScan headBatch = INDEX_SCAN_BATCH(scan,
+														batchqueue->headBatch);
+
+			/* Free the head batch (except when it's markBatch) */
+			batch_free(scan, headBatch);
+
+			/*
+			 * In any case, remove the batch from the regular queue, even if
+			 * we kept it for mark/restore
+			 */
+			batchqueue->headBatch++;
+
+			/* we can't skip any batches */
+			Assert(batchqueue->headBatch == readPos->batch);
+		}
+
+		pgstat_count_index_tuples(scan->indexRelation, 1);
+		return heapam_batch_return_tid(scan, readBatch, readPos);
 	}
 
 	if ((readBatch = heap_batch_getnext(scan, readBatch, direction)) != NULL)
@@ -395,6 +470,314 @@ nextbatch:
 	return NULL;
 }
 
+/*
+ * heap_batch_advance_streampos
+ *		Advance streamPos to the next item during prefetching.
+ *
+ * Move to the next item within the batch pointed to by caller's pos.
+ * Advances the position to the next item, either in the same batch or the
+ * following one (if already available).
+ *
+ * We can advance only if we already have some batches loaded, and there's
+ * either enough items in the current batch, or some more items in the
+ * subsequent batches.
+ *
+ * If this is the first advance (right after loading the initial/head batch),
+ * position is still undefined.  Otherwise we expect the position to be valid.
+ *
+ * Returns true if the position was advanced, false otherwise.  The position
+ * is guaranteed to be valid only after a successful advance.
+ */
+pg_attribute_always_inline
+static bool
+heap_batch_advance_streampos(IndexScanDesc scan, BatchQueueItemPos *streamPos,
+							 ScanDirection direction)
+{
+	BatchIndexScan streamBatch;
+
+	/* make sure we have batching initialized and consistent */
+	batch_assert_batches_valid(scan);
+
+	/* should know direction by now */
+	Assert(direction == scan->batchqueue->direction);
+	Assert(direction != NoMovementScanDirection);
+
+	/* We can't advance if there are no batches available. */
+	if (INDEX_SCAN_BATCH_COUNT(scan) == 0)
+		return false;
+
+	/*
+	 * The position is already defined, so we should have some batches loaded
+	 * and the position has to be valid with respect to those.
+	 */
+	Assert(!INDEX_SCAN_POS_INVALID(streamPos));
+	batch_assert_pos_valid(scan, streamPos);
+
+	/*
+	 * Advance to the next item in the same batch, if there are more items. If
+	 * we're at the last item, we'll try advancing to the next batch later.
+	 */
+	streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+
+	if (ScanDirectionIsForward(direction))
+	{
+		if (++streamPos->item <= streamBatch->lastItem)
+		{
+			batch_assert_pos_valid(scan, streamPos);
+
+			return true;
+		}
+	}
+	else						/* ScanDirectionIsBackward */
+	{
+		if (--streamPos->item >= streamBatch->firstItem)
+		{
+			batch_assert_pos_valid(scan, streamPos);
+
+			return true;
+		}
+	}
+
+	/*
+	 * We couldn't advance within the same batch, try advancing to the next
+	 * batch, if it's already loaded.
+	 */
+	if (INDEX_SCAN_BATCH_LOADED(scan, streamPos->batch + 1))
+	{
+		/* advance to the next batch */
+		streamPos->batch++;
+
+		streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+
+		if (ScanDirectionIsForward(direction))
+			streamPos->item = streamBatch->firstItem;
+		else
+			streamPos->item = streamBatch->lastItem;
+
+		batch_assert_pos_valid(scan, streamPos);
+
+		return true;
+	}
+
+	/* can't advance */
+	return false;
+}
+
+/*
+ * Controls when we cancel use of a read stream to do prefetching
+ */
+#define INDEX_SCAN_MIN_DISTANCE_NBATCHES	20
+#define INDEX_SCAN_MIN_TUPLE_DISTANCE		7
+
+/*
+ * heapam_getnext_stream
+ *		return the next block to pass to the read stream
+ *
+ * This assumes the "current" scan direction, requested by the caller.
+ *
+ * If the direction changes before consuming all blocks, we'll reset the stream
+ * and start from scratch. The scan direction change is handled elsewhere.
+ * Here we rely on having the correct value in batchqueue->direction.
+ *
+ * The position of the read_stream is stored in streamPos, which may be ahead of
+ * the current readPos (which is what got consumed by the scan).
+ *
+ * The streamPos can however also get behind readPos too, when some blocks are
+ * skipped and not returned to the read_stream. An example is an index scan on
+ * a correlated index, with many duplicate blocks are skipped, or an IOS where
+ * all-visible blocks are skipped.
+ *
+ * The initial batch is always loaded by heapam_batch_getnext_tid. We don't
+ * get here until the first read_stream_next_buffer() call, when pulling the
+ * first heap tuple from the stream. After that, most batches should be loaded
+ * by this callback, driven by the read_stream look-ahead distance.  However,
+ * with disabled prefetching (that is, with effective_io_concurrency=0), all
+ * batches will be loaded in heapam_batch_getnext_tid.
+ *
+ * It's possible we got here only fairly late in the scan, e.g. if many tuples
+ * got skipped in the index-only scan, etc. In this case just use the read
+ * position as a streamPos starting point.
+ */
+static BlockNumber
+heapam_getnext_stream(ReadStream *stream, void *callback_private_data,
+					  void *per_buffer_data)
+{
+	IndexScanDesc scan = (IndexScanDesc) callback_private_data;
+	BatchQueue *batchqueue = scan->batchqueue;
+	BatchQueueItemPos *streamPos = &batchqueue->streamPos;
+	ScanDirection direction = batchqueue->direction;
+
+	/* By now we should know the direction of the scan. */
+	Assert(direction != NoMovementScanDirection);
+
+	/*
+	 * The read position (readPos) has to be valid.
+	 *
+	 * We initialize/advance it before even attempting to read the heap tuple,
+	 * and it gets invalidated when we reach the end of the scan (but then we
+	 * don't invoke the callback again).
+	 *
+	 * XXX This applies to the readPos. We'll use streamPos to determine which
+	 * blocks to pass to the stream, and readPos may be used to initialize it.
+	 */
+	batch_assert_pos_valid(scan, &batchqueue->readPos);
+
+	/*
+	 * Try to advance the streamPos to the next item, and if that doesn't
+	 * succeed (if there are no more items in loaded batches), try loading the
+	 * next one.
+	 *
+	 * FIXME This can loop more than twice.  If many blocks get skipped due to
+	 * currentPrefetchBlock or all-visibility (per the "prefetch" callback),
+	 * we get to load additional batches. In the worst case we'll hit the
+	 * INDEX_SCAN_MAX_BATCHES limit and have to "pause"/reset the stream.
+	 */
+	while (true)
+	{
+		bool		advanced = false;
+		BatchIndexScan priorbatch = NULL;
+
+		/*
+		 * If the stream position has not been initialized yet, set it to the
+		 * current read position. This is the item the caller is trying to
+		 * read, so it's what we should return to the stream.
+		 */
+		if (INDEX_SCAN_POS_INVALID(streamPos))
+		{
+			*streamPos = batchqueue->readPos;
+			advanced = true;
+		}
+		else if (heap_batch_advance_streampos(scan, streamPos, direction))
+		{
+			advanced = true;
+		}
+
+		/*
+		 * FIXME Maybe check the streamPos is not behind readPos?
+		 *
+		 * FIXME Actually, could streamPos get stale/lagging behind readPos,
+		 * and if yes how much. Could it get so far behind to not be valid,
+		 * pointing at a freed batch? In that case we can't even advance it,
+		 * and we should just initialize it to readPos. We might do that
+		 * anyway, I guess, just to save on "pointless" advances (it must
+		 * agree with readPos, we can't allow "retroactively" changing the
+		 * block sequence).
+		 */
+
+		/*
+		 * If we advanced the position, either return the block for the TID,
+		 * or skip it (and then try advancing again).
+		 *
+		 * The block may be "skipped" for two reasons. First, the caller may
+		 * define a "prefetch" callback that tells us to skip items (IOS does
+		 * this to skip all-visible pages). Second, currentPrefetchBlock is
+		 * used to skip duplicate block numbers (a sequence of TIDS for the
+		 * same block).
+		 */
+		if (advanced)
+		{
+			BatchIndexScan streamBatch = INDEX_SCAN_BATCH(scan, streamPos->batch);
+			ItemPointer tid = &streamBatch->items[streamPos->item].heapTid;
+
+			DEBUG_LOG("heapam_getnext_stream: item %d, TID (%u,%u)",
+					  streamPos->item,
+					  ItemPointerGetBlockNumber(tid),
+					  ItemPointerGetOffsetNumber(tid));
+
+			/* same block as before, don't need to read it */
+			if (batchqueue->currentPrefetchBlock == ItemPointerGetBlockNumber(tid))
+			{
+				DEBUG_LOG("heapam_getnext_stream: skip block (currentPrefetchBlock)");
+				continue;
+			}
+
+			batchqueue->currentPrefetchBlock = ItemPointerGetBlockNumber(tid);
+
+			return batchqueue->currentPrefetchBlock;
+		}
+
+		if (scan->finished)
+			break;
+
+		/*
+		 * If we already used the maximum number of batch slots available,
+		 * it's pointless to try loading another one. This can happen for
+		 * various reasons, e.g. for index-only scans on all-visible table, or
+		 * skipping duplicate blocks on perfectly correlated indexes, etc.
+		 *
+		 * We could enlarge the array to allow more batches, but that's
+		 * futile, we can always construct a case using more memory. Not only
+		 * it would risk OOM, it'd also be inefficient because this happens
+		 * early in the scan (so it'd interfere with LIMIT queries).
+		 */
+		if (INDEX_SCAN_BATCH_FULL(scan))
+		{
+			DEBUG_LOG("batch_getnext: ran out of space for batches");
+			scan->batchqueue->reset = true;
+			break;
+		}
+
+		/*
+		 * Couldn't advance the position, no more items in the loaded batches.
+		 * Try loading the next batch - if that succeeds, try advancing again
+		 * (this time the advance should work, but we may skip all the items).
+		 *
+		 * If we fail to load the next batch, we're done.
+		 */
+		if (batchqueue->headBatch < batchqueue->nextBatch)
+			priorbatch = INDEX_SCAN_BATCH(scan, batchqueue->nextBatch - 1);
+		if (!heap_batch_getnext(scan, priorbatch, direction))
+			break;
+
+		/*
+		 * Consider disabling prefetching when we can't keep a sufficiently
+		 * large "index tuple distance" between readPos and streamPos.
+		 *
+		 * Only consider doing this when we're not on the scan's initial
+		 * batch, when readPos and streamPos share the same batch.
+		 */
+		if (!scan->finished && !batchqueue->prefetchingLockedIn)
+		{
+			int			itemdiff;
+
+			if (streamPos->batch <= INDEX_SCAN_MIN_DISTANCE_NBATCHES)
+			{
+				/* Too early to check if prefetching should be disabled */
+			}
+			else if (batchqueue->readPos.batch == streamPos->batch)
+			{
+				BatchQueueItemPos *readPos = &batchqueue->readPos;
+
+				if (ScanDirectionIsForward(direction))
+					itemdiff = streamPos->item - readPos->item;
+				else
+				{
+					BatchIndexScan readBatch =
+						INDEX_SCAN_BATCH(scan, readPos->batch);
+
+					itemdiff = (readPos->item - readBatch->firstItem) -
+						(streamPos->item - readBatch->firstItem);
+				}
+
+				if (itemdiff < INDEX_SCAN_MIN_TUPLE_DISTANCE)
+				{
+					batchqueue->disabled = true;
+					return InvalidBlockNumber;
+				}
+				else
+				{
+					batchqueue->prefetchingLockedIn = true;
+				}
+			}
+			else
+				batchqueue->prefetchingLockedIn = true;
+		}
+	}
+
+	/* no more items in this scan */
+	return InvalidBlockNumber;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -3063,6 +3446,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_reset = heapam_index_fetch_reset,
 	.index_fetch_end = heapam_index_fetch_end,
 	.index_getnext_slot = heapam_index_getnext_slot,
+	.index_getnext_stream = heapam_getnext_stream,
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index a59c76750..aaf2b39b4 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -467,7 +467,15 @@ index_restrpos(IndexScanDesc scan)
 	CHECK_SCAN_PROCEDURE(amgetbatch);
 	CHECK_SCAN_PROCEDURE(amposreset);
 
-	/* release resources (like buffer pins) from table accesses */
+	/*
+	 * release resources (like buffer pins) from table accesses
+	 *
+	 * XXX: Currently, the distance is always remembered across any
+	 * read_stream_reset calls (to work around the scan->batchqueue->reset
+	 * behavior of resetting the stream to deal with running out of batches).
+	 * We probably _should_ be forgetting the distance when we reset the
+	 * stream here (through our table_index_fetch_reset call), though.
+	 */
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
diff --git a/src/backend/access/index/indexbatch.c b/src/backend/access/index/indexbatch.c
index 7fad00084..d57ed0e7a 100644
--- a/src/backend/access/index/indexbatch.c
+++ b/src/backend/access/index/indexbatch.c
@@ -74,11 +74,16 @@ index_batch_init(IndexScanDesc scan)
 		(!scan->xs_want_itup && IsMVCCSnapshot(scan->xs_snapshot) &&
 		 RelationNeedsWAL(scan->indexRelation));
 	scan->finished = false;
+	scan->batchqueue->reset = false;
+	scan->batchqueue->prefetchingLockedIn = false;
+	scan->batchqueue->disabled = false;
+	scan->batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 	scan->batchqueue->direction = NoMovementScanDirection;
 
 	/* positions in the queue of batches */
 	batch_reset_pos(&scan->batchqueue->readPos);
 	batch_reset_pos(&scan->batchqueue->markPos);
+	batch_reset_pos(&scan->batchqueue->streamPos);
 
 	scan->batchqueue->markBatch = NULL;
 	scan->batchqueue->headBatch = 0;	/* initial head batch */
@@ -107,9 +112,12 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batch_assert_batches_valid(scan);
 	batch_debug_print_batches("index_batch_reset", scan);
 	Assert(scan->xs_heapfetch);
+	if (scan->xs_heapfetch->rs)
+		read_stream_reset(scan->xs_heapfetch->rs);
 
 	/* reset the positions */
 	batch_reset_pos(&batchqueue->readPos);
+	batch_reset_pos(&batchqueue->streamPos);
 
 	/*
 	 * With "complete" reset, make sure to also free the marked batch, either
@@ -155,6 +163,8 @@ index_batch_reset(IndexScanDesc scan, bool complete)
 	batchqueue->nextBatch = 0;	/* initial batch is empty */
 
 	scan->finished = false;
+	batchqueue->reset = false;
+	batchqueue->currentPrefetchBlock = InvalidBlockNumber;
 
 	batch_assert_batches_valid(scan);
 }
@@ -218,9 +228,13 @@ index_batch_restore_pos(IndexScanDesc scan)
 {
 	BatchQueue *batchqueue = scan->batchqueue;
 	BatchQueueItemPos *markPos = &batchqueue->markPos;
-	BatchQueueItemPos *readPos = &batchqueue->readPos;
 	BatchIndexScan markBatch = batchqueue->markBatch;
 
+	/*
+	 * XXX Disable this optimization when I/O prefetching is in use, at least
+	 * until the possible interactions with streamPos are fully understood.
+	 */
+#if 0
 	if (readPos->batch == markPos->batch &&
 		readPos->batch == batchqueue->headBatch)
 	{
@@ -231,6 +245,7 @@ index_batch_restore_pos(IndexScanDesc scan)
 		readPos->item = markPos->item;
 		return;
 	}
+#endif
 
 	/*
 	 * Call amposreset to let index AM know to invalidate any private state
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 16bf1f61a..23e7c0a2f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -144,6 +144,7 @@ int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
+bool		enable_indexscan_prefetch = true;
 bool		enable_indexonlyscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 88717c2ff..7463651e0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -99,6 +99,7 @@ struct ReadStream
 	int16		forwarded_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		distance_old;
 	int16		initialized_buffers;
 	int			read_buffers_flags;
 	bool		sync_mode;		/* using io_method=sync */
@@ -464,6 +465,7 @@ read_stream_look_ahead(ReadStream *stream)
 		if (blocknum == InvalidBlockNumber)
 		{
 			/* End of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			break;
 		}
@@ -862,6 +864,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No more blocks, end of stream. */
+			stream->distance_old = stream->distance;
 			stream->distance = 0;
 			stream->oldest_buffer_index = stream->next_buffer_index;
 			stream->pinned_buffers = 0;
@@ -1046,6 +1049,9 @@ read_stream_reset(ReadStream *stream)
 	int16		index;
 	Buffer		buffer;
 
+	/* remember the old distance (if we reset before end of the stream) */
+	stream->distance_old = Max(stream->distance, stream->distance_old);
+
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
@@ -1078,8 +1084,12 @@ read_stream_reset(ReadStream *stream)
 	Assert(stream->pinned_buffers == 0);
 	Assert(stream->ios_in_progress == 0);
 
-	/* Start off assuming data is cached. */
-	stream->distance = 1;
+	/*
+	 * Restore the old distance, if we have one. Otherwise start assuming data
+	 * is cached.
+	 */
+	stream->distance = Max(1, stream->distance_old);
+	stream->distance_old = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7c60b1255..a99aa41db 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -891,6 +891,13 @@
   boot_val => 'true',
 },
 
+{ name => 'enable_indexscan_prefetch', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
+  short_desc => 'Enables prefetching for index scans and index-only-scans.',
+  flags => 'GUC_EXPLAIN',
+  variable => 'enable_indexscan_prefetch',
+  boot_val => 'true',
+},
+
 { name => 'enable_material', type => 'bool', context => 'PGC_USERSET', group => 'QUERY_TUNING_METHOD',
   short_desc => 'Enables the planner\'s use of materialization.',
   flags => 'GUC_EXPLAIN',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f..da50ae15f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -412,6 +412,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexscan_prefetch = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0411db832..a2a8c3afa 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -159,6 +159,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexscan_prefetch      | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -173,7 +174,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(25 rows)
+(26 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.51.0